← All posts

Local Voice AI: Text-to-Speech and Voice Cloning on Consumer Hardware

2026-03-24

I wanted voice capabilities for my personal AI setup without sending audio to a cloud API. Not for privacy paranoia — I just didn’t want to pay per-request for something that should run locally. Turns out the open-source TTS landscape has gotten surprisingly good.

Piper: Fast Local TTS

Piper is a local text-to-speech engine that runs on CPU. No GPU required. I have it running in Docker on a TrueNAS server via the Wyoming protocol, which is the same protocol Home Assistant uses for voice assistants.

The speed surprised me. On modest hardware, Piper generates speech at realtime speed or better. A 10-second clip takes less than 10 seconds to render. The voice quality isn’t Eleven Labs, but it’s completely usable — clear, natural-sounding, and consistent.

I wrote a small OpenAI-compatible proxy that wraps the Wyoming protocol in a standard /v1/audio/speech endpoint. Any tool that speaks the OpenAI TTS API can now use my local Piper instance without modification.

Piper ships with dozens of voice models across multiple languages. I’m running a single English voice right now, but you can load multiple models on different ports if you need variety.

Qwen3-TTS: Voice Cloning on Apple Silicon

Qwen3-TTS is where things get interesting. It’s a text-to-speech model that supports voice cloning from a 3-second audio sample. I run it on an M1 Pro MacBook using mlx-audio, which leverages Apple Silicon’s Neural Engine.

The workflow: give it a short .wav clip of someone’s voice, and it generates new speech in that voice. The quality varies — natural conversational samples work much better than scripted readings. A few things I learned the hard way:

I set it up as a LaunchAgent so it starts on boot and stays running. The API is FastAPI-based, exposed on the local network.

What I Actually Use This For

Voice replies on Telegram. When someone sends me a voice message, Whisper transcribes it, the AI generates a text response, and Piper or Qwen converts it back to speech. The reply goes back as a voice note. The whole round-trip takes a few seconds.

Podcast generation. I built a pipeline that researches news via web search, writes a two-host podcast script, and generates 20 minutes of audio using different TTS voices for each host. Piper handles the bulk generation (fast), and I’ve experimented with Qwen for more distinctive character voices.

Notifications and alerts. Instead of text notifications, some of my home automation alerts are spoken aloud through a Sonos speaker. Piper handles this — it’s fast enough that the latency is imperceptible.

Piper vs. Qwen: When to Use Which

PiperQwen3-TTS
SpeedRealtime or faster~0.08x realtime
Voice cloningNoYes (3-second sample)
HardwareCPU (any machine)Apple Silicon recommended
QualityGood, consistentExcellent for cloned voices
Use caseBulk TTS, real-timeShort personalized clips

For anything that needs to be fast or long-form, Piper wins. For anything that needs to sound like a specific person, Qwen wins. I use both.

The Setup

The whole thing runs on hardware I already owned:

Total additional cost: $0. These are all open-source models running on consumer hardware.

What’s Next

I want to get Piper running with multiple voice models simultaneously — different voices on different ports — so the podcast pipeline can use truly distinct local voices without hitting any external API. The current single-voice setup works but having two or three local voices would eliminate the last dependency on cloud TTS entirely.

If you have a machine that can run Docker, you can have local TTS running in about 15 minutes. Piper’s documentation is solid, and the Wyoming protocol integration with Home Assistant makes it trivially easy if you’re already in that ecosystem.