Local Voice AI: Text-to-Speech and Voice Cloning on Consumer Hardware

I wanted voice capabilities for my personal AI setup without sending audio to a cloud API. Not for privacy paranoia — I just didn’t want to pay per-request for something that should run locally. Turns out the open-source TTS landscape has gotten surprisingly good.

Piper: Fast Local TTS

Piper is a local text-to-speech engine that runs on CPU. No GPU required. I have it running in Docker on a TrueNAS server via the Wyoming protocol, which is the same protocol Home Assistant uses for voice assistants.

The speed surprised me. On modest hardware, Piper generates speech at realtime speed or better. A 10-second clip takes less than 10 seconds to render. The voice quality isn’t Eleven Labs, but it’s completely usable — clear, natural-sounding, and consistent.

I wrote a small OpenAI-compatible proxy that wraps the Wyoming protocol in a standard /v1/audio/speech endpoint. Any tool that speaks the OpenAI TTS API can now use my local Piper instance without modification.

Piper ships with dozens of voice models across multiple languages. I’m running a single English voice right now, but you can load multiple models on different ports if you need variety.

Qwen3-TTS: Voice Cloning on Apple Silicon

Qwen3-TTS is where things get interesting. It’s a text-to-speech model that supports voice cloning from a 3-second audio sample. I run it on an M1 Pro MacBook using mlx-audio, which leverages Apple Silicon’s Neural Engine.

The workflow: give it a short .wav clip of someone’s voice, and it generates new speech in that voice. The quality varies — natural conversational samples work much better than scripted readings. A few things I learned the hard way:

Input audio must be 24kHz mono WAV. 16kHz samples produce truncated garbage with no error message. I lost hours to this.
Skip the style steering. The --instruct flag for controlling tone paradoxically produces worse results for voice cloning. Just let the model match the reference audio naturally.
Keep it short. Generation speed is about 0.08x realtime on M1 Pro — a 10-second clip takes roughly 2 minutes. Fine for short messages, impractical for long-form audio.

I set it up as a LaunchAgent so it starts on boot and stays running. The API is FastAPI-based, exposed on the local network.

What I Actually Use This For

Voice replies on Telegram. When someone sends me a voice message, Whisper transcribes it, the AI generates a text response, and Piper or Qwen converts it back to speech. The reply goes back as a voice note. The whole round-trip takes a few seconds.

Podcast generation. I built a pipeline that researches news via web search, writes a two-host podcast script, and generates 20 minutes of audio using different TTS voices for each host. Piper handles the bulk generation (fast), and I’ve experimented with Qwen for more distinctive character voices.

Notifications and alerts. Instead of text notifications, some of my home automation alerts are spoken aloud through a Sonos speaker. Piper handles this — it’s fast enough that the latency is imperceptible.

Piper vs. Qwen: When to Use Which

	Piper	Qwen3-TTS
Speed	Realtime or faster	~0.08x realtime
Voice cloning	No	Yes (3-second sample)
Hardware	CPU (any machine)	Apple Silicon recommended
Quality	Good, consistent	Excellent for cloned voices
Use case	Bulk TTS, real-time	Short personalized clips

For anything that needs to be fast or long-form, Piper wins. For anything that needs to sound like a specific person, Qwen wins. I use both.

The Setup

The whole thing runs on hardware I already owned:

Piper: Docker container on TrueNAS, Wyoming protocol on port 10200, OpenAI-compatible proxy on port 8951
Qwen3-TTS: LaunchAgent on MacBook, FastAPI server on port 8880
Whisper: Docker container on TrueNAS, OpenAI-compatible endpoint on port 8950

Total additional cost: $0. These are all open-source models running on consumer hardware.

What’s Next

I want to get Piper running with multiple voice models simultaneously — different voices on different ports — so the podcast pipeline can use truly distinct local voices without hitting any external API. The current single-voice setup works but having two or three local voices would eliminate the last dependency on cloud TTS entirely.

If you have a machine that can run Docker, you can have local TTS running in about 15 minutes. Piper’s documentation is solid, and the Wyoming protocol integration with Home Assistant makes it trivially easy if you’re already in that ecosystem.