Rewrite post with actual experience, not docs claims
This commit is contained in:
@@ -1,65 +1,60 @@
|
||||
---
|
||||
layout: ../../layouts/PostLayout.astro
|
||||
title: "Local Voice AI: Text-to-Speech and Voice Cloning on Consumer Hardware"
|
||||
title: "Local Voice AI: What Actually Works for TTS and Voice Cloning"
|
||||
date: "2026-03-24"
|
||||
---
|
||||
|
||||
I wanted voice capabilities for my personal AI setup without sending audio to a cloud API. Not for privacy paranoia — I just didn't want to pay per-request for something that should run locally. Turns out the open-source TTS landscape has gotten surprisingly good.
|
||||
I spent the last few weeks setting up local text-to-speech and voice cloning for a personal AI assistant. Here's what actually happened — not what the docs promise.
|
||||
|
||||
## Piper: Fast Local TTS
|
||||
## Piper TTS: The Reliable Workhorse
|
||||
|
||||
[Piper](https://github.com/rhasspy/piper) is a local text-to-speech engine that runs on CPU. No GPU required. I have it running in Docker on a TrueNAS server via the [Wyoming protocol](https://github.com/rhasspy/wyoming), which is the same protocol Home Assistant uses for voice assistants.
|
||||
[Piper](https://github.com/rhasspy/piper) runs in Docker on my TrueNAS server via the [Wyoming protocol](https://github.com/rhasspy/wyoming). Setup was straightforward. Pick a voice model, point it at a port, done.
|
||||
|
||||
The speed surprised me. On modest hardware, Piper generates speech at realtime speed or better. A 10-second clip takes less than 10 seconds to render. The voice quality isn't Eleven Labs, but it's completely usable — clear, natural-sounding, and consistent.
|
||||
The speed genuinely surprised me. On a machine that's primarily a NAS, not a compute server, Piper generates speech at realtime speed or better. I wrote a small proxy that wraps the Wyoming protocol in an OpenAI-compatible `/v1/audio/speech` endpoint so anything that speaks that API can use it.
|
||||
|
||||
I wrote a small [OpenAI-compatible proxy](https://github.com/ddoc) that wraps the Wyoming protocol in a standard `/v1/audio/speech` endpoint. Any tool that speaks the OpenAI TTS API can now use my local Piper instance without modification.
|
||||
The catch: Piper has one voice at a time. I tried using it for a two-host podcast and both hosts sounded identical. For single-voice use cases — notifications, voice replies, alerts through a Sonos speaker — it's excellent. For anything requiring distinct voices, you need something else.
|
||||
|
||||
Piper ships with dozens of voice models across multiple languages. I'm running a single English voice right now, but you can load multiple models on different ports if you need variety.
|
||||
## Qwen3-TTS Voice Cloning: Impressive But Painful
|
||||
|
||||
## Qwen3-TTS: Voice Cloning on Apple Silicon
|
||||
[Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) runs on my M1 Pro MacBook via [mlx-audio](https://github.com/Blaizzy/mlx-audio). The pitch: give it a voice sample and it generates new speech in that voice.
|
||||
|
||||
[Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) is where things get interesting. It's a text-to-speech model that supports voice cloning from a 3-second audio sample. I run it on an M1 Pro MacBook using [mlx-audio](https://github.com/Blaizzy/mlx-audio), which leverages Apple Silicon's Neural Engine.
|
||||
The reality is more complicated.
|
||||
|
||||
The workflow: give it a short `.wav` clip of someone's voice, and it generates new speech in that voice. The quality varies — natural conversational samples work much better than scripted readings. A few things I learned the hard way:
|
||||
**It's slow.** About 0.08x realtime on M1 Pro. A 10-second clip takes roughly two minutes to generate. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.
|
||||
|
||||
- **Input audio must be 24kHz mono WAV.** 16kHz samples produce truncated garbage with no error message. I lost hours to this.
|
||||
- **Skip the style steering.** The `--instruct` flag for controlling tone paradoxically produces worse results for voice cloning. Just let the model match the reference audio naturally.
|
||||
- **Keep it short.** Generation speed is about 0.08x realtime on M1 Pro — a 10-second clip takes roughly 2 minutes. Fine for short messages, impractical for long-form audio.
|
||||
**The 24kHz gotcha cost me hours.** Input reference audio must be 24000 Hz mono WAV. If you feed it 16kHz audio — which is what most voice recorders produce — the output is truncated garbage. No error message. No warning. Just broken audio. I resampled everything with ffmpeg and the problem disappeared, but there was nothing in the docs pointing to this.
|
||||
|
||||
I set it up as a [LaunchAgent](https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/CreatingLaunchdJobs.html) so it starts on boot and stays running. The API is FastAPI-based, exposed on the local network.
|
||||
**Style steering makes it worse.** The model has an `--instruct` mode for controlling tone and delivery. Every time I used it for voice cloning, the output quality dropped. Removing it and letting the model match the reference audio naturally produced better results every time.
|
||||
|
||||
**Sample quality matters more than sample length.** The docs suggest a few seconds of audio is enough. Technically true, but a clean 10-second conversational clip produces noticeably better clones than a noisy 3-second snippet. I got the best results from natural speech — someone talking normally, not reading a script.
|
||||
|
||||
I cloned a few voices for fun. Some worked well on the first try. Others took multiple attempts with different reference clips before they sounded right. It's not a "give it any audio and get a perfect clone" situation. It's more like "give it good audio and you'll get a recognizable approximation."
|
||||
|
||||
## What I Actually Use This For
|
||||
|
||||
**Voice replies on Telegram.** When someone sends me a voice message, Whisper transcribes it, the AI generates a text response, and Piper or Qwen converts it back to speech. The reply goes back as a voice note. The whole round-trip takes a few seconds.
|
||||
**Voice replies.** Someone sends a voice message on Telegram, [Whisper](https://github.com/openai/whisper) transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.
|
||||
|
||||
**Podcast generation.** I built a pipeline that researches news via web search, writes a two-host podcast script, and generates 20 minutes of audio using different TTS voices for each host. Piper handles the bulk generation (fast), and I've experimented with Qwen for more distinctive character voices.
|
||||
**Podcast generation.** I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too slow for 20 minutes of audio. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn't there yet for multi-voice long-form content.
|
||||
|
||||
**Notifications and alerts.** Instead of text notifications, some of my home automation alerts are spoken aloud through a Sonos speaker. Piper handles this — it's fast enough that the latency is imperceptible.
|
||||
**Home automation alerts.** Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.
|
||||
|
||||
## Piper vs. Qwen: When to Use Which
|
||||
**Voice cloning for short clips.** Qwen handles personalized greetings, short jokes in cloned voices, that kind of thing. Anything under 15 seconds where the two-minute generation time is acceptable.
|
||||
|
||||
## The Honest Comparison
|
||||
|
||||
| | Piper | Qwen3-TTS |
|
||||
|---|---|---|
|
||||
| Speed | Realtime or faster | ~0.08x realtime |
|
||||
| Voice cloning | No | Yes (3-second sample) |
|
||||
| Hardware | CPU (any machine) | Apple Silicon recommended |
|
||||
| Quality | Good, consistent | Excellent for cloned voices |
|
||||
| Use case | Bulk TTS, real-time | Short personalized clips |
|
||||
|
||||
For anything that needs to be fast or long-form, Piper wins. For anything that needs to sound like a specific person, Qwen wins. I use both.
|
||||
|
||||
## The Setup
|
||||
|
||||
The whole thing runs on hardware I already owned:
|
||||
|
||||
- **Piper**: Docker container on TrueNAS, Wyoming protocol on port 10200, OpenAI-compatible proxy on port 8951
|
||||
- **Qwen3-TTS**: LaunchAgent on MacBook, FastAPI server on port 8880
|
||||
- **Whisper**: Docker container on TrueNAS, OpenAI-compatible endpoint on port 8950
|
||||
|
||||
Total additional cost: $0. These are all open-source models running on consumer hardware.
|
||||
| Speed | Realtime+ | ~2 min per 10 sec |
|
||||
| Voices | One at a time | Clone any voice |
|
||||
| Quality | Good, consistent | Variable, sample-dependent |
|
||||
| Hardware | CPU (anything) | Apple Silicon |
|
||||
| Best for | Fast bulk TTS | Short personalized clips |
|
||||
| Worst at | Voice variety | Long-form, speed |
|
||||
|
||||
## What's Next
|
||||
|
||||
I want to get Piper running with multiple voice models simultaneously — different voices on different ports — so the podcast pipeline can use truly distinct local voices without hitting any external API. The current single-voice setup works but having two or three local voices would eliminate the last dependency on cloud TTS entirely.
|
||||
I want to get Piper running with multiple voice models on different ports. That would solve the podcast problem without needing cloud TTS. Piper supports dozens of voices — I just need to run more than one instance.
|
||||
|
||||
If you have a machine that can run Docker, you can have local TTS running in about 15 minutes. [Piper's documentation](https://github.com/rhasspy/piper) is solid, and the Wyoming protocol integration with Home Assistant makes it trivially easy if you're already in that ecosystem.
|
||||
For Qwen, the speed is the bottleneck. If Apple ships faster Neural Engine silicon or the model gets optimized further, voice cloning could become practical for longer content. Right now it's a novelty for short clips.
|
||||
|
||||
The setup cost me nothing beyond hardware I already owned. If you have a machine that runs Docker, [Piper](https://github.com/rhasspy/piper) is a 15-minute setup. Qwen requires more patience — both for the initial configuration and for waiting on every generation.
|
||||
|
||||
Reference in New Issue
Block a user