Fix Piper multi-voice claim, add proxy repo link
This commit is contained in:
10
dist/posts/self-hosted-ai-stack/index.html
vendored
10
dist/posts/self-hosted-ai-stack/index.html
vendored
@@ -3,8 +3,8 @@
|
||||
</style></head> <body data-astro-cid-5hce7sga> <nav data-astro-cid-5hce7sga> <a class="brand" href="/" data-astro-cid-5hce7sga> <img src="/logo-white.svg" alt="dd0c.net" width="40" height="40" data-astro-cid-5hce7sga> </a> <a href="/" data-astro-cid-5hce7sga>Home</a> <a href="/about" data-astro-cid-5hce7sga>About</a> <a href="/services" data-astro-cid-5hce7sga>Services</a> <div class="spacer" data-astro-cid-5hce7sga></div> <a class="external" href="https://github.com/ddoc" target="_blank" rel="noopener" data-astro-cid-5hce7sga>GitHub</a> </nav> <main data-astro-cid-5hce7sga> <a class="back" href="/" data-astro-cid-gjtny2mx>← All posts</a> <h1 data-astro-cid-gjtny2mx>Local Voice AI: What Actually Works for TTS and Voice Cloning</h1> <p class="post-meta" data-astro-cid-gjtny2mx>2026-03-24</p> <div class="post-body" data-astro-cid-gjtny2mx> <p>I spent the last few weeks setting up local text-to-speech and voice cloning for a personal AI assistant. Here’s what actually happened — not what the docs promise.</p>
|
||||
<h2 id="piper-tts-the-reliable-workhorse">Piper TTS: The Reliable Workhorse</h2>
|
||||
<p><a href="https://github.com/rhasspy/piper">Piper</a> runs in Docker on my TrueNAS server via the <a href="https://github.com/rhasspy/wyoming">Wyoming protocol</a>. Setup was straightforward. Pick a voice model, point it at a port, done.</p>
|
||||
<p>The speed genuinely surprised me. On a machine that’s primarily a NAS, not a compute server, Piper generates speech at realtime speed or better. I wrote a small proxy that wraps the Wyoming protocol in an OpenAI-compatible <code>/v1/audio/speech</code> endpoint so anything that speaks that API can use it.</p>
|
||||
<p>The catch: Piper has one voice at a time. I tried using it for a two-host podcast and both hosts sounded identical. For single-voice use cases — notifications, voice replies, alerts through a Sonos speaker — it’s excellent. For anything requiring distinct voices, you need something else.</p>
|
||||
<p>The speed genuinely surprised me. On a machine that’s primarily a NAS, not a compute server, Piper generates speech at realtime speed or better. I wrote a small <a href="https://github.com/ddoc/piper-openai-proxy">OpenAI-compatible proxy</a> that wraps the Wyoming protocol in a standard <code>/v1/audio/speech</code> endpoint so anything that speaks that API can use it.</p>
|
||||
<p>Piper supports multiple voices — either through multi-speaker models or by running separate instances with different voice models on different ports. I initially ran it with a single voice and assumed that was the limit. It’s not. For the podcast use case, running two Piper containers with different voices would have worked.</p>
|
||||
<h2 id="qwen3-tts-voice-cloning-impressive-but-painful">Qwen3-TTS Voice Cloning: Impressive But Painful</h2>
|
||||
<p><a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> runs on my M1 Pro MacBook via <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>. The pitch: give it a voice sample and it generates new speech in that voice.</p>
|
||||
<p>The reality is more complicated.</p>
|
||||
@@ -15,7 +15,7 @@
|
||||
<p>I cloned a few voices for fun. Some worked well on the first try. Others took multiple attempts with different reference clips before they sounded right. It’s not a “give it any audio and get a perfect clone” situation. It’s more like “give it good audio and you’ll get a recognizable approximation.”</p>
|
||||
<h2 id="what-i-actually-use-this-for">What I Actually Use This For</h2>
|
||||
<p><strong>Voice replies.</strong> Someone sends a voice message on Telegram, <a href="https://github.com/openai/whisper">Whisper</a> transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.</p>
|
||||
<p><strong>Podcast generation.</strong> I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn’t there yet for multi-voice long-form content.</p>
|
||||
<p><strong>Podcast generation.</strong> I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. In hindsight, multiple Piper instances would have been the right call. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn’t there yet for multi-voice long-form content.</p>
|
||||
<p><strong>Home automation alerts.</strong> Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.</p>
|
||||
<p><strong>Voice cloning for short clips.</strong> Qwen handles personalized greetings, short jokes in cloned voices, that kind of thing. Anything under 15 seconds where the two-minute generation time is acceptable.</p>
|
||||
<h2 id="the-honest-comparison">The Honest Comparison</h2>
|
||||
@@ -59,9 +59,9 @@
|
||||
|
||||
|
||||
|
||||
<table><thead><tr><th></th><th>Piper</th><th>Qwen3-TTS</th></tr></thead><tbody><tr><td>Speed</td><td>Realtime+</td><td>~30-50s per 12s (varies)</td></tr><tr><td>Voices</td><td>One at a time</td><td>Clone any voice</td></tr><tr><td>Quality</td><td>Good, consistent</td><td>Variable, sample-dependent</td></tr><tr><td>Hardware</td><td>CPU (anything)</td><td>Apple Silicon</td></tr><tr><td>Best for</td><td>Fast bulk TTS</td><td>Short personalized clips</td></tr><tr><td>Worst at</td><td>Voice variety</td><td>Long-form, speed</td></tr></tbody></table>
|
||||
<table><thead><tr><th></th><th>Piper</th><th>Qwen3-TTS</th></tr></thead><tbody><tr><td>Speed</td><td>Realtime+</td><td>~30-50s per 12s (varies)</td></tr><tr><td>Voices</td><td>Multiple (multi-speaker or multi-instance)</td><td>Clone any voice</td></tr><tr><td>Quality</td><td>Good, consistent</td><td>Variable, sample-dependent</td></tr><tr><td>Hardware</td><td>CPU (anything)</td><td>Apple Silicon</td></tr><tr><td>Best for</td><td>Fast bulk TTS</td><td>Short personalized clips</td></tr><tr><td>Worst at</td><td>Voice variety</td><td>Long-form, speed</td></tr></tbody></table>
|
||||
<h2 id="whats-next">What’s Next</h2>
|
||||
<p>I want to get Piper running with multiple voice models on different ports. That would solve the podcast problem without needing cloud TTS. Piper supports dozens of voices — I just need to run more than one instance.</p>
|
||||
<p>I’m setting up multiple Piper instances with different voice models for the podcast pipeline. Piper supports dozens of voices across 40+ languages — I just need to configure the routing.</p>
|
||||
<p>For Qwen, the speed is the bottleneck. If Apple ships faster Neural Engine silicon or the model gets optimized further, voice cloning could become practical for longer content. Right now it’s a novelty for short clips.</p>
|
||||
<p>The setup cost me nothing beyond hardware I already owned. If you have a machine that runs Docker, <a href="https://github.com/rhasspy/piper">Piper</a> is a 15-minute setup. Qwen requires more patience — both for the initial configuration and for waiting on every generation.</p> </div> </main> <footer data-astro-cid-5hce7sga> <a href="/privacy" data-astro-cid-5hce7sga>Privacy Policy</a> ·
|
||||
<a href="/terms" data-astro-cid-5hce7sga>Terms of Service</a> ·
|
||||
|
||||
Reference in New Issue
Block a user