dd0c-site/dist/posts/self-hosted-ai-stack/index.html

<!DOCTYPE html><html lang="en" data-astro-cid-5hce7sga> <head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><title>Local Voice AI: What Actually Works for TTS and Voice Cloning — dd0c.net</title><link rel="icon" href="/favicon.ico"><style>[data-astro-cid-5hce7sga],[data-astro-cid-5hce7sga]:before,[data-astro-cid-5hce7sga]:after{box-sizing:border-box}body{font-family:system-ui,-apple-system,sans-serif;margin:0;background:#f8f9fa;color:#333;line-height:1.6}nav[data-astro-cid-5hce7sga]{background:#fff;border-bottom:2px solid #3294D2;padding:0 1.5rem;display:flex;align-items:center;gap:0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{display:flex;align-items:center;gap:.5rem;text-decoration:none;margin-right:1.5rem;padding:.75rem 0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga] img[data-astro-cid-5hce7sga]{height:32px;width:auto}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#333;text-decoration:none;padding:.75rem .85rem;font-size:.95rem;transition:color .15s}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{color:#3294d2}nav[data-astro-cid-5hce7sga] .spacer[data-astro-cid-5hce7sga]{flex:1}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga].external{color:#3294d2}main[data-astro-cid-5hce7sga]{max-width:760px;margin:2rem auto;padding:0 1.25rem}footer[data-astro-cid-5hce7sga]{margin-top:3rem;padding:1.25rem;text-align:center;font-size:.875rem;color:#666;border-top:1px solid #e0e0e0;background:#fff}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#3294d2;text-decoration:none}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{text-decoration:underline}h1[data-astro-cid-5hce7sga]{font-size:1.75rem;color:#1a1a1a}h2[data-astro-cid-5hce7sga]{font-size:1.2rem;margin-top:1.75rem;color:#1a1a1a}a[data-astro-cid-5hce7sga]{color:#3294d2}p[data-astro-cid-5hce7sga]{margin:.6rem 0}@media (max-width: 600px){nav[data-astro-cid-5hce7sga]{flex-wrap:wrap;padding:0 .75rem}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{margin-right:.5rem}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{padding:.6rem .5rem;font-size:.875rem}}
.post-meta[data-astro-cid-gjtny2mx]{color:#888;font-size:.875rem;margin-bottom:1.5rem}.post-body[data-astro-cid-gjtny2mx]{line-height:1.75}.post-body[data-astro-cid-gjtny2mx] iframe[data-astro-cid-gjtny2mx]{max-width:100%}.back[data-astro-cid-gjtny2mx]{display:inline-block;margin-bottom:1.25rem;font-size:.9rem;color:#3294d2;text-decoration:none}.back[data-astro-cid-gjtny2mx]:hover{text-decoration:underline}
</style></head> <body data-astro-cid-5hce7sga> <nav data-astro-cid-5hce7sga> <a class="brand" href="/" data-astro-cid-5hce7sga> <img src="/logo-white.svg" alt="dd0c.net" width="40" height="40" data-astro-cid-5hce7sga> </a> <a href="/" data-astro-cid-5hce7sga>Home</a> <a href="/about" data-astro-cid-5hce7sga>About</a> <a href="/services" data-astro-cid-5hce7sga>Services</a> <div class="spacer" data-astro-cid-5hce7sga></div> <a class="external" href="https://github.com/ddoc" target="_blank" rel="noopener" data-astro-cid-5hce7sga>GitHub</a> </nav> <main data-astro-cid-5hce7sga>   <a class="back" href="/" data-astro-cid-gjtny2mx>← All posts</a> <h1 data-astro-cid-gjtny2mx>Local Voice AI: What Actually Works for TTS and Voice Cloning</h1> <p class="post-meta" data-astro-cid-gjtny2mx>2026-03-24</p> <div class="post-body" data-astro-cid-gjtny2mx> <p>I spent the last few weeks setting up local text-to-speech and voice cloning for a personal AI assistant. Here’s what actually happened — not what the docs promise.</p>
<h2 id="piper-tts-the-reliable-workhorse">Piper TTS: The Reliable Workhorse</h2>
<p><a href="https://github.com/rhasspy/piper">Piper</a> runs in Docker on my TrueNAS server via the <a href="https://github.com/rhasspy/wyoming">Wyoming protocol</a>. Setup was straightforward. Pick a voice model, point it at a port, done.</p>
<p>The speed genuinely surprised me. On a machine that’s primarily a NAS, not a compute server, Piper generates speech at realtime speed or better. I wrote a small <a href="https://github.com/ddoc/piper-openai-proxy">OpenAI-compatible proxy</a> that wraps the Wyoming protocol in a standard <code>/v1/audio/speech</code> endpoint so anything that speaks that API can use it.</p>
<p>Piper supports multiple voices — either through multi-speaker models or by running separate instances with different voice models on different ports. I initially ran it with a single voice and assumed that was the limit. It’s not. For the podcast use case, running two Piper containers with different voices would have worked.</p>
<h2 id="qwen3-tts-voice-cloning-impressive-but-painful">Qwen3-TTS Voice Cloning: Impressive But Painful</h2>
<p><a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> runs on my M1 Pro MacBook via <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>. The pitch: give it a voice sample and it generates new speech in that voice.</p>
<p>The reality is more complicated.</p>
<p><strong>It’s slow.</strong> About 0.3x realtime on M1 Pro on average, though it varies wildly — I benchmarked five runs and got anywhere from 0.09x to 0.41x. A 12-second clip takes 30–50 seconds on a good run, but can spike to over two minutes when thermally throttled. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.</p>
<p><strong>The 24kHz gotcha cost me hours.</strong> Input reference audio must be 24000 Hz mono WAV. If you feed it 16kHz audio — which is what most voice recorders produce — the output is truncated garbage. No error message. No warning. Just broken audio. I resampled everything with ffmpeg and the problem disappeared, but there was nothing in the docs pointing to this.</p>
<p><strong>Style steering makes it worse.</strong> The model has an <code>--instruct</code> mode for controlling tone and delivery. Every time I used it for voice cloning, the output quality dropped. Removing it and letting the model match the reference audio naturally produced better results every time.</p>
<p><strong>Sample quality matters more than sample length.</strong> The docs suggest a few seconds of audio is enough. Technically true, but a clean 10-second conversational clip produces noticeably better clones than a noisy 3-second snippet. I got the best results from natural speech — someone talking normally, not reading a script.</p>
<p>I cloned a few voices for fun. Some worked well on the first try. Others took multiple attempts with different reference clips before they sounded right. It’s not a “give it any audio and get a perfect clone” situation. It’s more like “give it good audio and you’ll get a recognizable approximation.”</p>
<h2 id="what-i-actually-use-this-for">What I Actually Use This For</h2>
<p><strong>Voice replies.</strong> Someone sends a voice message on Telegram, <a href="https://github.com/openai/whisper">Whisper</a> transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.</p>
<p><strong>Podcast generation.</strong> I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. In hindsight, multiple Piper instances would have been the right call. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn’t there yet for multi-voice long-form content.</p>
<p><strong>Home automation alerts.</strong> Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.</p>
<p><strong>Voice cloning for short clips.</strong> Qwen handles personalized greetings, short jokes in cloned voices, that kind of thing. Anything under 15 seconds where the two-minute generation time is acceptable.</p>
<h2 id="the-honest-comparison">The Honest Comparison</h2>


<table><thead><tr><th></th><th>Piper</th><th>Qwen3-TTS</th></tr></thead><tbody><tr><td>Speed</td><td>Realtime+</td><td>~30-50s per 12s (varies)</td></tr><tr><td>Voices</td><td>Multiple (multi-speaker or multi-instance)</td><td>Clone any voice</td></tr><tr><td>Quality</td><td>Good, consistent</td><td>Variable, sample-dependent</td></tr><tr><td>Hardware</td><td>CPU (anything)</td><td>Apple Silicon</td></tr><tr><td>Best for</td><td>Fast bulk TTS</td><td>Short personalized clips</td></tr><tr><td>Worst at</td><td>Voice variety</td><td>Long-form, speed</td></tr></tbody></table>
<h2 id="whats-next">What’s Next</h2>
<p>I’m setting up multiple Piper instances with different voice models for the podcast pipeline. Piper supports dozens of voices across 40+ languages — I just need to configure the routing.</p>
<p>For Qwen, the speed is the bottleneck. If Apple ships faster Neural Engine silicon or the model gets optimized further, voice cloning could become practical for longer content. Right now it’s a novelty for short clips.</p>
<p>The setup cost me nothing beyond hardware I already owned. If you have a machine that runs Docker, <a href="https://github.com/rhasspy/piper">Piper</a> is a 15-minute setup. Qwen requires more patience — both for the initial configuration and for waiting on every generation.</p> </div>  </main> <footer data-astro-cid-5hce7sga> <a href="/privacy" data-astro-cid-5hce7sga>Privacy Policy</a> &nbsp;·&nbsp;
<a href="/terms" data-astro-cid-5hce7sga>Terms of Service</a> &nbsp;·&nbsp;
    &copy; Brian Galura 2004&ndash;2026
</footer> </body></html>