dist/posts/self-hosted-ai-stack/index.html

<!DOCTYPE html><html lang="en" data-astro-cid-5hce7sga> <head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><title>Local Voice AI: What Actually Works for TTS and Voice Cloning — dd0c.net</title><link rel="icon" href="/favicon.ico"><style>[data-astro-cid-5hce7sga],[data-astro-cid-5hce7sga]:before,[data-astro-cid-5hce7sga]:after{box-sizing:border-box}body{font-family:system-ui,-apple-system,sans-serif;margin:0;background:#f8f9fa;color:#333;line-height:1.6}nav[data-astro-cid-5hce7sga]{background:#fff;border-bottom:2px solid #3294D2;padding:0 1.5rem;display:flex;align-items:center;gap:0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{display:flex;align-items:center;gap:.5rem;text-decoration:none;margin-right:1.5rem;padding:.75rem 0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga] img[data-astro-cid-5hce7sga]{height:32px;width:auto}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#333;text-decoration:none;padding:.75rem .85rem;font-size:.95rem;transition:color .15s}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{color:#3294d2}nav[data-astro-cid-5hce7sga] .spacer[data-astro-cid-5hce7sga]{flex:1}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga].external{color:#3294d2}main[data-astro-cid-5hce7sga]{max-width:760px;margin:2rem auto;padding:0 1.25rem}footer[data-astro-cid-5hce7sga]{margin-top:3rem;padding:1.25rem;text-align:center;font-size:.875rem;color:#666;border-top:1px solid #e0e0e0;background:#fff}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#3294d2;text-decoration:none}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{text-decoration:underline}h1[data-astro-cid-5hce7sga]{font-size:1.75rem;color:#1a1a1a}h2[data-astro-cid-5hce7sga]{font-size:1.2rem;margin-top:1.75rem;color:#1a1a1a}a[data-astro-cid-5hce7sga]{color:#3294d2}p[data-astro-cid-5hce7sga]{margin:.6rem 0}@media (max-width: 600px){nav[data-astro-cid-5hce7sga]{flex-wrap:wrap;padding:0 .75rem}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{margin-right:.5rem}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{padding:.6rem .5rem;font-size:.875rem}}
.post-meta[data-astro-cid-gjtny2mx]{color:#888;font-size:.875rem;margin-bottom:1.5rem}.post-body[data-astro-cid-gjtny2mx]{line-height:1.75}.post-body[data-astro-cid-gjtny2mx] iframe[data-astro-cid-gjtny2mx]{max-width:100%}.back[data-astro-cid-gjtny2mx]{display:inline-block;margin-bottom:1.25rem;font-size:.9rem;color:#3294d2;text-decoration:none}.back[data-astro-cid-gjtny2mx]:hover{text-decoration:underline}
</style></head> <body data-astro-cid-5hce7sga> <nav data-astro-cid-5hce7sga> <a class="brand" href="/" data-astro-cid-5hce7sga> <img src="/logo-white.svg" alt="dd0c.net" width="40" height="40" data-astro-cid-5hce7sga> </a> <a href="/" data-astro-cid-5hce7sga>Home</a> <a href="/about" data-astro-cid-5hce7sga>About</a> <a href="/services" data-astro-cid-5hce7sga>Services</a> <div class="spacer" data-astro-cid-5hce7sga></div> <a class="external" href="https://github.com/ddoc" target="_blank" rel="noopener" data-astro-cid-5hce7sga>GitHub</a> </nav> <main data-astro-cid-5hce7sga>   <a class="back" href="/" data-astro-cid-gjtny2mx>← All posts</a> <h1 data-astro-cid-gjtny2mx>Local Voice AI: What Actually Works for TTS and Voice Cloning</h1> <p class="post-meta" data-astro-cid-gjtny2mx>2026-03-24</p> <div class="post-body" data-astro-cid-gjtny2mx> <p>I spent the last few weeks setting up local text-to-speech and voice cloning for a personal AI assistant. Here’s what actually happened — not what the docs promise.</p>
<h2 id="piper-tts-the-reliable-workhorse">Piper TTS: The Reliable Workhorse</h2>
<p><a href="https://github.com/rhasspy/piper">Piper</a> runs in Docker on my TrueNAS server via the <a href="https://github.com/rhasspy/wyoming">Wyoming protocol</a>. Setup was straightforward. Pick a voice model, point it at a port, done.</p>
<p>The speed genuinely surprised me. On a machine that’s primarily a NAS, not a compute server, Piper generates speech at realtime speed or better. I wrote a small proxy that wraps the Wyoming protocol in an OpenAI-compatible <code>/v1/audio/speech</code> endpoint so anything that speaks that API can use it.</p>
<p>The catch: Piper has one voice at a time. I tried using it for a two-host podcast and both hosts sounded identical. For single-voice use cases — notifications, voice replies, alerts through a Sonos speaker — it’s excellent. For anything requiring distinct voices, you need something else.</p>
<h2 id="qwen3-tts-voice-cloning-impressive-but-painful">Qwen3-TTS Voice Cloning: Impressive But Painful</h2>
<p><a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> runs on my M1 Pro MacBook via <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>. The pitch: give it a voice sample and it generates new speech in that voice.</p>
<p>The reality is more complicated.</p>
<p><strong>It’s slow.</strong> About 0.3x realtime on M1 Pro on average, though it varies wildly — I benchmarked five runs and got anywhere from 0.09x to 0.41x. A 12-second clip takes 30–50 seconds on a good run, but can spike to over two minutes when thermally throttled. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.</p>
<p><strong>The 24kHz gotcha cost me hours.</strong> Input reference audio must be 24000 Hz mono WAV. If you feed it 16kHz audio — which is what most voice recorders produce — the output is truncated garbage. No error message. No warning. Just broken audio. I resampled everything with ffmpeg and the problem disappeared, but there was nothing in the docs pointing to this.</p>
<p><strong>Style steering makes it worse.</strong> The model has an <code>--instruct</code> mode for controlling tone and delivery. Every time I used it for voice cloning, the output quality dropped. Removing it and letting the model match the reference audio naturally produced better results every time.</p>
<p><strong>Sample quality matters more than sample length.</strong> The docs suggest a few seconds of audio is enough. Technically true, but a clean 10-second conversational clip produces noticeably better clones than a noisy 3-second snippet. I got the best results from natural speech — someone talking normally, not reading a script.</p>
<p>I cloned a few voices for fun. Some worked well on the first try. Others took multiple attempts with different reference clips before they sounded right. It’s not a “give it any audio and get a perfect clone” situation. It’s more like “give it good audio and you’ll get a recognizable approximation.”</p>
<h2 id="what-i-actually-use-this-for">What I Actually Use This For</h2>
<p><strong>Voice replies.</strong> Someone sends a voice message on Telegram, <a href="https://github.com/openai/whisper">Whisper</a> transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.</p>
<p><strong>Podcast generation.</strong> I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn’t there yet for multi-voice long-form content.</p>
<p><strong>Home automation alerts.</strong> Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.</p>
<p><strong>Voice cloning for short clips.</strong> Qwen handles personalized greetings, short jokes in cloned voices, that kind of thing. Anything under 15 seconds where the two-minute generation time is acceptable.</p>
<h2 id="the-honest-comparison">The Honest Comparison</h2>


<table><thead><tr><th></th><th>Piper</th><th>Qwen3-TTS</th></tr></thead><tbody><tr><td>Speed</td><td>Realtime+</td><td>~30-50s per 12s (varies)</td></tr><tr><td>Voices</td><td>One at a time</td><td>Clone any voice</td></tr><tr><td>Quality</td><td>Good, consistent</td><td>Variable, sample-dependent</td></tr><tr><td>Hardware</td><td>CPU (anything)</td><td>Apple Silicon</td></tr><tr><td>Best for</td><td>Fast bulk TTS</td><td>Short personalized clips</td></tr><tr><td>Worst at</td><td>Voice variety</td><td>Long-form, speed</td></tr></tbody></table>
<h2 id="whats-next">What’s Next</h2>
<p>I want to get Piper running with multiple voice models on different ports. That would solve the podcast problem without needing cloud TTS. Piper supports dozens of voices — I just need to run more than one instance.</p>
<p>For Qwen, the speed is the bottleneck. If Apple ships faster Neural Engine silicon or the model gets optimized further, voice cloning could become practical for longer content. Right now it’s a novelty for short clips.</p>
<p>The setup cost me nothing beyond hardware I already owned. If you have a machine that runs Docker, <a href="https://github.com/rhasspy/piper">Piper</a> is a 15-minute setup. Qwen requires more patience — both for the initial configuration and for waiting on every generation.</p> </div>  </main> <footer data-astro-cid-5hce7sga> <a href="/privacy" data-astro-cid-5hce7sga>Privacy Policy</a> &nbsp;·&nbsp;
<a href="/terms" data-astro-cid-5hce7sga>Terms of Service</a> &nbsp;·&nbsp;
    &copy; Brian Galura 2004&ndash;2026
</footer> </body></html>