73 lines
9.3 KiB
HTML
73 lines
9.3 KiB
HTML
<!DOCTYPE html><html lang="en" data-astro-cid-5hce7sga> <head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><title>Local Voice AI: Text-to-Speech and Voice Cloning on Consumer Hardware — dd0c.net</title><link rel="icon" href="/favicon.ico"><style>[data-astro-cid-5hce7sga],[data-astro-cid-5hce7sga]:before,[data-astro-cid-5hce7sga]:after{box-sizing:border-box}body{font-family:system-ui,-apple-system,sans-serif;margin:0;background:#f8f9fa;color:#333;line-height:1.6}nav[data-astro-cid-5hce7sga]{background:#fff;border-bottom:2px solid #3294D2;padding:0 1.5rem;display:flex;align-items:center;gap:0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{display:flex;align-items:center;gap:.5rem;text-decoration:none;margin-right:1.5rem;padding:.75rem 0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga] img[data-astro-cid-5hce7sga]{height:32px;width:auto}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#333;text-decoration:none;padding:.75rem .85rem;font-size:.95rem;transition:color .15s}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{color:#3294d2}nav[data-astro-cid-5hce7sga] .spacer[data-astro-cid-5hce7sga]{flex:1}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga].external{color:#3294d2}main[data-astro-cid-5hce7sga]{max-width:760px;margin:2rem auto;padding:0 1.25rem}footer[data-astro-cid-5hce7sga]{margin-top:3rem;padding:1.25rem;text-align:center;font-size:.875rem;color:#666;border-top:1px solid #e0e0e0;background:#fff}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#3294d2;text-decoration:none}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{text-decoration:underline}h1[data-astro-cid-5hce7sga]{font-size:1.75rem;color:#1a1a1a}h2[data-astro-cid-5hce7sga]{font-size:1.2rem;margin-top:1.75rem;color:#1a1a1a}a[data-astro-cid-5hce7sga]{color:#3294d2}p[data-astro-cid-5hce7sga]{margin:.6rem 0}@media (max-width: 600px){nav[data-astro-cid-5hce7sga]{flex-wrap:wrap;padding:0 .75rem}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{margin-right:.5rem}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{padding:.6rem .5rem;font-size:.875rem}}
|
||
.post-meta[data-astro-cid-gjtny2mx]{color:#888;font-size:.875rem;margin-bottom:1.5rem}.post-body[data-astro-cid-gjtny2mx]{line-height:1.75}.post-body[data-astro-cid-gjtny2mx] iframe[data-astro-cid-gjtny2mx]{max-width:100%}.back[data-astro-cid-gjtny2mx]{display:inline-block;margin-bottom:1.25rem;font-size:.9rem;color:#3294d2;text-decoration:none}.back[data-astro-cid-gjtny2mx]:hover{text-decoration:underline}
|
||
</style></head> <body data-astro-cid-5hce7sga> <nav data-astro-cid-5hce7sga> <a class="brand" href="/" data-astro-cid-5hce7sga> <img src="/logo-white.svg" alt="dd0c.net" width="40" height="40" data-astro-cid-5hce7sga> </a> <a href="/" data-astro-cid-5hce7sga>Home</a> <a href="/about" data-astro-cid-5hce7sga>About</a> <a href="/services" data-astro-cid-5hce7sga>Services</a> <div class="spacer" data-astro-cid-5hce7sga></div> <a class="external" href="https://github.com/ddoc" target="_blank" rel="noopener" data-astro-cid-5hce7sga>GitHub</a> </nav> <main data-astro-cid-5hce7sga> <a class="back" href="/" data-astro-cid-gjtny2mx>← All posts</a> <h1 data-astro-cid-gjtny2mx>Local Voice AI: Text-to-Speech and Voice Cloning on Consumer Hardware</h1> <p class="post-meta" data-astro-cid-gjtny2mx>2026-03-24</p> <div class="post-body" data-astro-cid-gjtny2mx> <p>I wanted voice capabilities for my personal AI setup without sending audio to a cloud API. Not for privacy paranoia — I just didn’t want to pay per-request for something that should run locally. Turns out the open-source TTS landscape has gotten surprisingly good.</p>
|
||
<h2 id="piper-fast-local-tts">Piper: Fast Local TTS</h2>
|
||
<p><a href="https://github.com/rhasspy/piper">Piper</a> is a local text-to-speech engine that runs on CPU. No GPU required. I have it running in Docker on a TrueNAS server via the <a href="https://github.com/rhasspy/wyoming">Wyoming protocol</a>, which is the same protocol Home Assistant uses for voice assistants.</p>
|
||
<p>The speed surprised me. On modest hardware, Piper generates speech at realtime speed or better. A 10-second clip takes less than 10 seconds to render. The voice quality isn’t Eleven Labs, but it’s completely usable — clear, natural-sounding, and consistent.</p>
|
||
<p>I wrote a small <a href="https://github.com/ddoc">OpenAI-compatible proxy</a> that wraps the Wyoming protocol in a standard <code>/v1/audio/speech</code> endpoint. Any tool that speaks the OpenAI TTS API can now use my local Piper instance without modification.</p>
|
||
<p>Piper ships with dozens of voice models across multiple languages. I’m running a single English voice right now, but you can load multiple models on different ports if you need variety.</p>
|
||
<h2 id="qwen3-tts-voice-cloning-on-apple-silicon">Qwen3-TTS: Voice Cloning on Apple Silicon</h2>
|
||
<p><a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> is where things get interesting. It’s a text-to-speech model that supports voice cloning from a 3-second audio sample. I run it on an M1 Pro MacBook using <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>, which leverages Apple Silicon’s Neural Engine.</p>
|
||
<p>The workflow: give it a short <code>.wav</code> clip of someone’s voice, and it generates new speech in that voice. The quality varies — natural conversational samples work much better than scripted readings. A few things I learned the hard way:</p>
|
||
<ul>
|
||
<li><strong>Input audio must be 24kHz mono WAV.</strong> 16kHz samples produce truncated garbage with no error message. I lost hours to this.</li>
|
||
<li><strong>Skip the style steering.</strong> The <code>--instruct</code> flag for controlling tone paradoxically produces worse results for voice cloning. Just let the model match the reference audio naturally.</li>
|
||
<li><strong>Keep it short.</strong> Generation speed is about 0.08x realtime on M1 Pro — a 10-second clip takes roughly 2 minutes. Fine for short messages, impractical for long-form audio.</li>
|
||
</ul>
|
||
<p>I set it up as a <a href="https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/CreatingLaunchdJobs.html">LaunchAgent</a> so it starts on boot and stays running. The API is FastAPI-based, exposed on the local network.</p>
|
||
<h2 id="what-i-actually-use-this-for">What I Actually Use This For</h2>
|
||
<p><strong>Voice replies on Telegram.</strong> When someone sends me a voice message, Whisper transcribes it, the AI generates a text response, and Piper or Qwen converts it back to speech. The reply goes back as a voice note. The whole round-trip takes a few seconds.</p>
|
||
<p><strong>Podcast generation.</strong> I built a pipeline that researches news via web search, writes a two-host podcast script, and generates 20 minutes of audio using different TTS voices for each host. Piper handles the bulk generation (fast), and I’ve experimented with Qwen for more distinctive character voices.</p>
|
||
<p><strong>Notifications and alerts.</strong> Instead of text notifications, some of my home automation alerts are spoken aloud through a Sonos speaker. Piper handles this — it’s fast enough that the latency is imperceptible.</p>
|
||
<h2 id="piper-vs-qwen-when-to-use-which">Piper vs. Qwen: When to Use Which</h2>
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
<table><thead><tr><th></th><th>Piper</th><th>Qwen3-TTS</th></tr></thead><tbody><tr><td>Speed</td><td>Realtime or faster</td><td>~0.08x realtime</td></tr><tr><td>Voice cloning</td><td>No</td><td>Yes (3-second sample)</td></tr><tr><td>Hardware</td><td>CPU (any machine)</td><td>Apple Silicon recommended</td></tr><tr><td>Quality</td><td>Good, consistent</td><td>Excellent for cloned voices</td></tr><tr><td>Use case</td><td>Bulk TTS, real-time</td><td>Short personalized clips</td></tr></tbody></table>
|
||
<p>For anything that needs to be fast or long-form, Piper wins. For anything that needs to sound like a specific person, Qwen wins. I use both.</p>
|
||
<h2 id="the-setup">The Setup</h2>
|
||
<p>The whole thing runs on hardware I already owned:</p>
|
||
<ul>
|
||
<li><strong>Piper</strong>: Docker container on TrueNAS, Wyoming protocol on port 10200, OpenAI-compatible proxy on port 8951</li>
|
||
<li><strong>Qwen3-TTS</strong>: LaunchAgent on MacBook, FastAPI server on port 8880</li>
|
||
<li><strong>Whisper</strong>: Docker container on TrueNAS, OpenAI-compatible endpoint on port 8950</li>
|
||
</ul>
|
||
<p>Total additional cost: $0. These are all open-source models running on consumer hardware.</p>
|
||
<h2 id="whats-next">What’s Next</h2>
|
||
<p>I want to get Piper running with multiple voice models simultaneously — different voices on different ports — so the podcast pipeline can use truly distinct local voices without hitting any external API. The current single-voice setup works but having two or three local voices would eliminate the last dependency on cloud TTS entirely.</p>
|
||
<p>If you have a machine that can run Docker, you can have local TTS running in about 15 minutes. <a href="https://github.com/rhasspy/piper">Piper’s documentation</a> is solid, and the Wyoming protocol integration with Home Assistant makes it trivially easy if you’re already in that ecosystem.</p> </div> </main> <footer data-astro-cid-5hce7sga> <a href="/privacy" data-astro-cid-5hce7sga>Privacy Policy</a> ·
|
||
<a href="/terms" data-astro-cid-5hce7sga>Terms of Service</a> ·
|
||
© Brian Galura 2004–2026
|
||
</footer> </body></html> |