Fix Piper multi-voice claim, add proxy repo link

This commit is contained in:
Jarvis Prime
2026-03-23 12:24:19 +00:00
parent 70e0be3703
commit afd886f51a
2 changed files with 10 additions and 10 deletions

View File

@@ -3,8 +3,8 @@
</style></head> <body data-astro-cid-5hce7sga> <nav data-astro-cid-5hce7sga> <a class="brand" href="/" data-astro-cid-5hce7sga> <img src="/logo-white.svg" alt="dd0c.net" width="40" height="40" data-astro-cid-5hce7sga> </a> <a href="/" data-astro-cid-5hce7sga>Home</a> <a href="/about" data-astro-cid-5hce7sga>About</a> <a href="/services" data-astro-cid-5hce7sga>Services</a> <div class="spacer" data-astro-cid-5hce7sga></div> <a class="external" href="https://github.com/ddoc" target="_blank" rel="noopener" data-astro-cid-5hce7sga>GitHub</a> </nav> <main data-astro-cid-5hce7sga> <a class="back" href="/" data-astro-cid-gjtny2mx>← All posts</a> <h1 data-astro-cid-gjtny2mx>Local Voice AI: What Actually Works for TTS and Voice Cloning</h1> <p class="post-meta" data-astro-cid-gjtny2mx>2026-03-24</p> <div class="post-body" data-astro-cid-gjtny2mx> <p>I spent the last few weeks setting up local text-to-speech and voice cloning for a personal AI assistant. Heres what actually happened — not what the docs promise.</p> </style></head> <body data-astro-cid-5hce7sga> <nav data-astro-cid-5hce7sga> <a class="brand" href="/" data-astro-cid-5hce7sga> <img src="/logo-white.svg" alt="dd0c.net" width="40" height="40" data-astro-cid-5hce7sga> </a> <a href="/" data-astro-cid-5hce7sga>Home</a> <a href="/about" data-astro-cid-5hce7sga>About</a> <a href="/services" data-astro-cid-5hce7sga>Services</a> <div class="spacer" data-astro-cid-5hce7sga></div> <a class="external" href="https://github.com/ddoc" target="_blank" rel="noopener" data-astro-cid-5hce7sga>GitHub</a> </nav> <main data-astro-cid-5hce7sga> <a class="back" href="/" data-astro-cid-gjtny2mx>← All posts</a> <h1 data-astro-cid-gjtny2mx>Local Voice AI: What Actually Works for TTS and Voice Cloning</h1> <p class="post-meta" data-astro-cid-gjtny2mx>2026-03-24</p> <div class="post-body" data-astro-cid-gjtny2mx> <p>I spent the last few weeks setting up local text-to-speech and voice cloning for a personal AI assistant. Heres what actually happened — not what the docs promise.</p>
<h2 id="piper-tts-the-reliable-workhorse">Piper TTS: The Reliable Workhorse</h2> <h2 id="piper-tts-the-reliable-workhorse">Piper TTS: The Reliable Workhorse</h2>
<p><a href="https://github.com/rhasspy/piper">Piper</a> runs in Docker on my TrueNAS server via the <a href="https://github.com/rhasspy/wyoming">Wyoming protocol</a>. Setup was straightforward. Pick a voice model, point it at a port, done.</p> <p><a href="https://github.com/rhasspy/piper">Piper</a> runs in Docker on my TrueNAS server via the <a href="https://github.com/rhasspy/wyoming">Wyoming protocol</a>. Setup was straightforward. Pick a voice model, point it at a port, done.</p>
<p>The speed genuinely surprised me. On a machine thats primarily a NAS, not a compute server, Piper generates speech at realtime speed or better. I wrote a small proxy that wraps the Wyoming protocol in an OpenAI-compatible <code>/v1/audio/speech</code> endpoint so anything that speaks that API can use it.</p> <p>The speed genuinely surprised me. On a machine thats primarily a NAS, not a compute server, Piper generates speech at realtime speed or better. I wrote a small <a href="https://github.com/ddoc/piper-openai-proxy">OpenAI-compatible proxy</a> that wraps the Wyoming protocol in a standard <code>/v1/audio/speech</code> endpoint so anything that speaks that API can use it.</p>
<p>The catch: Piper has one voice at a time. I tried using it for a two-host podcast and both hosts sounded identical. For single-voice use cases — notifications, voice replies, alerts through a Sonos speaker — its excellent. For anything requiring distinct voices, you need something else.</p> <p>Piper supports multiple voices — either through multi-speaker models or by running separate instances with different voice models on different ports. I initially ran it with a single voice and assumed that was the limit. Its not. For the podcast use case, running two Piper containers with different voices would have worked.</p>
<h2 id="qwen3-tts-voice-cloning-impressive-but-painful">Qwen3-TTS Voice Cloning: Impressive But Painful</h2> <h2 id="qwen3-tts-voice-cloning-impressive-but-painful">Qwen3-TTS Voice Cloning: Impressive But Painful</h2>
<p><a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> runs on my M1 Pro MacBook via <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>. The pitch: give it a voice sample and it generates new speech in that voice.</p> <p><a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> runs on my M1 Pro MacBook via <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>. The pitch: give it a voice sample and it generates new speech in that voice.</p>
<p>The reality is more complicated.</p> <p>The reality is more complicated.</p>
@@ -15,7 +15,7 @@
<p>I cloned a few voices for fun. Some worked well on the first try. Others took multiple attempts with different reference clips before they sounded right. Its not a “give it any audio and get a perfect clone” situation. Its more like “give it good audio and youll get a recognizable approximation.”</p> <p>I cloned a few voices for fun. Some worked well on the first try. Others took multiple attempts with different reference clips before they sounded right. Its not a “give it any audio and get a perfect clone” situation. Its more like “give it good audio and youll get a recognizable approximation.”</p>
<h2 id="what-i-actually-use-this-for">What I Actually Use This For</h2> <h2 id="what-i-actually-use-this-for">What I Actually Use This For</h2>
<p><strong>Voice replies.</strong> Someone sends a voice message on Telegram, <a href="https://github.com/openai/whisper">Whisper</a> transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.</p> <p><strong>Voice replies.</strong> Someone sends a voice message on Telegram, <a href="https://github.com/openai/whisper">Whisper</a> transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.</p>
<p><strong>Podcast generation.</strong> I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isnt there yet for multi-voice long-form content.</p> <p><strong>Podcast generation.</strong> I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. In hindsight, multiple Piper instances would have been the right call. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isnt there yet for multi-voice long-form content.</p>
<p><strong>Home automation alerts.</strong> Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.</p> <p><strong>Home automation alerts.</strong> Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.</p>
<p><strong>Voice cloning for short clips.</strong> Qwen handles personalized greetings, short jokes in cloned voices, that kind of thing. Anything under 15 seconds where the two-minute generation time is acceptable.</p> <p><strong>Voice cloning for short clips.</strong> Qwen handles personalized greetings, short jokes in cloned voices, that kind of thing. Anything under 15 seconds where the two-minute generation time is acceptable.</p>
<h2 id="the-honest-comparison">The Honest Comparison</h2> <h2 id="the-honest-comparison">The Honest Comparison</h2>
@@ -59,9 +59,9 @@
<table><thead><tr><th></th><th>Piper</th><th>Qwen3-TTS</th></tr></thead><tbody><tr><td>Speed</td><td>Realtime+</td><td>~30-50s per 12s (varies)</td></tr><tr><td>Voices</td><td>One at a time</td><td>Clone any voice</td></tr><tr><td>Quality</td><td>Good, consistent</td><td>Variable, sample-dependent</td></tr><tr><td>Hardware</td><td>CPU (anything)</td><td>Apple Silicon</td></tr><tr><td>Best for</td><td>Fast bulk TTS</td><td>Short personalized clips</td></tr><tr><td>Worst at</td><td>Voice variety</td><td>Long-form, speed</td></tr></tbody></table> <table><thead><tr><th></th><th>Piper</th><th>Qwen3-TTS</th></tr></thead><tbody><tr><td>Speed</td><td>Realtime+</td><td>~30-50s per 12s (varies)</td></tr><tr><td>Voices</td><td>Multiple (multi-speaker or multi-instance)</td><td>Clone any voice</td></tr><tr><td>Quality</td><td>Good, consistent</td><td>Variable, sample-dependent</td></tr><tr><td>Hardware</td><td>CPU (anything)</td><td>Apple Silicon</td></tr><tr><td>Best for</td><td>Fast bulk TTS</td><td>Short personalized clips</td></tr><tr><td>Worst at</td><td>Voice variety</td><td>Long-form, speed</td></tr></tbody></table>
<h2 id="whats-next">Whats Next</h2> <h2 id="whats-next">Whats Next</h2>
<p>I want to get Piper running with multiple voice models on different ports. That would solve the podcast problem without needing cloud TTS. Piper supports dozens of voices — I just need to run more than one instance.</p> <p>Im setting up multiple Piper instances with different voice models for the podcast pipeline. Piper supports dozens of voices across 40+ languages — I just need to configure the routing.</p>
<p>For Qwen, the speed is the bottleneck. If Apple ships faster Neural Engine silicon or the model gets optimized further, voice cloning could become practical for longer content. Right now its a novelty for short clips.</p> <p>For Qwen, the speed is the bottleneck. If Apple ships faster Neural Engine silicon or the model gets optimized further, voice cloning could become practical for longer content. Right now its a novelty for short clips.</p>
<p>The setup cost me nothing beyond hardware I already owned. If you have a machine that runs Docker, <a href="https://github.com/rhasspy/piper">Piper</a> is a 15-minute setup. Qwen requires more patience — both for the initial configuration and for waiting on every generation.</p> </div> </main> <footer data-astro-cid-5hce7sga> <a href="/privacy" data-astro-cid-5hce7sga>Privacy Policy</a> &nbsp;·&nbsp; <p>The setup cost me nothing beyond hardware I already owned. If you have a machine that runs Docker, <a href="https://github.com/rhasspy/piper">Piper</a> is a 15-minute setup. Qwen requires more patience — both for the initial configuration and for waiting on every generation.</p> </div> </main> <footer data-astro-cid-5hce7sga> <a href="/privacy" data-astro-cid-5hce7sga>Privacy Policy</a> &nbsp;·&nbsp;
<a href="/terms" data-astro-cid-5hce7sga>Terms of Service</a> &nbsp;·&nbsp; <a href="/terms" data-astro-cid-5hce7sga>Terms of Service</a> &nbsp;·&nbsp;

View File

@@ -10,9 +10,9 @@ I spent the last few weeks setting up local text-to-speech and voice cloning for
[Piper](https://github.com/rhasspy/piper) runs in Docker on my TrueNAS server via the [Wyoming protocol](https://github.com/rhasspy/wyoming). Setup was straightforward. Pick a voice model, point it at a port, done. [Piper](https://github.com/rhasspy/piper) runs in Docker on my TrueNAS server via the [Wyoming protocol](https://github.com/rhasspy/wyoming). Setup was straightforward. Pick a voice model, point it at a port, done.
The speed genuinely surprised me. On a machine that's primarily a NAS, not a compute server, Piper generates speech at realtime speed or better. I wrote a small proxy that wraps the Wyoming protocol in an OpenAI-compatible `/v1/audio/speech` endpoint so anything that speaks that API can use it. The speed genuinely surprised me. On a machine that's primarily a NAS, not a compute server, Piper generates speech at realtime speed or better. I wrote a small [OpenAI-compatible proxy](https://github.com/ddoc/piper-openai-proxy) that wraps the Wyoming protocol in a standard `/v1/audio/speech` endpoint so anything that speaks that API can use it.
The catch: Piper has one voice at a time. I tried using it for a two-host podcast and both hosts sounded identical. For single-voice use cases — notifications, voice replies, alerts through a Sonos speaker — it's excellent. For anything requiring distinct voices, you need something else. Piper supports multiple voices — either through multi-speaker models or by running separate instances with different voice models on different ports. I initially ran it with a single voice and assumed that was the limit. It's not. For the podcast use case, running two Piper containers with different voices would have worked.
## Qwen3-TTS Voice Cloning: Impressive But Painful ## Qwen3-TTS Voice Cloning: Impressive But Painful
@@ -34,7 +34,7 @@ I cloned a few voices for fun. Some worked well on the first try. Others took mu
**Voice replies.** Someone sends a voice message on Telegram, [Whisper](https://github.com/openai/whisper) transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety. **Voice replies.** Someone sends a voice message on Telegram, [Whisper](https://github.com/openai/whisper) transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.
**Podcast generation.** I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn't there yet for multi-voice long-form content. **Podcast generation.** I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. In hindsight, multiple Piper instances would have been the right call. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn't there yet for multi-voice long-form content.
**Home automation alerts.** Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible. **Home automation alerts.** Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.
@@ -45,7 +45,7 @@ I cloned a few voices for fun. Some worked well on the first try. Others took mu
| | Piper | Qwen3-TTS | | | Piper | Qwen3-TTS |
|---|---|---| |---|---|---|
| Speed | Realtime+ | ~30-50s per 12s (varies) | | Speed | Realtime+ | ~30-50s per 12s (varies) |
| Voices | One at a time | Clone any voice | | Voices | Multiple (multi-speaker or multi-instance) | Clone any voice |
| Quality | Good, consistent | Variable, sample-dependent | | Quality | Good, consistent | Variable, sample-dependent |
| Hardware | CPU (anything) | Apple Silicon | | Hardware | CPU (anything) | Apple Silicon |
| Best for | Fast bulk TTS | Short personalized clips | | Best for | Fast bulk TTS | Short personalized clips |
@@ -53,7 +53,7 @@ I cloned a few voices for fun. Some worked well on the first try. Others took mu
## What's Next ## What's Next
I want to get Piper running with multiple voice models on different ports. That would solve the podcast problem without needing cloud TTS. Piper supports dozens of voices — I just need to run more than one instance. I'm setting up multiple Piper instances with different voice models for the podcast pipeline. Piper supports dozens of voices across 40+ languages — I just need to configure the routing.
For Qwen, the speed is the bottleneck. If Apple ships faster Neural Engine silicon or the model gets optimized further, voice cloning could become practical for longer content. Right now it's a novelty for short clips. For Qwen, the speed is the bottleneck. If Apple ships faster Neural Engine silicon or the model gets optimized further, voice cloning could become practical for longer content. Right now it's a novelty for short clips.