Fix Qwen TTS speed claims with actual benchmark data

This commit is contained in:
Jarvis Prime
2026-03-23 04:55:17 +00:00
parent b4d039f797
commit 70e0be3703
2 changed files with 6 additions and 6 deletions

View File

@@ -8,14 +8,14 @@
<h2 id="qwen3-tts-voice-cloning-impressive-but-painful">Qwen3-TTS Voice Cloning: Impressive But Painful</h2> <h2 id="qwen3-tts-voice-cloning-impressive-but-painful">Qwen3-TTS Voice Cloning: Impressive But Painful</h2>
<p><a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> runs on my M1 Pro MacBook via <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>. The pitch: give it a voice sample and it generates new speech in that voice.</p> <p><a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> runs on my M1 Pro MacBook via <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>. The pitch: give it a voice sample and it generates new speech in that voice.</p>
<p>The reality is more complicated.</p> <p>The reality is more complicated.</p>
<p><strong>Its slow.</strong> About 0.08x realtime on M1 Pro. A 10-second clip takes roughly two minutes to generate. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.</p> <p><strong>Its slow.</strong> About 0.3x realtime on M1 Pro on average, though it varies wildly — I benchmarked five runs and got anywhere from 0.09x to 0.41x. A 12-second clip takes 3050 seconds on a good run, but can spike to over two minutes when thermally throttled. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.</p>
<p><strong>The 24kHz gotcha cost me hours.</strong> Input reference audio must be 24000 Hz mono WAV. If you feed it 16kHz audio — which is what most voice recorders produce — the output is truncated garbage. No error message. No warning. Just broken audio. I resampled everything with ffmpeg and the problem disappeared, but there was nothing in the docs pointing to this.</p> <p><strong>The 24kHz gotcha cost me hours.</strong> Input reference audio must be 24000 Hz mono WAV. If you feed it 16kHz audio — which is what most voice recorders produce — the output is truncated garbage. No error message. No warning. Just broken audio. I resampled everything with ffmpeg and the problem disappeared, but there was nothing in the docs pointing to this.</p>
<p><strong>Style steering makes it worse.</strong> The model has an <code>--instruct</code> mode for controlling tone and delivery. Every time I used it for voice cloning, the output quality dropped. Removing it and letting the model match the reference audio naturally produced better results every time.</p> <p><strong>Style steering makes it worse.</strong> The model has an <code>--instruct</code> mode for controlling tone and delivery. Every time I used it for voice cloning, the output quality dropped. Removing it and letting the model match the reference audio naturally produced better results every time.</p>
<p><strong>Sample quality matters more than sample length.</strong> The docs suggest a few seconds of audio is enough. Technically true, but a clean 10-second conversational clip produces noticeably better clones than a noisy 3-second snippet. I got the best results from natural speech — someone talking normally, not reading a script.</p> <p><strong>Sample quality matters more than sample length.</strong> The docs suggest a few seconds of audio is enough. Technically true, but a clean 10-second conversational clip produces noticeably better clones than a noisy 3-second snippet. I got the best results from natural speech — someone talking normally, not reading a script.</p>
<p>I cloned a few voices for fun. Some worked well on the first try. Others took multiple attempts with different reference clips before they sounded right. Its not a “give it any audio and get a perfect clone” situation. Its more like “give it good audio and youll get a recognizable approximation.”</p> <p>I cloned a few voices for fun. Some worked well on the first try. Others took multiple attempts with different reference clips before they sounded right. Its not a “give it any audio and get a perfect clone” situation. Its more like “give it good audio and youll get a recognizable approximation.”</p>
<h2 id="what-i-actually-use-this-for">What I Actually Use This For</h2> <h2 id="what-i-actually-use-this-for">What I Actually Use This For</h2>
<p><strong>Voice replies.</strong> Someone sends a voice message on Telegram, <a href="https://github.com/openai/whisper">Whisper</a> transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.</p> <p><strong>Voice replies.</strong> Someone sends a voice message on Telegram, <a href="https://github.com/openai/whisper">Whisper</a> transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.</p>
<p><strong>Podcast generation.</strong> I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too slow for 20 minutes of audio. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isnt there yet for multi-voice long-form content.</p> <p><strong>Podcast generation.</strong> I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isnt there yet for multi-voice long-form content.</p>
<p><strong>Home automation alerts.</strong> Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.</p> <p><strong>Home automation alerts.</strong> Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.</p>
<p><strong>Voice cloning for short clips.</strong> Qwen handles personalized greetings, short jokes in cloned voices, that kind of thing. Anything under 15 seconds where the two-minute generation time is acceptable.</p> <p><strong>Voice cloning for short clips.</strong> Qwen handles personalized greetings, short jokes in cloned voices, that kind of thing. Anything under 15 seconds where the two-minute generation time is acceptable.</p>
<h2 id="the-honest-comparison">The Honest Comparison</h2> <h2 id="the-honest-comparison">The Honest Comparison</h2>
@@ -59,7 +59,7 @@
<table><thead><tr><th></th><th>Piper</th><th>Qwen3-TTS</th></tr></thead><tbody><tr><td>Speed</td><td>Realtime+</td><td>~2 min per 10 sec</td></tr><tr><td>Voices</td><td>One at a time</td><td>Clone any voice</td></tr><tr><td>Quality</td><td>Good, consistent</td><td>Variable, sample-dependent</td></tr><tr><td>Hardware</td><td>CPU (anything)</td><td>Apple Silicon</td></tr><tr><td>Best for</td><td>Fast bulk TTS</td><td>Short personalized clips</td></tr><tr><td>Worst at</td><td>Voice variety</td><td>Long-form, speed</td></tr></tbody></table> <table><thead><tr><th></th><th>Piper</th><th>Qwen3-TTS</th></tr></thead><tbody><tr><td>Speed</td><td>Realtime+</td><td>~30-50s per 12s (varies)</td></tr><tr><td>Voices</td><td>One at a time</td><td>Clone any voice</td></tr><tr><td>Quality</td><td>Good, consistent</td><td>Variable, sample-dependent</td></tr><tr><td>Hardware</td><td>CPU (anything)</td><td>Apple Silicon</td></tr><tr><td>Best for</td><td>Fast bulk TTS</td><td>Short personalized clips</td></tr><tr><td>Worst at</td><td>Voice variety</td><td>Long-form, speed</td></tr></tbody></table>
<h2 id="whats-next">Whats Next</h2> <h2 id="whats-next">Whats Next</h2>
<p>I want to get Piper running with multiple voice models on different ports. That would solve the podcast problem without needing cloud TTS. Piper supports dozens of voices — I just need to run more than one instance.</p> <p>I want to get Piper running with multiple voice models on different ports. That would solve the podcast problem without needing cloud TTS. Piper supports dozens of voices — I just need to run more than one instance.</p>
<p>For Qwen, the speed is the bottleneck. If Apple ships faster Neural Engine silicon or the model gets optimized further, voice cloning could become practical for longer content. Right now its a novelty for short clips.</p> <p>For Qwen, the speed is the bottleneck. If Apple ships faster Neural Engine silicon or the model gets optimized further, voice cloning could become practical for longer content. Right now its a novelty for short clips.</p>

View File

@@ -20,7 +20,7 @@ The catch: Piper has one voice at a time. I tried using it for a two-host podcas
The reality is more complicated. The reality is more complicated.
**It's slow.** About 0.08x realtime on M1 Pro. A 10-second clip takes roughly two minutes to generate. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening. **It's slow.** About 0.3x realtime on M1 Pro on average, though it varies wildly — I benchmarked five runs and got anywhere from 0.09x to 0.41x. A 12-second clip takes 3050 seconds on a good run, but can spike to over two minutes when thermally throttled. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.
**The 24kHz gotcha cost me hours.** Input reference audio must be 24000 Hz mono WAV. If you feed it 16kHz audio — which is what most voice recorders produce — the output is truncated garbage. No error message. No warning. Just broken audio. I resampled everything with ffmpeg and the problem disappeared, but there was nothing in the docs pointing to this. **The 24kHz gotcha cost me hours.** Input reference audio must be 24000 Hz mono WAV. If you feed it 16kHz audio — which is what most voice recorders produce — the output is truncated garbage. No error message. No warning. Just broken audio. I resampled everything with ffmpeg and the problem disappeared, but there was nothing in the docs pointing to this.
@@ -34,7 +34,7 @@ I cloned a few voices for fun. Some worked well on the first try. Others took mu
**Voice replies.** Someone sends a voice message on Telegram, [Whisper](https://github.com/openai/whisper) transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety. **Voice replies.** Someone sends a voice message on Telegram, [Whisper](https://github.com/openai/whisper) transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.
**Podcast generation.** I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too slow for 20 minutes of audio. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn't there yet for multi-voice long-form content. **Podcast generation.** I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn't there yet for multi-voice long-form content.
**Home automation alerts.** Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible. **Home automation alerts.** Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.
@@ -44,7 +44,7 @@ I cloned a few voices for fun. Some worked well on the first try. Others took mu
| | Piper | Qwen3-TTS | | | Piper | Qwen3-TTS |
|---|---|---| |---|---|---|
| Speed | Realtime+ | ~2 min per 10 sec | | Speed | Realtime+ | ~30-50s per 12s (varies) |
| Voices | One at a time | Clone any voice | | Voices | One at a time | Clone any voice |
| Quality | Good, consistent | Variable, sample-dependent | | Quality | Good, consistent | Variable, sample-dependent |
| Hardware | CPU (anything) | Apple Silicon | | Hardware | CPU (anything) | Apple Silicon |