Fix Qwen TTS speed claims with actual benchmark data

2026-03-23 04:55:17 +00:00
parent b4d039f797
commit 70e0be3703
2 changed files with 6 additions and 6 deletions
--- a/src/pages/posts/self-hosted-ai-stack.md
+++ b/src/pages/posts/self-hosted-ai-stack.md
@@ -20,7 +20,7 @@ The catch: Piper has one voice at a time. I tried using it for a two-host podcas

 The reality is more complicated.

-**It's slow.** About 0.08x realtime on M1 Pro. A 10-second clip takes roughly two minutes to generate. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.
+**It's slow.** About 0.3x realtime on M1 Pro on average, though it varies wildly — I benchmarked five runs and got anywhere from 0.09x to 0.41x. A 12-second clip takes 30–50 seconds on a good run, but can spike to over two minutes when thermally throttled. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.

 **The 24kHz gotcha cost me hours.** Input reference audio must be 24000 Hz mono WAV. If you feed it 16kHz audio — which is what most voice recorders produce — the output is truncated garbage. No error message. No warning. Just broken audio. I resampled everything with ffmpeg and the problem disappeared, but there was nothing in the docs pointing to this.

@@ -34,7 +34,7 @@ I cloned a few voices for fun. Some worked well on the first try. Others took mu

 **Voice replies.** Someone sends a voice message on Telegram, [Whisper](https://github.com/openai/whisper) transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.

-**Podcast generation.** I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too slow for 20 minutes of audio. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn't there yet for multi-voice long-form content.
+**Podcast generation.** I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn't there yet for multi-voice long-form content.

 **Home automation alerts.** Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.

@@ -44,7 +44,7 @@ I cloned a few voices for fun. Some worked well on the first try. Others took mu

 | | Piper | Qwen3-TTS |
 |---|---|---|
-| Speed | Realtime+ | ~2 min per 10 sec |
+| Speed | Realtime+ | ~30-50s per 12s (varies) |
 | Voices | One at a time | Clone any voice |
 | Quality | Good, consistent | Variable, sample-dependent |
 | Hardware | CPU (anything) | Apple Silicon |