Fix Qwen TTS speed claims with actual benchmark data

This commit is contained in:
Jarvis Prime
2026-03-23 04:55:17 +00:00
parent b4d039f797
commit 70e0be3703
2 changed files with 6 additions and 6 deletions

View File

@@ -20,7 +20,7 @@ The catch: Piper has one voice at a time. I tried using it for a two-host podcas
The reality is more complicated.
**It's slow.** About 0.08x realtime on M1 Pro. A 10-second clip takes roughly two minutes to generate. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.
**It's slow.** About 0.3x realtime on M1 Pro on average, though it varies wildly — I benchmarked five runs and got anywhere from 0.09x to 0.41x. A 12-second clip takes 3050 seconds on a good run, but can spike to over two minutes when thermally throttled. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.
**The 24kHz gotcha cost me hours.** Input reference audio must be 24000 Hz mono WAV. If you feed it 16kHz audio — which is what most voice recorders produce — the output is truncated garbage. No error message. No warning. Just broken audio. I resampled everything with ffmpeg and the problem disappeared, but there was nothing in the docs pointing to this.
@@ -34,7 +34,7 @@ I cloned a few voices for fun. Some worked well on the first try. Others took mu
**Voice replies.** Someone sends a voice message on Telegram, [Whisper](https://github.com/openai/whisper) transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.
**Podcast generation.** I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too slow for 20 minutes of audio. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn't there yet for multi-voice long-form content.
**Podcast generation.** I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn't there yet for multi-voice long-form content.
**Home automation alerts.** Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.
@@ -44,7 +44,7 @@ I cloned a few voices for fun. Some worked well on the first try. Others took mu
| | Piper | Qwen3-TTS |
|---|---|---|
| Speed | Realtime+ | ~2 min per 10 sec |
| Speed | Realtime+ | ~30-50s per 12s (varies) |
| Voices | One at a time | Clone any voice |
| Quality | Good, consistent | Variable, sample-dependent |
| Hardware | CPU (anything) | Apple Silicon |