diff --git a/dist/posts/self-hosted-ai-stack/index.html b/dist/posts/self-hosted-ai-stack/index.html index 62422cf..20d5cc4 100644 --- a/dist/posts/self-hosted-ai-stack/index.html +++ b/dist/posts/self-hosted-ai-stack/index.html @@ -8,14 +8,14 @@

Qwen3-TTS Voice Cloning: Impressive But Painful

Qwen3-TTS runs on my M1 Pro MacBook via mlx-audio. The pitch: give it a voice sample and it generates new speech in that voice.

The reality is more complicated.

-

It’s slow. About 0.08x realtime on M1 Pro. A 10-second clip takes roughly two minutes to generate. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.

+

It’s slow. About 0.3x realtime on M1 Pro on average, though it varies wildly — I benchmarked five runs and got anywhere from 0.09x to 0.41x. A 12-second clip takes 30–50 seconds on a good run, but can spike to over two minutes when thermally throttled. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening.

The 24kHz gotcha cost me hours. Input reference audio must be 24000 Hz mono WAV. If you feed it 16kHz audio — which is what most voice recorders produce — the output is truncated garbage. No error message. No warning. Just broken audio. I resampled everything with ffmpeg and the problem disappeared, but there was nothing in the docs pointing to this.

Style steering makes it worse. The model has an --instruct mode for controlling tone and delivery. Every time I used it for voice cloning, the output quality dropped. Removing it and letting the model match the reference audio naturally produced better results every time.

Sample quality matters more than sample length. The docs suggest a few seconds of audio is enough. Technically true, but a clean 10-second conversational clip produces noticeably better clones than a noisy 3-second snippet. I got the best results from natural speech — someone talking normally, not reading a script.

I cloned a few voices for fun. Some worked well on the first try. Others took multiple attempts with different reference clips before they sounded right. It’s not a “give it any audio and get a perfect clone” situation. It’s more like “give it good audio and you’ll get a recognizable approximation.”

What I Actually Use This For

Voice replies. Someone sends a voice message on Telegram, Whisper transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety.

-

Podcast generation. I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too slow for 20 minutes of audio. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn’t there yet for multi-voice long-form content.

+

Podcast generation. I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn’t there yet for multi-voice long-form content.

Home automation alerts. Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible.

Voice cloning for short clips. Qwen handles personalized greetings, short jokes in cloned voices, that kind of thing. Anything under 15 seconds where the two-minute generation time is acceptable.

The Honest Comparison

@@ -59,7 +59,7 @@ -
PiperQwen3-TTS
SpeedRealtime+~2 min per 10 sec
VoicesOne at a timeClone any voice
QualityGood, consistentVariable, sample-dependent
HardwareCPU (anything)Apple Silicon
Best forFast bulk TTSShort personalized clips
Worst atVoice varietyLong-form, speed
+
PiperQwen3-TTS
SpeedRealtime+~30-50s per 12s (varies)
VoicesOne at a timeClone any voice
QualityGood, consistentVariable, sample-dependent
HardwareCPU (anything)Apple Silicon
Best forFast bulk TTSShort personalized clips
Worst atVoice varietyLong-form, speed

What’s Next

I want to get Piper running with multiple voice models on different ports. That would solve the podcast problem without needing cloud TTS. Piper supports dozens of voices — I just need to run more than one instance.

For Qwen, the speed is the bottleneck. If Apple ships faster Neural Engine silicon or the model gets optimized further, voice cloning could become practical for longer content. Right now it’s a novelty for short clips.

diff --git a/src/pages/posts/self-hosted-ai-stack.md b/src/pages/posts/self-hosted-ai-stack.md index 63e676c..6712833 100644 --- a/src/pages/posts/self-hosted-ai-stack.md +++ b/src/pages/posts/self-hosted-ai-stack.md @@ -20,7 +20,7 @@ The catch: Piper has one voice at a time. I tried using it for a two-host podcas The reality is more complicated. -**It's slow.** About 0.08x realtime on M1 Pro. A 10-second clip takes roughly two minutes to generate. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening. +**It's slow.** About 0.3x realtime on M1 Pro on average, though it varies wildly — I benchmarked five runs and got anywhere from 0.09x to 0.41x. A 12-second clip takes 30–50 seconds on a good run, but can spike to over two minutes when thermally throttled. I set it up as a LaunchAgent so the server stays running, but the speed makes it impractical for anything interactive. Short voice messages are fine. A 20-minute podcast is not happening. **The 24kHz gotcha cost me hours.** Input reference audio must be 24000 Hz mono WAV. If you feed it 16kHz audio — which is what most voice recorders produce — the output is truncated garbage. No error message. No warning. Just broken audio. I resampled everything with ffmpeg and the problem disappeared, but there was nothing in the docs pointing to this. @@ -34,7 +34,7 @@ I cloned a few voices for fun. Some worked well on the first try. Others took mu **Voice replies.** Someone sends a voice message on Telegram, [Whisper](https://github.com/openai/whisper) transcribes it, the AI generates a text response, and Piper converts it to speech. The reply goes back as a voice note. Piper handles this well because speed matters more than voice variety. -**Podcast generation.** I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too slow for 20 minutes of audio. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn't there yet for multi-voice long-form content. +**Podcast generation.** I built a pipeline that researches news, writes a two-host script, and generates audio. I tried Piper first but both hosts sounded the same. Qwen was too inconsistent for 20 minutes of audio — generation speed varies 4x between runs. I ended up using a cloud TTS API with two distinct voices for this. Local TTS isn't there yet for multi-voice long-form content. **Home automation alerts.** Piper speaks weather alerts and system notifications through a Sonos speaker. Fast enough that the latency is imperceptible. @@ -44,7 +44,7 @@ I cloned a few voices for fun. Some worked well on the first try. Others took mu | | Piper | Qwen3-TTS | |---|---|---| -| Speed | Realtime+ | ~2 min per 10 sec | +| Speed | Realtime+ | ~30-50s per 12s (varies) | | Voices | One at a time | Clone any voice | | Quality | Good, consistent | Variable, sample-dependent | | Hardware | CPU (anything) | Apple Silicon |