Rewrite post: focus on voice cloning and Piper TTS
This commit is contained in:
98
dist/posts/self-hosted-ai-stack/index.html
vendored
98
dist/posts/self-hosted-ai-stack/index.html
vendored
@@ -1,43 +1,73 @@
|
|||||||
<!DOCTYPE html><html lang="en" data-astro-cid-5hce7sga> <head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><title>Self-Hosted AI: Running LLMs, TTS, and Whisper on Consumer Hardware — dd0c.net</title><link rel="icon" href="/favicon.ico"><style>[data-astro-cid-5hce7sga],[data-astro-cid-5hce7sga]:before,[data-astro-cid-5hce7sga]:after{box-sizing:border-box}body{font-family:system-ui,-apple-system,sans-serif;margin:0;background:#f8f9fa;color:#333;line-height:1.6}nav[data-astro-cid-5hce7sga]{background:#fff;border-bottom:2px solid #3294D2;padding:0 1.5rem;display:flex;align-items:center;gap:0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{display:flex;align-items:center;gap:.5rem;text-decoration:none;margin-right:1.5rem;padding:.75rem 0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga] img[data-astro-cid-5hce7sga]{height:32px;width:auto}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#333;text-decoration:none;padding:.75rem .85rem;font-size:.95rem;transition:color .15s}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{color:#3294d2}nav[data-astro-cid-5hce7sga] .spacer[data-astro-cid-5hce7sga]{flex:1}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga].external{color:#3294d2}main[data-astro-cid-5hce7sga]{max-width:760px;margin:2rem auto;padding:0 1.25rem}footer[data-astro-cid-5hce7sga]{margin-top:3rem;padding:1.25rem;text-align:center;font-size:.875rem;color:#666;border-top:1px solid #e0e0e0;background:#fff}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#3294d2;text-decoration:none}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{text-decoration:underline}h1[data-astro-cid-5hce7sga]{font-size:1.75rem;color:#1a1a1a}h2[data-astro-cid-5hce7sga]{font-size:1.2rem;margin-top:1.75rem;color:#1a1a1a}a[data-astro-cid-5hce7sga]{color:#3294d2}p[data-astro-cid-5hce7sga]{margin:.6rem 0}@media (max-width: 600px){nav[data-astro-cid-5hce7sga]{flex-wrap:wrap;padding:0 .75rem}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{margin-right:.5rem}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{padding:.6rem .5rem;font-size:.875rem}}
|
<!DOCTYPE html><html lang="en" data-astro-cid-5hce7sga> <head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><title>Local Voice AI: Text-to-Speech and Voice Cloning on Consumer Hardware — dd0c.net</title><link rel="icon" href="/favicon.ico"><style>[data-astro-cid-5hce7sga],[data-astro-cid-5hce7sga]:before,[data-astro-cid-5hce7sga]:after{box-sizing:border-box}body{font-family:system-ui,-apple-system,sans-serif;margin:0;background:#f8f9fa;color:#333;line-height:1.6}nav[data-astro-cid-5hce7sga]{background:#fff;border-bottom:2px solid #3294D2;padding:0 1.5rem;display:flex;align-items:center;gap:0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{display:flex;align-items:center;gap:.5rem;text-decoration:none;margin-right:1.5rem;padding:.75rem 0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga] img[data-astro-cid-5hce7sga]{height:32px;width:auto}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#333;text-decoration:none;padding:.75rem .85rem;font-size:.95rem;transition:color .15s}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{color:#3294d2}nav[data-astro-cid-5hce7sga] .spacer[data-astro-cid-5hce7sga]{flex:1}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga].external{color:#3294d2}main[data-astro-cid-5hce7sga]{max-width:760px;margin:2rem auto;padding:0 1.25rem}footer[data-astro-cid-5hce7sga]{margin-top:3rem;padding:1.25rem;text-align:center;font-size:.875rem;color:#666;border-top:1px solid #e0e0e0;background:#fff}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#3294d2;text-decoration:none}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{text-decoration:underline}h1[data-astro-cid-5hce7sga]{font-size:1.75rem;color:#1a1a1a}h2[data-astro-cid-5hce7sga]{font-size:1.2rem;margin-top:1.75rem;color:#1a1a1a}a[data-astro-cid-5hce7sga]{color:#3294d2}p[data-astro-cid-5hce7sga]{margin:.6rem 0}@media (max-width: 600px){nav[data-astro-cid-5hce7sga]{flex-wrap:wrap;padding:0 .75rem}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{margin-right:.5rem}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{padding:.6rem .5rem;font-size:.875rem}}
|
||||||
.post-meta[data-astro-cid-gjtny2mx]{color:#888;font-size:.875rem;margin-bottom:1.5rem}.post-body[data-astro-cid-gjtny2mx]{line-height:1.75}.post-body[data-astro-cid-gjtny2mx] iframe[data-astro-cid-gjtny2mx]{max-width:100%}.back[data-astro-cid-gjtny2mx]{display:inline-block;margin-bottom:1.25rem;font-size:.9rem;color:#3294d2;text-decoration:none}.back[data-astro-cid-gjtny2mx]:hover{text-decoration:underline}
|
.post-meta[data-astro-cid-gjtny2mx]{color:#888;font-size:.875rem;margin-bottom:1.5rem}.post-body[data-astro-cid-gjtny2mx]{line-height:1.75}.post-body[data-astro-cid-gjtny2mx] iframe[data-astro-cid-gjtny2mx]{max-width:100%}.back[data-astro-cid-gjtny2mx]{display:inline-block;margin-bottom:1.25rem;font-size:.9rem;color:#3294d2;text-decoration:none}.back[data-astro-cid-gjtny2mx]:hover{text-decoration:underline}
|
||||||
</style></head> <body data-astro-cid-5hce7sga> <nav data-astro-cid-5hce7sga> <a class="brand" href="/" data-astro-cid-5hce7sga> <img src="/logo-white.svg" alt="dd0c.net" width="40" height="40" data-astro-cid-5hce7sga> </a> <a href="/" data-astro-cid-5hce7sga>Home</a> <a href="/about" data-astro-cid-5hce7sga>About</a> <a href="/services" data-astro-cid-5hce7sga>Services</a> <div class="spacer" data-astro-cid-5hce7sga></div> <a class="external" href="https://github.com/ddoc" target="_blank" rel="noopener" data-astro-cid-5hce7sga>GitHub</a> </nav> <main data-astro-cid-5hce7sga> <a class="back" href="/" data-astro-cid-gjtny2mx>← All posts</a> <h1 data-astro-cid-gjtny2mx>Self-Hosted AI: Running LLMs, TTS, and Whisper on Consumer Hardware</h1> <p class="post-meta" data-astro-cid-gjtny2mx>2026-03-24</p> <div class="post-body" data-astro-cid-gjtny2mx> <p>I wanted a personal AI assistant that I actually controlled. Not a SaaS product with a monthly bill and a privacy policy I’d never read. Something running on my own hardware, connected to my own accounts, with no vendor in the middle deciding what it could or couldn’t do.</p>
|
</style></head> <body data-astro-cid-5hce7sga> <nav data-astro-cid-5hce7sga> <a class="brand" href="/" data-astro-cid-5hce7sga> <img src="/logo-white.svg" alt="dd0c.net" width="40" height="40" data-astro-cid-5hce7sga> </a> <a href="/" data-astro-cid-5hce7sga>Home</a> <a href="/about" data-astro-cid-5hce7sga>About</a> <a href="/services" data-astro-cid-5hce7sga>Services</a> <div class="spacer" data-astro-cid-5hce7sga></div> <a class="external" href="https://github.com/ddoc" target="_blank" rel="noopener" data-astro-cid-5hce7sga>GitHub</a> </nav> <main data-astro-cid-5hce7sga> <a class="back" href="/" data-astro-cid-gjtny2mx>← All posts</a> <h1 data-astro-cid-gjtny2mx>Local Voice AI: Text-to-Speech and Voice Cloning on Consumer Hardware</h1> <p class="post-meta" data-astro-cid-gjtny2mx>2026-03-24</p> <div class="post-body" data-astro-cid-gjtny2mx> <p>I wanted voice capabilities for my personal AI setup without sending audio to a cloud API. Not for privacy paranoia — I just didn’t want to pay per-request for something that should run locally. Turns out the open-source TTS landscape has gotten surprisingly good.</p>
|
||||||
<p>I didn’t buy new hardware for this. I used what I had: a TrueNAS server in my home lab and an M1 Pro MacBook. Not a data center. Just the stuff already sitting on my desk.</p>
|
<h2 id="piper-fast-local-tts">Piper: Fast Local TTS</h2>
|
||||||
<h2 id="the-stack">The Stack</h2>
|
<p><a href="https://github.com/rhasspy/piper">Piper</a> is a local text-to-speech engine that runs on CPU. No GPU required. I have it running in Docker on a TrueNAS server via the <a href="https://github.com/rhasspy/wyoming">Wyoming protocol</a>, which is the same protocol Home Assistant uses for voice assistants.</p>
|
||||||
<p>The core pieces are:</p>
|
<p>The speed surprised me. On modest hardware, Piper generates speech at realtime speed or better. A 10-second clip takes less than 10 seconds to render. The voice quality isn’t Eleven Labs, but it’s completely usable — clear, natural-sounding, and consistent.</p>
|
||||||
|
<p>I wrote a small <a href="https://github.com/ddoc">OpenAI-compatible proxy</a> that wraps the Wyoming protocol in a standard <code>/v1/audio/speech</code> endpoint. Any tool that speaks the OpenAI TTS API can now use my local Piper instance without modification.</p>
|
||||||
|
<p>Piper ships with dozens of voice models across multiple languages. I’m running a single English voice right now, but you can load multiple models on different ports if you need variety.</p>
|
||||||
|
<h2 id="qwen3-tts-voice-cloning-on-apple-silicon">Qwen3-TTS: Voice Cloning on Apple Silicon</h2>
|
||||||
|
<p><a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> is where things get interesting. It’s a text-to-speech model that supports voice cloning from a 3-second audio sample. I run it on an M1 Pro MacBook using <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>, which leverages Apple Silicon’s Neural Engine.</p>
|
||||||
|
<p>The workflow: give it a short <code>.wav</code> clip of someone’s voice, and it generates new speech in that voice. The quality varies — natural conversational samples work much better than scripted readings. A few things I learned the hard way:</p>
|
||||||
<ul>
|
<ul>
|
||||||
<li><strong>LLM</strong>: Claude via a local reverse proxy that routes requests through my own Anthropic API key. No vendor lock-in, no shared rate limits, full visibility into what’s being sent.</li>
|
<li><strong>Input audio must be 24kHz mono WAV.</strong> 16kHz samples produce truncated garbage with no error message. I lost hours to this.</li>
|
||||||
<li><strong>TTS</strong>: <a href="https://github.com/rhasspy/piper">Piper TTS</a> running on TrueNAS via the <a href="https://github.com/rhasspy/wyoming">Wyoming protocol</a>. Also <a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> on the Mac for voice cloning.</li>
|
<li><strong>Skip the style steering.</strong> The <code>--instruct</code> flag for controlling tone paradoxically produces worse results for voice cloning. Just let the model match the reference audio naturally.</li>
|
||||||
<li><strong>STT</strong>: <a href="https://github.com/openai/whisper">Whisper</a> server on TrueNAS, exposed as an OpenAI-compatible <code>/v1/audio/transcriptions</code> endpoint.</li>
|
<li><strong>Keep it short.</strong> Generation speed is about 0.08x realtime on M1 Pro — a 10-second clip takes roughly 2 minutes. Fine for short messages, impractical for long-form audio.</li>
|
||||||
<li><strong>Agent framework</strong>: An open-source agent framework — self-hosted, connects to Telegram and WhatsApp.</li>
|
|
||||||
</ul>
|
</ul>
|
||||||
<p>The whole thing runs on hardware I already owned. The only recurring cost is the LLM API.</p>
|
<p>I set it up as a <a href="https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/CreatingLaunchdJobs.html">LaunchAgent</a> so it starts on boot and stays running. The API is FastAPI-based, exposed on the local network.</p>
|
||||||
<h2 id="getting-it-running">Getting It Running</h2>
|
<h2 id="what-i-actually-use-this-for">What I Actually Use This For</h2>
|
||||||
<p>Piper and Whisper both run in Docker on TrueNAS. Piper is straightforward — pick a voice model, point it at the Wyoming port, done. Whisper took a bit more tuning. I’m running <code>small.en</code> which is a good balance of speed and accuracy for English transcription.</p>
|
<p><strong>Voice replies on Telegram.</strong> When someone sends me a voice message, Whisper transcribes it, the AI generates a text response, and Piper or Qwen converts it back to speech. The reply goes back as a voice note. The whole round-trip takes a few seconds.</p>
|
||||||
<p>The agent framework is what ties it together. It handles the conversation loop, routes messages to the right tools, and manages the Telegram integration. Skills are just directories with a <code>SKILL.md</code> and some scripts — easy to add, easy to audit.</p>
|
<p><strong>Podcast generation.</strong> I built a pipeline that researches news via web search, writes a two-host podcast script, and generates 20 minutes of audio using different TTS voices for each host. Piper handles the bulk generation (fast), and I’ve experimented with Qwen for more distinctive character voices.</p>
|
||||||
<p>The LLM proxy sits between the agent and the Anthropic API. It adds request logging, lets me swap models without touching agent config, and gives me a single place to manage the API key.</p>
|
<p><strong>Notifications and alerts.</strong> Instead of text notifications, some of my home automation alerts are spoken aloud through a Sonos speaker. Piper handles this — it’s fast enough that the latency is imperceptible.</p>
|
||||||
<p>Qwen3-TTS on the Mac runs via <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>, which uses Apple Silicon’s Neural Engine. The voice cloning feature is genuinely impressive — give it a 3-second audio sample and it’ll match the voice reasonably well.</p>
|
<h2 id="piper-vs-qwen-when-to-use-which">Piper vs. Qwen: When to Use Which</h2>
|
||||||
<h2 id="what-surprised-me">What Surprised Me</h2>
|
|
||||||
<p><strong>Piper is fast.</strong> I expected local TTS to be slow and robotic. Piper is neither. On TrueNAS (not a powerful machine), it generates speech at realtime speed or better. The voice quality isn’t Eleven Labs, but it’s completely usable for a personal assistant.</p>
|
|
||||||
<p><strong>Whisper <code>small.en</code> is accurate enough.</strong> I was skeptical about running a smaller model, but it handles my voice well — probably 95% accuracy on normal speech. The only failures are proper nouns and technical jargon, which is expected. For a personal assistant that mostly hears the same vocabulary over and over, it’s fine.</p>
|
|
||||||
<p><strong>The real cost is tokens, not compute.</strong> I assumed the bottleneck would be CPU/GPU. It’s not. The hardware handles everything comfortably. What actually costs money is the LLM API. I pulled traces from <a href="https://github.com/comet-ml/opik">Opik</a> and found 75 million input tokens across 100 traces. Context window size is the killer — long conversations with tool call history get expensive fast.</p>
|
|
||||||
<p><strong>V8 memory limits in Docker will silently kill your agent.</strong> The agent framework runs on Node.js. Docker containers have a default memory limit, and Node’s V8 heap will hit it and crash without a clear error. The fix is <code>NODE_OPTIONS=--max-old-space-size=4096</code> in your container environment. I lost a few hours to this before finding it.</p>
|
|
||||||
<h2 id="the-numbers">The Numbers</h2>
|
|
||||||
<p>Rough monthly costs running this setup:</p>
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<table><thead><tr><th></th><th>Piper</th><th>Qwen3-TTS</th></tr></thead><tbody><tr><td>Speed</td><td>Realtime or faster</td><td>~0.08x realtime</td></tr><tr><td>Voice cloning</td><td>No</td><td>Yes (3-second sample)</td></tr><tr><td>Hardware</td><td>CPU (any machine)</td><td>Apple Silicon recommended</td></tr><tr><td>Quality</td><td>Good, consistent</td><td>Excellent for cloned voices</td></tr><tr><td>Use case</td><td>Bulk TTS, real-time</td><td>Short personalized clips</td></tr></tbody></table>
|
||||||
|
<p>For anything that needs to be fast or long-form, Piper wins. For anything that needs to sound like a specific person, Qwen wins. I use both.</p>
|
||||||
|
<h2 id="the-setup">The Setup</h2>
|
||||||
|
<p>The whole thing runs on hardware I already owned:</p>
|
||||||
<ul>
|
<ul>
|
||||||
<li>LLM API (Claude): ~$15–25/month depending on conversation volume</li>
|
<li><strong>Piper</strong>: Docker container on TrueNAS, Wyoming protocol on port 10200, OpenAI-compatible proxy on port 8951</li>
|
||||||
<li>Electricity for TrueNAS: already running, marginal cost near zero</li>
|
<li><strong>Qwen3-TTS</strong>: LaunchAgent on MacBook, FastAPI server on port 8880</li>
|
||||||
<li>Everything else (Piper, Whisper, agent framework): $0</li>
|
<li><strong>Whisper</strong>: Docker container on TrueNAS, OpenAI-compatible endpoint on port 8950</li>
|
||||||
</ul>
|
</ul>
|
||||||
<p>The 75M input tokens I mentioned came from a period of heavy testing with long context windows. Normal usage is significantly lower. The lesson: be deliberate about what you include in the system prompt and conversation history. Every token in context costs money on both ends of the conversation.</p>
|
<p>Total additional cost: $0. These are all open-source models running on consumer hardware.</p>
|
||||||
<h2 id="whats-next">What’s Next</h2>
|
<h2 id="whats-next">What’s Next</h2>
|
||||||
<p>A few things I’m working on:</p>
|
<p>I want to get Piper running with multiple voice models simultaneously — different voices on different ports — so the podcast pipeline can use truly distinct local voices without hitting any external API. The current single-voice setup works but having two or three local voices would eliminate the last dependency on cloud TTS entirely.</p>
|
||||||
<ul>
|
<p>If you have a machine that can run Docker, you can have local TTS running in about 15 minutes. <a href="https://github.com/rhasspy/piper">Piper’s documentation</a> is solid, and the Wyoming protocol integration with Home Assistant makes it trivially easy if you’re already in that ecosystem.</p> </div> </main> <footer data-astro-cid-5hce7sga> <a href="/privacy" data-astro-cid-5hce7sga>Privacy Policy</a> ·
|
||||||
<li>WhatsApp integration — Telegram works great but WhatsApp is where most people actually are</li>
|
|
||||||
<li>Multi-tenant hosting — running this for a small group, not just myself</li>
|
|
||||||
<li>Thermal mass modeling for HVAC — using the agent framework to build something that actually reasons about home energy, not just schedules</li>
|
|
||||||
</ul>
|
|
||||||
<p>The infrastructure is solid enough that I’m spending more time on what the agent <em>does</em> than on keeping it running. That’s the right place to be.</p>
|
|
||||||
<p>If you’re thinking about building something similar, the barrier is lower than you’d expect. You don’t need a GPU server. You don’t need a cloud budget. You need a machine that can run Docker, an API key, and a weekend.</p> </div> </main> <footer data-astro-cid-5hce7sga> <a href="/privacy" data-astro-cid-5hce7sga>Privacy Policy</a> ·
|
|
||||||
<a href="/terms" data-astro-cid-5hce7sga>Terms of Service</a> ·
|
<a href="/terms" data-astro-cid-5hce7sga>Terms of Service</a> ·
|
||||||
© Brian Galura 2004–2026
|
© Brian Galura 2004–2026
|
||||||
</footer> </body></html>
|
</footer> </body></html>
|
||||||
@@ -1,62 +1,65 @@
|
|||||||
---
|
---
|
||||||
layout: ../../layouts/PostLayout.astro
|
layout: ../../layouts/PostLayout.astro
|
||||||
title: "Self-Hosted AI: Running LLMs, TTS, and Whisper on Consumer Hardware"
|
title: "Local Voice AI: Text-to-Speech and Voice Cloning on Consumer Hardware"
|
||||||
date: "2026-03-24"
|
date: "2026-03-24"
|
||||||
---
|
---
|
||||||
|
|
||||||
I wanted a personal AI assistant that I actually controlled. Not a SaaS product with a monthly bill and a privacy policy I'd never read. Something running on my own hardware, connected to my own accounts, with no vendor in the middle deciding what it could or couldn't do.
|
I wanted voice capabilities for my personal AI setup without sending audio to a cloud API. Not for privacy paranoia — I just didn't want to pay per-request for something that should run locally. Turns out the open-source TTS landscape has gotten surprisingly good.
|
||||||
|
|
||||||
I didn't buy new hardware for this. I used what I had: a TrueNAS server in my home lab and an M1 Pro MacBook. Not a data center. Just the stuff already sitting on my desk.
|
## Piper: Fast Local TTS
|
||||||
|
|
||||||
## The Stack
|
[Piper](https://github.com/rhasspy/piper) is a local text-to-speech engine that runs on CPU. No GPU required. I have it running in Docker on a TrueNAS server via the [Wyoming protocol](https://github.com/rhasspy/wyoming), which is the same protocol Home Assistant uses for voice assistants.
|
||||||
|
|
||||||
The core pieces are:
|
The speed surprised me. On modest hardware, Piper generates speech at realtime speed or better. A 10-second clip takes less than 10 seconds to render. The voice quality isn't Eleven Labs, but it's completely usable — clear, natural-sounding, and consistent.
|
||||||
|
|
||||||
- **LLM**: Claude via a local reverse proxy that routes requests through my own Anthropic API key. No vendor lock-in, no shared rate limits, full visibility into what's being sent.
|
I wrote a small [OpenAI-compatible proxy](https://github.com/ddoc) that wraps the Wyoming protocol in a standard `/v1/audio/speech` endpoint. Any tool that speaks the OpenAI TTS API can now use my local Piper instance without modification.
|
||||||
- **TTS**: [Piper TTS](https://github.com/rhasspy/piper) running on TrueNAS via the [Wyoming protocol](https://github.com/rhasspy/wyoming). Also [Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) on the Mac for voice cloning.
|
|
||||||
- **STT**: [Whisper](https://github.com/openai/whisper) server on TrueNAS, exposed as an OpenAI-compatible `/v1/audio/transcriptions` endpoint.
|
|
||||||
- **Agent framework**: An open-source agent framework — self-hosted, connects to Telegram and WhatsApp.
|
|
||||||
|
|
||||||
The whole thing runs on hardware I already owned. The only recurring cost is the LLM API.
|
Piper ships with dozens of voice models across multiple languages. I'm running a single English voice right now, but you can load multiple models on different ports if you need variety.
|
||||||
|
|
||||||
## Getting It Running
|
## Qwen3-TTS: Voice Cloning on Apple Silicon
|
||||||
|
|
||||||
Piper and Whisper both run in Docker on TrueNAS. Piper is straightforward — pick a voice model, point it at the Wyoming port, done. Whisper took a bit more tuning. I'm running `small.en` which is a good balance of speed and accuracy for English transcription.
|
[Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) is where things get interesting. It's a text-to-speech model that supports voice cloning from a 3-second audio sample. I run it on an M1 Pro MacBook using [mlx-audio](https://github.com/Blaizzy/mlx-audio), which leverages Apple Silicon's Neural Engine.
|
||||||
|
|
||||||
The agent framework is what ties it together. It handles the conversation loop, routes messages to the right tools, and manages the Telegram integration. Skills are just directories with a `SKILL.md` and some scripts — easy to add, easy to audit.
|
The workflow: give it a short `.wav` clip of someone's voice, and it generates new speech in that voice. The quality varies — natural conversational samples work much better than scripted readings. A few things I learned the hard way:
|
||||||
|
|
||||||
The LLM proxy sits between the agent and the Anthropic API. It adds request logging, lets me swap models without touching agent config, and gives me a single place to manage the API key.
|
- **Input audio must be 24kHz mono WAV.** 16kHz samples produce truncated garbage with no error message. I lost hours to this.
|
||||||
|
- **Skip the style steering.** The `--instruct` flag for controlling tone paradoxically produces worse results for voice cloning. Just let the model match the reference audio naturally.
|
||||||
|
- **Keep it short.** Generation speed is about 0.08x realtime on M1 Pro — a 10-second clip takes roughly 2 minutes. Fine for short messages, impractical for long-form audio.
|
||||||
|
|
||||||
Qwen3-TTS on the Mac runs via [mlx-audio](https://github.com/Blaizzy/mlx-audio), which uses Apple Silicon's Neural Engine. The voice cloning feature is genuinely impressive — give it a 3-second audio sample and it'll match the voice reasonably well.
|
I set it up as a [LaunchAgent](https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/CreatingLaunchdJobs.html) so it starts on boot and stays running. The API is FastAPI-based, exposed on the local network.
|
||||||
|
|
||||||
## What Surprised Me
|
## What I Actually Use This For
|
||||||
|
|
||||||
**Piper is fast.** I expected local TTS to be slow and robotic. Piper is neither. On TrueNAS (not a powerful machine), it generates speech at realtime speed or better. The voice quality isn't Eleven Labs, but it's completely usable for a personal assistant.
|
**Voice replies on Telegram.** When someone sends me a voice message, Whisper transcribes it, the AI generates a text response, and Piper or Qwen converts it back to speech. The reply goes back as a voice note. The whole round-trip takes a few seconds.
|
||||||
|
|
||||||
**Whisper `small.en` is accurate enough.** I was skeptical about running a smaller model, but it handles my voice well — probably 95% accuracy on normal speech. The only failures are proper nouns and technical jargon, which is expected. For a personal assistant that mostly hears the same vocabulary over and over, it's fine.
|
**Podcast generation.** I built a pipeline that researches news via web search, writes a two-host podcast script, and generates 20 minutes of audio using different TTS voices for each host. Piper handles the bulk generation (fast), and I've experimented with Qwen for more distinctive character voices.
|
||||||
|
|
||||||
**The real cost is tokens, not compute.** I assumed the bottleneck would be CPU/GPU. It's not. The hardware handles everything comfortably. What actually costs money is the LLM API. I pulled traces from [Opik](https://github.com/comet-ml/opik) and found 75 million input tokens across 100 traces. Context window size is the killer — long conversations with tool call history get expensive fast.
|
**Notifications and alerts.** Instead of text notifications, some of my home automation alerts are spoken aloud through a Sonos speaker. Piper handles this — it's fast enough that the latency is imperceptible.
|
||||||
|
|
||||||
**V8 memory limits in Docker will silently kill your agent.** The agent framework runs on Node.js. Docker containers have a default memory limit, and Node's V8 heap will hit it and crash without a clear error. The fix is `NODE_OPTIONS=--max-old-space-size=4096` in your container environment. I lost a few hours to this before finding it.
|
## Piper vs. Qwen: When to Use Which
|
||||||
|
|
||||||
## The Numbers
|
| | Piper | Qwen3-TTS |
|
||||||
|
|---|---|---|
|
||||||
|
| Speed | Realtime or faster | ~0.08x realtime |
|
||||||
|
| Voice cloning | No | Yes (3-second sample) |
|
||||||
|
| Hardware | CPU (any machine) | Apple Silicon recommended |
|
||||||
|
| Quality | Good, consistent | Excellent for cloned voices |
|
||||||
|
| Use case | Bulk TTS, real-time | Short personalized clips |
|
||||||
|
|
||||||
Rough monthly costs running this setup:
|
For anything that needs to be fast or long-form, Piper wins. For anything that needs to sound like a specific person, Qwen wins. I use both.
|
||||||
|
|
||||||
- LLM API (Claude): ~$15–25/month depending on conversation volume
|
## The Setup
|
||||||
- Electricity for TrueNAS: already running, marginal cost near zero
|
|
||||||
- Everything else (Piper, Whisper, agent framework): $0
|
|
||||||
|
|
||||||
The 75M input tokens I mentioned came from a period of heavy testing with long context windows. Normal usage is significantly lower. The lesson: be deliberate about what you include in the system prompt and conversation history. Every token in context costs money on both ends of the conversation.
|
The whole thing runs on hardware I already owned:
|
||||||
|
|
||||||
|
- **Piper**: Docker container on TrueNAS, Wyoming protocol on port 10200, OpenAI-compatible proxy on port 8951
|
||||||
|
- **Qwen3-TTS**: LaunchAgent on MacBook, FastAPI server on port 8880
|
||||||
|
- **Whisper**: Docker container on TrueNAS, OpenAI-compatible endpoint on port 8950
|
||||||
|
|
||||||
|
Total additional cost: $0. These are all open-source models running on consumer hardware.
|
||||||
|
|
||||||
## What's Next
|
## What's Next
|
||||||
|
|
||||||
A few things I'm working on:
|
I want to get Piper running with multiple voice models simultaneously — different voices on different ports — so the podcast pipeline can use truly distinct local voices without hitting any external API. The current single-voice setup works but having two or three local voices would eliminate the last dependency on cloud TTS entirely.
|
||||||
|
|
||||||
- WhatsApp integration — Telegram works great but WhatsApp is where most people actually are
|
If you have a machine that can run Docker, you can have local TTS running in about 15 minutes. [Piper's documentation](https://github.com/rhasspy/piper) is solid, and the Wyoming protocol integration with Home Assistant makes it trivially easy if you're already in that ecosystem.
|
||||||
- Multi-tenant hosting — running this for a small group, not just myself
|
|
||||||
- Thermal mass modeling for HVAC — using the agent framework to build something that actually reasons about home energy, not just schedules
|
|
||||||
|
|
||||||
The infrastructure is solid enough that I'm spending more time on what the agent *does* than on keeping it running. That's the right place to be.
|
|
||||||
|
|
||||||
If you're thinking about building something similar, the barrier is lower than you'd expect. You don't need a GPU server. You don't need a cloud budget. You need a machine that can run Docker, an API key, and a weekend.
|
|
||||||
|
|||||||
Reference in New Issue
Block a user