Add draft post: Self-Hosted AI Stack (unlisted)

2026-03-22 23:19:23 +00:00
parent d67a3d416b
commit edadadd431
2 changed files with 105 additions and 0 deletions
--- a/dist/posts/self-hosted-ai-stack/index.html
+++ b/dist/posts/self-hosted-ai-stack/index.html
@@ -0,0 +1,43 @@
+<!DOCTYPE html><html lang="en" data-astro-cid-5hce7sga> <head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><title>Self-Hosted AI: Running LLMs, TTS, and Whisper on Consumer Hardware — dd0c.net</title><link rel="icon" href="/favicon.ico"><style>[data-astro-cid-5hce7sga],[data-astro-cid-5hce7sga]:before,[data-astro-cid-5hce7sga]:after{box-sizing:border-box}body{font-family:system-ui,-apple-system,sans-serif;margin:0;background:#f8f9fa;color:#333;line-height:1.6}nav[data-astro-cid-5hce7sga]{background:#fff;border-bottom:2px solid #3294D2;padding:0 1.5rem;display:flex;align-items:center;gap:0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{display:flex;align-items:center;gap:.5rem;text-decoration:none;margin-right:1.5rem;padding:.75rem 0}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga] img[data-astro-cid-5hce7sga]{height:32px;width:auto}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#333;text-decoration:none;padding:.75rem .85rem;font-size:.95rem;transition:color .15s}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{color:#3294d2}nav[data-astro-cid-5hce7sga] .spacer[data-astro-cid-5hce7sga]{flex:1}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga].external{color:#3294d2}main[data-astro-cid-5hce7sga]{max-width:760px;margin:2rem auto;padding:0 1.25rem}footer[data-astro-cid-5hce7sga]{margin-top:3rem;padding:1.25rem;text-align:center;font-size:.875rem;color:#666;border-top:1px solid #e0e0e0;background:#fff}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{color:#3294d2;text-decoration:none}footer[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]:hover{text-decoration:underline}h1[data-astro-cid-5hce7sga]{font-size:1.75rem;color:#1a1a1a}h2[data-astro-cid-5hce7sga]{font-size:1.2rem;margin-top:1.75rem;color:#1a1a1a}a[data-astro-cid-5hce7sga]{color:#3294d2}p[data-astro-cid-5hce7sga]{margin:.6rem 0}@media (max-width: 600px){nav[data-astro-cid-5hce7sga]{flex-wrap:wrap;padding:0 .75rem}nav[data-astro-cid-5hce7sga] .brand[data-astro-cid-5hce7sga]{margin-right:.5rem}nav[data-astro-cid-5hce7sga] a[data-astro-cid-5hce7sga]{padding:.6rem .5rem;font-size:.875rem}}
+.post-meta[data-astro-cid-gjtny2mx]{color:#888;font-size:.875rem;margin-bottom:1.5rem}.post-body[data-astro-cid-gjtny2mx]{line-height:1.75}.post-body[data-astro-cid-gjtny2mx] iframe[data-astro-cid-gjtny2mx]{max-width:100%}.back[data-astro-cid-gjtny2mx]{display:inline-block;margin-bottom:1.25rem;font-size:.9rem;color:#3294d2;text-decoration:none}.back[data-astro-cid-gjtny2mx]:hover{text-decoration:underline}
+</style></head> <body data-astro-cid-5hce7sga> <nav data-astro-cid-5hce7sga> <a class="brand" href="/" data-astro-cid-5hce7sga> <img src="/logo-white.svg" alt="dd0c.net" width="40" height="40" data-astro-cid-5hce7sga> </a> <a href="/" data-astro-cid-5hce7sga>Home</a> <a href="/about" data-astro-cid-5hce7sga>About</a> <a href="/services" data-astro-cid-5hce7sga>Services</a> <div class="spacer" data-astro-cid-5hce7sga></div> <a class="external" href="https://github.com/ddoc" target="_blank" rel="noopener" data-astro-cid-5hce7sga>GitHub</a> </nav> <main data-astro-cid-5hce7sga>   <a class="back" href="/" data-astro-cid-gjtny2mx>← All posts</a> <h1 data-astro-cid-gjtny2mx>Self-Hosted AI: Running LLMs, TTS, and Whisper on Consumer Hardware</h1> <p class="post-meta" data-astro-cid-gjtny2mx>2026-03-24</p> <div class="post-body" data-astro-cid-gjtny2mx> <p>I wanted a personal AI assistant that I actually controlled. Not a SaaS product with a monthly bill and a privacy policy I’d never read. Something running on my own hardware, connected to my own accounts, with no vendor in the middle deciding what it could or couldn’t do.</p>
+<p>I didn’t buy new hardware for this. I used what I had: a TrueNAS server in my home lab and an M1 Pro MacBook. Not a data center. Just the stuff already sitting on my desk.</p>
+<h2 id="the-stack">The Stack</h2>
+<p>The core pieces are:</p>
+<ul>
+<li><strong>LLM</strong>: Claude via <a href="https://github.com/openclaw/kiro">kiro-anthropic</a>, a local proxy that routes requests through my own Anthropic API key. No vendor lock-in, no shared rate limits, full visibility into what’s being sent.</li>
+<li><strong>TTS</strong>: <a href="https://github.com/rhasspy/piper">Piper TTS</a> running on TrueNAS via the <a href="https://github.com/rhasspy/wyoming">Wyoming protocol</a>. Also <a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> on the Mac for voice cloning.</li>
+<li><strong>STT</strong>: <a href="https://github.com/openai/whisper">Whisper</a> server on TrueNAS, exposed as an OpenAI-compatible <code>/v1/audio/transcriptions</code> endpoint.</li>
+<li><strong>Agent framework</strong>: <a href="https://github.com/openclaw/openclaw">OpenClaw</a> — open source, self-hosted, connects to Telegram and WhatsApp.</li>
+</ul>
+<p>The whole thing runs on hardware I already owned. The only recurring cost is the LLM API.</p>
+<h2 id="getting-it-running">Getting It Running</h2>
+<p>Piper and Whisper both run in Docker on TrueNAS. Piper is straightforward — pick a voice model, point it at the Wyoming port, done. Whisper took a bit more tuning. I’m running <code>small.en</code> which is a good balance of speed and accuracy for English transcription.</p>
+<p>The OpenClaw agent framework is what ties it together. It handles the conversation loop, routes messages to the right tools, and manages the Telegram integration. Skills are just directories with a <code>SKILL.md</code> and some scripts — easy to add, easy to audit.</p>
+<p>For the LLM proxy, kiro-anthropic sits between OpenClaw and the Anthropic API. It adds request logging, lets me swap models without touching agent config, and gives me a single place to manage the API key.</p>
+<p>Qwen3-TTS on the Mac runs via <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>, which uses Apple Silicon’s Neural Engine. The voice cloning feature is genuinely impressive — give it a 3-second audio sample and it’ll match the voice reasonably well.</p>
+<h2 id="what-surprised-me">What Surprised Me</h2>
+<p><strong>Piper is fast.</strong> I expected local TTS to be slow and robotic. Piper is neither. On TrueNAS (not a powerful machine), it generates speech at realtime speed or better. The voice quality isn’t Eleven Labs, but it’s completely usable for a personal assistant.</p>
+<p><strong>Whisper <code>small.en</code> is accurate enough.</strong> I was skeptical about running a smaller model, but it handles my voice well — probably 95% accuracy on normal speech. The only failures are proper nouns and technical jargon, which is expected. For a personal assistant that mostly hears the same vocabulary over and over, it’s fine.</p>
+<p><strong>The real cost is tokens, not compute.</strong> I assumed the bottleneck would be CPU/GPU. It’s not. The hardware handles everything comfortably. What actually costs money is the LLM API. I pulled traces from <a href="https://github.com/comet-ml/opik">Opik</a> and found 75 million input tokens across 100 traces. Context window size is the killer — long conversations with tool call history get expensive fast.</p>
+<p><strong>V8 memory limits in Docker will silently kill your agent.</strong> OpenClaw runs on Node.js. Docker containers have a default memory limit, and Node’s V8 heap will hit it and crash without a clear error. The fix is <code>NODE_OPTIONS=--max-old-space-size=4096</code> in your container environment. I lost a few hours to this before finding it.</p>
+<h2 id="the-numbers">The Numbers</h2>
+<p>Rough monthly costs running this setup:</p>
+<ul>
+<li>LLM API (Claude): ~$15–25/month depending on conversation volume</li>
+<li>Electricity for TrueNAS: already running, marginal cost near zero</li>
+<li>Everything else (Piper, Whisper, OpenClaw): $0</li>
+</ul>
+<p>The 75M input tokens I mentioned came from a period of heavy testing with long context windows. Normal usage is significantly lower. The lesson: be deliberate about what you include in the system prompt and conversation history. Every token in context costs money on both ends of the conversation.</p>
+<h2 id="whats-next">What’s Next</h2>
+<p>A few things I’m working on:</p>
+<ul>
+<li>WhatsApp integration — Telegram works great but WhatsApp is where most people actually are</li>
+<li>Multi-tenant hosting — running this for a small group, not just myself</li>
+<li>Thermal mass modeling for HVAC — using the agent framework to build something that actually reasons about home energy, not just schedules</li>
+</ul>
+<p>The infrastructure is solid enough that I’m spending more time on what the agent <em>does</em> than on keeping it running. That’s the right place to be.</p>
+<p>If you’re thinking about building something similar, the barrier is lower than you’d expect. You don’t need a GPU server. You don’t need a cloud budget. You need a machine that can run Docker, an API key, and a weekend.</p> </div>  </main> <footer data-astro-cid-5hce7sga> <a href="/privacy" data-astro-cid-5hce7sga>Privacy Policy</a> &nbsp;·&nbsp;
+<a href="/terms" data-astro-cid-5hce7sga>Terms of Service</a> &nbsp;·&nbsp;
+    &copy; Brian Galura 2004&ndash;2026
+</footer> </body></html>