src/pages/posts/self-hosted-ai-stack.md

---
layout: ../../layouts/PostLayout.astro
title: "Self-Hosted AI: Running LLMs, TTS, and Whisper on Consumer Hardware"
date: "2026-03-24"
---

I wanted a personal AI assistant that I actually controlled. Not a SaaS product with a monthly bill and a privacy policy I'd never read. Something running on my own hardware, connected to my own accounts, with no vendor in the middle deciding what it could or couldn't do.

I didn't buy new hardware for this. I used what I had: a TrueNAS server in my home lab and an M1 Pro MacBook. Not a data center. Just the stuff already sitting on my desk.

## The Stack

The core pieces are:

- **LLM**: Claude via [kiro-anthropic](https://github.com/openclaw/kiro), a local proxy that routes requests through my own Anthropic API key. No vendor lock-in, no shared rate limits, full visibility into what's being sent.
- **TTS**: [Piper TTS](https://github.com/rhasspy/piper) running on TrueNAS via the [Wyoming protocol](https://github.com/rhasspy/wyoming). Also [Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) on the Mac for voice cloning.
- **STT**: [Whisper](https://github.com/openai/whisper) server on TrueNAS, exposed as an OpenAI-compatible `/v1/audio/transcriptions` endpoint.
- **Agent framework**: [OpenClaw](https://github.com/openclaw/openclaw) — open source, self-hosted, connects to Telegram and WhatsApp.

The whole thing runs on hardware I already owned. The only recurring cost is the LLM API.

## Getting It Running

Piper and Whisper both run in Docker on TrueNAS. Piper is straightforward — pick a voice model, point it at the Wyoming port, done. Whisper took a bit more tuning. I'm running `small.en` which is a good balance of speed and accuracy for English transcription.

The OpenClaw agent framework is what ties it together. It handles the conversation loop, routes messages to the right tools, and manages the Telegram integration. Skills are just directories with a `SKILL.md` and some scripts — easy to add, easy to audit.

For the LLM proxy, kiro-anthropic sits between OpenClaw and the Anthropic API. It adds request logging, lets me swap models without touching agent config, and gives me a single place to manage the API key.

Qwen3-TTS on the Mac runs via [mlx-audio](https://github.com/Blaizzy/mlx-audio), which uses Apple Silicon's Neural Engine. The voice cloning feature is genuinely impressive — give it a 3-second audio sample and it'll match the voice reasonably well.

## What Surprised Me

**Piper is fast.** I expected local TTS to be slow and robotic. Piper is neither. On TrueNAS (not a powerful machine), it generates speech at realtime speed or better. The voice quality isn't Eleven Labs, but it's completely usable for a personal assistant.

**Whisper `small.en` is accurate enough.** I was skeptical about running a smaller model, but it handles my voice well — probably 95% accuracy on normal speech. The only failures are proper nouns and technical jargon, which is expected. For a personal assistant that mostly hears the same vocabulary over and over, it's fine.

**The real cost is tokens, not compute.** I assumed the bottleneck would be CPU/GPU. It's not. The hardware handles everything comfortably. What actually costs money is the LLM API. I pulled traces from [Opik](https://github.com/comet-ml/opik) and found 75 million input tokens across 100 traces. Context window size is the killer — long conversations with tool call history get expensive fast.

**V8 memory limits in Docker will silently kill your agent.** OpenClaw runs on Node.js. Docker containers have a default memory limit, and Node's V8 heap will hit it and crash without a clear error. The fix is `NODE_OPTIONS=--max-old-space-size=4096` in your container environment. I lost a few hours to this before finding it.

## The Numbers

Rough monthly costs running this setup:

- LLM API (Claude): ~$15–25/month depending on conversation volume
- Electricity for TrueNAS: already running, marginal cost near zero
- Everything else (Piper, Whisper, OpenClaw): $0

The 75M input tokens I mentioned came from a period of heavy testing with long context windows. Normal usage is significantly lower. The lesson: be deliberate about what you include in the system prompt and conversation history. Every token in context costs money on both ends of the conversation.

## What's Next

A few things I'm working on:

- WhatsApp integration — Telegram works great but WhatsApp is where most people actually are
- Multi-tenant hosting — running this for a small group, not just myself
- Thermal mass modeling for HVAC — using the agent framework to build something that actually reasons about home energy, not just schedules

The infrastructure is solid enough that I'm spending more time on what the agent *does* than on keeping it running. That's the right place to be.

If you're thinking about building something similar, the barrier is lower than you'd expect. You don't need a GPU server. You don't need a cloud budget. You need a machine that can run Docker, an API key, and a weekend.