← All posts

Self-Hosted AI: Running LLMs, TTS, and Whisper on Consumer Hardware

2026-03-24

I wanted a personal AI assistant that I actually controlled. Not a SaaS product with a monthly bill and a privacy policy I’d never read. Something running on my own hardware, connected to my own accounts, with no vendor in the middle deciding what it could or couldn’t do.

I didn’t buy new hardware for this. I used what I had: a TrueNAS server in my home lab and an M1 Pro MacBook. Not a data center. Just the stuff already sitting on my desk.

The Stack

The core pieces are:

The whole thing runs on hardware I already owned. The only recurring cost is the LLM API.

Getting It Running

Piper and Whisper both run in Docker on TrueNAS. Piper is straightforward — pick a voice model, point it at the Wyoming port, done. Whisper took a bit more tuning. I’m running small.en which is a good balance of speed and accuracy for English transcription.

The agent framework is what ties it together. It handles the conversation loop, routes messages to the right tools, and manages the Telegram integration. Skills are just directories with a SKILL.md and some scripts — easy to add, easy to audit.

The LLM proxy sits between the agent and the Anthropic API. It adds request logging, lets me swap models without touching agent config, and gives me a single place to manage the API key.

Qwen3-TTS on the Mac runs via mlx-audio, which uses Apple Silicon’s Neural Engine. The voice cloning feature is genuinely impressive — give it a 3-second audio sample and it’ll match the voice reasonably well.

What Surprised Me

Piper is fast. I expected local TTS to be slow and robotic. Piper is neither. On TrueNAS (not a powerful machine), it generates speech at realtime speed or better. The voice quality isn’t Eleven Labs, but it’s completely usable for a personal assistant.

Whisper small.en is accurate enough. I was skeptical about running a smaller model, but it handles my voice well — probably 95% accuracy on normal speech. The only failures are proper nouns and technical jargon, which is expected. For a personal assistant that mostly hears the same vocabulary over and over, it’s fine.

The real cost is tokens, not compute. I assumed the bottleneck would be CPU/GPU. It’s not. The hardware handles everything comfortably. What actually costs money is the LLM API. I pulled traces from Opik and found 75 million input tokens across 100 traces. Context window size is the killer — long conversations with tool call history get expensive fast.

V8 memory limits in Docker will silently kill your agent. The agent framework runs on Node.js. Docker containers have a default memory limit, and Node’s V8 heap will hit it and crash without a clear error. The fix is NODE_OPTIONS=--max-old-space-size=4096 in your container environment. I lost a few hours to this before finding it.

The Numbers

Rough monthly costs running this setup:

The 75M input tokens I mentioned came from a period of heavy testing with long context windows. Normal usage is significantly lower. The lesson: be deliberate about what you include in the system prompt and conversation history. Every token in context costs money on both ends of the conversation.

What’s Next

A few things I’m working on:

The infrastructure is solid enough that I’m spending more time on what the agent does than on keeping it running. That’s the right place to be.

If you’re thinking about building something similar, the barrier is lower than you’d expect. You don’t need a GPU server. You don’t need a cloud budget. You need a machine that can run Docker, an API key, and a weekend.