diff --git a/dist/posts/self-hosted-ai-stack/index.html b/dist/posts/self-hosted-ai-stack/index.html new file mode 100644 index 0000000..2004218 --- /dev/null +++ b/dist/posts/self-hosted-ai-stack/index.html @@ -0,0 +1,43 @@ + Self-Hosted AI: Running LLMs, TTS, and Whisper on Consumer Hardware — dd0c.net
← All posts

Self-Hosted AI: Running LLMs, TTS, and Whisper on Consumer Hardware

2026-03-24

I wanted a personal AI assistant that I actually controlled. Not a SaaS product with a monthly bill and a privacy policy I’d never read. Something running on my own hardware, connected to my own accounts, with no vendor in the middle deciding what it could or couldn’t do.

+

I didn’t buy new hardware for this. I used what I had: a TrueNAS server in my home lab and an M1 Pro MacBook. Not a data center. Just the stuff already sitting on my desk.

+

The Stack

+

The core pieces are:

+ +

The whole thing runs on hardware I already owned. The only recurring cost is the LLM API.

+

Getting It Running

+

Piper and Whisper both run in Docker on TrueNAS. Piper is straightforward — pick a voice model, point it at the Wyoming port, done. Whisper took a bit more tuning. I’m running small.en which is a good balance of speed and accuracy for English transcription.

+

The OpenClaw agent framework is what ties it together. It handles the conversation loop, routes messages to the right tools, and manages the Telegram integration. Skills are just directories with a SKILL.md and some scripts — easy to add, easy to audit.

+

For the LLM proxy, kiro-anthropic sits between OpenClaw and the Anthropic API. It adds request logging, lets me swap models without touching agent config, and gives me a single place to manage the API key.

+

Qwen3-TTS on the Mac runs via mlx-audio, which uses Apple Silicon’s Neural Engine. The voice cloning feature is genuinely impressive — give it a 3-second audio sample and it’ll match the voice reasonably well.

+

What Surprised Me

+

Piper is fast. I expected local TTS to be slow and robotic. Piper is neither. On TrueNAS (not a powerful machine), it generates speech at realtime speed or better. The voice quality isn’t Eleven Labs, but it’s completely usable for a personal assistant.

+

Whisper small.en is accurate enough. I was skeptical about running a smaller model, but it handles my voice well — probably 95% accuracy on normal speech. The only failures are proper nouns and technical jargon, which is expected. For a personal assistant that mostly hears the same vocabulary over and over, it’s fine.

+

The real cost is tokens, not compute. I assumed the bottleneck would be CPU/GPU. It’s not. The hardware handles everything comfortably. What actually costs money is the LLM API. I pulled traces from Opik and found 75 million input tokens across 100 traces. Context window size is the killer — long conversations with tool call history get expensive fast.

+

V8 memory limits in Docker will silently kill your agent. OpenClaw runs on Node.js. Docker containers have a default memory limit, and Node’s V8 heap will hit it and crash without a clear error. The fix is NODE_OPTIONS=--max-old-space-size=4096 in your container environment. I lost a few hours to this before finding it.

+

The Numbers

+

Rough monthly costs running this setup:

+ +

The 75M input tokens I mentioned came from a period of heavy testing with long context windows. Normal usage is significantly lower. The lesson: be deliberate about what you include in the system prompt and conversation history. Every token in context costs money on both ends of the conversation.

+

What’s Next

+

A few things I’m working on:

+ +

The infrastructure is solid enough that I’m spending more time on what the agent does than on keeping it running. That’s the right place to be.

+

If you’re thinking about building something similar, the barrier is lower than you’d expect. You don’t need a GPU server. You don’t need a cloud budget. You need a machine that can run Docker, an API key, and a weekend.

\ No newline at end of file diff --git a/src/pages/posts/self-hosted-ai-stack.md b/src/pages/posts/self-hosted-ai-stack.md new file mode 100644 index 0000000..6f3b39e --- /dev/null +++ b/src/pages/posts/self-hosted-ai-stack.md @@ -0,0 +1,62 @@ +--- +layout: ../../layouts/PostLayout.astro +title: "Self-Hosted AI: Running LLMs, TTS, and Whisper on Consumer Hardware" +date: "2026-03-24" +--- + +I wanted a personal AI assistant that I actually controlled. Not a SaaS product with a monthly bill and a privacy policy I'd never read. Something running on my own hardware, connected to my own accounts, with no vendor in the middle deciding what it could or couldn't do. + +I didn't buy new hardware for this. I used what I had: a TrueNAS server in my home lab and an M1 Pro MacBook. Not a data center. Just the stuff already sitting on my desk. + +## The Stack + +The core pieces are: + +- **LLM**: Claude via [kiro-anthropic](https://github.com/openclaw/kiro), a local proxy that routes requests through my own Anthropic API key. No vendor lock-in, no shared rate limits, full visibility into what's being sent. +- **TTS**: [Piper TTS](https://github.com/rhasspy/piper) running on TrueNAS via the [Wyoming protocol](https://github.com/rhasspy/wyoming). Also [Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) on the Mac for voice cloning. +- **STT**: [Whisper](https://github.com/openai/whisper) server on TrueNAS, exposed as an OpenAI-compatible `/v1/audio/transcriptions` endpoint. +- **Agent framework**: [OpenClaw](https://github.com/openclaw/openclaw) — open source, self-hosted, connects to Telegram and WhatsApp. + +The whole thing runs on hardware I already owned. The only recurring cost is the LLM API. + +## Getting It Running + +Piper and Whisper both run in Docker on TrueNAS. Piper is straightforward — pick a voice model, point it at the Wyoming port, done. Whisper took a bit more tuning. I'm running `small.en` which is a good balance of speed and accuracy for English transcription. + +The OpenClaw agent framework is what ties it together. It handles the conversation loop, routes messages to the right tools, and manages the Telegram integration. Skills are just directories with a `SKILL.md` and some scripts — easy to add, easy to audit. + +For the LLM proxy, kiro-anthropic sits between OpenClaw and the Anthropic API. It adds request logging, lets me swap models without touching agent config, and gives me a single place to manage the API key. + +Qwen3-TTS on the Mac runs via [mlx-audio](https://github.com/Blaizzy/mlx-audio), which uses Apple Silicon's Neural Engine. The voice cloning feature is genuinely impressive — give it a 3-second audio sample and it'll match the voice reasonably well. + +## What Surprised Me + +**Piper is fast.** I expected local TTS to be slow and robotic. Piper is neither. On TrueNAS (not a powerful machine), it generates speech at realtime speed or better. The voice quality isn't Eleven Labs, but it's completely usable for a personal assistant. + +**Whisper `small.en` is accurate enough.** I was skeptical about running a smaller model, but it handles my voice well — probably 95% accuracy on normal speech. The only failures are proper nouns and technical jargon, which is expected. For a personal assistant that mostly hears the same vocabulary over and over, it's fine. + +**The real cost is tokens, not compute.** I assumed the bottleneck would be CPU/GPU. It's not. The hardware handles everything comfortably. What actually costs money is the LLM API. I pulled traces from [Opik](https://github.com/comet-ml/opik) and found 75 million input tokens across 100 traces. Context window size is the killer — long conversations with tool call history get expensive fast. + +**V8 memory limits in Docker will silently kill your agent.** OpenClaw runs on Node.js. Docker containers have a default memory limit, and Node's V8 heap will hit it and crash without a clear error. The fix is `NODE_OPTIONS=--max-old-space-size=4096` in your container environment. I lost a few hours to this before finding it. + +## The Numbers + +Rough monthly costs running this setup: + +- LLM API (Claude): ~$15–25/month depending on conversation volume +- Electricity for TrueNAS: already running, marginal cost near zero +- Everything else (Piper, Whisper, OpenClaw): $0 + +The 75M input tokens I mentioned came from a period of heavy testing with long context windows. Normal usage is significantly lower. The lesson: be deliberate about what you include in the system prompt and conversation history. Every token in context costs money on both ends of the conversation. + +## What's Next + +A few things I'm working on: + +- WhatsApp integration — Telegram works great but WhatsApp is where most people actually are +- Multi-tenant hosting — running this for a small group, not just myself +- Thermal mass modeling for HVAC — using the agent framework to build something that actually reasons about home energy, not just schedules + +The infrastructure is solid enough that I'm spending more time on what the agent *does* than on keeping it running. That's the right place to be. + +If you're thinking about building something similar, the barrier is lower than you'd expect. You don't need a GPU server. You don't need a cloud budget. You need a machine that can run Docker, an API key, and a weekend.