Remove product references from blog post
This commit is contained in:
12
dist/posts/self-hosted-ai-stack/index.html
vendored
12
dist/posts/self-hosted-ai-stack/index.html
vendored
@@ -5,28 +5,28 @@
|
|||||||
<h2 id="the-stack">The Stack</h2>
|
<h2 id="the-stack">The Stack</h2>
|
||||||
<p>The core pieces are:</p>
|
<p>The core pieces are:</p>
|
||||||
<ul>
|
<ul>
|
||||||
<li><strong>LLM</strong>: Claude via <a href="https://github.com/openclaw/kiro">kiro-anthropic</a>, a local proxy that routes requests through my own Anthropic API key. No vendor lock-in, no shared rate limits, full visibility into what’s being sent.</li>
|
<li><strong>LLM</strong>: Claude via a local reverse proxy that routes requests through my own Anthropic API key. No vendor lock-in, no shared rate limits, full visibility into what’s being sent.</li>
|
||||||
<li><strong>TTS</strong>: <a href="https://github.com/rhasspy/piper">Piper TTS</a> running on TrueNAS via the <a href="https://github.com/rhasspy/wyoming">Wyoming protocol</a>. Also <a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> on the Mac for voice cloning.</li>
|
<li><strong>TTS</strong>: <a href="https://github.com/rhasspy/piper">Piper TTS</a> running on TrueNAS via the <a href="https://github.com/rhasspy/wyoming">Wyoming protocol</a>. Also <a href="https://huggingface.co/Qwen/Qwen3-TTS">Qwen3-TTS</a> on the Mac for voice cloning.</li>
|
||||||
<li><strong>STT</strong>: <a href="https://github.com/openai/whisper">Whisper</a> server on TrueNAS, exposed as an OpenAI-compatible <code>/v1/audio/transcriptions</code> endpoint.</li>
|
<li><strong>STT</strong>: <a href="https://github.com/openai/whisper">Whisper</a> server on TrueNAS, exposed as an OpenAI-compatible <code>/v1/audio/transcriptions</code> endpoint.</li>
|
||||||
<li><strong>Agent framework</strong>: <a href="https://github.com/openclaw/openclaw">OpenClaw</a> — open source, self-hosted, connects to Telegram and WhatsApp.</li>
|
<li><strong>Agent framework</strong>: An open-source agent framework — self-hosted, connects to Telegram and WhatsApp.</li>
|
||||||
</ul>
|
</ul>
|
||||||
<p>The whole thing runs on hardware I already owned. The only recurring cost is the LLM API.</p>
|
<p>The whole thing runs on hardware I already owned. The only recurring cost is the LLM API.</p>
|
||||||
<h2 id="getting-it-running">Getting It Running</h2>
|
<h2 id="getting-it-running">Getting It Running</h2>
|
||||||
<p>Piper and Whisper both run in Docker on TrueNAS. Piper is straightforward — pick a voice model, point it at the Wyoming port, done. Whisper took a bit more tuning. I’m running <code>small.en</code> which is a good balance of speed and accuracy for English transcription.</p>
|
<p>Piper and Whisper both run in Docker on TrueNAS. Piper is straightforward — pick a voice model, point it at the Wyoming port, done. Whisper took a bit more tuning. I’m running <code>small.en</code> which is a good balance of speed and accuracy for English transcription.</p>
|
||||||
<p>The OpenClaw agent framework is what ties it together. It handles the conversation loop, routes messages to the right tools, and manages the Telegram integration. Skills are just directories with a <code>SKILL.md</code> and some scripts — easy to add, easy to audit.</p>
|
<p>The agent framework is what ties it together. It handles the conversation loop, routes messages to the right tools, and manages the Telegram integration. Skills are just directories with a <code>SKILL.md</code> and some scripts — easy to add, easy to audit.</p>
|
||||||
<p>For the LLM proxy, kiro-anthropic sits between OpenClaw and the Anthropic API. It adds request logging, lets me swap models without touching agent config, and gives me a single place to manage the API key.</p>
|
<p>The LLM proxy sits between the agent and the Anthropic API. It adds request logging, lets me swap models without touching agent config, and gives me a single place to manage the API key.</p>
|
||||||
<p>Qwen3-TTS on the Mac runs via <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>, which uses Apple Silicon’s Neural Engine. The voice cloning feature is genuinely impressive — give it a 3-second audio sample and it’ll match the voice reasonably well.</p>
|
<p>Qwen3-TTS on the Mac runs via <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a>, which uses Apple Silicon’s Neural Engine. The voice cloning feature is genuinely impressive — give it a 3-second audio sample and it’ll match the voice reasonably well.</p>
|
||||||
<h2 id="what-surprised-me">What Surprised Me</h2>
|
<h2 id="what-surprised-me">What Surprised Me</h2>
|
||||||
<p><strong>Piper is fast.</strong> I expected local TTS to be slow and robotic. Piper is neither. On TrueNAS (not a powerful machine), it generates speech at realtime speed or better. The voice quality isn’t Eleven Labs, but it’s completely usable for a personal assistant.</p>
|
<p><strong>Piper is fast.</strong> I expected local TTS to be slow and robotic. Piper is neither. On TrueNAS (not a powerful machine), it generates speech at realtime speed or better. The voice quality isn’t Eleven Labs, but it’s completely usable for a personal assistant.</p>
|
||||||
<p><strong>Whisper <code>small.en</code> is accurate enough.</strong> I was skeptical about running a smaller model, but it handles my voice well — probably 95% accuracy on normal speech. The only failures are proper nouns and technical jargon, which is expected. For a personal assistant that mostly hears the same vocabulary over and over, it’s fine.</p>
|
<p><strong>Whisper <code>small.en</code> is accurate enough.</strong> I was skeptical about running a smaller model, but it handles my voice well — probably 95% accuracy on normal speech. The only failures are proper nouns and technical jargon, which is expected. For a personal assistant that mostly hears the same vocabulary over and over, it’s fine.</p>
|
||||||
<p><strong>The real cost is tokens, not compute.</strong> I assumed the bottleneck would be CPU/GPU. It’s not. The hardware handles everything comfortably. What actually costs money is the LLM API. I pulled traces from <a href="https://github.com/comet-ml/opik">Opik</a> and found 75 million input tokens across 100 traces. Context window size is the killer — long conversations with tool call history get expensive fast.</p>
|
<p><strong>The real cost is tokens, not compute.</strong> I assumed the bottleneck would be CPU/GPU. It’s not. The hardware handles everything comfortably. What actually costs money is the LLM API. I pulled traces from <a href="https://github.com/comet-ml/opik">Opik</a> and found 75 million input tokens across 100 traces. Context window size is the killer — long conversations with tool call history get expensive fast.</p>
|
||||||
<p><strong>V8 memory limits in Docker will silently kill your agent.</strong> OpenClaw runs on Node.js. Docker containers have a default memory limit, and Node’s V8 heap will hit it and crash without a clear error. The fix is <code>NODE_OPTIONS=--max-old-space-size=4096</code> in your container environment. I lost a few hours to this before finding it.</p>
|
<p><strong>V8 memory limits in Docker will silently kill your agent.</strong> The agent framework runs on Node.js. Docker containers have a default memory limit, and Node’s V8 heap will hit it and crash without a clear error. The fix is <code>NODE_OPTIONS=--max-old-space-size=4096</code> in your container environment. I lost a few hours to this before finding it.</p>
|
||||||
<h2 id="the-numbers">The Numbers</h2>
|
<h2 id="the-numbers">The Numbers</h2>
|
||||||
<p>Rough monthly costs running this setup:</p>
|
<p>Rough monthly costs running this setup:</p>
|
||||||
<ul>
|
<ul>
|
||||||
<li>LLM API (Claude): ~$15–25/month depending on conversation volume</li>
|
<li>LLM API (Claude): ~$15–25/month depending on conversation volume</li>
|
||||||
<li>Electricity for TrueNAS: already running, marginal cost near zero</li>
|
<li>Electricity for TrueNAS: already running, marginal cost near zero</li>
|
||||||
<li>Everything else (Piper, Whisper, OpenClaw): $0</li>
|
<li>Everything else (Piper, Whisper, agent framework): $0</li>
|
||||||
</ul>
|
</ul>
|
||||||
<p>The 75M input tokens I mentioned came from a period of heavy testing with long context windows. Normal usage is significantly lower. The lesson: be deliberate about what you include in the system prompt and conversation history. Every token in context costs money on both ends of the conversation.</p>
|
<p>The 75M input tokens I mentioned came from a period of heavy testing with long context windows. Normal usage is significantly lower. The lesson: be deliberate about what you include in the system prompt and conversation history. Every token in context costs money on both ends of the conversation.</p>
|
||||||
<h2 id="whats-next">What’s Next</h2>
|
<h2 id="whats-next">What’s Next</h2>
|
||||||
|
|||||||
@@ -12,10 +12,10 @@ I didn't buy new hardware for this. I used what I had: a TrueNAS server in my ho
|
|||||||
|
|
||||||
The core pieces are:
|
The core pieces are:
|
||||||
|
|
||||||
- **LLM**: Claude via [kiro-anthropic](https://github.com/openclaw/kiro), a local proxy that routes requests through my own Anthropic API key. No vendor lock-in, no shared rate limits, full visibility into what's being sent.
|
- **LLM**: Claude via a local reverse proxy that routes requests through my own Anthropic API key. No vendor lock-in, no shared rate limits, full visibility into what's being sent.
|
||||||
- **TTS**: [Piper TTS](https://github.com/rhasspy/piper) running on TrueNAS via the [Wyoming protocol](https://github.com/rhasspy/wyoming). Also [Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) on the Mac for voice cloning.
|
- **TTS**: [Piper TTS](https://github.com/rhasspy/piper) running on TrueNAS via the [Wyoming protocol](https://github.com/rhasspy/wyoming). Also [Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS) on the Mac for voice cloning.
|
||||||
- **STT**: [Whisper](https://github.com/openai/whisper) server on TrueNAS, exposed as an OpenAI-compatible `/v1/audio/transcriptions` endpoint.
|
- **STT**: [Whisper](https://github.com/openai/whisper) server on TrueNAS, exposed as an OpenAI-compatible `/v1/audio/transcriptions` endpoint.
|
||||||
- **Agent framework**: [OpenClaw](https://github.com/openclaw/openclaw) — open source, self-hosted, connects to Telegram and WhatsApp.
|
- **Agent framework**: An open-source agent framework — self-hosted, connects to Telegram and WhatsApp.
|
||||||
|
|
||||||
The whole thing runs on hardware I already owned. The only recurring cost is the LLM API.
|
The whole thing runs on hardware I already owned. The only recurring cost is the LLM API.
|
||||||
|
|
||||||
@@ -23,9 +23,9 @@ The whole thing runs on hardware I already owned. The only recurring cost is the
|
|||||||
|
|
||||||
Piper and Whisper both run in Docker on TrueNAS. Piper is straightforward — pick a voice model, point it at the Wyoming port, done. Whisper took a bit more tuning. I'm running `small.en` which is a good balance of speed and accuracy for English transcription.
|
Piper and Whisper both run in Docker on TrueNAS. Piper is straightforward — pick a voice model, point it at the Wyoming port, done. Whisper took a bit more tuning. I'm running `small.en` which is a good balance of speed and accuracy for English transcription.
|
||||||
|
|
||||||
The OpenClaw agent framework is what ties it together. It handles the conversation loop, routes messages to the right tools, and manages the Telegram integration. Skills are just directories with a `SKILL.md` and some scripts — easy to add, easy to audit.
|
The agent framework is what ties it together. It handles the conversation loop, routes messages to the right tools, and manages the Telegram integration. Skills are just directories with a `SKILL.md` and some scripts — easy to add, easy to audit.
|
||||||
|
|
||||||
For the LLM proxy, kiro-anthropic sits between OpenClaw and the Anthropic API. It adds request logging, lets me swap models without touching agent config, and gives me a single place to manage the API key.
|
The LLM proxy sits between the agent and the Anthropic API. It adds request logging, lets me swap models without touching agent config, and gives me a single place to manage the API key.
|
||||||
|
|
||||||
Qwen3-TTS on the Mac runs via [mlx-audio](https://github.com/Blaizzy/mlx-audio), which uses Apple Silicon's Neural Engine. The voice cloning feature is genuinely impressive — give it a 3-second audio sample and it'll match the voice reasonably well.
|
Qwen3-TTS on the Mac runs via [mlx-audio](https://github.com/Blaizzy/mlx-audio), which uses Apple Silicon's Neural Engine. The voice cloning feature is genuinely impressive — give it a 3-second audio sample and it'll match the voice reasonably well.
|
||||||
|
|
||||||
@@ -37,7 +37,7 @@ Qwen3-TTS on the Mac runs via [mlx-audio](https://github.com/Blaizzy/mlx-audio),
|
|||||||
|
|
||||||
**The real cost is tokens, not compute.** I assumed the bottleneck would be CPU/GPU. It's not. The hardware handles everything comfortably. What actually costs money is the LLM API. I pulled traces from [Opik](https://github.com/comet-ml/opik) and found 75 million input tokens across 100 traces. Context window size is the killer — long conversations with tool call history get expensive fast.
|
**The real cost is tokens, not compute.** I assumed the bottleneck would be CPU/GPU. It's not. The hardware handles everything comfortably. What actually costs money is the LLM API. I pulled traces from [Opik](https://github.com/comet-ml/opik) and found 75 million input tokens across 100 traces. Context window size is the killer — long conversations with tool call history get expensive fast.
|
||||||
|
|
||||||
**V8 memory limits in Docker will silently kill your agent.** OpenClaw runs on Node.js. Docker containers have a default memory limit, and Node's V8 heap will hit it and crash without a clear error. The fix is `NODE_OPTIONS=--max-old-space-size=4096` in your container environment. I lost a few hours to this before finding it.
|
**V8 memory limits in Docker will silently kill your agent.** The agent framework runs on Node.js. Docker containers have a default memory limit, and Node's V8 heap will hit it and crash without a clear error. The fix is `NODE_OPTIONS=--max-old-space-size=4096` in your container environment. I lost a few hours to this before finding it.
|
||||||
|
|
||||||
## The Numbers
|
## The Numbers
|
||||||
|
|
||||||
@@ -45,7 +45,7 @@ Rough monthly costs running this setup:
|
|||||||
|
|
||||||
- LLM API (Claude): ~$15–25/month depending on conversation volume
|
- LLM API (Claude): ~$15–25/month depending on conversation volume
|
||||||
- Electricity for TrueNAS: already running, marginal cost near zero
|
- Electricity for TrueNAS: already running, marginal cost near zero
|
||||||
- Everything else (Piper, Whisper, OpenClaw): $0
|
- Everything else (Piper, Whisper, agent framework): $0
|
||||||
|
|
||||||
The 75M input tokens I mentioned came from a period of heavy testing with long context windows. Normal usage is significantly lower. The lesson: be deliberate about what you include in the system prompt and conversation history. Every token in context costs money on both ends of the conversation.
|
The 75M input tokens I mentioned came from a period of heavy testing with long context windows. Normal usage is significantly lower. The lesson: be deliberate about what you include in the system prompt and conversation history. Every token in context costs money on both ends of the conversation.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user