Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
48 KiB
dd0c/run — Product Brief
AI-Powered Runbook Automation
Version: 1.0 | Date: 2026-02-28 | Author: Product Management | Status: Investor-Ready Draft
1. EXECUTIVE SUMMARY
Elevator Pitch
dd0c/run converts your team's existing runbooks — the stale Confluence pages, the Slack threads, the knowledge trapped in your senior engineer's head — into structured, executable workflows that guide on-call engineers through incidents step by step. Paste a runbook, get an intelligent copilot in under 5 seconds. No YAML. No configuration. No new DSL to learn.
This is the most safety-critical module in the dd0c platform. It touches production. We built it that way on purpose.
The Problem
The documentation-to-execution gap is killing engineering teams.
- The average on-call engineer spends 12+ minutes finding and interpreting the right runbook during a 3am incident. Cognitive function drops 30-40% during nighttime pages. Every minute of that search is a minute of downtime, a minute of cortisol, and a minute closer to burnout.
- 60% of runbooks are stale within 30 days of creation. Infrastructure changes, UIs get redesigned, scripts move repos. The runbook becomes a historical artifact that actively sabotages incident response.
- On-call burnout is the #1 reason SREs quit. Replacing a single engineer costs $150K+. The tooling that's supposed to help — Rundeck, PagerDuty Automation Actions, Shoreline — either requires weeks of setup, costs thousands per month, or demands a proprietary DSL that nobody has time to learn.
- SOC 2 and ISO 27001 require documented, auditable incident response procedures. Most teams' "documentation" is a stale wiki page that wouldn't survive a serious audit.
The industry has tools that route alerts (PagerDuty), tools that document incidents (Rootly, Incident.io), and tools that schedule jobs (Rundeck). Nobody owns the bridge between tribal knowledge and automated execution. The runbook sits in Confluence. The terminal sits on the engineer's laptop. The gap between them is where MTTR lives.
The Solution
dd0c/run is an AI-powered runbook engine that:
- Ingests runbooks from anywhere — paste raw text, connect Confluence/Notion, or point at a Git repo of markdown files.
- Parses prose into structured, executable steps with automatic risk classification (🟢 Safe / 🟡 Caution / 🔴 Dangerous) in under 5 seconds.
- Matches runbooks to incoming alerts via PagerDuty/OpsGenie webhooks, so the right procedure appears in the incident Slack channel before the engineer finishes rubbing their eyes.
- Guides execution in Copilot mode — auto-executing safe diagnostic steps, prompting for approval on state changes, blocking destructive actions without explicit confirmation.
- Learns from every execution — tracking which steps were skipped, modified, or added — and suggests runbook updates automatically. Runbooks get better with every incident instead of decaying.
Target Customer
Primary: Mid-market engineering teams (Series A through Series D startups, 10-100 engineers) with 5-15 SREs supporting high incident volume. They have existing runbooks in Confluence/Notion that they know are stale, they can't afford a dedicated SRE tooling team, and they're drowning in on-call.
Secondary: Startups approaching their first SOC 2 audit who need documented, auditable incident response procedures immediately.
Beachhead: Teams already using dd0c/cost or dd0c/alert. We've saved their budget and their sleep. Now we save their production environment.
Key Differentiators
-
Zero-Configuration Intelligence. Paste a runbook. Get structured, risk-classified, executable steps in under 5 seconds. Rundeck requires Java, a database, and YAML definitions. Shoreline requires a proprietary DSL. We require a clipboard.
-
The Trust Gradient. We don't ask teams to hand production to an AI on day one. dd0c/run starts in read-only suggestion mode. Trust is earned through data — 10 successful copilot runs with zero modifications before the system even suggests promotion to autopilot. Trust is earned in millimeters and lost in miles. We designed for that.
-
The dd0c Ecosystem Flywheel. dd0c/alert identifies the incident pattern. dd0c/run provides the resolution. Execution telemetry feeds back into alert intelligence, training the matching engine. dd0c/portal provides service ownership context. dd0c/cost tracks the financial impact. The modules are 10x more valuable together than apart. No point solution can replicate this.
-
The Resolution Pattern Database. Every skipped step, every manual override, every successful rollback is logged. We're building the industry's first database of what actually works in an incident — not what the runbook says, but what the engineer actually typed at 3:14 AM. This data moat compounds daily.
-
Agent-Based Security Model. A lightweight, open-source Rust agent runs inside the customer's VPC. The SaaS never sees credentials. The agent is strictly outbound-only. No inbound firewall rules. No root AWS credentials. InfoSec teams can audit the binary. The execution boundary is the customer's perimeter, not ours.
The Safety Narrative
Let's be direct: this product executes actions in production environments. An LLM suggesting the wrong command at 3am to a sleep-deprived engineer is not a theoretical risk — it's the scenario that kills the company.
Our entire architecture is built around one principle: assume the AI wants to delete production. Constrain it accordingly.
- V1 is Copilot-only. No autonomous execution of state-changing commands. Period. The AI suggests. The human approves. The audit trail proves it.
- Risk classification is deterministic + AI. The LLM classifies steps, but a deterministic regex/AST scanner validates against known destructive patterns. The scanner overrides the LLM. Always.
- The agent has a command whitelist. Anything outside the whitelist is blocked at the agent level, regardless of what the SaaS sends. The agent is the last line of defense, and it doesn't trust the cloud.
- 🔴 Dangerous actions require typing the resource name to confirm. Not clicking a button. Typing. UI friction is a feature, not a bug.
- Every state-changing step records its rollback command. One-click undo at any point. The safety net that makes engineers brave enough to click "Approve."
- Kill condition: If beta testing shows the LLM misclassifies a destructive command as Safe (🟢) even once, or if the false-positive rate exceeds 0.1%, the product is killed or fundamentally re-architected to remove LLMs from the classification path.
Trust is built incrementally: read-only diagnostics first, then copilot with approval gates, then — only after proven track records — selective autopilot for safe steps. The engineer always has the kill switch. The AI never insists.
2. MARKET OPPORTUNITY
The Documentation-to-Execution Gap
Every engineering team has some form of incident documentation. Confluence pages, Notion databases, Google Docs, Slack threads, senior engineers' brains. And every team has a terminal where incidents get resolved. The gap between those two things — the document and the execution — is where MTTR lives, where burnout festers, and where $12B+ in operational tooling spend fails to deliver.
The current market is segmented into tools that solve pieces of the incident lifecycle but leave the critical bridge unbuilt:
| Category | Players | What They Do | What They Don't Do |
|---|---|---|---|
| Alert Routing | PagerDuty, OpsGenie, Grafana OnCall | Route alerts to the right human | Help that human actually resolve the incident |
| Incident Management | Rootly, Incident.io, FireHydrant | Document the bureaucracy of the outage | Execute the resolution |
| Job Scheduling | Rundeck (OSS) | Run pre-defined jobs via YAML | Parse natural language, classify risk, learn from execution |
| Orchestration Platforms | Shoreline, Transposit | Execute complex remediation workflows | Onboard in under 5 minutes, work without a proprietary DSL |
| AIOps | BigPanda, Moogsoft | Cluster and correlate alerts | Bridge the gap from "we know what's wrong" to "here's how to fix it" |
Nobody owns the bridge from documentation to execution. That's the gap. That's the market.
Market Sizing
TAM (Total Addressable Market): $12B+
26 million software developers globally. ~20% involved in ops/on-call rotations (5.2 million). Average enterprise tooling spend of $200/month per engineer across incident management, AIOps, and automation tooling. The TAM encompasses the full operational tooling budget that dd0c/run competes for or displaces.
SAM (Serviceable Available Market): $1.5B
Focus on mid-market: startups to mid-size tech companies (Series A through Series D). These teams have the highest pain-to-budget ratio — 10-100 engineers, can't afford dedicated SRE tooling teams, can't justify Shoreline's enterprise pricing. Estimated 50,000 such companies globally, averaging 30 on-call engineers each (1.5 million target seats). At $1,000/year per seat equivalent, that's $1.5B.
SOM (Serviceable Obtainable Market): $15M (Year 3 Target)
1% penetration of SAM. 500 companies with average ARR of $30,000. This is a bootstrapped operation — the numbers must be defensible, not inflated for imaginary VCs.
Competitive Landscape
Rundeck (Open Source / PagerDuty-owned)
- Strengths: Free, established, large community.
- Weaknesses: 2015-era UX. Requires Java, a database, YAML job definitions. It's a job scheduler masquerading as a runbook engine. Time-to-value is measured in days, not seconds.
- Our advantage: Zero-config AI parsing vs. manual YAML authoring. It's Notepad vs. VS Code — different products for different eras.
Transposit / Shoreline
- Strengths: Deep orchestration capabilities, enterprise customers.
- Weaknesses: Over-engineered for the 1% of orgs that have bandwidth to learn a proprietary DSL. They built jetpacks for people who are currently drowning. Pricing is enterprise-only.
- Our advantage: Paste-to-parse in 5 seconds. No DSL. Mid-market pricing. We meet teams where they are, not where Shoreline wishes they were.
Rootly / Incident.io / FireHydrant
- Strengths: Excellent incident management workflows, growing fast.
- Weaknesses: They document the fire; they don't hold the hose. They stop at the boundary of execution.
- Our advantage: We start where they stop. And with dd0c/alert integration, we own the full chain from detection to resolution.
PagerDuty Automation Actions
- Strengths: Distribution. Every on-call team already has PagerDuty.
- Weaknesses: Cynical upsell — thousands of dollars to automate the resolution of alerts they already charge you to receive. Locked to PagerDuty ecosystem. No runbook intelligence, just pre-defined script execution.
- Our advantage: Platform-agnostic (works with PagerDuty, OpsGenie, Grafana OnCall, any webhook). AI-native intelligence vs. dumb script execution. 10x cheaper.
The Real Threat: PagerDuty or Incident.io acquiring an AI runbook startup and bundling it into Enterprise tier. Mitigation: They will build it as a closed, proprietary upsell. They cannot integrate with dd0c/cost, dd0c/drift, or dd0c/alert. They will sell to the CIO; we sell to the on-call engineer at 3 AM. We win on the open ecosystem, cross-platform nature, and mid-market pricing.
Timing Thesis: Why 2026
Two years ago, this product was impossible. Three things changed:
-
LLM Parsing Reliability. A 2024 model would hallucinate destructive commands or fail to parse implicit prerequisites. 2026 models can perform rigorous structural extraction and risk classification with the accuracy required for production-adjacent tooling. Context windows are large enough to digest a 50-page postmortem and distill the exact terminal commands that fixed it.
-
Inference Economics. Inference latency is under 2 seconds. Costs have commoditized to the point where we can offer AI-powered parsing for $29/runbook/month, destroying the enterprise pricing models of incumbents who charge $50-100/seat/month for dumb automation.
-
The Agentic Shift. The industry is transitioning from "human-in-the-loop" to "human-on-the-loop." Engineering teams are psychologically ready for AI-assisted operations in a way they weren't 18 months ago. The dread of the 3am pager now outweighs the skepticism of the AI — and that's the inflection point.
The window: If we don't build this in the next 12 months, PagerDuty, Incident.io, or a well-funded startup will. The documentation-to-execution gap is too obvious and too painful to remain unowned. First-mover advantage accrues to whoever builds the Resolution Pattern Database first — that data moat compounds daily and is nearly impossible to replicate.
3. PRODUCT DEFINITION
Value Proposition
For on-call engineers: Replace the 3am panic spiral — searching Confluence, interpreting stale docs, copy-pasting commands you don't understand — with a calm copilot that already knows what's wrong, already has the runbook, and walks you through it step by step.
For SRE managers: Replace vibes-based operational health with data. Know which services lack runbooks, which runbooks are stale, which steps get skipped, and what your actual MTTR is — with audit-ready compliance evidence generated automatically.
For senior engineers (the bus factor): Stop being the human runbook. Your expertise gets captured from your natural workflow — terminal sessions, Slack threads, incident resolutions — and lives on in the system even when you're on vacation, asleep, or gone.
One sentence: dd0c/run turns your team's scattered incident knowledge into a living, learning, executable system that makes every on-call engineer as effective as your best one.
Personas
| Persona | Name | Role | Primary Need | Key Metric |
|---|---|---|---|---|
| The On-Call Engineer | Riley | Mid-level SRE, 2 years exp, paged at 3am | Instantly know what to do without searching or guessing | Time-to-resolution, confidence during incidents |
| The SRE Manager | Jordan | Manages 8 SREs, owns incident response quality | Consistent, auditable, measurable incident response | MTTR trends, runbook coverage, compliance readiness |
| The Runbook Author | Morgan | Staff engineer, 6 years, carries institutional knowledge | Transfer expertise without the overhead of writing docs | Knowledge capture rate, runbook usage by others |
The Trust Gradient — The Core Design Principle
This is the architectural decision that defines dd0c/run. It is non-negotiable.
┌─────────────────────────────────────────────────────────────┐
│ THE TRUST GRADIENT │
│ │
│ READ-ONLY ──→ SUGGEST ──→ COPILOT ──→ AUTOPILOT │
│ │
│ "Show me "Here's what "I'll do it, "Handled. │
│ the steps" I'd do" you approve" Here's │
│ the log" │
│ │
│ V1 ◄──────────────────► │
│ (V1 scope: Read-Only + Suggest + Copilot for 🟢 only) │
│ │
│ ● Per-runbook setting (not global) │
│ ● Per-step override (🟢 auto, 🟡 prompt, 🔴 block) │
│ ● Earned through data (10 successful runs → suggest upgrade│
│ ● Instantly revocable (one bad run → auto-downgrade) │
│ ● The human always has the kill switch │
└─────────────────────────────────────────────────────────────┘
V1 is Read-Only + Suggest + Copilot for 🟢 Safe steps ONLY. The AI auto-executes read-only diagnostic commands (check logs, query metrics, list pods). All state-changing commands (🟡 Caution, 🔴 Dangerous) require explicit human approval. Full Autopilot mode is V2 — and only for runbooks with a proven track record.
This is not a limitation. This is the product. Trust earned through data is the moat that no competitor can shortcut.
Features by Release
V1: "Paste → Parse → Page → Pilot" (Months 4-6)
The minimum viable product. Four verbs. If you can't explain it in those four words, it's out of scope.
| Feature | Description | Priority |
|---|---|---|
| Paste & Parse | Copy-paste raw text from anywhere. AI structures it into numbered, executable steps with risk classification (🟢🟡🔴) in < 5 seconds. Zero configuration. | P0 — This IS the product |
| Risk Classification Engine | AI + deterministic scanner labels every step. LLM classifies intent; regex/AST scanner validates against known destructive patterns. Scanner overrides LLM. Always. | P0 — Trust foundation |
| Copilot Execution | Slack bot + web UI walks engineer through runbook step-by-step. Auto-executes 🟢 steps. Prompts for 🟡. Blocks 🔴 without explicit typed confirmation. | P0 — Core value prop |
| Alert-to-Runbook Matching | PagerDuty/OpsGenie webhook integration. Alert fires → dd0c/run matches to most relevant runbook via keyword + metadata + basic semantic similarity. Posts in incident Slack channel. | P0 — "The runbook finds you" |
| Alert Context Injection | Matched runbook arrives pre-populated: affected service, region, recent deploys, related metrics. No manual lookup. | P0 — 3am brain can't look things up |
| Rollback-Aware Execution | Every state-changing step records its inverse command. One-click undo at any point. | P0 — Safety net |
| Divergence Detection | Post-incident: AI compares what engineer did vs. what runbook prescribed. Flags skipped steps, modified commands, unlisted actions. | P1 — Learning loop |
| Auto-Update Suggestions | Generates runbook diffs from divergence data. "You skipped steps 6-8 in 4/4 runs. Remove them?" | P1 — Self-improving runbooks |
| Runbook Health Dashboard | Coverage %, average staleness, MTTR with/without runbook, step skip rates. Jordan's command center. | P1 — Management visibility |
| Compliance Export | PDF/CSV of timestamped execution logs, approval chains, audit trail. Not pretty, but functional. | P1 — SOC 2 readiness |
| Prerequisite Detection | AI identifies implicit requirements ("you need kubectl access", "make sure you're on the VPN") and surfaces pre-flight checklist. | P2 |
| Ambiguity Highlighter | AI flags vague steps ("check the logs" — which logs?) and prompts author to clarify before the runbook goes live. | P2 |
What V1 does NOT include: Terminal Watcher (V2), Full Autopilot (V2), Confluence/Notion crawlers (V2 — V1 is paste), Incident Simulator (V3), Runbook Marketplace (V3), Multi-cloud abstraction (V3), Voice-guided runbooks (V3).
V2: "Watch → Learn → Predict → Protect" (Months 7-9)
| Feature | Description | Unlocks |
|---|---|---|
| Terminal Watcher | Opt-in agent captures commands during incidents. AI generates runbooks from real actions. | Solves cold-start. Captures Morgan's expertise passively. |
| Confluence/Notion Crawlers | Automated discovery and sync of runbook-tagged pages. | Bulk import for large teams with 100+ runbooks. |
| Full Autopilot Mode | Runbooks with 10+ successful copilot runs and zero modifications can promote 🟢 steps to autonomous execution. | "Sleep through the night" promise. |
| dd0c/alert Deep Integration | Multi-alert correlation, enriched context passing, bidirectional feedback loop. | The platform flywheel engages. |
| Infrastructure-Aware Staleness | Cross-reference steps against live Terraform state, K8s manifests, AWS resources via dd0c/portal. | Runbooks that know when they're lying. |
| Runbook Effectiveness ML Model | Predict runbook success probability based on alert context, time of day, engineer experience. | Data-driven trust promotion. |
V3: "Simulate → Train → Marketplace → Scale" (Months 10-12)
| Feature | Description | Unlocks |
|---|---|---|
| Incident Simulator / Fire Drills | Sandbox environment for practicing runbooks. Gamified with scores. | Viral growth. "My team's score is 94." |
| Voice-Guided Runbooks | AI reads steps aloud at 3am. Hands-free incident response. | Genuine differentiation nobody else has. |
| Runbook Marketplace | Community-contributed, anonymized templates. "How teams running EKS + RDS handle connection storms." | Network effect. Templates improve with every customer. |
| Predictive Runbook Staging | dd0c/alert detects anomaly trending toward incident → dd0c/run pre-stages runbook → on-call gets heads-up 30 min early. | The incident that never happens. |
User Journey: Riley's 3am Page
3:17 AM — Phone buzzes. PagerDuty: CRITICAL — payment-service latency > 5000ms
3:17 AM — dd0c/run webhook fires. Matches alert to "Payment Service Latency Runbook" (92% confidence).
3:17 AM — Slack bot posts in #incident-2847:
🔔 Runbook matched: Payment Service Latency
📊 Pre-filled: region=us-east-1, service=payment-svc, deploy=v2.4.1 (2h ago)
🟢🟡🔴 8 steps (4 safe, 3 caution, 1 danger)
[▶ Start Copilot]
3:18 AM — Riley taps Start Copilot. Steps 1-3 (🟢 Safe) auto-execute:
✅ Checked pod status — 2/5 pods in CrashLoopBackOff
✅ Pulled logs — 847 connection timeout errors in last 5 min
✅ Queried pg_stat_activity — 312 idle-in-transaction connections
3:19 AM — Step 4 (🟡 Caution): "Bounce connection pool — kubectl rollout restart"
⚠️ This will restart all pods. ~30s downtime.
↩️ Rollback: kubectl rollout undo ...
Riley taps [✅ Approve & Execute]
3:20 AM — Step 5 (🟢 Safe) auto-executes: Verify latency recovery.
✅ Latency recovered to baseline. All pods Running.
3:21 AM — ✅ Incident resolved. MTTR: 3m 47s.
📝 "You skipped steps 6-8. Also ran a command not in the runbook:
SELECT count(*) FROM pg_stat_activity"
Suggested updates: Remove steps 6-8, add DB connection check before step 4.
[✅ Apply Updates]
3:21 AM — Riley applies updates. Goes back to sleep. The cat didn't even wake up.
Previous MTTR for this incident type without dd0c/run: 38-45 minutes. With dd0c/run: under 4 minutes. That's the product.
Pricing
| Tier | Price | Includes |
|---|---|---|
| Free | $0 | 3 runbooks, read-along mode only (no execution), basic parsing |
| Pro | $29/runbook/month | Unlimited runbooks, copilot execution, Slack bot, PagerDuty/OpsGenie integration, basic dashboard, divergence detection |
| Business | $49/seat/month | Everything in Pro + autopilot mode (V2), API access, SSO, compliance export, audit trail, priority support |
Pricing rationale: The per-runbook model ($29/runbook/month) aligns cost with value — teams pay for the runbooks they actually use, not empty seats. A team with 10 active runbooks pays $290/month. As they add more runbooks and see MTTR drop, revenue grows with demonstrated value. The per-seat Business tier captures larger teams that want platform features (SSO, compliance, API).
Note from Party Mode: The VC advisor recommended switching to pure per-seat pricing for simplicity. This is a valid concern. We will A/B test per-runbook vs. per-seat during beta to determine which model drives faster adoption and lower churn. The per-runbook model has the advantage of a lower entry point and direct value alignment; the per-seat model has the advantage of predictability and simpler billing.
4. GO-TO-MARKET PLAN
Launch Strategy
dd0c/run is Phase 3 in the dd0c platform rollout (Months 4-6). It does not launch alone. It launches alongside dd0c/alert as the "On-Call Savior" bundle — because a runbook engine without alert intelligence is a document viewer, and alert intelligence without execution is a notification system. Together, they close the loop from detection to resolution.
Prerequisite: dd0c/cost and dd0c/route must be live and generating revenue (Phase 1, Months 1-3). These FinOps modules prove immediate, hard-dollar ROI. If we can't save a company money, we have no right to ask them to trust us with their production environment. The FinOps wedge buys the political capital for the operational wedge.
Beachhead: Teams Drowning in On-Call
The ideal early customer has three characteristics:
- High incident volume. 10+ pages per week across the team. They feel the pain daily.
- Existing runbooks that they know are stale. They've tried to document. They know it's broken. They're looking for a better way.
- No dedicated SRE tooling team. They can't afford to spend 3 months configuring Rundeck or learning Shoreline's DSL. They need something that works in 5 minutes.
This is the Series B/C startup with 5-15 SREs supporting 50-200 developers. They're big enough to have real infrastructure problems, small enough that every engineer feels the on-call burden personally.
Secondary beachhead: Compliance chasers — startups preparing for SOC 2 who need documented, auditable incident response procedures yesterday. We sell them the audit trail masquerading as an automation tool.
The dd0c/alert → dd0c/run Upsell Path
This is the primary growth engine for dd0c/run. The conversion funnel:
dd0c/cost user (saves money) → trusts the platform
│
▼
dd0c/alert user (reduces noise, sleeps better) → trusts the intelligence
│
▼
dd0c/alert fires an alert → Slack message includes:
"📋 A runbook exists for this alert pattern. Want dd0c/run to guide you?"
│
▼
Engineer clicks through → lands on Paste & Parse → 5-second wow moment
│
▼
dd0c/run user (resolves incidents faster) → trusts the execution
│
▼
dd0c/portal user (owns the full developer experience) → locked in
Every dd0c/alert notification becomes a dd0c/run acquisition channel. The upsell is embedded in the product, not in a sales email.
Growth Loops
Loop 1: The Parsing Flywheel (Product-Led) Engineer pastes runbook → AI parses in 5 seconds → "Wow" → pastes 5 more → invites teammate → teammate pastes theirs → team has 20 runbooks in a week → first incident uses copilot → MTTR drops → team is hooked.
Fuel: The 5-second parse moment must be so good that engineers paste runbooks for fun.
Loop 2: The Incident Evidence Loop (Manager-Led) Jordan sees MTTR data → shows leadership → "With dd0c/run: 6 minutes. Without: 38 minutes." → leadership asks "Why don't all teams use this?" → org-wide rollout → more teams = more runbooks = better matching = better MTTR.
Fuel: The MTTR comparison chart. One number that justifies the budget.
Loop 3: The Open-Source Wedge (Community-Led)
Release ddoc-parse — a free, open-source CLI that parses runbooks locally. No account needed. No SaaS. Engineers who love it self-select into the beta. Their runbooks (anonymized) improve the parsing model. The CLI gets better. More users. More conversions.
Fuel: A genuinely useful free tool, not a crippled demo.
Loop 4: The Knowledge Capture Loop (Retention) Morgan's expertise captured in dd0c/run → Morgan leaves → Riley handles incident using Morgan's captured knowledge → team realizes dd0c/run IS their institutional memory → switching cost becomes infinite → renewal is automatic.
Fuel: The "Ghost of Morgan" moment — the first time a junior resolves an incident using a runbook generated from a senior's session.
Content Strategy
Engineering-as-marketing. Developers use adblockers and hate salespeople. We don't sell to them. We teach them.
| Content | Channel | Purpose |
|---|---|---|
| "The Anatomy of a 3am Page" — blog post with real data on cognitive impairment during nighttime incidents | Blog, Hacker News, r/sre | Thought leadership. Establishes the problem before pitching the solution. |
ddoc-parse open-source CLI |
GitHub, Product Hunt | Free tool that demonstrates AI parsing quality. Acquisition funnel. |
| "Your Runbooks Are Lying to You" — analysis of runbook staleness rates across 100 teams | Blog, SRE Weekly newsletter | Data-driven content that managers share internally. |
| Conference lightning talks (SREcon, KubeCon, DevOpsDays) | In-person | 5-minute talk ending with beta signup QR code. |
| Incident postmortem outreach | Direct DM | Companies publishing postmortems are self-selecting. "I read your Redis incident writeup. We're building something that would have cut your MTTR in half." |
| Pre-seeded runbook templates (AWS RDS failover, K8s CrashLoopBackOff, Redis memory pressure) | In-product, GitHub | Solve the cold-start problem. Demonstrate value before the user pastes anything. |
90-Day Launch Timeline
| Day | Milestone |
|---|---|
| 1-14 | Private alpha with 5 hand-picked teams from dd0c/cost user base. Paste & Parse + basic Copilot in Slack. Gather parsing quality feedback. |
| 15-30 | Iterate on parsing accuracy based on alpha feedback. Add PagerDuty webhook integration. Add risk classification validation (deterministic scanner). Ship divergence detection. |
| 31-45 | Expand to 15-20 beta teams. Launch ddoc-parse open-source CLI. Begin collecting MTTR data. Add health dashboard for Jordan persona. |
| 46-60 | Beta teams running in production. First MTTR comparison data available. Begin compliance export feature. Publish "Anatomy of a 3am Page" blog post. |
| 61-75 | Refine based on beta feedback. A/B test pricing models (per-runbook vs. per-seat). Secure 3+ case study commitments. Ship dd0c/alert integration (webhook-based). |
| 76-90 | Public launch. Product Hunt launch. Hacker News "Show HN" post. Conference talk submissions. Convert beta teams to paid. Target: 50 teams with ≥ 5 active runbooks. |
5. BUSINESS MODEL
Revenue Model
Primary: $29/runbook/month (Pro tier) or $49/seat/month (Business tier).
A team with 10 active runbooks on Pro pays $290/month. A team of 8 SREs on Business pays $392/month. Revenue scales with demonstrated value — more runbooks means more incidents resolved faster, which means higher willingness to pay.
Secondary: The dd0c platform bundle. Teams using dd0c/cost + dd0c/alert + dd0c/run together represent an average deal size of $500-800/month. The platform is stickier than any individual module.
Unit Economics
| Metric | Value | Notes |
|---|---|---|
| COGS per runbook/month | ~$3-5 | LLM inference (via dd0c/route, optimized model selection), compute for Rust API + agent coordination, PostgreSQL storage. Parsing is a one-time cost per runbook; execution inference is per-incident. |
| Gross Margin | ~85% | SaaS-standard. The Rust stack keeps infrastructure costs low. LLM costs are the primary variable, managed by dd0c/route. |
| CAC (Target) | < $500 | Product-led growth via ddoc-parse CLI + dd0c/alert upsell. No outbound sales team. Content marketing + community seeding. |
| LTV (Target) | > $5,000 | 18+ month retention (institutional memory lock-in). Average $290/month × 18 months = $5,220. |
| LTV:CAC Ratio | > 10:1 | Healthy for bootstrapped SaaS. The dd0c/alert upsell path has near-zero incremental CAC. |
| Payback Period | < 2 months | At $290/month with $500 CAC, payback in ~1.7 months. |
Path to Revenue Milestones
$10K MRR (Month 8 — 4 months post-launch)
- 35 Pro teams × 10 runbooks × $29 = $10,150/month
- Source: Beta conversions (15-20 teams) + dd0c/alert upsell (10-15 teams) + organic from
ddoc-parseCLI (5-10 teams) - Key assumption: 70% beta-to-paid conversion rate
$50K MRR (Month 14 — 10 months post-launch)
- 120 Pro teams ($34,800) + 30 Business teams ($14,700) = $49,500/month
- Source: Platform flywheel engaged. dd0c/alert → dd0c/run conversion running at 25%. Community templates driving organic acquisition. First conference talks generating inbound.
- Key assumption: < 5% monthly churn (institutional memory lock-in)
$100K MRR (Month 20 — 16 months post-launch)
- 200 Pro teams ($58,000) + 80 Business teams ($39,200) + 5 custom enterprise ($10,000) = $107,200/month
- Source: Runbook Marketplace (V3) creating network effects. Multi-team deployments within companies. SOC 2 compliance driving Business tier upgrades.
- Key assumption: Average expansion revenue of 30% (teams add runbooks and seats over time)
Solo Founder Constraints
This is the hardest product in the dd0c lineup to support as a solo founder. The reasons are structural:
-
Production safety liability. If dd0c/run contributes to a production outage, the reputational damage extends to the entire dd0c brand. There is no "move fast and break things" with a product that touches production. Every release must be paranoid.
-
Support burden. When a customer's weird custom Kubernetes setup doesn't play nice with the Rust agent, that's a high-urgency, high-complexity support ticket at 3am. Unlike dd0c/cost (where a bug means a wrong number on a dashboard), a dd0c/run bug means a failed incident response.
-
Security surface area. The VPC agent is open-source and auditable, but it's still a binary running inside customer infrastructure. A CVE in the agent is an existential event. Security reviews from enterprise customers will be thorough and time-consuming.
Mitigations:
- Shared platform architecture. One API gateway, one auth layer, one billing system, one OpenTelemetry ingest pipeline across all dd0c modules. If you build six separate data models, you burn out in 14 months.
- V1 scope discipline. Copilot-only. No Autopilot. No Terminal Watcher. No crawlers. The smaller the surface area, the smaller the support burden.
- Community-driven templates. Pre-seed 50 high-quality templates for standard infrastructure. Let the community maintain and improve them. Reduce the "my setup is unique" support tickets.
- Aggressive kill criteria. If the support burden exceeds 10 hours/week within the first 3 months, re-evaluate the agent architecture. Consider a managed-execution model where the SaaS handles execution via customer-provided cloud credentials (higher trust barrier, lower support burden).
6. RISKS & MITIGATIONS
Risk 1: LLM Hallucination Causes a Production Outage
Severity: 10/10 — Extinction-level event for the company. Probability: Medium (with mitigations), High (without).
The scenario: The runbook says "Restart the pod." The LLM hallucinates and outputs kubectl delete deployment instead, classifying it as 🟢 Safe. The tired engineer clicks approve. Production goes down. The customer cancels. dd0c goes bankrupt from reputational damage.
Mitigations:
- Deterministic regex/AST scanner validates every command against known destructive patterns. The scanner overrides the LLM classification. Always. The LLM is advisory; the scanner is authoritative.
- Agent-level command whitelist. Anything outside the whitelist is blocked at the agent, regardless of what the SaaS sends. Defense in depth.
- 🔴 Dangerous actions require typing the resource name to confirm. Not clicking a button. UI friction is a feature.
- Every state-changing step records its rollback command. One-click undo.
- V1 limits auto-execution to 🟢 Safe (read-only diagnostic) commands only. State changes always require human approval.
- Comprehensive logging of every command suggested, approved, executed, and rolled back. Full audit trail.
Kill condition: If the LLM misclassifies a destructive command as Safe (🟢) even once during beta, or if the false-positive rate for safe commands exceeds 0.1%, the product is killed or fundamentally re-architected to remove LLMs from the classification path.
Risk 2: PagerDuty / Incident.io Ships Native AI Runbook Automation
Severity: 8/10. Probability: High — they will build something. The question is when and how good.
The scenario: PagerDuty acquires a small AI runbook startup and bundles it into their Enterprise tier for "free" (subsidized by their massive license cost), choking off dd0c/run's distribution.
Mitigations:
- Platform agnosticism. dd0c/run works with PagerDuty, OpsGenie, Grafana OnCall, and any webhook source. PagerDuty's automation is locked to PagerDuty.
- Cross-module data advantage. PagerDuty can't integrate with dd0c/cost anomalies, dd0c/drift detection, or dd0c/portal service ownership. We have the context. They have the routing rules.
- Mid-market pricing. PagerDuty's automation is an enterprise upsell ($$$). We sell to the on-call engineer at 3am for $29/runbook.
- The Resolution Pattern Database. PagerDuty keeps data siloed per enterprise. We anonymize and share the "ideal" runbooks across the mid-market. Network effect they can't replicate without cannibalizing their enterprise model.
Pivot option: If PagerDuty ships a compelling native solution, double down on the dd0c ecosystem play. dd0c/run becomes the execution arm of dd0c/alert + dd0c/drift + dd0c/portal — a tightly coupled platform that no single-feature bolt-on can touch.
Risk 3: Teams Don't Have Documented Runbooks (Cold Start Problem)
Severity: 7/10. Probability: High — many teams have zero runbooks.
The scenario: A prospect signs up, goes to the "Paste Runbook" screen, and realizes they have nothing to paste. Churn happens in 60 seconds.
Mitigations:
- Pre-seed the platform with 50 high-quality templates for standard infrastructure (AWS RDS failover, K8s CrashLoopBackOff, Redis memory pressure, cert expiry, etc.). New users see value before they paste anything.
- Slack Thread Distiller (V1): Paste a Slack thread URL from a past incident. AI extracts the resolution commands and generates a draft runbook. If they have incidents, they have Slack threads.
- Postmortem-to-Runbook Pipeline (V1): Feed in a postmortem doc. AI extracts "what we did to fix it" and generates a structured runbook.
- Terminal Watcher (V2): Captures commands during live incidents and generates runbooks automatically.
- Shift marketing from "Automate your runbooks" to "Generate your missing runbooks." The product creates runbooks, not just executes them.
Risk 4: The "Agentic AI" Obsolescence Event
Severity: High. Probability: Low (in the next 3 years).
The scenario: Autonomous AI agents (Devin, GitHub Copilot Workspace, Pulumi Neo) can detect and fix infrastructure issues without human intervention. Who needs runbooks?
Mitigations:
- Runbooks become the "policy" that defines what the agent should do. They're the bridge between human intent and agent execution. We pivot from "human automation" to "agent policy management."
- Position dd0c/run as the control plane for agentic operations — the system that defines, constrains, and audits what AI agents are allowed to do in production.
- The Trust Gradient already models this transition: Read-Only → Copilot → Autopilot is the same spectrum as Human-Driven → Human-on-the-Loop → Agent-Driven.
Risk 5: Solo Founder Scaling / The Bus Factor
Severity: High. Probability: High — Brian is building six products.
The scenario: The support burden of a production-safety-critical product overwhelms a solo founder. A critical bug in the VPC agent requires immediate response at 3am. Brian burns out.
Mitigations:
- Shared platform architecture reduces per-module engineering overhead by 60%+.
- V1 scope discipline: Copilot-only, no Autopilot, no crawlers. Smallest possible surface area.
- Open-source the Rust agent. Community contributions for edge-case Kubernetes configurations. Community security audits.
- Aggressive automation of support: self-healing agent updates, comprehensive error messages, in-product diagnostics.
- If dd0c/run reaches $50K MRR, hire a dedicated SRE for agent support. This is the first hire, non-negotiable.
The Catastrophic Scenario and How to Prevent It
The nightmare: dd0c/run's AI suggests a destructive command. A sleep-deprived engineer approves it. Production goes down for a major customer. The incident gets posted on Hacker News. The dd0c brand — across all six modules — is destroyed. Not just dd0c/run. Everything.
Prevention (defense in depth):
- Layer 1 — LLM Classification: AI labels every step with risk level. This is the first pass, and it's the least trusted.
- Layer 2 — Deterministic Scanner: Regex/AST pattern matching against known destructive commands (
delete,drop,rm -rf,kubectl delete namespace, etc.). Overrides LLM. Catches hallucinations. - Layer 3 — Agent Whitelist: The Rust agent maintains a local whitelist of allowed command patterns. Anything not on the whitelist is rejected at the agent level, regardless of what the SaaS sends. The agent doesn't trust the cloud.
- Layer 4 — UI Friction: 🟡 commands require click-to-approve. 🔴 commands require typing the resource name. No "approve all" button. Ever.
- Layer 5 — Rollback Recording: Every state-changing command has a recorded inverse. One-click undo. The safety net.
- Layer 6 — Audit Trail: Every command suggested, approved, modified, executed, and rolled back is logged with timestamps, user identity, and alert context. Full forensic capability.
If all six layers fail simultaneously, the product deserves to die. But they won't fail simultaneously — that's the point of defense in depth.
7. SUCCESS METRICS
North Star Metric
Incidents resolved via dd0c/run copilot per month.
This single metric captures adoption (teams are using it), trust (engineers choose copilot over skipping), and value (incidents are actually getting resolved). If this number grows, everything else follows.
Leading Indicators
| Metric | Target (Month 6) | Why It Matters |
|---|---|---|
| Time-to-First-Runbook | < 5 minutes | If onboarding is slow, nobody reaches the value. The Vercel test. |
| Paste & Parse success rate | > 90% | If parsing fails or requires heavy manual editing, the magic moment is broken. |
| Copilot adoption rate | ≥ 60% of matched incidents | If engineers bypass copilot, the product isn't trusted or isn't useful. |
| Risk classification accuracy | > 99.9% (zero false-safe on destructive commands) | The safety foundation. One misclassification and we're done. |
| Weekly active runbooks per team | ≥ 5 | The product is alive, not shelfware. |
| Runbook update acceptance rate | ≥ 30% of suggestions | The learning loop is working. Runbooks are improving. |
Lagging Indicators
| Metric | Target (Month 6) | Why It Matters |
|---|---|---|
| MTTR reduction | ≥ 40% vs. baseline | The headline number. "Teams using dd0c/run resolve incidents 40% faster." |
| NPS from on-call engineers | > 50 | Riley actually likes this. Not just tolerates it. |
| Monthly churn | < 5% | Institutional memory lock-in is working. |
| Expansion revenue | > 20% | Teams adding runbooks and seats over time. |
| Zero safety incidents | 0 | dd0c/run never made an incident worse. Non-negotiable. |
30/60/90 Day Milestones
Day 30: Prove the Parse
- 15-20 beta teams onboarded
- Paste & Parse working with > 90% accuracy across diverse runbook formats
- PagerDuty webhook integration live
- Risk classification validated: zero false-safe misclassifications on destructive commands
- First MTTR data points collected
- Success criteria: Engineers say "wow" when they paste their first runbook
Day 60: Prove the Pilot
- Beta teams running copilot in production incidents
- MTTR reduction ≥ 30% for at least 8 teams
- Divergence detection generating useful runbook update suggestions
- Health dashboard live for Jordan persona
- dd0c/alert webhook integration functional
ddoc-parseopen-source CLI launched- Success criteria: At least one engineer says "I actually slept through the night because dd0c/run handled the diagnostics"
Day 90: Prove the Business
- 50 teams with ≥ 5 active runbooks
- MTTR reduction ≥ 40% for at least 12 teams
- 3+ teams committed as named case studies
- Pricing model validated (per-runbook vs. per-seat A/B test complete)
- Zero safety incidents across all beta teams
- Public launch executed (Product Hunt, Hacker News, conference submissions)
- $10K MRR trajectory confirmed
- Success criteria: Beta-to-paid conversion rate ≥ 70%
Kill Criteria
The product is killed or fundamentally re-architected if any of the following occur:
- Safety failure. The LLM misclassifies a destructive command as Safe (🟢) during beta. Even once.
- Trust failure. Engineers skip copilot mode > 50% of the time after 30 days. The product isn't trusted.
- Parse failure. Paste & Parse accuracy stays below 80% after 60 days of iteration. The core AI capability doesn't work.
- Adoption failure. Fewer than 8 beta teams active after 4 weeks. The problem isn't painful enough or the solution isn't compelling enough.
- MTTR failure. MTTR reduction < 20% or inconsistent across teams after 60 days. The product doesn't deliver measurable value.
If we hit a kill criterion, the pivot options are:
- Pivot to read-only intelligence: Strip execution entirely. Become the "runbook quality platform" — parsing, staleness detection, coverage dashboards, compliance evidence. Lower risk, lower value, but viable.
- Pivot to agent policy management: If agentic AI arrives faster than expected, position dd0c/run as the policy layer that defines what AI agents are allowed to do in production.
- Absorb into dd0c/portal: The Contrarian from Party Mode was right about one thing — if dd0c/run can't stand alone, it becomes a feature of the IDP, not a product.
APPENDIX: RESOLVED CONTRADICTIONS ACROSS PHASES
| Contradiction | Brainstorm Position | Party Mode Position | Resolution |
|---|---|---|---|
| Standalone product vs. portal feature | Standalone with ecosystem integration | Contrarian argued it's a portal feature, not a product | Standalone with kill-criteria pivot to portal. Launch as standalone to test market demand. If adoption fails, absorb into dd0c/portal as a feature. |
| Per-runbook vs. per-seat pricing | $29/runbook/month | VC advisor recommended per-seat for simplicity | A/B test during beta. Per-runbook aligns cost with value; per-seat is simpler. Let data decide. |
| V1 execution scope | Full copilot with 🟢🟡🔴 approval gates | CTO demanded no execution until deterministic validation exists; Bootstrap Founder said copilot-only | V1 auto-executes 🟢 only. 🟡🔴 require human approval. Deterministic scanner overrides LLM. Synthesizes both positions. |
| Confluence/Notion crawlers in V1 | Design Thinking included crawlers as V1 | Innovation Strategy said "do not build crawlers; force the user to paste" | Paste-only in V1. Crawlers are V2. Solo founder can't maintain integration APIs for V1. Paste is the 5-second wow moment anyway. |
| Cold start solution | Slack Thread Scraper in V1 | Terminal Watcher in V1 | Slack Thread Distiller in V1. Terminal Watcher deferred to V2. Slack threads require no agent installation (lower trust barrier). Terminal Watcher requires an agent on the engineer's machine — too much friction for V1. |
This brief synthesizes insights from four prior development phases: Brainstorm (Carson, Venture Architect), Design Thinking (Maya, Design Maestro), Innovation Strategy (Victor, Disruption Oracle), and Party Mode Advisory Board (5-person expert panel). All contradictions have been identified and resolved with rationale.
dd0c/run is the most safety-critical module in the dd0c platform. This brief reflects that gravity. Build it paranoid. Assume the AI wants to delete production. Constrain it accordingly. Then ship it — because the 3am pager isn't going to fix itself.