Files

Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.

2026-02-28 17:35:02 +00:00

48 KiB

Raw Permalink Blame History

dd0c/run — Product Brief

AI-Powered Runbook Automation

Version: 1.0 | Date: 2026-02-28 | Author: Product Management | Status: Investor-Ready Draft

1. EXECUTIVE SUMMARY

Elevator Pitch

dd0c/run converts your team's existing runbooks — the stale Confluence pages, the Slack threads, the knowledge trapped in your senior engineer's head — into structured, executable workflows that guide on-call engineers through incidents step by step. Paste a runbook, get an intelligent copilot in under 5 seconds. No YAML. No configuration. No new DSL to learn.

This is the most safety-critical module in the dd0c platform. It touches production. We built it that way on purpose.

The Problem

The documentation-to-execution gap is killing engineering teams.

The average on-call engineer spends 12+ minutes finding and interpreting the right runbook during a 3am incident. Cognitive function drops 30-40% during nighttime pages. Every minute of that search is a minute of downtime, a minute of cortisol, and a minute closer to burnout.
60% of runbooks are stale within 30 days of creation. Infrastructure changes, UIs get redesigned, scripts move repos. The runbook becomes a historical artifact that actively sabotages incident response.
On-call burnout is the #1 reason SREs quit. Replacing a single engineer costs $150K+. The tooling that's supposed to help — Rundeck, PagerDuty Automation Actions, Shoreline — either requires weeks of setup, costs thousands per month, or demands a proprietary DSL that nobody has time to learn.
SOC 2 and ISO 27001 require documented, auditable incident response procedures. Most teams' "documentation" is a stale wiki page that wouldn't survive a serious audit.

The industry has tools that route alerts (PagerDuty), tools that document incidents (Rootly, Incident.io), and tools that schedule jobs (Rundeck). Nobody owns the bridge between tribal knowledge and automated execution. The runbook sits in Confluence. The terminal sits on the engineer's laptop. The gap between them is where MTTR lives.

The Solution

dd0c/run is an AI-powered runbook engine that:

Ingests runbooks from anywhere — paste raw text, connect Confluence/Notion, or point at a Git repo of markdown files.
Parses prose into structured, executable steps with automatic risk classification (🟢 Safe / 🟡 Caution / 🔴 Dangerous) in under 5 seconds.
Matches runbooks to incoming alerts via PagerDuty/OpsGenie webhooks, so the right procedure appears in the incident Slack channel before the engineer finishes rubbing their eyes.
Guides execution in Copilot mode — auto-executing safe diagnostic steps, prompting for approval on state changes, blocking destructive actions without explicit confirmation.
Learns from every execution — tracking which steps were skipped, modified, or added — and suggests runbook updates automatically. Runbooks get better with every incident instead of decaying.

Target Customer

Primary: Mid-market engineering teams (Series A through Series D startups, 10-100 engineers) with 5-15 SREs supporting high incident volume. They have existing runbooks in Confluence/Notion that they know are stale, they can't afford a dedicated SRE tooling team, and they're drowning in on-call.

Secondary: Startups approaching their first SOC 2 audit who need documented, auditable incident response procedures immediately.

Beachhead: Teams already using dd0c/cost or dd0c/alert. We've saved their budget and their sleep. Now we save their production environment.

Key Differentiators

Zero-Configuration Intelligence. Paste a runbook. Get structured, risk-classified, executable steps in under 5 seconds. Rundeck requires Java, a database, and YAML definitions. Shoreline requires a proprietary DSL. We require a clipboard.
The Trust Gradient. We don't ask teams to hand production to an AI on day one. dd0c/run starts in read-only suggestion mode. Trust is earned through data — 10 successful copilot runs with zero modifications before the system even suggests promotion to autopilot. Trust is earned in millimeters and lost in miles. We designed for that.
The dd0c Ecosystem Flywheel. dd0c/alert identifies the incident pattern. dd0c/run provides the resolution. Execution telemetry feeds back into alert intelligence, training the matching engine. dd0c/portal provides service ownership context. dd0c/cost tracks the financial impact. The modules are 10x more valuable together than apart. No point solution can replicate this.
The Resolution Pattern Database. Every skipped step, every manual override, every successful rollback is logged. We're building the industry's first database of what actually works in an incident — not what the runbook says, but what the engineer actually typed at 3:14 AM. This data moat compounds daily.
Agent-Based Security Model. A lightweight, open-source Rust agent runs inside the customer's VPC. The SaaS never sees credentials. The agent is strictly outbound-only. No inbound firewall rules. No root AWS credentials. InfoSec teams can audit the binary. The execution boundary is the customer's perimeter, not ours.

The Safety Narrative

Let's be direct: this product executes actions in production environments. An LLM suggesting the wrong command at 3am to a sleep-deprived engineer is not a theoretical risk — it's the scenario that kills the company.

Our entire architecture is built around one principle: assume the AI wants to delete production. Constrain it accordingly.

V1 is Copilot-only. No autonomous execution of state-changing commands. Period. The AI suggests. The human approves. The audit trail proves it.
Risk classification is deterministic + AI. The LLM classifies steps, but a deterministic regex/AST scanner validates against known destructive patterns. The scanner overrides the LLM. Always.
The agent has a command whitelist. Anything outside the whitelist is blocked at the agent level, regardless of what the SaaS sends. The agent is the last line of defense, and it doesn't trust the cloud.
🔴 Dangerous actions require typing the resource name to confirm. Not clicking a button. Typing. UI friction is a feature, not a bug.
Every state-changing step records its rollback command. One-click undo at any point. The safety net that makes engineers brave enough to click "Approve."
Kill condition: If beta testing shows the LLM misclassifies a destructive command as Safe (🟢) even once, or if the false-positive rate exceeds 0.1%, the product is killed or fundamentally re-architected to remove LLMs from the classification path.

Trust is built incrementally: read-only diagnostics first, then copilot with approval gates, then — only after proven track records — selective autopilot for safe steps. The engineer always has the kill switch. The AI never insists.

2. MARKET OPPORTUNITY

The Documentation-to-Execution Gap

Every engineering team has some form of incident documentation. Confluence pages, Notion databases, Google Docs, Slack threads, senior engineers' brains. And every team has a terminal where incidents get resolved. The gap between those two things — the document and the execution — is where MTTR lives, where burnout festers, and where $12B+ in operational tooling spend fails to deliver.

The current market is segmented into tools that solve pieces of the incident lifecycle but leave the critical bridge unbuilt:

Category	Players	What They Do	What They Don't Do
Alert Routing	PagerDuty, OpsGenie, Grafana OnCall	Route alerts to the right human	Help that human actually resolve the incident
Incident Management	Rootly, Incident.io, FireHydrant	Document the bureaucracy of the outage	Execute the resolution
Job Scheduling	Rundeck (OSS)	Run pre-defined jobs via YAML	Parse natural language, classify risk, learn from execution
Orchestration Platforms	Shoreline, Transposit	Execute complex remediation workflows	Onboard in under 5 minutes, work without a proprietary DSL
AIOps	BigPanda, Moogsoft	Cluster and correlate alerts	Bridge the gap from "we know what's wrong" to "here's how to fix it"

Nobody owns the bridge from documentation to execution. That's the gap. That's the market.

Market Sizing

TAM (Total Addressable Market): $12B+

26 million software developers globally. ~20% involved in ops/on-call rotations (5.2 million). Average enterprise tooling spend of $200/month per engineer across incident management, AIOps, and automation tooling. The TAM encompasses the full operational tooling budget that dd0c/run competes for or displaces.

SAM (Serviceable Available Market): $1.5B

Focus on mid-market: startups to mid-size tech companies (Series A through Series D). These teams have the highest pain-to-budget ratio — 10-100 engineers, can't afford dedicated SRE tooling teams, can't justify Shoreline's enterprise pricing. Estimated 50,000 such companies globally, averaging 30 on-call engineers each (1.5 million target seats). At $1,000/year per seat equivalent, that's $1.5B.

SOM (Serviceable Obtainable Market): $15M (Year 3 Target)

1% penetration of SAM. 500 companies with average ARR of $30,000. This is a bootstrapped operation — the numbers must be defensible, not inflated for imaginary VCs.

Competitive Landscape

Rundeck (Open Source / PagerDuty-owned)

Strengths: Free, established, large community.
Weaknesses: 2015-era UX. Requires Java, a database, YAML job definitions. It's a job scheduler masquerading as a runbook engine. Time-to-value is measured in days, not seconds.
Our advantage: Zero-config AI parsing vs. manual YAML authoring. It's Notepad vs. VS Code — different products for different eras.

Transposit / Shoreline

Strengths: Deep orchestration capabilities, enterprise customers.
Weaknesses: Over-engineered for the 1% of orgs that have bandwidth to learn a proprietary DSL. They built jetpacks for people who are currently drowning. Pricing is enterprise-only.
Our advantage: Paste-to-parse in 5 seconds. No DSL. Mid-market pricing. We meet teams where they are, not where Shoreline wishes they were.

Rootly / Incident.io / FireHydrant

Strengths: Excellent incident management workflows, growing fast.
Weaknesses: They document the fire; they don't hold the hose. They stop at the boundary of execution.
Our advantage: We start where they stop. And with dd0c/alert integration, we own the full chain from detection to resolution.

PagerDuty Automation Actions

Strengths: Distribution. Every on-call team already has PagerDuty.
Weaknesses: Cynical upsell — thousands of dollars to automate the resolution of alerts they already charge you to receive. Locked to PagerDuty ecosystem. No runbook intelligence, just pre-defined script execution.
Our advantage: Platform-agnostic (works with PagerDuty, OpsGenie, Grafana OnCall, any webhook). AI-native intelligence vs. dumb script execution. 10x cheaper.

The Real Threat: PagerDuty or Incident.io acquiring an AI runbook startup and bundling it into Enterprise tier. Mitigation: They will build it as a closed, proprietary upsell. They cannot integrate with dd0c/cost, dd0c/drift, or dd0c/alert. They will sell to the CIO; we sell to the on-call engineer at 3 AM. We win on the open ecosystem, cross-platform nature, and mid-market pricing.

Timing Thesis: Why 2026

Two years ago, this product was impossible. Three things changed:

LLM Parsing Reliability. A 2024 model would hallucinate destructive commands or fail to parse implicit prerequisites. 2026 models can perform rigorous structural extraction and risk classification with the accuracy required for production-adjacent tooling. Context windows are large enough to digest a 50-page postmortem and distill the exact terminal commands that fixed it.
Inference Economics. Inference latency is under 2 seconds. Costs have commoditized to the point where we can offer AI-powered parsing for $29/runbook/month, destroying the enterprise pricing models of incumbents who charge $50-100/seat/month for dumb automation.
The Agentic Shift. The industry is transitioning from "human-in-the-loop" to "human-on-the-loop." Engineering teams are psychologically ready for AI-assisted operations in a way they weren't 18 months ago. The dread of the 3am pager now outweighs the skepticism of the AI — and that's the inflection point.

The window: If we don't build this in the next 12 months, PagerDuty, Incident.io, or a well-funded startup will. The documentation-to-execution gap is too obvious and too painful to remain unowned. First-mover advantage accrues to whoever builds the Resolution Pattern Database first — that data moat compounds daily and is nearly impossible to replicate.

3. PRODUCT DEFINITION

Value Proposition

For on-call engineers: Replace the 3am panic spiral — searching Confluence, interpreting stale docs, copy-pasting commands you don't understand — with a calm copilot that already knows what's wrong, already has the runbook, and walks you through it step by step.

For SRE managers: Replace vibes-based operational health with data. Know which services lack runbooks, which runbooks are stale, which steps get skipped, and what your actual MTTR is — with audit-ready compliance evidence generated automatically.

For senior engineers (the bus factor): Stop being the human runbook. Your expertise gets captured from your natural workflow — terminal sessions, Slack threads, incident resolutions — and lives on in the system even when you're on vacation, asleep, or gone.

One sentence: dd0c/run turns your team's scattered incident knowledge into a living, learning, executable system that makes every on-call engineer as effective as your best one.

Personas

Persona	Name	Role	Primary Need	Key Metric
The On-Call Engineer	Riley	Mid-level SRE, 2 years exp, paged at 3am	Instantly know what to do without searching or guessing	Time-to-resolution, confidence during incidents
The SRE Manager	Jordan	Manages 8 SREs, owns incident response quality	Consistent, auditable, measurable incident response	MTTR trends, runbook coverage, compliance readiness
The Runbook Author	Morgan	Staff engineer, 6 years, carries institutional knowledge	Transfer expertise without the overhead of writing docs	Knowledge capture rate, runbook usage by others

The Trust Gradient — The Core Design Principle

This is the architectural decision that defines dd0c/run. It is non-negotiable.

┌─────────────────────────────────────────────────────────────┐
│                    THE TRUST GRADIENT                        │
│                                                             │
│  READ-ONLY ──→ SUGGEST ──→ COPILOT ──→ AUTOPILOT           │
│                                                             │
│  "Show me       "Here's what    "I'll do it,    "Handled.  │
│   the steps"     I'd do"         you approve"    Here's     │
│                                                  the log"   │
│                                                             │
│  V1 ◄──────────────────►                                    │
│  (V1 scope: Read-Only + Suggest + Copilot for 🟢 only)     │
│                                                             │
│  ● Per-runbook setting (not global)                         │
│  ● Per-step override (🟢 auto, 🟡 prompt, 🔴 block)       │
│  ● Earned through data (10 successful runs → suggest upgrade│
│  ● Instantly revocable (one bad run → auto-downgrade)       │
│  ● The human always has the kill switch                     │
└─────────────────────────────────────────────────────────────┘

V1 is Read-Only + Suggest + Copilot for 🟢 Safe steps ONLY. The AI auto-executes read-only diagnostic commands (check logs, query metrics, list pods). All state-changing commands (🟡 Caution, 🔴 Dangerous) require explicit human approval. Full Autopilot mode is V2 — and only for runbooks with a proven track record.

This is not a limitation. This is the product. Trust earned through data is the moat that no competitor can shortcut.

Features by Release

V1: "Paste → Parse → Page → Pilot" (Months 4-6)

The minimum viable product. Four verbs. If you can't explain it in those four words, it's out of scope.

Feature	Description	Priority
Paste & Parse	Copy-paste raw text from anywhere. AI structures it into numbered, executable steps with risk classification (🟢🟡🔴) in < 5 seconds. Zero configuration.	P0 — This IS the product
Risk Classification Engine	AI + deterministic scanner labels every step. LLM classifies intent; regex/AST scanner validates against known destructive patterns. Scanner overrides LLM. Always.	P0 — Trust foundation
Copilot Execution	Slack bot + web UI walks engineer through runbook step-by-step. Auto-executes 🟢 steps. Prompts for 🟡. Blocks 🔴 without explicit typed confirmation.	P0 — Core value prop
Alert-to-Runbook Matching	PagerDuty/OpsGenie webhook integration. Alert fires → dd0c/run matches to most relevant runbook via keyword + metadata + basic semantic similarity. Posts in incident Slack channel.	P0 — "The runbook finds you"
Alert Context Injection	Matched runbook arrives pre-populated: affected service, region, recent deploys, related metrics. No manual lookup.	P0 — 3am brain can't look things up
Rollback-Aware Execution	Every state-changing step records its inverse command. One-click undo at any point.	P0 — Safety net
Divergence Detection	Post-incident: AI compares what engineer did vs. what runbook prescribed. Flags skipped steps, modified commands, unlisted actions.	P1 — Learning loop
Auto-Update Suggestions	Generates runbook diffs from divergence data. "You skipped steps 6-8 in 4/4 runs. Remove them?"	P1 — Self-improving runbooks
Runbook Health Dashboard	Coverage %, average staleness, MTTR with/without runbook, step skip rates. Jordan's command center.	P1 — Management visibility
Compliance Export	PDF/CSV of timestamped execution logs, approval chains, audit trail. Not pretty, but functional.	P1 — SOC 2 readiness
Prerequisite Detection	AI identifies implicit requirements ("you need kubectl access", "make sure you're on the VPN") and surfaces pre-flight checklist.	P2
Ambiguity Highlighter	AI flags vague steps ("check the logs" — which logs?) and prompts author to clarify before the runbook goes live.	P2

What V1 does NOT include: Terminal Watcher (V2), Full Autopilot (V2), Confluence/Notion crawlers (V2 — V1 is paste), Incident Simulator (V3), Runbook Marketplace (V3), Multi-cloud abstraction (V3), Voice-guided runbooks (V3).

V2: "Watch → Learn → Predict → Protect" (Months 7-9)

Feature	Description	Unlocks
Terminal Watcher	Opt-in agent captures commands during incidents. AI generates runbooks from real actions.	Solves cold-start. Captures Morgan's expertise passively.
Confluence/Notion Crawlers	Automated discovery and sync of runbook-tagged pages.	Bulk import for large teams with 100+ runbooks.
Full Autopilot Mode	Runbooks with 10+ successful copilot runs and zero modifications can promote 🟢 steps to autonomous execution.	"Sleep through the night" promise.
dd0c/alert Deep Integration	Multi-alert correlation, enriched context passing, bidirectional feedback loop.	The platform flywheel engages.
Infrastructure-Aware Staleness	Cross-reference steps against live Terraform state, K8s manifests, AWS resources via dd0c/portal.	Runbooks that know when they're lying.
Runbook Effectiveness ML Model	Predict runbook success probability based on alert context, time of day, engineer experience.	Data-driven trust promotion.

V3: "Simulate → Train → Marketplace → Scale" (Months 10-12)

Feature	Description	Unlocks
Incident Simulator / Fire Drills	Sandbox environment for practicing runbooks. Gamified with scores.	Viral growth. "My team's score is 94."
Voice-Guided Runbooks	AI reads steps aloud at 3am. Hands-free incident response.	Genuine differentiation nobody else has.
Runbook Marketplace	Community-contributed, anonymized templates. "How teams running EKS + RDS handle connection storms."	Network effect. Templates improve with every customer.
Predictive Runbook Staging	dd0c/alert detects anomaly trending toward incident → dd0c/run pre-stages runbook → on-call gets heads-up 30 min early.	The incident that never happens.

User Journey: Riley's 3am Page

3:17 AM — Phone buzzes. PagerDuty: CRITICAL — payment-service latency > 5000ms

3:17 AM — dd0c/run webhook fires. Matches alert to "Payment Service Latency Runbook" (92% confidence).

3:17 AM — Slack bot posts in #incident-2847:
          🔔 Runbook matched: Payment Service Latency
          📊 Pre-filled: region=us-east-1, service=payment-svc, deploy=v2.4.1 (2h ago)
          🟢🟡🔴 8 steps (4 safe, 3 caution, 1 danger)
          [▶ Start Copilot]

3:18 AM — Riley taps Start Copilot. Steps 1-3 (🟢 Safe) auto-execute:
          ✅ Checked pod status — 2/5 pods in CrashLoopBackOff
          ✅ Pulled logs — 847 connection timeout errors in last 5 min
          ✅ Queried pg_stat_activity — 312 idle-in-transaction connections

3:19 AM — Step 4 (🟡 Caution): "Bounce connection pool — kubectl rollout restart"
          ⚠️ This will restart all pods. ~30s downtime.
          ↩️ Rollback: kubectl rollout undo ...
          Riley taps [✅ Approve & Execute]

3:20 AM — Step 5 (🟢 Safe) auto-executes: Verify latency recovery.
          ✅ Latency recovered to baseline. All pods Running.

3:21 AM — ✅ Incident resolved. MTTR: 3m 47s.
          📝 "You skipped steps 6-8. Also ran a command not in the runbook:
              SELECT count(*) FROM pg_stat_activity"
          Suggested updates: Remove steps 6-8, add DB connection check before step 4.
          [✅ Apply Updates]

3:21 AM — Riley applies updates. Goes back to sleep. The cat didn't even wake up.

Previous MTTR for this incident type without dd0c/run: 38-45 minutes. With dd0c/run: under 4 minutes. That's the product.

Pricing

Tier	Price	Includes
Free	$0	3 runbooks, read-along mode only (no execution), basic parsing
Pro	$29/runbook/month	Unlimited runbooks, copilot execution, Slack bot, PagerDuty/OpsGenie integration, basic dashboard, divergence detection
Business	$49/seat/month	Everything in Pro + autopilot mode (V2), API access, SSO, compliance export, audit trail, priority support

Pricing rationale: The per-runbook model ($29/runbook/month) aligns cost with value — teams pay for the runbooks they actually use, not empty seats. A team with 10 active runbooks pays $290/month. As they add more runbooks and see MTTR drop, revenue grows with demonstrated value. The per-seat Business tier captures larger teams that want platform features (SSO, compliance, API).

Note from Party Mode: The VC advisor recommended switching to pure per-seat pricing for simplicity. This is a valid concern. We will A/B test per-runbook vs. per-seat during beta to determine which model drives faster adoption and lower churn. The per-runbook model has the advantage of a lower entry point and direct value alignment; the per-seat model has the advantage of predictability and simpler billing.

4. GO-TO-MARKET PLAN

Launch Strategy

dd0c/run is Phase 3 in the dd0c platform rollout (Months 4-6). It does not launch alone. It launches alongside dd0c/alert as the "On-Call Savior" bundle — because a runbook engine without alert intelligence is a document viewer, and alert intelligence without execution is a notification system. Together, they close the loop from detection to resolution.

Prerequisite: dd0c/cost and dd0c/route must be live and generating revenue (Phase 1, Months 1-3). These FinOps modules prove immediate, hard-dollar ROI. If we can't save a company money, we have no right to ask them to trust us with their production environment. The FinOps wedge buys the political capital for the operational wedge.

Beachhead: Teams Drowning in On-Call

The ideal early customer has three characteristics:

High incident volume. 10+ pages per week across the team. They feel the pain daily.
Existing runbooks that they know are stale. They've tried to document. They know it's broken. They're looking for a better way.
No dedicated SRE tooling team. They can't afford to spend 3 months configuring Rundeck or learning Shoreline's DSL. They need something that works in 5 minutes.

This is the Series B/C startup with 5-15 SREs supporting 50-200 developers. They're big enough to have real infrastructure problems, small enough that every engineer feels the on-call burden personally.

Secondary beachhead: Compliance chasers — startups preparing for SOC 2 who need documented, auditable incident response procedures yesterday. We sell them the audit trail masquerading as an automation tool.

The dd0c/alert → dd0c/run Upsell Path

This is the primary growth engine for dd0c/run. The conversion funnel:

dd0c/cost user (saves money) → trusts the platform
    │
    ▼
dd0c/alert user (reduces noise, sleeps better) → trusts the intelligence
    │
    ▼
dd0c/alert fires an alert → Slack message includes:
"📋 A runbook exists for this alert pattern. Want dd0c/run to guide you?"
    │
    ▼
Engineer clicks through → lands on Paste & Parse → 5-second wow moment
    │
    ▼
dd0c/run user (resolves incidents faster) → trusts the execution
    │
    ▼
dd0c/portal user (owns the full developer experience) → locked in

Every dd0c/alert notification becomes a dd0c/run acquisition channel. The upsell is embedded in the product, not in a sales email.

Growth Loops

Loop 1: The Parsing Flywheel (Product-Led) Engineer pastes runbook → AI parses in 5 seconds → "Wow" → pastes 5 more → invites teammate → teammate pastes theirs → team has 20 runbooks in a week → first incident uses copilot → MTTR drops → team is hooked.

Fuel: The 5-second parse moment must be so good that engineers paste runbooks for fun.

Loop 2: The Incident Evidence Loop (Manager-Led) Jordan sees MTTR data → shows leadership → "With dd0c/run: 6 minutes. Without: 38 minutes." → leadership asks "Why don't all teams use this?" → org-wide rollout → more teams = more runbooks = better matching = better MTTR.

Fuel: The MTTR comparison chart. One number that justifies the budget.

Loop 3: The Open-Source Wedge (Community-Led) Release ddoc-parse — a free, open-source CLI that parses runbooks locally. No account needed. No SaaS. Engineers who love it self-select into the beta. Their runbooks (anonymized) improve the parsing model. The CLI gets better. More users. More conversions.

Fuel: A genuinely useful free tool, not a crippled demo.

Loop 4: The Knowledge Capture Loop (Retention) Morgan's expertise captured in dd0c/run → Morgan leaves → Riley handles incident using Morgan's captured knowledge → team realizes dd0c/run IS their institutional memory → switching cost becomes infinite → renewal is automatic.

Fuel: The "Ghost of Morgan" moment — the first time a junior resolves an incident using a runbook generated from a senior's session.

Content Strategy

Engineering-as-marketing. Developers use adblockers and hate salespeople. We don't sell to them. We teach them.

Content	Channel	Purpose
"The Anatomy of a 3am Page" — blog post with real data on cognitive impairment during nighttime incidents	Blog, Hacker News, r/sre	Thought leadership. Establishes the problem before pitching the solution.
`ddoc-parse` open-source CLI	GitHub, Product Hunt	Free tool that demonstrates AI parsing quality. Acquisition funnel.
"Your Runbooks Are Lying to You" — analysis of runbook staleness rates across 100 teams	Blog, SRE Weekly newsletter	Data-driven content that managers share internally.
Conference lightning talks (SREcon, KubeCon, DevOpsDays)	In-person	5-minute talk ending with beta signup QR code.
Incident postmortem outreach	Direct DM	Companies publishing postmortems are self-selecting. "I read your Redis incident writeup. We're building something that would have cut your MTTR in half."
Pre-seeded runbook templates (AWS RDS failover, K8s CrashLoopBackOff, Redis memory pressure)	In-product, GitHub	Solve the cold-start problem. Demonstrate value before the user pastes anything.

90-Day Launch Timeline

Day	Milestone
1-14	Private alpha with 5 hand-picked teams from dd0c/cost user base. Paste & Parse + basic Copilot in Slack. Gather parsing quality feedback.
15-30	Iterate on parsing accuracy based on alpha feedback. Add PagerDuty webhook integration. Add risk classification validation (deterministic scanner). Ship divergence detection.
31-45	Expand to 15-20 beta teams. Launch `ddoc-parse` open-source CLI. Begin collecting MTTR data. Add health dashboard for Jordan persona.
46-60	Beta teams running in production. First MTTR comparison data available. Begin compliance export feature. Publish "Anatomy of a 3am Page" blog post.
61-75	Refine based on beta feedback. A/B test pricing models (per-runbook vs. per-seat). Secure 3+ case study commitments. Ship dd0c/alert integration (webhook-based).
76-90	Public launch. Product Hunt launch. Hacker News "Show HN" post. Conference talk submissions. Convert beta teams to paid. Target: 50 teams with ≥ 5 active runbooks.

5. BUSINESS MODEL

Revenue Model

Primary: $29/runbook/month (Pro tier) or $49/seat/month (Business tier).

A team with 10 active runbooks on Pro pays $290/month. A team of 8 SREs on Business pays $392/month. Revenue scales with demonstrated value — more runbooks means more incidents resolved faster, which means higher willingness to pay.

Secondary: The dd0c platform bundle. Teams using dd0c/cost + dd0c/alert + dd0c/run together represent an average deal size of $500-800/month. The platform is stickier than any individual module.

Unit Economics

Metric	Value	Notes
COGS per runbook/month	~$3-5	LLM inference (via dd0c/route, optimized model selection), compute for Rust API + agent coordination, PostgreSQL storage. Parsing is a one-time cost per runbook; execution inference is per-incident.
Gross Margin	~85%	SaaS-standard. The Rust stack keeps infrastructure costs low. LLM costs are the primary variable, managed by dd0c/route.
CAC (Target)	< $500	Product-led growth via `ddoc-parse` CLI + dd0c/alert upsell. No outbound sales team. Content marketing + community seeding.
LTV (Target)	> $5,000	18+ month retention (institutional memory lock-in). Average $290/month × 18 months = $5,220.
LTV:CAC Ratio	> 10:1	Healthy for bootstrapped SaaS. The dd0c/alert upsell path has near-zero incremental CAC.
Payback Period	< 2 months	At $290/month with $500 CAC, payback in ~1.7 months.

Path to Revenue Milestones

$10K MRR (Month 8 — 4 months post-launch)

35 Pro teams × 10 runbooks × $29 = $10,150/month
Source: Beta conversions (15-20 teams) + dd0c/alert upsell (10-15 teams) + organic from ddoc-parse CLI (5-10 teams)
Key assumption: 70% beta-to-paid conversion rate

$50K MRR (Month 14 — 10 months post-launch)

120 Pro teams ($34,800) + 30 Business teams ($14,700) = $49,500/month
Source: Platform flywheel engaged. dd0c/alert → dd0c/run conversion running at 25%. Community templates driving organic acquisition. First conference talks generating inbound.
Key assumption: < 5% monthly churn (institutional memory lock-in)

$100K MRR (Month 20 — 16 months post-launch)

200 Pro teams ($58,000) + 80 Business teams ($39,200) + 5 custom enterprise ($10,000) = $107,200/month
Source: Runbook Marketplace (V3) creating network effects. Multi-team deployments within companies. SOC 2 compliance driving Business tier upgrades.
Key assumption: Average expansion revenue of 30% (teams add runbooks and seats over time)

Solo Founder Constraints

This is the hardest product in the dd0c lineup to support as a solo founder. The reasons are structural:

Production safety liability. If dd0c/run contributes to a production outage, the reputational damage extends to the entire dd0c brand. There is no "move fast and break things" with a product that touches production. Every release must be paranoid.
Support burden. When a customer's weird custom Kubernetes setup doesn't play nice with the Rust agent, that's a high-urgency, high-complexity support ticket at 3am. Unlike dd0c/cost (where a bug means a wrong number on a dashboard), a dd0c/run bug means a failed incident response.
Security surface area. The VPC agent is open-source and auditable, but it's still a binary running inside customer infrastructure. A CVE in the agent is an existential event. Security reviews from enterprise customers will be thorough and time-consuming.

Mitigations:

Shared platform architecture. One API gateway, one auth layer, one billing system, one OpenTelemetry ingest pipeline across all dd0c modules. If you build six separate data models, you burn out in 14 months.
V1 scope discipline. Copilot-only. No Autopilot. No Terminal Watcher. No crawlers. The smaller the surface area, the smaller the support burden.
Community-driven templates. Pre-seed 50 high-quality templates for standard infrastructure. Let the community maintain and improve them. Reduce the "my setup is unique" support tickets.
Aggressive kill criteria. If the support burden exceeds 10 hours/week within the first 3 months, re-evaluate the agent architecture. Consider a managed-execution model where the SaaS handles execution via customer-provided cloud credentials (higher trust barrier, lower support burden).

6. RISKS & MITIGATIONS

Risk 1: LLM Hallucination Causes a Production Outage

Severity: 10/10 — Extinction-level event for the company. Probability: Medium (with mitigations), High (without).

The scenario: The runbook says "Restart the pod." The LLM hallucinates and outputs kubectl delete deployment instead, classifying it as 🟢 Safe. The tired engineer clicks approve. Production goes down. The customer cancels. dd0c goes bankrupt from reputational damage.

Mitigations:

Deterministic regex/AST scanner validates every command against known destructive patterns. The scanner overrides the LLM classification. Always. The LLM is advisory; the scanner is authoritative.
Agent-level command whitelist. Anything outside the whitelist is blocked at the agent, regardless of what the SaaS sends. Defense in depth.
🔴 Dangerous actions require typing the resource name to confirm. Not clicking a button. UI friction is a feature.
Every state-changing step records its rollback command. One-click undo.
V1 limits auto-execution to 🟢 Safe (read-only diagnostic) commands only. State changes always require human approval.
Comprehensive logging of every command suggested, approved, executed, and rolled back. Full audit trail.

Kill condition: If the LLM misclassifies a destructive command as Safe (🟢) even once during beta, or if the false-positive rate for safe commands exceeds 0.1%, the product is killed or fundamentally re-architected to remove LLMs from the classification path.

Risk 2: PagerDuty / Incident.io Ships Native AI Runbook Automation

Severity: 8/10. Probability: High — they will build something. The question is when and how good.

The scenario: PagerDuty acquires a small AI runbook startup and bundles it into their Enterprise tier for "free" (subsidized by their massive license cost), choking off dd0c/run's distribution.

Mitigations:

Platform agnosticism. dd0c/run works with PagerDuty, OpsGenie, Grafana OnCall, and any webhook source. PagerDuty's automation is locked to PagerDuty.
Cross-module data advantage. PagerDuty can't integrate with dd0c/cost anomalies, dd0c/drift detection, or dd0c/portal service ownership. We have the context. They have the routing rules.
Mid-market pricing. PagerDuty's automation is an enterprise upsell ($$$). We sell to the on-call engineer at 3am for $29/runbook.
The Resolution Pattern Database. PagerDuty keeps data siloed per enterprise. We anonymize and share the "ideal" runbooks across the mid-market. Network effect they can't replicate without cannibalizing their enterprise model.

Pivot option: If PagerDuty ships a compelling native solution, double down on the dd0c ecosystem play. dd0c/run becomes the execution arm of dd0c/alert + dd0c/drift + dd0c/portal — a tightly coupled platform that no single-feature bolt-on can touch.

Risk 3: Teams Don't Have Documented Runbooks (Cold Start Problem)

Severity: 7/10. Probability: High — many teams have zero runbooks.

The scenario: A prospect signs up, goes to the "Paste Runbook" screen, and realizes they have nothing to paste. Churn happens in 60 seconds.

Mitigations:

Pre-seed the platform with 50 high-quality templates for standard infrastructure (AWS RDS failover, K8s CrashLoopBackOff, Redis memory pressure, cert expiry, etc.). New users see value before they paste anything.
Slack Thread Distiller (V1): Paste a Slack thread URL from a past incident. AI extracts the resolution commands and generates a draft runbook. If they have incidents, they have Slack threads.
Postmortem-to-Runbook Pipeline (V1): Feed in a postmortem doc. AI extracts "what we did to fix it" and generates a structured runbook.
Terminal Watcher (V2): Captures commands during live incidents and generates runbooks automatically.
Shift marketing from "Automate your runbooks" to "Generate your missing runbooks." The product creates runbooks, not just executes them.

Risk 4: The "Agentic AI" Obsolescence Event

Severity: High. Probability: Low (in the next 3 years).

The scenario: Autonomous AI agents (Devin, GitHub Copilot Workspace, Pulumi Neo) can detect and fix infrastructure issues without human intervention. Who needs runbooks?

Mitigations:

Runbooks become the "policy" that defines what the agent should do. They're the bridge between human intent and agent execution. We pivot from "human automation" to "agent policy management."
Position dd0c/run as the control plane for agentic operations — the system that defines, constrains, and audits what AI agents are allowed to do in production.
The Trust Gradient already models this transition: Read-Only → Copilot → Autopilot is the same spectrum as Human-Driven → Human-on-the-Loop → Agent-Driven.

Risk 5: Solo Founder Scaling / The Bus Factor

Severity: High. Probability: High — Brian is building six products.

The scenario: The support burden of a production-safety-critical product overwhelms a solo founder. A critical bug in the VPC agent requires immediate response at 3am. Brian burns out.

Mitigations:

Shared platform architecture reduces per-module engineering overhead by 60%+.
V1 scope discipline: Copilot-only, no Autopilot, no crawlers. Smallest possible surface area.
Open-source the Rust agent. Community contributions for edge-case Kubernetes configurations. Community security audits.
Aggressive automation of support: self-healing agent updates, comprehensive error messages, in-product diagnostics.
If dd0c/run reaches $50K MRR, hire a dedicated SRE for agent support. This is the first hire, non-negotiable.

The Catastrophic Scenario and How to Prevent It

The nightmare: dd0c/run's AI suggests a destructive command. A sleep-deprived engineer approves it. Production goes down for a major customer. The incident gets posted on Hacker News. The dd0c brand — across all six modules — is destroyed. Not just dd0c/run. Everything.

Prevention (defense in depth):

Layer 1 — LLM Classification: AI labels every step with risk level. This is the first pass, and it's the least trusted.
Layer 2 — Deterministic Scanner: Regex/AST pattern matching against known destructive commands (delete, drop, rm -rf, kubectl delete namespace, etc.). Overrides LLM. Catches hallucinations.
Layer 3 — Agent Whitelist: The Rust agent maintains a local whitelist of allowed command patterns. Anything not on the whitelist is rejected at the agent level, regardless of what the SaaS sends. The agent doesn't trust the cloud.
Layer 4 — UI Friction: 🟡 commands require click-to-approve. 🔴 commands require typing the resource name. No "approve all" button. Ever.
Layer 5 — Rollback Recording: Every state-changing command has a recorded inverse. One-click undo. The safety net.
Layer 6 — Audit Trail: Every command suggested, approved, modified, executed, and rolled back is logged with timestamps, user identity, and alert context. Full forensic capability.

If all six layers fail simultaneously, the product deserves to die. But they won't fail simultaneously — that's the point of defense in depth.

7. SUCCESS METRICS

North Star Metric

Incidents resolved via dd0c/run copilot per month.

This single metric captures adoption (teams are using it), trust (engineers choose copilot over skipping), and value (incidents are actually getting resolved). If this number grows, everything else follows.

Leading Indicators

Metric	Target (Month 6)	Why It Matters
Time-to-First-Runbook	< 5 minutes	If onboarding is slow, nobody reaches the value. The Vercel test.
Paste & Parse success rate	> 90%	If parsing fails or requires heavy manual editing, the magic moment is broken.
Copilot adoption rate	≥ 60% of matched incidents	If engineers bypass copilot, the product isn't trusted or isn't useful.
Risk classification accuracy	> 99.9% (zero false-safe on destructive commands)	The safety foundation. One misclassification and we're done.
Weekly active runbooks per team	≥ 5	The product is alive, not shelfware.
Runbook update acceptance rate	≥ 30% of suggestions	The learning loop is working. Runbooks are improving.

Lagging Indicators

Metric	Target (Month 6)	Why It Matters
MTTR reduction	≥ 40% vs. baseline	The headline number. "Teams using dd0c/run resolve incidents 40% faster."
NPS from on-call engineers	> 50	Riley actually likes this. Not just tolerates it.
Monthly churn	< 5%	Institutional memory lock-in is working.
Expansion revenue	> 20%	Teams adding runbooks and seats over time.
Zero safety incidents	0	dd0c/run never made an incident worse. Non-negotiable.

30/60/90 Day Milestones

Day 30: Prove the Parse

15-20 beta teams onboarded
Paste & Parse working with > 90% accuracy across diverse runbook formats
PagerDuty webhook integration live
Risk classification validated: zero false-safe misclassifications on destructive commands
First MTTR data points collected
Success criteria: Engineers say "wow" when they paste their first runbook

Day 60: Prove the Pilot

Beta teams running copilot in production incidents
MTTR reduction ≥ 30% for at least 8 teams
Divergence detection generating useful runbook update suggestions
Health dashboard live for Jordan persona
dd0c/alert webhook integration functional
ddoc-parse open-source CLI launched
Success criteria: At least one engineer says "I actually slept through the night because dd0c/run handled the diagnostics"

Day 90: Prove the Business

50 teams with ≥ 5 active runbooks
MTTR reduction ≥ 40% for at least 12 teams
3+ teams committed as named case studies
Pricing model validated (per-runbook vs. per-seat A/B test complete)
Zero safety incidents across all beta teams
Public launch executed (Product Hunt, Hacker News, conference submissions)
$10K MRR trajectory confirmed
Success criteria: Beta-to-paid conversion rate ≥ 70%

Kill Criteria

The product is killed or fundamentally re-architected if any of the following occur:

Safety failure. The LLM misclassifies a destructive command as Safe (🟢) during beta. Even once.
Trust failure. Engineers skip copilot mode > 50% of the time after 30 days. The product isn't trusted.
Parse failure. Paste & Parse accuracy stays below 80% after 60 days of iteration. The core AI capability doesn't work.
Adoption failure. Fewer than 8 beta teams active after 4 weeks. The problem isn't painful enough or the solution isn't compelling enough.
MTTR failure. MTTR reduction < 20% or inconsistent across teams after 60 days. The product doesn't deliver measurable value.

If we hit a kill criterion, the pivot options are:

Pivot to read-only intelligence: Strip execution entirely. Become the "runbook quality platform" — parsing, staleness detection, coverage dashboards, compliance evidence. Lower risk, lower value, but viable.
Pivot to agent policy management: If agentic AI arrives faster than expected, position dd0c/run as the policy layer that defines what AI agents are allowed to do in production.
Absorb into dd0c/portal: The Contrarian from Party Mode was right about one thing — if dd0c/run can't stand alone, it becomes a feature of the IDP, not a product.

APPENDIX: RESOLVED CONTRADICTIONS ACROSS PHASES

Contradiction	Brainstorm Position	Party Mode Position	Resolution
Standalone product vs. portal feature	Standalone with ecosystem integration	Contrarian argued it's a portal feature, not a product	Standalone with kill-criteria pivot to portal. Launch as standalone to test market demand. If adoption fails, absorb into dd0c/portal as a feature.
Per-runbook vs. per-seat pricing	$29/runbook/month	VC advisor recommended per-seat for simplicity	A/B test during beta. Per-runbook aligns cost with value; per-seat is simpler. Let data decide.
V1 execution scope	Full copilot with 🟢🟡🔴 approval gates	CTO demanded no execution until deterministic validation exists; Bootstrap Founder said copilot-only	V1 auto-executes 🟢 only. 🟡🔴 require human approval. Deterministic scanner overrides LLM. Synthesizes both positions.
Confluence/Notion crawlers in V1	Design Thinking included crawlers as V1	Innovation Strategy said "do not build crawlers; force the user to paste"	Paste-only in V1. Crawlers are V2. Solo founder can't maintain integration APIs for V1. Paste is the 5-second wow moment anyway.
Cold start solution	Slack Thread Scraper in V1	Terminal Watcher in V1	Slack Thread Distiller in V1. Terminal Watcher deferred to V2. Slack threads require no agent installation (lower trust barrier). Terminal Watcher requires an agent on the engineer's machine — too much friction for V1.

This brief synthesizes insights from four prior development phases: Brainstorm (Carson, Venture Architect), Design Thinking (Maya, Design Maestro), Innovation Strategy (Victor, Disruption Oracle), and Party Mode Advisory Board (5-person expert panel). All contradictions have been identified and resolved with rationale.

dd0c/run is the most safety-critical module in the dd0c platform. This brief reflects that gravity. Build it paranoid. Assume the AI wants to delete production. Constrain it accordingly. Then ship it — because the 3am pager isn't going to fix itself.

48 KiB Raw Permalink Blame History Unescape Escape

dd0c/run — Product Brief

AI-Powered Runbook Automation

1. EXECUTIVE SUMMARY

Elevator Pitch

The Problem

The Solution

Target Customer

Key Differentiators

The Safety Narrative

2. MARKET OPPORTUNITY

The Documentation-to-Execution Gap

Market Sizing

Competitive Landscape

Timing Thesis: Why 2026

3. PRODUCT DEFINITION

Value Proposition

Personas

The Trust Gradient — The Core Design Principle

Features by Release

V1: "Paste → Parse → Page → Pilot" (Months 4-6)

V2: "Watch → Learn → Predict → Protect" (Months 7-9)

V3: "Simulate → Train → Marketplace → Scale" (Months 10-12)

User Journey: Riley's 3am Page

Pricing

4. GO-TO-MARKET PLAN

Launch Strategy

Beachhead: Teams Drowning in On-Call

The dd0c/alert → dd0c/run Upsell Path

Growth Loops

Content Strategy

90-Day Launch Timeline

5. BUSINESS MODEL

Revenue Model

Unit Economics

Path to Revenue Milestones

Solo Founder Constraints

6. RISKS & MITIGATIONS

Risk 1: LLM Hallucination Causes a Production Outage

Risk 2: PagerDuty / Incident.io Ships Native AI Runbook Automation

Risk 3: Teams Don't Have Documented Runbooks (Cold Start Problem)

Risk 4: The "Agentic AI" Obsolescence Event

Risk 5: Solo Founder Scaling / The Bus Factor

The Catastrophic Scenario and How to Prevent It

7. SUCCESS METRICS

North Star Metric

Leading Indicators

Lagging Indicators

30/60/90 Day Milestones

Kill Criteria

APPENDIX: RESOLVED CONTRADICTIONS ACROSS PHASES

48 KiB

Raw Permalink Blame History