dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
This commit is contained in:
2144
products/06-runbook-automation/architecture/architecture.md
Normal file
2144
products/06-runbook-automation/architecture/architecture.md
Normal file
File diff suppressed because it is too large
Load Diff
270
products/06-runbook-automation/brainstorm/session.md
Normal file
270
products/06-runbook-automation/brainstorm/session.md
Normal file
@@ -0,0 +1,270 @@
|
||||
# dd0c/run — Brainstorm Session: AI-Powered Runbook Automation
|
||||
**Facilitator:** Carson, Elite Brainstorming Specialist
|
||||
**Date:** 2026-02-28
|
||||
**Product:** dd0c/run (Product #6 in the dd0c platform)
|
||||
**Phase:** "On-Call Savior" (Months 4-6 per brand strategy)
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Problem Space (25 ideas)
|
||||
|
||||
The graveyard of runbooks is REAL. Let's map every angle of the pain.
|
||||
|
||||
### Discovery & Awareness
|
||||
1. **The Invisible Runbook** — On-call engineer gets paged, doesn't know a runbook exists for this exact alert. It's buried in page 47 of a Confluence space nobody bookmarks.
|
||||
2. **The "Ask Steve" Problem** — The runbook is Steve's brain. Steve is on vacation in Bali. Steve didn't write it down. Steve never will.
|
||||
3. **The Wrong Runbook** — Engineer finds a runbook but it's for a different version of the service, or a similar-but-different failure mode. They follow it anyway. Things get worse.
|
||||
4. **The Search Tax** — At 3am, panicking, the engineer spends 12 minutes searching Confluence, Notion, Slack, and Google Docs for the right runbook. MTTR just doubled.
|
||||
5. **The Tribal Knowledge Silo** — Senior engineers have mental runbooks for every failure mode. They never write them down because "it's faster to just fix it myself."
|
||||
|
||||
### Runbook Rot & Maintenance
|
||||
6. **The Day-One Decay** — A runbook is accurate the day it's written. By day 30, the infrastructure has changed and 3 of the 8 steps are wrong.
|
||||
7. **The Nobody-Owns-It Problem** — Who maintains the runbook? The person who wrote it left 6 months ago. The team that owns the service doesn't know the runbook exists.
|
||||
8. **The Copy-Paste Drift** — Runbooks get forked, copied, slightly modified. Now there are 4 versions and none are canonical.
|
||||
9. **The Screenshot Graveyard** — Runbooks full of screenshots of UIs that have been redesigned twice since.
|
||||
10. **The "Works on My Machine" Runbook** — Steps that assume specific IAM permissions, VPN configs, or CLI versions that the on-call engineer doesn't have.
|
||||
|
||||
### Cognitive Load & Human Factors
|
||||
11. **3am Brain** — Cognitive function drops 30-40% during night pages. Complex multi-step runbooks become impossible to follow accurately.
|
||||
12. **The Panic Spiral** — Alert fires → engineer panics → skips steps → makes it worse → more alerts fire → more panic. The runbook can't help if the human can't process it.
|
||||
13. **Context Switching Hell** — Following a runbook means jumping between the doc, the terminal, the AWS console, Datadog, Slack, and PagerDuty. Each switch costs 30 seconds of re-orientation.
|
||||
14. **The "Which Step Am I On?" Problem** — Engineer gets interrupted by Slack message, loses their place in a 20-step runbook, re-executes step 7 which was supposed to be idempotent but isn't.
|
||||
15. **Decision Fatigue at the Fork** — Runbook says "If X, do A. If Y, do B. If neither, escalate." Engineer can't tell if it's X or Y. Freezes.
|
||||
|
||||
### Organizational & Cultural
|
||||
16. **The Postmortem Lie** — Every postmortem says "Action item: update runbook." It never happens. The Jira ticket sits in the backlog for eternity.
|
||||
17. **The Hero Culture** — Organizations reward the engineer who heroically fixes the incident, not the one who writes the boring runbook. Incentives are backwards.
|
||||
18. **The New Hire Cliff** — New on-call engineer's first page. No context, no muscle memory, runbooks assume 2 years of institutional knowledge.
|
||||
19. **The Handoff Gap** — Shift change during an active incident. The outgoing engineer's context is lost. The incoming engineer starts from scratch.
|
||||
20. **The "We Don't Have Runbooks" Admission** — Many teams simply don't have runbooks at all. They rely entirely on tribal knowledge and hope.
|
||||
|
||||
### Economic & Business Impact
|
||||
21. **The MTTR Multiplier** — Every minute of downtime costs money. A 30-minute MTTR vs 5-minute MTTR on a revenue-critical service can mean $50K+ difference per incident for mid-market companies.
|
||||
22. **The Attrition Cost** — On-call burnout is the #1 reason SREs quit. Bad runbooks = more stress = more turnover = $150K+ per lost engineer.
|
||||
23. **The Compliance Gap** — SOC 2 and ISO 27001 require documented incident response procedures. Most teams' "documentation" is a stale wiki page that wouldn't survive an audit.
|
||||
24. **The Repeated Incident Tax** — Same incident happens monthly. Same engineer fixes it manually each time. Nobody automates it because "it only takes 20 minutes." That's 4 hours/year of senior engineer time per recurring incident.
|
||||
25. **The Escalation Cascade** — Junior engineer can't follow the runbook → escalates to senior → senior is also paged for 3 other things → everyone's awake, nobody's effective.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Solution Space (42 ideas)
|
||||
|
||||
LET'S GO. Every idea is valid. We're building the future of incident response.
|
||||
|
||||
### Ingestion & Import (Ideas 26-36)
|
||||
26. **Confluence Crawler** — API integration that discovers and imports all runbook-tagged pages from Confluence spaces. Parses prose into structured steps.
|
||||
27. **Notion Sync** — Bidirectional sync with Notion databases. Import existing runbooks, push updates back.
|
||||
28. **GitHub/GitLab Markdown Ingest** — Point at a repo directory of `.md` runbooks. Auto-import on merge to main.
|
||||
29. **Slack Thread Scraper** — "That time we fixed the database" lives in a Slack thread. AI extracts the resolution steps from the conversation noise.
|
||||
30. **Google Docs Connector** — Many teams keep runbooks in shared Google Docs. Import and keep synced.
|
||||
31. **Video Transcription Import** — Senior engineer recorded a Loom/screen recording of fixing an issue. AI transcribes it, extracts the steps, generates a runbook.
|
||||
32. **Terminal Session Replay Import** — Import asciinema recordings or shell history from past incidents. AI identifies the commands that actually fixed the issue vs. the diagnostic noise.
|
||||
33. **Postmortem-to-Runbook Pipeline** — Feed in your postmortem doc. AI extracts the resolution steps and generates a draft runbook automatically.
|
||||
34. **PagerDuty/OpsGenie Notes Scraper** — Engineers often leave resolution notes in the incident timeline. Scrape those into runbook drafts.
|
||||
35. **Jira/Linear Ticket Mining** — Incident tickets often contain resolution steps in comments. Mine them.
|
||||
36. **Clipboard/Paste Import** — Zero-friction: just paste the text of your runbook into dd0c/run. AI structures it instantly.
|
||||
|
||||
### AI Parsing & Understanding (Ideas 37-45)
|
||||
37. **Prose-to-Steps Converter** — AI takes a wall of text ("First you need to SSH into the bastion, then check the logs for...") and converts it into numbered, executable steps.
|
||||
38. **Command Extraction** — AI identifies shell commands, API calls, and SQL queries embedded in prose. Tags them as executable.
|
||||
39. **Prerequisite Detection** — AI identifies implicit prerequisites ("you need kubectl access", "make sure you're on the VPN") and surfaces them as a checklist before execution.
|
||||
40. **Conditional Logic Mapping** — AI identifies decision points ("if the error is X, do Y; otherwise do Z") and creates branching workflow trees.
|
||||
41. **Risk Classification per Step** — AI labels each step: 🟢 Safe (read-only), 🟡 Caution (state change, reversible), 🔴 Dangerous (destructive, irreversible). Auto-execute green, prompt for yellow, require explicit approval for red.
|
||||
42. **Staleness Detection** — AI cross-references runbook commands against current infrastructure (Terraform state, K8s manifests, AWS resource tags). Flags steps that reference resources that no longer exist.
|
||||
43. **Ambiguity Highlighter** — AI flags vague steps ("check the logs" — which logs? where?) and prompts the author to clarify.
|
||||
44. **Multi-Language Support** — Parse runbooks written in English, Spanish, Japanese, etc. Incident response is global.
|
||||
45. **Diagram/Flowchart Generation** — AI generates a visual flowchart from the runbook steps. Engineers can see the whole decision tree at a glance.
|
||||
|
||||
### Execution Modes (Ideas 46-54)
|
||||
46. **Full Autopilot** — For well-tested, low-risk runbooks: AI executes every step automatically, reports results.
|
||||
47. **Copilot Mode (Human-in-the-Loop)** — AI suggests each step, pre-fills the command, engineer clicks "Execute" or modifies it first. The default mode.
|
||||
48. **Suggestion-Only / Read-Along** — AI walks the engineer through the runbook step by step, highlighting the current step, but doesn't execute anything. Training wheels.
|
||||
49. **Dry-Run Mode** — AI simulates execution of each step, shows what WOULD happen without actually doing it. Perfect for testing runbooks.
|
||||
50. **Progressive Trust** — Starts in suggestion-only mode. As the team builds confidence, they can promote individual runbooks to copilot or autopilot mode.
|
||||
51. **Approval Chains** — Dangerous steps require approval from a second engineer or a manager. Integrated with Slack/Teams for quick approvals.
|
||||
52. **Rollback-Aware Execution** — Every step that changes state also records the rollback command. If things go wrong, one-click undo.
|
||||
53. **Parallel Step Execution** — Some runbook steps are independent. AI identifies parallelizable steps and executes them simultaneously to reduce MTTR.
|
||||
54. **Breakpoint Mode** — Engineer sets breakpoints in the runbook like a debugger. Execution pauses at those points for manual inspection.
|
||||
|
||||
### Alert Integration (Ideas 55-62)
|
||||
55. **Auto-Attach Runbook to Incident** — When PagerDuty/OpsGenie fires an alert, dd0c/run automatically identifies the most relevant runbook and attaches it to the incident.
|
||||
56. **Alert-to-Runbook Matching Engine** — ML model that learns which alerts map to which runbooks based on historical resolution patterns.
|
||||
57. **Slack Bot Integration** — `/ddoc run database-failover` in Slack. The bot walks you through the runbook right in the channel.
|
||||
58. **PagerDuty Custom Action** — One-click "Run Runbook" button directly in the PagerDuty incident page.
|
||||
59. **Pre-Incident Warm-Up** — When anomaly detection suggests an incident is LIKELY (but hasn't fired yet), dd0c/run pre-loads the relevant runbook and notifies the on-call.
|
||||
60. **Multi-Alert Correlation** — When 5 alerts fire simultaneously, AI determines they're all symptoms of one root cause and suggests the single runbook that addresses it.
|
||||
61. **Escalation-Aware Routing** — If the L1 runbook doesn't resolve the issue within N minutes, automatically escalate to L2 runbook and page the senior engineer with full context.
|
||||
62. **Alert Context Injection** — When the runbook starts, AI pre-populates variables (affected service, region, customer impact) from the alert payload. No manual lookup needed.
|
||||
|
||||
### Learning Loop & Continuous Improvement (Ideas 63-72)
|
||||
63. **Resolution Tracking** — Track which runbook steps actually resolved the incident vs. which were skipped or failed. Use this data to improve runbooks.
|
||||
64. **Auto-Update Suggestions** — After an incident, AI compares what the engineer actually did vs. what the runbook said. Suggests updates for divergences.
|
||||
65. **Runbook Effectiveness Score** — Each runbook gets a score: success rate, average MTTR when used, skip rate per step. Surface the worst-performing runbooks for review.
|
||||
66. **Dead Step Detection** — If step 4 is skipped by every engineer every time, it's probably unnecessary. Flag it for removal.
|
||||
67. **New Failure Mode Detection** — AI notices an incident that doesn't match any existing runbook. Prompts the resolving engineer to create one from their actions.
|
||||
68. **A/B Testing Runbooks** — Two approaches to fixing the same issue? Run both, track which has better MTTR. Data-driven runbook optimization.
|
||||
69. **Seasonal Pattern Learning** — "This database issue happens every month-end during batch processing." AI learns temporal patterns and pre-stages runbooks.
|
||||
70. **Cross-Team Learning** — Anonymized patterns: "Teams using this AWS architecture commonly need this type of runbook." Suggest runbook templates based on infrastructure fingerprint.
|
||||
71. **Confidence Decay Model** — Runbook confidence score decreases over time since last successful use or last infrastructure change. Triggers review when confidence drops below threshold.
|
||||
72. **Incident Replay for Training** — Record the full incident timeline (alerts, runbook execution, engineer actions). Replay it for training new on-call engineers.
|
||||
|
||||
### Collaboration & Handoff (Ideas 73-80)
|
||||
73. **Multi-Engineer Incident View** — Multiple engineers working the same incident can see each other's progress through the runbook in real-time.
|
||||
74. **Shift Handoff Package** — When shift changes during an incident, dd0c/run generates a context package: what's been tried, what's left, current state.
|
||||
75. **War Room Mode** — Dedicated incident channel with the runbook pinned, step progress visible, and AI providing real-time suggestions.
|
||||
76. **Expert Paging with Context** — When escalating, the paged expert receives not just "help needed" but the full runbook execution history, what's been tried, and where it's stuck.
|
||||
77. **Async Runbook Contributions** — After an incident, any engineer can suggest edits to the runbook. Changes go through a review process like a PR.
|
||||
78. **Runbook Comments & Annotations** — Engineers can leave inline comments on runbook steps ("This step takes 5 minutes, don't panic if it seems stuck").
|
||||
79. **Incident Narration** — AI generates a real-time narrative of the incident for stakeholders: "The team is on step 5 of 8. Database failover is in progress. ETA: 10 minutes."
|
||||
80. **Cross-Timezone Handoff Intelligence** — AI knows which engineers are in which timezone and suggests optimal handoff points.
|
||||
|
||||
### Runbook Creation & Generation (Ideas 81-90)
|
||||
81. **Terminal Watcher** — Opt-in agent that watches your terminal during an incident. After resolution, AI generates a runbook from the commands you ran.
|
||||
82. **Incident Postmortem → Runbook** — Feed in the postmortem. AI generates the runbook. Close the loop that every team promises but never delivers.
|
||||
83. **Screen Recording → Runbook** — Record your screen while fixing an issue. AI watches, transcribes, and generates a step-by-step runbook.
|
||||
84. **Slack Thread → Runbook** — Point at a Slack thread where an incident was resolved. AI extracts the signal from the noise and generates a runbook.
|
||||
85. **Template Library** — Pre-built runbook templates for common scenarios: "AWS RDS failover", "Kubernetes pod crash loop", "Redis memory pressure", "Certificate expiry".
|
||||
86. **Infrastructure-Aware Generation** — dd0c/run knows your infrastructure (via dd0c/portal integration). When you deploy a new service, it auto-suggests runbook templates based on the tech stack.
|
||||
87. **Chaos Engineering Integration** — Run a chaos experiment (Gremlin, LitmusChaos). dd0c/run observes the resolution and generates a runbook from it.
|
||||
88. **Pair Programming Runbooks** — AI and engineer co-author a runbook interactively. AI asks questions ("What do you check first?"), engineer answers, AI structures it.
|
||||
89. **Runbook from Architecture Diagram** — Feed in your architecture diagram. AI identifies potential failure points and generates skeleton runbooks for each.
|
||||
90. **Git-Backed Runbooks** — All runbooks stored as code in a Git repo. Version history, PRs for changes, CI/CD for validation. The runbook-as-code movement.
|
||||
|
||||
### Wild & Visionary Ideas (Ideas 91-100)
|
||||
91. **Incident Simulator / Fire Drills** — Simulate incidents in a sandbox environment. On-call engineers practice runbooks without real consequences. Gamified with scores and leaderboards.
|
||||
92. **Voice-Guided Runbooks** — At 3am, reading is hard. AI reads the runbook steps aloud through your headphones while you type commands. Hands-free incident response.
|
||||
93. **Runbook Marketplace** — Community-contributed runbook templates. "Here's how Stripe handles Redis failover." Anonymized, vetted, rated.
|
||||
94. **Predictive Runbook Staging** — AI predicts incidents before they happen (based on metrics trends) and pre-stages the relevant runbook, pre-approves safe steps, and alerts the on-call: "Heads up, you might need this in 30 minutes."
|
||||
95. **Natural Language Incident Response** — Engineer types "the database is slow" in Slack. AI figures out which database, runs diagnostics, identifies the issue, and suggests the right runbook. No alert needed.
|
||||
96. **Runbook Dependency Graph** — Visualize how runbooks relate to each other. "If Runbook A fails, try Runbook B." "Runbook C is a prerequisite for Runbook D."
|
||||
97. **Self-Healing Runbooks** — Runbooks that detect their own staleness by periodically dry-running against the current infrastructure and flagging broken steps.
|
||||
98. **Customer-Impact Aware Execution** — AI knows which customers are affected (via dd0c/portal service catalog) and prioritizes runbook execution based on customer tier/revenue impact.
|
||||
99. **Regulatory Compliance Mode** — Every runbook execution is logged with full audit trail. Who ran what, when, what changed. Auto-generates compliance evidence for SOC 2/ISO 27001.
|
||||
100. **Multi-Cloud Runbook Abstraction** — Write one runbook that works across AWS, GCP, and Azure. AI translates cloud-specific commands based on the target environment.
|
||||
101. **Runbook Health Dashboard** — Single pane of glass: total runbooks, coverage gaps (services without runbooks), staleness scores, usage frequency, effectiveness ratings.
|
||||
102. **"What Would Steve Do?" Mode** — AI learns from how senior engineers resolve incidents (via terminal watcher + historical data) and can suggest their approach even when they're not available.
|
||||
103. **Incident Cost Tracker** — Real-time cost counter during incident: "This outage has cost $12,400 so far. Estimated savings if resolved in next 5 minutes: $8,200."
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Differentiation & Moat (18 ideas)
|
||||
|
||||
### Beating Rundeck (Free/OSS)
|
||||
104. **UX Superiority** — Rundeck's UI is from 2015. dd0c/run is Linear-quality UX. Engineers will pay for beautiful, fast tools.
|
||||
105. **Zero Config** — Rundeck requires Java, a database, YAML job definitions. dd0c/run: paste your runbook, it works. Time-to-value < 5 minutes.
|
||||
106. **AI-Native vs. Bolt-On** — Rundeck is a job scheduler with runbook features bolted on. dd0c/run is AI-first. The AI IS the product, not a feature.
|
||||
107. **SaaS vs. Self-Hosted Burden** — Rundeck requires hosting, patching, upgrading. dd0c/run is managed SaaS. One less thing to maintain.
|
||||
|
||||
### Beating PagerDuty Automation Actions
|
||||
108. **Not Locked to PagerDuty** — PagerDuty Automation Actions only works within PagerDuty. dd0c/run works with PagerDuty, OpsGenie, Grafana OnCall, or any webhook-based alerting.
|
||||
109. **Runbook Intelligence vs. Dumb Automation** — PagerDuty runs pre-defined scripts. dd0c/run understands the runbook, adapts to context, handles branching logic, and learns.
|
||||
110. **Ingestion from Anywhere** — PagerDuty can't import your existing Confluence runbooks. dd0c/run can.
|
||||
111. **Mid-Market Pricing** — PagerDuty's automation is an expensive add-on to an already expensive product. dd0c/run is $15-30/seat standalone.
|
||||
|
||||
### The Data Moat
|
||||
112. **Runbook Corpus** — Every runbook ingested makes the AI smarter at parsing and structuring new runbooks. Network effect.
|
||||
113. **Resolution Pattern Database** — "When alert X fires for service type Y, step Z resolves it 87% of the time." This data is incredibly valuable and compounds over time.
|
||||
114. **Infrastructure Fingerprinting** — dd0c/run learns common failure patterns for specific tech stacks (EKS + RDS + Redis = these 5 failure modes). New customers with similar stacks get instant runbook suggestions.
|
||||
115. **MTTR Benchmarking** — "Your MTTR for database incidents is 23 minutes. Similar teams average 8 minutes. Here's what they do differently." Anonymized cross-customer intelligence.
|
||||
|
||||
### Platform Integration Moat (dd0c Ecosystem)
|
||||
116. **dd0c/alert → dd0c/run Pipeline** — Alert intelligence identifies the incident, runbook automation resolves it. Together they're 10x more valuable than apart.
|
||||
117. **dd0c/portal Service Catalog** — dd0c/run knows who owns the service, what it depends on, and who to page. No configuration needed if you're already using portal.
|
||||
118. **dd0c/cost Integration** — Runbook execution can factor in cost: "This remediation will spin up 3 extra instances costing $X/hour. Approve?"
|
||||
119. **dd0c/drift Integration** — "This incident was caused by infrastructure drift detected by dd0c/drift. Here's the runbook to remediate AND the drift to fix."
|
||||
120. **Unified Audit Trail** — All dd0c modules share one audit log. Compliance teams get a single source of truth for incident response, cost decisions, and infrastructure changes.
|
||||
121. **The "Last Mile" Advantage** — Competitors solve one piece. dd0c solves the whole chain: detect anomaly → correlate alerts → identify runbook → execute resolution → update documentation → generate postmortem.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Anti-Ideas & Red Team (15 ideas)
|
||||
|
||||
Time to be brutally honest. Let's stress-test this thing.
|
||||
|
||||
### Why This Could Fail
|
||||
122. **AI Agents Make Runbooks Obsolete** — If autonomous AI agents (Pulumi Neo, GitHub Agentic Workflows) can detect and fix infrastructure issues without human intervention, who needs runbooks? *Counter:* We're 3-5 years from trusting AI to autonomously fix production. Runbooks are the bridge. And even with AI agents, you need runbooks as the "policy" that defines what the agent should do.
|
||||
123. **Trust Barrier** — Will engineers let an AI run commands in their production environment? The first time dd0c/run makes an incident worse, trust is destroyed forever. *Counter:* Progressive trust model. Start with suggestion-only. Graduate to copilot. Autopilot only for proven runbooks. Never force it.
|
||||
124. **The AI Makes It Worse** — AI misinterprets a runbook step, executes the wrong command, cascading failure. *Counter:* Risk classification per step. Dangerous steps always require human approval. Dry-run mode. Rollback-aware execution.
|
||||
125. **Runbook Quality Garbage-In** — If the existing runbooks are terrible (and they usually are), AI can't magically make them good. It'll just execute bad steps faster. *Counter:* AI quality scoring on import. Flag ambiguous, incomplete, or risky runbooks. Suggest improvements before enabling execution.
|
||||
126. **Security & Compliance Nightmare** — dd0c/run needs access to production systems to execute commands. That's a massive attack surface and compliance concern. *Counter:* Agent-based architecture (like the dd0c brand strategy specifies). Agent runs in their VPC. SaaS never sees credentials. SOC 2 compliance from day one.
|
||||
127. **Small Market?** — How many teams actually have runbooks worth automating? Most teams don't have runbooks at all. *Counter:* That's the opportunity. dd0c/run doesn't just automate existing runbooks — it helps CREATE them. The terminal watcher and postmortem-to-runbook features address teams with zero runbooks.
|
||||
128. **Rundeck is Free** — Why pay for dd0c/run when Rundeck is open source? *Counter:* Rundeck is a job scheduler, not an AI runbook engine. It's like comparing Notepad to VS Code. Different products for different eras.
|
||||
129. **PagerDuty/Rootly Acqui-Hire the Space** — Big players could build or acquire this capability. *Counter:* PagerDuty is slow-moving and enterprise-focused. Rootly is incident management, not runbook execution. By the time they build it, dd0c/run has the data moat.
|
||||
130. **Engineer Resistance** — "I don't need an AI to tell me how to do my job." Cultural resistance from senior engineers. *Counter:* Position it as a tool for the 3am junior engineer, not the senior. Seniors benefit because they stop getting paged for things juniors can now handle.
|
||||
131. **Integration Fatigue** — Yet another tool to integrate with PagerDuty, Slack, AWS, etc. *Counter:* dd0c platform handles integrations once. dd0c/run inherits them.
|
||||
132. **Latency During Incidents** — If dd0c/run adds latency to incident response (loading, parsing, waiting for AI), engineers will bypass it. *Counter:* Pre-stage runbooks. Cache everything. AI inference must be < 2 seconds. If it's slower than reading the doc, it's useless.
|
||||
133. **Liability** — If dd0c/run's AI suggestion causes data loss, who's liable? *Counter:* Clear ToS. AI suggests, human approves (in copilot mode). Audit trail proves the human clicked "Execute."
|
||||
134. **Hallucination Risk** — AI "invents" a runbook step that doesn't exist in the source material. *Counter:* Strict grounding. Every suggested step must trace back to the source runbook. Hallucination detection layer. Never generate steps that aren't in the original document unless explicitly in "creative" mode.
|
||||
135. **Chicken-and-Egg: No Runbooks = No Product** — Teams without runbooks can't use dd0c/run. *Counter:* Terminal watcher, postmortem mining, Slack thread scraping, and template library all solve cold-start. dd0c/run creates runbooks, not just executes them.
|
||||
136. **Pricing Pressure** — If the market commoditizes AI runbook execution, margins collapse. *Counter:* The moat isn't the execution engine — it's the resolution pattern database and the dd0c platform integration. Those compound over time.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Synthesis
|
||||
|
||||
### Top 10 Ideas (Ranked by Impact × Feasibility)
|
||||
|
||||
| Rank | Idea | # | Why |
|
||||
|------|------|---|-----|
|
||||
| 1 | **Copilot Mode (Human-in-the-Loop Execution)** | 47 | The core product. AI suggests, human approves. Safe, trustworthy, immediately valuable. This IS dd0c/run. |
|
||||
| 2 | **Auto-Attach Runbook to Incident** | 55 | The killer integration. Alert fires → runbook appears. Solves the #1 problem (engineers don't know the runbook exists). |
|
||||
| 3 | **Risk Classification per Step** | 41 | The trust enabler. Green/yellow/red labeling makes engineers comfortable letting AI execute safe steps while maintaining control over dangerous ones. |
|
||||
| 4 | **Confluence/Notion/Markdown Ingestion** | 26-28 | Meet teams where they are. Zero migration friction. Import existing runbooks in minutes. |
|
||||
| 5 | **Prose-to-Steps AI Converter** | 37 | The magic moment. Paste a wall of text, get a structured executable runbook. This is the demo that sells the product. |
|
||||
| 6 | **Terminal Watcher → Auto-Generate Runbook** | 81 | Solves the cold-start problem AND the "seniors won't write runbooks" problem. The runbook writes itself. |
|
||||
| 7 | **Resolution Tracking & Auto-Update Suggestions** | 63-64 | The learning loop that kills runbook rot. Runbooks get better with every incident instead of decaying. |
|
||||
| 8 | **Slack Bot Integration** | 57 | Meet engineers where they already are during incidents. No context switching. `/ddoc run` in the incident channel. |
|
||||
| 9 | **Runbook Effectiveness Score** | 65 | Data-driven runbook management. Surface the worst runbooks, celebrate the best. Gamification of operational excellence. |
|
||||
| 10 | **dd0c/alert → dd0c/run Pipeline** | 116 | The platform play. Alert intelligence + runbook automation = the complete incident response stack. This is how you beat point solutions. |
|
||||
|
||||
### 3 Wild Cards 🃏
|
||||
|
||||
1. **🃏 Incident Simulator / Fire Drills (#91)** — Gamified incident response training. Engineers practice runbooks in a sandbox. Leaderboards, scores, team competitions. This could be the viral growth mechanism — "My team's incident response score is 94. What's yours?" Could become a standalone product.
|
||||
|
||||
2. **🃏 Voice-Guided Runbooks (#92)** — At 3am, your eyes are barely open. What if dd0c/run talked you through the incident like a calm co-pilot? "Step 3: SSH into the bastion. The command is ready in your clipboard. Press enter when ready." This is genuinely differentiated — nobody else is doing audio-guided incident response.
|
||||
|
||||
3. **🃏 "What Would Steve Do?" Mode (#102)** — AI learns senior engineers' incident response patterns and can replicate their decision-making. This is the ultimate knowledge capture tool. When Steve leaves the company, his expertise stays. Emotionally compelling pitch for engineering managers worried about bus factor.
|
||||
|
||||
### Recommended V1 Scope
|
||||
|
||||
**V1 = "Paste → Parse → Page → Pilot"**
|
||||
|
||||
The minimum viable product that delivers immediate value:
|
||||
|
||||
1. **Ingest** — Paste a runbook (plain text, markdown) or connect Confluence/Notion. AI parses it into structured, executable steps with risk classification (green/yellow/red).
|
||||
|
||||
2. **Match** — PagerDuty/OpsGenie webhook integration. When an alert fires, dd0c/run matches it to the most relevant runbook using semantic similarity + alert metadata.
|
||||
|
||||
3. **Execute (Copilot Mode)** — Slack bot or web UI walks the on-call engineer through the runbook step-by-step. Auto-executes green (safe) steps. Prompts for approval on yellow/red steps. Pre-fills commands with context from the alert.
|
||||
|
||||
4. **Learn** — Track which steps were executed, skipped, or modified. After incident resolution, suggest runbook updates based on what actually happened vs. what the runbook said.
|
||||
|
||||
**What V1 does NOT include:**
|
||||
- Terminal watcher (V2)
|
||||
- Full autopilot mode (V2 — need trust first)
|
||||
- Incident simulator (V3)
|
||||
- Multi-cloud abstraction (V3)
|
||||
- Runbook marketplace (V4)
|
||||
|
||||
**V1 Success Metrics:**
|
||||
- Time-to-first-runbook: < 5 minutes (paste and go)
|
||||
- MTTR reduction: 40%+ for teams using dd0c/run vs. manual runbook following
|
||||
- Runbook coverage: surface services with zero runbooks, track coverage growth
|
||||
- NPS from on-call engineers: > 50 (they actually LIKE being on-call now)
|
||||
|
||||
**V1 Tech Stack:**
|
||||
- Lightweight agent (Rust/Go) runs in customer VPC for command execution
|
||||
- SaaS dashboard + Slack bot for the UI
|
||||
- OpenAI/Anthropic for runbook parsing and step generation (use dd0c/route for cost optimization — eat your own dog food)
|
||||
- PagerDuty + OpsGenie webhooks for alert integration
|
||||
- PostgreSQL + vector DB for runbook storage and semantic matching
|
||||
|
||||
**V1 Pricing:**
|
||||
- Free: 3 runbooks, suggestion-only mode
|
||||
- Pro ($25/seat/month): Unlimited runbooks, copilot execution, Slack bot
|
||||
- Business ($49/seat/month): Autopilot mode, API access, SSO, audit trail
|
||||
|
||||
---
|
||||
|
||||
*Total ideas generated: 136*
|
||||
*Session complete. Let's build this thing.* 🔥
|
||||
1097
products/06-runbook-automation/design-thinking/session.md
Normal file
1097
products/06-runbook-automation/design-thinking/session.md
Normal file
File diff suppressed because it is too large
Load Diff
552
products/06-runbook-automation/epics/epics.md
Normal file
552
products/06-runbook-automation/epics/epics.md
Normal file
@@ -0,0 +1,552 @@
|
||||
# dd0c/run — V1 MVP Epics
|
||||
|
||||
This document breaks down the V1 MVP of dd0c/run into implementation Epics and User Stories. Scope is strictly limited to V1 (read-only execution 🟢, copilot approval for 🟡/🔴, no autopilot, no terminal watcher).
|
||||
|
||||
---
|
||||
|
||||
## Epic 1: Runbook Parser
|
||||
**Description:** The "5-second wow moment." Ingest raw unstructured text from a paste event, normalize it, and use a fast LLM (e.g., Claude Haiku via dd0c/route) to extract a structured JSON representation of executable steps, variables, conditional branches, and prerequisites in under 5 seconds.
|
||||
|
||||
**Dependencies:** None. This is the foundational data structure.
|
||||
|
||||
**Technical Notes:**
|
||||
- Pure Rust normalizer to strip HTML/Markdown before hitting the LLM to save tokens and latency.
|
||||
- LLM prompt must enforce a strict JSON schema.
|
||||
- Idempotent parsing: same text + temperature 0 = same output.
|
||||
- Must run in < 3.5s to leave room for the Action Classifier within the 5s SLA.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 1.1: Raw Text Normalization & Ingestion**
|
||||
*As a runbook author (Morgan), I want to paste raw text from Confluence or Notion into the system, so that I don't have to learn a proprietary YAML DSL.*
|
||||
- **Acceptance Criteria:**
|
||||
- System accepts raw text payload via API.
|
||||
- Rust-based normalizer strips HTML tags, markdown formatting, and normalizes whitespace/bullet points.
|
||||
- Original raw text and hash are preserved in the DB for audit/re-parsing.
|
||||
- **Story Points:** 2
|
||||
- **Dependencies:** None
|
||||
|
||||
**Story 1.2: LLM Structured Step Extraction**
|
||||
*As a system, I want to pass normalized text to a fast LLM to extract an ordered JSON array of steps and commands, so that the runbook becomes machine-readable.*
|
||||
- **Acceptance Criteria:**
|
||||
- Sends normalized text to `dd0c/route` with a strict JSON schema prompt.
|
||||
- Correctly extracts step order, natural language description, and embedded CLI/shell commands.
|
||||
- P95 latency for extraction is < 3.5 seconds.
|
||||
- Rejects/errors gracefully if the text contains no actionable steps.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 1.1
|
||||
|
||||
**Story 1.3: Variable & Prerequisite Detection**
|
||||
*As an on-call engineer (Riley), I want the parser to identify implicit prerequisites (like VPN) and variables (like `<instance-id>`), so that I'm fully prepared before the runbook starts.*
|
||||
- **Acceptance Criteria:**
|
||||
- Regex/heuristic scanner identifies common placeholders (`$VAR`, `<var>`, `{var}`).
|
||||
- LLM identifies implicit requirements ("ensure you are on VPN", "requires prod AWS profile").
|
||||
- Outputs a structured array of `variables` and `prerequisites` in the JSON payload.
|
||||
- **Story Points:** 2
|
||||
- **Dependencies:** 1.2
|
||||
|
||||
**Story 1.4: Branching & Ambiguity Highlighting**
|
||||
*As a runbook author (Morgan), I want the system to map conditional logic (if/else) and flag vague instructions, so that my runbooks are deterministic and clear.*
|
||||
- **Acceptance Criteria:**
|
||||
- Parser detects simple "if X then Y" statements and maps them to a step execution DAG (Directed Acyclic Graph).
|
||||
- Identifies vague steps (e.g., "check the logs" without specifying which logs) and adds them to an `ambiguities` array for author review.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 1.2
|
||||
|
||||
## Epic 2: Action Classifier
|
||||
**Description:** The safety-critical core of the system. Classifies every extracted command as 🟢 Safe, 🟡 Caution, 🔴 Dangerous, or ⬜ Unknown. Uses a defense-in-depth "dual-key" architecture: a deterministic compiled Rust scanner overrides an advisory LLM classifier. A misclassification here is an existential risk.
|
||||
|
||||
**Dependencies:** Epic 1 (needs parsed steps to classify)
|
||||
|
||||
**Technical Notes:**
|
||||
- The deterministic scanner must use tree-sitter or compiled regex sets.
|
||||
- Merge rules must be hardcoded in Rust (not configurable).
|
||||
- LLM classification must run in parallel across steps to meet the 5-second total parse+classify SLA.
|
||||
- If the LLM confidence is low or unknown, risk escalates. Scanner wins disagreements.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 2.1: Deterministic Safety Scanner**
|
||||
*As a system, I want to pattern-match commands against a compiled database of known signatures, so that destructive commands are definitively blocked without relying on an LLM.*
|
||||
- **Acceptance Criteria:**
|
||||
- Rust library with compiled regex sets (allowlist 🟢, caution 🟡, blocklist 🔴).
|
||||
- Uses tree-sitter to parse SQL/shell ASTs to detect destructive patterns (e.g., `DELETE` without `WHERE`, piped `rm -rf`).
|
||||
- Executes in < 1ms per command.
|
||||
- Defaults to `Unknown` (🟡 minimum) if no patterns match.
|
||||
- **Story Points:** 5
|
||||
- **Dependencies:** None (standalone library)
|
||||
|
||||
**Story 2.2: LLM Contextual Classifier**
|
||||
*As a system, I want an LLM to assess the command in the context of surrounding steps, so that implicit state changes or custom scripts are caught.*
|
||||
- **Acceptance Criteria:**
|
||||
- Sends the command, surrounding steps, and infrastructure context to an LLM via a strict JSON prompt.
|
||||
- Returns a classification (🟢/🟡/🔴), a confidence score, and suggested rollbacks.
|
||||
- Escalate classification to the next risk tier if confidence < 0.9.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** Epic 1 (requires structured context)
|
||||
|
||||
**Story 2.3: Classification Merge Engine**
|
||||
*As an on-call engineer (Riley), I want the system to safely merge the LLM and deterministic scanner results, so that I can trust the final 🟢/🟡/🔴 label.*
|
||||
- **Acceptance Criteria:**
|
||||
- Implements the 5 hardcoded merge rules exactly as architected (e.g., if Scanner says 🔴, final is 🔴; if Scanner says 🟡 and LLM says 🟢, final is 🟡).
|
||||
- The final risk level is `Safe` ONLY if both the Scanner and the LLM agree it is safe.
|
||||
- **Story Points:** 2
|
||||
- **Dependencies:** 2.1, 2.2
|
||||
|
||||
**Story 2.4: Immutable Classification Audit**
|
||||
*As an SRE manager (Jordan), I want every classification decision logged with full context, so that I have a forensic record of why the system made its choice.*
|
||||
- **Acceptance Criteria:**
|
||||
- Emits a `runbook.classified` event to the PostgreSQL database for every step.
|
||||
- Log includes: scanner result (matched patterns), LLM reasoning, confidence scores, final classification, and merge rule applied.
|
||||
- **Story Points:** 1
|
||||
- **Dependencies:** 2.3
|
||||
|
||||
## Epic 3: Execution Engine
|
||||
**Description:** A state machine orchestrating the step-by-step runbook execution, enforcing the Trust Gradient (Level 0, 1, 2) at every transition. It never auto-executes 🟡 or 🔴 commands in V1. Handles timeouts, rollbacks, and tracking skipped steps.
|
||||
|
||||
**Dependencies:** Epics 1 & 2 (needs classified runbooks to execute)
|
||||
|
||||
**Technical Notes:**
|
||||
- V1 is Copilot-only. State transitions must block 🔴 and prompt for 🟡.
|
||||
- Engine must communicate with the Agent via gRPC over mTLS.
|
||||
- Each step execution receives a unique ID to prevent duplicate deliveries.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 3.1: Execution State Machine**
|
||||
*As a system, I want a state machine to orchestrate step-by-step runbook execution, so that no two commands execute simultaneously and the Trust Gradient is enforced.*
|
||||
- **Acceptance Criteria:**
|
||||
- Starts in `Pending` state upon alert match.
|
||||
- Progresses to `AutoExecute` ONLY if the step is 🟢 and trust level allows.
|
||||
- Transitions to `AwaitApproval` for 🟡/🔴 steps, blocking execution until approved.
|
||||
- Aborts or transitions to `ManualIntervention` on timeout (e.g., 60s for 🟢, 300s for 🔴).
|
||||
- **Story Points:** 5
|
||||
- **Dependencies:** 2.3
|
||||
|
||||
**Story 3.2: gRPC Agent Communication Protocol**
|
||||
*As a system, I want to communicate securely with the dd0c Agent in the customer VPC, so that commands are executed safely without inbound firewall rules.*
|
||||
- **Acceptance Criteria:**
|
||||
- Outbound-only gRPC streaming connection initiated by the Agent.
|
||||
- Engine sends `ExecuteStep` payload (command, timeout, risk level, environment variables).
|
||||
- Agent streams `StepOutput` (stdout/stderr) back to the Engine.
|
||||
- Agent returns `StepResult` (exit code, duration, stdout/stderr hashes) on completion.
|
||||
- **Story Points:** 5
|
||||
- **Dependencies:** 3.1
|
||||
|
||||
**Story 3.3: Rollback Integration**
|
||||
*As an on-call engineer (Riley), I want every state-changing step to record its inverse command, so that I can undo it with one click if it fails.*
|
||||
- **Acceptance Criteria:**
|
||||
- If a step fails, the Engine transitions to `RollbackAvailable` and awaits human approval.
|
||||
- Engine stores the rollback command before executing the forward command.
|
||||
- Executing rollback triggers an independent execution step and returns the state machine to `StepReady` or `ManualIntervention`.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 3.1
|
||||
|
||||
**Story 3.4: Divergence Analysis**
|
||||
*As an SRE manager (Jordan), I want the system to track what the engineer actually did vs. what the runbook prescribed, so that runbooks get better over time.*
|
||||
- **Acceptance Criteria:**
|
||||
- Post-execution analyzer compares prescribed steps with executed/skipped steps and any unlisted commands detected by the agent.
|
||||
- Flags skipped steps, modified commands, and unlisted actions.
|
||||
- Emits a `divergence.detected` event with suggested runbook updates.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 3.1
|
||||
|
||||
## Epic 4: Slack Bot Copilot
|
||||
**Description:** The primary 3am interface for on-call engineers. Guides the engineer step-by-step through an active incident, enforcing the Trust Gradient via interactive buttons for 🟡 actions and explicit typed confirmations for 🔴 actions.
|
||||
|
||||
**Dependencies:** Epic 3 (needs execution state to drive the UI)
|
||||
|
||||
**Technical Notes:**
|
||||
- Uses Rust + Slack Bolt SDK in Socket Mode (no inbound webhooks).
|
||||
- Must use Slack Block Kit for formatting.
|
||||
- Respects Slack's 1 message/sec/channel rate limit by batching rapid updates.
|
||||
- Uses dedicated threads to keep main channels clean.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 4.1: Alert Matching & Notification**
|
||||
*As an on-call engineer (Riley), I want a Slack message with the relevant runbook when an alert fires, so that I don't have to search for it.*
|
||||
- **Acceptance Criteria:**
|
||||
- Integrates with PagerDuty/OpsGenie webhook payloads.
|
||||
- Matches alert context (service, region, keywords) to the most relevant runbook.
|
||||
- Posts a `🔔 Runbook matched` message with [▶ Start Copilot] button in the incident channel.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** None
|
||||
|
||||
**Story 4.2: Step-by-Step Interactive UI**
|
||||
*As an on-call engineer (Riley), I want the bot to guide me through the runbook step-by-step in a thread, so that I can focus on one command at a time.*
|
||||
- **Acceptance Criteria:**
|
||||
- Tapping [▶ Start Copilot] opens a thread and triggers the Execution Engine.
|
||||
- Shows 🟢 steps auto-executing with live stdout snippets.
|
||||
- Shows 🟡/🔴 steps awaiting approval with command details and rollback.
|
||||
- Includes [⏭ Skip] and [🛑 Abort] buttons for every step.
|
||||
- **Story Points:** 5
|
||||
- **Dependencies:** 3.1, 4.1
|
||||
|
||||
**Story 4.3: Risk-Aware Approval Gates**
|
||||
*As a system, I want to block dangerous commands until explicit confirmation is provided, so that accidental misclicks don't cause outages.*
|
||||
- **Acceptance Criteria:**
|
||||
- 🟡 steps require clicking [✅ Approve] or [✏️ Edit].
|
||||
- 🔴 steps require opening a text input modal and typing the resource name exactly.
|
||||
- Cannot bulk-approve steps. Each step must be individually gated.
|
||||
- Approver's Slack identity is captured and passed to the Execution Engine.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 4.2
|
||||
|
||||
**Story 4.4: Real-Time Output Streaming**
|
||||
*As an on-call engineer (Riley), I want to see the execution output of commands in real-time, so that I know what's happening.*
|
||||
- **Acceptance Criteria:**
|
||||
- Slack messages update dynamically with command stdout/stderr.
|
||||
- Batches rapid output updates to respect Slack rate limits.
|
||||
- Truncates long outputs and provides a link to the full log in the dashboard.
|
||||
- **Story Points:** 2
|
||||
- **Dependencies:** 4.2
|
||||
|
||||
## Epic 5: Audit Trail
|
||||
**Description:** The compliance backbone and forensic record. An append-only, immutable, partitioned PostgreSQL log that tracks every runbook parse, classification, approval, execution, and rollback.
|
||||
|
||||
**Dependencies:** Epics 2, 3, 4 (needs events to log)
|
||||
|
||||
**Technical Notes:**
|
||||
- PostgreSQL partitioned table (by month) for query performance.
|
||||
- Application role must have `INSERT` and `SELECT` only. No `UPDATE` or `DELETE` grants.
|
||||
- Row-Level Security (RLS) enforces tenant isolation at the database level.
|
||||
- V1 includes a basic PDF/CSV export for SOC 2 readiness.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 5.1: Append-Only Audit Schema**
|
||||
*As a system, I want a partitioned database table to store all execution events immutably, so that no one can tamper with the forensic record.*
|
||||
- **Acceptance Criteria:**
|
||||
- Creates a PostgreSQL table partitioned by month.
|
||||
- Revokes `UPDATE` and `DELETE` grants from the application role.
|
||||
- Schema captures `tenant_id`, `event_type`, `execution_id`, `actor_id`, `event_data` (JSONB), and `created_at`.
|
||||
- **Story Points:** 2
|
||||
- **Dependencies:** None
|
||||
|
||||
**Story 5.2: Event Ingestion Pipeline**
|
||||
*As an SRE manager (Jordan), I want every action recorded in real-time, so that I know exactly who did what during an incident.*
|
||||
- **Acceptance Criteria:**
|
||||
- Ingests events from Parser, Classifier, Execution Engine, and Slack Bot.
|
||||
- Logs `runbook.parsed`, `runbook.classified`, `step.auto_executed`, `step.approved`, `step.executed`, etc., exactly as architected.
|
||||
- Captures the exact command executed, exit code, and stdout/stderr hashes.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 5.1
|
||||
|
||||
**Story 5.3: Compliance Export**
|
||||
*As an SRE manager (Jordan), I want to export timestamped execution logs, so that I can provide evidence to SOC 2 auditors.*
|
||||
- **Acceptance Criteria:**
|
||||
- Generates a PDF or CSV of an execution run, including the approval chain and audit trail.
|
||||
- Includes risk classifications and any divergence/modifications made by the engineer.
|
||||
- Links to S3 for full stdout/stderr logs if needed.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 5.2
|
||||
|
||||
**Story 5.4: Multi-Tenant Data Isolation (RLS)**
|
||||
*As a system, I want to ensure no tenant can see another tenant's data, so that security boundaries are guaranteed at the database level.*
|
||||
- **Acceptance Criteria:**
|
||||
- Enables Row-Level Security (RLS) on all tenant-scoped tables (`runbooks`, `executions`, `audit_events`).
|
||||
- API middleware sets `app.current_tenant_id` session variable on every database connection.
|
||||
- Cross-tenant queries return zero rows, not an error.
|
||||
- **Story Points:** 2
|
||||
- **Dependencies:** 5.1
|
||||
|
||||
## Epic 6: Dashboard API
|
||||
**Description:** The control plane REST API served by the dd0c API Gateway. Handles runbook CRUD, parsing endpoints, execution history, and integration with the Alert-Runbook Matcher. Provides the backend for the Dashboard UI and external integrations.
|
||||
|
||||
**Dependencies:** Epics 1, 3, 5
|
||||
|
||||
**Technical Notes:**
|
||||
- Served via Axum (Rust) and secured with JWT.
|
||||
- Integrates with the shared dd0c API Gateway.
|
||||
- Enforces tenant isolation (RLS) on every request.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 6.1: Runbook CRUD & Parsing Endpoints**
|
||||
*As a runbook author (Morgan), I want to create, read, update, and soft-delete runbooks via a REST API, so that I can manage my team's active runbooks.*
|
||||
- **Acceptance Criteria:**
|
||||
- Implements `POST /runbooks` (paste raw text → auto-parse), `GET /runbooks`, `GET /runbooks/:id/versions`, `PUT /runbooks/:id`, `DELETE /runbooks/:id`.
|
||||
- Implements `POST /runbooks/parse-preview` for the 5-second wow moment (parses without saving).
|
||||
- Returns 422 if parsing fails.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 1.2
|
||||
|
||||
**Story 6.2: Execution History & Status Queries**
|
||||
*As an SRE manager (Jordan), I want to view active and completed execution runs, so that I can monitor incident response times and skipped steps.*
|
||||
- **Acceptance Criteria:**
|
||||
- Implements `POST /executions` to start copilot manually.
|
||||
- Implements `GET /executions` and `GET /executions/:id` to retrieve status, steps, exit codes, and durations.
|
||||
- Implements `GET /executions/:id/divergence` to get the post-execution analysis (skipped steps, modified commands).
|
||||
- **Story Points:** 2
|
||||
- **Dependencies:** 3.1, 5.2
|
||||
|
||||
**Story 6.3: Alert-Runbook Matcher Integration**
|
||||
*As a system, I want to match incoming webhooks (e.g., PagerDuty) to the correct runbook, so that the Slack bot can post the runbook immediately.*
|
||||
- **Acceptance Criteria:**
|
||||
- Implements `POST /webhooks/pagerduty`, `POST /webhooks/opsgenie`, and `POST /webhooks/dd0c-alert`.
|
||||
- Uses keyword + metadata matching (and optionally pgvector similarity) to find the best runbook for the alert payload.
|
||||
- Generates the `alert_context` and triggers the Slack bot notification (Epic 4).
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 6.1
|
||||
|
||||
**Story 6.4: Classification Query Endpoints**
|
||||
*As a runbook author (Morgan), I want to query the classification engine directly, so that I can test commands before publishing a runbook.*
|
||||
- **Acceptance Criteria:**
|
||||
- Implements `POST /classify` for testing/debugging.
|
||||
- Implements `GET /classifications/:step_id` to retrieve full classification details.
|
||||
- Rate-limited to 30 req/min per tenant.
|
||||
- **Story Points:** 1
|
||||
- **Dependencies:** 2.3
|
||||
|
||||
## Epic 7: Dashboard UI
|
||||
**Description:** A React Single-Page Application (SPA) for runbook authors and managers to view runbooks, past executions, and risk classifications. The primary interface for onboarding, reviewing runbooks, and analyzing post-incident data.
|
||||
|
||||
**Dependencies:** Epic 6 (needs APIs to consume)
|
||||
|
||||
**Technical Notes:**
|
||||
- React SPA, integrates with the shared dd0c portal.
|
||||
- Displays the 5-second "wow moment" parse preview.
|
||||
- Must visually distinguish 🟢 Safe, 🟡 Caution, and 🔴 Dangerous commands.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 7.1: Runbook Paste & Preview UI**
|
||||
*As a runbook author (Morgan), I want to paste raw text and instantly see the parsed steps, so that I can verify the structure before saving.*
|
||||
- **Acceptance Criteria:**
|
||||
- Large text area for pasting raw runbook text.
|
||||
- Calls `POST /runbooks/parse-preview` and displays the structured steps.
|
||||
- Highlights variables, prerequisites, and ambiguities.
|
||||
- Allows editing the raw text to trigger a re-parse.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 6.1
|
||||
|
||||
**Story 7.2: Execution Timeline & Divergence View**
|
||||
*As an SRE manager (Jordan), I want to view a timeline of an incident's execution, so that I can see what was skipped or modified.*
|
||||
- **Acceptance Criteria:**
|
||||
- Displays the execution run with a visual timeline of steps.
|
||||
- Shows who approved each step, the exact command executed, and the exit code.
|
||||
- Highlights divergence (skipped steps, modified commands, unlisted actions).
|
||||
- Provides a "Apply Updates" button to update the runbook based on divergence.
|
||||
- **Story Points:** 5
|
||||
- **Dependencies:** 6.2
|
||||
|
||||
**Story 7.3: Trust Level & Risk Visualization**
|
||||
*As a runbook author (Morgan), I want to clearly see the risk level of each step in my runbook, so that I know what requires approval.*
|
||||
- **Acceptance Criteria:**
|
||||
- Color-codes steps: 🟢 Green (Safe), 🟡 Yellow (Caution), 🔴 Red (Dangerous).
|
||||
- Clicking a risk badge shows the classification reasoning (scanner rules matched, LLM explanation).
|
||||
- Displays the runbook's overall Trust Level (0, 1, 2) and allows authorized users to change it (V1 max = 2).
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 7.1
|
||||
|
||||
**Story 7.4: Basic Health & MTTR Dashboard**
|
||||
*As an SRE manager (Jordan), I want a high-level view of my team's runbooks and execution stats, so that I can measure the value of dd0c/run.*
|
||||
- **Acceptance Criteria:**
|
||||
- Displays a list of runbooks, their coverage, and average staleness.
|
||||
- Shows MTTR (Mean Time To Resolution) for incidents handled via Copilot.
|
||||
- Displays the total number of Copilot runs and skipped steps per month.
|
||||
- **Story Points:** 2
|
||||
- **Dependencies:** 6.2
|
||||
|
||||
## Epic 8: Infrastructure & DevOps
|
||||
**Description:** The foundational cloud infrastructure and CI/CD pipelines required to run the dd0c/run SaaS platform and build the customer-facing Agent. Ensures strict security isolation (mTLS), observability, and zero-downtime ECS Fargate deployments.
|
||||
|
||||
**Dependencies:** None (can be built in parallel with Epic 1).
|
||||
|
||||
**Technical Notes:**
|
||||
- All infrastructure defined as code (Terraform) and deployed to AWS.
|
||||
- ECS Fargate for all core services (Parser, Classifier, Engine, APIs).
|
||||
- Agent binary built for Linux x86_64 and ARM64 via GitHub Actions.
|
||||
- mTLS certificate generation and rotation pipeline for Agent authentication.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 8.1: Core AWS Infrastructure Provisioning**
|
||||
*As a system administrator (Brian), I want the base AWS infrastructure provisioned via Terraform, so that the services have a secure, scalable environment to run in.*
|
||||
- **Acceptance Criteria:**
|
||||
- Terraform configures VPC, public/private subnets, NAT Gateway, and ALB.
|
||||
- Provisions RDS PostgreSQL 16 (Multi-AZ) with pgvector and S3 buckets for audit/logs.
|
||||
- Provisions ECS Fargate cluster and SQS queues.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** None
|
||||
|
||||
**Story 8.2: CI/CD Pipeline & Agent Build**
|
||||
*As a developer (Brian), I want automated build and deployment pipelines, so that I can ship code to production safely and distribute the Agent binary.*
|
||||
- **Acceptance Criteria:**
|
||||
- GitHub Actions pipeline runs `cargo clippy`, `cargo test`, and the "canary test suite" on every PR.
|
||||
- Merges to `main` auto-deploy ECS Fargate services with zero downtime.
|
||||
- Release tags compile the Agent binary (x86_64, aarch64), sign it, and publish to GitHub Releases / S3.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** None
|
||||
|
||||
**Story 8.3: Agent mTLS & gRPC Setup**
|
||||
*As a system, I want secure outbound-only gRPC communication between the Agent and the Execution Engine, so that commands are transmitted safely without VPNs or inbound firewall rules.*
|
||||
- **Acceptance Criteria:**
|
||||
- Internal CA issues tenant-scoped mTLS certificates.
|
||||
- API Gateway terminates mTLS and validates the tenant ID in the certificate against the request.
|
||||
- Agent successfully establishes a persistent, outbound-only gRPC stream to the Engine.
|
||||
- **Story Points:** 5
|
||||
- **Dependencies:** 8.1
|
||||
|
||||
**Story 8.4: Observability & Party Mode Alerting**
|
||||
*As the solo on-call engineer (Brian), I want comprehensive monitoring and alerting, so that I am woken up if a safety invariant is violated (Party Mode).*
|
||||
- **Acceptance Criteria:**
|
||||
- OpenTelemetry (OTEL) collector deployed and routing metrics/traces to Grafana Cloud.
|
||||
- PagerDuty integration configured for P1 alerts.
|
||||
- Alert fires immediately if the "Party Mode" database flag is set or if Agent heartbeats drop unexpectedly.
|
||||
- **Story Points:** 2
|
||||
- **Dependencies:** 8.1
|
||||
|
||||
## Epic 9: Onboarding & PLG
|
||||
**Description:** Product-Led Growth (PLG) and frictionless onboarding. Focuses on the "5-second wow moment" where a user can paste a runbook and instantly see the parsed, classified steps without installing an agent, plus a self-serve tier to capture leads.
|
||||
|
||||
**Dependencies:** Epics 1, 6, 7.
|
||||
|
||||
**Technical Notes:**
|
||||
- The demo uses the `POST /runbooks/parse-preview` endpoint (does not persist data).
|
||||
- Tenant provisioning must be fully automated upon signup.
|
||||
- Free tier enforced via database limits (no Stripe required initially for free tier).
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 9.1: The 5-Second Interactive Demo**
|
||||
*As a prospective customer, I want to paste my messy runbook into the marketing website and see it parsed instantly, so that I immediately understand the product's value without signing up.*
|
||||
- **Acceptance Criteria:**
|
||||
- Landing page features a prominent text area for pasting a runbook.
|
||||
- Submitting calls the unauthenticated parse-preview endpoint (rate-limited by IP).
|
||||
- Renders the structured steps with 🟢/🟡/🔴 risk badges dynamically.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 6.1, 7.1
|
||||
|
||||
**Story 9.2: Self-Serve Signup & Tenant Provisioning**
|
||||
*As a new user, I want to create an account and get my tenant provisioned instantly, so that I can start using the product without talking to sales.*
|
||||
- **Acceptance Criteria:**
|
||||
- OAuth/Email signup flow creates a User and a Tenant record in PostgreSQL.
|
||||
- Automated worker initializes the tenant workspace, generates the initial mTLS certs, and provides the Agent installation command.
|
||||
- Limits the free tier to 5 active runbooks and 50 executions/month.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 8.1
|
||||
|
||||
**Story 9.3: Agent Installation Wizard**
|
||||
*As a new user, I want a simple copy-paste command to install the read-only Agent in my infrastructure, so that I can execute my first runbook within 10 minutes of signup.*
|
||||
- **Acceptance Criteria:**
|
||||
- Dashboard provides a dynamically generated `curl | bash` or `kubectl apply` snippet.
|
||||
- Snippet includes the tenant-specific mTLS certificate and Agent binary download.
|
||||
- Dashboard shows a live "Waiting for Agent heartbeat..." state that turns green when the Agent connects.
|
||||
- **Story Points:** 3
|
||||
- **Dependencies:** 8.3, 9.2
|
||||
|
||||
**Story 9.4: First Runbook Copilot Walkthrough**
|
||||
*As a new user, I want an interactive guide for my first runbook execution, so that I learn how the Copilot approval flow and Trust Gradient work safely.*
|
||||
- **Acceptance Criteria:**
|
||||
- System automatically provisions a "Hello World" runbook containing safe read-only commands (e.g., `kubectl get nodes`).
|
||||
- In-app tooltip guides the user to trigger the runbook via the Slack integration.
|
||||
- Completing this first execution unlocks the "Runbook Author" badge/status.
|
||||
- **Story Points:** 2
|
||||
- **Dependencies:** 4.2, 9.3
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Epic 10: Transparent Factory Compliance
|
||||
**Description:** Cross-cutting epic ensuring dd0c/run adheres to the 5 Transparent Factory tenets. Of all 6 dd0c products, runbook automation carries the highest governance burden — it executes commands in production infrastructure. Every tenet here is load-bearing.
|
||||
|
||||
### Story 10.1: Atomic Flagging — Feature Flags for Execution Behaviors
|
||||
**As a** solo founder, **I want** every new runbook execution capability, command parser, and auto-remediation behavior behind a feature flag (default: off), **so that** a bad parser change never executes the wrong command in a customer's production environment.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- OpenFeature SDK integrated into the execution engine. V1: JSON file provider.
|
||||
- All flags evaluate locally — no network calls during runbook execution.
|
||||
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
|
||||
- Automated circuit breaker: if a flagged execution path fails >2 times in any 10-minute window, the flag auto-disables and all in-flight executions using that path are paused (not killed).
|
||||
- Flags required for: new command parsers, conditional logic handlers, new infrastructure targets (K8s, AWS, bare metal), auto-retry behaviors, approval bypass rules.
|
||||
- **Hard rule:** any flag that controls execution of destructive commands (`rm`, `delete`, `terminate`, `drop`) requires a 48-hour bake time at 10% rollout before full enablement.
|
||||
|
||||
**Estimate:** 5 points
|
||||
**Dependencies:** Epic 3 (Execution Engine)
|
||||
**Technical Notes:**
|
||||
- Circuit breaker threshold is intentionally low (2 failures) because production execution failures are high-severity.
|
||||
- Pause vs. kill: paused executions hold state and can be resumed after review. Killing mid-execution risks partial state.
|
||||
- 48-hour bake for destructive flags: enforced via flag metadata `destructive: true` + CI check.
|
||||
|
||||
### Story 10.2: Elastic Schema — Additive-Only for Execution Audit Trail
|
||||
**As a** solo founder, **I want** all execution log and runbook state schema changes to be strictly additive, **so that** the forensic audit trail of every production command ever executed is never corrupted or lost.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- CI rejects any migration that removes, renames, or changes type of existing columns/attributes.
|
||||
- New fields use `_v2` suffix for breaking changes.
|
||||
- All execution log parsers ignore unknown fields.
|
||||
- Dual-write during migration windows within the same transaction.
|
||||
- Every migration includes `sunset_date` comment (max 30 days).
|
||||
- **Hard rule:** execution audit records are immutable. No `UPDATE` or `DELETE` statements are permitted on the `execution_log` table/collection. Ever.
|
||||
|
||||
**Estimate:** 3 points
|
||||
**Dependencies:** Epic 4 (Audit & Forensics)
|
||||
**Technical Notes:**
|
||||
- Execution logs are append-only by design and by policy. Use DynamoDB with no `UpdateItem` IAM permission on the audit table.
|
||||
- For runbook definition versioning: every edit creates a new version record. Old versions are never mutated.
|
||||
- Schema for parsed runbook steps: version the parser output format. V1 steps and V2 steps coexist.
|
||||
|
||||
### Story 10.3: Cognitive Durability — Decision Logs for Execution Logic
|
||||
**As a** future maintainer, **I want** every change to runbook parsing, step classification, or execution logic accompanied by a `decision_log.json`, **so that** I understand why the system interpreted step 3 as "destructive" and required approval.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
|
||||
- CI requires a decision log for PRs touching `pkg/parser/`, `pkg/classifier/`, `pkg/executor/`, or `pkg/approval/`.
|
||||
- Cyclomatic complexity cap of 10 enforced via linter. PRs exceeding this are blocked.
|
||||
- Decision logs in `docs/decisions/`.
|
||||
- **Additional requirement:** every change to the "destructive command" classification list requires a decision log entry explaining why the command was added/removed.
|
||||
|
||||
**Estimate:** 2 points
|
||||
**Dependencies:** None
|
||||
**Technical Notes:**
|
||||
- The destructive command list is the most sensitive config in the entire product. Changes must be documented, reviewed, and logged.
|
||||
- Parser logic decision logs should include sample runbook snippets showing before/after interpretation.
|
||||
|
||||
### Story 10.4: Semantic Observability — AI Reasoning Spans on Classification & Execution
|
||||
**As an** SRE manager investigating an incident caused by automated execution, **I want** every runbook classification and execution decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I have a complete forensic record of what the system decided, why, and what it executed.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- Every runbook execution creates a parent `runbook_execution` span. Child spans for each step: `step_classification`, `step_approval_check`, `step_execution`.
|
||||
- `step_classification` attributes: `step.text_hash`, `step.classified_as` (safe/destructive/ambiguous), `step.confidence_score`, `step.alternatives_considered`.
|
||||
- `step_execution` attributes: `step.command_hash`, `step.target_host_hash`, `step.exit_code`, `step.duration_ms`, `step.output_truncated` (first 500 chars, hashed).
|
||||
- If AI-assisted classification: `ai.prompt_hash`, `ai.model_version`, `ai.reasoning_chain`.
|
||||
- `step_approval_check` attributes: `step.approval_required` (bool), `step.approval_source` (human/policy/auto), `step.approval_latency_ms`.
|
||||
- No PII. No raw commands in spans — everything hashed. Full commands only in the encrypted audit log.
|
||||
|
||||
**Estimate:** 5 points
|
||||
**Dependencies:** Epic 3 (Execution Engine), Epic 4 (Audit & Forensics)
|
||||
**Technical Notes:**
|
||||
- This is the highest-effort observability story across all 6 products because execution tracing is forensic-grade.
|
||||
- Span hierarchy: `runbook_execution` → `step_N` → `[classification, approval, execution]`. Three levels deep.
|
||||
- Output hashing: SHA-256 of command output for correlation. Raw output only in encrypted audit store.
|
||||
|
||||
### Story 10.5: Configurable Autonomy — Governance for Production Execution
|
||||
**As a** solo founder, **I want** a `policy.json` that strictly controls what dd0c/run is allowed to execute autonomously vs. what requires human approval, **so that** no automated system ever runs a destructive command without explicit authorization.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- `policy.json` defines `governance_mode`: `strict` (all steps require human approval) or `audit` (safe steps auto-execute, destructive steps require approval).
|
||||
- Default for ALL customers: `strict`. There is no "fully autonomous" mode for dd0c/run. Even in `audit` mode, destructive commands always require approval.
|
||||
- `panic_mode`: when true, ALL execution halts immediately. In-flight steps are paused (not killed). All pending approvals are revoked. System enters read-only forensic mode.
|
||||
- Governance drift monitoring: weekly report of auto-executed vs. human-approved steps. If auto-execution ratio exceeds the configured threshold, system auto-downgrades to `strict`.
|
||||
- Per-customer, per-runbook governance overrides. Customers can lock specific runbooks to `strict` regardless of system mode.
|
||||
- **Hard rule:** `panic_mode` must be triggerable in <1 second via API call, CLI command, Slack command, OR a physical hardware button (webhook endpoint).
|
||||
|
||||
**Estimate:** 5 points
|
||||
**Dependencies:** Epic 3 (Execution Engine), Epic 5 (Approval Workflow)
|
||||
**Technical Notes:**
|
||||
- No "full auto" mode is a deliberate product decision. dd0c/run assists humans, it doesn't replace them for destructive actions.
|
||||
- Panic mode implementation: Redis key `dd0c:panic` checked at the top of every execution loop iteration. Webhook endpoint `POST /admin/panic` sets the key and broadcasts a halt signal via pub/sub.
|
||||
- The <1 second requirement means panic cannot depend on a DB write — Redis pub/sub only.
|
||||
- Governance drift: cron job queries execution logs weekly. Threshold configurable per-org (default: 70% auto-execution triggers downgrade).
|
||||
|
||||
### Epic 10 Summary
|
||||
| Story | Tenet | Points |
|
||||
|-------|-------|--------|
|
||||
| 10.1 | Atomic Flagging | 5 |
|
||||
| 10.2 | Elastic Schema | 3 |
|
||||
| 10.3 | Cognitive Durability | 2 |
|
||||
| 10.4 | Semantic Observability | 5 |
|
||||
| 10.5 | Configurable Autonomy | 5 |
|
||||
| **Total** | | **20** |
|
||||
128
products/06-runbook-automation/innovation-strategy/session.md
Normal file
128
products/06-runbook-automation/innovation-strategy/session.md
Normal file
@@ -0,0 +1,128 @@
|
||||
# dd0c/run — Innovation Strategy & Disruption Verdict
|
||||
**Strategist:** Victor, Disruptive Innovation Oracle
|
||||
**Date:** 2026-02-28
|
||||
|
||||
## Section 1: MARKET LANDSCAPE
|
||||
|
||||
Let us dispense with the industry hallucinations. The current runbook automation market is a museum of failed promises. We are selling to teams whose "documentation" is a stale Confluence page that actively sabotages their incident response.
|
||||
|
||||
**The Incumbent Graveyard (Current State):**
|
||||
* **Rundeck:** A 2015-era job scheduler masquerading as a modern runbook engine. It requires Java, a database, and YAML definitions. It is the definition of toil.
|
||||
* **Transposit & Shoreline:** Over-engineered orchestration platforms built for the 1% of engineering orgs that have the bandwidth to learn yet another proprietary DSL. They built jetpacks for people who are currently drowning.
|
||||
* **Rootly:** Excellent at incident management (the bureaucracy of the outage), but they stop at the boundary of execution. They document the fire; they don't hold the hose.
|
||||
|
||||
**Adjacent Markets (The Collision Course):**
|
||||
* **Incident Management (PagerDuty, OpsGenie):** They own the alert routing but treat the actual resolution as an "exercise left to the reader." Their native automation add-ons are extortionately priced bolt-ons.
|
||||
* **AIOps:** A buzzword that has historically meant "we will group your 5,000 meaningless alerts into 50 slightly-less-meaningless clusters."
|
||||
* **Workflow Automation (Zapier/Make for DevOps):** Too generic. They don't understand infrastructure state, blast radius, or the concept of a 3 AM rollback.
|
||||
|
||||
**Key Macro Trends (2026):**
|
||||
1. **Shift from Documentation to Execution:** The era of static text is dead. If a runbook cannot execute its own read-only steps, it is legacy technical debt.
|
||||
2. **LLM-Powered Ops (The Parsing Revolution):** We finally have models capable of translating ambiguous human intent ("bounce the connection pool") into deterministic shell commands with high reliability.
|
||||
3. **Agentic Automation:** We are transitioning from "human-in-the-loop" to "human-on-the-loop." The trust gradient is shifting.
|
||||
|
||||
## Section 2: DISRUPTION ANALYSIS
|
||||
|
||||
The incumbents have built moats out of complexity. They mistake the density of their UI for enterprise value.
|
||||
|
||||
**Vulnerabilities of the Old Guard:**
|
||||
* **The Complexity Tax:** Rundeck and Shoreline require upfront investment. You do not buy them; you marry them. This violates the 5-minute time-to-value constraint.
|
||||
* **The PagerDuty Extortion:** PagerDuty's native automation is a cynical upsell. It demands thousands of dollars simply to automate the resolution of the alerts it already charges you to receive. They are taxing their own utility.
|
||||
|
||||
**The Unowned Gap:**
|
||||
Nobody owns the bridge between *tribal knowledge* and *automated execution*. The "documentation-to-execution" gap is vast. Teams currently have to write documentation, then write automation code. We eliminate the intermediate step. The documentation *is* the code.
|
||||
|
||||
**Why 2026? The Paradigm Shift:**
|
||||
Two years ago, this was impossible. A 2024 LLM would hallucinate a destructive command or fail to parse implicit prerequisites. Now, we have models capable of rigorous structural extraction and risk classification (🟢🟡🔴). The context windows are large enough to digest a 50-page postmortem and distill the exact terminal commands that fixed it. The inference latency is under 2 seconds. The underlying intelligence has commoditized to the point where we can offer it for $29/runbook/month, destroying the enterprise pricing models of the incumbents.
|
||||
|
||||
## Section 3: COMPETITIVE MOAT STRATEGY
|
||||
|
||||
You cannot rely on LLM capabilities as a moat. OpenAI or Anthropic will drop prices or release better reasoning models every six months. The moat is your data and your ecosystem. If you do not lock this down, you are just a generic wrapper waiting to be replaced by a Pulumi or GitHub Agentic Workflow.
|
||||
|
||||
**The Flywheel of the dd0c Ecosystem:**
|
||||
* **The Alert/Run Coupling:** `dd0c/alert` identifies the incident pattern; `dd0c/run` provides the resolution. The execution telemetry from `dd0c/run` feeds back into `dd0c/alert`, training the matching engine. It is a closed-loop system of continuous improvement. The data moat compounds daily.
|
||||
|
||||
**The Network Effect:**
|
||||
* **The Template Marketplace:** A company signs up and immediately inherits the collective wisdom of thousands of other engineering teams. A shared template for "AWS RDS Failover" that has been battle-tested and refined across 500 organizations is infinitely more valuable than a blank slate. The value of the platform scales non-linearly with every new user.
|
||||
|
||||
**The Data Moat (Execution Telemetry):**
|
||||
* We log every skipped step, every manual override, every successful rollback. We are building the industry's first database of *what actually works in an incident*. This "Resolution Pattern Database" is an asset no incumbent possesses. They only know what the runbook says; we know what the human actually typed at 3:14 AM.
|
||||
|
||||
**Why the Incumbents Cannot Replicate This:**
|
||||
* PagerDuty and Incident.io cannot simply add a "generate runbook" button. To replicate `dd0c/run`, they would need the deep infrastructure context, the FinOps integration (`dd0c/cost`), and the alert intelligence pipeline (`dd0c/alert`). We have the context. They just have the routing rules.
|
||||
|
||||
## Section 4: MARKET SIZING
|
||||
|
||||
The numbers must be defensible. Stop inflating them to please imaginary VCs. We are a bootstrapped operation.
|
||||
|
||||
**Methodology & Market Sizing:**
|
||||
* **TAM (Total Addressable Market): $12B+**
|
||||
* *Calculation:* There are roughly 26 million software developers globally. Assume 20% are involved in ops/on-call rotations (5.2 million). Assume an average enterprise tooling spend of $200/month per engineer for incident management, AIOps, and automation.
|
||||
* **SAM (Serviceable Available Market): $1.5B**
|
||||
* *Calculation:* Focus on the mid-market (startups to mid-size tech companies, Series A to Series D). These teams have the highest pain-to-budget ratio. They have 10-100 engineers and cannot afford to hire a dedicated SRE team or buy Shoreline. Let's estimate 50,000 such companies, averaging 30 on-call engineers each (1.5 million target seats). At $1,000/year per seat, that's $1.5B.
|
||||
* **SOM (Serviceable Obtainable Market): $15M (Year 3 Target)**
|
||||
* *Calculation:* 1% penetration of the SAM. 500 companies with an average ARR of $30,000.
|
||||
|
||||
**Beachhead Segment Identification:**
|
||||
* **The Drowning SRE Team (Series B/C Startups):** Teams of 5-15 SREs supporting 50-200 developers. They have high incident volume and their existing Confluence runbooks are a known liability. They are desperate for a solution that does not require a six-month migration.
|
||||
* **The Compliance Chasers:** Startups preparing for their first or second SOC 2 audit. They need documented, auditable incident response procedures immediately. We sell them the audit trail masquerading as an automation tool.
|
||||
|
||||
**Revenue Projections (Based on $29/runbook/month or equivalent per-seat pricing):**
|
||||
* **Conservative (Year 1):** $250K ARR. We capture the early adopters from our FinOps wedge (`dd0c/cost`) and convert them to the platform.
|
||||
* **Moderate (Year 2):** $1.2M ARR. The flywheel engages. The template marketplace drives organic acquisition. `dd0c/alert` and `dd0c/run` are sold as a bundled pair.
|
||||
* **Aggressive (Year 3):** $5M+ ARR. The platform takes over the incident management budget of 150-200 mid-market companies.
|
||||
|
||||
## Section 5: STRATEGIC RISKS
|
||||
|
||||
This is where the hallucination stops. This is where you look at the barrel of the gun. The market is not kind to solo founders.
|
||||
|
||||
**Top 5 Existential Risks:**
|
||||
|
||||
1. **PagerDuty/Incident.io Building Native AI Automation**
|
||||
* **Severity:** Critical
|
||||
* **Probability:** High
|
||||
* **Mitigation:** They *will* build this, but they will build it as a closed, proprietary upsell for Enterprise tiers. They will not integrate deeply with your AWS cost anomalies (`dd0c/cost`) or your infrastructure drift (`dd0c/drift`). We win on the open ecosystem, the cross-platform nature of the agent, and mid-market pricing. They will sell to the CIO; we sell to the on-call engineer at 3 AM.
|
||||
|
||||
2. **LLM Hallucination in Production Runbooks (Safety Critical)**
|
||||
* **Severity:** Catastrophic
|
||||
* **Probability:** Medium
|
||||
* **Mitigation:** The Trust Gradient is non-negotiable. 🟢 (Safe), 🟡 (Caution), 🔴 (Dangerous). We never execute state-changing commands without explicit human approval (Copilot Mode) or a proven track record (Autopilot Mode). We must implement strict grounding techniques; the LLM cannot invent steps not found in the source material unless explicitly requested. Every action must have a recorded rollback command. The first time `dd0c/run` breaks production autonomously, the company is dead.
|
||||
|
||||
3. **The "Agentic AI" Obsolescence Event**
|
||||
* **Severity:** High
|
||||
* **Probability:** Low (in the next 3 years)
|
||||
* **Mitigation:** If autonomous AI agents (Devin, GitHub Copilot Workspace, Pulumi Neo) can detect and fix infrastructure issues without human intervention, who needs runbooks? The answer: They need runbooks as the "policy" that defines what the agent *should* do. Runbooks become the bridge between human intent and agent execution. We pivot from "human automation" to "agent policy management."
|
||||
|
||||
4. **Solo Founder Scaling Constraints (The Bus Factor)**
|
||||
* **Severity:** High
|
||||
* **Probability:** High
|
||||
* **Mitigation:** Brian, you are building six products. You must rigorously enforce the "Anti-Bloatware Platform Strategy." Share the API gateway, the auth layer, the OpenTelemetry ingest, and the Rust agent architecture across all `dd0c` modules. If you build six separate data models, you will burn out in 14 months. Do not build custom integrations where webhooks will suffice. Do not build crawlers; force the user to paste.
|
||||
|
||||
5. **The "No Runbooks at All" Cold Start Problem**
|
||||
* **Severity:** Medium
|
||||
* **Probability:** High
|
||||
* **Mitigation:** Many teams have zero runbooks. They cannot use `dd0c/run` if they have nothing to paste. V1 must include the "Slack Thread Scraper" and "Postmortem-to-Runbook Pipeline." V2 must include the "Terminal Watcher." The product must *create* runbooks, not just execute them.
|
||||
|
||||
## Section 6: INNOVATION VERDICT
|
||||
|
||||
This is the final word. The market is saturated with "Ops" products, but it is entirely devoid of execution velocity.
|
||||
|
||||
**Verdict: CONDITIONAL GO.**
|
||||
|
||||
**Timing Recommendation:**
|
||||
Do not build this first. Do not build this second.
|
||||
1. **Month 1-3:** Build `dd0c/route` and `dd0c/cost`. These are the FinOps wedges. They prove immediate, hard-dollar ROI. If you cannot save a company money, you have no right to ask them to trust you with their production environment.
|
||||
2. **Month 4-6:** Build `dd0c/alert` and `dd0c/run` as a bundled pair. The "On-Call Savior" phase. You have saved their budget; now save their sleep.
|
||||
|
||||
**Key Conditions & Kill Criteria:**
|
||||
* **Condition 1:** The "Paste & Parse" AI must take < 5 seconds and perfectly classify destructive vs. read-only commands. If it requires 10 minutes of manual YAML adjustment after pasting, the product is dead. Kill it.
|
||||
* **Condition 2:** You must secure the `dd0c/alert` integration pipeline. `dd0c/run` is not a standalone product; it is the execution arm of your alert intelligence. If it cannot auto-attach the correct runbook to a PagerDuty webhook with >90% confidence, kill it.
|
||||
* **Condition 3:** Zero-configuration local execution. The Rust agent must run in their VPC and pull commands from the SaaS. If you require inbound firewall rules or AWS root credentials, the security review will block every sale. Kill it.
|
||||
|
||||
**The One Thing That Must Be True:**
|
||||
Engineers must hate their current 3 AM reality more than they fear giving an LLM the ability to suggest a production terminal command. The dread of the pager must outweigh the skepticism of the AI.
|
||||
|
||||
If that is true—and my telemetry suggests it is—then `dd0c/run` is not just a feature. It is the beginning of the end for static documentation.
|
||||
|
||||
Build the weapon.
|
||||
— Victor
|
||||
101
products/06-runbook-automation/party-mode/session.md
Normal file
101
products/06-runbook-automation/party-mode/session.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# dd0c/run — Party Mode Advisory Board Review
|
||||
**Date:** 2026-02-28
|
||||
**Product:** dd0c/run (AI-Powered Runbook Automation)
|
||||
**Phase:** 3
|
||||
|
||||
> *The boardroom doors lock. The coffee is black. The whiteboard is empty. Five experts sit around the table to tear dd0c/run apart. If it survives this, it survives production.*
|
||||
|
||||
## Round 1: INDIVIDUAL REVIEWS
|
||||
|
||||
**The VC (Pattern Matcher):**
|
||||
"I'll be honest, the initial pitch sounds like a feature PagerDuty will announce at their next user conference. But the *Resolution Pattern Database*—that's a data moat. If you actually capture what engineers type at 3am across 500 companies, you're not building an ops tool, you're building the industry's first operational intelligence graph. What worries me is the go-to-market. $29/runbook/month sounds weird. Just do price per seat. **CONDITIONAL GO**—only if you change the pricing model and prove PagerDuty can't just copy the UI."
|
||||
|
||||
**The CTO (Infrastructure Veteran):**
|
||||
"I have 20 years of scars from production outages. The idea of an LLM piping commands directly into a root shell makes my blood run cold. 'Hallucinate a failover' is a resume-generating event. However, the strict Risk Classification (🟢🟡🔴) and the local Rust agent in the customer's VPC give me a sliver of hope. What worries me is the parsing engine misclassifying a destructive command as 'safe' because the LLM got confused by context. **NO-GO** until I see the deterministic AST-level validation that overrides the LLM's risk guesses."
|
||||
|
||||
**The Bootstrap Founder (Solo Shipper):**
|
||||
"A secure, remote-execution VPC agent plus an LLM parsing layer plus a Slack bot? For a solo founder? Brian, you're going to burn out before month 6. I love the 'Paste & Parse' wedge—that's a viral 5-second time-to-value that doesn't require a sales call. But what worries me is the support burden when a customer's weird custom Kubernetes setup doesn't play nice with your agent. **GO**, but you have to ruthlessly cut scope. V1 is Copilot-only, no Autopilot."
|
||||
|
||||
**The On-Call Engineer (3am Survivor):**
|
||||
"You had me at 'Slack bot that pre-fills the variables.' When I'm paged at 3am, my IQ drops by 40 points. I don't want to read Confluence. I want a button that says 'Check Database Locks' and just does it. What worries me is trust. If this thing lies to me once—if it runs the wrong script or links the wrong runbook—I will uninstall it and tell every SRE I know that it's garbage. The rollback feature is the only reason I'm at the table. **CONDITIONAL GO**—if it actually works in Slack."
|
||||
|
||||
**The Contrarian (The Skeptic):**
|
||||
"You're all missing the point. Automating a runbook is a fundamental anti-pattern. If a process can be documented as a step-by-step runbook, it should be codified as a self-healing script in the infrastructure, not babysat by an AI chatbot! Runbooks are a symptom of engineering failure. You're building a highly profitable band-aid for a bullet wound. Furthermore, this is a feature of `dd0c/portal`, not a standalone product. **NO-GO**."
|
||||
|
||||
## Round 2: CROSS-EXAMINATION
|
||||
|
||||
**The CTO:** "Let's talk blast radius. What happens when the LLM reads a stale runbook that says `kubectl delete namespace payments-legacy`, but that namespace was repurposed last week for the new billing engine?"
|
||||
|
||||
**The On-Call Engineer:** "Wait, if the AI suggests that at 3am, I might just click approve because I'm half asleep. If the UI makes it look 'green' or safe, I'm blindly trusting it."
|
||||
|
||||
**The CTO:** "Exactly. An LLM cannot know the current state of the cluster unless it's constantly diffing against live infrastructure. It's just reading a dead document."
|
||||
|
||||
**The Contrarian:** "Which is why relying on text documents to manage infrastructure in 2026 is archaic. The product is enforcing bad habits. We should be writing declarative state, not chat scripts."
|
||||
|
||||
**The Bootstrap Founder:** "Guys, focus. We're not solving the platonic ideal of engineering, we're solving the reality that 50,000 startups have garbage Confluence pages. CTO, the design doc says V2 includes 'Infrastructure-Aware Staleness' via `dd0c/portal` to catch dead resources."
|
||||
|
||||
**The CTO:** "V2 doesn't pay the bills if V1 drops a production database. The risk classifier cannot be pure LLM. It has to be a deterministic regex/AST scanner matching against known destructive patterns."
|
||||
|
||||
**The VC:** "Let's pivot to market dynamics. Bootstrap Founder, you think one person can't build this. But if he uses the shared `dd0c` platform architecture—same auth, same billing, same gateway—isn't this just a Slack bot piping to an LLM router?"
|
||||
|
||||
**The Bootstrap Founder:** "The SaaS side is easy. The Rust execution agent that runs in the customer's VPC is hard. That has to be bulletproof. If that agent has a CVE, the whole dd0c brand is dead."
|
||||
|
||||
**The VC:** "But without the execution, we're just a markdown parser. Incident.io can build a markdown parser in a weekend. They're probably building it right now."
|
||||
|
||||
**The On-Call Engineer:** "I don't care about Incident.io. I care that when I get a PagerDuty alert, the Slack thread already has the logs pulled. If `dd0c/run` can do that without even executing a write command, I'd pay for it."
|
||||
|
||||
**The Contrarian:** "So you're paying $29 a month for a glorified `grep`?"
|
||||
|
||||
**The On-Call Engineer:** "At 3am, a glorified `grep` that knows exactly which pod is failing is worth its weight in gold. You don't get it because you sleep through the night."
|
||||
|
||||
**The Bootstrap Founder:** "That's the wedge! The read-only Copilot. V1 doesn't need to execute state changes. V1 just runs the 🟢 safe diagnostic steps automatically and presents the context."
|
||||
|
||||
**The CTO:** "I can stomach that. Read-only diagnostics limit the blast radius to zero, while proving the agent works."
|
||||
|
||||
## Round 3: STRESS TEST
|
||||
|
||||
**1. The Catastrophic Scenario: LLM Hallucination Causes a Production Outage**
|
||||
* **Severity:** 10/10. An extinction-level event for the company.
|
||||
* **The Attack:** The runbook says "Restart the pod." The LLM hallucinates and outputs `kubectl delete deployment` instead, classifying it as 🟢 Safe. The tired engineer clicks approve. Production goes down, the customer cancels, and dd0c goes bankrupt from lawsuits.
|
||||
* **Mitigation:**
|
||||
* *Systemic:* Strict UI friction. 🔴 actions require typing the resource name to confirm, not just clicking a button.
|
||||
* *Architectural:* The agent must have a deterministic whitelist of allowed commands. Anything outside the whitelist is blocked at the agent level, regardless of the SaaS payload.
|
||||
* **Pivot Option:** Limit V1 to Read-Only Diagnostic execution. The AI only fetches logs and metrics. State changes must be copy-pasted by the human.
|
||||
|
||||
**2. The Market Threat: PagerDuty / Incident.io Ships Native Runbook Automation**
|
||||
* **Severity:** 8/10.
|
||||
* **The Attack:** PagerDuty acquires a small AI runbook startup and bundles it into their Enterprise tier for "free" (subsidized by their massive license cost), choking off dd0c/run's distribution.
|
||||
* **Mitigation:**
|
||||
* *Positioning:* PagerDuty's automation is locked to PagerDuty. `dd0c/run` must be agnostic (works with OpsGenie, Grafana OnCall, direct webhooks).
|
||||
* *Data Moat:* The cross-customer resolution template library. PagerDuty keeps data siloed per enterprise; we anonymize and share the "ideal" runbooks across the mid-market.
|
||||
* **Pivot Option:** Double down on the `dd0c` ecosystem. `dd0c/run` becomes the execution arm of `dd0c/alert` and `dd0c/drift`, creating a tightly coupled platform that PagerDuty can't touch.
|
||||
|
||||
**3. The Cold Start: Teams Don't Have Documented Runbooks**
|
||||
* **Severity:** 7/10.
|
||||
* **The Attack:** A prospect signs up, goes to the "Paste Runbook" screen, and realizes they have nothing to paste. Churn happens in 60 seconds.
|
||||
* **Mitigation:**
|
||||
* *Product:* The "Slack Thread Scraper" and "Postmortem-to-Runbook Pipeline" must be in V1. If they have incidents, they have Slack threads. Extract the runbooks from history.
|
||||
* *Community:* Pre-seed the platform with 50 high-quality templates for standard infra (AWS RDS failover, K8s CrashLoopBackOff, Redis memory pressure).
|
||||
* **Pivot Option:** Shift marketing from "Automate your runbooks" to "Generate your missing runbooks."
|
||||
|
||||
## Round 4: FINAL VERDICT
|
||||
|
||||
**The Board Vote:** Split Decision (3 GO, 1 CONDITIONAL GO, 1 NO-GO).
|
||||
|
||||
**The Verdict:** **PIVOT TO CONDITIONAL GO.**
|
||||
|
||||
**Revised Strategy & Timing:**
|
||||
Confirmed as Phase 3 product (Months 4-6), but ONLY if `dd0c/alert` is functional first. `dd0c/run` cannot stand alone in the market against incumbents; it must be the execution arm of the alert intelligence.
|
||||
|
||||
**Top 3 Must-Get-Right Items:**
|
||||
1. **The Trust Gradient (Read-Only First):** V1 must aggressively limit itself to 🟢 Safe (diagnostic/read-only) commands for auto-execution. 🟡 and 🔴 commands must have extreme UI friction.
|
||||
2. **The "Paste to Pilot" Vercel Moment:** The onboarding must be under 5 minutes. If parsing fails or requires heavy YAML tweaking, the solo founder loses the GTM battle.
|
||||
3. **The Agent Security Model:** The Rust VPC agent must be open-source, audited, and strictly outbound-only. If InfoSec teams balk, the sales cycle extends beyond a solo founder's runway.
|
||||
|
||||
**The Kill Condition:**
|
||||
If beta testing shows that the LLM misclassifies a destructive command as "Safe" (🟢) even once, or if the false-positive rate for safe commands exceeds 0.1%, the product must be killed or fundamentally re-architected to remove LLMs from the execution path.
|
||||
|
||||
**Closing Remarks from The Board:**
|
||||
*The board recognizes the immense leverage of this product. If Brian can pull this off, he effectively creates the bridge between static documentation and autonomous AI agents. But the execution risk is astronomical. Build it paranoid. Assume the AI wants to delete production. Constrain it accordingly.*
|
||||
|
||||
*Meeting adjourned.*
|
||||
617
products/06-runbook-automation/product-brief/brief.md
Normal file
617
products/06-runbook-automation/product-brief/brief.md
Normal file
@@ -0,0 +1,617 @@
|
||||
# dd0c/run — Product Brief
|
||||
## AI-Powered Runbook Automation
|
||||
**Version:** 1.0 | **Date:** 2026-02-28 | **Author:** Product Management | **Status:** Investor-Ready Draft
|
||||
|
||||
---
|
||||
|
||||
## 1. EXECUTIVE SUMMARY
|
||||
|
||||
### Elevator Pitch
|
||||
|
||||
dd0c/run converts your team's existing runbooks — the stale Confluence pages, the Slack threads, the knowledge trapped in your senior engineer's head — into structured, executable workflows that guide on-call engineers through incidents step by step. Paste a runbook, get an intelligent copilot in under 5 seconds. No YAML. No configuration. No new DSL to learn.
|
||||
|
||||
This is the most safety-critical module in the dd0c platform. It touches production. We built it that way on purpose.
|
||||
|
||||
### The Problem
|
||||
|
||||
**The documentation-to-execution gap is killing engineering teams.**
|
||||
|
||||
- The average on-call engineer spends 12+ minutes *finding and interpreting* the right runbook during a 3am incident. Cognitive function drops 30-40% during nighttime pages. Every minute of that search is a minute of downtime, a minute of cortisol, and a minute closer to burnout.
|
||||
- 60% of runbooks are stale within 30 days of creation. Infrastructure changes, UIs get redesigned, scripts move repos. The runbook becomes a historical artifact that actively sabotages incident response.
|
||||
- On-call burnout is the #1 reason SREs quit. Replacing a single engineer costs $150K+. The tooling that's supposed to help — Rundeck, PagerDuty Automation Actions, Shoreline — either requires weeks of setup, costs thousands per month, or demands a proprietary DSL that nobody has time to learn.
|
||||
- SOC 2 and ISO 27001 require documented, auditable incident response procedures. Most teams' "documentation" is a stale wiki page that wouldn't survive a serious audit.
|
||||
|
||||
The industry has tools that *route* alerts (PagerDuty), tools that *document* incidents (Rootly, Incident.io), and tools that *schedule* jobs (Rundeck). Nobody owns the bridge between tribal knowledge and automated execution. The runbook sits in Confluence. The terminal sits on the engineer's laptop. The gap between them is where MTTR lives.
|
||||
|
||||
### The Solution
|
||||
|
||||
dd0c/run is an AI-powered runbook engine that:
|
||||
|
||||
1. **Ingests** runbooks from anywhere — paste raw text, connect Confluence/Notion, or point at a Git repo of markdown files.
|
||||
2. **Parses** prose into structured, executable steps with automatic risk classification (🟢 Safe / 🟡 Caution / 🔴 Dangerous) in under 5 seconds.
|
||||
3. **Matches** runbooks to incoming alerts via PagerDuty/OpsGenie webhooks, so the right procedure appears in the incident Slack channel before the engineer finishes rubbing their eyes.
|
||||
4. **Guides** execution in Copilot mode — auto-executing safe diagnostic steps, prompting for approval on state changes, blocking destructive actions without explicit confirmation.
|
||||
5. **Learns** from every execution — tracking which steps were skipped, modified, or added — and suggests runbook updates automatically. Runbooks get better with every incident instead of decaying.
|
||||
|
||||
### Target Customer
|
||||
|
||||
**Primary:** Mid-market engineering teams (Series A through Series D startups, 10-100 engineers) with 5-15 SREs supporting high incident volume. They have existing runbooks in Confluence/Notion that they know are stale, they can't afford a dedicated SRE tooling team, and they're drowning in on-call.
|
||||
|
||||
**Secondary:** Startups approaching their first SOC 2 audit who need documented, auditable incident response procedures immediately.
|
||||
|
||||
**Beachhead:** Teams already using dd0c/cost or dd0c/alert. We've saved their budget and their sleep. Now we save their production environment.
|
||||
|
||||
### Key Differentiators
|
||||
|
||||
1. **Zero-Configuration Intelligence.** Paste a runbook. Get structured, risk-classified, executable steps in under 5 seconds. Rundeck requires Java, a database, and YAML definitions. Shoreline requires a proprietary DSL. We require a clipboard.
|
||||
|
||||
2. **The Trust Gradient.** We don't ask teams to hand production to an AI on day one. dd0c/run starts in read-only suggestion mode. Trust is earned through data — 10 successful copilot runs with zero modifications before the system even *suggests* promotion to autopilot. Trust is earned in millimeters and lost in miles. We designed for that.
|
||||
|
||||
3. **The dd0c Ecosystem Flywheel.** dd0c/alert identifies the incident pattern. dd0c/run provides the resolution. Execution telemetry feeds back into alert intelligence, training the matching engine. dd0c/portal provides service ownership context. dd0c/cost tracks the financial impact. The modules are 10x more valuable together than apart. No point solution can replicate this.
|
||||
|
||||
4. **The Resolution Pattern Database.** Every skipped step, every manual override, every successful rollback is logged. We're building the industry's first database of *what actually works in an incident* — not what the runbook says, but what the engineer actually typed at 3:14 AM. This data moat compounds daily.
|
||||
|
||||
5. **Agent-Based Security Model.** A lightweight, open-source Rust agent runs inside the customer's VPC. The SaaS never sees credentials. The agent is strictly outbound-only. No inbound firewall rules. No root AWS credentials. InfoSec teams can audit the binary. The execution boundary is the customer's perimeter, not ours.
|
||||
|
||||
### The Safety Narrative
|
||||
|
||||
Let's be direct: this product executes actions in production environments. An LLM suggesting the wrong command at 3am to a sleep-deprived engineer is not a theoretical risk — it's the scenario that kills the company.
|
||||
|
||||
Our entire architecture is built around one principle: **assume the AI wants to delete production. Constrain it accordingly.**
|
||||
|
||||
- **V1 is Copilot-only.** No autonomous execution of state-changing commands. Period. The AI suggests. The human approves. The audit trail proves it.
|
||||
- **Risk classification is deterministic + AI.** The LLM classifies steps, but a deterministic regex/AST scanner validates against known destructive patterns. The scanner overrides the LLM. Always.
|
||||
- **The agent has a command whitelist.** Anything outside the whitelist is blocked at the agent level, regardless of what the SaaS sends. The agent is the last line of defense, and it doesn't trust the cloud.
|
||||
- **🔴 Dangerous actions require typing the resource name to confirm.** Not clicking a button. Typing. UI friction is a feature, not a bug.
|
||||
- **Every state-changing step records its rollback command.** One-click undo at any point. The safety net that makes engineers brave enough to click "Approve."
|
||||
- **Kill condition:** If beta testing shows the LLM misclassifies a destructive command as Safe (🟢) even once, or if the false-positive rate exceeds 0.1%, the product is killed or fundamentally re-architected to remove LLMs from the classification path.
|
||||
|
||||
Trust is built incrementally: read-only diagnostics first, then copilot with approval gates, then — only after proven track records — selective autopilot for safe steps. The engineer always has the kill switch. The AI never insists.
|
||||
|
||||
---
|
||||
|
||||
## 2. MARKET OPPORTUNITY
|
||||
|
||||
### The Documentation-to-Execution Gap
|
||||
|
||||
Every engineering team has some form of incident documentation. Confluence pages, Notion databases, Google Docs, Slack threads, senior engineers' brains. And every team has a terminal where incidents get resolved. The gap between those two things — the document and the execution — is where MTTR lives, where burnout festers, and where $12B+ in operational tooling spend fails to deliver.
|
||||
|
||||
The current market is segmented into tools that solve *pieces* of the incident lifecycle but leave the critical bridge unbuilt:
|
||||
|
||||
| Category | Players | What They Do | What They Don't Do |
|
||||
|----------|---------|-------------|-------------------|
|
||||
| Alert Routing | PagerDuty, OpsGenie, Grafana OnCall | Route alerts to the right human | Help that human actually resolve the incident |
|
||||
| Incident Management | Rootly, Incident.io, FireHydrant | Document the bureaucracy of the outage | Execute the resolution |
|
||||
| Job Scheduling | Rundeck (OSS) | Run pre-defined jobs via YAML | Parse natural language, classify risk, learn from execution |
|
||||
| Orchestration Platforms | Shoreline, Transposit | Execute complex remediation workflows | Onboard in under 5 minutes, work without a proprietary DSL |
|
||||
| AIOps | BigPanda, Moogsoft | Cluster and correlate alerts | Bridge the gap from "we know what's wrong" to "here's how to fix it" |
|
||||
|
||||
Nobody owns the bridge from documentation to execution. That's the gap. That's the market.
|
||||
|
||||
### Market Sizing
|
||||
|
||||
**TAM (Total Addressable Market): $12B+**
|
||||
|
||||
26 million software developers globally. ~20% involved in ops/on-call rotations (5.2 million). Average enterprise tooling spend of $200/month per engineer across incident management, AIOps, and automation tooling. The TAM encompasses the full operational tooling budget that dd0c/run competes for or displaces.
|
||||
|
||||
**SAM (Serviceable Available Market): $1.5B**
|
||||
|
||||
Focus on mid-market: startups to mid-size tech companies (Series A through Series D). These teams have the highest pain-to-budget ratio — 10-100 engineers, can't afford dedicated SRE tooling teams, can't justify Shoreline's enterprise pricing. Estimated 50,000 such companies globally, averaging 30 on-call engineers each (1.5 million target seats). At $1,000/year per seat equivalent, that's $1.5B.
|
||||
|
||||
**SOM (Serviceable Obtainable Market): $15M (Year 3 Target)**
|
||||
|
||||
1% penetration of SAM. 500 companies with average ARR of $30,000. This is a bootstrapped operation — the numbers must be defensible, not inflated for imaginary VCs.
|
||||
|
||||
### Competitive Landscape
|
||||
|
||||
**Rundeck (Open Source / PagerDuty-owned)**
|
||||
- Strengths: Free, established, large community.
|
||||
- Weaknesses: 2015-era UX. Requires Java, a database, YAML job definitions. It's a job scheduler masquerading as a runbook engine. Time-to-value is measured in days, not seconds.
|
||||
- Our advantage: Zero-config AI parsing vs. manual YAML authoring. It's Notepad vs. VS Code — different products for different eras.
|
||||
|
||||
**Transposit / Shoreline**
|
||||
- Strengths: Deep orchestration capabilities, enterprise customers.
|
||||
- Weaknesses: Over-engineered for the 1% of orgs that have bandwidth to learn a proprietary DSL. They built jetpacks for people who are currently drowning. Pricing is enterprise-only.
|
||||
- Our advantage: Paste-to-parse in 5 seconds. No DSL. Mid-market pricing. We meet teams where they are, not where Shoreline wishes they were.
|
||||
|
||||
**Rootly / Incident.io / FireHydrant**
|
||||
- Strengths: Excellent incident management workflows, growing fast.
|
||||
- Weaknesses: They document the fire; they don't hold the hose. They stop at the boundary of execution.
|
||||
- Our advantage: We start where they stop. And with dd0c/alert integration, we own the full chain from detection to resolution.
|
||||
|
||||
**PagerDuty Automation Actions**
|
||||
- Strengths: Distribution. Every on-call team already has PagerDuty.
|
||||
- Weaknesses: Cynical upsell — thousands of dollars to automate the resolution of alerts they already charge you to receive. Locked to PagerDuty ecosystem. No runbook intelligence, just pre-defined script execution.
|
||||
- Our advantage: Platform-agnostic (works with PagerDuty, OpsGenie, Grafana OnCall, any webhook). AI-native intelligence vs. dumb script execution. 10x cheaper.
|
||||
|
||||
**The Real Threat: PagerDuty or Incident.io acquiring an AI runbook startup and bundling it into Enterprise tier.** Mitigation: They will build it as a closed, proprietary upsell. They cannot integrate with dd0c/cost, dd0c/drift, or dd0c/alert. They will sell to the CIO; we sell to the on-call engineer at 3 AM. We win on the open ecosystem, cross-platform nature, and mid-market pricing.
|
||||
|
||||
### Timing Thesis: Why 2026
|
||||
|
||||
Two years ago, this product was impossible. Three things changed:
|
||||
|
||||
1. **LLM Parsing Reliability.** A 2024 model would hallucinate destructive commands or fail to parse implicit prerequisites. 2026 models can perform rigorous structural extraction and risk classification with the accuracy required for production-adjacent tooling. Context windows are large enough to digest a 50-page postmortem and distill the exact terminal commands that fixed it.
|
||||
|
||||
2. **Inference Economics.** Inference latency is under 2 seconds. Costs have commoditized to the point where we can offer AI-powered parsing for $29/runbook/month, destroying the enterprise pricing models of incumbents who charge $50-100/seat/month for dumb automation.
|
||||
|
||||
3. **The Agentic Shift.** The industry is transitioning from "human-in-the-loop" to "human-on-the-loop." Engineering teams are psychologically ready for AI-assisted operations in a way they weren't 18 months ago. The dread of the 3am pager now outweighs the skepticism of the AI — and that's the inflection point.
|
||||
|
||||
**The window:** If we don't build this in the next 12 months, PagerDuty, Incident.io, or a well-funded startup will. The documentation-to-execution gap is too obvious and too painful to remain unowned. First-mover advantage accrues to whoever builds the Resolution Pattern Database first — that data moat compounds daily and is nearly impossible to replicate.
|
||||
|
||||
---
|
||||
|
||||
## 3. PRODUCT DEFINITION
|
||||
|
||||
### Value Proposition
|
||||
|
||||
**For on-call engineers:** Replace the 3am panic spiral — searching Confluence, interpreting stale docs, copy-pasting commands you don't understand — with a calm copilot that already knows what's wrong, already has the runbook, and walks you through it step by step.
|
||||
|
||||
**For SRE managers:** Replace vibes-based operational health with data. Know which services lack runbooks, which runbooks are stale, which steps get skipped, and what your actual MTTR is — with audit-ready compliance evidence generated automatically.
|
||||
|
||||
**For senior engineers (the bus factor):** Stop being the human runbook. Your expertise gets captured from your natural workflow — terminal sessions, Slack threads, incident resolutions — and lives on in the system even when you're on vacation, asleep, or gone.
|
||||
|
||||
**One sentence:** dd0c/run turns your team's scattered incident knowledge into a living, learning, executable system that makes every on-call engineer as effective as your best one.
|
||||
|
||||
### Personas
|
||||
|
||||
| Persona | Name | Role | Primary Need | Key Metric |
|
||||
|---------|------|------|-------------|------------|
|
||||
| The On-Call Engineer | Riley | Mid-level SRE, 2 years exp, paged at 3am | Instantly know what to do without searching or guessing | Time-to-resolution, confidence during incidents |
|
||||
| The SRE Manager | Jordan | Manages 8 SREs, owns incident response quality | Consistent, auditable, measurable incident response | MTTR trends, runbook coverage, compliance readiness |
|
||||
| The Runbook Author | Morgan | Staff engineer, 6 years, carries institutional knowledge | Transfer expertise without the overhead of writing docs | Knowledge capture rate, runbook usage by others |
|
||||
|
||||
### The Trust Gradient — The Core Design Principle
|
||||
|
||||
This is the architectural decision that defines dd0c/run. It is non-negotiable.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ THE TRUST GRADIENT │
|
||||
│ │
|
||||
│ READ-ONLY ──→ SUGGEST ──→ COPILOT ──→ AUTOPILOT │
|
||||
│ │
|
||||
│ "Show me "Here's what "I'll do it, "Handled. │
|
||||
│ the steps" I'd do" you approve" Here's │
|
||||
│ the log" │
|
||||
│ │
|
||||
│ V1 ◄──────────────────► │
|
||||
│ (V1 scope: Read-Only + Suggest + Copilot for 🟢 only) │
|
||||
│ │
|
||||
│ ● Per-runbook setting (not global) │
|
||||
│ ● Per-step override (🟢 auto, 🟡 prompt, 🔴 block) │
|
||||
│ ● Earned through data (10 successful runs → suggest upgrade│
|
||||
│ ● Instantly revocable (one bad run → auto-downgrade) │
|
||||
│ ● The human always has the kill switch │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**V1 is Read-Only + Suggest + Copilot for 🟢 Safe steps ONLY.** The AI auto-executes read-only diagnostic commands (check logs, query metrics, list pods). All state-changing commands (🟡 Caution, 🔴 Dangerous) require explicit human approval. Full Autopilot mode is V2 — and only for runbooks with a proven track record.
|
||||
|
||||
This is not a limitation. This is the product. Trust earned through data is the moat that no competitor can shortcut.
|
||||
|
||||
### Features by Release
|
||||
|
||||
#### V1: "Paste → Parse → Page → Pilot" (Months 4-6)
|
||||
|
||||
The minimum viable product. Four verbs. If you can't explain it in those four words, it's out of scope.
|
||||
|
||||
| Feature | Description | Priority |
|
||||
|---------|-------------|----------|
|
||||
| **Paste & Parse** | Copy-paste raw text from anywhere. AI structures it into numbered, executable steps with risk classification (🟢🟡🔴) in < 5 seconds. Zero configuration. | P0 — This IS the product |
|
||||
| **Risk Classification Engine** | AI + deterministic scanner labels every step. LLM classifies intent; regex/AST scanner validates against known destructive patterns. Scanner overrides LLM. Always. | P0 — Trust foundation |
|
||||
| **Copilot Execution** | Slack bot + web UI walks engineer through runbook step-by-step. Auto-executes 🟢 steps. Prompts for 🟡. Blocks 🔴 without explicit typed confirmation. | P0 — Core value prop |
|
||||
| **Alert-to-Runbook Matching** | PagerDuty/OpsGenie webhook integration. Alert fires → dd0c/run matches to most relevant runbook via keyword + metadata + basic semantic similarity. Posts in incident Slack channel. | P0 — "The runbook finds you" |
|
||||
| **Alert Context Injection** | Matched runbook arrives pre-populated: affected service, region, recent deploys, related metrics. No manual lookup. | P0 — 3am brain can't look things up |
|
||||
| **Rollback-Aware Execution** | Every state-changing step records its inverse command. One-click undo at any point. | P0 — Safety net |
|
||||
| **Divergence Detection** | Post-incident: AI compares what engineer did vs. what runbook prescribed. Flags skipped steps, modified commands, unlisted actions. | P1 — Learning loop |
|
||||
| **Auto-Update Suggestions** | Generates runbook diffs from divergence data. "You skipped steps 6-8 in 4/4 runs. Remove them?" | P1 — Self-improving runbooks |
|
||||
| **Runbook Health Dashboard** | Coverage %, average staleness, MTTR with/without runbook, step skip rates. Jordan's command center. | P1 — Management visibility |
|
||||
| **Compliance Export** | PDF/CSV of timestamped execution logs, approval chains, audit trail. Not pretty, but functional. | P1 — SOC 2 readiness |
|
||||
| **Prerequisite Detection** | AI identifies implicit requirements ("you need kubectl access", "make sure you're on the VPN") and surfaces pre-flight checklist. | P2 |
|
||||
| **Ambiguity Highlighter** | AI flags vague steps ("check the logs" — which logs?) and prompts author to clarify before the runbook goes live. | P2 |
|
||||
|
||||
**What V1 does NOT include:** Terminal Watcher (V2), Full Autopilot (V2), Confluence/Notion crawlers (V2 — V1 is paste), Incident Simulator (V3), Runbook Marketplace (V3), Multi-cloud abstraction (V3), Voice-guided runbooks (V3).
|
||||
|
||||
#### V2: "Watch → Learn → Predict → Protect" (Months 7-9)
|
||||
|
||||
| Feature | Description | Unlocks |
|
||||
|---------|-------------|---------|
|
||||
| Terminal Watcher | Opt-in agent captures commands during incidents. AI generates runbooks from real actions. | Solves cold-start. Captures Morgan's expertise passively. |
|
||||
| Confluence/Notion Crawlers | Automated discovery and sync of runbook-tagged pages. | Bulk import for large teams with 100+ runbooks. |
|
||||
| Full Autopilot Mode | Runbooks with 10+ successful copilot runs and zero modifications can promote 🟢 steps to autonomous execution. | "Sleep through the night" promise. |
|
||||
| dd0c/alert Deep Integration | Multi-alert correlation, enriched context passing, bidirectional feedback loop. | The platform flywheel engages. |
|
||||
| Infrastructure-Aware Staleness | Cross-reference steps against live Terraform state, K8s manifests, AWS resources via dd0c/portal. | Runbooks that know when they're lying. |
|
||||
| Runbook Effectiveness ML Model | Predict runbook success probability based on alert context, time of day, engineer experience. | Data-driven trust promotion. |
|
||||
|
||||
#### V3: "Simulate → Train → Marketplace → Scale" (Months 10-12)
|
||||
|
||||
| Feature | Description | Unlocks |
|
||||
|---------|-------------|---------|
|
||||
| Incident Simulator / Fire Drills | Sandbox environment for practicing runbooks. Gamified with scores. | Viral growth. "My team's score is 94." |
|
||||
| Voice-Guided Runbooks | AI reads steps aloud at 3am. Hands-free incident response. | Genuine differentiation nobody else has. |
|
||||
| Runbook Marketplace | Community-contributed, anonymized templates. "How teams running EKS + RDS handle connection storms." | Network effect. Templates improve with every customer. |
|
||||
| Predictive Runbook Staging | dd0c/alert detects anomaly trending toward incident → dd0c/run pre-stages runbook → on-call gets heads-up 30 min early. | The incident that never happens. |
|
||||
|
||||
### User Journey: Riley's 3am Page
|
||||
|
||||
```
|
||||
3:17 AM — Phone buzzes. PagerDuty: CRITICAL — payment-service latency > 5000ms
|
||||
|
||||
3:17 AM — dd0c/run webhook fires. Matches alert to "Payment Service Latency Runbook" (92% confidence).
|
||||
|
||||
3:17 AM — Slack bot posts in #incident-2847:
|
||||
🔔 Runbook matched: Payment Service Latency
|
||||
📊 Pre-filled: region=us-east-1, service=payment-svc, deploy=v2.4.1 (2h ago)
|
||||
🟢🟡🔴 8 steps (4 safe, 3 caution, 1 danger)
|
||||
[▶ Start Copilot]
|
||||
|
||||
3:18 AM — Riley taps Start Copilot. Steps 1-3 (🟢 Safe) auto-execute:
|
||||
✅ Checked pod status — 2/5 pods in CrashLoopBackOff
|
||||
✅ Pulled logs — 847 connection timeout errors in last 5 min
|
||||
✅ Queried pg_stat_activity — 312 idle-in-transaction connections
|
||||
|
||||
3:19 AM — Step 4 (🟡 Caution): "Bounce connection pool — kubectl rollout restart"
|
||||
⚠️ This will restart all pods. ~30s downtime.
|
||||
↩️ Rollback: kubectl rollout undo ...
|
||||
Riley taps [✅ Approve & Execute]
|
||||
|
||||
3:20 AM — Step 5 (🟢 Safe) auto-executes: Verify latency recovery.
|
||||
✅ Latency recovered to baseline. All pods Running.
|
||||
|
||||
3:21 AM — ✅ Incident resolved. MTTR: 3m 47s.
|
||||
📝 "You skipped steps 6-8. Also ran a command not in the runbook:
|
||||
SELECT count(*) FROM pg_stat_activity"
|
||||
Suggested updates: Remove steps 6-8, add DB connection check before step 4.
|
||||
[✅ Apply Updates]
|
||||
|
||||
3:21 AM — Riley applies updates. Goes back to sleep. The cat didn't even wake up.
|
||||
```
|
||||
|
||||
Previous MTTR for this incident type without dd0c/run: 38-45 minutes. With dd0c/run: under 4 minutes. That's the product.
|
||||
|
||||
### Pricing
|
||||
|
||||
| Tier | Price | Includes |
|
||||
|------|-------|----------|
|
||||
| **Free** | $0 | 3 runbooks, read-along mode only (no execution), basic parsing |
|
||||
| **Pro** | $29/runbook/month | Unlimited runbooks, copilot execution, Slack bot, PagerDuty/OpsGenie integration, basic dashboard, divergence detection |
|
||||
| **Business** | $49/seat/month | Everything in Pro + autopilot mode (V2), API access, SSO, compliance export, audit trail, priority support |
|
||||
|
||||
**Pricing rationale:** The per-runbook model ($29/runbook/month) aligns cost with value — teams pay for the runbooks they actually use, not empty seats. A team with 10 active runbooks pays $290/month. As they add more runbooks and see MTTR drop, revenue grows with demonstrated value. The per-seat Business tier captures larger teams that want platform features (SSO, compliance, API).
|
||||
|
||||
**Note from Party Mode:** The VC advisor recommended switching to pure per-seat pricing for simplicity. This is a valid concern. We will A/B test per-runbook vs. per-seat during beta to determine which model drives faster adoption and lower churn. The per-runbook model has the advantage of a lower entry point and direct value alignment; the per-seat model has the advantage of predictability and simpler billing.
|
||||
|
||||
---
|
||||
|
||||
## 4. GO-TO-MARKET PLAN
|
||||
|
||||
### Launch Strategy
|
||||
|
||||
dd0c/run is Phase 3 in the dd0c platform rollout (Months 4-6). It does not launch alone. It launches alongside dd0c/alert as the "On-Call Savior" bundle — because a runbook engine without alert intelligence is a document viewer, and alert intelligence without execution is a notification system. Together, they close the loop from detection to resolution.
|
||||
|
||||
**Prerequisite:** dd0c/cost and dd0c/route must be live and generating revenue (Phase 1, Months 1-3). These FinOps modules prove immediate, hard-dollar ROI. If we can't save a company money, we have no right to ask them to trust us with their production environment. The FinOps wedge buys the political capital for the operational wedge.
|
||||
|
||||
### Beachhead: Teams Drowning in On-Call
|
||||
|
||||
The ideal early customer has three characteristics:
|
||||
|
||||
1. **High incident volume.** 10+ pages per week across the team. They feel the pain daily.
|
||||
2. **Existing runbooks that they know are stale.** They've tried to document. They know it's broken. They're looking for a better way.
|
||||
3. **No dedicated SRE tooling team.** They can't afford to spend 3 months configuring Rundeck or learning Shoreline's DSL. They need something that works in 5 minutes.
|
||||
|
||||
This is the Series B/C startup with 5-15 SREs supporting 50-200 developers. They're big enough to have real infrastructure problems, small enough that every engineer feels the on-call burden personally.
|
||||
|
||||
**Secondary beachhead:** Compliance chasers — startups preparing for SOC 2 who need documented, auditable incident response procedures yesterday. We sell them the audit trail masquerading as an automation tool.
|
||||
|
||||
### The dd0c/alert → dd0c/run Upsell Path
|
||||
|
||||
This is the primary growth engine for dd0c/run. The conversion funnel:
|
||||
|
||||
```
|
||||
dd0c/cost user (saves money) → trusts the platform
|
||||
│
|
||||
▼
|
||||
dd0c/alert user (reduces noise, sleeps better) → trusts the intelligence
|
||||
│
|
||||
▼
|
||||
dd0c/alert fires an alert → Slack message includes:
|
||||
"📋 A runbook exists for this alert pattern. Want dd0c/run to guide you?"
|
||||
│
|
||||
▼
|
||||
Engineer clicks through → lands on Paste & Parse → 5-second wow moment
|
||||
│
|
||||
▼
|
||||
dd0c/run user (resolves incidents faster) → trusts the execution
|
||||
│
|
||||
▼
|
||||
dd0c/portal user (owns the full developer experience) → locked in
|
||||
```
|
||||
|
||||
Every dd0c/alert notification becomes a dd0c/run acquisition channel. The upsell is embedded in the product, not in a sales email.
|
||||
|
||||
### Growth Loops
|
||||
|
||||
**Loop 1: The Parsing Flywheel (Product-Led)**
|
||||
Engineer pastes runbook → AI parses in 5 seconds → "Wow" → pastes 5 more → invites teammate → teammate pastes theirs → team has 20 runbooks in a week → first incident uses copilot → MTTR drops → team is hooked.
|
||||
|
||||
*Fuel:* The 5-second parse moment must be so good that engineers paste runbooks for fun.
|
||||
|
||||
**Loop 2: The Incident Evidence Loop (Manager-Led)**
|
||||
Jordan sees MTTR data → shows leadership → "With dd0c/run: 6 minutes. Without: 38 minutes." → leadership asks "Why don't all teams use this?" → org-wide rollout → more teams = more runbooks = better matching = better MTTR.
|
||||
|
||||
*Fuel:* The MTTR comparison chart. One number that justifies the budget.
|
||||
|
||||
**Loop 3: The Open-Source Wedge (Community-Led)**
|
||||
Release `ddoc-parse` — a free, open-source CLI that parses runbooks locally. No account needed. No SaaS. Engineers who love it self-select into the beta. Their runbooks (anonymized) improve the parsing model. The CLI gets better. More users. More conversions.
|
||||
|
||||
*Fuel:* A genuinely useful free tool, not a crippled demo.
|
||||
|
||||
**Loop 4: The Knowledge Capture Loop (Retention)**
|
||||
Morgan's expertise captured in dd0c/run → Morgan leaves → Riley handles incident using Morgan's captured knowledge → team realizes dd0c/run IS their institutional memory → switching cost becomes infinite → renewal is automatic.
|
||||
|
||||
*Fuel:* The "Ghost of Morgan" moment — the first time a junior resolves an incident using a runbook generated from a senior's session.
|
||||
|
||||
### Content Strategy
|
||||
|
||||
**Engineering-as-marketing.** Developers use adblockers and hate salespeople. We don't sell to them. We teach them.
|
||||
|
||||
| Content | Channel | Purpose |
|
||||
|---------|---------|---------|
|
||||
| "The Anatomy of a 3am Page" — blog post with real data on cognitive impairment during nighttime incidents | Blog, Hacker News, r/sre | Thought leadership. Establishes the problem before pitching the solution. |
|
||||
| `ddoc-parse` open-source CLI | GitHub, Product Hunt | Free tool that demonstrates AI parsing quality. Acquisition funnel. |
|
||||
| "Your Runbooks Are Lying to You" — analysis of runbook staleness rates across 100 teams | Blog, SRE Weekly newsletter | Data-driven content that managers share internally. |
|
||||
| Conference lightning talks (SREcon, KubeCon, DevOpsDays) | In-person | 5-minute talk ending with beta signup QR code. |
|
||||
| Incident postmortem outreach | Direct DM | Companies publishing postmortems are self-selecting. "I read your Redis incident writeup. We're building something that would have cut your MTTR in half." |
|
||||
| Pre-seeded runbook templates (AWS RDS failover, K8s CrashLoopBackOff, Redis memory pressure) | In-product, GitHub | Solve the cold-start problem. Demonstrate value before the user pastes anything. |
|
||||
|
||||
### 90-Day Launch Timeline
|
||||
|
||||
| Day | Milestone |
|
||||
|-----|-----------|
|
||||
| **1-14** | Private alpha with 5 hand-picked teams from dd0c/cost user base. Paste & Parse + basic Copilot in Slack. Gather parsing quality feedback. |
|
||||
| **15-30** | Iterate on parsing accuracy based on alpha feedback. Add PagerDuty webhook integration. Add risk classification validation (deterministic scanner). Ship divergence detection. |
|
||||
| **31-45** | Expand to 15-20 beta teams. Launch `ddoc-parse` open-source CLI. Begin collecting MTTR data. Add health dashboard for Jordan persona. |
|
||||
| **46-60** | Beta teams running in production. First MTTR comparison data available. Begin compliance export feature. Publish "Anatomy of a 3am Page" blog post. |
|
||||
| **61-75** | Refine based on beta feedback. A/B test pricing models (per-runbook vs. per-seat). Secure 3+ case study commitments. Ship dd0c/alert integration (webhook-based). |
|
||||
| **76-90** | Public launch. Product Hunt launch. Hacker News "Show HN" post. Conference talk submissions. Convert beta teams to paid. Target: 50 teams with ≥ 5 active runbooks. |
|
||||
|
||||
---
|
||||
|
||||
## 5. BUSINESS MODEL
|
||||
|
||||
### Revenue Model
|
||||
|
||||
**Primary:** $29/runbook/month (Pro tier) or $49/seat/month (Business tier).
|
||||
|
||||
A team with 10 active runbooks on Pro pays $290/month. A team of 8 SREs on Business pays $392/month. Revenue scales with demonstrated value — more runbooks means more incidents resolved faster, which means higher willingness to pay.
|
||||
|
||||
**Secondary:** The dd0c platform bundle. Teams using dd0c/cost + dd0c/alert + dd0c/run together represent an average deal size of $500-800/month. The platform is stickier than any individual module.
|
||||
|
||||
### Unit Economics
|
||||
|
||||
| Metric | Value | Notes |
|
||||
|--------|-------|-------|
|
||||
| **COGS per runbook/month** | ~$3-5 | LLM inference (via dd0c/route, optimized model selection), compute for Rust API + agent coordination, PostgreSQL storage. Parsing is a one-time cost per runbook; execution inference is per-incident. |
|
||||
| **Gross Margin** | ~85% | SaaS-standard. The Rust stack keeps infrastructure costs low. LLM costs are the primary variable, managed by dd0c/route. |
|
||||
| **CAC (Target)** | < $500 | Product-led growth via `ddoc-parse` CLI + dd0c/alert upsell. No outbound sales team. Content marketing + community seeding. |
|
||||
| **LTV (Target)** | > $5,000 | 18+ month retention (institutional memory lock-in). Average $290/month × 18 months = $5,220. |
|
||||
| **LTV:CAC Ratio** | > 10:1 | Healthy for bootstrapped SaaS. The dd0c/alert upsell path has near-zero incremental CAC. |
|
||||
| **Payback Period** | < 2 months | At $290/month with $500 CAC, payback in ~1.7 months. |
|
||||
|
||||
### Path to Revenue Milestones
|
||||
|
||||
**$10K MRR (Month 8 — 4 months post-launch)**
|
||||
- 35 Pro teams × 10 runbooks × $29 = $10,150/month
|
||||
- Source: Beta conversions (15-20 teams) + dd0c/alert upsell (10-15 teams) + organic from `ddoc-parse` CLI (5-10 teams)
|
||||
- Key assumption: 70% beta-to-paid conversion rate
|
||||
|
||||
**$50K MRR (Month 14 — 10 months post-launch)**
|
||||
- 120 Pro teams ($34,800) + 30 Business teams ($14,700) = $49,500/month
|
||||
- Source: Platform flywheel engaged. dd0c/alert → dd0c/run conversion running at 25%. Community templates driving organic acquisition. First conference talks generating inbound.
|
||||
- Key assumption: < 5% monthly churn (institutional memory lock-in)
|
||||
|
||||
**$100K MRR (Month 20 — 16 months post-launch)**
|
||||
- 200 Pro teams ($58,000) + 80 Business teams ($39,200) + 5 custom enterprise ($10,000) = $107,200/month
|
||||
- Source: Runbook Marketplace (V3) creating network effects. Multi-team deployments within companies. SOC 2 compliance driving Business tier upgrades.
|
||||
- Key assumption: Average expansion revenue of 30% (teams add runbooks and seats over time)
|
||||
|
||||
### Solo Founder Constraints
|
||||
|
||||
This is the hardest product in the dd0c lineup to support as a solo founder. The reasons are structural:
|
||||
|
||||
1. **Production safety liability.** If dd0c/run contributes to a production outage, the reputational damage extends to the entire dd0c brand. There is no "move fast and break things" with a product that touches production. Every release must be paranoid.
|
||||
|
||||
2. **Support burden.** When a customer's weird custom Kubernetes setup doesn't play nice with the Rust agent, that's a high-urgency, high-complexity support ticket at 3am. Unlike dd0c/cost (where a bug means a wrong number on a dashboard), a dd0c/run bug means a failed incident response.
|
||||
|
||||
3. **Security surface area.** The VPC agent is open-source and auditable, but it's still a binary running inside customer infrastructure. A CVE in the agent is an existential event. Security reviews from enterprise customers will be thorough and time-consuming.
|
||||
|
||||
**Mitigations:**
|
||||
|
||||
- **Shared platform architecture.** One API gateway, one auth layer, one billing system, one OpenTelemetry ingest pipeline across all dd0c modules. If you build six separate data models, you burn out in 14 months.
|
||||
- **V1 scope discipline.** Copilot-only. No Autopilot. No Terminal Watcher. No crawlers. The smaller the surface area, the smaller the support burden.
|
||||
- **Community-driven templates.** Pre-seed 50 high-quality templates for standard infrastructure. Let the community maintain and improve them. Reduce the "my setup is unique" support tickets.
|
||||
- **Aggressive kill criteria.** If the support burden exceeds 10 hours/week within the first 3 months, re-evaluate the agent architecture. Consider a managed-execution model where the SaaS handles execution via customer-provided cloud credentials (higher trust barrier, lower support burden).
|
||||
|
||||
---
|
||||
|
||||
## 6. RISKS & MITIGATIONS
|
||||
|
||||
### Risk 1: LLM Hallucination Causes a Production Outage
|
||||
|
||||
**Severity:** 10/10 — Extinction-level event for the company.
|
||||
**Probability:** Medium (with mitigations), High (without).
|
||||
|
||||
**The scenario:** The runbook says "Restart the pod." The LLM hallucinates and outputs `kubectl delete deployment` instead, classifying it as 🟢 Safe. The tired engineer clicks approve. Production goes down. The customer cancels. dd0c goes bankrupt from reputational damage.
|
||||
|
||||
**Mitigations:**
|
||||
- Deterministic regex/AST scanner validates every command against known destructive patterns. The scanner overrides the LLM classification. Always. The LLM is advisory; the scanner is authoritative.
|
||||
- Agent-level command whitelist. Anything outside the whitelist is blocked at the agent, regardless of what the SaaS sends. Defense in depth.
|
||||
- 🔴 Dangerous actions require typing the resource name to confirm. Not clicking a button. UI friction is a feature.
|
||||
- Every state-changing step records its rollback command. One-click undo.
|
||||
- V1 limits auto-execution to 🟢 Safe (read-only diagnostic) commands only. State changes always require human approval.
|
||||
- Comprehensive logging of every command suggested, approved, executed, and rolled back. Full audit trail.
|
||||
|
||||
**Kill condition:** If the LLM misclassifies a destructive command as Safe (🟢) even once during beta, or if the false-positive rate for safe commands exceeds 0.1%, the product is killed or fundamentally re-architected to remove LLMs from the classification path.
|
||||
|
||||
### Risk 2: PagerDuty / Incident.io Ships Native AI Runbook Automation
|
||||
|
||||
**Severity:** 8/10.
|
||||
**Probability:** High — they will build something. The question is when and how good.
|
||||
|
||||
**The scenario:** PagerDuty acquires a small AI runbook startup and bundles it into their Enterprise tier for "free" (subsidized by their massive license cost), choking off dd0c/run's distribution.
|
||||
|
||||
**Mitigations:**
|
||||
- Platform agnosticism. dd0c/run works with PagerDuty, OpsGenie, Grafana OnCall, and any webhook source. PagerDuty's automation is locked to PagerDuty.
|
||||
- Cross-module data advantage. PagerDuty can't integrate with dd0c/cost anomalies, dd0c/drift detection, or dd0c/portal service ownership. We have the context. They have the routing rules.
|
||||
- Mid-market pricing. PagerDuty's automation is an enterprise upsell ($$$). We sell to the on-call engineer at 3am for $29/runbook.
|
||||
- The Resolution Pattern Database. PagerDuty keeps data siloed per enterprise. We anonymize and share the "ideal" runbooks across the mid-market. Network effect they can't replicate without cannibalizing their enterprise model.
|
||||
|
||||
**Pivot option:** If PagerDuty ships a compelling native solution, double down on the dd0c ecosystem play. dd0c/run becomes the execution arm of dd0c/alert + dd0c/drift + dd0c/portal — a tightly coupled platform that no single-feature bolt-on can touch.
|
||||
|
||||
### Risk 3: Teams Don't Have Documented Runbooks (Cold Start Problem)
|
||||
|
||||
**Severity:** 7/10.
|
||||
**Probability:** High — many teams have zero runbooks.
|
||||
|
||||
**The scenario:** A prospect signs up, goes to the "Paste Runbook" screen, and realizes they have nothing to paste. Churn happens in 60 seconds.
|
||||
|
||||
**Mitigations:**
|
||||
- Pre-seed the platform with 50 high-quality templates for standard infrastructure (AWS RDS failover, K8s CrashLoopBackOff, Redis memory pressure, cert expiry, etc.). New users see value before they paste anything.
|
||||
- Slack Thread Distiller (V1): Paste a Slack thread URL from a past incident. AI extracts the resolution commands and generates a draft runbook. If they have incidents, they have Slack threads.
|
||||
- Postmortem-to-Runbook Pipeline (V1): Feed in a postmortem doc. AI extracts "what we did to fix it" and generates a structured runbook.
|
||||
- Terminal Watcher (V2): Captures commands during live incidents and generates runbooks automatically.
|
||||
- Shift marketing from "Automate your runbooks" to "Generate your missing runbooks." The product creates runbooks, not just executes them.
|
||||
|
||||
### Risk 4: The "Agentic AI" Obsolescence Event
|
||||
|
||||
**Severity:** High.
|
||||
**Probability:** Low (in the next 3 years).
|
||||
|
||||
**The scenario:** Autonomous AI agents (Devin, GitHub Copilot Workspace, Pulumi Neo) can detect and fix infrastructure issues without human intervention. Who needs runbooks?
|
||||
|
||||
**Mitigations:**
|
||||
- Runbooks become the "policy" that defines what the agent *should* do. They're the bridge between human intent and agent execution. We pivot from "human automation" to "agent policy management."
|
||||
- Position dd0c/run as the control plane for agentic operations — the system that defines, constrains, and audits what AI agents are allowed to do in production.
|
||||
- The Trust Gradient already models this transition: Read-Only → Copilot → Autopilot is the same spectrum as Human-Driven → Human-on-the-Loop → Agent-Driven.
|
||||
|
||||
### Risk 5: Solo Founder Scaling / The Bus Factor
|
||||
|
||||
**Severity:** High.
|
||||
**Probability:** High — Brian is building six products.
|
||||
|
||||
**The scenario:** The support burden of a production-safety-critical product overwhelms a solo founder. A critical bug in the VPC agent requires immediate response at 3am. Brian burns out.
|
||||
|
||||
**Mitigations:**
|
||||
- Shared platform architecture reduces per-module engineering overhead by 60%+.
|
||||
- V1 scope discipline: Copilot-only, no Autopilot, no crawlers. Smallest possible surface area.
|
||||
- Open-source the Rust agent. Community contributions for edge-case Kubernetes configurations. Community security audits.
|
||||
- Aggressive automation of support: self-healing agent updates, comprehensive error messages, in-product diagnostics.
|
||||
- If dd0c/run reaches $50K MRR, hire a dedicated SRE for agent support. This is the first hire, non-negotiable.
|
||||
|
||||
### The Catastrophic Scenario and How to Prevent It
|
||||
|
||||
**The nightmare:** dd0c/run's AI suggests a destructive command. A sleep-deprived engineer approves it. Production goes down for a major customer. The incident gets posted on Hacker News. The dd0c brand — across all six modules — is destroyed. Not just dd0c/run. Everything.
|
||||
|
||||
**Prevention (defense in depth):**
|
||||
|
||||
1. **Layer 1 — LLM Classification:** AI labels every step with risk level. This is the first pass, and it's the least trusted.
|
||||
2. **Layer 2 — Deterministic Scanner:** Regex/AST pattern matching against known destructive commands (`delete`, `drop`, `rm -rf`, `kubectl delete namespace`, etc.). Overrides LLM. Catches hallucinations.
|
||||
3. **Layer 3 — Agent Whitelist:** The Rust agent maintains a local whitelist of allowed command patterns. Anything not on the whitelist is rejected at the agent level, regardless of what the SaaS sends. The agent doesn't trust the cloud.
|
||||
4. **Layer 4 — UI Friction:** 🟡 commands require click-to-approve. 🔴 commands require typing the resource name. No "approve all" button. Ever.
|
||||
5. **Layer 5 — Rollback Recording:** Every state-changing command has a recorded inverse. One-click undo. The safety net.
|
||||
6. **Layer 6 — Audit Trail:** Every command suggested, approved, modified, executed, and rolled back is logged with timestamps, user identity, and alert context. Full forensic capability.
|
||||
|
||||
If all six layers fail simultaneously, the product deserves to die. But they won't fail simultaneously — that's the point of defense in depth.
|
||||
|
||||
---
|
||||
|
||||
## 7. SUCCESS METRICS
|
||||
|
||||
### North Star Metric
|
||||
|
||||
**Incidents resolved via dd0c/run copilot per month.**
|
||||
|
||||
This single metric captures adoption (teams are using it), trust (engineers choose copilot over skipping), and value (incidents are actually getting resolved). If this number grows, everything else follows.
|
||||
|
||||
### Leading Indicators
|
||||
|
||||
| Metric | Target (Month 6) | Why It Matters |
|
||||
|--------|-------------------|----------------|
|
||||
| Time-to-First-Runbook | < 5 minutes | If onboarding is slow, nobody reaches the value. The Vercel test. |
|
||||
| Paste & Parse success rate | > 90% | If parsing fails or requires heavy manual editing, the magic moment is broken. |
|
||||
| Copilot adoption rate | ≥ 60% of matched incidents | If engineers bypass copilot, the product isn't trusted or isn't useful. |
|
||||
| Risk classification accuracy | > 99.9% (zero false-safe on destructive commands) | The safety foundation. One misclassification and we're done. |
|
||||
| Weekly active runbooks per team | ≥ 5 | The product is alive, not shelfware. |
|
||||
| Runbook update acceptance rate | ≥ 30% of suggestions | The learning loop is working. Runbooks are improving. |
|
||||
|
||||
### Lagging Indicators
|
||||
|
||||
| Metric | Target (Month 6) | Why It Matters |
|
||||
|--------|-------------------|----------------|
|
||||
| MTTR reduction | ≥ 40% vs. baseline | The headline number. "Teams using dd0c/run resolve incidents 40% faster." |
|
||||
| NPS from on-call engineers | > 50 | Riley actually likes this. Not just tolerates it. |
|
||||
| Monthly churn | < 5% | Institutional memory lock-in is working. |
|
||||
| Expansion revenue | > 20% | Teams adding runbooks and seats over time. |
|
||||
| Zero safety incidents | 0 | dd0c/run never made an incident worse. Non-negotiable. |
|
||||
|
||||
### 30/60/90 Day Milestones
|
||||
|
||||
**Day 30: Prove the Parse**
|
||||
- 15-20 beta teams onboarded
|
||||
- Paste & Parse working with > 90% accuracy across diverse runbook formats
|
||||
- PagerDuty webhook integration live
|
||||
- Risk classification validated: zero false-safe misclassifications on destructive commands
|
||||
- First MTTR data points collected
|
||||
- Success criteria: Engineers say "wow" when they paste their first runbook
|
||||
|
||||
**Day 60: Prove the Pilot**
|
||||
- Beta teams running copilot in production incidents
|
||||
- MTTR reduction ≥ 30% for at least 8 teams
|
||||
- Divergence detection generating useful runbook update suggestions
|
||||
- Health dashboard live for Jordan persona
|
||||
- dd0c/alert webhook integration functional
|
||||
- `ddoc-parse` open-source CLI launched
|
||||
- Success criteria: At least one engineer says "I actually slept through the night because dd0c/run handled the diagnostics"
|
||||
|
||||
**Day 90: Prove the Business**
|
||||
- 50 teams with ≥ 5 active runbooks
|
||||
- MTTR reduction ≥ 40% for at least 12 teams
|
||||
- 3+ teams committed as named case studies
|
||||
- Pricing model validated (per-runbook vs. per-seat A/B test complete)
|
||||
- Zero safety incidents across all beta teams
|
||||
- Public launch executed (Product Hunt, Hacker News, conference submissions)
|
||||
- $10K MRR trajectory confirmed
|
||||
- Success criteria: Beta-to-paid conversion rate ≥ 70%
|
||||
|
||||
### Kill Criteria
|
||||
|
||||
The product is killed or fundamentally re-architected if any of the following occur:
|
||||
|
||||
1. **Safety failure.** The LLM misclassifies a destructive command as Safe (🟢) during beta. Even once.
|
||||
2. **Trust failure.** Engineers skip copilot mode > 50% of the time after 30 days. The product isn't trusted.
|
||||
3. **Parse failure.** Paste & Parse accuracy stays below 80% after 60 days of iteration. The core AI capability doesn't work.
|
||||
4. **Adoption failure.** Fewer than 8 beta teams active after 4 weeks. The problem isn't painful enough or the solution isn't compelling enough.
|
||||
5. **MTTR failure.** MTTR reduction < 20% or inconsistent across teams after 60 days. The product doesn't deliver measurable value.
|
||||
|
||||
If we hit a kill criterion, the pivot options are:
|
||||
- **Pivot to read-only intelligence:** Strip execution entirely. Become the "runbook quality platform" — parsing, staleness detection, coverage dashboards, compliance evidence. Lower risk, lower value, but viable.
|
||||
- **Pivot to agent policy management:** If agentic AI arrives faster than expected, position dd0c/run as the policy layer that defines what AI agents are allowed to do in production.
|
||||
- **Absorb into dd0c/portal:** The Contrarian from Party Mode was right about one thing — if dd0c/run can't stand alone, it becomes a feature of the IDP, not a product.
|
||||
|
||||
---
|
||||
|
||||
## APPENDIX: RESOLVED CONTRADICTIONS ACROSS PHASES
|
||||
|
||||
| Contradiction | Brainstorm Position | Party Mode Position | Resolution |
|
||||
|--------------|---------------------|---------------------|------------|
|
||||
| Standalone product vs. portal feature | Standalone with ecosystem integration | Contrarian argued it's a portal feature, not a product | **Standalone with kill-criteria pivot to portal.** Launch as standalone to test market demand. If adoption fails, absorb into dd0c/portal as a feature. |
|
||||
| Per-runbook vs. per-seat pricing | $29/runbook/month | VC advisor recommended per-seat for simplicity | **A/B test during beta.** Per-runbook aligns cost with value; per-seat is simpler. Let data decide. |
|
||||
| V1 execution scope | Full copilot with 🟢🟡🔴 approval gates | CTO demanded no execution until deterministic validation exists; Bootstrap Founder said copilot-only | **V1 auto-executes 🟢 only. 🟡🔴 require human approval. Deterministic scanner overrides LLM.** Synthesizes both positions. |
|
||||
| Confluence/Notion crawlers in V1 | Design Thinking included crawlers as V1 | Innovation Strategy said "do not build crawlers; force the user to paste" | **Paste-only in V1. Crawlers are V2.** Solo founder can't maintain integration APIs for V1. Paste is the 5-second wow moment anyway. |
|
||||
| Cold start solution | Slack Thread Scraper in V1 | Terminal Watcher in V1 | **Slack Thread Distiller in V1. Terminal Watcher deferred to V2.** Slack threads require no agent installation (lower trust barrier). Terminal Watcher requires an agent on the engineer's machine — too much friction for V1. |
|
||||
|
||||
---
|
||||
|
||||
*This brief synthesizes insights from four prior development phases: Brainstorm (Carson, Venture Architect), Design Thinking (Maya, Design Maestro), Innovation Strategy (Victor, Disruption Oracle), and Party Mode Advisory Board (5-person expert panel). All contradictions have been identified and resolved with rationale.*
|
||||
|
||||
*dd0c/run is the most safety-critical module in the dd0c platform. This brief reflects that gravity. Build it paranoid. Assume the AI wants to delete production. Constrain it accordingly. Then ship it — because the 3am pager isn't going to fix itself.*
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user