Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
29 KiB
dd0c/run — Brainstorm Session: AI-Powered Runbook Automation
Facilitator: Carson, Elite Brainstorming Specialist Date: 2026-02-28 Product: dd0c/run (Product #6 in the dd0c platform) Phase: "On-Call Savior" (Months 4-6 per brand strategy)
Phase 1: Problem Space (25 ideas)
The graveyard of runbooks is REAL. Let's map every angle of the pain.
Discovery & Awareness
- The Invisible Runbook — On-call engineer gets paged, doesn't know a runbook exists for this exact alert. It's buried in page 47 of a Confluence space nobody bookmarks.
- The "Ask Steve" Problem — The runbook is Steve's brain. Steve is on vacation in Bali. Steve didn't write it down. Steve never will.
- The Wrong Runbook — Engineer finds a runbook but it's for a different version of the service, or a similar-but-different failure mode. They follow it anyway. Things get worse.
- The Search Tax — At 3am, panicking, the engineer spends 12 minutes searching Confluence, Notion, Slack, and Google Docs for the right runbook. MTTR just doubled.
- The Tribal Knowledge Silo — Senior engineers have mental runbooks for every failure mode. They never write them down because "it's faster to just fix it myself."
Runbook Rot & Maintenance
- The Day-One Decay — A runbook is accurate the day it's written. By day 30, the infrastructure has changed and 3 of the 8 steps are wrong.
- The Nobody-Owns-It Problem — Who maintains the runbook? The person who wrote it left 6 months ago. The team that owns the service doesn't know the runbook exists.
- The Copy-Paste Drift — Runbooks get forked, copied, slightly modified. Now there are 4 versions and none are canonical.
- The Screenshot Graveyard — Runbooks full of screenshots of UIs that have been redesigned twice since.
- The "Works on My Machine" Runbook — Steps that assume specific IAM permissions, VPN configs, or CLI versions that the on-call engineer doesn't have.
Cognitive Load & Human Factors
- 3am Brain — Cognitive function drops 30-40% during night pages. Complex multi-step runbooks become impossible to follow accurately.
- The Panic Spiral — Alert fires → engineer panics → skips steps → makes it worse → more alerts fire → more panic. The runbook can't help if the human can't process it.
- Context Switching Hell — Following a runbook means jumping between the doc, the terminal, the AWS console, Datadog, Slack, and PagerDuty. Each switch costs 30 seconds of re-orientation.
- The "Which Step Am I On?" Problem — Engineer gets interrupted by Slack message, loses their place in a 20-step runbook, re-executes step 7 which was supposed to be idempotent but isn't.
- Decision Fatigue at the Fork — Runbook says "If X, do A. If Y, do B. If neither, escalate." Engineer can't tell if it's X or Y. Freezes.
Organizational & Cultural
- The Postmortem Lie — Every postmortem says "Action item: update runbook." It never happens. The Jira ticket sits in the backlog for eternity.
- The Hero Culture — Organizations reward the engineer who heroically fixes the incident, not the one who writes the boring runbook. Incentives are backwards.
- The New Hire Cliff — New on-call engineer's first page. No context, no muscle memory, runbooks assume 2 years of institutional knowledge.
- The Handoff Gap — Shift change during an active incident. The outgoing engineer's context is lost. The incoming engineer starts from scratch.
- The "We Don't Have Runbooks" Admission — Many teams simply don't have runbooks at all. They rely entirely on tribal knowledge and hope.
Economic & Business Impact
- The MTTR Multiplier — Every minute of downtime costs money. A 30-minute MTTR vs 5-minute MTTR on a revenue-critical service can mean $50K+ difference per incident for mid-market companies.
- The Attrition Cost — On-call burnout is the #1 reason SREs quit. Bad runbooks = more stress = more turnover = $150K+ per lost engineer.
- The Compliance Gap — SOC 2 and ISO 27001 require documented incident response procedures. Most teams' "documentation" is a stale wiki page that wouldn't survive an audit.
- The Repeated Incident Tax — Same incident happens monthly. Same engineer fixes it manually each time. Nobody automates it because "it only takes 20 minutes." That's 4 hours/year of senior engineer time per recurring incident.
- The Escalation Cascade — Junior engineer can't follow the runbook → escalates to senior → senior is also paged for 3 other things → everyone's awake, nobody's effective.
Phase 2: Solution Space (42 ideas)
LET'S GO. Every idea is valid. We're building the future of incident response.
Ingestion & Import (Ideas 26-36)
- Confluence Crawler — API integration that discovers and imports all runbook-tagged pages from Confluence spaces. Parses prose into structured steps.
- Notion Sync — Bidirectional sync with Notion databases. Import existing runbooks, push updates back.
- GitHub/GitLab Markdown Ingest — Point at a repo directory of
.mdrunbooks. Auto-import on merge to main. - Slack Thread Scraper — "That time we fixed the database" lives in a Slack thread. AI extracts the resolution steps from the conversation noise.
- Google Docs Connector — Many teams keep runbooks in shared Google Docs. Import and keep synced.
- Video Transcription Import — Senior engineer recorded a Loom/screen recording of fixing an issue. AI transcribes it, extracts the steps, generates a runbook.
- Terminal Session Replay Import — Import asciinema recordings or shell history from past incidents. AI identifies the commands that actually fixed the issue vs. the diagnostic noise.
- Postmortem-to-Runbook Pipeline — Feed in your postmortem doc. AI extracts the resolution steps and generates a draft runbook automatically.
- PagerDuty/OpsGenie Notes Scraper — Engineers often leave resolution notes in the incident timeline. Scrape those into runbook drafts.
- Jira/Linear Ticket Mining — Incident tickets often contain resolution steps in comments. Mine them.
- Clipboard/Paste Import — Zero-friction: just paste the text of your runbook into dd0c/run. AI structures it instantly.
AI Parsing & Understanding (Ideas 37-45)
- Prose-to-Steps Converter — AI takes a wall of text ("First you need to SSH into the bastion, then check the logs for...") and converts it into numbered, executable steps.
- Command Extraction — AI identifies shell commands, API calls, and SQL queries embedded in prose. Tags them as executable.
- Prerequisite Detection — AI identifies implicit prerequisites ("you need kubectl access", "make sure you're on the VPN") and surfaces them as a checklist before execution.
- Conditional Logic Mapping — AI identifies decision points ("if the error is X, do Y; otherwise do Z") and creates branching workflow trees.
- Risk Classification per Step — AI labels each step: 🟢 Safe (read-only), 🟡 Caution (state change, reversible), 🔴 Dangerous (destructive, irreversible). Auto-execute green, prompt for yellow, require explicit approval for red.
- Staleness Detection — AI cross-references runbook commands against current infrastructure (Terraform state, K8s manifests, AWS resource tags). Flags steps that reference resources that no longer exist.
- Ambiguity Highlighter — AI flags vague steps ("check the logs" — which logs? where?) and prompts the author to clarify.
- Multi-Language Support — Parse runbooks written in English, Spanish, Japanese, etc. Incident response is global.
- Diagram/Flowchart Generation — AI generates a visual flowchart from the runbook steps. Engineers can see the whole decision tree at a glance.
Execution Modes (Ideas 46-54)
- Full Autopilot — For well-tested, low-risk runbooks: AI executes every step automatically, reports results.
- Copilot Mode (Human-in-the-Loop) — AI suggests each step, pre-fills the command, engineer clicks "Execute" or modifies it first. The default mode.
- Suggestion-Only / Read-Along — AI walks the engineer through the runbook step by step, highlighting the current step, but doesn't execute anything. Training wheels.
- Dry-Run Mode — AI simulates execution of each step, shows what WOULD happen without actually doing it. Perfect for testing runbooks.
- Progressive Trust — Starts in suggestion-only mode. As the team builds confidence, they can promote individual runbooks to copilot or autopilot mode.
- Approval Chains — Dangerous steps require approval from a second engineer or a manager. Integrated with Slack/Teams for quick approvals.
- Rollback-Aware Execution — Every step that changes state also records the rollback command. If things go wrong, one-click undo.
- Parallel Step Execution — Some runbook steps are independent. AI identifies parallelizable steps and executes them simultaneously to reduce MTTR.
- Breakpoint Mode — Engineer sets breakpoints in the runbook like a debugger. Execution pauses at those points for manual inspection.
Alert Integration (Ideas 55-62)
- Auto-Attach Runbook to Incident — When PagerDuty/OpsGenie fires an alert, dd0c/run automatically identifies the most relevant runbook and attaches it to the incident.
- Alert-to-Runbook Matching Engine — ML model that learns which alerts map to which runbooks based on historical resolution patterns.
- Slack Bot Integration —
/ddoc run database-failoverin Slack. The bot walks you through the runbook right in the channel. - PagerDuty Custom Action — One-click "Run Runbook" button directly in the PagerDuty incident page.
- Pre-Incident Warm-Up — When anomaly detection suggests an incident is LIKELY (but hasn't fired yet), dd0c/run pre-loads the relevant runbook and notifies the on-call.
- Multi-Alert Correlation — When 5 alerts fire simultaneously, AI determines they're all symptoms of one root cause and suggests the single runbook that addresses it.
- Escalation-Aware Routing — If the L1 runbook doesn't resolve the issue within N minutes, automatically escalate to L2 runbook and page the senior engineer with full context.
- Alert Context Injection — When the runbook starts, AI pre-populates variables (affected service, region, customer impact) from the alert payload. No manual lookup needed.
Learning Loop & Continuous Improvement (Ideas 63-72)
- Resolution Tracking — Track which runbook steps actually resolved the incident vs. which were skipped or failed. Use this data to improve runbooks.
- Auto-Update Suggestions — After an incident, AI compares what the engineer actually did vs. what the runbook said. Suggests updates for divergences.
- Runbook Effectiveness Score — Each runbook gets a score: success rate, average MTTR when used, skip rate per step. Surface the worst-performing runbooks for review.
- Dead Step Detection — If step 4 is skipped by every engineer every time, it's probably unnecessary. Flag it for removal.
- New Failure Mode Detection — AI notices an incident that doesn't match any existing runbook. Prompts the resolving engineer to create one from their actions.
- A/B Testing Runbooks — Two approaches to fixing the same issue? Run both, track which has better MTTR. Data-driven runbook optimization.
- Seasonal Pattern Learning — "This database issue happens every month-end during batch processing." AI learns temporal patterns and pre-stages runbooks.
- Cross-Team Learning — Anonymized patterns: "Teams using this AWS architecture commonly need this type of runbook." Suggest runbook templates based on infrastructure fingerprint.
- Confidence Decay Model — Runbook confidence score decreases over time since last successful use or last infrastructure change. Triggers review when confidence drops below threshold.
- Incident Replay for Training — Record the full incident timeline (alerts, runbook execution, engineer actions). Replay it for training new on-call engineers.
Collaboration & Handoff (Ideas 73-80)
- Multi-Engineer Incident View — Multiple engineers working the same incident can see each other's progress through the runbook in real-time.
- Shift Handoff Package — When shift changes during an incident, dd0c/run generates a context package: what's been tried, what's left, current state.
- War Room Mode — Dedicated incident channel with the runbook pinned, step progress visible, and AI providing real-time suggestions.
- Expert Paging with Context — When escalating, the paged expert receives not just "help needed" but the full runbook execution history, what's been tried, and where it's stuck.
- Async Runbook Contributions — After an incident, any engineer can suggest edits to the runbook. Changes go through a review process like a PR.
- Runbook Comments & Annotations — Engineers can leave inline comments on runbook steps ("This step takes 5 minutes, don't panic if it seems stuck").
- Incident Narration — AI generates a real-time narrative of the incident for stakeholders: "The team is on step 5 of 8. Database failover is in progress. ETA: 10 minutes."
- Cross-Timezone Handoff Intelligence — AI knows which engineers are in which timezone and suggests optimal handoff points.
Runbook Creation & Generation (Ideas 81-90)
- Terminal Watcher — Opt-in agent that watches your terminal during an incident. After resolution, AI generates a runbook from the commands you ran.
- Incident Postmortem → Runbook — Feed in the postmortem. AI generates the runbook. Close the loop that every team promises but never delivers.
- Screen Recording → Runbook — Record your screen while fixing an issue. AI watches, transcribes, and generates a step-by-step runbook.
- Slack Thread → Runbook — Point at a Slack thread where an incident was resolved. AI extracts the signal from the noise and generates a runbook.
- Template Library — Pre-built runbook templates for common scenarios: "AWS RDS failover", "Kubernetes pod crash loop", "Redis memory pressure", "Certificate expiry".
- Infrastructure-Aware Generation — dd0c/run knows your infrastructure (via dd0c/portal integration). When you deploy a new service, it auto-suggests runbook templates based on the tech stack.
- Chaos Engineering Integration — Run a chaos experiment (Gremlin, LitmusChaos). dd0c/run observes the resolution and generates a runbook from it.
- Pair Programming Runbooks — AI and engineer co-author a runbook interactively. AI asks questions ("What do you check first?"), engineer answers, AI structures it.
- Runbook from Architecture Diagram — Feed in your architecture diagram. AI identifies potential failure points and generates skeleton runbooks for each.
- Git-Backed Runbooks — All runbooks stored as code in a Git repo. Version history, PRs for changes, CI/CD for validation. The runbook-as-code movement.
Wild & Visionary Ideas (Ideas 91-100)
- Incident Simulator / Fire Drills — Simulate incidents in a sandbox environment. On-call engineers practice runbooks without real consequences. Gamified with scores and leaderboards.
- Voice-Guided Runbooks — At 3am, reading is hard. AI reads the runbook steps aloud through your headphones while you type commands. Hands-free incident response.
- Runbook Marketplace — Community-contributed runbook templates. "Here's how Stripe handles Redis failover." Anonymized, vetted, rated.
- Predictive Runbook Staging — AI predicts incidents before they happen (based on metrics trends) and pre-stages the relevant runbook, pre-approves safe steps, and alerts the on-call: "Heads up, you might need this in 30 minutes."
- Natural Language Incident Response — Engineer types "the database is slow" in Slack. AI figures out which database, runs diagnostics, identifies the issue, and suggests the right runbook. No alert needed.
- Runbook Dependency Graph — Visualize how runbooks relate to each other. "If Runbook A fails, try Runbook B." "Runbook C is a prerequisite for Runbook D."
- Self-Healing Runbooks — Runbooks that detect their own staleness by periodically dry-running against the current infrastructure and flagging broken steps.
- Customer-Impact Aware Execution — AI knows which customers are affected (via dd0c/portal service catalog) and prioritizes runbook execution based on customer tier/revenue impact.
- Regulatory Compliance Mode — Every runbook execution is logged with full audit trail. Who ran what, when, what changed. Auto-generates compliance evidence for SOC 2/ISO 27001.
- Multi-Cloud Runbook Abstraction — Write one runbook that works across AWS, GCP, and Azure. AI translates cloud-specific commands based on the target environment.
- Runbook Health Dashboard — Single pane of glass: total runbooks, coverage gaps (services without runbooks), staleness scores, usage frequency, effectiveness ratings.
- "What Would Steve Do?" Mode — AI learns from how senior engineers resolve incidents (via terminal watcher + historical data) and can suggest their approach even when they're not available.
- Incident Cost Tracker — Real-time cost counter during incident: "This outage has cost $12,400 so far. Estimated savings if resolved in next 5 minutes: $8,200."
Phase 3: Differentiation & Moat (18 ideas)
Beating Rundeck (Free/OSS)
- UX Superiority — Rundeck's UI is from 2015. dd0c/run is Linear-quality UX. Engineers will pay for beautiful, fast tools.
- Zero Config — Rundeck requires Java, a database, YAML job definitions. dd0c/run: paste your runbook, it works. Time-to-value < 5 minutes.
- AI-Native vs. Bolt-On — Rundeck is a job scheduler with runbook features bolted on. dd0c/run is AI-first. The AI IS the product, not a feature.
- SaaS vs. Self-Hosted Burden — Rundeck requires hosting, patching, upgrading. dd0c/run is managed SaaS. One less thing to maintain.
Beating PagerDuty Automation Actions
- Not Locked to PagerDuty — PagerDuty Automation Actions only works within PagerDuty. dd0c/run works with PagerDuty, OpsGenie, Grafana OnCall, or any webhook-based alerting.
- Runbook Intelligence vs. Dumb Automation — PagerDuty runs pre-defined scripts. dd0c/run understands the runbook, adapts to context, handles branching logic, and learns.
- Ingestion from Anywhere — PagerDuty can't import your existing Confluence runbooks. dd0c/run can.
- Mid-Market Pricing — PagerDuty's automation is an expensive add-on to an already expensive product. dd0c/run is $15-30/seat standalone.
The Data Moat
- Runbook Corpus — Every runbook ingested makes the AI smarter at parsing and structuring new runbooks. Network effect.
- Resolution Pattern Database — "When alert X fires for service type Y, step Z resolves it 87% of the time." This data is incredibly valuable and compounds over time.
- Infrastructure Fingerprinting — dd0c/run learns common failure patterns for specific tech stacks (EKS + RDS + Redis = these 5 failure modes). New customers with similar stacks get instant runbook suggestions.
- MTTR Benchmarking — "Your MTTR for database incidents is 23 minutes. Similar teams average 8 minutes. Here's what they do differently." Anonymized cross-customer intelligence.
Platform Integration Moat (dd0c Ecosystem)
- dd0c/alert → dd0c/run Pipeline — Alert intelligence identifies the incident, runbook automation resolves it. Together they're 10x more valuable than apart.
- dd0c/portal Service Catalog — dd0c/run knows who owns the service, what it depends on, and who to page. No configuration needed if you're already using portal.
- dd0c/cost Integration — Runbook execution can factor in cost: "This remediation will spin up 3 extra instances costing $X/hour. Approve?"
- dd0c/drift Integration — "This incident was caused by infrastructure drift detected by dd0c/drift. Here's the runbook to remediate AND the drift to fix."
- Unified Audit Trail — All dd0c modules share one audit log. Compliance teams get a single source of truth for incident response, cost decisions, and infrastructure changes.
- The "Last Mile" Advantage — Competitors solve one piece. dd0c solves the whole chain: detect anomaly → correlate alerts → identify runbook → execute resolution → update documentation → generate postmortem.
Phase 4: Anti-Ideas & Red Team (15 ideas)
Time to be brutally honest. Let's stress-test this thing.
Why This Could Fail
- AI Agents Make Runbooks Obsolete — If autonomous AI agents (Pulumi Neo, GitHub Agentic Workflows) can detect and fix infrastructure issues without human intervention, who needs runbooks? Counter: We're 3-5 years from trusting AI to autonomously fix production. Runbooks are the bridge. And even with AI agents, you need runbooks as the "policy" that defines what the agent should do.
- Trust Barrier — Will engineers let an AI run commands in their production environment? The first time dd0c/run makes an incident worse, trust is destroyed forever. Counter: Progressive trust model. Start with suggestion-only. Graduate to copilot. Autopilot only for proven runbooks. Never force it.
- The AI Makes It Worse — AI misinterprets a runbook step, executes the wrong command, cascading failure. Counter: Risk classification per step. Dangerous steps always require human approval. Dry-run mode. Rollback-aware execution.
- Runbook Quality Garbage-In — If the existing runbooks are terrible (and they usually are), AI can't magically make them good. It'll just execute bad steps faster. Counter: AI quality scoring on import. Flag ambiguous, incomplete, or risky runbooks. Suggest improvements before enabling execution.
- Security & Compliance Nightmare — dd0c/run needs access to production systems to execute commands. That's a massive attack surface and compliance concern. Counter: Agent-based architecture (like the dd0c brand strategy specifies). Agent runs in their VPC. SaaS never sees credentials. SOC 2 compliance from day one.
- Small Market? — How many teams actually have runbooks worth automating? Most teams don't have runbooks at all. Counter: That's the opportunity. dd0c/run doesn't just automate existing runbooks — it helps CREATE them. The terminal watcher and postmortem-to-runbook features address teams with zero runbooks.
- Rundeck is Free — Why pay for dd0c/run when Rundeck is open source? Counter: Rundeck is a job scheduler, not an AI runbook engine. It's like comparing Notepad to VS Code. Different products for different eras.
- PagerDuty/Rootly Acqui-Hire the Space — Big players could build or acquire this capability. Counter: PagerDuty is slow-moving and enterprise-focused. Rootly is incident management, not runbook execution. By the time they build it, dd0c/run has the data moat.
- Engineer Resistance — "I don't need an AI to tell me how to do my job." Cultural resistance from senior engineers. Counter: Position it as a tool for the 3am junior engineer, not the senior. Seniors benefit because they stop getting paged for things juniors can now handle.
- Integration Fatigue — Yet another tool to integrate with PagerDuty, Slack, AWS, etc. Counter: dd0c platform handles integrations once. dd0c/run inherits them.
- Latency During Incidents — If dd0c/run adds latency to incident response (loading, parsing, waiting for AI), engineers will bypass it. Counter: Pre-stage runbooks. Cache everything. AI inference must be < 2 seconds. If it's slower than reading the doc, it's useless.
- Liability — If dd0c/run's AI suggestion causes data loss, who's liable? Counter: Clear ToS. AI suggests, human approves (in copilot mode). Audit trail proves the human clicked "Execute."
- Hallucination Risk — AI "invents" a runbook step that doesn't exist in the source material. Counter: Strict grounding. Every suggested step must trace back to the source runbook. Hallucination detection layer. Never generate steps that aren't in the original document unless explicitly in "creative" mode.
- Chicken-and-Egg: No Runbooks = No Product — Teams without runbooks can't use dd0c/run. Counter: Terminal watcher, postmortem mining, Slack thread scraping, and template library all solve cold-start. dd0c/run creates runbooks, not just executes them.
- Pricing Pressure — If the market commoditizes AI runbook execution, margins collapse. Counter: The moat isn't the execution engine — it's the resolution pattern database and the dd0c platform integration. Those compound over time.
Phase 5: Synthesis
Top 10 Ideas (Ranked by Impact × Feasibility)
| Rank | Idea | # | Why |
|---|---|---|---|
| 1 | Copilot Mode (Human-in-the-Loop Execution) | 47 | The core product. AI suggests, human approves. Safe, trustworthy, immediately valuable. This IS dd0c/run. |
| 2 | Auto-Attach Runbook to Incident | 55 | The killer integration. Alert fires → runbook appears. Solves the #1 problem (engineers don't know the runbook exists). |
| 3 | Risk Classification per Step | 41 | The trust enabler. Green/yellow/red labeling makes engineers comfortable letting AI execute safe steps while maintaining control over dangerous ones. |
| 4 | Confluence/Notion/Markdown Ingestion | 26-28 | Meet teams where they are. Zero migration friction. Import existing runbooks in minutes. |
| 5 | Prose-to-Steps AI Converter | 37 | The magic moment. Paste a wall of text, get a structured executable runbook. This is the demo that sells the product. |
| 6 | Terminal Watcher → Auto-Generate Runbook | 81 | Solves the cold-start problem AND the "seniors won't write runbooks" problem. The runbook writes itself. |
| 7 | Resolution Tracking & Auto-Update Suggestions | 63-64 | The learning loop that kills runbook rot. Runbooks get better with every incident instead of decaying. |
| 8 | Slack Bot Integration | 57 | Meet engineers where they already are during incidents. No context switching. /ddoc run in the incident channel. |
| 9 | Runbook Effectiveness Score | 65 | Data-driven runbook management. Surface the worst runbooks, celebrate the best. Gamification of operational excellence. |
| 10 | dd0c/alert → dd0c/run Pipeline | 116 | The platform play. Alert intelligence + runbook automation = the complete incident response stack. This is how you beat point solutions. |
3 Wild Cards 🃏
-
🃏 Incident Simulator / Fire Drills (#91) — Gamified incident response training. Engineers practice runbooks in a sandbox. Leaderboards, scores, team competitions. This could be the viral growth mechanism — "My team's incident response score is 94. What's yours?" Could become a standalone product.
-
🃏 Voice-Guided Runbooks (#92) — At 3am, your eyes are barely open. What if dd0c/run talked you through the incident like a calm co-pilot? "Step 3: SSH into the bastion. The command is ready in your clipboard. Press enter when ready." This is genuinely differentiated — nobody else is doing audio-guided incident response.
-
🃏 "What Would Steve Do?" Mode (#102) — AI learns senior engineers' incident response patterns and can replicate their decision-making. This is the ultimate knowledge capture tool. When Steve leaves the company, his expertise stays. Emotionally compelling pitch for engineering managers worried about bus factor.
Recommended V1 Scope
V1 = "Paste → Parse → Page → Pilot"
The minimum viable product that delivers immediate value:
-
Ingest — Paste a runbook (plain text, markdown) or connect Confluence/Notion. AI parses it into structured, executable steps with risk classification (green/yellow/red).
-
Match — PagerDuty/OpsGenie webhook integration. When an alert fires, dd0c/run matches it to the most relevant runbook using semantic similarity + alert metadata.
-
Execute (Copilot Mode) — Slack bot or web UI walks the on-call engineer through the runbook step-by-step. Auto-executes green (safe) steps. Prompts for approval on yellow/red steps. Pre-fills commands with context from the alert.
-
Learn — Track which steps were executed, skipped, or modified. After incident resolution, suggest runbook updates based on what actually happened vs. what the runbook said.
What V1 does NOT include:
- Terminal watcher (V2)
- Full autopilot mode (V2 — need trust first)
- Incident simulator (V3)
- Multi-cloud abstraction (V3)
- Runbook marketplace (V4)
V1 Success Metrics:
- Time-to-first-runbook: < 5 minutes (paste and go)
- MTTR reduction: 40%+ for teams using dd0c/run vs. manual runbook following
- Runbook coverage: surface services with zero runbooks, track coverage growth
- NPS from on-call engineers: > 50 (they actually LIKE being on-call now)
V1 Tech Stack:
- Lightweight agent (Rust/Go) runs in customer VPC for command execution
- SaaS dashboard + Slack bot for the UI
- OpenAI/Anthropic for runbook parsing and step generation (use dd0c/route for cost optimization — eat your own dog food)
- PagerDuty + OpsGenie webhooks for alert integration
- PostgreSQL + vector DB for runbook storage and semantic matching
V1 Pricing:
- Free: 3 runbooks, suggestion-only mode
- Pro ($25/seat/month): Unlimited runbooks, copilot execution, Slack bot
- Business ($49/seat/month): Autopilot mode, API access, SSO, audit trail
Total ideas generated: 136 Session complete. Let's build this thing. 🔥