Files

Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.

2026-02-28 17:35:02 +00:00

29 KiB

Raw Blame History

dd0c/run — Brainstorm Session: AI-Powered Runbook Automation

Facilitator: Carson, Elite Brainstorming Specialist Date: 2026-02-28 Product: dd0c/run (Product #6 in the dd0c platform) Phase: "On-Call Savior" (Months 4-6 per brand strategy)

Phase 1: Problem Space (25 ideas)

The graveyard of runbooks is REAL. Let's map every angle of the pain.

Discovery & Awareness

The Invisible Runbook — On-call engineer gets paged, doesn't know a runbook exists for this exact alert. It's buried in page 47 of a Confluence space nobody bookmarks.
The "Ask Steve" Problem — The runbook is Steve's brain. Steve is on vacation in Bali. Steve didn't write it down. Steve never will.
The Wrong Runbook — Engineer finds a runbook but it's for a different version of the service, or a similar-but-different failure mode. They follow it anyway. Things get worse.
The Search Tax — At 3am, panicking, the engineer spends 12 minutes searching Confluence, Notion, Slack, and Google Docs for the right runbook. MTTR just doubled.
The Tribal Knowledge Silo — Senior engineers have mental runbooks for every failure mode. They never write them down because "it's faster to just fix it myself."

Runbook Rot & Maintenance

The Day-One Decay — A runbook is accurate the day it's written. By day 30, the infrastructure has changed and 3 of the 8 steps are wrong.
The Nobody-Owns-It Problem — Who maintains the runbook? The person who wrote it left 6 months ago. The team that owns the service doesn't know the runbook exists.
The Copy-Paste Drift — Runbooks get forked, copied, slightly modified. Now there are 4 versions and none are canonical.
The Screenshot Graveyard — Runbooks full of screenshots of UIs that have been redesigned twice since.
The "Works on My Machine" Runbook — Steps that assume specific IAM permissions, VPN configs, or CLI versions that the on-call engineer doesn't have.

Cognitive Load & Human Factors

3am Brain — Cognitive function drops 30-40% during night pages. Complex multi-step runbooks become impossible to follow accurately.
The Panic Spiral — Alert fires → engineer panics → skips steps → makes it worse → more alerts fire → more panic. The runbook can't help if the human can't process it.
Context Switching Hell — Following a runbook means jumping between the doc, the terminal, the AWS console, Datadog, Slack, and PagerDuty. Each switch costs 30 seconds of re-orientation.
The "Which Step Am I On?" Problem — Engineer gets interrupted by Slack message, loses their place in a 20-step runbook, re-executes step 7 which was supposed to be idempotent but isn't.
Decision Fatigue at the Fork — Runbook says "If X, do A. If Y, do B. If neither, escalate." Engineer can't tell if it's X or Y. Freezes.

Organizational & Cultural

The Postmortem Lie — Every postmortem says "Action item: update runbook." It never happens. The Jira ticket sits in the backlog for eternity.
The Hero Culture — Organizations reward the engineer who heroically fixes the incident, not the one who writes the boring runbook. Incentives are backwards.
The New Hire Cliff — New on-call engineer's first page. No context, no muscle memory, runbooks assume 2 years of institutional knowledge.
The Handoff Gap — Shift change during an active incident. The outgoing engineer's context is lost. The incoming engineer starts from scratch.
The "We Don't Have Runbooks" Admission — Many teams simply don't have runbooks at all. They rely entirely on tribal knowledge and hope.

Economic & Business Impact

The MTTR Multiplier — Every minute of downtime costs money. A 30-minute MTTR vs 5-minute MTTR on a revenue-critical service can mean $50K+ difference per incident for mid-market companies.
The Attrition Cost — On-call burnout is the #1 reason SREs quit. Bad runbooks = more stress = more turnover = $150K+ per lost engineer.
The Compliance Gap — SOC 2 and ISO 27001 require documented incident response procedures. Most teams' "documentation" is a stale wiki page that wouldn't survive an audit.
The Repeated Incident Tax — Same incident happens monthly. Same engineer fixes it manually each time. Nobody automates it because "it only takes 20 minutes." That's 4 hours/year of senior engineer time per recurring incident.
The Escalation Cascade — Junior engineer can't follow the runbook → escalates to senior → senior is also paged for 3 other things → everyone's awake, nobody's effective.

Phase 2: Solution Space (42 ideas)

LET'S GO. Every idea is valid. We're building the future of incident response.

Ingestion & Import (Ideas 26-36)

Confluence Crawler — API integration that discovers and imports all runbook-tagged pages from Confluence spaces. Parses prose into structured steps.
Notion Sync — Bidirectional sync with Notion databases. Import existing runbooks, push updates back.
GitHub/GitLab Markdown Ingest — Point at a repo directory of .md runbooks. Auto-import on merge to main.
Slack Thread Scraper — "That time we fixed the database" lives in a Slack thread. AI extracts the resolution steps from the conversation noise.
Google Docs Connector — Many teams keep runbooks in shared Google Docs. Import and keep synced.
Video Transcription Import — Senior engineer recorded a Loom/screen recording of fixing an issue. AI transcribes it, extracts the steps, generates a runbook.
Terminal Session Replay Import — Import asciinema recordings or shell history from past incidents. AI identifies the commands that actually fixed the issue vs. the diagnostic noise.
Postmortem-to-Runbook Pipeline — Feed in your postmortem doc. AI extracts the resolution steps and generates a draft runbook automatically.
PagerDuty/OpsGenie Notes Scraper — Engineers often leave resolution notes in the incident timeline. Scrape those into runbook drafts.
Jira/Linear Ticket Mining — Incident tickets often contain resolution steps in comments. Mine them.
Clipboard/Paste Import — Zero-friction: just paste the text of your runbook into dd0c/run. AI structures it instantly.

AI Parsing & Understanding (Ideas 37-45)

Prose-to-Steps Converter — AI takes a wall of text ("First you need to SSH into the bastion, then check the logs for...") and converts it into numbered, executable steps.
Command Extraction — AI identifies shell commands, API calls, and SQL queries embedded in prose. Tags them as executable.
Prerequisite Detection — AI identifies implicit prerequisites ("you need kubectl access", "make sure you're on the VPN") and surfaces them as a checklist before execution.
Conditional Logic Mapping — AI identifies decision points ("if the error is X, do Y; otherwise do Z") and creates branching workflow trees.
Risk Classification per Step — AI labels each step: 🟢 Safe (read-only), 🟡 Caution (state change, reversible), 🔴 Dangerous (destructive, irreversible). Auto-execute green, prompt for yellow, require explicit approval for red.
Staleness Detection — AI cross-references runbook commands against current infrastructure (Terraform state, K8s manifests, AWS resource tags). Flags steps that reference resources that no longer exist.
Ambiguity Highlighter — AI flags vague steps ("check the logs" — which logs? where?) and prompts the author to clarify.
Multi-Language Support — Parse runbooks written in English, Spanish, Japanese, etc. Incident response is global.
Diagram/Flowchart Generation — AI generates a visual flowchart from the runbook steps. Engineers can see the whole decision tree at a glance.

Execution Modes (Ideas 46-54)

Full Autopilot — For well-tested, low-risk runbooks: AI executes every step automatically, reports results.
Copilot Mode (Human-in-the-Loop) — AI suggests each step, pre-fills the command, engineer clicks "Execute" or modifies it first. The default mode.
Suggestion-Only / Read-Along — AI walks the engineer through the runbook step by step, highlighting the current step, but doesn't execute anything. Training wheels.
Dry-Run Mode — AI simulates execution of each step, shows what WOULD happen without actually doing it. Perfect for testing runbooks.
Progressive Trust — Starts in suggestion-only mode. As the team builds confidence, they can promote individual runbooks to copilot or autopilot mode.
Approval Chains — Dangerous steps require approval from a second engineer or a manager. Integrated with Slack/Teams for quick approvals.
Rollback-Aware Execution — Every step that changes state also records the rollback command. If things go wrong, one-click undo.
Parallel Step Execution — Some runbook steps are independent. AI identifies parallelizable steps and executes them simultaneously to reduce MTTR.
Breakpoint Mode — Engineer sets breakpoints in the runbook like a debugger. Execution pauses at those points for manual inspection.

Alert Integration (Ideas 55-62)

Auto-Attach Runbook to Incident — When PagerDuty/OpsGenie fires an alert, dd0c/run automatically identifies the most relevant runbook and attaches it to the incident.
Alert-to-Runbook Matching Engine — ML model that learns which alerts map to which runbooks based on historical resolution patterns.
Slack Bot Integration — /ddoc run database-failover in Slack. The bot walks you through the runbook right in the channel.
PagerDuty Custom Action — One-click "Run Runbook" button directly in the PagerDuty incident page.
Pre-Incident Warm-Up — When anomaly detection suggests an incident is LIKELY (but hasn't fired yet), dd0c/run pre-loads the relevant runbook and notifies the on-call.
Multi-Alert Correlation — When 5 alerts fire simultaneously, AI determines they're all symptoms of one root cause and suggests the single runbook that addresses it.
Escalation-Aware Routing — If the L1 runbook doesn't resolve the issue within N minutes, automatically escalate to L2 runbook and page the senior engineer with full context.
Alert Context Injection — When the runbook starts, AI pre-populates variables (affected service, region, customer impact) from the alert payload. No manual lookup needed.

Learning Loop & Continuous Improvement (Ideas 63-72)

Resolution Tracking — Track which runbook steps actually resolved the incident vs. which were skipped or failed. Use this data to improve runbooks.
Auto-Update Suggestions — After an incident, AI compares what the engineer actually did vs. what the runbook said. Suggests updates for divergences.
Runbook Effectiveness Score — Each runbook gets a score: success rate, average MTTR when used, skip rate per step. Surface the worst-performing runbooks for review.
Dead Step Detection — If step 4 is skipped by every engineer every time, it's probably unnecessary. Flag it for removal.
New Failure Mode Detection — AI notices an incident that doesn't match any existing runbook. Prompts the resolving engineer to create one from their actions.
A/B Testing Runbooks — Two approaches to fixing the same issue? Run both, track which has better MTTR. Data-driven runbook optimization.
Seasonal Pattern Learning — "This database issue happens every month-end during batch processing." AI learns temporal patterns and pre-stages runbooks.
Cross-Team Learning — Anonymized patterns: "Teams using this AWS architecture commonly need this type of runbook." Suggest runbook templates based on infrastructure fingerprint.
Confidence Decay Model — Runbook confidence score decreases over time since last successful use or last infrastructure change. Triggers review when confidence drops below threshold.
Incident Replay for Training — Record the full incident timeline (alerts, runbook execution, engineer actions). Replay it for training new on-call engineers.

Collaboration & Handoff (Ideas 73-80)

Multi-Engineer Incident View — Multiple engineers working the same incident can see each other's progress through the runbook in real-time.
Shift Handoff Package — When shift changes during an incident, dd0c/run generates a context package: what's been tried, what's left, current state.
War Room Mode — Dedicated incident channel with the runbook pinned, step progress visible, and AI providing real-time suggestions.
Expert Paging with Context — When escalating, the paged expert receives not just "help needed" but the full runbook execution history, what's been tried, and where it's stuck.
Async Runbook Contributions — After an incident, any engineer can suggest edits to the runbook. Changes go through a review process like a PR.
Runbook Comments & Annotations — Engineers can leave inline comments on runbook steps ("This step takes 5 minutes, don't panic if it seems stuck").
Incident Narration — AI generates a real-time narrative of the incident for stakeholders: "The team is on step 5 of 8. Database failover is in progress. ETA: 10 minutes."
Cross-Timezone Handoff Intelligence — AI knows which engineers are in which timezone and suggests optimal handoff points.

Runbook Creation & Generation (Ideas 81-90)

Terminal Watcher — Opt-in agent that watches your terminal during an incident. After resolution, AI generates a runbook from the commands you ran.
Incident Postmortem → Runbook — Feed in the postmortem. AI generates the runbook. Close the loop that every team promises but never delivers.
Screen Recording → Runbook — Record your screen while fixing an issue. AI watches, transcribes, and generates a step-by-step runbook.
Slack Thread → Runbook — Point at a Slack thread where an incident was resolved. AI extracts the signal from the noise and generates a runbook.
Template Library — Pre-built runbook templates for common scenarios: "AWS RDS failover", "Kubernetes pod crash loop", "Redis memory pressure", "Certificate expiry".
Infrastructure-Aware Generation — dd0c/run knows your infrastructure (via dd0c/portal integration). When you deploy a new service, it auto-suggests runbook templates based on the tech stack.
Chaos Engineering Integration — Run a chaos experiment (Gremlin, LitmusChaos). dd0c/run observes the resolution and generates a runbook from it.
Pair Programming Runbooks — AI and engineer co-author a runbook interactively. AI asks questions ("What do you check first?"), engineer answers, AI structures it.
Runbook from Architecture Diagram — Feed in your architecture diagram. AI identifies potential failure points and generates skeleton runbooks for each.
Git-Backed Runbooks — All runbooks stored as code in a Git repo. Version history, PRs for changes, CI/CD for validation. The runbook-as-code movement.

Wild & Visionary Ideas (Ideas 91-100)

Incident Simulator / Fire Drills — Simulate incidents in a sandbox environment. On-call engineers practice runbooks without real consequences. Gamified with scores and leaderboards.
Voice-Guided Runbooks — At 3am, reading is hard. AI reads the runbook steps aloud through your headphones while you type commands. Hands-free incident response.
Runbook Marketplace — Community-contributed runbook templates. "Here's how Stripe handles Redis failover." Anonymized, vetted, rated.
Predictive Runbook Staging — AI predicts incidents before they happen (based on metrics trends) and pre-stages the relevant runbook, pre-approves safe steps, and alerts the on-call: "Heads up, you might need this in 30 minutes."
Natural Language Incident Response — Engineer types "the database is slow" in Slack. AI figures out which database, runs diagnostics, identifies the issue, and suggests the right runbook. No alert needed.
Runbook Dependency Graph — Visualize how runbooks relate to each other. "If Runbook A fails, try Runbook B." "Runbook C is a prerequisite for Runbook D."
Self-Healing Runbooks — Runbooks that detect their own staleness by periodically dry-running against the current infrastructure and flagging broken steps.
Customer-Impact Aware Execution — AI knows which customers are affected (via dd0c/portal service catalog) and prioritizes runbook execution based on customer tier/revenue impact.
Regulatory Compliance Mode — Every runbook execution is logged with full audit trail. Who ran what, when, what changed. Auto-generates compliance evidence for SOC 2/ISO 27001.
Multi-Cloud Runbook Abstraction — Write one runbook that works across AWS, GCP, and Azure. AI translates cloud-specific commands based on the target environment.
Runbook Health Dashboard — Single pane of glass: total runbooks, coverage gaps (services without runbooks), staleness scores, usage frequency, effectiveness ratings.
"What Would Steve Do?" Mode — AI learns from how senior engineers resolve incidents (via terminal watcher + historical data) and can suggest their approach even when they're not available.
Incident Cost Tracker — Real-time cost counter during incident: "This outage has cost $12,400 so far. Estimated savings if resolved in next 5 minutes: $8,200."

Phase 3: Differentiation & Moat (18 ideas)

Beating Rundeck (Free/OSS)

UX Superiority — Rundeck's UI is from 2015. dd0c/run is Linear-quality UX. Engineers will pay for beautiful, fast tools.
Zero Config — Rundeck requires Java, a database, YAML job definitions. dd0c/run: paste your runbook, it works. Time-to-value < 5 minutes.
AI-Native vs. Bolt-On — Rundeck is a job scheduler with runbook features bolted on. dd0c/run is AI-first. The AI IS the product, not a feature.
SaaS vs. Self-Hosted Burden — Rundeck requires hosting, patching, upgrading. dd0c/run is managed SaaS. One less thing to maintain.

Beating PagerDuty Automation Actions

Not Locked to PagerDuty — PagerDuty Automation Actions only works within PagerDuty. dd0c/run works with PagerDuty, OpsGenie, Grafana OnCall, or any webhook-based alerting.
Runbook Intelligence vs. Dumb Automation — PagerDuty runs pre-defined scripts. dd0c/run understands the runbook, adapts to context, handles branching logic, and learns.
Ingestion from Anywhere — PagerDuty can't import your existing Confluence runbooks. dd0c/run can.
Mid-Market Pricing — PagerDuty's automation is an expensive add-on to an already expensive product. dd0c/run is $15-30/seat standalone.

The Data Moat

Runbook Corpus — Every runbook ingested makes the AI smarter at parsing and structuring new runbooks. Network effect.
Resolution Pattern Database — "When alert X fires for service type Y, step Z resolves it 87% of the time." This data is incredibly valuable and compounds over time.
Infrastructure Fingerprinting — dd0c/run learns common failure patterns for specific tech stacks (EKS + RDS + Redis = these 5 failure modes). New customers with similar stacks get instant runbook suggestions.
MTTR Benchmarking — "Your MTTR for database incidents is 23 minutes. Similar teams average 8 minutes. Here's what they do differently." Anonymized cross-customer intelligence.

Platform Integration Moat (dd0c Ecosystem)

dd0c/alert → dd0c/run Pipeline — Alert intelligence identifies the incident, runbook automation resolves it. Together they're 10x more valuable than apart.
dd0c/portal Service Catalog — dd0c/run knows who owns the service, what it depends on, and who to page. No configuration needed if you're already using portal.
dd0c/cost Integration — Runbook execution can factor in cost: "This remediation will spin up 3 extra instances costing $X/hour. Approve?"
dd0c/drift Integration — "This incident was caused by infrastructure drift detected by dd0c/drift. Here's the runbook to remediate AND the drift to fix."
Unified Audit Trail — All dd0c modules share one audit log. Compliance teams get a single source of truth for incident response, cost decisions, and infrastructure changes.
The "Last Mile" Advantage — Competitors solve one piece. dd0c solves the whole chain: detect anomaly → correlate alerts → identify runbook → execute resolution → update documentation → generate postmortem.

Phase 4: Anti-Ideas & Red Team (15 ideas)

Time to be brutally honest. Let's stress-test this thing.

Why This Could Fail

AI Agents Make Runbooks Obsolete — If autonomous AI agents (Pulumi Neo, GitHub Agentic Workflows) can detect and fix infrastructure issues without human intervention, who needs runbooks? Counter: We're 3-5 years from trusting AI to autonomously fix production. Runbooks are the bridge. And even with AI agents, you need runbooks as the "policy" that defines what the agent should do.
Trust Barrier — Will engineers let an AI run commands in their production environment? The first time dd0c/run makes an incident worse, trust is destroyed forever. Counter: Progressive trust model. Start with suggestion-only. Graduate to copilot. Autopilot only for proven runbooks. Never force it.
The AI Makes It Worse — AI misinterprets a runbook step, executes the wrong command, cascading failure. Counter: Risk classification per step. Dangerous steps always require human approval. Dry-run mode. Rollback-aware execution.
Runbook Quality Garbage-In — If the existing runbooks are terrible (and they usually are), AI can't magically make them good. It'll just execute bad steps faster. Counter: AI quality scoring on import. Flag ambiguous, incomplete, or risky runbooks. Suggest improvements before enabling execution.
Security & Compliance Nightmare — dd0c/run needs access to production systems to execute commands. That's a massive attack surface and compliance concern. Counter: Agent-based architecture (like the dd0c brand strategy specifies). Agent runs in their VPC. SaaS never sees credentials. SOC 2 compliance from day one.
Small Market? — How many teams actually have runbooks worth automating? Most teams don't have runbooks at all. Counter: That's the opportunity. dd0c/run doesn't just automate existing runbooks — it helps CREATE them. The terminal watcher and postmortem-to-runbook features address teams with zero runbooks.
Rundeck is Free — Why pay for dd0c/run when Rundeck is open source? Counter: Rundeck is a job scheduler, not an AI runbook engine. It's like comparing Notepad to VS Code. Different products for different eras.
PagerDuty/Rootly Acqui-Hire the Space — Big players could build or acquire this capability. Counter: PagerDuty is slow-moving and enterprise-focused. Rootly is incident management, not runbook execution. By the time they build it, dd0c/run has the data moat.
Engineer Resistance — "I don't need an AI to tell me how to do my job." Cultural resistance from senior engineers. Counter: Position it as a tool for the 3am junior engineer, not the senior. Seniors benefit because they stop getting paged for things juniors can now handle.
Integration Fatigue — Yet another tool to integrate with PagerDuty, Slack, AWS, etc. Counter: dd0c platform handles integrations once. dd0c/run inherits them.
Latency During Incidents — If dd0c/run adds latency to incident response (loading, parsing, waiting for AI), engineers will bypass it. Counter: Pre-stage runbooks. Cache everything. AI inference must be < 2 seconds. If it's slower than reading the doc, it's useless.
Liability — If dd0c/run's AI suggestion causes data loss, who's liable? Counter: Clear ToS. AI suggests, human approves (in copilot mode). Audit trail proves the human clicked "Execute."
Hallucination Risk — AI "invents" a runbook step that doesn't exist in the source material. Counter: Strict grounding. Every suggested step must trace back to the source runbook. Hallucination detection layer. Never generate steps that aren't in the original document unless explicitly in "creative" mode.
Chicken-and-Egg: No Runbooks = No Product — Teams without runbooks can't use dd0c/run. Counter: Terminal watcher, postmortem mining, Slack thread scraping, and template library all solve cold-start. dd0c/run creates runbooks, not just executes them.
Pricing Pressure — If the market commoditizes AI runbook execution, margins collapse. Counter: The moat isn't the execution engine — it's the resolution pattern database and the dd0c platform integration. Those compound over time.

Phase 5: Synthesis

Top 10 Ideas (Ranked by Impact × Feasibility)

Rank	Idea	#	Why
1	Copilot Mode (Human-in-the-Loop Execution)	47	The core product. AI suggests, human approves. Safe, trustworthy, immediately valuable. This IS dd0c/run.
2	Auto-Attach Runbook to Incident	55	The killer integration. Alert fires → runbook appears. Solves the #1 problem (engineers don't know the runbook exists).
3	Risk Classification per Step	41	The trust enabler. Green/yellow/red labeling makes engineers comfortable letting AI execute safe steps while maintaining control over dangerous ones.
4	Confluence/Notion/Markdown Ingestion	26-28	Meet teams where they are. Zero migration friction. Import existing runbooks in minutes.
5	Prose-to-Steps AI Converter	37	The magic moment. Paste a wall of text, get a structured executable runbook. This is the demo that sells the product.
6	Terminal Watcher → Auto-Generate Runbook	81	Solves the cold-start problem AND the "seniors won't write runbooks" problem. The runbook writes itself.
7	Resolution Tracking & Auto-Update Suggestions	63-64	The learning loop that kills runbook rot. Runbooks get better with every incident instead of decaying.
8	Slack Bot Integration	57	Meet engineers where they already are during incidents. No context switching. `/ddoc run` in the incident channel.
9	Runbook Effectiveness Score	65	Data-driven runbook management. Surface the worst runbooks, celebrate the best. Gamification of operational excellence.
10	dd0c/alert → dd0c/run Pipeline	116	The platform play. Alert intelligence + runbook automation = the complete incident response stack. This is how you beat point solutions.

3 Wild Cards 🃏

🃏 Incident Simulator / Fire Drills (#91) — Gamified incident response training. Engineers practice runbooks in a sandbox. Leaderboards, scores, team competitions. This could be the viral growth mechanism — "My team's incident response score is 94. What's yours?" Could become a standalone product.
🃏 Voice-Guided Runbooks (#92) — At 3am, your eyes are barely open. What if dd0c/run talked you through the incident like a calm co-pilot? "Step 3: SSH into the bastion. The command is ready in your clipboard. Press enter when ready." This is genuinely differentiated — nobody else is doing audio-guided incident response.
🃏 "What Would Steve Do?" Mode (#102) — AI learns senior engineers' incident response patterns and can replicate their decision-making. This is the ultimate knowledge capture tool. When Steve leaves the company, his expertise stays. Emotionally compelling pitch for engineering managers worried about bus factor.

Recommended V1 Scope

V1 = "Paste → Parse → Page → Pilot"

The minimum viable product that delivers immediate value:

Ingest — Paste a runbook (plain text, markdown) or connect Confluence/Notion. AI parses it into structured, executable steps with risk classification (green/yellow/red).
Match — PagerDuty/OpsGenie webhook integration. When an alert fires, dd0c/run matches it to the most relevant runbook using semantic similarity + alert metadata.
Execute (Copilot Mode) — Slack bot or web UI walks the on-call engineer through the runbook step-by-step. Auto-executes green (safe) steps. Prompts for approval on yellow/red steps. Pre-fills commands with context from the alert.
Learn — Track which steps were executed, skipped, or modified. After incident resolution, suggest runbook updates based on what actually happened vs. what the runbook said.

What V1 does NOT include:

Terminal watcher (V2)
Full autopilot mode (V2 — need trust first)
Incident simulator (V3)
Multi-cloud abstraction (V3)
Runbook marketplace (V4)

V1 Success Metrics:

Time-to-first-runbook: < 5 minutes (paste and go)
MTTR reduction: 40%+ for teams using dd0c/run vs. manual runbook following
Runbook coverage: surface services with zero runbooks, track coverage growth
NPS from on-call engineers: > 50 (they actually LIKE being on-call now)

V1 Tech Stack:

Lightweight agent (Rust/Go) runs in customer VPC for command execution
SaaS dashboard + Slack bot for the UI
OpenAI/Anthropic for runbook parsing and step generation (use dd0c/route for cost optimization — eat your own dog food)
PagerDuty + OpsGenie webhooks for alert integration
PostgreSQL + vector DB for runbook storage and semantic matching

V1 Pricing:

Free: 3 runbooks, suggestion-only mode
Pro ($25/seat/month): Unlimited runbooks, copilot execution, Slack bot
Business ($49/seat/month): Autopilot mode, API access, SSO, audit trail

Total ideas generated: 136 Session complete. Let's build this thing. 🔥

29 KiB Raw Blame History Unescape Escape