1098 lines
77 KiB
Markdown
1098 lines
77 KiB
Markdown
|
|
# dd0c/run — Design Thinking Session
|
|||
|
|
**Facilitator:** Maya, Design Thinking Maestro — BMad Creative Intelligence Suite
|
|||
|
|
**Date:** 2026-02-28
|
|||
|
|
**Product:** dd0c/run (AI-Powered Runbook Automation)
|
|||
|
|
**Phase:** "On-Call Savior" — Phase 2 of the dd0c platform rollout
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
> *"Design is jazz. You listen before you play. You feel the room before you solo. And the room right now? It's 3am, a phone is buzzing on a nightstand, and someone's stomach just dropped through the floor."*
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 1: EMPATHIZE — Personas
|
|||
|
|
|
|||
|
|
The cardinal sin of product design is building for yourself. We're not building for architects who debate system design over craft coffee. We're building for the person whose hands are shaking at 3am, squinting at a Confluence page on a laptop balanced on their knees in bed, while their partner rolls over and sighs.
|
|||
|
|
|
|||
|
|
Let's meet them. Not as "users." As humans.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Persona 1: The On-Call Engineer — "Riley"
|
|||
|
|
|
|||
|
|
**Demographics:** 26 years old. 2 years at the company. Mid-level SRE. Lives alone with a cat named `sudo`. Joined because the team seemed chill. Didn't realize "chill" meant "we don't document anything."
|
|||
|
|
|
|||
|
|
**The Scene:** It's 3:17am on a Tuesday. Riley's phone screams — PagerDuty's default tone, the one that triggers a Pavlovian cortisol spike. The alert says: `CRITICAL: payment-service latency > 5000ms — us-east-1`. Riley's brain is running at maybe 40% capacity. Eyes won't focus. Opens laptop. Tries to remember: is there a runbook for this? Where would it be? Confluence? That Google Doc someone shared in Slack three months ago?
|
|||
|
|
|
|||
|
|
Riley finds something. It's 18 steps long. Step 3 says "SSH into the bastion host." Which bastion host? There are four. Step 7 says "Run the failover script." What failover script? Where? Step 11 references a dashboard that was migrated to Grafana six months ago. Riley copy-pastes a command, holds their breath, hits Enter. Nothing explodes. Probably fine. Probably.
|
|||
|
|
|
|||
|
|
Forty-three minutes later, the alert resolves. Riley doesn't know if it was something they did or if it self-healed. They lie in bed, heart still pounding, staring at the ceiling. The cat is asleep. The cat doesn't care.
|
|||
|
|
|
|||
|
|
#### Empathy Map
|
|||
|
|
|
|||
|
|
| **Says** | **Thinks** |
|
|||
|
|
|----------|------------|
|
|||
|
|
| "I followed the runbook." (They skipped 4 steps.) | "I have no idea if what I just did was right." |
|
|||
|
|
| "The runbook was mostly helpful." (It was 60% wrong.) | "If I break production, everyone will know it was me." |
|
|||
|
|
| "On-call isn't that bad." (It's destroying their sleep.) | "I should have been a dentist." |
|
|||
|
|
| "Can someone update this doc?" (Nobody will.) | "Why am I the one doing this at 3am when Steve wrote this service?" |
|
|||
|
|
|
|||
|
|
| **Does** | **Feels** |
|
|||
|
|
|----------|-----------|
|
|||
|
|
| Opens 7 browser tabs: PagerDuty, Confluence, AWS Console, Grafana, Slack, terminal, Stack Overflow | **Dread** when the phone buzzes |
|
|||
|
|
| Copy-pastes commands without fully understanding them | **Isolation** — alone in the dark with a production incident |
|
|||
|
|
| Asks in Slack: "anyone awake?" (Nobody is.) | **Imposter syndrome** — "a real engineer would know what to do" |
|
|||
|
|
| Resolves the alert, writes a one-line note, goes back to bed | **Relief** that dissolves into **anxiety** about the next page |
|
|||
|
|
| Skips steps that seem confusing or irrelevant | **Resentment** toward the team for not maintaining docs |
|
|||
|
|
|
|||
|
|
#### Pain Points
|
|||
|
|
1. **Can't find the right runbook** — or doesn't know one exists. Search across Confluence, Notion, Google Docs, and Slack is a nightmare at 3am.
|
|||
|
|
2. **Runbook is stale** — references dead resources, old dashboards, deprecated commands. Following it makes things worse.
|
|||
|
|
3. **Cognitive overload** — 20-step runbook with branching logic when your brain is at 40% capacity. Decision fatigue at every fork.
|
|||
|
|
4. **Context switching tax** — jumping between the doc, terminal, AWS console, monitoring, and Slack. Loses their place constantly.
|
|||
|
|
5. **No confidence in actions** — copy-pasting commands without understanding them. No way to know if a step is safe or destructive.
|
|||
|
|
6. **No feedback loop** — did the fix work? Was it the right runbook? Nobody follows up. The postmortem action item ("update runbook") dies in Jira.
|
|||
|
|
|
|||
|
|
#### Current Workarounds
|
|||
|
|
- Bookmarks a personal "incident cheat sheet" in Notion (also goes stale)
|
|||
|
|
- Searches Slack history for "how did we fix this last time"
|
|||
|
|
- Texts the senior engineer directly (feels guilty about it)
|
|||
|
|
- Writes commands in a scratch terminal first to "test" them (not a real test)
|
|||
|
|
- Sets a personal timer: "if I can't fix it in 20 minutes, escalate" (arbitrary)
|
|||
|
|
|
|||
|
|
#### Jobs to Be Done (JTBD)
|
|||
|
|
- **When** I get paged at 3am, **I want to** immediately know what to do and in what order, **so I can** resolve the incident quickly and go back to sleep.
|
|||
|
|
- **When** I'm following a runbook, **I want to** know which steps are safe and which are dangerous, **so I can** act with confidence instead of fear.
|
|||
|
|
- **When** an incident is resolved, **I want to** know that my actions are recorded, **so I can** prove I did the right thing and not get blamed if something else breaks.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Persona 2: The SRE Manager — "Jordan"
|
|||
|
|
|
|||
|
|
**Demographics:** 34 years old. Manages a team of 8 SREs. Former on-call warrior who got promoted because they were good at incidents, not because they wanted to manage people. Has a 2-year-old at home. Sleeps with one eye open because they're still on the escalation path.
|
|||
|
|
|
|||
|
|
**The Scene:** It's Monday morning. Jordan is reviewing last week's incidents. Three pages, two of which were for the same issue — the payment service latency spike that happens every time the batch job runs. There's a runbook for it. Riley didn't use it. Or maybe Riley used the wrong version — there are three copies in Confluence, none of them canonical. The postmortem from last month says "Action: update runbook and add to PagerDuty." The Jira ticket is still in "To Do."
|
|||
|
|
|
|||
|
|
Jordan opens the team's runbook inventory. It's a Confluence space with 47 pages. Last updated: 8 months ago (most of them). Jordan knows at least half are dangerously stale. But there's no time to audit them — the team is underwater with project work, and nobody wants to spend their sprint updating docs that "nobody reads anyway."
|
|||
|
|
|
|||
|
|
The SOC 2 auditor is coming next quarter. They'll ask about incident response procedures. Jordan will point at the Confluence space and pray nobody clicks into the actual pages.
|
|||
|
|
|
|||
|
|
#### Empathy Map
|
|||
|
|
|
|||
|
|
| **Says** | **Thinks** |
|
|||
|
|
|----------|------------|
|
|||
|
|
| "We need to invest in our runbook hygiene." | "I've said this every quarter for two years." |
|
|||
|
|
| "Let's make runbook updates part of the incident process." | "Nobody will actually do it." |
|
|||
|
|
| "Our incident response is mature." (To leadership.) | "We're one bad incident away from a very bad day." |
|
|||
|
|
| "Riley did a great job handling that page." | "Riley has no idea what they did and neither do I." |
|
|||
|
|
|
|||
|
|
| **Does** | **Feels** |
|
|||
|
|
|----------|-----------|
|
|||
|
|
| Reviews incident reports, tries to spot patterns | **Frustration** — knows the problems, can't fix them |
|
|||
|
|
| Assigns runbook update tickets that never get done | **Guilt** — puts junior engineers on-call with bad docs |
|
|||
|
|
| Advocates for "operational excellence" in planning meetings | **Anxiety** — the SOC 2 audit is coming |
|
|||
|
|
| Manually checks in on on-call engineers during major incidents | **Exhaustion** — still gets pulled into escalations |
|
|||
|
|
| Builds spreadsheets tracking runbook coverage and staleness | **Helplessness** — the spreadsheet just shows how bad it is |
|
|||
|
|
|
|||
|
|
#### Pain Points
|
|||
|
|
1. **No visibility into runbook quality** — can't tell which runbooks are current, which are stale, which are actively harmful.
|
|||
|
|
2. **No accountability for runbook maintenance** — ownership is unclear, updates don't happen, and there's no forcing function.
|
|||
|
|
3. **Inconsistent incident response** — same incident, different engineer, wildly different approach and outcome. No standardization.
|
|||
|
|
4. **Can't prove compliance** — SOC 2 requires documented procedures. The docs exist but they're fiction.
|
|||
|
|
5. **On-call burnout causing attrition** — losing engineers because on-call is miserable. Recruiting replacements costs $150K+ each.
|
|||
|
|
6. **No data on what works** — which runbooks actually reduce MTTR? Which steps get skipped? No metrics, just vibes.
|
|||
|
|
|
|||
|
|
#### Current Workarounds
|
|||
|
|
- Quarterly "runbook review" meetings that everyone dreads and nothing comes of
|
|||
|
|
- Pairs junior engineers with seniors for their first on-call rotation (expensive use of senior time)
|
|||
|
|
- Maintains a personal "critical runbooks" list that they keep updated themselves
|
|||
|
|
- Uses MTTR as a proxy for runbook quality (very noisy signal)
|
|||
|
|
- Writes the most critical runbooks personally (doesn't scale)
|
|||
|
|
|
|||
|
|
#### Jobs to Be Done (JTBD)
|
|||
|
|
- **When** an incident occurs, **I want to** know that my team followed a consistent, documented process, **so I can** trust the outcome and satisfy auditors.
|
|||
|
|
- **When** reviewing operational health, **I want to** see which services lack runbooks and which runbooks are stale, **so I can** prioritize improvements with data instead of gut feel.
|
|||
|
|
- **When** onboarding new on-call engineers, **I want to** give them a system that guides them through incidents, **so I can** reduce ramp time and prevent costly mistakes.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Persona 3: The Runbook Author — "Morgan"
|
|||
|
|
|
|||
|
|
**Demographics:** 38 years old. Staff engineer. Has been at the company for 6 years. Built half the infrastructure. Knows where every body is buried. Is tired of being the human runbook. Has started interviewing at other companies but hasn't told anyone yet.
|
|||
|
|
|
|||
|
|
**The Scene:** It's 2pm on a Wednesday. Morgan just got pulled out of deep work — again — because Riley couldn't figure out the payment service issue — again. Morgan fixed it in 4 minutes. It took Riley 43 minutes last night. The difference? Morgan knows that step 7 in the runbook is wrong (the script moved repos 3 months ago), that you need to check the batch job schedule first (not in the runbook), and that the real fix is usually just bouncing the connection pool (also not in the runbook).
|
|||
|
|
|
|||
|
|
Morgan has been meaning to update the runbook for months. But updating a Confluence page feels like shouting into the void. Last time Morgan spent 2 hours writing a detailed runbook with screenshots, nobody used it. They still got paged. The knowledge in Morgan's head is worth more than anything in Confluence, but there's no good way to transfer it.
|
|||
|
|
|
|||
|
|
Morgan is the bus factor. Morgan knows it. Jordan knows it. Nobody talks about it.
|
|||
|
|
|
|||
|
|
#### Empathy Map
|
|||
|
|
|
|||
|
|
| **Says** | **Thinks** |
|
|||
|
|
|----------|------------|
|
|||
|
|
| "I'll update the runbook this sprint." (They won't.) | "Why bother? Nobody reads them." |
|
|||
|
|
| "Just ping me if you get stuck." | "I'm so tired of being pinged." |
|
|||
|
|
| "The runbook covers the basics." | "The runbook covers maybe 30% of what you actually need to know." |
|
|||
|
|
| "We should automate this." | "I don't have time to automate this AND do my actual job." |
|
|||
|
|
|
|||
|
|
| **Does** | **Feels** |
|
|||
|
|
|----------|-----------|
|
|||
|
|
| Fixes incidents in minutes that take others hours | **Pride** mixed with **resentment** — they're good at this, but it shouldn't be their job anymore |
|
|||
|
|
| Writes runbooks in bursts of guilt, then ignores them for months | **Futility** — the docs decay faster than they can maintain them |
|
|||
|
|
| Answers Slack DMs during incidents instead of pointing to docs | **Trapped** — they've become a single point of failure |
|
|||
|
|
| Considers leaving but worries about the team | **Guilt** — "if I leave, who handles the 3am pages?" |
|
|||
|
|
| Hoards knowledge unintentionally — it's just faster to do it themselves | **Loneliness** — nobody else understands the systems at this depth |
|
|||
|
|
|
|||
|
|
#### Pain Points
|
|||
|
|
1. **Writing runbooks feels pointless** — they go stale immediately, nobody follows them correctly, and Morgan still gets paged.
|
|||
|
|
2. **No way to encode judgment** — runbooks capture steps, not the decision-making process. "If the error looks like X but the metrics show Y, it's actually Z" — that nuance doesn't fit in a numbered list.
|
|||
|
|
3. **Maintenance is a second job** — infrastructure changes constantly. Keeping runbooks current is a Sisyphean task with zero recognition.
|
|||
|
|
4. **No feedback on runbook usage** — Morgan doesn't know which runbooks are used, which steps are skipped, or where people get stuck.
|
|||
|
|
5. **Knowledge transfer is broken** — Morgan's expertise is trapped in their head. The company's incident response capability is directly proportional to Morgan's availability.
|
|||
|
|
6. **No incentive structure** — the company rewards feature shipping, not operational documentation. Writing runbooks is invisible work.
|
|||
|
|
|
|||
|
|
#### Current Workarounds
|
|||
|
|
- Keeps a personal wiki with "the real runbooks" (not shared, not maintained)
|
|||
|
|
- Records Loom videos of themselves fixing things ("watch this if I'm not around")
|
|||
|
|
- Writes overly detailed runbooks that nobody reads because they're too long
|
|||
|
|
- Answers questions in Slack threads that become the de facto runbook (unsearchable)
|
|||
|
|
- Has automated some fixes as cron jobs or scripts, but they're undocumented and live on Morgan's laptop
|
|||
|
|
|
|||
|
|
#### Jobs to Be Done (JTBD)
|
|||
|
|
- **When** I fix an incident, **I want to** capture what I did automatically, **so I can** create runbooks without the overhead of writing documentation.
|
|||
|
|
- **When** I write a runbook, **I want to** know that people are actually using it correctly, **so I can** feel like the effort was worthwhile.
|
|||
|
|
- **When** I think about leaving the company, **I want to** know my knowledge has been transferred, **so I can** leave without guilt.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
> *"Three humans. Three different nightmares. But they're all trapped in the same broken system — a system where knowledge decays, documentation is fiction, and the phone buzzing at 3am is the only thing that's reliable."*
|
|||
|
|
>
|
|||
|
|
> *Riley wants to survive the night. Jordan wants to trust the process. Morgan wants to be free. dd0c/run has to serve all three — or it serves none of them.*
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 2: DEFINE — Frame the Problem
|
|||
|
|
|
|||
|
|
> *"In jazz, the silence between notes matters more than the notes themselves. In design, the problem you choose to frame matters more than the solution you build. Frame it wrong and you'll build a beautiful answer to a question nobody asked."*
|
|||
|
|
|
|||
|
|
We've listened. We've felt the room. Now we name the beast.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Point-of-View (POV) Statements
|
|||
|
|
|
|||
|
|
A POV statement isn't a feature request. It's a declaration of human need that's so specific it hurts.
|
|||
|
|
|
|||
|
|
**Riley (The On-Call Engineer):**
|
|||
|
|
> A sleep-deprived on-call engineer who gets paged at 3am **needs a way to** instantly know the exact steps to resolve an incident without searching, interpreting, or guessing **because** cognitive function drops 40% at night, and every minute spent finding and deciphering a stale runbook is a minute of downtime, a minute of cortisol, and a minute closer to burnout.
|
|||
|
|
|
|||
|
|
**Jordan (The SRE Manager):**
|
|||
|
|
> An SRE manager responsible for incident response quality across a team of 8 **needs a way to** ensure consistent, auditable, and continuously improving incident response regardless of which engineer is on-call **because** the current system depends entirely on individual heroics, produces no usable data, and will not survive a compliance audit or the departure of a single senior engineer.
|
|||
|
|
|
|||
|
|
**Morgan (The Runbook Author):**
|
|||
|
|
> A staff engineer who carries six years of institutional knowledge in their head **needs a way to** transfer their expertise into a living system that learns, adapts, and executes without their constant involvement **because** they've become a single point of failure, their documentation efforts feel futile, and they can't leave — or even take a vacation — without the team's incident response capability collapsing.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Key Insights
|
|||
|
|
|
|||
|
|
These are the truths we uncovered in empathy that the market hasn't fully articulated yet. Each one is a design constraint.
|
|||
|
|
|
|||
|
|
1. **The Runbook Is Not the Problem — The Gap Between the Runbook and Reality Is.** Every team has some documentation. The issue is that documentation decays the moment it's written. Infrastructure changes, UIs get redesigned, scripts move repos. The runbook becomes a historical artifact, not an operational tool. *Any solution that doesn't address continuous decay is dead on arrival.*
|
|||
|
|
|
|||
|
|
2. **3am Brain Is a Design Constraint, Not an Edge Case.** Most tools are designed for alert, caffeinated engineers at their desks. But the critical use case — the one where MTTR matters most — is a half-asleep human in bed with a laptop. *If it requires reading comprehension above a 6th-grade level at 3am, it's too complex.*
|
|||
|
|
|
|||
|
|
3. **Trust Is Earned in Millimeters and Lost in Miles.** Engineers will not hand over production command execution to an AI on day one. They shouldn't. But they also won't adopt a tool that's just a fancy document viewer. *The product must have a trust gradient — from "show me" to "do it for me" — and the engineer must control the dial.*
|
|||
|
|
|
|||
|
|
4. **The Author's Incentive Problem Is Upstream of Everything.** If Morgan doesn't write the runbook, Riley has nothing to follow, and Jordan has nothing to audit. But Morgan has zero incentive to write runbooks — it's invisible, thankless work that decays immediately. *The product must make runbook creation a byproduct of incident resolution, not a separate chore.*
|
|||
|
|
|
|||
|
|
5. **Knowledge Lives in Actions, Not Documents.** The most valuable runbook content isn't in Confluence. It's in Morgan's terminal history, in Slack threads at 3am, in the muscle memory of senior engineers. *The product must capture knowledge from where it naturally occurs — the command line, the chat, the incident itself.*
|
|||
|
|
|
|||
|
|
6. **The Emotional Core Is Dread, Not Efficiency.** We're not selling MTTR reduction. We're selling the absence of dread. The moment Riley's phone buzzes, the product's job is to replace "oh god" with "I've got this." *Every design decision should be tested against: does this reduce the dread?*
|
|||
|
|
|
|||
|
|
7. **Consistency Is a Management Superpower.** Jordan doesn't need every incident resolved perfectly. They need every incident resolved *the same way* — so they can measure, improve, and prove compliance. *Standardization of process is as valuable as speed of resolution.*
|
|||
|
|
|
|||
|
|
8. **The Bus Factor Is an Existential Risk Disguised as a Staffing Problem.** Morgan leaving isn't a hiring problem. It's a knowledge extinction event. Companies know this but have no mechanism to prevent it. *dd0c/run's deepest value proposition isn't automation — it's institutional memory.*
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### "How Might We" Questions
|
|||
|
|
|
|||
|
|
HMWs are the bridge between insight and ideation. Each one opens a door. Some lead to features. Some lead to entirely new product categories.
|
|||
|
|
|
|||
|
|
**Reducing the 3am Dread:**
|
|||
|
|
1. **HMW** eliminate the "which runbook?" problem so the right procedure appears the instant an alert fires?
|
|||
|
|
2. **HMW** reduce the cognitive load of following a runbook to near-zero for a sleep-deprived engineer?
|
|||
|
|
3. **HMW** give the on-call engineer confidence that each step is safe before they execute it?
|
|||
|
|
4. **HMW** make the runbook feel like a calm co-pilot rather than a confusing instruction manual?
|
|||
|
|
|
|||
|
|
**Killing Runbook Rot:**
|
|||
|
|
5. **HMW** make runbooks self-healing — automatically detecting and flagging their own staleness?
|
|||
|
|
6. **HMW** make runbook updates a natural byproduct of incident resolution rather than a separate task?
|
|||
|
|
7. **HMW** create a feedback loop where every incident makes the runbook better instead of leaving it to decay?
|
|||
|
|
|
|||
|
|
**Unlocking Trapped Knowledge:**
|
|||
|
|
8. **HMW** capture Morgan's expertise from their natural workflow (terminal, Slack, screen) without requiring them to write documentation?
|
|||
|
|
9. **HMW** encode not just the *steps* but the *judgment* — the "if it looks like X but is actually Y" knowledge that separates seniors from juniors?
|
|||
|
|
10. **HMW** make runbook authoring feel rewarding instead of futile — so Morgan *wants* to contribute?
|
|||
|
|
|
|||
|
|
**Enabling Management Visibility:**
|
|||
|
|
11. **HMW** give Jordan a single dashboard that shows runbook coverage, quality, and effectiveness across all services?
|
|||
|
|
12. **HMW** generate audit-ready compliance evidence automatically from actual incident response activity?
|
|||
|
|
13. **HMW** standardize incident response across the team without micromanaging individual engineers?
|
|||
|
|
|
|||
|
|
**Building Trust in Automation:**
|
|||
|
|
14. **HMW** let teams gradually increase automation trust — from "show me" to "do it for me" — at their own pace?
|
|||
|
|
15. **HMW** ensure the AI never makes an incident worse, even if the source runbook is flawed?
|
|||
|
|
16. **HMW** make the AI's reasoning transparent so engineers understand *why* it's suggesting a step, not just *what* step?
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### The Core Tension: Automation vs. Human Judgment
|
|||
|
|
|
|||
|
|
> *"Here's where the jazz gets dissonant. This is the chord that doesn't resolve easily."*
|
|||
|
|
|
|||
|
|
Every runbook automation product faces the same fundamental tension:
|
|||
|
|
|
|||
|
|
**Too much automation** → Engineers don't trust it, don't learn, and when the AI is wrong (and it will be wrong), nobody knows how to recover manually. You've created a new single point of failure — the AI itself.
|
|||
|
|
|
|||
|
|
**Too little automation** → It's just a fancy document viewer. Engineers bypass it because it's slower than their current workflow. You've built a feature, not a product.
|
|||
|
|
|
|||
|
|
The answer isn't a compromise. It's a **spectrum with explicit controls.**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────────────────────────┐
|
|||
|
|
│ THE TRUST GRADIENT │
|
|||
|
|
│ │
|
|||
|
|
│ READ-ALONG ──→ COPILOT ──→ AUTOPILOT │
|
|||
|
|
│ │
|
|||
|
|
│ "Show me "Suggest & "Just handle it, │
|
|||
|
|
│ the steps" I'll approve" wake me if it's red" │
|
|||
|
|
│ │
|
|||
|
|
│ ● Per-runbook setting (not global) │
|
|||
|
|
│ ● Per-step override (green auto, yellow prompt, red block) │
|
|||
|
|
│ ● Earned through data (10 successful runs → suggest upgrade│
|
|||
|
|
│ ● Instantly revocable (one bad run → auto-downgrade) │
|
|||
|
|
└─────────────────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Design Principles for the Tension:**
|
|||
|
|
|
|||
|
|
1. **Default to caution.** Every new runbook starts in Read-Along mode. Trust is earned, never assumed.
|
|||
|
|
2. **Granularity matters.** Trust isn't per-product or per-team. It's per-runbook and per-step. Step 1 (check logs) can be autopilot while Step 5 (failover database) stays manual.
|
|||
|
|
3. **Transparency is non-negotiable.** The AI must show its work. "I'm suggesting this step because the alert payload contains X and the runbook says when X, do Y." No black boxes.
|
|||
|
|
4. **Rollback is a first-class citizen.** Every automated action must have a recorded undo. If the AI executes Step 3 and it makes things worse, one click to reverse it.
|
|||
|
|
5. **The human always has the kill switch.** At any point, the engineer can pause automation, take manual control, and resume later. The AI adapts, it doesn't insist.
|
|||
|
|
6. **Data drives trust decisions.** After 10 successful copilot runs with zero modifications, the system suggests: "This runbook has a 100% success rate in copilot mode. Promote to autopilot?" The team decides. Not the AI.
|
|||
|
|
|
|||
|
|
> *"The tension between automation and judgment isn't a problem to solve. It's a dynamic to design for. Like tension in a jazz chord — it's what makes the music interesting. Resolve it too quickly and you get elevator music. Let it ring and you get something people actually feel."*
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 3: IDEATE — Generate Solutions
|
|||
|
|
|
|||
|
|
> *"Ideation is a controlled explosion. You light the fuse, let it blow, then walk through the debris looking for gold. No idea is too wild. No idea is too boring. The magic is in the collision between them."*
|
|||
|
|
|
|||
|
|
We've felt the pain. We've named the beast. Now we build the arsenal.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 25 Solution Ideas
|
|||
|
|
|
|||
|
|
**Theme A: Ingestion — "Meet the Runbook Where It Lives"**
|
|||
|
|
|
|||
|
|
1. **Paste & Parse** — Zero-friction entry point. Copy-paste raw text from anywhere — Confluence, Notion, a napkin — and AI instantly structures it into executable steps with risk classification. The "5-second wow" moment.
|
|||
|
|
2. **Confluence Crawler** — OAuth-connected crawler that discovers runbook-tagged pages across Confluence spaces. Periodic re-sync detects changes and flags drift between the source doc and the parsed version.
|
|||
|
|
3. **Notion Bidirectional Sync** — Import from Notion databases, but also push structured updates back. The runbook in dd0c/run becomes the source of truth; Notion becomes the mirror.
|
|||
|
|
4. **Git-Backed Markdown Ingest** — Point at a repo directory. On merge to main, runbooks auto-import. Version history comes free. Engineers who prefer docs-as-code get their workflow.
|
|||
|
|
5. **Slack Thread Distiller** — Paste a Slack thread URL from a past incident. AI separates signal from noise — strips the "anyone awake?" and "trying something" messages, extracts the actual resolution commands and decisions into a draft runbook.
|
|||
|
|
6. **Postmortem-to-Runbook Pipeline** — Feed in a postmortem doc (any format). AI extracts the "what we did to fix it" section and generates a structured runbook draft. Closes the loop that every retro promises and nobody delivers.
|
|||
|
|
7. **Terminal Session Replay Import** — Import asciinema recordings or shell history exports from past incidents. AI identifies the commands that actually fixed the issue vs. the diagnostic noise and `ls` commands. Morgan's muscle memory, captured.
|
|||
|
|
|
|||
|
|
**Theme B: AI Parsing — "The Brain Behind the Runbook"**
|
|||
|
|
|
|||
|
|
8. **Prose-to-Steps Converter** — The core AI capability. Takes a wall of natural language ("First SSH into the bastion, then check the logs for connection timeout errors, if you see more than 50 in the last 5 minutes you need to bounce the connection pool...") and produces numbered, executable, branching steps.
|
|||
|
|
9. **Risk Classification Engine** — Every parsed step gets a traffic light: 🟢 Safe (read-only: check logs, query metrics), 🟡 Caution (state change, reversible: restart service, scale up), 🔴 Dangerous (destructive/irreversible: drop table, failover database). This is the trust foundation.
|
|||
|
|
10. **Prerequisite Detector** — AI identifies implicit requirements buried in prose: "you need kubectl access," "make sure you're on the VPN," "this assumes you have admin IAM role." Surfaces them as a pre-flight checklist before execution begins.
|
|||
|
|
11. **Ambiguity Highlighter** — AI flags vague steps: "check the logs" → *Which logs? Where? What are you looking for?* Prompts the author to clarify before the runbook goes live. Prevents the "works for Morgan, confuses Riley" problem.
|
|||
|
|
12. **Staleness Detector** — Cross-references runbook commands against live infrastructure (Terraform state, K8s manifests, AWS resource tags via dd0c/portal). Flags steps that reference resources, endpoints, or dashboards that no longer exist. The runbook's immune system.
|
|||
|
|
13. **Conditional Logic Mapper** — Identifies decision points in prose and generates visual branching trees. "If error is timeout → path A. If error is OOM → path B. If neither → escalate." Riley sees the whole decision space at a glance instead of parsing nested if-statements in paragraph form.
|
|||
|
|
14. **Variable Extraction & Auto-Fill** — AI identifies placeholders in runbook steps (service name, region, instance ID) and auto-populates them from the alert payload and infrastructure context. Riley doesn't have to look up which region is affected — it's already filled in.
|
|||
|
|
|
|||
|
|
**Theme C: Execution — "The Trust Gradient in Action"**
|
|||
|
|
|
|||
|
|
15. **Copilot Mode** — The default and the core product. AI presents each step with the command pre-filled, context from the alert injected, risk level displayed. Engineer reviews, optionally modifies, clicks Execute. AI shows the output, suggests the next step. A calm voice in the chaos.
|
|||
|
|
16. **Progressive Trust Promotion** — Runbooks start in Read-Along (show steps, no execution). After N successful copilot runs with zero engineer modifications, the system suggests promotion to Autopilot for green steps. Trust is earned through data, not configuration.
|
|||
|
|
17. **Rollback-Aware Execution** — Every state-changing step records its inverse operation. Bounced a service? Here's the command to restart it with the previous config. Scaled up? Here's the scale-down. One-click undo at any point. The safety net that makes engineers brave.
|
|||
|
|
18. **Breakpoint Mode** — Engineers set pause points in the runbook like a debugger. Execution runs automatically through green steps, pauses at breakpoints for manual inspection. For the engineer who trusts the AI for the boring parts but wants to eyeball the critical moments.
|
|||
|
|
19. **Dry-Run Simulation** — Execute the runbook against a simulated environment. Shows what *would* happen at each step without touching production. Perfect for testing new runbooks, training new engineers, and building trust before going live.
|
|||
|
|
|
|||
|
|
**Theme D: Alert Integration — "The Runbook Finds You"**
|
|||
|
|
|
|||
|
|
20. **Auto-Attach on Page** — When PagerDuty/OpsGenie fires an alert, dd0c/run matches it to the most relevant runbook using semantic similarity on alert metadata + historical resolution patterns. The runbook appears in the incident channel before Riley finishes rubbing their eyes. Solves the #1 problem: "I didn't know a runbook existed."
|
|||
|
|
21. **Alert Context Injection** — The matched runbook arrives pre-populated: affected service, region, customer impact tier, recent deploy history, related metrics. No manual lookup. The runbook already knows what's on fire.
|
|||
|
|
22. **Multi-Alert Correlation** — Five alerts fire simultaneously. AI determines they're all symptoms of one root cause (the database failover, not five separate issues) and surfaces the single runbook that addresses the root, not the symptoms. Cuts through the noise.
|
|||
|
|
23. **Escalation-Aware Routing** — If the L1 runbook doesn't resolve within N minutes, automatically escalate: page the senior engineer, attach the L2 runbook, include full context of what's been tried. Morgan gets woken up with information, not just panic.
|
|||
|
|
|
|||
|
|
**Theme E: Learning Loop — "Runbooks That Get Smarter"**
|
|||
|
|
|
|||
|
|
24. **Divergence Detector** — After every incident, AI compares what the engineer actually did vs. what the runbook prescribed. Skipped step 4? Modified the command in step 7? Added an unlisted step between 9 and 10? AI generates a diff and suggests runbook updates. The postmortem action item completes itself.
|
|||
|
|
25. **Runbook Effectiveness Score** — Each runbook gets a living score: success rate, average MTTR when used, step skip rate, modification frequency, confidence decay over time. Jordan's dashboard shows the worst-performing runbooks at the top. Data-driven operational excellence.
|
|||
|
|
26. **Dead Step Pruning** — If step 6 is skipped by 90% of engineers in 90% of incidents, it's dead weight. AI flags it: "This step has been skipped 14 out of 15 times. Remove or revise?" Runbooks get leaner over time instead of accumulating cruft.
|
|||
|
|
27. **New Failure Mode Detection** — AI notices an incident that doesn't match any existing runbook. After resolution, it prompts: "This incident type has no runbook. Based on what you just did, here's a draft. Review and publish?" The cold-start problem solves itself.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Idea Clusters
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ SOLUTION ARCHITECTURE │
|
|||
|
|
│ │
|
|||
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐ │
|
|||
|
|
│ │ INGEST │───→│ UNDERSTAND │───→│ EXECUTE │ │
|
|||
|
|
│ │ │ │ │ │ │ │
|
|||
|
|
│ │ Paste/Import │ │ Parse/Risk/ │ │ Read-Along → │ │
|
|||
|
|
│ │ Crawl/Sync │ │ Variables/ │ │ Copilot → │ │
|
|||
|
|
│ │ Capture │ │ Branching │ │ Autopilot │ │
|
|||
|
|
│ │ (#1-7) │ │ (#8-14) │ │ (#15-19) │ │
|
|||
|
|
│ └──────────────┘ └──────────────┘ └───────────────────┘ │
|
|||
|
|
│ ↑ │ │
|
|||
|
|
│ │ ┌──────────────┐ │ │
|
|||
|
|
│ │ │ CONNECT │ │ │
|
|||
|
|
│ │ │ │ │ │
|
|||
|
|
│ │ │ Alert Match │←────────────┘ │
|
|||
|
|
│ │ │ Context Fill │ │
|
|||
|
|
│ │ │ Escalation │ │
|
|||
|
|
│ │ │ (#20-23) │ │
|
|||
|
|
│ │ └──────────────┘ │
|
|||
|
|
│ │ │ │
|
|||
|
|
│ │ ┌──────────────┐ │
|
|||
|
|
│ └────────────│ LEARN │ │
|
|||
|
|
│ │ │ │
|
|||
|
|
│ │ Divergence │ │
|
|||
|
|
│ │ Scoring │ │
|
|||
|
|
│ │ Pruning │ │
|
|||
|
|
│ │ Generation │ │
|
|||
|
|
│ │ (#24-27) │ │
|
|||
|
|
│ └──────────────┘ │
|
|||
|
|
└─────────────────────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The flywheel: **Ingest → Understand → Connect → Execute → Learn → Ingest** (better). Every incident makes the system smarter. Every runbook gets sharper. The corpus grows. The matching improves. This is the data moat Carson identified in the brainstorm — and it compounds.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Top 5 Concepts with User Flow Sketches
|
|||
|
|
|
|||
|
|
#### Concept 1: "The 3am Lifeline" — Alert-Triggered Copilot Execution
|
|||
|
|
|
|||
|
|
*The core product experience. This is what we demo. This is what sells.*
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Riley's phone buzzes: CRITICAL — payment-service latency > 5000ms
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────────────────────────────────────┐
|
|||
|
|
│ PagerDuty fires webhook → dd0c/run │
|
|||
|
|
│ Matches alert to: "Payment Service Latency │
|
|||
|
|
│ Runbook" (92% confidence) │
|
|||
|
|
└─────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────────────────────────────────────┐
|
|||
|
|
│ Slack bot posts in #incident-2847: │
|
|||
|
|
│ │
|
|||
|
|
│ 🔔 Runbook matched: Payment Service Latency │
|
|||
|
|
│ 📊 Pre-filled: region=us-east-1, │
|
|||
|
|
│ service=payment-svc, deploy=v2.4.1 (2h ago)│
|
|||
|
|
│ 🟢🟡🔴 8 steps (4 safe, 3 caution, 1 danger) │
|
|||
|
|
│ │
|
|||
|
|
│ [▶ Start Copilot] [📖 View Steps] [⏭ Skip] │
|
|||
|
|
└─────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼ Riley clicks "Start Copilot"
|
|||
|
|
┌─────────────────────────────────────────────────┐
|
|||
|
|
│ Step 1/8 🟢 SAFE — Check service logs │
|
|||
|
|
│ │
|
|||
|
|
│ > kubectl logs -n payments payment-svc-xxx │
|
|||
|
|
│ --since=10m | grep -i "timeout\|error" │
|
|||
|
|
│ │
|
|||
|
|
│ ℹ️ Looking for connection timeout patterns │
|
|||
|
|
│ that indicate database connection pool │
|
|||
|
|
│ exhaustion. │
|
|||
|
|
│ │
|
|||
|
|
│ [▶ Execute] [✏️ Edit] [⏭ Skip] │
|
|||
|
|
└─────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼ Auto-executes green steps, pauses on yellow
|
|||
|
|
┌─────────────────────────────────────────────────┐
|
|||
|
|
│ Step 5/8 🟡 CAUTION — Bounce connection pool │
|
|||
|
|
│ │
|
|||
|
|
│ > kubectl rollout restart deployment/ │
|
|||
|
|
│ payment-svc -n payments │
|
|||
|
|
│ │
|
|||
|
|
│ ⚠️ This will restart all pods. ~30s downtime. │
|
|||
|
|
│ ↩️ Rollback: kubectl rollout undo ... │
|
|||
|
|
│ │
|
|||
|
|
│ [✅ Approve & Execute] [✏️ Edit] [⏭ Skip] │
|
|||
|
|
└─────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼ Incident resolves after step 5
|
|||
|
|
┌─────────────────────────────────────────────────┐
|
|||
|
|
│ ✅ Incident resolved — MTTR: 4m 23s │
|
|||
|
|
│ │
|
|||
|
|
│ 📝 You skipped steps 6-8. The AI noticed you │
|
|||
|
|
│ also ran a command not in the runbook: │
|
|||
|
|
│ `SELECT count(*) FROM pg_stat_activity` │
|
|||
|
|
│ │
|
|||
|
|
│ Suggested updates: │
|
|||
|
|
│ • Remove steps 6-8 (skipped 4/4 last runs) │
|
|||
|
|
│ • Add DB connection check before step 5 │
|
|||
|
|
│ │
|
|||
|
|
│ [✅ Apply Updates] [📝 Review First] [❌ No] │
|
|||
|
|
└─────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why it wins:** Riley went from phone buzz to resolution in 4 minutes instead of 43. They never left Slack. They never searched for a runbook. They never copy-pasted a command they didn't understand. And the runbook got better because they used it.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Concept 2: "Paste & Parse" — The 5-Second Onboarding
|
|||
|
|
|
|||
|
|
*The first-touch experience. Time-to-value < 5 minutes or we've lost them.*
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Morgan pastes a wall of text from Confluence:
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────────────────────────────────────┐
|
|||
|
|
│ dd0c/run — New Runbook │
|
|||
|
|
│ │
|
|||
|
|
│ ┌─────────────────────────────────────────┐ │
|
|||
|
|
│ │ Paste your runbook here... │ │
|
|||
|
|
│ │ │ │
|
|||
|
|
│ │ "When the payment service latency alert │ │
|
|||
|
|
│ │ fires, first SSH into the bastion host │ │
|
|||
|
|
│ │ (bastion-prod-east). Check the logs │ │
|
|||
|
|
│ │ for timeout errors. If you see more │ │
|
|||
|
|
│ │ than 50 in the last 5 minutes, you │ │
|
|||
|
|
│ │ need to bounce the connection pool..." │ │
|
|||
|
|
│ └─────────────────────────────────────────┘ │
|
|||
|
|
│ │
|
|||
|
|
│ [🧠 Parse with AI] │
|
|||
|
|
└─────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼ ~3 seconds later
|
|||
|
|
┌─────────────────────────────────────────────────┐
|
|||
|
|
│ ✨ Parsed: 8 steps, 2 decision points │
|
|||
|
|
│ │
|
|||
|
|
│ 1. 🟢 SSH into bastion-prod-east │
|
|||
|
|
│ 2. 🟢 Check payment-svc logs (last 5min) │
|
|||
|
|
│ 3. 🔀 IF timeouts > 50 → step 4 │
|
|||
|
|
│ ELSE → step 6 │
|
|||
|
|
│ 4. 🟡 Bounce connection pool │
|
|||
|
|
│ 5. 🟢 Verify latency recovery │
|
|||
|
|
│ 6. 🟢 Check recent deploys │
|
|||
|
|
│ 7. 🟡 Rollback last deploy │
|
|||
|
|
│ 8. 🔴 Manual database failover │
|
|||
|
|
│ │
|
|||
|
|
│ ⚠️ 2 issues found: │
|
|||
|
|
│ • Step 1: "bastion host" — which one? │
|
|||
|
|
│ (auto-resolved: bastion-prod-east from text) │
|
|||
|
|
│ • Step 8: References "failover script" but │
|
|||
|
|
│ no path provided. [Add path?] │
|
|||
|
|
│ │
|
|||
|
|
│ [✅ Save] [✏️ Edit Steps] [🔗 Link to Alert]│
|
|||
|
|
└─────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why it wins:** Morgan went from a wall of Confluence prose to a structured, risk-classified, executable runbook in under a minute. The AI caught the ambiguity Morgan didn't even notice. And it took less effort than writing a Jira comment.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Concept 3: "The Living Scoreboard" — Runbook Health Dashboard
|
|||
|
|
|
|||
|
|
*Jordan's command center. The thing that makes the SOC 2 auditor smile.*
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌──────────────────────────────────────────────────────────────┐
|
|||
|
|
│ dd0c/run — Runbook Health Jordan ▾ │
|
|||
|
|
│ │
|
|||
|
|
│ Coverage: 34/52 services (65%) ████████████░░░░░░ │
|
|||
|
|
│ Avg Staleness: 47 days ⚠️ │
|
|||
|
|
│ Avg MTTR (with runbook): 6m 12s │
|
|||
|
|
│ Avg MTTR (without): 38m 45s ← this number sells the product│
|
|||
|
|
│ │
|
|||
|
|
│ ┌─ NEEDS ATTENTION ────────────────────────────────────┐ │
|
|||
|
|
│ │ 🔴 payment-service-failover Staleness: 142 days │ │
|
|||
|
|
│ │ Last used: 3 days ago. 2 steps reference deleted │ │
|
|||
|
|
│ │ resources. Effectiveness: 34% │ │
|
|||
|
|
│ │ │ │
|
|||
|
|
│ │ 🟡 redis-memory-pressure No runbook exists │ │
|
|||
|
|
│ │ 3 incidents in last 30 days, resolved ad-hoc │ │
|
|||
|
|
│ │ [🧠 Generate from incident history] │ │
|
|||
|
|
│ │ │ │
|
|||
|
|
│ │ 🟡 cert-expiry-renewal Staleness: 89 days │ │
|
|||
|
|
│ │ Step 4 skip rate: 100%. Suggested: remove step. │ │
|
|||
|
|
│ └───────────────────────────────────────────────────────┘ │
|
|||
|
|
│ │
|
|||
|
|
│ ┌─ TOP PERFORMERS ─────────────────────────────────────┐ │
|
|||
|
|
│ │ 🟢 k8s-pod-crashloop MTTR: 2m 14s │ │
|
|||
|
|
│ │ Effectiveness: 97%. Autopilot-eligible. │ │
|
|||
|
|
│ │ │ │
|
|||
|
|
│ │ 🟢 rds-connection-spike MTTR: 3m 41s │ │
|
|||
|
|
│ │ Updated 2 days ago via auto-suggestion. │ │
|
|||
|
|
│ └───────────────────────────────────────────────────────┘ │
|
|||
|
|
│ │
|
|||
|
|
│ [📊 Export Compliance Report] [📈 MTTR Trends] │
|
|||
|
|
└──────────────────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why it wins:** Jordan can finally answer "how good is our incident response?" with data instead of vibes. The 6x MTTR difference (with vs. without runbook) is the number that justifies the budget. The compliance export is the number that satisfies the auditor.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Concept 4: "Ghost of Morgan" — Automatic Knowledge Capture
|
|||
|
|
|
|||
|
|
*The solution to the bus factor. Morgan's expertise, immortalized.*
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Morgan opts into Terminal Watcher during an incident:
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────────────────────────────────────┐
|
|||
|
|
│ 🔴 Active Incident: database-connection-storm │
|
|||
|
|
│ Terminal Watcher: ON (recording) │
|
|||
|
|
│ │
|
|||
|
|
│ Morgan's terminal: │
|
|||
|
|
│ $ kubectl get pods -n payments | grep -v Running│
|
|||
|
|
│ $ kubectl logs payment-svc-abc123 --tail=100 │
|
|||
|
|
│ $ psql -h prod-db -c "SELECT count(*) │
|
|||
|
|
│ FROM pg_stat_activity WHERE state='idle │
|
|||
|
|
│ in transaction'" │
|
|||
|
|
│ $ kubectl rollout restart deployment/payment-svc│
|
|||
|
|
│ $ watch kubectl get pods -n payments │
|
|||
|
|
│ ✅ All pods Running. Latency recovered. │
|
|||
|
|
└─────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼ After incident resolves
|
|||
|
|
┌─────────────────────────────────────────────────┐
|
|||
|
|
│ 🧠 AI generated a runbook from your session: │
|
|||
|
|
│ │
|
|||
|
|
│ "Database Connection Storm — Payment Service" │
|
|||
|
|
│ │
|
|||
|
|
│ 1. 🟢 Check for non-running pods │
|
|||
|
|
│ 2. 🟢 Review pod logs for connection errors │
|
|||
|
|
│ 3. 🟢 Check DB for idle-in-transaction count │
|
|||
|
|
│ 4. 🔀 IF idle_txn > 50 → step 5 │
|
|||
|
|
│ 5. 🟡 Restart payment-svc deployment │
|
|||
|
|
│ 6. 🟢 Verify pod health and latency recovery │
|
|||
|
|
│ │
|
|||
|
|
│ AI notes: "Morgan checked the DB connection │
|
|||
|
|
│ pool before restarting — this diagnostic step │
|
|||
|
|
│ isn't in the existing runbook but was critical │
|
|||
|
|
│ for confirming root cause." │
|
|||
|
|
│ │
|
|||
|
|
│ [✅ Publish] [✏️ Edit] [🔗 Link to Alert] │
|
|||
|
|
└─────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why it wins:** Morgan didn't write a runbook. Morgan fixed an incident. The runbook wrote itself. The knowledge that lives in Morgan's fingers — the instinct to check `pg_stat_activity` before restarting — is now captured, structured, and available to Riley at 3am. When Morgan eventually leaves, their ghost stays.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Concept 5: "The Feedback Loop" — Self-Improving Runbooks
|
|||
|
|
|
|||
|
|
*The flywheel that turns a static document into a living system.*
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Over 30 days, dd0c/run observes 12 executions of
|
|||
|
|
"Payment Service Latency Runbook":
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────────────────────────────────────┐
|
|||
|
|
│ 📊 Runbook Intelligence Report │
|
|||
|
|
│ "Payment Service Latency" — 12 runs, 30 days │
|
|||
|
|
│ │
|
|||
|
|
│ Step Analysis: │
|
|||
|
|
│ ├─ Step 1 (check logs): Executed 12/12 ✅ │
|
|||
|
|
│ ├─ Step 2 (check metrics): Executed 12/12 ✅ │
|
|||
|
|
│ ├─ Step 3 (check deploys): Executed 3/12 ⚠️ │
|
|||
|
|
│ │ → Skipped when logs show "connection pool" │
|
|||
|
|
│ ├─ Step 4 (bounce pool): Executed 10/12 ✅ │
|
|||
|
|
│ │ → Resolved incident 9/10 times │
|
|||
|
|
│ ├─ Step 5 (rollback deploy): Executed 2/12 │
|
|||
|
|
│ │ → Only needed when step 4 fails │
|
|||
|
|
│ ├─ Step 6-8: Never executed (0/12) 🗑️ │
|
|||
|
|
│ │ │
|
|||
|
|
│ │ Unlisted actions detected: │
|
|||
|
|
│ │ • 8/12 runs: engineer checked │
|
|||
|
|
│ │ pg_stat_activity BEFORE step 4 │
|
|||
|
|
│ │ • 3/12 runs: engineer checked recent │
|
|||
|
|
│ │ deploy diff in GitHub │
|
|||
|
|
│ │
|
|||
|
|
│ 🧠 Suggested Evolution: │
|
|||
|
|
│ • Add: "Check pg_stat_activity" before step 4 │
|
|||
|
|
│ • Reorder: Move "check deploys" to branch path │
|
|||
|
|
│ • Remove: Steps 6-8 (never used) │
|
|||
|
|
│ • Promote: Steps 1-2 to Autopilot (100% rate) │
|
|||
|
|
│ │
|
|||
|
|
│ Projected MTTR improvement: 6m 12s → 3m 45s │
|
|||
|
|
│ │
|
|||
|
|
│ [✅ Apply All] [📝 Review Each] [❌ Dismiss] │
|
|||
|
|
└─────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why it wins:** The runbook started as Morgan's brain dump. After 30 days and 12 incidents, it's been refined by the collective behavior of the entire team. Steps nobody uses get pruned. Steps everyone adds get formalized. The runbook evolves toward the platonic ideal of "how to fix this" — not because someone sat down to update it, but because the system learned from every execution.
|
|||
|
|
|
|||
|
|
> *"Five concepts. One thread connecting them all: the runbook is not a document. It's a living organism. It's born from paste or capture, it grows through execution, it heals through feedback, and it evolves through data. dd0c/run isn't a runbook tool. It's a runbook ecosystem."*
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 4: PROTOTYPE — Define the MVP
|
|||
|
|
|
|||
|
|
> *"A prototype isn't a product. It's a question made tangible. The question dd0c/run is asking: 'Can we turn the worst moment of an engineer's day into the most supported one?'"*
|
|||
|
|
|
|||
|
|
### The Mantra: "Paste → Parse → Page → Pilot"
|
|||
|
|
|
|||
|
|
Four words. Four verbs. That's the V1. If you can't explain the product in those four words, you've over-scoped.
|
|||
|
|
|
|||
|
|
- **Paste** — Bring your existing runbook. Any format. Zero migration.
|
|||
|
|
- **Parse** — AI structures it into executable, risk-classified steps.
|
|||
|
|
- **Page** — Alert fires, runbook auto-attaches. The right doc finds the right human.
|
|||
|
|
- **Pilot** — Copilot mode walks you through it. Step by step. Calm in the chaos.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Core User Flows
|
|||
|
|
|
|||
|
|
#### Flow 1: First-Time Setup (Morgan, 5 minutes)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Morgan signs up → lands on empty dashboard
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────────────────────────────────┐
|
|||
|
|
│ "No runbooks yet. Let's fix that." │
|
|||
|
|
│ │
|
|||
|
|
│ [📋 Paste a Runbook] │
|
|||
|
|
│ [🔗 Connect Confluence] │
|
|||
|
|
│ [📁 Import from Markdown/Git] │
|
|||
|
|
└─────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼ Morgan clicks "Paste a Runbook"
|
|||
|
|
┌─────────────────────────────────────────────┐
|
|||
|
|
│ Paste raw text → AI parses in ~3 seconds │
|
|||
|
|
│ → Review structured steps + risk levels │
|
|||
|
|
│ → Fix flagged ambiguities │
|
|||
|
|
│ → Save + Link to alert pattern │
|
|||
|
|
└─────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼ Morgan connects PagerDuty (OAuth)
|
|||
|
|
┌─────────────────────────────────────────────┐
|
|||
|
|
│ dd0c/run pulls alert history │
|
|||
|
|
│ → Suggests alert-to-runbook mappings │
|
|||
|
|
│ → Morgan confirms or adjusts │
|
|||
|
|
│ → Done. Next page triggers the runbook. │
|
|||
|
|
└─────────────────────────────────────────────┘
|
|||
|
|
|
|||
|
|
Total time: ~5 minutes. Zero YAML. Zero config files.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Flow 2: Incident Response (Riley, 3am)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
PagerDuty alert fires → webhook hits dd0c/run
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
dd0c/run matches alert to runbook (semantic + metadata)
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
Slack bot posts in incident channel:
|
|||
|
|
"🔔 Matched runbook: [Payment Service Latency]
|
|||
|
|
8 steps (4🟢 3🟡 1🔴) — pre-filled with alert context
|
|||
|
|
[▶ Start Copilot]"
|
|||
|
|
│
|
|||
|
|
▼ Riley clicks Start
|
|||
|
|
Step-by-step copilot:
|
|||
|
|
→ 🟢 steps: auto-execute with output shown
|
|||
|
|
→ 🟡 steps: present command, wait for approval
|
|||
|
|
→ 🔴 steps: require explicit confirmation + show rollback
|
|||
|
|
→ Each step shows: command, risk, context, rollback option
|
|||
|
|
│
|
|||
|
|
▼ Incident resolves
|
|||
|
|
Post-incident:
|
|||
|
|
→ Execution log saved (full audit trail)
|
|||
|
|
→ Divergence analysis: "You skipped step 3, added a command"
|
|||
|
|
→ Suggested runbook updates
|
|||
|
|
→ MTTR recorded: 4m 23s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Flow 3: Runbook Health Review (Jordan, Monday morning)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Jordan opens dd0c/run dashboard
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
Overview: coverage %, avg staleness, MTTR with/without
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
"Needs Attention" queue:
|
|||
|
|
→ Stale runbooks (infrastructure changed since last edit)
|
|||
|
|
→ Services with no runbook (but recent incidents)
|
|||
|
|
→ Low-effectiveness runbooks (high skip rate, long MTTR)
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
Jordan assigns review tasks, exports compliance report
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Key Screens / Views (V1)
|
|||
|
|
|
|||
|
|
| Screen | Primary User | Purpose |
|
|||
|
|
|--------|-------------|---------|
|
|||
|
|
| **Paste & Parse** | Morgan | Import and structure runbooks. The onboarding moment. |
|
|||
|
|
| **Runbook Editor** | Morgan | Review/edit parsed steps, set risk levels, link to alerts, add context notes. |
|
|||
|
|
| **Runbook Library** | All | Browse all runbooks. Search, filter by service/team/staleness. |
|
|||
|
|
| **Copilot Execution** (Slack + Web) | Riley | Step-by-step guided execution during incidents. The 3am interface. |
|
|||
|
|
| **Incident Summary** | Riley / Jordan | Post-incident: what ran, what was skipped, divergence analysis, suggested updates. |
|
|||
|
|
| **Health Dashboard** | Jordan | Coverage, staleness, MTTR, effectiveness scores. The management view. |
|
|||
|
|
| **Alert Mappings** | Morgan / Jordan | Configure which alerts trigger which runbooks. Semantic suggestions + manual override. |
|
|||
|
|
| **Compliance Export** | Jordan | One-click SOC 2 / ISO 27001 evidence: timestamped execution logs, approval chains, audit trail. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### What to Fake vs. Build in V1
|
|||
|
|
|
|||
|
|
This is where discipline lives. A solo founder building six products needs to be ruthless about scope.
|
|||
|
|
|
|||
|
|
| Capability | V1 Approach | Why |
|
|||
|
|
|-----------|-------------|-----|
|
|||
|
|
| **Paste & Parse** | BUILD — full AI parsing | This IS the product. The magic moment. Non-negotiable. |
|
|||
|
|
| **Risk Classification** | BUILD — AI-powered 🟢🟡🔴 | The trust foundation. Without this, nobody lets the AI execute anything. |
|
|||
|
|
| **Copilot Execution** | BUILD — Slack bot + web UI | The core value prop. This is what Riley uses at 3am. |
|
|||
|
|
| **PagerDuty/OpsGenie Integration** | BUILD — webhook receiver + OAuth | The "runbook finds you" moment. Critical for adoption. |
|
|||
|
|
| **Alert-to-Runbook Matching** | SEMI-BUILD — keyword + metadata matching, basic semantic similarity | Full ML matching is V2. V1 uses alert service name + keywords + manual mapping as fallback. Good enough. |
|
|||
|
|
| **Confluence/Notion Crawlers** | FAKE — manual paste/import only | Crawlers are integration maintenance nightmares. V1 is paste. If they want bulk import, they paste 10 times. It's 10 minutes. |
|
|||
|
|
| **Terminal Watcher** | DEFER to V2 | Requires an agent on the engineer's machine. Trust barrier too high for V1. |
|
|||
|
|
| **Runbook Effectiveness Scoring** | SEMI-BUILD — basic metrics (MTTR, skip rate) | Full scoring model is V2. V1 tracks the raw data and shows simple stats. |
|
|||
|
|
| **Divergence Detection** | BUILD — compare executed vs. prescribed steps | Low engineering cost, high perceived intelligence. The "wow, it noticed I did something different" moment. |
|
|||
|
|
| **Auto-Update Suggestions** | BUILD — generate diffs from divergence data | Natural extension of divergence detection. Makes the learning loop tangible. |
|
|||
|
|
| **Staleness Detection** | FAKE — time-based only | V1: "Last updated 90 days ago" warning. V2: cross-reference with infrastructure state via dd0c/portal. |
|
|||
|
|
| **Compliance Export** | SEMI-BUILD — PDF/CSV of execution logs | Not pretty, but functional. Jordan needs it for the auditor. |
|
|||
|
|
| **Autopilot Mode** | DEFER to V2 | Trust must be earned first. V1 is copilot-only (with auto-execute for 🟢 steps). |
|
|||
|
|
| **Multi-Alert Correlation** | DEFER to V2 | Requires dd0c/alert integration. V1 handles one alert → one runbook. |
|
|||
|
|
| **Rollback Recording** | BUILD — capture inverse commands | Essential safety net. Engineers won't approve 🟡 steps without a visible undo button. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Technical Approach (V1)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────────────────────────┐
|
|||
|
|
│ dd0c/run V1 Architecture │
|
|||
|
|
│ │
|
|||
|
|
│ ┌──────────┐ ┌──────────────┐ ┌────────────────┐ │
|
|||
|
|
│ │ Slack Bot │────→│ dd0c/run │────→│ Execution │ │
|
|||
|
|
│ │ (Bolt) │←────│ API (Rust) │←────│ Agent (Rust) │ │
|
|||
|
|
│ └──────────┘ └──────┬───────┘ └────────────────┘ │
|
|||
|
|
│ │ ↑ │
|
|||
|
|
│ ┌──────────┐ ┌──────┴───────┐ ┌─────┴──────────┐ │
|
|||
|
|
│ │ Web UI │────→│ PostgreSQL │ │ Customer VPC │ │
|
|||
|
|
│ │ (React) │ │ + pgvector │ │ (agent runs │ │
|
|||
|
|
│ └──────────┘ └──────────────┘ │ here, pushes │ │
|
|||
|
|
│ │ │ results out) │ │
|
|||
|
|
│ ┌──────┴───────┐ └────────────────┘ │
|
|||
|
|
│ │ LLM Layer │ │
|
|||
|
|
│ │ (via dd0c/ │ │
|
|||
|
|
│ │ route) │ │
|
|||
|
|
│ └──────────────┘ │
|
|||
|
|
└─────────────────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Key Technical Decisions:**
|
|||
|
|
|
|||
|
|
1. **Rust API + Agent** — Consistent with dd0c platform strategy. Fast, small binary, deploys anywhere. The agent runs in the customer's VPC and executes commands locally. The SaaS never sees credentials.
|
|||
|
|
|
|||
|
|
2. **PostgreSQL + pgvector** — Runbook storage, execution logs, and semantic search in one database. No separate vector DB needed for V1 scale.
|
|||
|
|
|
|||
|
|
3. **LLM via dd0c/route** — Eat your own dog food. Runbook parsing uses the LLM cost router for model selection and cost optimization. Parsing a runbook doesn't need GPT-4o — a fine-tuned smaller model handles structured extraction.
|
|||
|
|
|
|||
|
|
4. **Slack-First UI** — The 3am interface is Slack, not a web app. Engineers are already in Slack during incidents. The web UI is for setup, review, and dashboards — daytime activities.
|
|||
|
|
|
|||
|
|
5. **Webhook-Based Alert Integration** — PagerDuty and OpsGenie both support outbound webhooks. V1 receives webhooks, matches alerts to runbooks, and posts to Slack. No polling, no complex API integration.
|
|||
|
|
|
|||
|
|
6. **Execution Model** — The agent receives step commands from the API, executes them locally, streams output back. Each step is a discrete unit with timeout, rollback command, and risk level. The agent never executes a 🟡 or 🔴 step without explicit API confirmation (which requires human approval in the Slack/web UI).
|
|||
|
|
|
|||
|
|
**V1 Pricing (from brand strategy):**
|
|||
|
|
- **Free:** 3 runbooks, read-along mode only (no execution)
|
|||
|
|
- **Pro ($25/seat/month):** Unlimited runbooks, copilot execution, Slack bot, basic dashboard
|
|||
|
|
- **Business ($49/seat/month):** API access, SSO, compliance export, audit trail, priority support
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 5: TEST — Validation Plan
|
|||
|
|
|
|||
|
|
> *"Testing isn't about proving you're right. It's about finding out where you're wrong before your users do. In jazz, you rehearse so the performance feels effortless. In product, you test so the launch feels inevitable."*
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Beta User Acquisition Strategy
|
|||
|
|
|
|||
|
|
**Target: 15-20 beta teams over 6 weeks**
|
|||
|
|
|
|||
|
|
We need three types of beta users — one for each persona:
|
|||
|
|
|
|||
|
|
| Segment | Profile | Where to Find Them | Hook |
|
|||
|
|
|---------|---------|-------------------|------|
|
|||
|
|
| **The Drowning Team** | 5-15 SREs, high incident volume, existing runbooks in Confluence/Notion that they know are stale | SRE Slack communities, r/sre, Hacker News "Who's Hiring" threads (companies mentioning on-call) | "Paste your worst runbook. See it transformed in 5 seconds." |
|
|||
|
|
| **The Compliance-Pressured** | Series A/B startups approaching SOC 2 audit, need documented incident response | YC alumni Slack, startup CTO communities, SOC 2 prep consultants as referral partners | "Generate audit-ready incident response evidence automatically." |
|
|||
|
|
| **The Zero-Runbook Team** | Teams that rely entirely on tribal knowledge, have had a recent painful incident | DevOps Twitter/X, conference hallway conversations, postmortem blog posts (reach out to authors) | "Your senior engineer's brain, captured. Before they leave." |
|
|||
|
|
|
|||
|
|
**Acquisition Channels (Ranked by Expected Yield):**
|
|||
|
|
|
|||
|
|
1. **Direct Outreach via Incident Postmortems** — Companies that publish postmortems (Cloudflare, GitLab, etc. style) are self-selecting as teams that care about incident response. Find mid-market companies publishing postmortems on their blogs. DM the authors. "I read your postmortem on the Redis incident. We're building something that would have cut your MTTR in half. Want early access?"
|
|||
|
|
|
|||
|
|
2. **SRE Community Seeding** — Post in: Rands Leadership Slack (#ops), SRE Weekly newsletter (sponsor a mention), r/sre, DevOps Discord servers. Not a product pitch — share the design thinking insights. "We interviewed 20 on-call engineers about 3am pages. Here's what we learned." Link to waitlist.
|
|||
|
|
|
|||
|
|
3. **Engineering-as-Marketing** — Release a free, open-source CLI tool: `ddoc-parse`. Paste a runbook, get structured JSON output with risk classification. No account needed. No SaaS. Just the AI parsing magic as a standalone tool. Engineers who love it self-select into the beta.
|
|||
|
|
|
|||
|
|
4. **Conference Lightning Talks** — SREcon, KubeCon, DevOpsDays. 5-minute talk: "The Anatomy of a 3am Page — Why Your Runbooks Are Killing Your Engineers." End with beta signup QR code.
|
|||
|
|
|
|||
|
|
5. **PagerDuty/OpsGenie Community** — Both have community forums and integration marketplaces. List dd0c/run as an integration partner. Teams browsing for PagerDuty add-ons are pre-qualified.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Success Metrics
|
|||
|
|
|
|||
|
|
**Primary Metrics (Must-Hit for V1 Launch):**
|
|||
|
|
|
|||
|
|
| Metric | Target | How We Measure | Why It Matters |
|
|||
|
|
|--------|--------|---------------|----------------|
|
|||
|
|
| **Time-to-First-Runbook** | < 5 minutes | Timestamp: signup → first runbook saved | If onboarding is slow, nobody gets to the value. This is the Vercel test. |
|
|||
|
|
| **MTTR Reduction** | ≥ 40% vs. baseline | Compare MTTR for incidents with dd0c/run vs. team's historical average (from PagerDuty data) | The headline number. "Teams using dd0c/run resolve incidents 40% faster." |
|
|||
|
|
| **Copilot Adoption Rate** | ≥ 60% of incidents use copilot mode | % of matched runbooks where engineer clicks "Start Copilot" vs. "Skip" | If engineers bypass the copilot, the product isn't trusted or isn't useful. |
|
|||
|
|
| **Runbook Update Acceptance Rate** | ≥ 30% of suggested updates accepted | % of AI-suggested runbook updates that are applied | The learning loop is working. Runbooks are actually improving. |
|
|||
|
|
| **Weekly Active Runbooks** | ≥ 5 per team | Runbooks that were either edited, executed, or reviewed in the past 7 days | The product is alive, not shelfware. |
|
|||
|
|
|
|||
|
|
**Secondary Metrics (Directional):**
|
|||
|
|
|
|||
|
|
| Metric | Target | Signal |
|
|||
|
|
|--------|--------|--------|
|
|||
|
|
| **NPS from On-Call Engineers** | > 50 | Riley actually likes this. Not just tolerates it. |
|
|||
|
|
| **Runbook Coverage Growth** | +20% in first 30 days | Teams are creating new runbooks, not just importing old ones. |
|
|||
|
|
| **Step Skip Rate Trend** | Decreasing over time | Runbooks are getting more accurate (fewer irrelevant steps). |
|
|||
|
|
| **Escalation Rate** | Decreasing over time | Junior engineers are resolving more incidents without escalating. |
|
|||
|
|
| **3am Copilot Usage** | ≥ 70% of nighttime incidents | The product works when it matters most — when brains are at 40%. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Beta Interview Questions
|
|||
|
|
|
|||
|
|
**For Riley (On-Call Engineer) — After First Incident with dd0c/run:**
|
|||
|
|
|
|||
|
|
1. Walk me through the moment the alert fired. What did you do first? Did you notice the Slack message from dd0c/run?
|
|||
|
|
2. When the runbook appeared, did you trust it? What made you click "Start Copilot" (or what made you skip it)?
|
|||
|
|
3. Was there a moment during execution where you felt confused, unsafe, or unsure? Which step?
|
|||
|
|
4. Did the risk classification (green/yellow/red) match your intuition? Were any steps mis-classified?
|
|||
|
|
5. How did this compare to the last time you handled a similar incident without dd0c/run?
|
|||
|
|
6. If you could change one thing about the experience, what would it be?
|
|||
|
|
7. *The money question:* Would you want this for your next 3am page?
|
|||
|
|
|
|||
|
|
**For Jordan (SRE Manager) — After 2 Weeks:**
|
|||
|
|
|
|||
|
|
1. Have you looked at the health dashboard? What surprised you?
|
|||
|
|
2. Did the MTTR data match your expectations, or was it better/worse than you thought?
|
|||
|
|
3. Has dd0c/run changed how you think about runbook maintenance? How?
|
|||
|
|
4. Would the compliance export satisfy your auditor as-is, or what's missing?
|
|||
|
|
5. Has on-call sentiment changed? Have engineers mentioned dd0c/run in retros or standups?
|
|||
|
|
6. *The money question:* If I took this away tomorrow, what would you miss most?
|
|||
|
|
|
|||
|
|
**For Morgan (Runbook Author) — After First Runbook Import:**
|
|||
|
|
|
|||
|
|
1. How long did it take to paste and parse your first runbook? Did the AI get it right?
|
|||
|
|
2. Did the ambiguity highlighter catch anything you hadn't noticed?
|
|||
|
|
3. Have you looked at the divergence analysis after an incident? Did the suggested updates make sense?
|
|||
|
|
4. Has this changed your motivation to write or update runbooks? Why or why not?
|
|||
|
|
5. *The money question:* Does this feel like it captures your knowledge, or just your commands?
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Success Criteria for V1 Launch
|
|||
|
|
|
|||
|
|
**Green Light (Launch):**
|
|||
|
|
- ≥ 12 of 15 beta teams still active after 4 weeks
|
|||
|
|
- MTTR reduction ≥ 40% for at least 8 teams
|
|||
|
|
- NPS > 50 from on-call engineers
|
|||
|
|
- Zero incidents where dd0c/run made an incident worse (safety is non-negotiable)
|
|||
|
|
- At least 3 teams willing to be named case studies
|
|||
|
|
|
|||
|
|
**Yellow Light (Iterate Before Launch):**
|
|||
|
|
- 8-12 teams active, MTTR reduction 20-40%
|
|||
|
|
- Engineers use copilot but frequently modify commands (parsing quality issue)
|
|||
|
|
- Dashboard is used but compliance export needs work
|
|||
|
|
- Extend beta 2 weeks, focus on parsing accuracy
|
|||
|
|
|
|||
|
|
**Red Light (Pivot or Major Rework):**
|
|||
|
|
- < 8 teams active after 4 weeks
|
|||
|
|
- Engineers skip copilot mode > 50% of the time
|
|||
|
|
- MTTR reduction < 20% or inconsistent
|
|||
|
|
- Any incident where dd0c/run contributed to making things worse
|
|||
|
|
- Fundamental rethink of the execution model or trust gradient
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 6: ITERATE — Next Steps
|
|||
|
|
|
|||
|
|
> *"The first version is never the final version. It's the first note in a long improvisation. You play it, listen to how the room responds, and let the music tell you where to go next."*
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### V1 → V2 Progression
|
|||
|
|
|
|||
|
|
**V1: "Paste → Parse → Page → Pilot" (Months 4-6)**
|
|||
|
|
The foundation. Prove the core loop works: import runbooks, match to alerts, guide execution, learn from divergence.
|
|||
|
|
|
|||
|
|
**V2: "Watch → Learn → Predict → Protect" (Months 7-9)**
|
|||
|
|
|
|||
|
|
| Feature | Description | Unlocks |
|
|||
|
|
|---------|-------------|---------|
|
|||
|
|
| **Terminal Watcher** | Opt-in agent captures commands during incidents. AI generates runbooks from real actions. | Solves cold-start for teams with zero runbooks. Captures Morgan's expertise passively. |
|
|||
|
|
| **Confluence/Notion Crawlers** | Automated discovery and sync of runbook-tagged pages. | Bulk import for large teams. Eliminates manual paste for 100+ runbook libraries. |
|
|||
|
|
| **Full Autopilot Mode** | Runbooks with proven track records (10+ successful copilot runs, zero modifications) can be promoted to fully autonomous execution for 🟢 steps. | The "sleep through the night" promise. Riley gets paged, dd0c/run handles it, Riley gets a summary in the morning. |
|
|||
|
|
| **dd0c/alert Integration** | Alert intelligence feeds directly into runbook matching. Multi-alert correlation identifies root cause and surfaces the right runbook. | The platform flywheel. Alert + Runbook together are 10x more valuable than apart. |
|
|||
|
|
| **Infrastructure-Aware Staleness** | Cross-reference runbook steps against live Terraform state, K8s manifests, AWS resources (via dd0c/portal). Flag steps that reference deleted or changed resources. | Runbooks that know when they're lying. The immune system gets real teeth. |
|
|||
|
|
| **Runbook Effectiveness ML Model** | Move beyond simple metrics to a trained model that predicts runbook success probability based on alert context, time of day, engineer experience, and historical patterns. | "This runbook has a 94% chance of resolving this incident. Last time it failed was when the root cause was actually in the upstream service." |
|
|||
|
|
|
|||
|
|
**V3: "Simulate → Train → Marketplace → Scale" (Months 10-12)**
|
|||
|
|
|
|||
|
|
| Feature | Description | Unlocks |
|
|||
|
|
|---------|-------------|---------|
|
|||
|
|
| **Incident Simulator / Fire Drills** | Sandbox environment where teams practice runbooks against simulated incidents. Gamified with scores and leaderboards. | Viral growth mechanism. "My team's incident response score is 94." Training without risk. |
|
|||
|
|
| **Voice-Guided Runbooks** | AI reads steps aloud at 3am. Hands-free incident response. | Genuine differentiation. Nobody else does audio-guided incident response. The 3am brain can listen even when it can't read. |
|
|||
|
|
| **Runbook Marketplace** | Community-contributed, anonymized runbook templates. "Here's how teams running EKS + RDS + Redis handle connection storms." | Network effect. Every new customer's runbooks make the templates better for everyone. |
|
|||
|
|
| **Multi-Cloud Abstraction** | Write one runbook, execute across AWS/GCP/Azure. AI translates cloud-specific commands. | Enterprise expansion. Teams with multi-cloud architectures get a single runbook layer. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### dd0c/alert Integration Timeline
|
|||
|
|
|
|||
|
|
This is the platform play. The "On-Call Savior" phase depends on alert + runbook working as one system.
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Month 4: dd0c/alert launches independently
|
|||
|
|
dd0c/run launches independently
|
|||
|
|
Integration: webhook-based (alert fires → run matches)
|
|||
|
|
│
|
|||
|
|
Month 5: Shared alert context
|
|||
|
|
dd0c/alert passes enriched context to dd0c/run:
|
|||
|
|
- Correlated alerts (not just the trigger)
|
|||
|
|
- Affected services + owners (from dd0c/portal)
|
|||
|
|
- Recent deploy history
|
|||
|
|
- Anomaly confidence score
|
|||
|
|
│
|
|||
|
|
Month 6: Bidirectional feedback loop
|
|||
|
|
dd0c/run reports resolution data back to dd0c/alert:
|
|||
|
|
- Which runbook resolved which alert pattern
|
|||
|
|
- MTTR per alert type
|
|||
|
|
- dd0c/alert learns which alerts are "auto-resolvable"
|
|||
|
|
and adjusts severity/routing accordingly
|
|||
|
|
│
|
|||
|
|
Month 7-8: Unified incident view
|
|||
|
|
Single pane: alert timeline + runbook execution +
|
|||
|
|
resolution + postmortem — all in one place
|
|||
|
|
The incident channel becomes the command center
|
|||
|
|
│
|
|||
|
|
Month 9: Predictive pipeline
|
|||
|
|
dd0c/alert detects anomaly trending toward incident
|
|||
|
|
→ dd0c/run pre-stages the relevant runbook
|
|||
|
|
→ On-call gets a heads-up: "Incident likely in ~30min.
|
|||
|
|
Runbook ready. Want to run diagnostics now?"
|
|||
|
|
The page that never fires. The incident that never happens.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Growth Loops
|
|||
|
|
|
|||
|
|
**Loop 1: The Parsing Flywheel (Product-Led)**
|
|||
|
|
```
|
|||
|
|
Engineer pastes runbook → AI parses it → "Wow, that was fast"
|
|||
|
|
→ Engineer pastes 5 more → Invites teammate → Teammate pastes theirs
|
|||
|
|
→ Team has 20 runbooks in dd0c/run within a week
|
|||
|
|
→ First incident uses copilot → MTTR drops → Team is hooked
|
|||
|
|
```
|
|||
|
|
*Fuel:* The 5-second parse moment. Make it so good that engineers paste runbooks for fun.
|
|||
|
|
|
|||
|
|
**Loop 2: The Incident Evidence Loop (Manager-Led)**
|
|||
|
|
```
|
|||
|
|
Jordan sees MTTR data → Shows leadership → Leadership asks
|
|||
|
|
"Why don't all teams use this?" → Org-wide rollout
|
|||
|
|
→ More teams = more runbooks = better matching = better MTTR
|
|||
|
|
→ Jordan becomes internal champion
|
|||
|
|
```
|
|||
|
|
*Fuel:* The MTTR comparison chart. "With dd0c/run: 6 minutes. Without: 38 minutes."
|
|||
|
|
|
|||
|
|
**Loop 3: The Open-Source Wedge (Community-Led)**
|
|||
|
|
```
|
|||
|
|
`ddoc-parse` CLI released (free, open-source)
|
|||
|
|
→ Engineers use it locally to structure runbooks
|
|||
|
|
→ Some want execution, matching, dashboards
|
|||
|
|
→ They upgrade to dd0c/run SaaS
|
|||
|
|
→ Their runbooks (anonymized) improve the parsing model
|
|||
|
|
→ `ddoc-parse` gets better → More users → More conversions
|
|||
|
|
```
|
|||
|
|
*Fuel:* The free tool that's genuinely useful on its own. Not a crippled demo — a real tool.
|
|||
|
|
|
|||
|
|
**Loop 4: The Knowledge Capture Loop (Retention)**
|
|||
|
|
```
|
|||
|
|
Morgan's expertise captured in dd0c/run
|
|||
|
|
→ Morgan leaves the company (or goes on vacation)
|
|||
|
|
→ Riley handles incident using Morgan's captured knowledge
|
|||
|
|
→ Team realizes dd0c/run IS their institutional memory
|
|||
|
|
→ Switching cost becomes infinite
|
|||
|
|
→ Renewal is automatic
|
|||
|
|
```
|
|||
|
|
*Fuel:* The "Ghost of Morgan" moment. The first time a junior engineer resolves an incident using a runbook generated from a senior's terminal session — that's when the product becomes indispensable.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Key Metrics by Phase
|
|||
|
|
|
|||
|
|
| Phase | North Star Metric | Target |
|
|||
|
|
|-------|------------------|--------|
|
|||
|
|
| **V1 (Months 4-6)** | Teams with ≥ 5 active runbooks | 50 teams |
|
|||
|
|
| **V2 (Months 7-9)** | Incidents resolved via copilot/autopilot per month | 500 incidents/month |
|
|||
|
|
| **V3 (Months 10-12)** | Cross-team runbook template adoption | 30% of new runbooks start from a community template |
|
|||
|
|
| **Revenue** | MRR | $15K (Month 6) → $35K (Month 9) → $60K (Month 12) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### The Emotional Finish
|
|||
|
|
|
|||
|
|
> *"Let me leave you with this image."*
|
|||
|
|
>
|
|||
|
|
> *It's 3:17am. Riley's phone buzzes. But this time, before the dread fully forms, a Slack notification appears: "🔔 Runbook matched: Payment Service Latency. 8 steps ready. Pre-filled with alert context."*
|
|||
|
|
>
|
|||
|
|
> *Riley taps "Start Copilot." The first three steps auto-execute — green, safe, read-only. Results stream in. Step 4 is yellow: "Bounce connection pool. ~30s downtime. Rollback ready." Riley taps Approve. Thirty seconds later: "✅ Incident resolved. MTTR: 3m 47s."*
|
|||
|
|
>
|
|||
|
|
> *Riley puts the phone down. The cat hasn't even woken up.*
|
|||
|
|
>
|
|||
|
|
> *That's the product. That's the promise. Not faster incident response — though it is. Not better documentation — though it is. Not compliance evidence — though it is.*
|
|||
|
|
>
|
|||
|
|
> *It's the absence of dread. It's the 3am page that doesn't ruin your night. It's Morgan's knowledge living on after Morgan moves to that startup in Austin. It's Jordan sleeping through the night because the system works even when the humans are tired.*
|
|||
|
|
>
|
|||
|
|
> *dd0c/run doesn't automate runbooks. It automates trust.*
|
|||
|
|
>
|
|||
|
|
> *Now let's build it.* 🎷
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**End of Design Thinking Session**
|
|||
|
|
**Facilitator:** Maya, Design Thinking Maestro
|
|||
|
|
**Total Phases:** 6 (Empathize → Define → Ideate → Prototype → Test → Iterate)
|
|||
|
|
**Next Step:** Hand off to product/engineering for V1 sprint planning
|
|||
|
|
|