dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
This commit is contained in:
2026-02-28 17:35:02 +00:00
commit 5ee95d8b13
51 changed files with 36935 additions and 0 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,227 @@
# 🧠 Alert Intelligence Layer — Brainstorm Session
**Product:** #3 — Alert Intelligence Layer (dd0c platform)
**Date:** 2026-02-28
**Facilitator:** Carson (Elite Brainstorming Specialist)
**Total Ideas Generated:** 112
---
## Phase 1: Problem Space (22 ideas)
### What Alert Fatigue Actually Feels Like at 3am
1. **The Pavlov's Dog Effect** — Your phone buzzes and your cortisol spikes before you even read it. After 6 months of on-call, the sound of a notification triggers anxiety even on vacation.
2. **The Boy Who Cried Wolf** — After 50 false alarms, you stop reading the details. You just ack and go back to sleep. The 51st one is the real outage.
3. **The Scroll of Doom** — You wake up to 347 unread alerts in Slack. You have to mentally triage which ones matter. By the time you find the real one, it's been firing for 40 minutes.
4. **The Guilt Loop** — You muted a channel because it was too noisy. Now you feel guilty. What if something real fires? You unmute. It's noisy again. Repeat.
5. **The Resignation Trigger** — Alert fatigue is the #1 cited reason engineers leave on-call-heavy roles. It's not the incidents — it's the noise between incidents.
### What Alerts Are Always Noise? What Patterns Exist?
6. **The Cascade** — One root cause (DB goes slow) triggers 47 downstream alerts across 12 services. Every single one is a symptom, not the cause.
7. **The Flapper** — CPU hits 80%, alert fires. CPU drops to 79%, resolves. CPU hits 80% again. 14 times in an hour. Same alert, same non-issue.
8. **The Deployment Storm** — Every deploy triggers a brief spike in error rates. 100% predictable. 100% still alerting.
9. **The Threshold Lie** — "Alert when latency > 500ms" was set 2 years ago when traffic was 10x lower. Nobody updated it. It fires daily now.
10. **The Zombie Alert** — The service it monitors was decommissioned 6 months ago. The alert still fires. Nobody knows who owns it.
11. **The Chatty Neighbor** — One misconfigured service generates 80% of all alerts. Everyone knows. Nobody fixes it because "it's not my team's service."
12. **The Scheduled Noise** — Cron jobs, batch processes, maintenance windows — all generate predictable alerts at predictable times.
13. **The Metric Drift** — Seasonal traffic patterns mean thresholds that work in January fail in December. Static thresholds can't handle dynamic systems.
### Why Do Critical Alerts Get Lost?
14. **Signal Drowning** — When everything is urgent, nothing is urgent. Critical alerts are buried in a sea of warnings.
15. **Channel Overload** — Alerts go to Slack, email, PagerDuty, and SMS simultaneously. Engineers pick ONE channel and ignore the rest. If the critical alert only went to email...
16. **Context Collapse** — "High CPU on prod-web-07" tells you nothing. Is it the one serving 40% of traffic or the one that's being decommissioned?
17. **The Wrong Person** — Alert goes to the on-call for Team A, but the root cause is in Team B's service. 30 minutes of "not my problem" before escalation.
### Cost Analysis
18. **MTTR Tax** — Every minute of alert triage is a minute not spent fixing. Teams with high noise have 3-5x longer MTTR.
19. **The Attrition Cost** — Replacing a senior SRE costs $150-300K (recruiting + ramp). Alert fatigue drives attrition. Do the math.
20. **The Incident That Wasn't Caught** — One missed P1 incident can cost $100K-$10M depending on the business. Alert fatigue makes this inevitable.
21. **Cognitive Load Tax** — Engineers who had a bad on-call night are 40% less productive the next day. That's a hidden cost nobody tracks.
22. **The Compliance Risk** — In regulated industries (fintech, healthcare), missed alerts can mean regulatory fines. Alert fatigue is a compliance risk.
---
## Phase 2: Solution Space (48 ideas)
### AI Deduplication Approaches
23. **Semantic Fingerprinting** — Hash alerts by semantic meaning, not exact text. "High CPU on web-01" and "CPU spike detected on web-01" are the same alert.
24. **Topology-Aware Grouping** — Use service dependency maps to group alerts that share a common upstream cause. DB slow → API slow → Frontend errors = 1 incident, not 47 alerts.
25. **Time-Window Clustering** — Alerts within a 5-minute window affecting related services get auto-grouped into a single incident.
26. **Embedding-Based Similarity** — Use lightweight embeddings (sentence-transformers) to compute similarity scores between alerts. Cluster above threshold.
27. **Template Extraction** — Learn alert templates ("X metric on Y host exceeded Z threshold") and deduplicate by template + parameters.
28. **Cross-Source Dedup** — Same incident triggers alerts in Datadog AND PagerDuty AND Grafana. Deduplicate across sources, not just within.
### Correlation Strategies
29. **Deployment Correlation** — Automatically correlate alert spikes with recent deployments (pull from CI/CD: GitHub Actions, ArgoCD, etc.). "This started 3 minutes after deploy #4521."
30. **Change Correlation** — Beyond deploys: config changes, feature flag flips, infrastructure changes (Terraform applies), DNS changes.
31. **Service Dependency Graph** — Auto-discover or import service maps. When alert fires, show the blast radius and likely root cause.
32. **Temporal Pattern Matching** — "This exact pattern of alerts happened 3 times before. Each time it was caused by X." Learn from history.
33. **Cross-Team Correlation** — Alert in Team A's service + alert in Team B's service = shared dependency issue. Neither team sees the full picture alone.
34. **Infrastructure Event Correlation** — Cloud provider incidents, network blips, AZ failures — correlate with external status pages automatically.
35. **Calendar-Aware Correlation** — Black Friday traffic, end-of-month batch jobs, quarterly reporting — correlate with known business events.
### Priority Scoring
36. **SLO-Based Priority** — If this alert threatens an SLO with < 20% error budget remaining, it's critical. If error budget is 90% full, it can wait.
37. **Business Impact Scoring** — Assign business value to services (revenue-generating, customer-facing, internal-only). Alert priority inherits from service importance.
38. **Historical Resolution Priority** — Alerts that historically required immediate action get high priority. Alerts that were always acked-and-ignored get suppressed.
39. **Blast Radius Scoring** — How many users/services are affected? An alert affecting 1 user vs 1 million users should have very different priorities.
40. **Time-Decay Priority** — An alert that's been firing for 5 minutes is more urgent than one that just started (it's not self-resolving).
41. **Compound Scoring** — Combine multiple signals: SLO impact × business value × blast radius × historical urgency = composite priority score.
42. **Dynamic Thresholds** — Replace static thresholds with ML-based anomaly detection. Alert only when behavior is genuinely anomalous for THIS time of day, THIS day of week.
### Learning Mechanisms
43. **Ack Pattern Learning** — If an alert is acknowledged within 10 seconds 95% of the time, it's probably noise. Learn to auto-suppress.
44. **Resolution Pattern Learning** — Track what actually gets resolved vs what auto-resolves. Focus human attention on alerts that need human action.
45. **Runbook Extraction** — Parse existing runbooks and link them to alert types. If the runbook says "check if it's a deploy," automate that check.
46. **Postmortem Mining** — Analyze incident postmortems to identify which alerts were useful signals and which were noise during real incidents.
47. **Feedback Loops** — Explicit thumbs up/down on alert usefulness. "Was this alert helpful?" Build a labeled dataset from real engineer feedback.
48. **Snooze Intelligence** — Learn from snooze patterns. If everyone snoozes "disk usage > 80%" for 24 hours, maybe the threshold should be 90%.
49. **Team-Specific Learning** — Different teams have different noise profiles. Learn per-team, not globally.
50. **Seasonal Learning** — Recognize that December traffic patterns are different from July. Adjust baselines seasonally.
### Integration Approaches
51. **Webhook Receiver (Primary)** — Accept webhooks from any monitoring tool. Zero-config for tools that support webhook destinations. Lowest friction.
52. **API Polling (Secondary)** — For tools that don't support webhooks well, poll their APIs on a schedule.
53. **Slack Bot Integration** — Live in Slack where engineers already are. Receive alerts, show grouped incidents, allow ack/resolve from Slack.
54. **PagerDuty Bidirectional Sync** — Don't replace PagerDuty — sit in front of it. Filter noise before it hits PagerDuty's on-call rotation.
55. **Terraform Provider** — Configure alert rules, suppression policies, and service maps as code. GitOps-friendly.
56. **OpenTelemetry Collector Plugin** — Tap into the OTel pipeline to correlate alerts with traces and logs.
57. **GitHub/GitLab Integration** — Pull deployment events, PR merges, and config changes for correlation.
### UX Ideas
58. **Slack-Native Experience** — Primary interface IS Slack. Threaded incident channels, interactive buttons, slash commands. No new tool to learn.
59. **Mobile-First Dashboard** — On-call engineers are on their phones at 3am. The mobile experience must be exceptional, not an afterthought.
60. **Daily Digest Email** — "Yesterday: 347 alerts fired. 12 were real. Here's what we suppressed and why." Build trust through transparency.
61. **Alert Replay** — Visualize an incident timeline: which alerts fired in what order, how they were grouped, what the AI decided. Full auditability.
62. **Noise Report Card** — Weekly report per team: "Your noisiest alerts, your most-ignored alerts, suggested tuning." Gamify noise reduction.
63. **On-Call Handoff Summary** — Auto-generated summary for shift handoffs: "Here's what happened, what's still open, what to watch."
64. **Service Health Dashboard** — Not another dashboard — a SMART dashboard that only shows what's actually wrong right now, with context.
65. **CLI Tool**`alert-intel status`, `alert-intel suppress <pattern>`, `alert-intel explain <incident-id>`. For the terminal-native engineers.
### Escalation Intelligence
66. **Smart Routing** — Route to the engineer who last fixed this type of issue, not just whoever is on-call.
67. **Auto-Escalation Rules** — If no ack in 10 minutes AND SLO impact is high, auto-escalate to the next tier. No human needed to press "escalate."
68. **Responder Availability** — Integrate with calendar/Slack status. Don't page someone who's marked as unavailable — find the backup automatically.
69. **Fatigue-Aware Routing** — If the on-call engineer has been paged 5 times tonight, route the next one to the secondary. Prevent burnout in real-time.
70. **Cross-Team Escalation** — When correlation shows the root cause is in another team's service, auto-notify that team with context.
### Wild Ideas
71. **Predictive Alerting** — Use time-series forecasting to predict when a metric WILL breach a threshold. Alert before it happens. "CPU will hit 95% in ~20 minutes based on current trend."
72. **Alert Simulation Mode** — "What if we changed this threshold? Here's what would have happened last month." Simulate before you ship.
73. **Incident Autopilot** — For known, repeatable incidents, execute the runbook automatically. Human just approves. "We've seen this 47 times. Auto-scaling fixes it. Execute? [Yes/No]"
74. **Natural Language Alert Creation** — "Alert me if checkout latency is bad" → AI translates to proper metric query + dynamic threshold.
75. **Alert Debt Score** — Like tech debt but for monitoring. "You have 47 alerts that fire daily and are always ignored. Your alert debt score is 73/100."
76. **Chaos Engineering Integration** — During chaos experiments, automatically suppress expected alerts and highlight unexpected ones.
77. **LLM-Powered Root Cause Analysis** — Feed the AI the alert cluster + recent changes + service graph → get a natural language hypothesis: "Likely cause: memory leak introduced in commit abc123, deployed 12 minutes ago."
78. **Voice Interface for On-Call** — "Hey Alert Intel, what's going on?" at 3am when you can't read your phone screen. Get a spoken summary.
79. **Alert Sound Design** — Different sounds for different severity/types. Your brain learns to distinguish "noise ping" from "real incident alarm" without reading.
80. **Collaborative Incident Chat** — Auto-create a war room channel, pull in relevant people, seed it with context, timeline, and suggested actions.
---
## Phase 3: Differentiation (15 ideas)
### What Makes This Defensible?
81. **Data Moat** — Every alert processed, every ack, every resolve, every feedback signal makes the model smarter. Competitors starting fresh can't match 6 months of learned patterns.
82. **Network Effects Within Org** — The more teams that use it, the better cross-team correlation works. Creates internal pressure to expand.
83. **Integration Depth** — Deep bidirectional integrations with 5-10 monitoring tools create switching costs. You'd have to rewire everything.
84. **Institutional Knowledge Capture** — The system learns what your senior SRE knows intuitively. When that person leaves, the knowledge stays. That's incredibly valuable.
85. **Custom Model Per Customer** — Each customer's model is trained on THEIR patterns. Generic competitors can't match customer-specific intelligence.
### Why Not Just Use PagerDuty's Built-In AI?
86. **Vendor Lock-In** — PagerDuty AIOps only works with PagerDuty. We work across ALL your tools. Most teams use 3-5 monitoring tools.
87. **Price** — PagerDuty AIOps is an expensive add-on to an already expensive product. We're $15-30/seat vs their $50+/seat for AIOps alone.
88. **Independence** — We're tool-agnostic. Switch from Datadog to Grafana? We don't care. Your learned patterns carry over.
89. **Focus** — PagerDuty is an incident management platform that bolted on AI. We're AI-first, purpose-built for alert intelligence.
90. **Transparency** — Big platforms are black boxes. We show exactly why an alert was suppressed, grouped, or escalated. Engineers need to trust it.
### Additional Differentiation
91. **SMB-First Design** — BigPanda needs a 6-month enterprise deployment. We need a webhook URL and 10 minutes.
92. **Open Core Potential** — Open-source the core deduplication engine. Build community. Monetize the hosted service + advanced features.
93. **Developer Experience** — API-first, CLI tools, Terraform provider, GitOps config. Built BY engineers FOR engineers, not by enterprise sales teams.
94. **Time to Value** — Show value in the first hour. "You received 200 alerts today. We would have shown you 23." Instant proof.
95. **Community-Shared Patterns** — Anonymized, opt-in pattern sharing. "Teams using Kubernetes + Istio commonly see this noise pattern. Auto-suppress?" Collective intelligence.
---
## Phase 4: Anti-Ideas (17 ideas)
### Why Would This Fail?
96. **Trust Gap** — Engineers will NOT trust AI to suppress alerts on day one. One missed critical alert and they'll disable it forever. The trust ramp is the hardest problem.
97. **The Datadog Threat** — Datadog has $2B+ revenue and is building AI features aggressively. They could ship "AI Alert Grouping" as a free feature tomorrow.
98. **Integration Maintenance Hell** — Supporting 10+ monitoring tools means 10+ APIs that change, break, and deprecate. Integration maintenance could eat the entire engineering team.
99. **Cold Start Problem** — The AI needs data to learn. Day 1, it's dumb. How do you deliver value before the model has learned anything?
100. **Alert Suppression Liability** — If the AI suppresses a real alert and there's an outage, who's liable? Legal/compliance teams will ask this question.
101. **Small TAM Concern** — Teams of 5-50 engineers at $15-30/seat = $75-$1,500/month per customer. Need thousands of customers to build a real business.
102. **Enterprise Gravity** — Larger companies (where the money is) already have BigPanda or PagerDuty AIOps. SMBs have less budget and higher churn.
103. **The "Good Enough" Problem** — Teams might just... mute channels and deal with it. The pain is real but the workarounds are free.
104. **Security Concerns** — Alert data contains service names, infrastructure details, error messages. Sending this to a third party is a security review.
105. **Champion Risk** — If the one SRE who championed the tool leaves, does the team keep paying? Single-champion products churn hard.
106. **Monitoring Tool Consolidation** — The trend is toward fewer tools (Datadog is eating everything). If teams consolidate to one tool, the "cross-tool" value prop weakens.
107. **AI Hype Fatigue** — "AI-powered" is becoming meaningless. Engineers are skeptical of AI claims. Need to prove it with numbers, not buzzwords.
108. **Open Source Competition** — Someone could build an open-source version. The deduplication algorithms aren't rocket science. The value is in the learned models.
109. **Webhook Reliability** — If our system goes down, alerts don't get processed. We become a single point of failure in the alerting pipeline. That's terrifying.
110. **Feature Creep Temptation** — The pull to become a full incident management platform (competing with PagerDuty) is strong. Must resist and stay focused.
111. **Pricing Pressure** — At $15-30/seat, margins are thin. Infrastructure costs for ML inference could eat profits if not carefully managed.
112. **The "Just Fix Your Alerts" Argument** — Some will say: "If your alerts are noisy, fix your alerts." They're not wrong. We're a band-aid on a deeper problem.
---
## Phase 5: Synthesis
### Top 10 Ideas (Ranked)
| Rank | Idea | Why |
|------|------|-----|
| 1 | **Webhook Receiver + Slack-Native UX** (#51, #58) | Lowest friction entry point. Engineers don't leave Slack. 10-minute setup. |
| 2 | **Topology-Aware Alert Grouping** (#24) | The single highest-impact feature. Turns 47 alerts into 1 incident. Immediate, visible value. |
| 3 | **Deployment Correlation** (#29) | "This started after deploy #4521" is the most useful sentence in incident response. Pulls from GitHub/CI automatically. |
| 4 | **Ack/Resolve Pattern Learning** (#43, #44) | The data moat starts here. Every interaction makes the system smarter. Passive learning, no user effort. |
| 5 | **Daily Noise Report Card** (#62) | Builds trust through transparency. Shows what was suppressed and why. Gamifies noise reduction. |
| 6 | **SLO-Based Priority Scoring** (#36) | Objective, defensible prioritization. Not "AI magic" — math based on your own SLO definitions. |
| 7 | **Time-Window Clustering** (#25) | Simple, effective, explainable. "These 12 alerts all fired within 2 minutes — probably related." |
| 8 | **Feedback Loops (Thumbs Up/Down)** (#47) | Explicit signal to train the model. Engineers feel in control. Builds the labeled dataset for V2 ML. |
| 9 | **Alert Simulation Mode** (#72) | "What would have happened last month?" is the killer demo. Proves value before they commit. |
| 10 | **PagerDuty Bidirectional Sync** (#54) | Don't replace PagerDuty — sit in front of it. Reduces friction. "Keep your existing setup, just add us." |
### 3 Wild Cards
1. **🔮 Predictive Alerting** (#71) — Forecast threshold breaches before they happen. Hard to build, but if it works, it's magic. "You have 20 minutes before this becomes a problem." Game-changing for V2.
2. **🤖 Incident Autopilot** (#73) — Auto-execute known runbooks with human approval. "We've seen this 47 times. Auto-scaling fixes it every time. Execute?" This is where the real money is long-term.
3. **🧠 Community-Shared Patterns** (#95) — Anonymized collective intelligence. "87% of Kubernetes teams suppress this alert. Want to?" Network effects across customers, not just within orgs. Could be the ultimate moat.
### Recommended V1 Scope
**V1: "Smart Alert Funnel"**
Core features (ship in 8-12 weeks):
- **Webhook ingestion** from Datadog, Grafana, PagerDuty, OpsGenie (4 integrations)
- **Time-window clustering** (group alerts within 5-min windows affecting related services)
- **Semantic deduplication** (same alert, different wording = 1 alert)
- **Deployment correlation** (GitHub Actions / GitLab CI integration)
- **Slack bot** as primary UX (grouped incidents, ack/resolve, context)
- **Daily digest** showing noise reduction stats
- **Thumbs up/down feedback** on every grouped incident
- **Alert simulation** ("connect us to your webhook history, we'll show you what V1 would have done")
What V1 is NOT:
- No ML-based anomaly detection (use rule-based grouping first)
- No predictive alerting
- No auto-remediation
- No custom dashboards (Slack IS the dashboard)
- No on-prem deployment
**V1 Success Metric:** Reduce actionable alert volume by 60%+ within the first week.
**V1 Pricing:** $19/seat/month. Free tier for up to 5 seats (land-and-expand).
**V1 Go-to-Market:**
- Target: DevOps/SRE teams at Series A-C startups (20-200 employees)
- Channel: Dev Twitter/X, Hacker News launch, DevOps subreddit, conference lightning talks
- Hook: "Connect your webhook. See your noise reduction in 60 seconds."
---
*Session complete. 112 ideas generated across 5 phases. Let's build this thing.* 🚀

View File

@@ -0,0 +1,342 @@
# 🎷 dd0c/alert — Design Thinking Session
**Product:** Alert Intelligence Layer (dd0c/alert)
**Facilitator:** Maya, Design Thinking Maestro
**Date:** 2026-02-28
**Method:** Full Design Thinking (Empathize → Define → Ideate → Prototype → Test → Iterate)
---
> *"An alert system that cries wolf isn't broken — it's traumatizing the shepherd. We're not fixing alerts. We're restoring trust between humans and their machines."*
> — Maya
---
# Phase 1: EMPATHIZE 🎧
Design is jazz. You don't start by playing — you start by *listening*. And right now, the on-call world is screaming a blues riff at 3am, and nobody's transcribing the melody. Let's sit with these people. Let's feel what they feel before we dare to build anything.
---
## Persona 1: Priya Sharma — The On-Call Engineer
**Age:** 28 | **Role:** Backend Engineer, On-Call Rotation | **Company:** Mid-stage fintech startup, 85 engineers
**Slack Status:** 🔴 "on-call until Thursday, pray for me"
### Empathy Map
**SAYS:**
- "I got paged 6 times last night. Five were nothing."
- "I just ack everything now. I'll look at it in the morning if it's still firing."
- "I used to love this job. Now I dread Tuesdays." (her on-call day)
- "Can someone PLEASE fix the checkout-latency alert? It fires every deploy."
- "I'm not burned out, I'm just... tired." (she's burned out)
**THINKS:**
- "If I mute this channel, will I miss the one real incident?"
- "My manager says on-call is 'shared responsibility' but I've been on rotation 3x more than anyone else this quarter."
- "I wonder if that startup down the street has on-call this bad."
- "What if I just... don't answer? What's the worst that happens?"
- "I'm mass-acking alerts at 3am. This is not engineering. This is whack-a-mole."
**DOES:**
- Sets multiple alarms because she doesn't trust herself to wake up for pages anymore — her brain has learned to sleep through them
- Keeps a personal "ignore list" in a Notion doc — alerts she's learned are always noise
- Spends the first 20 minutes of every incident figuring out if it's real
- Writes angry Slack messages in #sre-gripes at 3:17am
- Checks the deploy log manually every single time an alert fires
- Takes a "recovery day" after bad on-call nights (uses PTO, doesn't tell anyone why)
**FEELS:**
- **Anxiety:** The phantom vibration in her pocket. Even off-call, she flinches when her phone buzzes.
- **Resentment:** Toward the teams that ship noisy services and never fix their alerts.
- **Guilt:** When she mutes channels or acks without investigating.
- **Isolation:** Nobody who isn't on-call understands what it's like. Her partner thinks she's "just checking her phone."
- **Helplessness:** She's filed 12 tickets to fix noisy alerts. 2 got addressed. The rest are "backlog."
### Pain Points
1. **Signal-to-noise ratio is catastrophic** — 80-90% of pages are non-actionable
2. **Context is missing** — Alert says "high latency on prod-api-12" but doesn't say WHY or whether it matters
3. **No correlation** — She has to manually check if a deploy just happened, if other services are affected, if it's a known pattern
4. **Alert ownership is broken** — Nobody owns the noisy alerts. The team that created the service moved on. The alert is an orphan.
5. **Recovery time is invisible** — Management doesn't see the cognitive cost of a bad night. She's 40% less productive the next day but nobody measures that.
6. **Tools fragment her attention** — Alerts come from Datadog, PagerDuty, Slack, and email. She has to context-switch across 4 tools to understand one incident.
### Current Workarounds
- Personal Notion "ignore list" of known-noisy alerts
- Slack keyword muting for specific alert patterns
- A bash script she wrote that checks the deploy log when she gets paged (she's automated her own triage)
- Group chat with other on-call engineers where they share "is this real?" messages at 3am
- Coffee. So much coffee.
### Jobs to Be Done (JTBD)
- **When** I get paged at 3am, **I want to** instantly know if this is real and what to do, **so I can** either fix it fast or go back to sleep.
- **When** I start my on-call shift, **I want to** know what's currently broken and what's just noise, **so I can** mentally prepare and not waste energy on false alarms.
- **When** an incident is happening, **I want to** see all related alerts grouped together with context, **so I can** focus on the root cause instead of chasing symptoms.
- **When** my on-call shift ends, **I want to** hand off cleanly with a summary, **so I can** actually disconnect and recover.
### Day-in-the-Life: Tuesday (On-Call Day)
**6:30 AM** — Alarm goes off. Checks phone immediately. 14 alerts overnight. Scrolls through them in bed. 12 are the usual suspects (checkout-latency, disk-usage-warning, the zombie alert from the decommissioned auth service). 2 look potentially real. Gets up with a knot in her stomach.
**9:15 AM** — Standup. "Anything from on-call?" She says "quiet night" because explaining 14 alerts that were all noise is exhausting and nobody wants to hear it.
**11:42 AM** — Page: "Error rate spike on payment-service." Heart rate jumps. Opens Datadog. Opens PagerDuty. Opens Slack. Checks the deploy log — yes, someone deployed 8 minutes ago. Checks the PR. It's a config change. Error rate is already recovering. Acks the alert. Total time: 12 minutes. Actual action needed: zero.
**2:15 PM** — Trying to do actual feature work. Gets paged again. Same checkout-latency alert that fires every afternoon during peak traffic. Acks in 3 seconds without looking. Goes back to coding. Loses her flow state. Takes 20 minutes to get back into the problem.
**3:47 AM (Wednesday)** — Phone screams. Bolts awake. Heart pounding. "Database connection pool exhausted on prod-db-primary." This one is real. But she doesn't know that yet. Spends 8 minutes triaging — checking if it's the flapper, checking if there was a deploy, checking if other services are affected. By the time she confirms it's real and starts the runbook, it's been 11 minutes. MTTR clock is ticking.
**4:22 AM** — Incident resolved. Wide awake now. Adrenaline. Can't sleep. Opens Twitter. Sees a meme about on-call life. Laughs bitterly. Considers updating her LinkedIn.
**9:00 AM (Wednesday)** — Shows up to work exhausted. Manager asks about the incident. She explains. Manager says "great job." She thinks: "Great job would be not getting paged for garbage 6 times before the real one."
---
## Persona 2: Marcus Chen — The SRE/Platform Lead
**Age:** 34 | **Role:** Senior SRE / Platform Team Lead | **Company:** Series C SaaS company, 140 engineers, 8 SREs
**Slack Status:** 📊 "Reviewing Q1 on-call metrics (they're bad)"
### Empathy Map
**SAYS:**
- "We need to fix our alert hygiene. I've been saying this for two quarters."
- "I can't force product teams to fix their alerts. I can only write guidelines nobody reads."
- "Our MTTR is 34 minutes. Industry benchmark is 15. I know why, but I can't fix it alone."
- "PagerDuty costs us $47/seat and half the alerts it sends are noise."
- "I need a way to show leadership that alert quality is an engineering productivity problem, not just an SRE problem."
**THINKS:**
- "I know which alerts are noise. I've known for months. But fixing them requires buy-in from 6 different teams and none of them prioritize it."
- "If I suppress alerts aggressively, and something breaks, it's MY head on the block."
- "The junior engineers on rotation are getting destroyed. I can see it in their faces. I need to protect them but I don't have the tools."
- "I could build something internally... but I've been saying that for a year and we never have the bandwidth."
- "Am I going to spend my entire career fighting alert noise? Is this really what SRE is?"
**DOES:**
- Runs a monthly "alert review" meeting that nobody wants to attend
- Maintains a spreadsheet tracking alert-to-incident ratios per service (manually updated, always out of date)
- Writes alert rules for other teams because they won't write good ones themselves
- Spends 30% of his time on alert tuning instead of platform work
- Advocates for "alert budgets" per team (like error budgets) — leadership likes the idea but won't enforce it
- Reviews every postmortem looking for "was the right alert the first alert?" (answer is usually no)
**FEELS:**
- **Frustration:** He has the expertise to fix this but not the organizational leverage. He's a platform lead, not a VP.
- **Responsibility:** Every bad on-call night for his team feels like his failure. He set up the rotation. He should have fixed the noise.
- **Exhaustion:** Alert tuning is Sisyphean. Fix 10 noisy alerts, 15 new ones appear because someone shipped a new service with default thresholds.
- **Professional anxiety:** His MTTR metrics look bad. Leadership sees numbers, not the nuance of why.
- **Loneliness:** He's the only one who sees the full picture. Product teams see their alerts. He sees ALL the alerts. The view is terrifying.
### Pain Points
1. **No leverage over alert quality** — He can write guidelines, but can't force teams to follow them. Alert quality is a tragedy of the commons.
2. **Manual correlation is his full-time job** — He's the human correlation engine. When an incident happens, HE connects the dots across services because no tool does it.
3. **Metrics are hard to produce** — Proving that alert noise costs money requires data he has to manually compile. Leadership wants dashboards, not spreadsheets.
4. **Tool sprawl** — His team uses Datadog for metrics, Grafana for some dashboards, PagerDuty for paging, OpsGenie for some teams that refused PagerDuty. He's managing 4 alerting surfaces.
5. **The cold start problem with every new service** — New services launch with terrible default alerts. By the time they're tuned, the team has already suffered through weeks of noise.
6. **Retention risk** — He's lost 2 engineers in the past year who cited on-call burden. Recruiting replacements took 4 months each.
### Current Workarounds
- The spreadsheet. Always the spreadsheet.
- Monthly "alert amnesty" where teams can delete alerts without judgment (attendance: poor)
- A Slack bot he hacked together that counts alerts per channel per day (it breaks constantly)
- Manually tagging alerts as "noise" or "signal" in postmortem docs
- Begging product managers to prioritize alert fixes by framing it as "developer productivity"
- Taking on-call shifts himself to "lead by example" (and to spare his junior engineers)
### Jobs to Be Done (JTBD)
- **When** I'm reviewing on-call health, **I want to** see exactly which alerts are noise and which are signal across all teams, **so I can** prioritize fixes with data instead of gut feel.
- **When** a new service launches, **I want to** automatically apply intelligent alert defaults, **so I can** prevent the cold-start noise problem.
- **When** I'm presenting to leadership, **I want to** show the business cost of alert noise (MTTR impact, engineer hours wasted, attrition risk), **so I can** get budget and priority for fixing it.
- **When** an incident is in progress, **I want to** see correlated alerts across all services and tools in one view, **so I can** guide the response team to the root cause faster.
### Day-in-the-Life: Monday
**8:00 AM** — Opens his alert metrics spreadsheet. Last week: 1,247 alerts across all teams. 89 resulted in actual incidents. That's a 7.1% signal rate. He's been tracking this for 6 months. It's getting worse, not better.
**9:30 AM** — Alert review meeting. 3 of 8 team leads show up. They review the top 10 noisiest alerts. Everyone agrees they should be fixed. Nobody commits to a timeline. Marcus assigns himself 4 of them because nobody else will.
**11:00 AM** — Gets pulled into an incident. Payment service is throwing errors. He immediately checks: was there a deploy? (Yes, 20 minutes ago.) Are other services affected? (He checks 3 dashboards to find out — yes, the downstream notification service is also erroring.) He connects the dots in 6 minutes. The on-call engineer had been looking at the notification service errors for 15 minutes without realizing the root cause was upstream.
**1:30 PM** — Writes a postmortem for last week's P1. In the "what went well / what didn't" section, he writes: "The first alert that fired was a symptom, not the cause. The causal alert fired 4 minutes later but was buried in 23 other alerts." He's written this same sentence in 11 different postmortems.
**3:00 PM** — 1:1 with his manager (Director of Engineering). Manager asks about MTTR. Marcus shows the spreadsheet. Manager says "Can you get this into a dashboard?" Marcus thinks: "With what time?"
**5:30 PM** — Reviewing PagerDuty bill. $47/seat × 40 engineers on rotation = $1,880/month. For a tool that faithfully delivers noise to people's phones at 3am. He wonders if there's something better.
**7:00 PM** — At home. Gets a Slack DM from a junior engineer: "Hey Marcus, I'm on-call tonight. Any tips for the checkout-latency alert? It fired 8 times last night for Priya." He sends her his personal runbook. He thinks about building an internal tool. Again. He opens a beer instead.
---
## Persona 3: Diana Okafor — The VP of Engineering
**Age:** 41 | **Role:** VP of Engineering | **Company:** Same Series C SaaS, 140 engineers, reports to CTO
**Slack Status:** Rarely on Slack. Lives in Google Docs and Zoom.
### Empathy Map
**SAYS:**
- "Our MTTR is 34 minutes. The board wants it under 15. What's the plan?"
- "I keep hearing about alert fatigue but I need data, not anecdotes."
- "We lost two SREs last quarter. Recruiting is taking forever. We need to fix the on-call experience."
- "I'm not going to approve another $50K tool unless someone can show me ROI in the first quarter."
- "Why are we paying Datadog $180K/year and PagerDuty $22K/year and still having these problems?"
**THINKS:**
- "On-call burnout is a retention problem disguised as a tooling problem. Or is it the other way around?"
- "If we have another major incident where the alert was missed because of noise, the CTO is going to ask me hard questions I don't have answers to."
- "Marcus keeps asking for headcount. I believe him that the team is stretched, but I need to justify it with metrics the CFO will accept."
- "The engineers complain about on-call but I don't have visibility into what's actually happening. I see incident counts and MTTR. I don't see the human cost."
- "We're spending $200K+/year on monitoring and alerting tools. Are we getting $200K of value?"
**DOES:**
- Reviews MTTR and incident count dashboards weekly (surface-level metrics that don't capture the real problem)
- Approves tool purchases based on ROI projections and vendor demos (has been burned by tools that demo well but don't deliver)
- Runs quarterly engineering satisfaction surveys — "on-call experience" has been the #1 complaint for 3 consecutive quarters
- Asks Marcus for "the alert noise number" before board meetings (Marcus scrambles to update his spreadsheet)
- Compares their incident metrics to industry benchmarks and doesn't like what she sees
- Has started mentioning "alert fatigue" in leadership meetings because she read a Gartner report about it
**FEELS:**
- **Accountability pressure:** She owns engineering productivity. If MTTR is bad, it's her problem. If engineers quit, it's her problem.
- **Information asymmetry:** She knows something is wrong but can't see the details. She's dependent on Marcus's spreadsheets and anecdotal reports.
- **Budget anxiety:** Every new tool is a line item she has to defend. The CFO questions every SaaS subscription over $10K/year.
- **Empathy (distant):** She was an engineer once. She remembers bad on-call nights. But it's been 8 years since she was in rotation, and the scale of the problem has changed.
- **Strategic concern:** Competitors are shipping faster. If her engineers are spending 30% of their cognitive energy on alert noise, that's 30% less innovation.
### Pain Points
1. **No single metric for alert health** — She has MTTR, incident count, and anecdotes. She needs a "noise score" or "alert quality index" she can track over time and present to the board.
2. **ROI of monitoring tools is unmeasurable** — She's spending $200K+/year on Datadog + PagerDuty + Grafana. She can't quantify what she's getting for that money.
3. **Attrition is expensive and invisible** — Losing an SRE costs $150-300K (recruiting + ramp + lost institutional knowledge). Alert fatigue drives attrition. But the causal chain is hard to prove to a CFO.
4. **Tool fatigue** — Her teams already use too many tools. Adding another one is a hard sell unless it REPLACES something or has undeniable, immediate value.
5. **Compliance risk** — They're in fintech. Missed alerts could mean regulatory issues. She loses sleep over this (ironic, given the product).
6. **No visibility into cross-team alert patterns** — She doesn't know that Team A's noisy alerts are causing Team B's MTTR to spike because of shared dependencies.
### Current Workarounds
- Marcus's spreadsheet (she knows it's manual and incomplete, but it's all she has)
- Quarterly "on-call health" reviews that produce action items nobody follows up on
- Throwing headcount at the problem (hiring more SREs to spread the on-call load)
- Vendor calls with PagerDuty's "customer success" team that result in no meaningful changes
- Asking engineering managers to "prioritize alert hygiene" without giving them dedicated time to do it
### Jobs to Be Done (JTBD)
- **When** I'm preparing for a board meeting, **I want to** show a clear metric for operational health that includes alert quality, **so I can** demonstrate that we're improving (or justify investment if we're not).
- **When** I'm evaluating a new tool, **I want to** see projected ROI based on our actual data within the first week, **so I can** make a fast, confident buy decision.
- **When** an engineer quits citing on-call burden, **I want to** have data showing exactly how bad their on-call experience was, **so I can** fix the systemic issue instead of just backfilling the role.
- **When** I'm allocating engineering time, **I want to** know which teams have the worst alert noise, **so I can** direct investment where it has the most impact.
### Day-in-the-Life: Wednesday
**7:30 AM** — Checks email. CTO forwarded a Gartner report: "AIOps Market to Reach $40B by 2028." Attached note: "Should we be looking at this?" She adds it to her reading list.
**9:00 AM** — Leadership standup. CTO asks about the P1 incident from last week. Diana gives the summary. CTO asks: "Why did it take 34 minutes to respond?" Diana says: "The on-call engineer was triaging other alerts when it fired." CTO's eyebrow goes up. Diana makes a mental note to talk to Marcus.
**10:30 AM** — 1:1 with Marcus. He shows her the spreadsheet: 1,247 alerts last week, 89 real incidents. She does the math: 93% noise. She asks: "Can we get this to 50% noise?" Marcus says: "Not without dedicated engineering time from every team, or a tool that does it for us." She asks him to evaluate options.
**12:00 PM** — Lunch with the Head of Recruiting. They discuss the two open SRE roles. Average time-to-fill for SREs in their market: 67 days. Cost per hire: $45K (agency fees + interview time). Diana thinks about how much cheaper it would be to just not burn out the SREs they have.
**2:00 PM** — Quarterly planning. She's trying to allocate 15% of engineering time to "platform health" but product managers are pushing back. They want features. She needs ammunition — hard data showing that alert noise is costing them feature velocity.
**4:00 PM** — Reviews the engineering satisfaction survey results. On-call experience: 2.1 out of 5. Comments include: "I dread my on-call weeks," "The alerts are mostly useless," and "I'm considering leaving if this doesn't improve." She highlights these for the CTO.
**6:30 PM** — Driving home. Thinks about the $200K monitoring bill. Thinks about the 2 engineers who left. Thinks about the 34-minute MTTR. Thinks: "There has to be a better way." Opens LinkedIn at a red light. Sees an ad for yet another AIOps platform. Closes LinkedIn.
---
# Phase 2: DEFINE 🎯
> *"The problem is never what people say it is. Priya says 'too many alerts.' Marcus says 'bad alert hygiene.' Diana says 'high MTTR.' They're all describing the same elephant from different angles. Our job is to see the whole animal."*
---
## Point-of-View Statements
A POV statement crystallizes the tension: [User] needs [need] because [insight].
### Priya (On-Call Engineer)
**Priya, a dedicated backend engineer on a weekly on-call rotation, needs to instantly distinguish real incidents from noise at 3am because her brain has been conditioned to ignore alerts — and the one time she shouldn't ignore one, she will.**
The deeper insight: Priya's problem isn't volume. It's *trust erosion*. Every false alarm trains her nervous system to stop caring. The alert system is literally conditioning her to fail at the one moment it matters most. This is a Pavlovian tragedy.
### Marcus (SRE/Platform Lead)
**Marcus, an SRE lead responsible for operational health across 8 teams, needs a way to make alert quality visible and actionable across the organization because he currently holds all the correlation knowledge in his head — and that knowledge walks out the door when he goes on vacation.**
The deeper insight: Marcus is a human AIOps engine. He IS the correlation layer. He IS the deduplication algorithm. The organization has outsourced its alert intelligence to one person's brain. That's not a process — that's a single point of failure wearing a hoodie.
### Diana (VP of Engineering)
**Diana, a VP of Engineering accountable for engineering productivity and retention, needs a single, defensible metric for alert health because she's fighting a war she can't measure — and in leadership, what you can't measure, you can't fund.**
The deeper insight: Diana's problem is *translation*. She needs to convert "Priya had a terrible night" into "$47,000 in lost productivity and attrition risk" — a language the CFO and board understand. Without that translation layer, alert fatigue remains an anecdote, not a budget line item.
---
## Key Insights
1. **Trust is the product, not technology.** The #1 barrier to adoption isn't feature gaps — it's that engineers won't trust AI to suppress alerts. One missed critical alert = permanent distrust. The trust ramp IS the product challenge.
2. **Alert fatigue is a tragedy of the commons.** No single team owns the problem. Team A's noisy service creates Team B's on-call nightmare. Without organizational visibility, everyone optimizes locally and the system degrades globally.
3. **The correlation knowledge is trapped in human brains.** Marcus knows that "DB slow + API errors + frontend timeouts = one incident." That knowledge isn't in any tool. When Marcus is unavailable, MTTR doubles because nobody else can connect the dots.
4. **Metrics exist at the wrong altitude.** Diana sees MTTR (too high-level). Priya sees individual alerts (too low-level). Nobody sees the middle layer: alert quality, noise ratios, correlation patterns, cost-per-false-alarm. This middle layer is where decisions should be made.
5. **The economic cost is real but invisible.** Alert fatigue drives attrition ($150-300K per lost SRE), inflates MTTR (3-5x longer with high noise), reduces next-day productivity (40% cognitive tax), and creates compliance risk. But nobody has a dashboard that shows this. The cost hides in plain sight.
6. **Engineers don't want another tool — they want fewer interruptions.** The bar for a new tool is astronomically high. It must live where they already are (Slack), require zero behavior change, and prove value before asking for commitment.
7. **The cold start paradox.** AI needs data to be smart. Day 1, it's dumb. But Day 1 is when you need to prove value. The solution: rule-based intelligence first (time-window clustering, deployment correlation) that works immediately, with ML that gets smarter over time.
8. **Transparency is non-negotiable.** Engineers will reject a black box. Every suppression, every grouping, every priority score must be explainable. "We suppressed this because..." is the most important sentence in the product.
---
## Core Tension
Here's the fundamental tension at the heart of dd0c/alert — the jazz dissonance that makes this product interesting:
**Engineers need FEWER alerts to do their jobs, but they're TERRIFIED of missing the one that matters.**
This is not a feature problem. This is a *psychological safety* problem. The product must simultaneously:
- **Suppress aggressively** (to deliver the 70-90% noise reduction that makes it worth buying)
- **Never miss a critical alert** (to maintain the trust that makes it usable)
- **Prove it's working** (to justify the suppression to skeptical engineers AND budget-conscious VPs)
The resolution of this tension is *graduated trust*. You don't suppress on Day 1. You SHOW what you WOULD suppress. You let the engineer confirm. You build a track record. You earn the right to act autonomously. Like a new musician sitting in with a jazz ensemble — you listen for 3 songs before you solo.
---
## How Might We (HMW) Questions
These are the creative springboards. Each one opens a design space.
### Trust & Transparency
1. **HMW** build trust with on-call engineers who've been burned by every "smart" alerting tool before?
2. **HMW** make alert suppression decisions transparent and reversible, so engineers feel in control even when AI is acting?
3. **HMW** create a "trust score" that grows over time, unlocking more autonomous behavior as the system proves itself?
### Signal vs. Noise
4. **HMW** reduce alert volume by 70%+ without ever suppressing a genuinely critical alert?
5. **HMW** help engineers distinguish "this is noise" from "this is a symptom of something real" in under 10 seconds?
6. **HMW** automatically correlate alerts across multiple monitoring tools so engineers see ONE incident instead of 47 symptoms?
### Organizational Visibility
7. **HMW** give Marcus (SRE lead) a real-time view of alert quality across all teams without requiring manual data collection?
8. **HMW** translate alert noise into business metrics (dollars, hours, attrition risk) that Diana can present to the board?
9. **HMW** create accountability for alert quality without turning it into a blame game between teams?
### Developer Experience
10. **HMW** deliver value in the first 60 seconds of setup, before the AI has learned anything?
11. **HMW** make Slack the primary interface so engineers never have to learn a new tool?
12. **HMW** provide context-rich incident summaries that eliminate the "is this real?" triage phase entirely?
### Learning & Adaptation
13. **HMW** learn from engineer behavior (acks, snoozes, ignores) without requiring explicit feedback?
14. **HMW** handle the cold-start problem so the product is useful on Day 1, not Day 30?
15. **HMW** adapt to each team's unique noise profile instead of applying one-size-fits-all rules?
### Human Cost
16. **HMW** measure and make visible the human cost of alert fatigue (sleep disruption, cognitive load, burnout)?
17. **HMW** protect junior engineers on rotation from the worst of the noise while they're still learning?
18. **HMW** turn on-call from a dreaded obligation into a manageable, even empowering experience?
---

View File

@@ -0,0 +1,480 @@
# dd0c/alert — V1 MVP Epics
**Product:** dd0c/alert (Alert Intelligence Platform)
**Phase:** 7 — Epics & Stories
---
## Epic 1: Webhook Ingestion
**Description:** The front door of dd0c/alert. Responsible for receiving alert payloads from monitoring providers via webhooks, validating their authenticity, normalizing them into a canonical schema, and queuing them securely for the Correlation Engine. Must support high burst volume (incident storms) and guarantee zero dropped payloads.
### User Stories
**Story 1.1: Datadog Webhook Ingestion**
* **As a** Platform Engineer, **I want** to send Datadog webhooks to a unique dd0c URL, **so that** my Datadog alerts enter the correlation pipeline.
* **Acceptance Criteria:**
- System exposes `POST /v1/wh/{tenant_id}/datadog`
- Normalizes Datadog JSON (handles arrays/batched alerts) into the Canonical Alert Schema.
- Normalizes Datadog P1-P5 severities into critical/high/medium/low/info.
* **Estimate:** 3 points
**Story 1.2: PagerDuty Webhook Ingestion**
* **As a** Platform Engineer, **I want** to send PagerDuty v3 webhooks to dd0c, **so that** my PD incidents are tracked.
* **Acceptance Criteria:**
- System exposes `POST /v1/wh/{tenant_id}/pagerduty`
- Normalizes PagerDuty JSON into the Canonical Alert Schema.
* **Estimate:** 3 points
**Story 1.3: HMAC Signature Validation**
* **As a** Security Admin, **I want** all incoming webhooks to have their HMAC signatures validated, **so that** bad actors cannot inject fake alerts.
* **Acceptance Criteria:**
- Rejects payloads with missing or invalid `DD-WEBHOOK-SIGNATURE` or `X-PagerDuty-Signature` headers with 401 Unauthorized.
- Compares against the integration secret stored in DynamoDB/Secrets Manager.
* **Estimate:** 3 points
**Story 1.4: Payload Normalization & Deduplication (Fingerprinting)**
* **As an** On-Call Engineer, **I want** identical alerts to be deterministically fingerprinted, **so that** flapping or duplicated payloads are instantly recognized.
* **Acceptance Criteria:**
- Generates a SHA-256 fingerprint based on `tenant_id + provider + service + normalized_title`.
- Pushes canonical alert to SQS FIFO queue with `MessageGroupId=tenant_id`.
- Saves raw payload asynchronously to S3 for audit/replay.
* **Estimate:** 5 points
### Dependencies
- Story 1.3 depends on Tenant/Integration configuration existing (Epic 9).
- Story 1.4 depends on Canonical Alert Schema definition.
### Technical Notes
- **Infra:** API Gateway HTTP API -> Lambda -> SQS FIFO.
- Lambda must return 200 OK to the provider in <100ms. S3 raw payload storage must be non-blocking (async).
- Use ULIDs for `alert_id` for time-sortability.
## Epic 2: Correlation Engine
**Description:** The intelligence core. Consumes the normalized SQS FIFO queue, groups alerts based on time windows and service dependencies, and outputs correlated incidents.
### User Stories
**Story 2.1: Time-Window Clustering**
* **As an** On-Call Engineer, **I want** alerts firing within a brief time window for the same service to be grouped together, **so that** I don't get paged 10 times for one failure.
* **Acceptance Criteria:**
- Opens a 5-minute (configurable) correlation window in Redis when a new alert fingerprint arrives.
- Groups subsequent alerts for the same tenant/service into the active window.
- Stores the correlation state in ElastiCache Redis.
* **Estimate:** 5 points
**Story 2.2: Cascading Failure Correlation (Service Graph)**
* **As an** On-Call Engineer, **I want** cascading failures across dependent services to be merged into a single incident, **so that** I can see the blast radius of an issue.
* **Acceptance Criteria:**
- Reads explicit service dependencies from DynamoDB (`upstream -> downstream`).
- If a window is open for an upstream service, downstream service alerts are merged into the same window.
* **Estimate:** 8 points
**Story 2.3: Active Window Extension**
* **As an** On-Call Engineer, **I want** the correlation window to automatically extend if alerts are still trickling in, **so that** long-running, cascading incidents are correctly grouped.
* **Acceptance Criteria:**
- If a new alert arrives within the last 30 seconds of a window, the window extends by 2 minutes (max 15 minutes).
- Updates the `closes_at` timestamp in Redis.
* **Estimate:** 3 points
**Story 2.4: Incident Generation & Persistence**
* **As an** On-Call Engineer, **I want** completed time windows to be saved as durable incidents, **so that** I have a permanent record of the correlated event.
* **Acceptance Criteria:**
- When a window closes, it generates an Incident record in DynamoDB.
- Generates an event in TimescaleDB for trend tracking.
- Pushes a `correlation-request` to the Suggestion Engine SQS queue.
* **Estimate:** 5 points
### Dependencies
- Story 2.1 depends on Epic 1 (normalized SQS queue).
- Story 2.2 depends on a basic service dependency mapping (either config or API).
### Technical Notes
- **Infra:** ECS Fargate consuming SQS FIFO.
- Must use Redis Sorted Sets for active window management (`closes_at_epoch` as score).
- The correlation engine must be stateless (relying on Redis) so it can scale horizontally to handle incident storms.
## Epic 3: Noise Analysis
**Description:** The Suggestion Engine. Calculates a noise score (0-100) for correlated incidents and generates observe-only suppression suggestions. It strictly adheres to V1 constraints by *never* taking auto-action.
### User Stories
**Story 3.1: Rule-Based Noise Scoring**
* **As an** On-Call Engineer, **I want** every incident to receive a noise score based on objective data points, **so that** I have a metric to understand if this incident is likely a false positive.
* **Acceptance Criteria:**
- Calculates a 0-100 noise score when an incident is generated.
- Scores based on duplicate fingerprints (flapping), severity distribution (info vs critical), and time of day.
- Cap at 100, floor at 0.
* **Estimate:** 5 points
**Story 3.2: "Never Suppress" Safelist Execution**
* **As a** Platform Engineer, **I want** critical services (databases, billing) to be excluded from high noise scoring regardless of pattern, **so that** I never miss a genuine P1.
* **Acceptance Criteria:**
- Implements a default safelist regex (e.g., `db|rds|payment|billing`).
- Forces the noise score below 50 if the service or title matches the safelist, or if severity is critical.
* **Estimate:** 3 points
**Story 3.3: Observe-Only Suppression Suggestions**
* **As an** On-Call Engineer, **I want** the system to tell me what it *would* have suppressed, **so that** I can build trust in its intelligence without risking an outage.
* **Acceptance Criteria:**
- If a noise score > 80, the system generates a `suppress` suggestion record in DynamoDB.
- Generates plain-English reasoning for the suggestion (e.g., "This pattern was resolved automatically 4 times this month.").
- `action_taken` is always hardcoded to `none` for V1.
* **Estimate:** 5 points
**Story 3.4: Incident Scoring Metrics Collection**
* **As an** Engineering Manager, **I want** the noise scores and counts to be stored as time-series data, **so that** I can view trends in our alert hygiene over time.
* **Acceptance Criteria:**
- Writes noise score, alert counts, and unique fingerprints to TimescaleDB `alert_timeseries` table.
* **Estimate:** 3 points
### Dependencies
- Story 3.1 depends on Epic 2 for Incident Generation.
- Story 3.3 depends on Epic 5 (Slack Bot) to display the suggestion.
### Technical Notes
- **Infra:** ECS Fargate consuming from the `correlation-request` SQS queue.
- Use PostgreSQL (TimescaleDB) for historical frequency lookups ("how many times has this fired in 7 days?") to inform the score.
## Epic 4: CI/CD Correlation
**Description:** Ingests deployment events and correlates them with alert storms. The "killer feature" mandated by the Party Mode board for V1 MVP, answering "did this break right after a deploy?"
### User Stories
**Story 4.1: GitHub Actions Deploy Ingestion**
* **As a** Platform Engineer, **I want** to connect my GitHub Actions deployment webhooks, **so that** dd0c/alert knows exactly when and who deployed to production.
* **Acceptance Criteria:**
- System exposes `POST /v1/wh/{tenant_id}/github`
- Validates `X-Hub-Signature-256`.
- Normalizes GHA workflow run payload into `DeployEvent` canonical schema.
- Pushes deploy event to SQS FIFO queue (`deploy-event`).
* **Estimate:** 3 points
**Story 4.2: Deploy-to-Alert Correlation**
* **As an** On-Call Engineer, **I want** an alert cluster to be automatically tagged with a recent deployment to that service, **so that** I don't waste 15 minutes checking deploy logs manually.
* **Acceptance Criteria:**
- When the Correlation Engine opens a window, it queries DynamoDB for deployments to the affected service within a configurable lookback window (default 15m for prod, 30m for staging).
- If a match is found, the deploy context (`deploy_pr`, `deploy_author`, `source_url`) is attached to the window state.
* **Estimate:** 8 points
**Story 4.3: Deploy-Weighted Noise Scoring**
* **As an** On-Call Engineer, **I want** alerts that are highly correlated with deployments to be scored as more likely to be noise (if they aren't critical), **so that** feature flags and config refreshes don't wake me up.
* **Acceptance Criteria:**
- If a deploy event is attached to an incident, boost the noise score by 15-30 points.
- Additional +5 points if the PR title matches `config` or `feature-flag`.
* **Estimate:** 2 points
### Dependencies
- Story 4.2 depends on Epic 2 (Correlation Engine) and Epic 3 (Noise Analysis).
- Service name mapping between GitHub and Datadog/PagerDuty (convention-based string matching).
### Technical Notes
- **Infra:** The Deployment Tracker runs as a module within the Correlation Engine ECS Task to avoid network latency.
- DynamoDB needs a Global Secondary Index (GSI): `tenant_id` + `service` + `completed_at` to quickly find recent deploys.
## Epic 5: Slack Bot
**Description:** The primary interface for on-call engineers. Delivers correlated incident summaries, observe-only suppression suggestions, and daily alert digests directly into Slack. Provides interactive buttons for engineers to acknowledge or validate suggestions.
### User Stories
**Story 5.1: Incident Summary Notifications**
* **As an** On-Call Engineer, **I want** to receive a single, concise Slack message when an alert storm is correlated, **so that** I don't get flooded with dozens of individual alert notifications.
* **Acceptance Criteria:**
- Bot sends a formatted Slack Block Kit message to a configured channel.
- Message groups all related alerts under a single incident title.
- Displays the total number of correlated alerts, affected services, and start time.
* **Estimate:** 5 points
**Story 5.2: Observe-Only Suppression Suggestions in Slack**
* **As an** On-Call Engineer, **I want** the Slack message to include the system's noise score and suppression recommendation, **so that** I can evaluate its accuracy in real-time.
* **Acceptance Criteria:**
- If noise score > 80, the message includes a specific "Suggestion" block (e.g., "Would have auto-suppressed: 95% noise score").
- Includes the plain-English reasoning generated in Epic 3.
* **Estimate:** 3 points
**Story 5.3: Interactive Feedback Actions**
* **As an** On-Call Engineer, **I want** to click "Good Catch" or "Bad Suggestion" on the Slack message, **so that** I can help train the noise analysis engine for future versions.
* **Acceptance Criteria:**
- Slack message includes interactive buttons for feedback.
- Clicking a button sends a payload back to dd0c/alert to record the user's validation in the database.
- Updates the Slack message to acknowledge the feedback.
* **Estimate:** 5 points
**Story 5.4: Daily Alert Digest**
* **As an** Engineering Manager, **I want** a daily summary of the noisiest services and total incidents dropped into Slack, **so that** my team can prioritize technical debt.
* **Acceptance Criteria:**
- A scheduled job runs daily at 9 AM (configurable timezone).
- Aggregates the previous 24 hours of data from TimescaleDB.
- Posts a summary of "Top 3 Noisiest Services" and "Total Time Saved" (estimated) to the channel.
* **Estimate:** 5 points
### Dependencies
- Story 5.1 depends on Epic 2 (Correlation Engine).
- Story 5.2 depends on Epic 3 (Noise Analysis).
### Technical Notes
- **Infra:** AWS Lambda for handling incoming Slack interactions (buttons) via API Gateway.
- Use Slack's Block Kit Builder for UI consistency.
- Requires storing Slack Workspace and Channel tokens securely in AWS Secrets Manager or DynamoDB.
## Epic 6: Dashboard API
**Description:** The backend REST API that powers the dd0c/alert web dashboard. Provides secure endpoints for authentication, querying historical incidents, analyzing alert volume, and managing tenant configuration.
### User Stories
**Story 6.1: Tenant Authentication & Authorization**
* **As a** Platform Engineer, **I want** to securely log in to the dashboard API, **so that** I can manage my organization's alert data safely.
* **Acceptance Criteria:**
- Implement JWT-based authentication.
- Enforce tenant isolation on all API endpoints (users can only access data for their `tenant_id`).
* **Estimate:** 5 points
**Story 6.2: Incident Query Endpoints**
* **As an** On-Call Engineer, **I want** to fetch a paginated list of historical incidents and their associated alerts, **so that** I can review past outages.
* **Acceptance Criteria:**
- `GET /v1/incidents` supports pagination, time-range filtering, and service filtering.
- `GET /v1/incidents/{incident_id}/alerts` returns the raw alerts correlated into that incident.
* **Estimate:** 5 points
**Story 6.3: Analytics & Noise Score API**
* **As an** Engineering Manager, **I want** to query aggregated metrics about alert noise and volume, **so that** I can populate charts on the dashboard.
* **Acceptance Criteria:**
- `GET /v1/analytics/noise` returns time-series data of average noise scores per service.
- Queries TimescaleDB efficiently using materialized views or continuous aggregates if necessary.
* **Estimate:** 8 points
**Story 6.4: Configuration Management Endpoints**
* **As a** Platform Engineer, **I want** to manage my integration webhooks and routing rules via API, **so that** I can script my onboarding or use the UI.
* **Acceptance Criteria:**
- CRUD endpoints for managing Slack channel destinations.
- Endpoints to generate and rotate inbound webhook secrets for Datadog/PagerDuty.
* **Estimate:** 3 points
### Dependencies
- Story 6.2 and 6.3 depend on TimescaleDB schema and data from Epics 2 and 3.
### Technical Notes
- **Infra:** API Gateway HTTP API -> AWS Lambda (Node.js/Go).
- Strict validation middleware required for tenant isolation.
- Use standard OpenAPI 3.0 specification for documentation.
## Epic 7: Dashboard UI
**Description:** The React Single Page Application (SPA) for dd0c/alert. Gives users a visual interface to view the incident timeline, inspect alert correlation details, and understand the noise scoring.
### User Stories
**Story 7.1: Incident Timeline View**
* **As an** On-Call Engineer, **I want** a main feed showing all correlated incidents chronologically, **so that** I can see the current state of my systems at a glance.
* **Acceptance Criteria:**
- React SPA fetches and displays data from `GET /v1/incidents`.
- Visual distinction between high-noise (suggested suppressed) and low-noise (critical) incidents.
- Real-time updates or auto-refresh every 30 seconds.
* **Estimate:** 8 points
**Story 7.2: Alert Correlation Visualizer**
* **As an** On-Call Engineer, **I want** to click on an incident and see exactly which alerts were grouped together, **so that** I understand why the engine correlated them.
* **Acceptance Criteria:**
- Detail pane showing the timeline of individual alerts within the incident window.
- Displays the deployment context (Epic 4) if applicable.
* **Estimate:** 5 points
**Story 7.3: Noise Score Breakdown**
* **As a** Platform Engineer, **I want** to see the exact factors that contributed to an incident's noise score, **so that** I can trust the engine's reasoning.
* **Acceptance Criteria:**
- UI component displaying the 0-100 noise score gauge.
- Lists the bulleted reasoning (e.g., "+20 points: Occurred 10 times this week", "+15 points: Recent deployment").
* **Estimate:** 3 points
**Story 7.4: Analytics Dashboard**
* **As an** Engineering Manager, **I want** charts showing alert volume and noise trends over the last 30 days, **so that** I can track improvements in our alert hygiene.
* **Acceptance Criteria:**
- Integrates a charting library (e.g., Recharts or Chart.js).
- Displays a bar chart of total alerts vs. correlated incidents to show "noise reduction" value.
* **Estimate:** 5 points
### Dependencies
- Depends entirely on Epic 6 (Dashboard API).
### Technical Notes
- **Infra:** Hosted on AWS S3 + CloudFront or Vercel.
- Framework: React (Next.js or Vite).
- Tailwind CSS for rapid styling.
## Epic 8: Infrastructure & DevOps
**Description:** The foundational cloud infrastructure and deployment pipelines necessary to run dd0c/alert reliably, securely, and with observability.
### User Stories
**Story 8.1: Infrastructure as Code (IaC)**
* **As a** Developer, **I want** all AWS resources defined in code, **so that** I can easily spin up staging and production environments identically.
* **Acceptance Criteria:**
- Terraform or AWS CDK defines VPC, API Gateway, Lambda functions, ECS Fargate clusters, SQS queues, and DynamoDB tables.
- State is stored securely in an S3 backend with DynamoDB locking.
* **Estimate:** 8 points
**Story 8.2: CI/CD Pipelines**
* **As a** Developer, **I want** automated testing and deployment when I push to main, **so that** I can ship features quickly without manual steps.
* **Acceptance Criteria:**
- GitHub Actions workflow runs unit tests and linters on PRs.
- Merges to `main` trigger a deployment to the staging environment, followed by a manual approval for production.
* **Estimate:** 5 points
**Story 8.3: System Monitoring & Logging**
* **As a** System Admin, **I want** central logging and metrics for the dd0c/alert services, **so that** I can debug issues when the platform itself fails.
* **Acceptance Criteria:**
- All Lambda and ECS logs route to CloudWatch Logs.
- CloudWatch Alarms configured for API 5xx errors and SQS Dead Letter Queue (DLQ) messages.
* **Estimate:** 3 points
**Story 8.4: Database Provisioning (Timescale & Redis)**
* **As a** Database Admin, **I want** managed, highly available instances for TimescaleDB and Redis, **so that** the correlation engine runs with low latency and durable storage.
* **Acceptance Criteria:**
- Provisions AWS ElastiCache for Redis (for active window state).
- Provisions RDS for PostgreSQL with TimescaleDB extension, or uses Timescale Cloud.
* **Estimate:** 5 points
### Dependencies
- Blocked by architectural decisions being finalized.
- Blocks Epics 1, 2, 3 from being deployed to production.
### Technical Notes
- Optimize for Solo Founder: Keep infrastructure simple. Managed services over self-hosted.
- Ensure appropriate IAM roles with least privilege access between Lambda/ECS and DynamoDB/SQS.
## Epic 9: Onboarding & PLG
**Description:** Product-Led Growth and the critical 60-second time-to-value flow. Ensures a frictionless setup experience for new users to connect their monitoring tools and Slack workspace immediately.
### User Stories
**Story 9.1: Frictionless Sign-Up**
* **As a** New User, **I want** to sign up using my GitHub or Google account, **so that** I don't have to create and remember a new password.
* **Acceptance Criteria:**
- Implement OAuth2 login (GitHub/Google).
- Automatically provisions a new `tenant_id` and default configuration upon successful first login.
* **Estimate:** 5 points
**Story 9.2: Webhook Setup Wizard**
* **As a** New User, **I want** a step-by-step wizard to configure my Datadog or PagerDuty webhooks, **so that** I can start sending data to dd0c/alert immediately.
* **Acceptance Criteria:**
- UI wizard provides copy-paste ready webhook URLs and secrets.
- Includes a "Waiting for first payload..." state that updates in real-time via WebSockets or polling when the first alert arrives.
* **Estimate:** 8 points
**Story 9.3: Slack App Installation Flow**
* **As a** New User, **I want** a 1-click "Add to Slack" button, **so that** I can authorize dd0c/alert to post in my incident channels.
* **Acceptance Criteria:**
- Implements the standard Slack OAuth v2 flow.
- Allows the user to select the default channel for incident summaries.
* **Estimate:** 5 points
**Story 9.4: Free Tier Limitations**
* **As a** Product Owner, **I want** a free tier that limits the number of processed alerts or retention period, **so that** users can try the product without me incurring massive AWS costs.
* **Acceptance Criteria:**
- Free tier limits enforced at the ingestion API (e.g., max 10,000 alerts/month).
- UI displays a usage quota bar.
- Data retention in TimescaleDB automatically purged after 7 days for free tier tenants.
* **Estimate:** 5 points
### Dependencies
- Depends on Epic 6 (Dashboard API) and Epic 7 (Dashboard UI).
- Story 9.2 depends on Epic 1 (Webhook Ingestion) being live.
### Technical Notes
- Use Auth0, Clerk, or AWS Cognito to minimize authentication development time for the Solo Founder.
- Real-time "Waiting for payload" can be implemented via a lightweight polling endpoint if WebSockets add too much complexity.
---
## Epic 10: Transparent Factory Compliance
**Description:** Cross-cutting epic ensuring dd0c/alert adheres to the 5 Transparent Factory tenets. For an alert intelligence platform, Semantic Observability is paramount — a tool that reasons about alerts must make its own reasoning fully transparent.
### Story 10.1: Atomic Flagging — Feature Flags for Correlation & Scoring Rules
**As a** solo founder, **I want** every new correlation rule, noise scoring algorithm, and suppression behavior behind a feature flag (default: off), **so that** a bad scoring change doesn't silence critical alerts in production.
**Acceptance Criteria:**
- OpenFeature SDK integrated into the alert processing pipeline. V1: env-var or JSON file provider.
- All flags evaluate locally — no network calls in the alert ingestion hot path.
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
- Automated circuit breaker: if a flagged scoring rule suppresses >2x the baseline alert volume over 30 minutes, the flag auto-disables and all suppressed alerts are re-emitted.
- Flags required for: new correlation patterns, CI/CD deployment correlation, noise scoring thresholds, notification channel routing.
**Estimate:** 5 points
**Dependencies:** Epic 2 (Correlation Engine)
**Technical Notes:**
- Circuit breaker is critical here — a bad suppression rule is worse than no suppression. Track suppression counts per flag in Redis with 30-min sliding window.
- Re-emission: suppressed alerts buffered in a dead-letter queue for 1 hour. On circuit break, replay the queue.
### Story 10.2: Elastic Schema — Additive-Only for Alert Event Store
**As a** solo founder, **I want** all alert event schema changes to be strictly additive, **so that** historical alert correlation data remains queryable after any deployment.
**Acceptance Criteria:**
- CI rejects migrations containing `DROP`, `ALTER ... TYPE`, or `RENAME` on existing columns/attributes.
- New fields use `_v2` suffix for breaking changes. Old fields remain readable.
- All event parsers configured to ignore unknown fields (Pydantic `model_config = {"extra": "ignore"}` or equivalent).
- Dual-write during migration windows within the same DB transaction.
- Every migration includes `sunset_date` comment (max 30 days). CI warns on overdue cleanups.
**Estimate:** 3 points
**Dependencies:** Epic 3 (Event Store)
**Technical Notes:**
- Alert events are append-only by nature — leverage this. Never mutate historical events.
- For correlation metadata (enrichments added post-ingestion), store as separate linked records rather than mutating the original event.
- TimescaleDB compression policies must handle both V1 and V2 column layouts.
### Story 10.3: Cognitive Durability — Decision Logs for Scoring Logic
**As a** future maintainer, **I want** every change to noise scoring weights, correlation rules, or suppression thresholds accompanied by a `decision_log.json`, **so that** I can understand why alert X was classified as noise vs. signal.
**Acceptance Criteria:**
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
- CI requires a decision log for PRs touching `src/scoring/`, `src/correlation/`, or `src/suppression/`.
- Cyclomatic complexity cap of 10 enforced in CI. Scoring functions must be decomposable and testable.
- Decision logs in `docs/decisions/`, one per significant logic change.
**Estimate:** 2 points
**Dependencies:** None
**Technical Notes:**
- Scoring weight changes are especially important to document — "why is deployment correlation weighted 0.7 and not 0.5?"
- Include sample alert scenarios in decision logs showing before/after scoring behavior.
### Story 10.4: Semantic Observability — AI Reasoning Spans on Alert Classification
**As an** on-call engineer investigating a missed critical alert, **I want** every alert scoring and correlation decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I can trace exactly why an alert was scored as noise when it was actually a P1 incident.
**Acceptance Criteria:**
- Every alert ingestion creates a parent `alert_evaluation` span. Child spans for `noise_scoring`, `correlation_matching`, and `suppression_decision`.
- Span attributes: `alert.source`, `alert.noise_score`, `alert.correlation_matches` (JSON array), `alert.suppressed` (bool), `alert.suppression_reason`.
- If AI-assisted classification is used: `ai.prompt_hash`, `ai.model_version`, `ai.confidence_score`, `ai.reasoning_chain` (summarized).
- CI/CD correlation spans include: `alert.deployment_correlation_score`, `alert.deployment_id`, `alert.time_since_deploy_seconds`.
- No PII in spans. Alert payloads are hashed for correlation, not logged raw.
**Estimate:** 3 points
**Dependencies:** Epic 2 (Correlation Engine)
**Technical Notes:**
- This is the most important tenet for dd0c/alert. If the tool suppresses an alert, the reasoning MUST be traceable.
- Use `opentelemetry-python` with OTLP exporter. Batch span export to avoid per-alert overhead.
- For V1 without AI: `alert.suppression_reason` is the rule name + threshold. When AI scoring is added, the full reasoning chain is captured.
### Story 10.5: Configurable Autonomy — Governance for Alert Suppression
**As a** solo founder, **I want** a `policy.json` that controls whether dd0c/alert can auto-suppress alerts or only annotate them, **so that** customers never lose visibility into their alerts without explicit opt-in.
**Acceptance Criteria:**
- `policy.json` defines `governance_mode`: `strict` (annotate-only, never suppress) or `audit` (auto-suppress with full logging).
- Default for all new customers: `strict`. Suppression requires explicit opt-in.
- `panic_mode`: when true, all suppression stops immediately. Every alert passes through unmodified. A "panic active" banner appears in the dashboard.
- Per-customer governance override: customers can only be MORE restrictive than system default.
- All policy decisions logged with full context: "Alert X suppressed by audit mode, rule Y, score Z" or "Alert X annotation-only, strict mode active".
**Estimate:** 3 points
**Dependencies:** Epic 4 (Notification Router)
**Technical Notes:**
- `strict` mode is the safe default — dd0c/alert adds value even without suppression by annotating alerts with correlation data and noise scores.
- Panic mode: single Redis key `dd0c:panic`. All suppression checks short-circuit on this key. Triggerable via `POST /admin/panic` or env var.
- Customer override: stored in org settings. Merge: `max_restrictive(system, customer)`.
### Epic 10 Summary
| Story | Tenet | Points |
|-------|-------|--------|
| 10.1 | Atomic Flagging | 5 |
| 10.2 | Elastic Schema | 3 |
| 10.3 | Cognitive Durability | 2 |
| 10.4 | Semantic Observability | 3 |
| 10.5 | Configurable Autonomy | 3 |
| **Total** | | **16** |

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,115 @@
# 🎉 dd0c/alert — Party Mode Advisory Board
**Product:** Alert Intelligence Layer (dd0c/alert)
**Date:** 2026-02-28
**Format:** BMad Creative Intelligence Suite - "Party Mode"
---
## Round 1: INDIVIDUAL REVIEWS
**1. The VC (Pattern-Matching Machine)**
* **Excites me:** The wedge. Entering via a webhook bypassing enterprise procurement is a beautiful PLG motion. The $19/seat price point makes it an individual contributor expense swipe. Total addressable market for AIOps is massive.
* **Worries me:** The moat. Sentence-transformers and time-window clustering are commodities now. What's stopping incident.io from adding this to their $16/seat tier tomorrow? What's stopping PagerDuty from dropping their AIOps add-on price?
* **Vote:** CONDITIONAL GO. (Prove you can get 500 teams before the incumbents wake up).
**2. The CTO (20 Years in Infrastructure)**
* **Excites me:** Cross-tool correlation. I have teams on Datadog, teams on Prometheus, and everyone routes to PagerDuty. A centralized intelligence layer that sees the whole topology is a holy grail for reducing MTTR.
* **Worries me:** The "Black Box" of AI. The moment this thing auto-suppresses a critical database failover alert because it "looked like a routine spike," I'm firing the vendor. "Explainability" is easy to put on a slide, incredibly hard to engineer reliably.
* **Vote:** CONDITIONAL GO. (Needs a strict, default-deny "Trust Ramp" and zero auto-suppression in V1).
**3. The Bootstrap Founder (Solo SaaS Veteran)**
* **Excites me:** The unbundling play. You don't need to build a whole incident management platform. You're just building a webhook processor with a Slack bot. That is 100% shippable by a solo dev in 30 days. $19/seat means 500 seats (like 25 mid-sized teams) gets you to $10K MRR.
* **Worries me:** Support burden. When a webhook drops at 4am, you're the one getting paged. Can a solo founder maintain 99.99% uptime on an alert ingestion pipeline while also doing marketing and sales?
* **Vote:** GO. (The math works. Keep the scope ruthlessly small).
**4. The On-Call Engineer (Drowning in 3am Pages)**
* **Excites me:** Finally, someone acknowledges the human cost! The "Noise Report Card" and the idea of translating my 3am suffering into a dollar metric for my VP is brilliant. Also, the deploy correlation—if you can just tell me "this broke right after PR #452," you've saved me 15 minutes of digging.
* **Worries me:** Trusting it. I've used PagerDuty's "Intelligent Alert Grouping" and it routinely groups unrelated things or misses obvious correlations. If I have to double-check the AI's work, it's just adding cognitive load, not removing it.
* **Vote:** CONDITIONAL GO. (Only if it's strictly "suggest-only" until I explicitly train it to auto-suppress).
**5. The Contrarian (The Blind-Spot Finder)**
* **Excites me:** The fact that everyone is so focused on the AI. That means the real value is actually the low-tech stuff: webhook unification, slack-native UI, and basic time-window grouping.
* **Worries me:** You're all treating "alert fatigue" like a software problem. It's an organizational problem. Companies have noisy alerts because their engineering culture is broken and they don't prioritize technical debt. Putting an AI band-aid over a broken culture just gives them permission to keep writing terrible code with bad thresholds.
* **Vote:** NO-GO. (You're a painkiller for a disease that requires surgery. They'll eventually churn when they realize they still don't know how their systems work).
---
## Round 2: CROSS-EXAMINATION
**The VC:** So, Priya... I mean, On-Call Engineer. I see you're suffering. That's great, pain sells. But will your boss actually pay $19/seat for this, or will she just tell you to keep muting channels? Honestly, $19 feels too cheap for a critical B2B tool, but too high if you can't even get budget approval.
**The On-Call Engineer:** $19 a month is two overpriced coffees in San Francisco. My VP of Engineering spends $180K a year on Datadog alone, and we still have a 34-minute MTTR. If I can take the "Noise Report Card," drop it on her desk, and say, "this tool will give us back 40 engineering hours a week," she'll swipe the card. But she won't pay $50K for BigPanda. We're a 140-person engineering org.
**The VC:** That's fair. But why wouldn't Datadog just bundle this? They already ingest your metrics.
**The On-Call Engineer:** Because we don't just use Datadog! We use Grafana, OpsGenie, CloudWatch... Datadog can't see the alerts Grafana is throwing. We need something that sits *across* all of them.
**The Bootstrap Founder:** Let me jump in on that VC pessimism. You're worried about moats and BigPanda. I look at this and see a textbook unbundling play. I don't need to build a $50M/year business. If I hit 500 teams at 20 seats, that's $190K MRR. One guy. Almost 100% margins.
**The VC:** $190K MRR with 10,000 active webhooks firing constantly? As a solo founder? One AWS outage and your whole "alert intelligence layer" goes down. If you're the single point of failure for an SRE team's 3am pages, you are going to get sued into oblivion when you miss a P1.
**The Bootstrap Founder:** That's why the architecture is an *overlay*. We don't replace their PagerDuty webhooks. We sit parallel or upstream. If our ingestion goes down, the fallback is their raw, noisy alert stream. They're no worse off than they were yesterday!
**The CTO:** Hold on. Let's talk about the actual tech. Contrarian, you called this a "band-aid." But let's be real: I've spent 20 years fighting alert hygiene. Every company's culture is "broken" by your definition. Microservices mean no single team understands the whole topology anymore. AI correlation isn't a band-aid, it's the only way to synthesize 500 microservices throwing errors at once.
**The Contrarian:** Synthesis is fine. *Suppression* is the problem. You're putting a black-box LLM in charge of deciding if an alert is real. "Oh, the embedding similarity score is 0.95, it must be the same issue." No, CTO! What if the payment gateway fails *at the exact same time* a frontend deploy goes out? Your "smart" AI correlates them, suppresses the payment alert as a "deploy symptom," and you lose $400K in an hour.
**The CTO:** Which is why the "Trust Ramp" is the only way I'd buy this. V1 cannot auto-suppress. Period. It has to say, "Hey, I grouped these 14 alerts into 1 incident. Thumbs up or thumbs down?" It needs to earn my trust before it ever gets permission to mute a single payload.
**The Contrarian:** But if it doesn't auto-suppress, it hasn't solved the 3am problem! Priya is still getting woken up to press "thumbs down" on a bad grouping! You've just replaced "alert fatigue" with "AI grading fatigue."
**The On-Call Engineer:** Honestly? I'd take grading fatigue over raw alerts. If my phone wakes me up, and instead of 14 separate pages I see *one* grouped incident with a "suspected cause: Deploy #4521" tag... I can go back to sleep in 30 seconds instead of spending 15 minutes correlating it manually.
**The VC:** But where is the retention? Once they spend 6 months using your tool to figure out which alerts are noise, won't they just go fix the underlying alerts and then cancel their $19/seat subscription? You're training them to not need you!
**The Bootstrap Founder:** Have you ever met a software engineer? They will *never* fix the underlying alerts. They'll just keep writing new microservices with new noisy default thresholds. The alert hygiene problem is a treadmill. We're selling them a permanent personal trainer.
---
## Round 3: STRESS TEST
### Threat 1: PagerDuty Ships Native AI Correlation (They're already working on it)
* **The VC Attacks:** PagerDuty is $430M ARR, heavily funded, and literally building this right now into their AIOps tier. If they bundle cross-tool correlation into their enterprise plans or drop the price for the mid-market, your $19/seat standalone tool is dead in the water. Why would anyone pay for a separate ingestion layer?
* **Severity:** 8/10
* **Mitigation (The CTO & Founder):** PagerDuty's strength is its cage. They only deeply correlate what runs *through* PagerDuty. dd0c/alert sits upstream of OpsGenie, Datadog, Grafana Cloud, and Slack natively. Second, our $19/seat price makes us a rounding error. PagerDuty's AIOps is an expensive, clunky add-on. We build for the mid-market who can't justify doubling their PagerDuty bill.
* **Pivot Option:** Double down on *cross-tool visualization* and deployment correlation inside Slack. If they improve grouping, we pivot harder into becoming the "incident context brain" connecting GitHub/CI to PagerDuty.
### Threat 2: AI Suppresses a Critical Alert (The "Outage Liability" Scenario)
* **The On-Call Engineer Attacks:** Month 3. The system gets cocky. A database connection pool exhaust error fires during a routine frontend deploy. The AI thinks, "Ah, deploy noise," and suppresses it. We are down for 4 hours. My VP rips out dd0c/alert the next morning and writes a furious blog post. The company's reputation dies instantly.
* **Severity:** 10/10 (Existential)
* **Mitigation (The Contrarian & CTO):** V1 has ZERO auto-suppression out of the box. The "Trust Ramp" is non-negotiable. We only auto-suppress when specific, user-confirmed correlation templates reach 99% accuracy. Even then, we have a hard-coded "never suppress" safelist (e.g., specific tags like `sev1`, `database`, `billing`). Finally, provide an "Audit Trail" so transparent that even if it *does* make a mistake, the team sees exactly why, and can fix the rule in 5 seconds.
* **Pivot Option:** Drop auto-suppression entirely if the market rejects it. Pivot to pure "Alert Grouping & Context Synthesis" inside Slack. Just grouping 47 pages into 1 reduces the 3am panic significantly, without the liability of muting anything.
### Threat 3: Enterprises Won't Send Alert Data to a 3rd Party
* **The CTO Attacks:** My CISO will never approve sending raw Datadog metrics and error payloads to a solo founder's SaaS app. That data contains user IDs, stack traces, and API keys leaked by juniors. SOC2 takes a year and $50K.
* **Severity:** 7/10
* **Mitigation (The Bootstrap Founder):** We aren't selling to enterprises with 12-page vendor procurement checklists. We are targeting 40-engineer Series B startups where Marcus the SRE can just plug in a webhook on Friday afternoon. For the security-conscious mid-market, we offer "Payload Stripping" mode: the webhook agent runs locally or they configure Datadog to *only* send us metadata (source, timestamp, severity, alert name), stripping the raw logs.
* **Pivot Option:** Open-source the correlation engine (the ingestion worker). Customers run `dd0c-worker` in their own VPC, which computes the ML embeddings locally and only sends anonymous hashes and timing data to our SaaS dashboard.
---
## Round 4: FINAL VERDICT
**The Panel Convenes:**
The room is thick with tension. The On-Call Engineer is scrolling blindly through PagerDuty notifications out of muscle memory. The CTO is drawing network diagrams on the whiteboard. The Bootstrap Founder is checking Stripe. The VC is checking Twitter. The Contrarian is just shaking their head.
**The Decision:**
SPLIT DECISION (4-1 GO).
The Contrarian holds out: "You're selling a painkiller to an organizational culture problem. But I admit, people buy painkillers."
**Revised Priority in the dd0c Lineup:**
This is the wedge. While dd0c/run is the ultimate value, you can't auto-remediate what you can't intelligently correlate. dd0c/alert MUST launch first or simultaneously with dd0c/run. It is the "brain" that feeds the "hands."
**Top 3 Must-Get-Right Items:**
1. **The 60-Second "Wow":** The moment the webhook is pasted, the Slack bot needs to group 50 noisy alerts into 5 actionable incidents. Immediate ROI.
2. **The "Trust Ramp" (No Auto-Suppress in V1):** Engineers must explicitly opt-in to suppression rules. Show what *would* have been suppressed and let them confirm it.
3. **Deployment Correlation:** Pulling CI/CD webhooks to say "This alert spike started 2 minutes after PR #1042 was merged" is the killer feature that none of the legacy AIOps tools do gracefully out of the box.
**The One Kill Condition:**
If the product cannot achieve a verifiable 50% noise reduction for 10 paying beta teams within 90 days without suppressing a single false-negative (real alert missed), kill the product or strip it back to a pure Slack formatting tool.
**FINAL VERDICT:**
**🟢 GO.**
Alert fatigue is an epidemic. The incumbents are too fat to sell to the mid-market at $19/seat. The PLG webhook motion is pristine. Build the wedge, earn the trust, and then sell them the runbooks. Go build it.

View File

@@ -0,0 +1,543 @@
# dd0c/alert — Product Brief
### AI-Powered Alert Intelligence for Engineering Teams
**Version:** 1.0 | **Date:** 2026-02-28 | **Author:** dd0c Product | **Status:** Phase 5 — Product Brief
---
## 1. EXECUTIVE SUMMARY
### Elevator Pitch
dd0c/alert is an AI-powered alert intelligence layer that sits upstream of your existing monitoring stack — PagerDuty, OpsGenie, Datadog, Grafana — correlating, deduplicating, and contextualizing alerts across all tools via a single webhook. Slack-first. $19/seat/month. Prove value in 60 seconds.
### Problem Statement
Alert fatigue is an epidemic hiding in plain sight.
The average on-call engineer at a mid-size company receives **4,000+ alerts per month**. Industry data consistently shows **7090% are non-actionable** — duplicate symptoms, transient spikes, deploy artifacts, and orphaned monitors nobody owns. The consequences are measurable and severe:
- **MTTR inflation:** Engineers spend the first 815 minutes of every incident determining if it's real, manually correlating across dashboards, and checking deploy logs. Average MTTR at affected orgs: 34 minutes vs. a 15-minute industry benchmark.
- **Attrition:** On-call satisfaction scores average 2.1/5 at companies with high alert noise. Replacing a single SRE costs $150300K (recruiting, ramp, lost institutional knowledge). Alert burden is now cited as a top-3 reason for SRE attrition.
- **Invisible cost:** A 140-engineer org with 93% alert noise wastes an estimated 40+ engineering hours per week on false-alarm triage — roughly $300K/year in loaded salary, with zero feature output to show for it.
- **Trust erosion:** Every false alarm trains engineers to ignore alerts. The system conditions its operators to fail at the one moment it matters most — a Pavlovian tragedy playing out nightly across thousands of on-call rotations.
No mid-market solution exists today. BigPanda charges $50K$500K/year and requires 6-month deployments. PagerDuty's AIOps is locked to PagerDuty-only alerts at $4159/seat on top of base platform costs. incident.io's alert features are shallow. The 150,000+ engineering teams between 20500 engineers are completely underserved.
### Solution Overview
dd0c/alert is a cross-tool alert intelligence layer deployed via webhook in under 5 minutes:
1. **Ingest** — Accepts alert webhooks from any monitoring tool (Datadog, Grafana, PagerDuty, OpsGenie, CloudWatch, Prometheus Alertmanager). No agents, no SDKs, no credentials.
2. **Correlate** — Groups related alerts using time-window clustering, service-dependency mapping, and CI/CD deployment correlation. V1 is rule-based; V2 adds ML-based semantic deduplication via sentence-transformer embeddings.
3. **Contextualize** — Enriches each correlated incident with deployment context ("started 2 minutes after PR #1042 merged to payment-service"), affected service topology, historical resolution patterns, and linked runbooks.
4. **Surface** — Delivers grouped, context-rich incident cards to Slack with thumbs-up/down feedback buttons. Engineers see 5 incidents instead of 47 raw alerts.
5. **Learn** — Every ack, snooze, override, and feedback signal trains the model. The system gets smarter with every on-call shift.
**V1 is strictly observe-and-suggest.** No auto-suppression. The system shows what it *would* suppress and lets engineers confirm. Trust is earned through a graduated "Trust Ramp," not assumed.
### Target Customer
**Primary:** Series AC startups and mid-market companies with 20200 engineers, running microservices on Kubernetes, using 2+ monitoring tools, with painful on-call rotations. The champion is the SRE lead or senior platform engineer (2838 years old, 510 years experience) who can add a webhook integration without VP approval.
**Secondary:** The VP of Engineering who needs a defensible metric for alert health to present to the board, justify tooling spend, and address attrition driven by on-call burden.
**Anti-ICP:** Enterprises with 500+ engineers requiring SOC2 on Day 1, companies using only one monitoring tool, companies without on-call rotations, companies already running BigPanda.
### Key Differentiators
| Differentiator | Why It Matters |
|---|---|
| **Cross-tool correlation** | The only mid-market product purpose-built to correlate alerts across Datadog + Grafana + PagerDuty + OpsGenie simultaneously. PagerDuty only sees PagerDuty. Datadog only sees Datadog. dd0c/alert sees everything. |
| **60-second time to value** | Paste a webhook URL → see grouped incidents in Slack within 60 seconds. BigPanda takes 6 months. This isn't incremental — it's a category shift. |
| **CI/CD deployment correlation** | Automatic "this alert spike started after deploy X" tagging. The single most valuable piece of context during incident triage, and no legacy AIOps tool does it gracefully for the mid-market. |
| **Transparent, explainable decisions** | Every grouping and suppression decision is logged with plain-English reasoning. No black boxes. Engineers can audit, override, and learn from every decision. |
| **Observe-and-suggest Trust Ramp** | V1 never auto-suppresses. The system earns autonomy through demonstrated accuracy, graduating from observe → suggest-and-confirm → auto-suppress only with explicit engineer opt-in. |
| **$19/seat pricing** | 1/3 to 1/100th the cost of alternatives. Below the "just expense it" threshold ($380/month for a 20-person team). Below the "build internally" threshold (one engineer-day costs more than a year of dd0c/alert for a small team). |
| **Overlay architecture** | Doesn't replace anything. Sits on top of existing tools. Zero-risk adoption: remove the webhook and your existing pipeline is untouched. |
---
## 2. MARKET OPPORTUNITY
### Market Sizing
| Segment | Size | Methodology |
|---|---|---|
| **TAM** | **$5.3B$16.4B** | Global AIOps market (20242025). Alert intelligence/correlation represents ~2530% = $1.3B$4.9B. Growing at 1730% CAGR depending on analyst (Fortune Business Insights, GM Insights, Mordor Intelligence). |
| **SAM** | **~$800M** | Companies with 20500 engineers, using 2+ monitoring tools, experiencing alert fatigue, willing to adopt SaaS. ~150,000200,000 such companies globally (Series A through mid-market). Average potential spend: $4,000$6,000/year at dd0c/alert's price point. |
| **SOM** | **$1.7M$9.1M ARR (Year 12)** | Year 1: 200500 paying teams × 15 avg seats × $19/seat × 12 months = $684K$1.71M ARR. Year 2 with expansion: $3M$9.1M ARR. Bootstrappable without venture capital. |
**The math that matters:** 500 teams × 15 seats × $19/seat × 12 months = $1.71M ARR. At 2,000 teams × 20 seats = $9.12M ARR. The PLG motion and low friction make volume achievable at this price point.
### Competitive Landscape
#### Tier 1: Enterprise AIOps Incumbents
| Competitor | Revenue / Funding | Alert Intelligence | Pricing | Threat to dd0c |
|---|---|---|---|---|
| **PagerDuty AIOps** | ~$430M ARR (public) | Medium depth, PagerDuty-only ecosystem | $4159/seat + base platform | **MEDIUM** — Massive install base but locked to single tool. Mid-market finds it too expensive. Will improve in 1218 months. |
| **BigPanda** | $196M raised | Deep correlation engine, patent portfolio | $50K$500K/year, "Contact Sales" | **LOW** — Cannot profitably serve dd0c's target market. 6-month deployments. Different game entirely. |
| **Moogsoft (Dell/BMC)** | Acquired | Deep ML (legacy) | Enterprise pricing | **LOW** — Post-acquisition identity crisis. Innovation stalled. Trapped inside legacy ITSM platform. |
#### Tier 2: Modern Incident Management
| Competitor | Revenue / Funding | Alert Intelligence | Pricing | Threat to dd0c |
|---|---|---|---|---|
| **incident.io** | $57M raised (Series B) | Shallow but growing. Recently added "Alerts" product | ~$1625/seat | **HIGH** — Same buyer persona, same PLG playbook, same Slack-native approach. Most dangerous competitor. If they build deep alert intelligence, speed becomes existential. |
| **Rootly** | $20M+ raised | Shallow — basic routing rules, not ML | ~$1520/seat | **MEDIUM** — Could add alert intelligence but DNA is incident response. |
| **FireHydrant** | $70M+ raised | Shallow — checkbox feature | ~$2035/seat | **MEDIUM** — Broad but shallow. Trying to be everything. |
#### Tier 3: Emerging Threat
| Competitor | Threat | Timeline |
|---|---|---|
| **Datadog** ($2.1B+ ARR) | Will build alert intelligence features. Has the data, ML team, and distribution. But Datadog only works with Datadog — their moat is also their cage. | **HIGH long-term, LOW short-term.** 1218 month window. |
#### dd0c/alert's Competitive Position
dd0c/alert occupies a blue ocean at the intersection of:
1. **Deep alert intelligence** (like BigPanda/Moogsoft) — not shallow routing rules
2. **At SMB/mid-market pricing** (like incident.io/Rootly) — not enterprise contracts
3. **With instant time-to-value** (like nobody) — 60 seconds, not 6 months
4. **Across all monitoring tools** (like nobody for the mid-market) — not locked to one ecosystem
This combination does not exist today. BigPanda has the intelligence but not the accessibility. incident.io has the accessibility but not the intelligence. dd0c/alert threads the needle between them.
### Timing Thesis: The 18-Month Window
Four structural forces are converging in 2026 that create a once-in-a-cycle entry window:
**1. Alert fatigue has hit critical mass.** The average mid-size company now runs 200500 microservices, each generating its own alerts. "Alert fatigue" has gone from an SRE inside joke to a board-level retention concern. VPs of Engineering are now *asking* for solutions — they weren't 2 years ago.
**2. AI capabilities have matured, but incumbents haven't shipped.** Embedding models make semantic alert deduplication trivially cheap. LLMs generate useful incident summaries. Inference costs have dropped 10x in 2 years. But incumbents built their ML stacks in 20192021 on legacy architectures. A greenfield product built today has a massive technical advantage.
**3. Datadog pricing backlash + tool fragmentation.** Datadog's aggressive pricing has created a revolt. Teams are migrating to Grafana Cloud, self-hosted Prometheus, and alternatives. This fragmentation is *good* for dd0c/alert — the more tools a team uses, the more they need a cross-tool correlation layer.
**4. Regulatory tailwinds.** SOC2, HIPAA, PCI-DSS, and DORA (EU Digital Operational Resilience Act) all require demonstrable incident response capabilities. "How do you ensure critical alerts aren't missed?" is becoming a compliance question. dd0c/alert's transparent audit trail is a compliance feature that black-box AI can't match.
**The window closes in ~18 months.** PagerDuty will ship better native AIOps (1218 months). incident.io will deepen alert intelligence (612 months). Datadog will launch cross-signal correlation (1218 months). After that, dd0c competes on execution and data moat, not market gap — which is fine, if the moat is built by then.
### Market Trends
- **Microservices proliferation** driving exponential alert volume growth
- **SRE attrition at historic highs** — companies connecting on-call burden to turnover
- **"Build vs. buy" shifting to buy** as AI tooling costs drop below internal development thresholds
- **Platform unbundling** — teams rejecting monolithic platforms in favor of best-of-breed point solutions (Linear unbundled Jira; dd0c/alert unbundles alert intelligence from incident management platforms)
- **AI skepticism rising** — engineers increasingly skeptical of "AI-powered" claims, favoring transparent, explainable tools over black-box magic. dd0c's stoic, anti-hype brand voice is a strategic advantage here
---
## 3. PRODUCT DEFINITION
### Value Proposition
**For on-call engineers:** "You got paged 6 times last night. 5 were noise. We would have let you sleep." dd0c/alert reduces alert volume 70%+ by correlating and deduplicating across all your monitoring tools, delivering context-rich incident cards to Slack instead of raw alert spam.
**For SRE/platform leads:** "What if Marcus's pattern recognition was available to every on-call engineer, 24/7?" dd0c/alert institutionalizes the tribal correlation knowledge trapped in senior engineers' heads — cross-service dependencies, deploy-correlated noise, seasonal patterns — and makes it available to every engineer on rotation.
**For VPs of Engineering:** "Your alert noise costs $300K/year in wasted engineering time and drives your best SREs to quit. Here's the dashboard that proves it — and the tool that fixes it." dd0c/alert translates alert fatigue into business metrics (dollars wasted, hours lost, attrition risk) that justify investment at the board level.
### Personas
#### Priya Sharma — The On-Call Engineer (Primary User)
- 28, backend engineer, weekly on-call rotation at a mid-stage fintech (85 engineers)
- Gets paged 6+ times per night; 8090% are non-actionable
- Keeps a personal Notion "ignore list" of known-noisy alerts
- Has a bash script that checks deploy logs when she gets paged — she's automated her own triage
- Spends the first 1220 minutes of every incident figuring out if it's real
- **JTBD:** "When I get paged at 3am, I want to instantly know if this is real and what to do, so I can either fix it fast or go back to sleep."
#### Marcus Chen — The SRE/Platform Lead (Champion / Buyer)
- 34, senior SRE leading a team of 8 at a Series C SaaS company (140 engineers)
- He IS the human correlation engine — connects dots across services because no tool does it
- Maintains a manual spreadsheet tracking alert-to-incident ratios (always out of date)
- Spends 30% of his time on alert tuning instead of platform work
- Lost 2 engineers in the past year who cited on-call burden
- **JTBD:** "When I'm reviewing on-call health, I want to see exactly which alerts are noise and which are signal across all teams, so I can prioritize fixes with data instead of gut feel."
#### Diana Okafor — The VP of Engineering (Economic Buyer)
- 41, VP of Engineering, reports to CTO, accountable for MTTR and retention
- Sees MTTR of 34 minutes vs. 15-minute benchmark; on-call satisfaction at 2.1/5 for 3 consecutive quarters
- Spending $200K+/year on Datadog + PagerDuty + Grafana with no way to quantify ROI
- Needs a single, defensible metric for alert health she can present to the board
- **JTBD:** "When I'm preparing for a board meeting, I want to show a clear metric for operational health that includes alert quality, so I can demonstrate improvement or justify investment."
### Feature Roadmap
#### V1 — MVP: "Observe & Suggest" (Month 1, 30-day build)
**CRITICAL DESIGN DECISION: V1 is strictly observe-and-suggest. No auto-suppression. No auto-muting. The system shows what it *would* do and lets engineers confirm. This resolves contradictions from earlier phases where auto-suppression was discussed — the party mode board unanimously mandated this constraint, and it is non-negotiable for V1.**
| Feature | Description |
|---|---|
| **Webhook ingestion** | Accept alert payloads from Datadog, PagerDuty, OpsGenie, Grafana via webhook URL. No agents, no SDKs. |
| **Payload normalization** | Transform each source's format into a unified alert schema (source, severity, timestamp, service, message). |
| **Time-window clustering** | Group alerts firing within N minutes of each other into correlated incidents. Rule-based, no ML required. |
| **CI/CD deployment correlation** | Connect to GitHub/GitLab webhooks. Tag alert clusters with "started after deploy X" context. Party mode mandated this as a V1 must-have. |
| **Slack bot** | Post grouped incident cards to Slack. Each card shows: grouped alert count, source tools, suspected trigger, severity. Thumbs-up/down feedback buttons. |
| **Daily digest** | Summary of alerts received vs. incidents created, noise ratio, top noisy alerts. |
| **Suppression log** | Every grouping decision logged with plain-English reasoning. Searchable. Auditable. |
| **"What would have happened" view** | Show what dd0c/alert *would* have suppressed — without actually suppressing anything. The core trust-building mechanism. |
**What V1 does NOT include:** ML-based semantic dedup, auto-suppression, SSO/SCIM, custom dashboards, mobile app, API, SOC2 certification.
#### V2 — Intelligence Layer (Months 24)
| Feature | Description |
|---|---|
| **Semantic deduplication** | Sentence-transformer embeddings to group alerts with similar meaning but different wording. |
| **Alert Simulation Mode** | Upload historical PagerDuty/OpsGenie exports → see what dd0c/alert would have done last month. The killer demo: proves value with zero risk, zero commitment. |
| **Noise Report Card** | Weekly per-team report: noise ratios, noisiest alerts, suggested tuning, estimated cost of noise. Gamifies alert hygiene. Creates organizational accountability. |
| **Trust Ramp — Stage 2** | "Suggest-and-confirm" mode. System proposes suppressions; engineer approves/rejects with one click. Auto-suppression unlocked only for specific, user-confirmed patterns reaching 99% accuracy. |
| **"Never suppress" safelist** | Hard-coded defaults (sev1, database, billing, security) that are never suppressed regardless of model confidence. User-configurable. |
| **Business impact dashboard** | Translate noise into dollars: hours wasted, estimated attrition cost, MTTR impact. Diana's board-meeting ammunition. |
| **Additional integrations** | CloudWatch, Prometheus Alertmanager, custom webhook format support. |
#### V3 — Platform & Automation (Months 59)
| Feature | Description |
|---|---|
| **dd0c/run integration** | Alert fires → correlated incident → suggested runbook → one-click execute. The flywheel that makes alert + run 10x more valuable together. |
| **Cross-team correlation** | When multiple teams send alerts, correlate incidents across service boundaries. "Every time Team A's DB alerts fire, Team B's API errors follow 2 minutes later." |
| **Predictive severity scoring** | Historical resolution data predicts incident severity. "This pattern was resolved by 'restart-payment-service' 14 times in 3 months." |
| **Trust Ramp — Stage 3** | Full auto-suppression for patterns with proven track records. Circuit breaker: if accuracy drops below 95%, auto-fallback to pass-through mode. |
| **SSO (SAML/OIDC)** | Required for Business tier and company-wide rollouts. |
| **API access** | Programmatic access to alert data, noise metrics, and suppression rules. |
| **SOC2 Type II** | Certification process started at ~Month 6, completed by Month 9. |
| **Community patterns (future)** | Anonymized cross-customer pattern sharing. "87% of teams running K8s + Istio suppress this pattern." Requires 500+ customers. Architect the data pipeline to support this from Day 1. |
### User Journey
```
DISCOVER ACTIVATE ENGAGE EXPAND
─────────────────────────────────────────────────────────────────────────────────────────────
"Alert fatigue sucks" "Paste webhook URL, "See noise reduction "Roll out to all teams,
connect Slack" in 60 seconds" upgrade to Business"
Blog post / HN launch / Free tier signup → Daily digest shows Cross-team correlation
Alert Fatigue Calculator / copy webhook URL → 47 alerts → 8 incidents. value prop triggers
Twitter / conf talk paste into Datadog/PD → Noise Report Card in expansion. VP sees
first alerts flow → weekly SRE review. business impact
Slack bot groups them Thumbs-up/down trains dashboard → mandates
in <60 seconds. the model. Trust grows. company-wide rollout.
"WOW: 47 → 8." dd0c/run cross-sell.
```
**The critical activation metric: Time to First "Wow"**
Target: **60 seconds** from signup to seeing grouped incidents in Slack. This is the party mode board's #1 mandate. The entire PLG motion lives or dies on this number.
The Alert Simulation shortcut for prospects not ready to connect live alerts: upload last 30 days of PagerDuty/OpsGenie export → see "Last month, you received 4,200 alerts. We would have shown you 340 incidents." Proves value with zero risk.
### Pricing
| Tier | Price | Includes | Target |
|---|---|---|---|
| **Free** | $0 | Up to 5 seats, 1,000 alerts/month, 2 integrations, 7-day retention | Solo devs, tiny teams, tire-kickers. Removes cost objection. |
| **Pro** | $19/seat/month | Unlimited alerts, 4 integrations, 90-day retention, Slack bot, daily digest, deployment correlation, Noise Report Card | Teams of 550. The beachhead. Credit-card swipe, no procurement. |
| **Business** | $39/seat/month | Everything in Pro + unlimited integrations, 1-year retention, API access, custom suppression rules, priority support, SSO | Teams of 50200. Expansion tier when VP mandates company-wide rollout. |
| **Enterprise** | Custom | Everything in Business + dedicated instance, SLA, SOC2 report, custom integrations | 200+ seats. Don't build until Year 2. |
**Pricing rationale:**
- $19/seat for a 20-person team = $380/month. Below the "just expense it" threshold (most eng managers can expense <$500/month without VP approval).
- ROI is trivial: one prevented false-alarm page at 3am saves ~$2533 in engineer productivity. dd0c/alert needs to prevent ONE false page per engineer per month to pay for itself. At 70% noise reduction, ROI is 1050x.
- Below the "build internally" threshold: one engineer-day building a custom dedup script (~$600) exceeds a year of dd0c/alert for a small team.
- Average blended price across customers: ~$25/seat (mix of Pro and Business tiers).
---
## 4. GO-TO-MARKET PLAN
### Launch Strategy
dd0c/alert is Phase 2 of the dd0c platform ("The On-Call Savior," months 46 per brand strategy). It launches after dd0c/route and dd0c/cost have established the dd0c brand and are generating ≥$5K MRR — proving the platform resonates before adding a third product.
The GTM motion is **pure PLG via webhook integration.** No sales team. No "Contact Sales." No 6-month POCs. The webhook URL is the distribution channel — the lowest-friction integration mechanism in all of DevOps (copy URL, paste into monitoring tool, done).
### Beachhead: The First 10 Customers
**Ideal First Customer Profile:**
- Series AC startup, 30150 engineers
- Running microservices on Kubernetes (AWS EKS or GCP GKE)
- Using at least 2 of: Datadog, Grafana, PagerDuty, OpsGenie
- Dedicated SRE/platform team of 28 people
- On-call rotation exists and is painful (verify via public postmortem blogs — companies that publish postmortems have mature-enough incident culture to care about alert quality)
**Champion profile:** The SRE lead or senior platform engineer (2838, 510 years experience), active on Twitter/X or SRE Slack communities, has complained publicly about alert fatigue, and has authority to add a webhook without VP approval.
**Where to find them:**
| Channel | Tactic | Expected Customers |
|---|---|---|
| **SRE Twitter/X** | Search for engineers tweeting about alert fatigue, PagerDuty frustration, on-call burnout. Engage authentically. DM 50 warm leads at launch: "I built something for this. Free for 30 days." 1015% conversion on warm DMs. | 34 |
| **Hacker News** | "Show HN: I was tired of getting paged for garbage at 3am, so I built dd0c/alert." Be technical, be honest, show the architecture. HN loves solo founder stories from senior engineers solving their own pain. 200500 signups, 25% convert. | 23 |
| **SRE Slack communities** | Rands Leadership Slack, DevOps Chat, SRE community Slack, Kubernetes Slack. Participate in alert fatigue conversations. Offer free beta access. | 23 |
| **Conference lightning talks** | SREcon, KubeCon, DevOpsDays. "How We Reduced Alert Volume 80% With a Webhook and Some Embeddings." Live demo converts attendees that night. | 12 |
| **Personal network** | Brian's AWS architect network. First 12 customers should be people he knows personally — they'll give honest feedback and forgive V1 bugs. | 12 |
**Target: 10 paying customers within 4 weeks of launch.**
### The "Prove Value in 60 Seconds" Onboarding Requirement
The party mode board mandated this as the #1 must-get-right item. The entire PLG funnel depends on it:
1. User signs up (email + company name, nothing else)
2. User gets a webhook URL
3. User pastes webhook URL into Datadog/PagerDuty/Grafana notification settings
4. First alerts start flowing in
5. Within 60 seconds, dd0c/alert shows in Slack: "You've received 47 alerts in the last hour. We identified 8 unique incidents. Here's how we'd group them."
6. **That's the "wow."** 47 → 8. Visible, immediate, undeniable.
**Alert Simulation shortcut** for prospects who want proof before connecting live alerts: "Upload your last 30 days of alert history (CSV export from PagerDuty/OpsGenie). We'll show you what last month would have looked like." This is the killer demo — proves value with zero risk, zero commitment, zero live integration. No competitor offers this.
### Growth Loops
**Loop 1: Noise Report Card → Internal Virality**
Weekly per-team noise report → Marcus shares with Diana → Diana mandates company-wide rollout → more teams adopt → cross-team correlation improves → more value → more sharing. The report card is both a retention feature and an expansion trigger.
**Loop 2: Alert Fatigue Calculator → Lead Gen → Conversion**
Free public web tool (dd0c.com/calculator). Engineers input their alert volume, noise %, team size, salary. Calculator outputs: hours wasted, dollar cost, attrition risk. CTA: "Want to see your actual noise reduction? Connect dd0c/alert free →." Genuinely useful even without dd0c/alert — gets shared in Slack channels, 1:1s, all-hands. Captures and qualifies leads (someone entering "500 alerts/week, 85% noise, 40 engineers" is a perfect customer).
**Loop 3: Cross-Team Expansion**
Land in one team → demonstrate 60% noise reduction → pitch: "Connect all 8 teams and we estimate 85% reduction because we can correlate across service boundaries." Cross-team correlation is the expansion trigger that no single-team tool can match.
**Loop 4: dd0c/alert → dd0c/run Cross-Sell**
Engineers see "Suggested Runbook" placeholders on incident cards → "Want to auto-attach runbooks? Add dd0c/run." Alert intelligence feeds runbook automation; resolution data feeds back into smarter correlation. The flywheel that makes the platform 10x more valuable than either product alone.
### Content Strategy
| Asset | Purpose | Timeline |
|---|---|---|
| **Alert Fatigue Calculator** | Lead gen, SEO, qualification. Long-tail keyword "alert fatigue cost calculator" = high purchase intent, low competition. | Launch day |
| **Engineering blog** | Technical credibility. "The True Cost of Alert Fatigue," "How We Reduced Alert Volume 80%," "The Architecture of dd0c/alert: Semantic Dedup with Sentence Transformers." | Ongoing from launch |
| **Open-source CLI: `dd0c-dedup`** | Engineering-as-marketing. Local tool that analyzes PagerDuty/OpsGenie export files and shows noise patterns. Free sample → SaaS subscription. | Month 1 |
| **"State of Alert Fatigue" annual report** | Survey 500+ SREs. Publish benchmarks. Become the industry reference that journalists and conference speakers cite. dd0c becomes synonymous with "alert intelligence." | Month 6 |
| **Case studies** | Social proof. First case study from earliest customer. "How [Company] reduced alert noise 73% in 2 weeks." | Month 23 |
| **Build-in-public Twitter thread** | Authenticity. Share progress, architecture decisions, customer wins. SRE audience respects transparency. | Pre-launch through ongoing |
### Marketplace Partnerships
| Partner | Distribution Value | Priority | Pitch |
|---|---|---|---|
| **PagerDuty Marketplace** | Very High — 28,000+ customers, exact buyer persona | P0 | "We make PagerDuty better. We reduce noise before it hits your platform. Complement, not competitor." |
| **Grafana Plugin Directory** | High — massive open-source community, growing as teams migrate from Datadog | P0 | Natural distribution. Plugin sends Grafana alerts to dd0c/alert. |
| **Datadog Marketplace** | High — growing marketplace | P1 | "We help Datadog customers get more value by correlating Datadog alerts with alerts from other tools." |
| **OpsGenie/Atlassian Marketplace** | Medium — #2 on-call tool, Atlassian distribution | P1 | Atlassian ecosystem reach. |
| **Slack App Directory** | Medium — discovery channel | P1 | Slack-native positioning. |
### 90-Day Launch Timeline
| Period | Actions | Targets |
|---|---|---|
| **Days 130: Build MLP** | Core engine (webhook ingestion, normalization, time-window clustering, deployment correlation). Slack bot. Dashboard MVP (Noise Report Card, integration management, suppression log). | Ship V1. First webhook received. |
| **Days 3160: Launch & Validate** | HN "Show HN" post. Twitter/X announcement. Alert Fatigue Calculator live. SRE Slack community outreach. Personal network DMs. Daily customer conversations. Fix top 3 pain points. | 2550 free signups. 510 paying teams. First case study. |
| **Days 6190: Prove Flywheel** | Add semantic dedup (sentence-transformer embeddings). Ship Alert Simulation Mode. Submit to PagerDuty Marketplace + Grafana Plugin Directory. Publish first case study. Launch dd0c/alert + dd0c/run integration. | 50100 free users. 1525 paying teams. $5K+ MRR. |
---
## 5. BUSINESS MODEL
### Revenue Model
**Primary revenue:** Per-seat SaaS subscription (Pro at $19/seat/month, Business at $39/seat/month).
**Expansion revenue:** Seat expansion within accounts (land with 10 seats, expand to 50+ as more teams adopt) + tier upgrades (Pro → Business when VP mandates company-wide rollout and needs SSO/longer retention) + cross-product upsell (dd0c/alert → dd0c/run bundle).
**Future revenue (Year 2+):** Usage-based pricing tiers for high-volume customers processing >100K alerts/month. Enterprise tier with custom pricing for 200+ seat deployments.
### Unit Economics
| Metric | Value | Notes |
|---|---|---|
| **Average deal size** | $285/month ($19 × 15 seats) | Pro tier, typical mid-market team |
| **Blended ARPU** | ~$375/month | Mix of Pro ($285) and Business ($780) customers |
| **Gross margin** | ~8590% | Infrastructure costs are minimal: webhook ingestion + embedding computation + Slack API. No agents to host. |
| **CAC (PLG)** | ~$50150 | Content marketing + community engagement. No paid ads initially. No sales team. |
| **CAC payback** | <1 month | At $285/month ARPU and $150 CAC, payback is immediate. |
| **LTV (at 5% monthly churn)** | ~$5,700 | $285/month × 20-month average lifetime. Improves as data moat reduces churn over time. |
| **LTV:CAC ratio** | 38:1 to 114:1 | Exceptional unit economics enabled by PLG + solo founder cost structure. |
**Cost structure advantage:** Zero employees, zero investors, zero burn rate. Profitable from customer #1. BigPanda needs $40M+ in revenue to break even (200+ employees at ~$200K fully loaded). incident.io raised $57M and must move upmarket to satisfy investor returns. dd0c can price at $19/seat and be profitable because the cost structure IS the moat.
### Path to Revenue Milestones
#### $10K MRR (~35 paying teams)
- **Timeline:** Month 34 (Grind scenario), Month 2 (Rocket scenario)
- **How:** First 10 customers from launch channels (HN, Twitter, personal network). Next 25 from content marketing, marketplace listings, and word of mouth.
- **Solo founder feasible:** Yes. Product is stable, support is manageable, marketing is content-driven.
#### $50K MRR (~175 paying teams)
- **Timeline:** Month 810 (Grind), Month 5 (Rocket)
- **How:** PLG flywheel kicking in. Noise Report Card driving internal expansion. Alert Fatigue Calculator generating steady leads. PagerDuty Marketplace live. First case studies published. dd0c/run cross-sell beginning.
- **Solo founder feasible:** Stretching. Consider first hire (engineer) at $30K MRR to maintain velocity.
#### $100K MRR (~350 paying teams)
- **Timeline:** Month 1215 (Grind), Month 8 (Rocket)
- **How:** Cross-team expansion driving seat growth. Business tier adoption at 20%+ of customers. dd0c/alert + dd0c/run bundle driving 3040% of new signups. Community patterns feature (if 500+ customers reached) creating cross-customer network effects.
- **Solo founder feasible:** No. Need 23 person team. First engineer hired at $30K MRR, second at $75K MRR. Hire for infrastructure reliability and ML — the two areas that compound value fastest.
### Solo Founder Constraints & Mitigations
| Constraint | Mitigation |
|---|---|
| **Support burden** | Self-service docs, in-app guides, community Slack channel. Overlay architecture means dd0c going down = fallback to raw alerts (no worse than before). |
| **Uptime expectations** | Multi-region webhook endpoints with failover. Dual-path: webhook for real-time + periodic API polling for reconciliation. Health check monitoring if webhook volume drops to zero. |
| **Feature velocity** | Shared dd0c platform infrastructure (auth, billing, data pipeline) means each new product is incremental, not greenfield. Ruthless scope control. |
| **Burnout / bus factor** | Hire first engineer at $30K MRR, not $100K MRR. Don't wait until drowning. Automate everything automatable. |
### Revenue Scenarios (24-Month Projection)
| Scenario | Probability | Month 6 ARR | Month 12 ARR | Month 24 ARR |
|---|---|---|---|---|
| **Rocket** (everything clicks) | 20% | $342K | $1.64M | $12.5M |
| **Grind** (solid PMF, slower growth) | 50% | $109K | $513K | $3.03M |
| **Pivot** (competitive pressure, stalls) | 30% | $34K | $109K | Pivot to dd0c/run feature |
| **Expected value (weighted)** | — | $138K | $596K | $4.05M |
The expected-value scenario produces a $4M ARR product at Month 24. Even the Grind scenario (most likely) yields $3M ARR — enough to hire a small team and compound growth. This is a real business at every scenario except Pivot, which has defined kill criteria.
---
## 6. RISKS & MITIGATIONS
### Top 5 Risks
#### Risk 1: PagerDuty Ships Native Cross-Tool AI Correlation
- **Probability:** HIGH (80%) | **Impact:** CRITICAL | **Timeline:** 1218 months
- **Threat:** PagerDuty already has "Event Intelligence." If they ship genuinely good alert intelligence bundled free into existing plans, dd0c's value prop for PagerDuty-only shops evaporates.
- **Mitigation:** dd0c's cross-tool correlation is the hedge — PagerDuty can only improve intelligence for PagerDuty alerts. Speed: be in market with 500+ customers and a trained data moat before they ship. Position as complement: "Keep PagerDuty for on-call. Add dd0c/alert in front to cut noise 70% across ALL your tools."
- **Residual risk:** MEDIUM. PagerDuty-only shops (~30% of TAM) become harder. Multi-tool shops (70% of TAM) unaffected.
- **Pivot option:** Double down on cross-tool visualization and deployment correlation inside Slack. Become the "incident context brain" connecting CI/CD to PagerDuty.
#### Risk 2: AI Suppresses a Real P1 Alert (Existential Trust Event)
- **Probability:** MEDIUM (50%) | **Impact:** CRITICAL | **Timeline:** Ongoing from Day 1
- **Threat:** One suppressed critical alert causing a production outage = permanent distrust. "dd0c/alert suppressed a P1 and we had a 2-hour outage" on Hacker News destroys the brand instantly.
- **Mitigation:** V1 has ZERO auto-suppression (non-negotiable). Trust Ramp: observe → suggest-and-confirm → auto-suppress only with explicit opt-in on patterns reaching 99% accuracy. "Never suppress" safelist (sev1, database, billing, security) — configurable, default-on. Transparent audit trail for every decision. Circuit breaker: if accuracy drops below 95%, auto-fallback to pass-through mode.
- **Residual risk:** MEDIUM. This risk never reaches zero — it's the existential tension of the product. Managing it IS the core competency.
- **Pivot option:** Drop auto-suppression entirely. Pivot to pure "Alert Grouping & Context Synthesis" in Slack. Grouping 47 pages into 1 still reduces 3am panic significantly without suppression liability.
#### Risk 3: Data Privacy — Enterprises Won't Send Alert Data to a Solo Founder's SaaS
- **Probability:** MEDIUM (50%) | **Impact:** HIGH | **Timeline:** From Day 1
- **Threat:** Alert data contains service names, infrastructure details, error messages, sometimes customer data in payloads. CISOs will block adoption.
- **Mitigation:** Target Series B startups where Marcus the SRE can plug in a webhook without procurement review (not Fortune 500). Offer "Payload Stripping" mode: only receive metadata (source, timestamp, severity, alert name), strip raw logs. Publish clear data handling policy. SOC2 Type II by Month 69. Architecture transparency: publish diagrams showing encryption in transit (TLS) and at rest (AES-256), no access to monitoring credentials.
- **Residual risk:** MEDIUM. Slows enterprise adoption but doesn't block mid-market PLG.
- **Pivot option:** Open-source the correlation engine (`dd0c-worker`). Customers run it in their own VPC; only anonymous hashes and timing data sent to SaaS dashboard.
#### Risk 4: incident.io Adds Deep Alert Intelligence
- **Probability:** HIGH (70%) | **Impact:** HIGH | **Timeline:** 612 months
- **Threat:** Same buyer persona, same PLG motion, same Slack-native approach. $57M raised, 100+ employees. If they invest heavily in ML-based correlation, they offer alert intelligence + incident management in one product.
- **Mitigation:** Speed — be the recognized "alert intelligence" brand before they get there. Depth over breadth — their alert intelligence is one feature among many; dd0c's is the entire product, 10x deeper. The dd0c/alert + dd0c/run flywheel creates compound value they'd need two products to match. Interop positioning: "Use incident.io for incident management. Use dd0c/alert for alert intelligence. They work great together."
- **Residual risk:** MEDIUM-HIGH. This is the biggest competitive threat. Monitor their product roadmap obsessively.
#### Risk 5: Solo Founder Burnout / Bus Factor
- **Probability:** MEDIUM-HIGH (60%) | **Impact:** CRITICAL | **Timeline:** 612 months
- **Threat:** Building and supporting multiple dd0c products while doing marketing, sales, and customer support. One person maintaining 99.99% uptime on an alert ingestion pipeline.
- **Mitigation:** Ruthless scope control (V1 is minimal: time-window clustering + deployment correlation + Slack bot). Shared platform infrastructure reduces per-product effort. Overlay architecture means downtime = fallback to raw alerts, not total failure. Hire first engineer at $30K MRR. Automate support via self-service docs and community Slack.
- **Residual risk:** MEDIUM. Solo founder risk is real and doesn't fully mitigate. Discipline about scope is the only defense.
### Risk Summary Matrix
| # | Risk | Probability | Impact | Residual | Action |
|---|---|---|---|---|---|
| 1 | PagerDuty builds natively | HIGH | CRITICAL | MEDIUM | Outrun. Cross-tool positioning. |
| 2 | AI suppresses real P1 | MEDIUM | CRITICAL | MEDIUM | Engineer. Trust Ramp. Never-suppress safelist. |
| 3 | Data privacy concerns | MEDIUM | HIGH | MEDIUM | Certify. Payload stripping. SOC2. |
| 4 | incident.io adds alert intelligence | HIGH | HIGH | MEDIUM-HIGH | Outrun. Depth + flywheel. |
| 5 | Solo founder burnout | MEDIUM-HIGH | CRITICAL | MEDIUM | Scope ruthlessly. Hire early. |
### Kill Criteria
These are the signals to STOP and redirect resources:
1. **Can't find 10 paying customers in 90 days.** If the pain isn't acute enough for 10 teams to pay $19/seat after a free trial, the market isn't ready. Redirect to dd0c/run or dd0c/portal.
2. **Cannot achieve verifiable 50% noise reduction for 10 paying beta teams within 90 days without a single false-negative** (real alert missed). Kill the product or strip it back to a pure Slack formatting tool.
3. **False positive rate exceeds 5% after 90 days.** If suppression accuracy can't reach 95% within 3 months of real-world data, the technology isn't ready. Go back to R&D.
4. **PagerDuty ships free, cross-tool alert intelligence.** Market position becomes untenable. Pivot dd0c/alert into a feature of dd0c/run.
5. **incident.io launches deep alert intelligence at <$15/seat.** Fighting uphill. Consider folding dd0c/alert into dd0c/run rather than competing standalone.
6. **Monthly customer churn exceeds 10% after Month 3.** Value isn't sticky. Investigate root cause before continuing investment.
7. **Spending >60% of time on support instead of building.** Product isn't self-service enough. Fix UX or reconsider viability as solo-founder venture.
### Pivot Options
| Trigger | Pivot |
|---|---|
| Competitive pressure kills standalone viability | Fold dd0c/alert into dd0c/run as a feature (alert correlation → auto-remediation pipeline) |
| Auto-suppression rejected by market | Pure "Alert Grouping & Context Synthesis" tool — no suppression, just better Slack formatting with deploy context |
| Data privacy blocks SaaS adoption | Open-source the correlation engine; charge for the dashboard/analytics SaaS layer |
| Alert intelligence commoditized | Pivot to deployment correlation as primary value prop — "the CI/CD ↔ incident bridge" |
---
## 7. SUCCESS METRICS
### North Star Metric
**Alerts Correlated Per Month**
Every correlated alert = an engineer who didn't get interrupted by a duplicate or noise alert. It's measurable, meaningful, and grows with both customer count and per-customer value. It captures the core promise: turning alert chaos into actionable signal.
### Leading Indicators (Predict Future Success)
| Metric | Target | Why It Matters |
|---|---|---|
| Time to first webhook | <5 minutes | Activation friction. If this is >30 minutes, the PLG motion is broken. |
| Time to first "wow" (grouped incident in Slack) | <60 seconds after first alert | The party mode mandate. The moment that converts tire-kickers to believers. |
| Thumbs-up/down ratio on Slack cards | >80% thumbs-up | Model accuracy signal. Below 70% = correlation quality is insufficient. |
| Free → Paid conversion rate | >5% | Willingness to pay. Below 2% = value prop isn't landing. |
| Weekly active users / total seats | >60% | Engagement depth. Below 30% = shelfware risk. |
| Integrations per customer | >2 | Multi-tool stickiness. More integrations = higher switching cost = lower churn. |
### Lagging Indicators (Confirm Business Health)
| Metric | Target | Why It Matters |
|---|---|---|
| MRR and MRR growth rate | 1530% MoM (Stage 1) | Business trajectory. |
| Net revenue retention | >110% | Expansion outpacing churn. Land-and-expand working. |
| Logo churn (monthly) | <5% | Customer satisfaction. >10% = kill criteria triggered. |
| Noise reduction % (customer-reported) | >50% (target 70%+) | Core value delivery. <30% = kill criteria triggered. |
| NPS | >40 | Product-market fit signal. <20 = fundamental problem. |
| Seats per customer (avg) | Growing over time | Internal expansion working. |
### 30/60/90 Day Milestones
| Milestone | Day 30 | Day 60 | Day 90 |
|---|---|---|---|
| **Product** | V1 shipped. Webhook ingestion, time-window clustering, deployment correlation, Slack bot live. | Semantic dedup added. Alert Simulation Mode live. Top 3 user pain points fixed. | dd0c/run integration live. PagerDuty Marketplace submitted. |
| **Customers** | First webhook received. First free users. | 2550 free signups. 510 paying teams. | 50100 free users. 1525 paying teams. |
| **Revenue** | $0$1K MRR | $1K$3K MRR | $3K$5K+ MRR |
| **Validation** | Time-to-first-webhook <5 min confirmed. | Noise reduction >50% confirmed with real customers. First case study drafted. | Free-to-paid conversion >5%. NPS >40. Kill criteria evaluated. |
### Month 6 Targets
| Metric | Target |
|---|---|
| Paying teams | 100 (Grind) / 250 (Rocket) |
| MRR | $25K (Grind) / $70K (Rocket) |
| Noise reduction (avg across customers) | >65% |
| PagerDuty Marketplace | Live and generating signups |
| SOC2 Type II | Process started |
| dd0c/run cross-sell rate | 15%+ of alert customers |
| Net revenue retention | >110% |
### Month 12 Targets
| Metric | Target |
|---|---|
| Paying teams | 400 (Grind) / 1,000 (Rocket) |
| ARR | $513K (Grind) / $1.64M (Rocket) |
| Noise reduction (avg) | >70% |
| Team size | 23 (first engineer hired at $30K MRR) |
| SOC2 Type II | Certified |
| Cross-product adoption (alert + run) | 3040% of customers |
| Community patterns feature | Architected, beta if 500+ customers reached |
| Net revenue retention | >120% |
---
*This product brief synthesizes findings from four prior phases: Brainstorm (200+ ideas), Design Thinking (5 personas, empathy mapping, journey mapping), Innovation Strategy (Christensen disruption analysis, Blue Ocean strategy, Porter's Five Forces, JTBD analysis), and Party Mode (5-person advisory board stress test, 4-1 GO verdict). All contradictions have been resolved in favor of the party mode board's mandates: V1 is observe-and-suggest only, deployment correlation is a V1 must-have, and the product must prove value within 60 seconds of pasting a webhook.*
*dd0c/alert is a classic low-end disruption: BigPanda intelligence at 1/100th the price, for the 150,000 mid-market teams the incumbents can't profitably serve. The 18-month window is open. Build the wedge, earn the trust, sell them the runbooks.*
**All signal. Zero chaos.**

View File

@@ -0,0 +1,69 @@
# dd0c/alert — Test Architecture & TDD Strategy
**Product:** dd0c/alert (Alert Intelligence Platform)
**Version:** 2.0 | **Date:** 2026-02-28 | **Phase:** 7 — Test Architecture
**Stack:** TypeScript / Node.js 20 | Vitest | Testcontainers | LocalStack
---
## 1. Testing Philosophy & TDD Workflow
### 1.1 Core Principle
dd0c/alert is an intelligence platform — it makes decisions about what engineers see during incidents. A wrong suppression decision can hide a P1. A wrong correlation can create noise. **Tests are not optional; they are the specification.**
Every behavioral rule in the Correlation Engine, Noise Scorer, and Notification Router must be expressed as a failing test before a single line of implementation is written.
### 1.2 Red-Green-Refactor Cycle
```
RED → Write a failing test that describes the desired behavior.
The test must fail for the right reason (not a compile error).
GREEN → Write the minimum code to make the test pass.
No gold-plating. No "while I'm here" changes.
REFACTOR → Clean up the implementation without breaking tests.
Extract functions, rename for clarity, remove duplication.
Tests stay green throughout.
```
**Strict rule:** No implementation code is written without a failing test first. PRs that add implementation without a corresponding test are blocked by CI.
### 1.3 Test Naming Convention
Tests follow the `given_when_then` pattern using Vitest's `describe`/`it` structure:
```typescript
describe('NoiseScorer', () => {
describe('given a deploy-correlated alert window', () => {
it('should boost noise score by 25 points when deploy is attached', () => { ... });
it('should add 5 additional points when PR title contains "feature-flag"', () => { ... });
it('should not boost score above 50 when service matches never-suppress safelist', () => { ... });
});
});
```
Test file naming: `{module}.test.ts` for unit tests, `{module}.integration.test.ts` for integration tests, `{journey}.e2e.test.ts` for E2E.
### 1.4 When Tests Lead (TDD Mandatory)
TDD is **mandatory** for:
- All noise scoring logic (`src/scoring/`)
- All correlation rules (`src/correlation/`)
- All suppression decisions (`src/suppression/`)
- HMAC validation per provider
- Canonical schema mapping (every provider parser)
- Feature flag circuit breaker logic
- Governance policy enforcement (`policy.json` evaluation)
- Any function with cyclomatic complexity > 3
TDD is **recommended but not enforced** for:
- Infrastructure glue code (SQS consumers, DynamoDB adapters)
- Slack Block Kit message formatting
- Dashboard API route handlers (covered by integration tests)
### 1.5 Test Ownership
Each epic owns its tests. The Correlation Engine team owns `src/correlation/**/*.test.ts`. No cross-team test ownership. If a test breaks due to a dependency change, the team that changed the dependency fixes the test.
---