Files
dd0c/products/03-alert-intelligence/design-thinking/session.md
Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
2026-02-28 17:35:02 +00:00

27 KiB
Raw Blame History

🎷 dd0c/alert — Design Thinking Session

Product: Alert Intelligence Layer (dd0c/alert) Facilitator: Maya, Design Thinking Maestro Date: 2026-02-28 Method: Full Design Thinking (Empathize → Define → Ideate → Prototype → Test → Iterate)


"An alert system that cries wolf isn't broken — it's traumatizing the shepherd. We're not fixing alerts. We're restoring trust between humans and their machines." — Maya


Phase 1: EMPATHIZE 🎧

Design is jazz. You don't start by playing — you start by listening. And right now, the on-call world is screaming a blues riff at 3am, and nobody's transcribing the melody. Let's sit with these people. Let's feel what they feel before we dare to build anything.


Persona 1: Priya Sharma — The On-Call Engineer

Age: 28 | Role: Backend Engineer, On-Call Rotation | Company: Mid-stage fintech startup, 85 engineers Slack Status: 🔴 "on-call until Thursday, pray for me"

Empathy Map

SAYS:

  • "I got paged 6 times last night. Five were nothing."
  • "I just ack everything now. I'll look at it in the morning if it's still firing."
  • "I used to love this job. Now I dread Tuesdays." (her on-call day)
  • "Can someone PLEASE fix the checkout-latency alert? It fires every deploy."
  • "I'm not burned out, I'm just... tired." (she's burned out)

THINKS:

  • "If I mute this channel, will I miss the one real incident?"
  • "My manager says on-call is 'shared responsibility' but I've been on rotation 3x more than anyone else this quarter."
  • "I wonder if that startup down the street has on-call this bad."
  • "What if I just... don't answer? What's the worst that happens?"
  • "I'm mass-acking alerts at 3am. This is not engineering. This is whack-a-mole."

DOES:

  • Sets multiple alarms because she doesn't trust herself to wake up for pages anymore — her brain has learned to sleep through them
  • Keeps a personal "ignore list" in a Notion doc — alerts she's learned are always noise
  • Spends the first 20 minutes of every incident figuring out if it's real
  • Writes angry Slack messages in #sre-gripes at 3:17am
  • Checks the deploy log manually every single time an alert fires
  • Takes a "recovery day" after bad on-call nights (uses PTO, doesn't tell anyone why)

FEELS:

  • Anxiety: The phantom vibration in her pocket. Even off-call, she flinches when her phone buzzes.
  • Resentment: Toward the teams that ship noisy services and never fix their alerts.
  • Guilt: When she mutes channels or acks without investigating.
  • Isolation: Nobody who isn't on-call understands what it's like. Her partner thinks she's "just checking her phone."
  • Helplessness: She's filed 12 tickets to fix noisy alerts. 2 got addressed. The rest are "backlog."

Pain Points

  1. Signal-to-noise ratio is catastrophic — 80-90% of pages are non-actionable
  2. Context is missing — Alert says "high latency on prod-api-12" but doesn't say WHY or whether it matters
  3. No correlation — She has to manually check if a deploy just happened, if other services are affected, if it's a known pattern
  4. Alert ownership is broken — Nobody owns the noisy alerts. The team that created the service moved on. The alert is an orphan.
  5. Recovery time is invisible — Management doesn't see the cognitive cost of a bad night. She's 40% less productive the next day but nobody measures that.
  6. Tools fragment her attention — Alerts come from Datadog, PagerDuty, Slack, and email. She has to context-switch across 4 tools to understand one incident.

Current Workarounds

  • Personal Notion "ignore list" of known-noisy alerts
  • Slack keyword muting for specific alert patterns
  • A bash script she wrote that checks the deploy log when she gets paged (she's automated her own triage)
  • Group chat with other on-call engineers where they share "is this real?" messages at 3am
  • Coffee. So much coffee.

Jobs to Be Done (JTBD)

  • When I get paged at 3am, I want to instantly know if this is real and what to do, so I can either fix it fast or go back to sleep.
  • When I start my on-call shift, I want to know what's currently broken and what's just noise, so I can mentally prepare and not waste energy on false alarms.
  • When an incident is happening, I want to see all related alerts grouped together with context, so I can focus on the root cause instead of chasing symptoms.
  • When my on-call shift ends, I want to hand off cleanly with a summary, so I can actually disconnect and recover.

Day-in-the-Life: Tuesday (On-Call Day)

6:30 AM — Alarm goes off. Checks phone immediately. 14 alerts overnight. Scrolls through them in bed. 12 are the usual suspects (checkout-latency, disk-usage-warning, the zombie alert from the decommissioned auth service). 2 look potentially real. Gets up with a knot in her stomach.

9:15 AM — Standup. "Anything from on-call?" She says "quiet night" because explaining 14 alerts that were all noise is exhausting and nobody wants to hear it.

11:42 AM — Page: "Error rate spike on payment-service." Heart rate jumps. Opens Datadog. Opens PagerDuty. Opens Slack. Checks the deploy log — yes, someone deployed 8 minutes ago. Checks the PR. It's a config change. Error rate is already recovering. Acks the alert. Total time: 12 minutes. Actual action needed: zero.

2:15 PM — Trying to do actual feature work. Gets paged again. Same checkout-latency alert that fires every afternoon during peak traffic. Acks in 3 seconds without looking. Goes back to coding. Loses her flow state. Takes 20 minutes to get back into the problem.

3:47 AM (Wednesday) — Phone screams. Bolts awake. Heart pounding. "Database connection pool exhausted on prod-db-primary." This one is real. But she doesn't know that yet. Spends 8 minutes triaging — checking if it's the flapper, checking if there was a deploy, checking if other services are affected. By the time she confirms it's real and starts the runbook, it's been 11 minutes. MTTR clock is ticking.

4:22 AM — Incident resolved. Wide awake now. Adrenaline. Can't sleep. Opens Twitter. Sees a meme about on-call life. Laughs bitterly. Considers updating her LinkedIn.

9:00 AM (Wednesday) — Shows up to work exhausted. Manager asks about the incident. She explains. Manager says "great job." She thinks: "Great job would be not getting paged for garbage 6 times before the real one."


Persona 2: Marcus Chen — The SRE/Platform Lead

Age: 34 | Role: Senior SRE / Platform Team Lead | Company: Series C SaaS company, 140 engineers, 8 SREs Slack Status: 📊 "Reviewing Q1 on-call metrics (they're bad)"

Empathy Map

SAYS:

  • "We need to fix our alert hygiene. I've been saying this for two quarters."
  • "I can't force product teams to fix their alerts. I can only write guidelines nobody reads."
  • "Our MTTR is 34 minutes. Industry benchmark is 15. I know why, but I can't fix it alone."
  • "PagerDuty costs us $47/seat and half the alerts it sends are noise."
  • "I need a way to show leadership that alert quality is an engineering productivity problem, not just an SRE problem."

THINKS:

  • "I know which alerts are noise. I've known for months. But fixing them requires buy-in from 6 different teams and none of them prioritize it."
  • "If I suppress alerts aggressively, and something breaks, it's MY head on the block."
  • "The junior engineers on rotation are getting destroyed. I can see it in their faces. I need to protect them but I don't have the tools."
  • "I could build something internally... but I've been saying that for a year and we never have the bandwidth."
  • "Am I going to spend my entire career fighting alert noise? Is this really what SRE is?"

DOES:

  • Runs a monthly "alert review" meeting that nobody wants to attend
  • Maintains a spreadsheet tracking alert-to-incident ratios per service (manually updated, always out of date)
  • Writes alert rules for other teams because they won't write good ones themselves
  • Spends 30% of his time on alert tuning instead of platform work
  • Advocates for "alert budgets" per team (like error budgets) — leadership likes the idea but won't enforce it
  • Reviews every postmortem looking for "was the right alert the first alert?" (answer is usually no)

FEELS:

  • Frustration: He has the expertise to fix this but not the organizational leverage. He's a platform lead, not a VP.
  • Responsibility: Every bad on-call night for his team feels like his failure. He set up the rotation. He should have fixed the noise.
  • Exhaustion: Alert tuning is Sisyphean. Fix 10 noisy alerts, 15 new ones appear because someone shipped a new service with default thresholds.
  • Professional anxiety: His MTTR metrics look bad. Leadership sees numbers, not the nuance of why.
  • Loneliness: He's the only one who sees the full picture. Product teams see their alerts. He sees ALL the alerts. The view is terrifying.

Pain Points

  1. No leverage over alert quality — He can write guidelines, but can't force teams to follow them. Alert quality is a tragedy of the commons.
  2. Manual correlation is his full-time job — He's the human correlation engine. When an incident happens, HE connects the dots across services because no tool does it.
  3. Metrics are hard to produce — Proving that alert noise costs money requires data he has to manually compile. Leadership wants dashboards, not spreadsheets.
  4. Tool sprawl — His team uses Datadog for metrics, Grafana for some dashboards, PagerDuty for paging, OpsGenie for some teams that refused PagerDuty. He's managing 4 alerting surfaces.
  5. The cold start problem with every new service — New services launch with terrible default alerts. By the time they're tuned, the team has already suffered through weeks of noise.
  6. Retention risk — He's lost 2 engineers in the past year who cited on-call burden. Recruiting replacements took 4 months each.

Current Workarounds

  • The spreadsheet. Always the spreadsheet.
  • Monthly "alert amnesty" where teams can delete alerts without judgment (attendance: poor)
  • A Slack bot he hacked together that counts alerts per channel per day (it breaks constantly)
  • Manually tagging alerts as "noise" or "signal" in postmortem docs
  • Begging product managers to prioritize alert fixes by framing it as "developer productivity"
  • Taking on-call shifts himself to "lead by example" (and to spare his junior engineers)

Jobs to Be Done (JTBD)

  • When I'm reviewing on-call health, I want to see exactly which alerts are noise and which are signal across all teams, so I can prioritize fixes with data instead of gut feel.
  • When a new service launches, I want to automatically apply intelligent alert defaults, so I can prevent the cold-start noise problem.
  • When I'm presenting to leadership, I want to show the business cost of alert noise (MTTR impact, engineer hours wasted, attrition risk), so I can get budget and priority for fixing it.
  • When an incident is in progress, I want to see correlated alerts across all services and tools in one view, so I can guide the response team to the root cause faster.

Day-in-the-Life: Monday

8:00 AM — Opens his alert metrics spreadsheet. Last week: 1,247 alerts across all teams. 89 resulted in actual incidents. That's a 7.1% signal rate. He's been tracking this for 6 months. It's getting worse, not better.

9:30 AM — Alert review meeting. 3 of 8 team leads show up. They review the top 10 noisiest alerts. Everyone agrees they should be fixed. Nobody commits to a timeline. Marcus assigns himself 4 of them because nobody else will.

11:00 AM — Gets pulled into an incident. Payment service is throwing errors. He immediately checks: was there a deploy? (Yes, 20 minutes ago.) Are other services affected? (He checks 3 dashboards to find out — yes, the downstream notification service is also erroring.) He connects the dots in 6 minutes. The on-call engineer had been looking at the notification service errors for 15 minutes without realizing the root cause was upstream.

1:30 PM — Writes a postmortem for last week's P1. In the "what went well / what didn't" section, he writes: "The first alert that fired was a symptom, not the cause. The causal alert fired 4 minutes later but was buried in 23 other alerts." He's written this same sentence in 11 different postmortems.

3:00 PM — 1:1 with his manager (Director of Engineering). Manager asks about MTTR. Marcus shows the spreadsheet. Manager says "Can you get this into a dashboard?" Marcus thinks: "With what time?"

5:30 PM — Reviewing PagerDuty bill. $47/seat × 40 engineers on rotation = $1,880/month. For a tool that faithfully delivers noise to people's phones at 3am. He wonders if there's something better.

7:00 PM — At home. Gets a Slack DM from a junior engineer: "Hey Marcus, I'm on-call tonight. Any tips for the checkout-latency alert? It fired 8 times last night for Priya." He sends her his personal runbook. He thinks about building an internal tool. Again. He opens a beer instead.


Persona 3: Diana Okafor — The VP of Engineering

Age: 41 | Role: VP of Engineering | Company: Same Series C SaaS, 140 engineers, reports to CTO Slack Status: Rarely on Slack. Lives in Google Docs and Zoom.

Empathy Map

SAYS:

  • "Our MTTR is 34 minutes. The board wants it under 15. What's the plan?"
  • "I keep hearing about alert fatigue but I need data, not anecdotes."
  • "We lost two SREs last quarter. Recruiting is taking forever. We need to fix the on-call experience."
  • "I'm not going to approve another $50K tool unless someone can show me ROI in the first quarter."
  • "Why are we paying Datadog $180K/year and PagerDuty $22K/year and still having these problems?"

THINKS:

  • "On-call burnout is a retention problem disguised as a tooling problem. Or is it the other way around?"
  • "If we have another major incident where the alert was missed because of noise, the CTO is going to ask me hard questions I don't have answers to."
  • "Marcus keeps asking for headcount. I believe him that the team is stretched, but I need to justify it with metrics the CFO will accept."
  • "The engineers complain about on-call but I don't have visibility into what's actually happening. I see incident counts and MTTR. I don't see the human cost."
  • "We're spending $200K+/year on monitoring and alerting tools. Are we getting $200K of value?"

DOES:

  • Reviews MTTR and incident count dashboards weekly (surface-level metrics that don't capture the real problem)
  • Approves tool purchases based on ROI projections and vendor demos (has been burned by tools that demo well but don't deliver)
  • Runs quarterly engineering satisfaction surveys — "on-call experience" has been the #1 complaint for 3 consecutive quarters
  • Asks Marcus for "the alert noise number" before board meetings (Marcus scrambles to update his spreadsheet)
  • Compares their incident metrics to industry benchmarks and doesn't like what she sees
  • Has started mentioning "alert fatigue" in leadership meetings because she read a Gartner report about it

FEELS:

  • Accountability pressure: She owns engineering productivity. If MTTR is bad, it's her problem. If engineers quit, it's her problem.
  • Information asymmetry: She knows something is wrong but can't see the details. She's dependent on Marcus's spreadsheets and anecdotal reports.
  • Budget anxiety: Every new tool is a line item she has to defend. The CFO questions every SaaS subscription over $10K/year.
  • Empathy (distant): She was an engineer once. She remembers bad on-call nights. But it's been 8 years since she was in rotation, and the scale of the problem has changed.
  • Strategic concern: Competitors are shipping faster. If her engineers are spending 30% of their cognitive energy on alert noise, that's 30% less innovation.

Pain Points

  1. No single metric for alert health — She has MTTR, incident count, and anecdotes. She needs a "noise score" or "alert quality index" she can track over time and present to the board.
  2. ROI of monitoring tools is unmeasurable — She's spending $200K+/year on Datadog + PagerDuty + Grafana. She can't quantify what she's getting for that money.
  3. Attrition is expensive and invisible — Losing an SRE costs $150-300K (recruiting + ramp + lost institutional knowledge). Alert fatigue drives attrition. But the causal chain is hard to prove to a CFO.
  4. Tool fatigue — Her teams already use too many tools. Adding another one is a hard sell unless it REPLACES something or has undeniable, immediate value.
  5. Compliance risk — They're in fintech. Missed alerts could mean regulatory issues. She loses sleep over this (ironic, given the product).
  6. No visibility into cross-team alert patterns — She doesn't know that Team A's noisy alerts are causing Team B's MTTR to spike because of shared dependencies.

Current Workarounds

  • Marcus's spreadsheet (she knows it's manual and incomplete, but it's all she has)
  • Quarterly "on-call health" reviews that produce action items nobody follows up on
  • Throwing headcount at the problem (hiring more SREs to spread the on-call load)
  • Vendor calls with PagerDuty's "customer success" team that result in no meaningful changes
  • Asking engineering managers to "prioritize alert hygiene" without giving them dedicated time to do it

Jobs to Be Done (JTBD)

  • When I'm preparing for a board meeting, I want to show a clear metric for operational health that includes alert quality, so I can demonstrate that we're improving (or justify investment if we're not).
  • When I'm evaluating a new tool, I want to see projected ROI based on our actual data within the first week, so I can make a fast, confident buy decision.
  • When an engineer quits citing on-call burden, I want to have data showing exactly how bad their on-call experience was, so I can fix the systemic issue instead of just backfilling the role.
  • When I'm allocating engineering time, I want to know which teams have the worst alert noise, so I can direct investment where it has the most impact.

Day-in-the-Life: Wednesday

7:30 AM — Checks email. CTO forwarded a Gartner report: "AIOps Market to Reach $40B by 2028." Attached note: "Should we be looking at this?" She adds it to her reading list.

9:00 AM — Leadership standup. CTO asks about the P1 incident from last week. Diana gives the summary. CTO asks: "Why did it take 34 minutes to respond?" Diana says: "The on-call engineer was triaging other alerts when it fired." CTO's eyebrow goes up. Diana makes a mental note to talk to Marcus.

10:30 AM — 1:1 with Marcus. He shows her the spreadsheet: 1,247 alerts last week, 89 real incidents. She does the math: 93% noise. She asks: "Can we get this to 50% noise?" Marcus says: "Not without dedicated engineering time from every team, or a tool that does it for us." She asks him to evaluate options.

12:00 PM — Lunch with the Head of Recruiting. They discuss the two open SRE roles. Average time-to-fill for SREs in their market: 67 days. Cost per hire: $45K (agency fees + interview time). Diana thinks about how much cheaper it would be to just not burn out the SREs they have.

2:00 PM — Quarterly planning. She's trying to allocate 15% of engineering time to "platform health" but product managers are pushing back. They want features. She needs ammunition — hard data showing that alert noise is costing them feature velocity.

4:00 PM — Reviews the engineering satisfaction survey results. On-call experience: 2.1 out of 5. Comments include: "I dread my on-call weeks," "The alerts are mostly useless," and "I'm considering leaving if this doesn't improve." She highlights these for the CTO.

6:30 PM — Driving home. Thinks about the $200K monitoring bill. Thinks about the 2 engineers who left. Thinks about the 34-minute MTTR. Thinks: "There has to be a better way." Opens LinkedIn at a red light. Sees an ad for yet another AIOps platform. Closes LinkedIn.


Phase 2: DEFINE 🎯

"The problem is never what people say it is. Priya says 'too many alerts.' Marcus says 'bad alert hygiene.' Diana says 'high MTTR.' They're all describing the same elephant from different angles. Our job is to see the whole animal."


Point-of-View Statements

A POV statement crystallizes the tension: [User] needs [need] because [insight].

Priya (On-Call Engineer)

Priya, a dedicated backend engineer on a weekly on-call rotation, needs to instantly distinguish real incidents from noise at 3am because her brain has been conditioned to ignore alerts — and the one time she shouldn't ignore one, she will.

The deeper insight: Priya's problem isn't volume. It's trust erosion. Every false alarm trains her nervous system to stop caring. The alert system is literally conditioning her to fail at the one moment it matters most. This is a Pavlovian tragedy.

Marcus (SRE/Platform Lead)

Marcus, an SRE lead responsible for operational health across 8 teams, needs a way to make alert quality visible and actionable across the organization because he currently holds all the correlation knowledge in his head — and that knowledge walks out the door when he goes on vacation.

The deeper insight: Marcus is a human AIOps engine. He IS the correlation layer. He IS the deduplication algorithm. The organization has outsourced its alert intelligence to one person's brain. That's not a process — that's a single point of failure wearing a hoodie.

Diana (VP of Engineering)

Diana, a VP of Engineering accountable for engineering productivity and retention, needs a single, defensible metric for alert health because she's fighting a war she can't measure — and in leadership, what you can't measure, you can't fund.

The deeper insight: Diana's problem is translation. She needs to convert "Priya had a terrible night" into "$47,000 in lost productivity and attrition risk" — a language the CFO and board understand. Without that translation layer, alert fatigue remains an anecdote, not a budget line item.


Key Insights

  1. Trust is the product, not technology. The #1 barrier to adoption isn't feature gaps — it's that engineers won't trust AI to suppress alerts. One missed critical alert = permanent distrust. The trust ramp IS the product challenge.

  2. Alert fatigue is a tragedy of the commons. No single team owns the problem. Team A's noisy service creates Team B's on-call nightmare. Without organizational visibility, everyone optimizes locally and the system degrades globally.

  3. The correlation knowledge is trapped in human brains. Marcus knows that "DB slow + API errors + frontend timeouts = one incident." That knowledge isn't in any tool. When Marcus is unavailable, MTTR doubles because nobody else can connect the dots.

  4. Metrics exist at the wrong altitude. Diana sees MTTR (too high-level). Priya sees individual alerts (too low-level). Nobody sees the middle layer: alert quality, noise ratios, correlation patterns, cost-per-false-alarm. This middle layer is where decisions should be made.

  5. The economic cost is real but invisible. Alert fatigue drives attrition ($150-300K per lost SRE), inflates MTTR (3-5x longer with high noise), reduces next-day productivity (40% cognitive tax), and creates compliance risk. But nobody has a dashboard that shows this. The cost hides in plain sight.

  6. Engineers don't want another tool — they want fewer interruptions. The bar for a new tool is astronomically high. It must live where they already are (Slack), require zero behavior change, and prove value before asking for commitment.

  7. The cold start paradox. AI needs data to be smart. Day 1, it's dumb. But Day 1 is when you need to prove value. The solution: rule-based intelligence first (time-window clustering, deployment correlation) that works immediately, with ML that gets smarter over time.

  8. Transparency is non-negotiable. Engineers will reject a black box. Every suppression, every grouping, every priority score must be explainable. "We suppressed this because..." is the most important sentence in the product.


Core Tension

Here's the fundamental tension at the heart of dd0c/alert — the jazz dissonance that makes this product interesting:

Engineers need FEWER alerts to do their jobs, but they're TERRIFIED of missing the one that matters.

This is not a feature problem. This is a psychological safety problem. The product must simultaneously:

  • Suppress aggressively (to deliver the 70-90% noise reduction that makes it worth buying)
  • Never miss a critical alert (to maintain the trust that makes it usable)
  • Prove it's working (to justify the suppression to skeptical engineers AND budget-conscious VPs)

The resolution of this tension is graduated trust. You don't suppress on Day 1. You SHOW what you WOULD suppress. You let the engineer confirm. You build a track record. You earn the right to act autonomously. Like a new musician sitting in with a jazz ensemble — you listen for 3 songs before you solo.


How Might We (HMW) Questions

These are the creative springboards. Each one opens a design space.

Trust & Transparency

  1. HMW build trust with on-call engineers who've been burned by every "smart" alerting tool before?
  2. HMW make alert suppression decisions transparent and reversible, so engineers feel in control even when AI is acting?
  3. HMW create a "trust score" that grows over time, unlocking more autonomous behavior as the system proves itself?

Signal vs. Noise

  1. HMW reduce alert volume by 70%+ without ever suppressing a genuinely critical alert?
  2. HMW help engineers distinguish "this is noise" from "this is a symptom of something real" in under 10 seconds?
  3. HMW automatically correlate alerts across multiple monitoring tools so engineers see ONE incident instead of 47 symptoms?

Organizational Visibility

  1. HMW give Marcus (SRE lead) a real-time view of alert quality across all teams without requiring manual data collection?
  2. HMW translate alert noise into business metrics (dollars, hours, attrition risk) that Diana can present to the board?
  3. HMW create accountability for alert quality without turning it into a blame game between teams?

Developer Experience

  1. HMW deliver value in the first 60 seconds of setup, before the AI has learned anything?
  2. HMW make Slack the primary interface so engineers never have to learn a new tool?
  3. HMW provide context-rich incident summaries that eliminate the "is this real?" triage phase entirely?

Learning & Adaptation

  1. HMW learn from engineer behavior (acks, snoozes, ignores) without requiring explicit feedback?
  2. HMW handle the cold-start problem so the product is useful on Day 1, not Day 30?
  3. HMW adapt to each team's unique noise profile instead of applying one-size-fits-all rules?

Human Cost

  1. HMW measure and make visible the human cost of alert fatigue (sleep disruption, cognitive load, burnout)?
  2. HMW protect junior engineers on rotation from the worst of the noise while they're still learning?
  3. HMW turn on-call from a dreaded obligation into a manageable, even empowering experience?