dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
This commit is contained in:
350
products/05-aws-cost-anomaly/design-thinking/session.md
Normal file
350
products/05-aws-cost-anomaly/design-thinking/session.md
Normal file
@@ -0,0 +1,350 @@
|
||||
# dd0c/cost — Design Thinking Session
|
||||
## AWS Cost Anomaly Detective
|
||||
|
||||
**Facilitator:** Maya, Design Thinking Maestro
|
||||
**Date:** February 28, 2026
|
||||
**Product:** dd0c/cost (Product #5 — "The Gateway Drug")
|
||||
**Philosophy:** Design is about THEM, not us.
|
||||
|
||||
---
|
||||
|
||||
> *"The best products don't solve problems. They dissolve the anxiety that surrounds them. When a startup CTO sees a bill that's 4x what they expected, the problem isn't the bill — it's the 47 seconds of pure existential dread before they can even begin to understand WHY."*
|
||||
|
||||
---
|
||||
|
||||
# Phase 1: EMPATHIZE
|
||||
|
||||
We're not building a cost tool. We're building an anxiety medication for cloud infrastructure. Let's meet the humans who need it.
|
||||
|
||||
---
|
||||
|
||||
## Persona 1: The Startup CTO — "Alex"
|
||||
|
||||
**Demographics:** 32 years old. Series A startup, 12 engineers. Wears the CTO/VP Eng/DevOps hat simultaneously. Personally signed the AWS Enterprise Agreement. The board sees every line item.
|
||||
|
||||
**The Moment That Defines Alex:**
|
||||
It's Tuesday morning, 7:14 AM. Alex is brushing their teeth when a Slack notification buzzes. The CFO has forwarded the AWS billing alert email: "Your estimated charges for this billing period have exceeded $8,000." Last month was $2,100. Alex's stomach drops. Toothbrush still in mouth. They open the AWS Console on their phone. Cost Explorer takes 11 seconds to load on mobile. The bar chart shows a spike but doesn't say WHERE or WHY. Alex is now going to be late for the 8 AM standup, and they'll spend the entire meeting distracted, mentally running through every possible cause. Was it the new feature deploy? Did someone spin up a big instance? Is it a data transfer thing? They don't know. They won't know for hours.
|
||||
|
||||
### Empathy Map
|
||||
|
||||
**SAYS:**
|
||||
- "Can someone check why our AWS bill spiked?"
|
||||
- "We need to be more careful about resource management"
|
||||
- "I'll look into it after standup"
|
||||
- "We can't afford surprises like this"
|
||||
- "Who launched that instance?"
|
||||
- "Do we even need that RDS cluster in staging?"
|
||||
|
||||
**THINKS:**
|
||||
- "This is going to come up at the board meeting"
|
||||
- "I should have set up billing alerts months ago"
|
||||
- "Is this my fault for not having better guardrails?"
|
||||
- "What if this keeps happening and we burn through runway?"
|
||||
- "I don't have time to become a FinOps expert on top of everything else"
|
||||
- "The investors are going to ask about our burn rate"
|
||||
|
||||
**DOES:**
|
||||
- Opens AWS Cost Explorer (waits for it to load, gets frustrated)
|
||||
- Manually checks EC2 console, RDS console, Lambda console — one by one
|
||||
- Searches CloudTrail logs trying to correlate events with cost spikes
|
||||
- Asks in Slack: "Did anyone spin up anything big recently?"
|
||||
- Creates a spreadsheet to track monthly costs (abandons it by month 3)
|
||||
- Sets a billing alarm at 80% of budget (but the alarm fires 48 hours late)
|
||||
|
||||
**FEELS:**
|
||||
- **Panic** — the visceral gut-punch of an unexpected bill
|
||||
- **Helpless** — AWS gives data but not answers
|
||||
- **Guilty** — "I should have caught this sooner"
|
||||
- **Overwhelmed** — too many consoles, too many services, not enough time
|
||||
- **Exposed** — the board/investors will see this number
|
||||
- **Alone** — nobody else on the team understands AWS billing
|
||||
|
||||
### Pain Points
|
||||
1. **The 48-hour blindspot** — By the time Cost Explorer shows the spike, thousands are already burned
|
||||
2. **No attribution** — "EC2 costs went up" tells you nothing about WHICH instance or WHO launched it
|
||||
3. **Context-switching hell** — Diagnosing a cost issue requires jumping between 5+ AWS consoles
|
||||
4. **Personal liability** — At a startup, the CTO's name is on the account. The bill feels personal.
|
||||
5. **Time poverty** — Alex has 47 other priorities. Cost management is important but never urgent — until it's an emergency
|
||||
6. **Knowledge gap** — Alex is a great engineer but not a FinOps specialist. AWS billing is deliberately opaque.
|
||||
|
||||
### Current Workarounds
|
||||
- AWS Billing Alerts (delayed, no context, email-only)
|
||||
- Monthly manual review of Cost Explorer (reactive, not proactive)
|
||||
- Asking in Slack "who did this?" (blame-oriented, unreliable)
|
||||
- Spreadsheet tracking (abandoned within weeks)
|
||||
- Hoping for the best (the most common strategy)
|
||||
|
||||
### Jobs To Be Done (JTBD)
|
||||
- **When** I see an unexpected AWS charge, **I want to** instantly understand what caused it and who's responsible, **so I can** fix it before it gets worse and explain it to stakeholders.
|
||||
- **When** I'm planning our monthly budget, **I want to** confidently predict our AWS spend, **so I can** give the board accurate numbers and not look incompetent.
|
||||
- **When** a new service or resource is created, **I want to** know immediately if it's going to be expensive, **so I can** intervene before costs accumulate.
|
||||
|
||||
### Day-in-the-Life Scenario
|
||||
|
||||
**6:45 AM** — Wake up, check phone. No alerts. Good.
|
||||
**7:14 AM** — CFO Slack: "Why is AWS $8K?" Stomach drops.
|
||||
**7:15-7:55 AM** — Frantically clicking through AWS Console on laptop. Cost Explorer shows EC2 spike but no details. Check CloudTrail — hundreds of events, no obvious culprit.
|
||||
**8:00 AM** — Standup. Distracted. Mentions "looking into a billing issue."
|
||||
**8:30-10:00 AM** — Deep dive. Finally discovers: a developer launched 4x p3.2xlarge GPU instances for an ML experiment on Thursday. They're still running. That's $12.24/hour × 96 hours = $1,175 burned. The developer forgot.
|
||||
**10:05 AM** — Terminates the instances. Sends a Slack message to the team about resource management. Feels like a hall monitor.
|
||||
**10:30 AM** — Writes a "cloud cost policy" doc. Nobody will read it.
|
||||
**11:00 AM** — Back to actual work, 3 hours behind schedule.
|
||||
**Next month** — It happens again. Different resource. Same panic.
|
||||
|
||||
---
|
||||
|
||||
## Persona 2: The FinOps Analyst — "Jordan"
|
||||
|
||||
**Demographics:** 28 years old. Mid-size SaaS company, 150 engineers, 23 AWS accounts. Jordan's title is "Cloud Financial Analyst" but everyone calls them "the cost person." Reports to VP of Engineering and dotted-line to Finance. The only person in the company who understands AWS billing at a granular level.
|
||||
|
||||
**The Moment That Defines Jordan:**
|
||||
It's the last Thursday of the month. Jordan has spent the past 3 days building the monthly cloud cost report. They have 14 browser tabs open: Cost Explorer for 6 different accounts, 3 spreadsheets, a Confluence page, and the AWS CUR data in Athena. The VP of Engineering wants the report by Friday EOD. The CFO wants it "in a format Finance can understand." Jordan is translating between two worlds — engineering resource names and financial line items — and neither side appreciates how hard that translation is. They just found a $4,200 discrepancy between Cost Explorer and the CUR data and have no idea which one is right.
|
||||
|
||||
### Empathy Map
|
||||
|
||||
**SAYS:**
|
||||
- "I need the teams to tag their resources properly"
|
||||
- "The CUR data doesn't match Cost Explorer — again"
|
||||
- "Can we get a meeting to discuss the tagging strategy?"
|
||||
- "This account's spend is 40% over forecast"
|
||||
- "I've been asking for this data for two weeks"
|
||||
- "No, I can't tell you the cost per request. We don't have that granularity."
|
||||
|
||||
**THINKS:**
|
||||
- "Nobody takes tagging seriously until the bill is a disaster"
|
||||
- "I'm a single point of failure for cost visibility in this entire company"
|
||||
- "If I got hit by a bus, nobody could produce this report"
|
||||
- "I wish I could automate 80% of what I do"
|
||||
- "The engineering teams think I'm the cost police. I'm trying to help them."
|
||||
- "There has to be a better way than 14 spreadsheets"
|
||||
|
||||
**DOES:**
|
||||
- Downloads CUR data daily, loads into Athena, runs custom queries
|
||||
- Maintains a master spreadsheet mapping AWS accounts → teams → budgets
|
||||
- Sends weekly cost summaries to team leads (most don't read them)
|
||||
- Manually investigates anomalies by cross-referencing CUR, CloudTrail, and Cost Explorer
|
||||
- Attends FinOps Foundation meetups to learn best practices
|
||||
- Builds custom dashboards in QuickSight (they break every time AWS changes the CUR schema)
|
||||
|
||||
**FEELS:**
|
||||
- **Frustrated** — the tools are inadequate and nobody understands the complexity
|
||||
- **Undervalued** — cost optimization saves hundreds of thousands but gets no glory
|
||||
- **Anxious** — one missed anomaly and it's Jordan's fault
|
||||
- **Isolated** — the only person who speaks both "engineering" and "finance"
|
||||
- **Exhausted** — the work is repetitive, manual, and never-ending
|
||||
- **Determined** — genuinely believes cost optimization matters and wants to prove it
|
||||
|
||||
### Pain Points
|
||||
1. **Manual data wrangling** — 60% of Jordan's time is spent collecting, cleaning, and reconciling data, not analyzing it
|
||||
2. **Tagging chaos** — Teams don't tag consistently. Untagged resources are a black hole of unattributable cost.
|
||||
3. **Multi-account complexity** — 23 accounts with different owners, different conventions, different levels of maturity
|
||||
4. **No real-time visibility** — CUR is hourly at best, Cost Explorer is 24-48 hours delayed. Jordan is always looking backward.
|
||||
5. **Stakeholder translation** — Engineering wants resource-level detail. Finance wants department-level summaries. Jordan manually bridges the gap.
|
||||
6. **Tool fragmentation** — Uses Cost Explorer + CUR + Athena + QuickSight + spreadsheets + Slack. No single source of truth.
|
||||
|
||||
### Current Workarounds
|
||||
- Custom Athena queries on CUR data (brittle, requires SQL expertise)
|
||||
- Master spreadsheet updated manually every week (error-prone)
|
||||
- QuickSight dashboards (break when CUR schema changes)
|
||||
- Slack reminders to team leads about their budgets (ignored)
|
||||
- Monthly "cost review" meetings (dreaded by everyone)
|
||||
- AWS Cost Anomaly Detection (too many false positives, no actionable context)
|
||||
|
||||
### Jobs To Be Done (JTBD)
|
||||
- **When** I'm preparing the monthly cost report, **I want to** automatically aggregate costs by team, environment, and service with accurate attribution, **so I can** deliver the report in hours instead of days.
|
||||
- **When** an anomaly is detected, **I want to** immediately see the root cause with full context (who, what, when, why), **so I can** resolve it without a 3-hour investigation.
|
||||
- **When** a team exceeds their budget, **I want to** automatically notify the team lead with specific recommendations, **so I can** scale cost governance without being the bottleneck.
|
||||
|
||||
### Day-in-the-Life Scenario
|
||||
|
||||
**8:00 AM** — Open laptop. 47 unread emails. 12 are AWS billing notifications from various accounts. Triage: most are noise.
|
||||
**8:30 AM** — Check yesterday's CUR data in Athena. Run the anomaly detection query Jordan wrote. 3 flagged items. One is a real issue (new RDS instance in account #17), two are false positives (monthly batch job, expected).
|
||||
**9:00 AM** — Slack the owner of account #17: "Hey, there's a new db.r5.4xlarge in us-west-2. Is this expected?" No response for 2 hours.
|
||||
**9:15 AM** — Start building the weekly cost summary. Pull data from 6 accounts. Two accounts have untagged resources totaling $3,400. Jordan can't attribute them. Adds them to "Unallocated" with a note.
|
||||
**10:00 AM** — Meeting with VP Eng about Q1 cloud budget. VP wants to cut 15%. Jordan explains which optimizations are realistic and which are fantasy. VP doesn't fully understand the constraints.
|
||||
**11:00 AM** — Account #17 owner responds: "Oh yeah, that's for the new analytics pipeline. It's permanent." Jordan updates the forecast spreadsheet. The annual impact is $28,000. Nobody approved this.
|
||||
**12:00 PM** — Lunch at desk. Reading a FinOps Foundation article about showback vs. chargeback models.
|
||||
**1:00-4:00 PM** — Deep in spreadsheets. Reconciling CUR data with the finance team's GL codes. Find a $4,200 discrepancy. Spend 90 minutes discovering it's because of a refund that appeared in CUR but not in Cost Explorer.
|
||||
**4:30 PM** — Team lead asks: "Can you tell me how much our staging environment costs?" Jordan: "Give me 30 minutes." It takes 90 because staging resources aren't consistently tagged.
|
||||
**6:00 PM** — Leave. Tomorrow: same thing.
|
||||
|
||||
---
|
||||
|
||||
## Persona 3: The DevOps Engineer — "Sam"
|
||||
|
||||
**Demographics:** 26 years old. Backend/infrastructure engineer at a 40-person startup. Manages Terraform, CI/CD, and "whatever AWS thing is broken today." Doesn't think about costs — until they cause a problem. Sam's primary metric is uptime, not spend.
|
||||
|
||||
**The Moment That Defines Sam:**
|
||||
It's Friday at 4:47 PM. Sam is about to close the laptop for the weekend when a Slack message from the CTO lands: "Sam, did you launch those GPU instances? Finance says we burned $1,200 on something called p3.2xlarge." Sam's blood runs cold. Last Tuesday, Sam spun up 4 GPU instances to benchmark a new ML model for the data team. The benchmark took 20 minutes. Sam meant to terminate them immediately after. But then there was a production incident, and Sam got pulled away, and the instances... are still running. It's been 4 days. Sam checks: $12.24/hour × 4 instances × 96 hours = $4,700. Sam wants to disappear.
|
||||
|
||||
### Empathy Map
|
||||
|
||||
**SAYS:**
|
||||
- "I'll terminate it right after the test"
|
||||
- "I thought I set it to auto-terminate"
|
||||
- "Can we get a policy to auto-kill dev resources?"
|
||||
- "I didn't know NAT Gateways were that expensive"
|
||||
- "The staging environment? Yeah, it's always running. Should it not be?"
|
||||
- "I don't have time to learn AWS billing — I have deploys to ship"
|
||||
|
||||
**THINKS:**
|
||||
- "Cost management isn't my job... but it keeps becoming my problem"
|
||||
- "I should have set a reminder to terminate those instances"
|
||||
- "AWS makes it way too easy to create expensive things and way too hard to know what they cost"
|
||||
- "I'm going to get blamed for this even though there's no guardrail to prevent it"
|
||||
- "Why doesn't AWS just TELL you when something is burning money?"
|
||||
- "I bet there are other zombie resources I don't even know about"
|
||||
|
||||
**DOES:**
|
||||
- Launches resources via Terraform and CLI, sometimes via console for quick tests
|
||||
- Forgets to clean up temporary resources (not malicious — just busy)
|
||||
- Checks costs only when asked by management
|
||||
- Uses `aws ce get-cost-and-usage` CLI occasionally but finds the output confusing
|
||||
- Tags resources inconsistently ("I'll add tags later" → never)
|
||||
- Responds to cost inquiries defensively ("It was just a test!")
|
||||
|
||||
**FEELS:**
|
||||
- **Embarrassed** — when caught leaving expensive resources running
|
||||
- **Defensive** — "There should be a system to catch this, not just blame me"
|
||||
- **Indifferent** — cost isn't Sam's KPI; uptime and velocity are
|
||||
- **Overwhelmed** — too many responsibilities, cost management is one more thing
|
||||
- **Anxious** — fear of making an expensive mistake and getting called out
|
||||
- **Resentful** — "Why is this my problem? Where are the guardrails?"
|
||||
|
||||
### Pain Points
|
||||
1. **No feedback loop** — Sam creates a resource and gets zero signal about its cost until someone complains weeks later
|
||||
2. **Easy to create, hard to track** — AWS makes it trivial to launch resources and nearly impossible to understand their cost implications in real-time
|
||||
3. **No safety net** — There's no automated system to catch forgotten resources. It's all human memory.
|
||||
4. **Blame culture** — When costs spike, the question is "who did this?" not "how do we prevent this?"
|
||||
5. **Cost literacy gap** — Sam is an excellent engineer but has no mental model for AWS pricing. NAT Gateway data processing? EBS IOPS charges? It's a foreign language.
|
||||
6. **Context-switching tax** — Investigating a cost issue means leaving the terminal/IDE and navigating the AWS billing console, which is a completely different mental model.
|
||||
|
||||
### Current Workarounds
|
||||
- Setting personal calendar reminders to terminate test resources (unreliable)
|
||||
- Using spot instances when remembering to (inconsistent)
|
||||
- Terraform `destroy` for test stacks (when they remember)
|
||||
- Asking in Slack before launching anything expensive (social pressure, not a system)
|
||||
- Nothing. Most of the time, there's just nothing. Hope and prayer.
|
||||
|
||||
### Jobs To Be Done (JTBD)
|
||||
- **When** I spin up a temporary resource for testing, **I want to** be automatically reminded (or have it auto-terminated) after a set period, **so I can** focus on my actual work without worrying about zombie resources.
|
||||
- **When** I'm about to create something expensive, **I want to** see the estimated cost impact immediately, **so I can** make an informed decision or choose a cheaper alternative.
|
||||
- **When** a cost anomaly is traced back to my actions, **I want to** fix it with one click from wherever I already am (Slack/terminal), **so I can** resolve it in 30 seconds instead of 15 minutes of console-clicking.
|
||||
|
||||
### Day-in-the-Life Scenario
|
||||
|
||||
**9:00 AM** — Start day. Check CI/CD pipelines. One failed overnight — flaky test. Re-run it.
|
||||
**9:30 AM** — Sprint planning. Pick up a ticket to set up a new ECS service for the payments team.
|
||||
**10:00 AM** — Writing Terraform for the new ECS service. Chooses instance type based on the last service they set up (m5.xlarge). Doesn't check if it's the right size. Doesn't estimate cost.
|
||||
**11:00 AM** — Data team asks Sam to spin up GPU instances for ML benchmarking. Sam launches 4x p3.2xlarge via CLI. Plans to terminate after lunch.
|
||||
**11:30 AM** — Production alert: database connection pool exhausted. All hands on deck.
|
||||
**11:30 AM - 2:00 PM** — Incident response. The GPU instances are completely forgotten.
|
||||
**2:00 PM** — Incident resolved. Sam is mentally drained. Grabs lunch.
|
||||
**2:30 PM** — Back to the ECS Terraform. Deploys to staging. Doesn't think about the GPU instances.
|
||||
**3:00 PM** — Code review for a teammate's Lambda function. Doesn't notice it logs full request payloads at DEBUG level (future CloudWatch cost bomb).
|
||||
**4:00 PM** — Pushes the ECS service to production. Monitors for 30 minutes. Looks good.
|
||||
**4:47 PM** — CTO Slack: "Did you launch those GPU instances?" The cold sweat begins.
|
||||
**4:50 PM** — Terminates the instances. $4,700 burned. Apologizes. Feels terrible.
|
||||
**5:00 PM** — Closes laptop. Spends the weekend low-key anxious about it.
|
||||
**Monday** — CTO announces a new "cloud cost policy." Sam knows it's because of them. Nobody will follow it.
|
||||
|
||||
---
|
||||
|
||||
# Phase 2: DEFINE
|
||||
|
||||
> *"A well-defined problem is a problem half-solved. But here's the jazz riff — the problem isn't 'costs are too high.' The problem is 'I'm flying blind in a machine that charges by the millisecond.' That's a fundamentally different design challenge."*
|
||||
|
||||
---
|
||||
|
||||
## Point-of-View (POV) Statements
|
||||
|
||||
### POV 1: The Startup CTO (Alex)
|
||||
**Alex**, a time-starved startup CTO who is personally accountable for AWS spend, **needs a way to** instantly understand and resolve unexpected cost spikes the moment they happen, **because** the 48-hour delay in current tools means thousands of dollars burn before they even know there's a problem, and every unexplained spike erodes investor confidence and their own credibility.
|
||||
|
||||
### POV 2: The FinOps Analyst (Jordan)
|
||||
**Jordan**, a solo FinOps analyst responsible for cost governance across 23 AWS accounts, **needs a way to** automatically detect, attribute, and communicate cost anomalies without manual data wrangling, **because** they spend 60% of their time collecting and reconciling data instead of analyzing it, and they are a single point of failure for cost visibility in a 150-person engineering org.
|
||||
|
||||
### POV 3: The DevOps Engineer (Sam)
|
||||
**Sam**, a DevOps engineer who accidentally creates expensive zombie resources, **needs a way to** get immediate cost feedback and automatic safety nets when creating or forgetting cloud resources, **because** there is currently zero feedback loop between "I launched a thing" and "that thing is costing $12/hour," and the resulting blame culture makes cost management feel punitive rather than preventive.
|
||||
|
||||
---
|
||||
|
||||
## Key Insights
|
||||
|
||||
1. **The Anxiety Gap** — The real pain isn't the dollar amount. It's the TIME between "something went wrong" and "I understand what happened." AWS's 48-hour delay turns a $200 problem into a $2,000 problem AND a week of anxiety. Speed of detection is speed of relief.
|
||||
|
||||
2. **Attribution Is Emotional, Not Just Financial** — "Who did this?" is the first question asked in every cost spike. Current tools can't answer it. This creates blame culture. If dd0c can instantly say "Sam's p3.2xlarge instances from Tuesday," it transforms the conversation from blame to resolution.
|
||||
|
||||
3. **Nobody Wakes Up Wanting to Do FinOps** — Alex doesn't want a cost dashboard. Sam doesn't want billing alerts. Jordan doesn't want more spreadsheets. They all want the ABSENCE of cost problems. The best cost tool is one you barely notice — until it saves you.
|
||||
|
||||
4. **The Guardrail Deficit** — AWS makes it trivially easy to create expensive resources and provides zero real-time feedback about cost. It's like a highway with no speed limit signs and no guardrails — and the speeding ticket arrives 2 days later. dd0c is the guardrail, not the ticket.
|
||||
|
||||
5. **Slack Is the Operating System** — All three personas live in Slack. Alex gets the CFO's panic message there. Jordan sends cost summaries there. Sam gets called out there. The product that wins is the one that meets them where they already are — not behind another login.
|
||||
|
||||
6. **The Trust Ladder** — Read-only detection (low trust) → Recommendations (medium trust) → One-click remediation (high trust) → Autonomous action (maximum trust). Users climb this ladder over time. V1 must support the full ladder but default to the bottom rung.
|
||||
|
||||
7. **Cost Literacy Is Near Zero** — Even experienced engineers don't understand AWS pricing. NAT Gateway data processing, cross-AZ transfer, EBS IOPS — it's deliberately opaque. dd0c must EXPLAIN costs in human language, not just report numbers.
|
||||
|
||||
8. **Waste Regenerates** — Optimization isn't a one-time event. New engineers join, new services launch, configurations drift. The zombie resource problem is perpetual. dd0c's value is continuous, not episodic.
|
||||
|
||||
---
|
||||
|
||||
## How Might We (HMW) Questions
|
||||
|
||||
### Detection & Speed
|
||||
1. **HMW** detect expensive resource creation in seconds instead of days, so users can intervene before costs accumulate?
|
||||
2. **HMW** distinguish between expected cost changes (planned deployments) and genuine anomalies, to minimize false positive fatigue?
|
||||
3. **HMW** make the "cost signal" as immediate and visceral as a production alert, so it gets the same urgency?
|
||||
|
||||
### Attribution & Understanding
|
||||
4. **HMW** automatically attribute every cost spike to a specific person, team, and action — without requiring perfect tagging?
|
||||
5. **HMW** explain cost anomalies in plain English so that a CTO, a FinOps analyst, AND a junior engineer all understand what happened?
|
||||
6. **HMW** show the "cost blast radius" of a single action (e.g., "this one CLI command is costing $12.24/hour") at the moment it happens?
|
||||
|
||||
### Remediation & Action
|
||||
7. **HMW** reduce the time from "anomaly detected" to "problem fixed" from hours to seconds?
|
||||
8. **HMW** make remediation feel safe (not scary) so users actually click the "Stop Instance" button instead of hesitating?
|
||||
9. **HMW** build automatic safety nets (auto-terminate, auto-schedule) that prevent problems without requiring human vigilance?
|
||||
|
||||
### Culture & Behavior
|
||||
10. **HMW** transform cost management from a blame game into a team sport?
|
||||
11. **HMW** make cost awareness a natural byproduct of daily engineering work, not a separate chore?
|
||||
12. **HMW** reward cost-conscious behavior instead of only punishing waste?
|
||||
|
||||
### Scale & Governance
|
||||
13. **HMW** give Jordan (the FinOps analyst) their time back by automating the 60% of work that's just data wrangling?
|
||||
14. **HMW** provide cost governance across 20+ AWS accounts without creating a bottleneck at one person?
|
||||
|
||||
---
|
||||
|
||||
## The Core Tension: Real-Time Detection vs. Accuracy
|
||||
|
||||
Here's the design tension that will define dd0c/cost's soul:
|
||||
|
||||
```
|
||||
FAST ←————————————————————→ ACCURATE
|
||||
CloudTrail events CUR line-item data
|
||||
(instant, but estimated) (hourly+, but precise)
|
||||
```
|
||||
|
||||
**The Tradeoff:**
|
||||
- **CloudTrail event detection** tells you "someone just launched a p3.2xlarge" within seconds. You can estimate the cost ($3.06/hour). But you don't have the ACTUAL billed amount yet — there could be reserved instance coverage, savings plans, spot pricing, or marketplace fees that change the real number.
|
||||
- **CUR/Cost Explorer data** gives you the exact billed amount, with all discounts and credits applied. But it's delayed by hours (CUR) or days (Cost Explorer).
|
||||
|
||||
**The Resolution — A Two-Layer Architecture:**
|
||||
|
||||
| Layer | Source | Speed | Accuracy | Use Case |
|
||||
|-------|--------|-------|----------|----------|
|
||||
| **Layer 1: Event Stream** | CloudTrail + EventBridge | Seconds | Estimated (~85% accurate) | "ALERT: New expensive resource detected" |
|
||||
| **Layer 2: Billing Reconciliation** | CloudWatch EstimatedCharges + CUR | Minutes to hours | Precise (99%+) | "UPDATE: Confirmed cost impact is $X" |
|
||||
|
||||
**The Design Principle:** Alert on Layer 1 (fast, estimated). Reconcile with Layer 2 (slow, precise). Always show the user which layer they're looking at. Never pretend an estimate is exact. Never wait for precision when speed saves money.
|
||||
|
||||
This is like a smoke detector vs. a fire investigation. The smoke detector goes off immediately — it might be burnt toast, it might be a real fire. You don't wait for the fire investigator's report before evacuating. You act on the fast signal, then refine your understanding.
|
||||
|
||||
**For each persona, this plays differently:**
|
||||
- **Alex (CTO):** Wants Layer 1 immediately. "I don't care if it's $1,175 or $1,230 — I need to know NOW that something is burning money." Precision can come later.
|
||||
- **Jordan (FinOps):** Needs both layers. Layer 1 for real-time awareness, Layer 2 for accurate reporting and forecasting. Jordan will be frustrated if estimates are wildly off.
|
||||
- **Sam (DevOps):** Wants Layer 1 as a safety net. "Tell me the second I forget to terminate something." Doesn't care about the exact dollar amount — cares about the pattern.
|
||||
|
||||
---
|
||||
Reference in New Issue
Block a user