dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
This commit is contained in:
2421
products/05-aws-cost-anomaly/architecture/architecture.md
Normal file
2421
products/05-aws-cost-anomaly/architecture/architecture.md
Normal file
File diff suppressed because it is too large
Load Diff
340
products/05-aws-cost-anomaly/brainstorm/session.md
Normal file
340
products/05-aws-cost-anomaly/brainstorm/session.md
Normal file
@@ -0,0 +1,340 @@
|
||||
# dd0c/cost — AWS Cost Anomaly Detective
|
||||
## Brainstorming Session
|
||||
|
||||
**Date:** February 28, 2026
|
||||
**Facilitator:** Carson, Elite Brainstorming Specialist
|
||||
**Product:** dd0c/cost (Product #5 in the dd0c platform)
|
||||
**Target:** Teams spending $10K–$500K/mo on AWS who want instant alerts when something spikes
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Problem Space (25 ideas)
|
||||
|
||||
### The Emotional Gut-Punch
|
||||
|
||||
1. **The 3am Stomach Drop** — You open your phone, see a Slack message from finance: "Why is our AWS bill $18K over budget?" Your heart rate spikes. You don't even know where to start looking. AWS Cost Explorer loads in 8 seconds and shows you a bar chart that means nothing.
|
||||
|
||||
2. **The Blame Game** — Someone left a GPU instance running over the weekend. Nobody knows who. The CTO asks in the all-hands. Three teams point fingers. The intern who actually did it is too scared to speak up. The political fallout lasts weeks.
|
||||
|
||||
3. **The "It Was Just a Test" Excuse** — A developer spun up a 5-node EMR cluster "just to test something real quick." That was 11 days ago. It's been burning $47/hour. Nobody noticed because nobody looks at billing until month-end.
|
||||
|
||||
4. **The NAT Gateway Surprise** — The single most rage-inducing line item on any AWS bill. Teams discover they're paying $3K/month for NAT Gateway data processing and have zero idea which service is generating the traffic. AWS gives you no breakdown.
|
||||
|
||||
5. **The Data Transfer Black Hole** — Cross-region, cross-AZ, internet egress, VPC endpoints, PrivateLink — data transfer costs are a labyrinth. Even experienced architects can't predict them. They just show up as a lump sum.
|
||||
|
||||
6. **The Autoscaling Runaway** — A traffic spike triggers autoscaling. The spike ends. Autoscaling doesn't scale back down because the cooldown period is misconfigured. You're now running 40 instances instead of 4. For three days.
|
||||
|
||||
7. **The Reserved Instance Waste** — You bought $50K in reserved instances for m5.xlarge. Six months later, the team migrated to Graviton (m7g). The reservations are burning money on instances nobody uses.
|
||||
|
||||
8. **The S3 Lifecycle Policy That Never Was** — "We'll add lifecycle policies later." Later never comes. You're storing 80TB of debug logs from 2023 in S3 Standard at $0.023/GB. That's $1,840/month for data nobody will ever read.
|
||||
|
||||
9. **The EBS Snapshot Graveyard** — Hundreds of orphaned EBS snapshots from deleted instances. Each one costs pennies, but collectively they're $400/month. Nobody even knows they exist.
|
||||
|
||||
10. **The CloudWatch Log Explosion** — A misconfigured Lambda starts logging every request payload at DEBUG level. CloudWatch ingestion costs go from $50/month to $2,000/month in 48 hours. The default CloudWatch dashboard doesn't show cost impact.
|
||||
|
||||
### Why AWS Cost Explorer Sucks
|
||||
|
||||
11. **24-48 Hour Delay** — Cost Explorer data is delayed by up to 48 hours. By the time you see the spike, you've already burned thousands. Real-time? AWS doesn't know the meaning.
|
||||
|
||||
12. **Terrible Filtering UX** — Want to see costs for a specific team's resources? Hope you tagged everything perfectly. Spoiler: you didn't. Cost Explorer's filter UI is a nightmare of dropdowns and "apply" buttons.
|
||||
|
||||
13. **No Actionable Context** — Cost Explorer tells you "EC2 costs went up 300%." It does NOT tell you which specific instances, who launched them, when, or why. You have to cross-reference with CloudTrail manually.
|
||||
|
||||
14. **Anomaly Detection is a Joke** — AWS Cost Anomaly Detection exists but: alerts are delayed (same 24-48hr lag), the ML model is a black box you can't tune, false positive rate is absurd, and the notification options are limited to SNS/email (no Slack native).
|
||||
|
||||
15. **No Remediation Path** — Even when you find the problem in Cost Explorer, there's no "fix it" button. You have to context-switch to the EC2 console, find the resource, and manually terminate/resize. That's 15 clicks minimum.
|
||||
|
||||
16. **Forecasting is Useless** — AWS's cost forecast is a straight-line projection that ignores seasonality, deployment patterns, and common sense. "Based on current trends, your bill will be $∞."
|
||||
|
||||
### What Causes Cost Spikes (The Usual Suspects)
|
||||
|
||||
17. **Zombie Resources** — EC2 instances, RDS databases, Elastic IPs, Load Balancers, Redshift clusters that are running but serving no traffic. The #1 source of waste. Every AWS account has them.
|
||||
|
||||
18. **Right-Sizing Neglect** — Running m5.4xlarge instances that average 8% CPU utilization. Nobody downsizes because "what if we need the headroom?" (They never do.)
|
||||
|
||||
19. **Dev/Staging Environments Running 24/7** — Production needs to be always-on. Dev and staging do not. But they run 24/7 because nobody set up a schedule. That's 75% waste on non-prod.
|
||||
|
||||
20. **Marketplace AMI Licensing** — Someone launched an instance with a marketplace AMI that costs $2/hour in licensing fees on top of the EC2 cost. The license cost doesn't show up where you'd expect.
|
||||
|
||||
21. **Elastic IP Charges** — Allocated but unattached Elastic IPs cost $0.005/hour each. Sounds tiny. 50 orphaned EIPs = $180/month. Death by a thousand cuts.
|
||||
|
||||
22. **Lambda Concurrency Explosions** — A recursive Lambda invocation bug or a sudden traffic spike causes thousands of concurrent executions. The per-invocation cost is low but at 10K concurrent, it adds up fast.
|
||||
|
||||
23. **DynamoDB On-Demand Pricing Surprises** — Teams choose on-demand for convenience, then discover their read/write patterns would be 80% cheaper with provisioned capacity + auto-scaling.
|
||||
|
||||
24. **Multi-Account Sprawl** — Organizations with 20+ AWS accounts lose track of which accounts are active, who owns them, and what's running in them. Consolidated billing hides the details.
|
||||
|
||||
25. **Savings Plans Mismatch** — Bought Compute Savings Plans based on last quarter's usage. This quarter's usage shifted to a different instance family/region. Savings Plans don't cover it. You're paying on-demand AND wasting the commitment.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Solution Space (42 ideas)
|
||||
|
||||
### Detection Approaches
|
||||
|
||||
26. **CloudWatch Billing Metrics (Near Real-Time)** — Poll `EstimatedCharges` metric every 5 minutes. It's the fastest signal AWS provides. Not perfect, but way better than Cost Explorer's 48-hour lag.
|
||||
|
||||
27. **CloudTrail Event Stream** — Monitor `RunInstances`, `CreateDBInstance`, `CreateFunction`, `CreateNatGateway` etc. in real-time via EventBridge. Detect expensive resource creation the MOMENT it happens, before any cost accrues.
|
||||
|
||||
28. **Cost and Usage Report (CUR) Hourly Parsing** — AWS CUR can be delivered hourly to S3. Parse it with a lightweight Lambda/Fargate job. Gives line-item granularity that Cost Explorer API doesn't.
|
||||
|
||||
29. **Hybrid Detection: Events + Billing** — Use CloudTrail for instant "something was created" alerts, then correlate with CUR data for actual cost impact. Best of both worlds.
|
||||
|
||||
30. **Tag-Based Cost Boundaries** — Let users define expected cost ranges per tag (e.g., `team:payments` should be $2K-$4K/month). Alert when any tag group exceeds its boundary.
|
||||
|
||||
31. **Service-Level Baselines** — Automatically learn the "normal" cost pattern for each AWS service in the account. Alert on deviations. No manual threshold setting required.
|
||||
|
||||
32. **Account-Level Anomaly Scoring** — Assign each AWS account a daily "anomaly score" (0-100) based on deviation from historical patterns. Dashboard shows accounts ranked by anomaly severity.
|
||||
|
||||
### Anomaly Algorithms
|
||||
|
||||
33. **Statistical Z-Score Detection** — Simple, explainable. Calculate rolling mean and standard deviation for each service/tag. Alert when current spend exceeds 2σ or 3σ. Users understand "this is 3 standard deviations above normal."
|
||||
|
||||
34. **Seasonal Decomposition (STL)** — Decompose cost time series into trend + seasonal + residual. Alert on residual spikes. Handles weekly patterns (lower on weekends) and monthly patterns (batch jobs on the 1st).
|
||||
|
||||
35. **Prophet-Style Forecasting** — Use Facebook Prophet or similar for time-series forecasting. Compare actual vs. predicted. Alert on significant positive deviations. Good for accounts with complex seasonality.
|
||||
|
||||
36. **Rule-Based Guardrails** — Simple rules that catch 80% of problems: "Alert if any single resource costs >$X/day", "Alert if a new service appears that wasn't used last month", "Alert if daily spend exceeds 150% of 30-day average."
|
||||
|
||||
37. **Peer Comparison** — "Your EC2 spend per engineer is 3x the median for companies your size." Anonymized benchmarking across dd0c customers. Powerful social proof for optimization.
|
||||
|
||||
38. **Rate-of-Change Detection** — Don't just look at absolute cost. Look at the derivative. A service going from $10/day to $50/day is a 5x spike even though the absolute number is small. Catch problems early when they're cheap to fix.
|
||||
|
||||
39. **Composite Anomaly Detection** — Combine multiple signals: cost spike + new resource creation + unusual API calls = high-confidence anomaly. Single signals = low-confidence (reduce false positives).
|
||||
|
||||
### Remediation
|
||||
|
||||
40. **One-Click Stop Instance** — See a runaway EC2 instance? Click "Stop" right from the dd0c alert. We execute `StopInstances` via the customer's IAM role. No console context-switching.
|
||||
|
||||
41. **One-Click Terminate with Snapshot** — For instances that should be killed: terminate but automatically create an EBS snapshot first, so nothing is lost. Safety net built in.
|
||||
|
||||
42. **Schedule Non-Prod Shutdown** — "This dev environment runs 24/7 but only gets traffic 9am-6pm ET." One click to create a start/stop schedule. Instant 62% savings.
|
||||
|
||||
43. **Right-Size Recommendation with Apply** — "This m5.4xlarge averages 8% CPU. Recommended: m5.large. Estimated savings: $312/month." Click "Apply" to resize. We handle the stop/modify/start.
|
||||
|
||||
44. **Auto-Kill Zombie Resources** — Define a policy: "Any EC2 instance with <1% CPU for 7 days gets auto-terminated." dd0c enforces it. Opt-in, with a 24-hour warning notification before termination.
|
||||
|
||||
45. **Budget Circuit Breaker** — Set a hard daily/weekly budget. When spend approaches the limit, dd0c automatically stops non-essential resources (tagged as `priority:low`). Like a financial circuit breaker.
|
||||
|
||||
46. **Savings Plan Optimizer** — Analyze usage patterns and recommend optimal Savings Plan purchases. Show the exact commitment amount and projected savings. One-click purchase through AWS.
|
||||
|
||||
47. **Reserved Instance Exchange Assistant** — Got unused RIs? dd0c finds the optimal exchange path to convert them to instance types you actually use. Handles the RI Marketplace listing if exchange isn't possible.
|
||||
|
||||
48. **S3 Lifecycle Policy Generator** — Scan S3 buckets, analyze access patterns, generate optimal lifecycle policies (Standard → IA → Glacier → Delete). One-click apply.
|
||||
|
||||
49. **EBS Snapshot Cleanup** — Identify orphaned snapshots, show total cost, one-click bulk delete with a confirmation list.
|
||||
|
||||
50. **Approval Workflow for Expensive Actions** — For remediation actions above a cost threshold, require manager/lead approval via Slack. "Max wants to terminate 5 instances saving $2,100/month. Approve?"
|
||||
|
||||
### Attribution
|
||||
|
||||
51. **Team-Level Cost Dashboard** — Break down costs by team using tags, account mapping, or resource ownership. Each team sees ONLY their costs. Accountability without blame.
|
||||
|
||||
52. **PR-Level Cost Attribution** — Track which pull request / deployment caused a cost change. "Costs increased $340/day after PR #1847 was merged (added new ECS service)." Integration with GitHub/GitLab.
|
||||
|
||||
53. **Environment-Level Breakdown** — Production vs. Staging vs. Dev vs. QA. Instantly see that staging is costing 60% of production (it shouldn't be).
|
||||
|
||||
54. **Service-Level Cost per Request** — Combine cost data with traffic data. "Your payment service costs $0.003 per request. Your search service costs $0.047 per request." Unit economics for infrastructure.
|
||||
|
||||
55. **Slack Cost Bot** — `/cost my-team` in Slack returns your team's current month spend, trend, and anomalies. No dashboard needed for quick checks.
|
||||
|
||||
### Forecasting
|
||||
|
||||
56. **End-of-Month Projection** — "Based on current trajectory, your February bill will be $47,200 (budget: $40,000). You'll exceed budget by $7,200 unless action is taken." Updated daily.
|
||||
|
||||
57. **What-If Scenarios** — "What if we right-size all oversized instances? Projected savings: $4,200/month." "What if we schedule dev environments? Savings: $2,800/month." Quantify the impact before acting.
|
||||
|
||||
58. **Deployment Cost Preview** — Before deploying a new service, estimate its monthly cost based on the Terraform/CloudFormation template. "This deployment will add approximately $1,200/month to your bill." Pre-deploy, not post-mortem.
|
||||
|
||||
59. **Trend Analysis with Narrative** — Not just charts. "Your EC2 costs have increased 23% month-over-month for 3 consecutive months, driven primarily by the data-pipeline team's EMR usage. At this rate, EC2 alone will exceed $30K by April."
|
||||
|
||||
### Notification
|
||||
|
||||
60. **Slack-Native Alerts with Action Buttons** — Alert lands in Slack with context AND action buttons: [Stop Instance] [Snooze 24h] [Assign to Team] [View Details]. No context-switching.
|
||||
|
||||
61. **PagerDuty Integration for Critical Spikes** — Cost spike >$X/hour? That's an incident. Page the on-call FinOps person (or the team lead if no FinOps role exists).
|
||||
|
||||
62. **Daily Digest Email** — Morning email: "Yesterday's spend: $1,423. Trend: ↑12% vs. 7-day average. Top anomaly: NAT Gateway in us-east-1 (+$89). Action needed: 3 zombie instances detected."
|
||||
|
||||
63. **SMS for Emergency Spikes** — Configurable threshold. "Your AWS spend exceeded $500 in the last hour. This is 10x your normal hourly rate." For the truly catastrophic events.
|
||||
|
||||
64. **Weekly Cost Report for Leadership** — Auto-generated PDF/Slack message for non-technical stakeholders. Plain English. "We spent $38K on AWS this week. That's 5% under budget. Three optimization opportunities worth $2,100/month were identified."
|
||||
|
||||
### Visualization
|
||||
|
||||
65. **Cost Heatmap** — Calendar heatmap showing daily spend intensity. Instantly spot the expensive days. Click any day to drill down.
|
||||
|
||||
66. **Service Treemap** — Treemap visualization where rectangle size = cost. Instantly see which services dominate your bill. Click to drill into sub-categories.
|
||||
|
||||
67. **Real-Time Cost Ticker** — A live-updating ticker showing current burn rate: "$1.87/hour | $44.88/day | $1,346/month (projected)". Like a stock ticker for your AWS bill.
|
||||
|
||||
68. **Anomaly Timeline** — Horizontal timeline showing detected anomalies as colored dots. Red = unresolved, green = remediated, yellow = acknowledged. Visual history of your cost health.
|
||||
|
||||
69. **Cost Diff View** — Side-by-side comparison of any two time periods. "This week vs. last week: +$2,100 total. EC2: +$800, RDS: +$1,100, S3: +$200." Like a git diff for your bill.
|
||||
|
||||
70. **Infrastructure Cost Map** — Visual representation of your AWS architecture with cost annotations. See your VPC, subnets, instances, databases — each labeled with their daily cost. Like an AWS architecture diagram that shows you where the money goes.
|
||||
|
||||
### Wild Ideas 🔥
|
||||
|
||||
71. **"Cost Replay"** — Rewind your AWS bill to any point in time and replay cost changes like a video. See exactly when costs started climbing and correlate with CloudTrail events. A DVR for your cloud spend.
|
||||
|
||||
72. **Auto-Negotiate Reserved Instances** — dd0c monitors your usage patterns, identifies RI opportunities, and automatically purchases optimal reservations (with configurable approval thresholds). Fully autonomous FinOps.
|
||||
|
||||
73. **Zombie Resource Hunter (Autonomous Agent)** — An AI agent that continuously scans your account for unused resources, calculates waste, and either auto-terminates (if policy allows) or creates a cleanup ticket. It never sleeps.
|
||||
|
||||
74. **"Cost Blast Radius" for PRs** — GitHub Action that comments on every PR: "If merged, this change will increase monthly AWS costs by approximately $340 (new ECS task definition with 4 vCPU)." Shift cost awareness left.
|
||||
|
||||
75. **Competitive Benchmarking** — "Companies similar to yours (50 engineers, SaaS, Series B) spend a median of $28K/month on AWS. You spend $45K. Here's where you're overspending." Anonymous, aggregated data from dd0c's customer base.
|
||||
|
||||
76. **"AWS Bill Explained Like I'm 5"** — AI-generated plain-English explanation of your bill. "You spent $4,200 on EC2 this month. That's like renting 12 computers 24/7. But 4 of them did almost nothing. If you turn those off, you save $1,400."
|
||||
|
||||
77. **Cost Gamification** — Leaderboard: "Team Payments reduced their AWS spend by 18% this month! 🏆" Badges for optimization milestones. Make cost optimization fun and competitive.
|
||||
|
||||
78. **Automatic Spot Instance Migration** — Identify workloads that are spot-compatible (stateless, fault-tolerant) and automatically migrate them from on-demand to spot. 60-90% savings with zero manual effort.
|
||||
|
||||
79. **"What's This Costing Me?" Chrome Extension** — Hover over any resource in the AWS Console and see its monthly cost. Like a price tag on every resource. Because AWS deliberately makes this hard to see.
|
||||
|
||||
80. **Multi-Cloud Cost Normalization** — Normalize costs across AWS, GCP, and Azure into a single dashboard. "Your compute costs $X on AWS. The equivalent on GCP would cost $Y." Help teams make informed multi-cloud decisions.
|
||||
|
||||
81. **Cost-Aware Autoscaling** — Replace AWS's native autoscaling with a cost-aware version. Instead of just scaling on CPU/memory, factor in cost. "We could scale to 20 instances, but 12 instances + a queue would handle the load at 40% less cost."
|
||||
|
||||
82. **Invoice Dispute Assistant** — AI that reviews your AWS bill for billing errors, credits you're owed, and generates dispute emails. AWS makes billing mistakes more often than people think.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Differentiation & Moat (18 ideas)
|
||||
|
||||
### Beating AWS Native Cost Anomaly Detection
|
||||
|
||||
83. **Speed** — AWS Cost Anomaly Detection has a 24-48 hour delay. dd0c/cost detects anomalies in minutes via CloudTrail events + CloudWatch billing metrics. This alone is a 100x improvement.
|
||||
|
||||
84. **Actionability** — AWS tells you "anomaly detected." dd0c tells you "anomaly detected → here's the specific resource → here's who created it → here's the one-click fix." Context + action, not just a notification.
|
||||
|
||||
85. **UX That Doesn't Make You Want to Cry** — AWS Cost Anomaly Detection is buried in the Billing console behind 4 clicks. The UI is a table with tiny text. dd0c is a beautiful, purpose-built dashboard with Slack-native alerts.
|
||||
|
||||
86. **Tunable Sensitivity** — AWS's ML model is a black box. dd0c lets you tune sensitivity per service, per team, per account. "I expect RDS to fluctuate ±20%, but EC2 should be stable within ±5%."
|
||||
|
||||
87. **Remediation Built In** — AWS detects. dd0c detects AND fixes. The gap between "knowing" and "doing" is where all the value is.
|
||||
|
||||
### Beating Vantage / CloudHealth
|
||||
|
||||
88. **Time-to-Value** — Vantage requires connecting your CUR, waiting for data ingestion, configuring dashboards. dd0c: connect your AWS account, get your first anomaly alert in under 10 minutes. Vercel-speed onboarding.
|
||||
|
||||
89. **Pricing Transparency** — CloudHealth/Apptio: "Contact Sales." Vantage: reasonable but still opaque at scale. dd0c: pricing on the website, self-serve signup, no sales calls ever.
|
||||
|
||||
90. **Focus** — Vantage is becoming a broad FinOps platform (Kubernetes costs, unit economics, budgets, reports). dd0c/cost does ONE thing: detect anomalies and fix them. Focused tools beat Swiss Army knives.
|
||||
|
||||
91. **Developer-First, Not Finance-First** — CloudHealth was built for FinOps teams and CFOs. dd0c is built for the engineer who gets paged when something breaks. Different user, different UX, different value prop.
|
||||
|
||||
92. **Real-Time, Not Daily** — Vantage updates costs daily. dd0c provides near-real-time monitoring. For a team burning $100/hour on a runaway resource, daily updates mean $2,400 wasted before you even know.
|
||||
|
||||
### Building the Moat
|
||||
|
||||
93. **Cross-Module Data Flywheel** — dd0c/cost knows your spend. dd0c/portal knows who owns what. dd0c/alert knows your incident patterns. Together, they create an intelligence layer no single-purpose tool can match. "The payment service owned by Team Alpha had a cost spike correlated with the deployment that triggered 3 alerts."
|
||||
|
||||
94. **Anonymized Benchmarking Network** — The more customers dd0c has, the better the benchmarking data. "Your RDS spend per GB is 2x the median." This data is exclusive to dd0c and improves with scale. Classic network effect.
|
||||
|
||||
95. **Optimization Intelligence Accumulation** — Every remediation action taken through dd0c trains the system. "When customers see this pattern, they usually do X." Over time, dd0c's recommendations become eerily accurate. Data moat.
|
||||
|
||||
96. **Open-Source Agent, Paid Dashboard** — The in-VPC agent is open source. This builds trust (customers can audit the code), creates community contributions, and makes dd0c the default choice. The dashboard/alerting/remediation is the paid layer.
|
||||
|
||||
97. **Terraform/Pulumi Provider** — `dd0c_cost_monitor` as a Terraform resource. Define your cost policies as code. This embeds dd0c into the infrastructure-as-code workflow, making it sticky.
|
||||
|
||||
98. **Slack-First Architecture** — Most FinOps tools are dashboard-first. dd0c is Slack-first. Engineers live in Slack. Alerts, actions, reports — all in Slack. The dashboard exists for deep dives, but daily interaction is in Slack. This is a UX moat.
|
||||
|
||||
99. **Multi-Cloud (Strategic Expansion)** — Start AWS-only (Brian's expertise). Add GCP and Azure in Year 2. Become the cross-cloud cost anomaly layer. No single cloud vendor will build this because it's against their interest.
|
||||
|
||||
100. **API-First for Automation** — Full API for everything. Let customers build custom workflows: "When dd0c detects a spike > $500, automatically create a Jira ticket and page the team lead." Programmable FinOps.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Anti-Ideas & Red Team (12 ideas)
|
||||
|
||||
### Why This Could Fail
|
||||
|
||||
101. **AWS Improves Cost Explorer** — AWS could ship real-time billing, better anomaly detection, and native Slack integration. They have the data advantage (it's their platform). Counter: AWS has had 15 years to make billing UX good and hasn't. Their incentive is for you to SPEND more, not less. They'll never build a great cost reduction tool.
|
||||
|
||||
102. **Vantage Eats Our Lunch** — Vantage is well-funded, developer-friendly, and already has momentum. They could add real-time anomaly detection tomorrow. Counter: Vantage is going broad (FinOps platform). We're going deep (anomaly detection + remediation). Different strategies.
|
||||
|
||||
103. **IAM Permission Anxiety** — Customers won't give dd0c the IAM permissions needed for remediation (terminate instances, modify resources). Counter: Tiered permissions. Read-only for detection (low trust barrier). Write permissions only for remediation (opt-in). Open-source agent for auditability.
|
||||
|
||||
104. **Race to the Bottom on Pricing** — Cost optimization tools compete on price because their value prop is "we save you money." If you charge too much relative to savings, customers leave. Counter: Price as % of savings identified, not flat fee. Align incentives.
|
||||
|
||||
105. **False Positive Fatigue** — If dd0c alerts too often on non-issues, users will ignore it (same problem as AWS native). Counter: Composite anomaly scoring, tunable sensitivity, and a "snooze" mechanism. Learn from user feedback to reduce false positives over time.
|
||||
|
||||
106. **Small Market Size** — Teams spending $10K-$500K/month is a specific segment. Below $10K, savings aren't worth the tool cost. Above $500K, they have dedicated FinOps teams using enterprise tools. Counter: This segment is actually massive — hundreds of thousands of AWS accounts. And the $500K ceiling can rise as dd0c matures.
|
||||
|
||||
107. **Security Breach Risk** — dd0c has read (and optionally write) access to customer AWS accounts. A breach would be catastrophic for trust. Counter: Minimal permissions, open-source agent, SOC 2 compliance from day 1, no storage of sensitive data (only cost metrics).
|
||||
|
||||
108. **"We'll Build It Internally"** — Platform teams at mid-size companies might build their own cost monitoring. Counter: They always underestimate the effort. Internal tools get abandoned. dd0c is cheaper than one engineer's time for a month.
|
||||
|
||||
109. **AWS Organizations Consolidated Billing Complexity** — Large orgs with complex account structures, SCPs, and consolidated billing make cost attribution incredibly hard. Counter: This is actually a FEATURE opportunity. If dd0c handles multi-account complexity well, it becomes indispensable.
|
||||
|
||||
110. **Terraform Cost Estimation Tools (Infracost) Expand** — Infracost could add post-deploy monitoring to complement their pre-deploy estimation. Counter: Different core competency. Infracost is CI/CD-focused. dd0c is runtime-focused. They're complementary, not competitive. Could even integrate.
|
||||
|
||||
111. **Economic Downturn Kills Cloud Spend** — If companies cut cloud budgets aggressively, there's less to optimize. Counter: Downturns INCREASE demand for cost optimization tools. When budgets tighten, every dollar matters more.
|
||||
|
||||
112. **Customer Churn After Optimization** — Customers use dd0c, optimize their spend, then cancel because there's nothing left to optimize. Counter: Cost drift is continuous. New resources, new team members, new services — waste regenerates. dd0c is a continuous need, not a one-time fix. Also, the monitoring/alerting value persists even after optimization.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Synthesis
|
||||
|
||||
### Top 10 Ideas (Ranked by Impact × Feasibility)
|
||||
|
||||
| Rank | Idea | Why |
|
||||
|------|------|-----|
|
||||
| 1 | **CloudTrail Real-Time Event Detection** (#27, #29) | The single biggest differentiator vs. every competitor. Detect expensive resource creation in seconds, not days. This is the core innovation. |
|
||||
| 2 | **Slack-Native Alerts with Action Buttons** (#60) | Where engineers live. Alert + context + one-click action in Slack = the entire value prop in one message. This IS the product for most users. |
|
||||
| 3 | **One-Click Remediation Suite** (#40-44) | Stop, terminate, resize, schedule — all from the alert. Closing the gap between detection and action is the moat. |
|
||||
| 4 | **Zombie Resource Hunter** (#73, #44) | Autonomous agent that continuously finds and flags waste. Set-and-forget value. This is the "it pays for itself" feature. |
|
||||
| 5 | **End-of-Month Projection** (#56) | "You'll exceed budget by $7,200 unless you act." Simple, powerful, and something AWS does terribly. |
|
||||
| 6 | **Team-Level Cost Attribution** (#51) | Accountability without blame. Each team sees their costs. Essential for organizations with 3+ engineering teams. |
|
||||
| 7 | **Schedule Non-Prod Shutdown** (#42) | The single easiest win for any customer. "Turn off dev at night" = instant 62% savings on non-prod. Proves ROI in week 1. |
|
||||
| 8 | **Cost Blast Radius for PRs** (#74) | Shift-left cost awareness. GitHub Action that comments estimated cost impact on PRs. Viral distribution mechanism (developers share cool GitHub Actions). |
|
||||
| 9 | **Real-Time Cost Ticker** (#67) | Emotional hook. A live burn rate counter creates urgency and awareness. Makes cost visceral, not abstract. |
|
||||
| 10 | **Rule-Based Guardrails** (#36) | Simple rules catch 80% of problems. "Alert if daily spend > 150% of average." Easy to implement, easy to understand, high value. |
|
||||
|
||||
### 3 Wild Cards 🃏
|
||||
|
||||
| Wild Card | Idea | Why It's Wild |
|
||||
|-----------|------|---------------|
|
||||
| 🃏 1 | **"Cost Replay" DVR** (#71) | Rewind your bill like a video. Correlate cost changes with CloudTrail events in a timeline. Nobody has this. It would be a demo-killer at conferences. |
|
||||
| 🃏 2 | **Competitive Benchmarking Network** (#75, #94) | "Companies like yours spend 30% less on RDS." Anonymized cross-customer data creates a network effect moat that grows with every customer. Requires scale but is defensible. |
|
||||
| 🃏 3 | **Invoice Dispute Assistant** (#82) | AI that finds AWS billing errors and generates dispute emails. AWS overcharges more than people realize. This would generate incredible word-of-mouth: "dd0c found $2,400 in billing errors on my account." |
|
||||
|
||||
### Recommended V1 Scope
|
||||
|
||||
**V1 Goal:** Get a customer from "connected AWS account" to "first anomaly detected and remediated" in under 10 minutes.
|
||||
|
||||
**V1 Features (4-6 week build):**
|
||||
|
||||
1. **AWS Account Connection** — IAM role with read-only billing + CloudTrail access. One CloudFormation template click.
|
||||
2. **CloudTrail Event Monitoring** — Real-time detection of expensive resource creation (EC2, RDS, EMR, NAT Gateway, EBS volumes).
|
||||
3. **CloudWatch Billing Polling** — 5-minute polling of EstimatedCharges for account-level anomaly detection.
|
||||
4. **Statistical Anomaly Detection** — Z-score based, per-service, with configurable sensitivity (low/medium/high).
|
||||
5. **Slack Integration** — Alerts with context (what, who, when, how much) and action buttons (Stop, Terminate, Snooze, Assign).
|
||||
6. **Zombie Resource Scanner** — Daily scan for idle EC2 (CPU <5% for 7 days), unattached EBS volumes, orphaned EIPs, unused ELBs.
|
||||
7. **One-Click Stop/Terminate** — Optional write permissions for direct remediation from Slack.
|
||||
8. **End-of-Month Forecast** — Simple projection based on current burn rate with budget comparison.
|
||||
9. **Daily Digest** — Morning Slack message with yesterday's spend, trend, and top anomalies.
|
||||
|
||||
**V1 Does NOT Include:**
|
||||
- Multi-cloud (AWS only)
|
||||
- CUR parsing (too complex for V1; use CloudWatch + CloudTrail)
|
||||
- Savings Plan/RI optimization (Phase 2)
|
||||
- Team attribution (requires tagging strategy; Phase 2)
|
||||
- PR cost estimation (Phase 2; integrate with Infracost instead)
|
||||
- Dashboard UI (Slack-first for V1; web dashboard in Phase 2)
|
||||
|
||||
**V1 Pricing:**
|
||||
- Free: 1 AWS account, daily anomaly checks only
|
||||
- Pro ($49/mo): 3 accounts, real-time detection, Slack alerts, remediation
|
||||
- Business ($149/mo): Unlimited accounts, zombie hunter, forecasting, team features
|
||||
|
||||
**V1 Success Metric:** First 10 paying customers within 60 days of launch. Average customer saves >$500/month (10x the Pro price).
|
||||
|
||||
---
|
||||
|
||||
*Total ideas generated: 112*
|
||||
*Session complete. Let's build this thing.* 🔥
|
||||
350
products/05-aws-cost-anomaly/design-thinking/session.md
Normal file
350
products/05-aws-cost-anomaly/design-thinking/session.md
Normal file
@@ -0,0 +1,350 @@
|
||||
# dd0c/cost — Design Thinking Session
|
||||
## AWS Cost Anomaly Detective
|
||||
|
||||
**Facilitator:** Maya, Design Thinking Maestro
|
||||
**Date:** February 28, 2026
|
||||
**Product:** dd0c/cost (Product #5 — "The Gateway Drug")
|
||||
**Philosophy:** Design is about THEM, not us.
|
||||
|
||||
---
|
||||
|
||||
> *"The best products don't solve problems. They dissolve the anxiety that surrounds them. When a startup CTO sees a bill that's 4x what they expected, the problem isn't the bill — it's the 47 seconds of pure existential dread before they can even begin to understand WHY."*
|
||||
|
||||
---
|
||||
|
||||
# Phase 1: EMPATHIZE
|
||||
|
||||
We're not building a cost tool. We're building an anxiety medication for cloud infrastructure. Let's meet the humans who need it.
|
||||
|
||||
---
|
||||
|
||||
## Persona 1: The Startup CTO — "Alex"
|
||||
|
||||
**Demographics:** 32 years old. Series A startup, 12 engineers. Wears the CTO/VP Eng/DevOps hat simultaneously. Personally signed the AWS Enterprise Agreement. The board sees every line item.
|
||||
|
||||
**The Moment That Defines Alex:**
|
||||
It's Tuesday morning, 7:14 AM. Alex is brushing their teeth when a Slack notification buzzes. The CFO has forwarded the AWS billing alert email: "Your estimated charges for this billing period have exceeded $8,000." Last month was $2,100. Alex's stomach drops. Toothbrush still in mouth. They open the AWS Console on their phone. Cost Explorer takes 11 seconds to load on mobile. The bar chart shows a spike but doesn't say WHERE or WHY. Alex is now going to be late for the 8 AM standup, and they'll spend the entire meeting distracted, mentally running through every possible cause. Was it the new feature deploy? Did someone spin up a big instance? Is it a data transfer thing? They don't know. They won't know for hours.
|
||||
|
||||
### Empathy Map
|
||||
|
||||
**SAYS:**
|
||||
- "Can someone check why our AWS bill spiked?"
|
||||
- "We need to be more careful about resource management"
|
||||
- "I'll look into it after standup"
|
||||
- "We can't afford surprises like this"
|
||||
- "Who launched that instance?"
|
||||
- "Do we even need that RDS cluster in staging?"
|
||||
|
||||
**THINKS:**
|
||||
- "This is going to come up at the board meeting"
|
||||
- "I should have set up billing alerts months ago"
|
||||
- "Is this my fault for not having better guardrails?"
|
||||
- "What if this keeps happening and we burn through runway?"
|
||||
- "I don't have time to become a FinOps expert on top of everything else"
|
||||
- "The investors are going to ask about our burn rate"
|
||||
|
||||
**DOES:**
|
||||
- Opens AWS Cost Explorer (waits for it to load, gets frustrated)
|
||||
- Manually checks EC2 console, RDS console, Lambda console — one by one
|
||||
- Searches CloudTrail logs trying to correlate events with cost spikes
|
||||
- Asks in Slack: "Did anyone spin up anything big recently?"
|
||||
- Creates a spreadsheet to track monthly costs (abandons it by month 3)
|
||||
- Sets a billing alarm at 80% of budget (but the alarm fires 48 hours late)
|
||||
|
||||
**FEELS:**
|
||||
- **Panic** — the visceral gut-punch of an unexpected bill
|
||||
- **Helpless** — AWS gives data but not answers
|
||||
- **Guilty** — "I should have caught this sooner"
|
||||
- **Overwhelmed** — too many consoles, too many services, not enough time
|
||||
- **Exposed** — the board/investors will see this number
|
||||
- **Alone** — nobody else on the team understands AWS billing
|
||||
|
||||
### Pain Points
|
||||
1. **The 48-hour blindspot** — By the time Cost Explorer shows the spike, thousands are already burned
|
||||
2. **No attribution** — "EC2 costs went up" tells you nothing about WHICH instance or WHO launched it
|
||||
3. **Context-switching hell** — Diagnosing a cost issue requires jumping between 5+ AWS consoles
|
||||
4. **Personal liability** — At a startup, the CTO's name is on the account. The bill feels personal.
|
||||
5. **Time poverty** — Alex has 47 other priorities. Cost management is important but never urgent — until it's an emergency
|
||||
6. **Knowledge gap** — Alex is a great engineer but not a FinOps specialist. AWS billing is deliberately opaque.
|
||||
|
||||
### Current Workarounds
|
||||
- AWS Billing Alerts (delayed, no context, email-only)
|
||||
- Monthly manual review of Cost Explorer (reactive, not proactive)
|
||||
- Asking in Slack "who did this?" (blame-oriented, unreliable)
|
||||
- Spreadsheet tracking (abandoned within weeks)
|
||||
- Hoping for the best (the most common strategy)
|
||||
|
||||
### Jobs To Be Done (JTBD)
|
||||
- **When** I see an unexpected AWS charge, **I want to** instantly understand what caused it and who's responsible, **so I can** fix it before it gets worse and explain it to stakeholders.
|
||||
- **When** I'm planning our monthly budget, **I want to** confidently predict our AWS spend, **so I can** give the board accurate numbers and not look incompetent.
|
||||
- **When** a new service or resource is created, **I want to** know immediately if it's going to be expensive, **so I can** intervene before costs accumulate.
|
||||
|
||||
### Day-in-the-Life Scenario
|
||||
|
||||
**6:45 AM** — Wake up, check phone. No alerts. Good.
|
||||
**7:14 AM** — CFO Slack: "Why is AWS $8K?" Stomach drops.
|
||||
**7:15-7:55 AM** — Frantically clicking through AWS Console on laptop. Cost Explorer shows EC2 spike but no details. Check CloudTrail — hundreds of events, no obvious culprit.
|
||||
**8:00 AM** — Standup. Distracted. Mentions "looking into a billing issue."
|
||||
**8:30-10:00 AM** — Deep dive. Finally discovers: a developer launched 4x p3.2xlarge GPU instances for an ML experiment on Thursday. They're still running. That's $12.24/hour × 96 hours = $1,175 burned. The developer forgot.
|
||||
**10:05 AM** — Terminates the instances. Sends a Slack message to the team about resource management. Feels like a hall monitor.
|
||||
**10:30 AM** — Writes a "cloud cost policy" doc. Nobody will read it.
|
||||
**11:00 AM** — Back to actual work, 3 hours behind schedule.
|
||||
**Next month** — It happens again. Different resource. Same panic.
|
||||
|
||||
---
|
||||
|
||||
## Persona 2: The FinOps Analyst — "Jordan"
|
||||
|
||||
**Demographics:** 28 years old. Mid-size SaaS company, 150 engineers, 23 AWS accounts. Jordan's title is "Cloud Financial Analyst" but everyone calls them "the cost person." Reports to VP of Engineering and dotted-line to Finance. The only person in the company who understands AWS billing at a granular level.
|
||||
|
||||
**The Moment That Defines Jordan:**
|
||||
It's the last Thursday of the month. Jordan has spent the past 3 days building the monthly cloud cost report. They have 14 browser tabs open: Cost Explorer for 6 different accounts, 3 spreadsheets, a Confluence page, and the AWS CUR data in Athena. The VP of Engineering wants the report by Friday EOD. The CFO wants it "in a format Finance can understand." Jordan is translating between two worlds — engineering resource names and financial line items — and neither side appreciates how hard that translation is. They just found a $4,200 discrepancy between Cost Explorer and the CUR data and have no idea which one is right.
|
||||
|
||||
### Empathy Map
|
||||
|
||||
**SAYS:**
|
||||
- "I need the teams to tag their resources properly"
|
||||
- "The CUR data doesn't match Cost Explorer — again"
|
||||
- "Can we get a meeting to discuss the tagging strategy?"
|
||||
- "This account's spend is 40% over forecast"
|
||||
- "I've been asking for this data for two weeks"
|
||||
- "No, I can't tell you the cost per request. We don't have that granularity."
|
||||
|
||||
**THINKS:**
|
||||
- "Nobody takes tagging seriously until the bill is a disaster"
|
||||
- "I'm a single point of failure for cost visibility in this entire company"
|
||||
- "If I got hit by a bus, nobody could produce this report"
|
||||
- "I wish I could automate 80% of what I do"
|
||||
- "The engineering teams think I'm the cost police. I'm trying to help them."
|
||||
- "There has to be a better way than 14 spreadsheets"
|
||||
|
||||
**DOES:**
|
||||
- Downloads CUR data daily, loads into Athena, runs custom queries
|
||||
- Maintains a master spreadsheet mapping AWS accounts → teams → budgets
|
||||
- Sends weekly cost summaries to team leads (most don't read them)
|
||||
- Manually investigates anomalies by cross-referencing CUR, CloudTrail, and Cost Explorer
|
||||
- Attends FinOps Foundation meetups to learn best practices
|
||||
- Builds custom dashboards in QuickSight (they break every time AWS changes the CUR schema)
|
||||
|
||||
**FEELS:**
|
||||
- **Frustrated** — the tools are inadequate and nobody understands the complexity
|
||||
- **Undervalued** — cost optimization saves hundreds of thousands but gets no glory
|
||||
- **Anxious** — one missed anomaly and it's Jordan's fault
|
||||
- **Isolated** — the only person who speaks both "engineering" and "finance"
|
||||
- **Exhausted** — the work is repetitive, manual, and never-ending
|
||||
- **Determined** — genuinely believes cost optimization matters and wants to prove it
|
||||
|
||||
### Pain Points
|
||||
1. **Manual data wrangling** — 60% of Jordan's time is spent collecting, cleaning, and reconciling data, not analyzing it
|
||||
2. **Tagging chaos** — Teams don't tag consistently. Untagged resources are a black hole of unattributable cost.
|
||||
3. **Multi-account complexity** — 23 accounts with different owners, different conventions, different levels of maturity
|
||||
4. **No real-time visibility** — CUR is hourly at best, Cost Explorer is 24-48 hours delayed. Jordan is always looking backward.
|
||||
5. **Stakeholder translation** — Engineering wants resource-level detail. Finance wants department-level summaries. Jordan manually bridges the gap.
|
||||
6. **Tool fragmentation** — Uses Cost Explorer + CUR + Athena + QuickSight + spreadsheets + Slack. No single source of truth.
|
||||
|
||||
### Current Workarounds
|
||||
- Custom Athena queries on CUR data (brittle, requires SQL expertise)
|
||||
- Master spreadsheet updated manually every week (error-prone)
|
||||
- QuickSight dashboards (break when CUR schema changes)
|
||||
- Slack reminders to team leads about their budgets (ignored)
|
||||
- Monthly "cost review" meetings (dreaded by everyone)
|
||||
- AWS Cost Anomaly Detection (too many false positives, no actionable context)
|
||||
|
||||
### Jobs To Be Done (JTBD)
|
||||
- **When** I'm preparing the monthly cost report, **I want to** automatically aggregate costs by team, environment, and service with accurate attribution, **so I can** deliver the report in hours instead of days.
|
||||
- **When** an anomaly is detected, **I want to** immediately see the root cause with full context (who, what, when, why), **so I can** resolve it without a 3-hour investigation.
|
||||
- **When** a team exceeds their budget, **I want to** automatically notify the team lead with specific recommendations, **so I can** scale cost governance without being the bottleneck.
|
||||
|
||||
### Day-in-the-Life Scenario
|
||||
|
||||
**8:00 AM** — Open laptop. 47 unread emails. 12 are AWS billing notifications from various accounts. Triage: most are noise.
|
||||
**8:30 AM** — Check yesterday's CUR data in Athena. Run the anomaly detection query Jordan wrote. 3 flagged items. One is a real issue (new RDS instance in account #17), two are false positives (monthly batch job, expected).
|
||||
**9:00 AM** — Slack the owner of account #17: "Hey, there's a new db.r5.4xlarge in us-west-2. Is this expected?" No response for 2 hours.
|
||||
**9:15 AM** — Start building the weekly cost summary. Pull data from 6 accounts. Two accounts have untagged resources totaling $3,400. Jordan can't attribute them. Adds them to "Unallocated" with a note.
|
||||
**10:00 AM** — Meeting with VP Eng about Q1 cloud budget. VP wants to cut 15%. Jordan explains which optimizations are realistic and which are fantasy. VP doesn't fully understand the constraints.
|
||||
**11:00 AM** — Account #17 owner responds: "Oh yeah, that's for the new analytics pipeline. It's permanent." Jordan updates the forecast spreadsheet. The annual impact is $28,000. Nobody approved this.
|
||||
**12:00 PM** — Lunch at desk. Reading a FinOps Foundation article about showback vs. chargeback models.
|
||||
**1:00-4:00 PM** — Deep in spreadsheets. Reconciling CUR data with the finance team's GL codes. Find a $4,200 discrepancy. Spend 90 minutes discovering it's because of a refund that appeared in CUR but not in Cost Explorer.
|
||||
**4:30 PM** — Team lead asks: "Can you tell me how much our staging environment costs?" Jordan: "Give me 30 minutes." It takes 90 because staging resources aren't consistently tagged.
|
||||
**6:00 PM** — Leave. Tomorrow: same thing.
|
||||
|
||||
---
|
||||
|
||||
## Persona 3: The DevOps Engineer — "Sam"
|
||||
|
||||
**Demographics:** 26 years old. Backend/infrastructure engineer at a 40-person startup. Manages Terraform, CI/CD, and "whatever AWS thing is broken today." Doesn't think about costs — until they cause a problem. Sam's primary metric is uptime, not spend.
|
||||
|
||||
**The Moment That Defines Sam:**
|
||||
It's Friday at 4:47 PM. Sam is about to close the laptop for the weekend when a Slack message from the CTO lands: "Sam, did you launch those GPU instances? Finance says we burned $1,200 on something called p3.2xlarge." Sam's blood runs cold. Last Tuesday, Sam spun up 4 GPU instances to benchmark a new ML model for the data team. The benchmark took 20 minutes. Sam meant to terminate them immediately after. But then there was a production incident, and Sam got pulled away, and the instances... are still running. It's been 4 days. Sam checks: $12.24/hour × 4 instances × 96 hours = $4,700. Sam wants to disappear.
|
||||
|
||||
### Empathy Map
|
||||
|
||||
**SAYS:**
|
||||
- "I'll terminate it right after the test"
|
||||
- "I thought I set it to auto-terminate"
|
||||
- "Can we get a policy to auto-kill dev resources?"
|
||||
- "I didn't know NAT Gateways were that expensive"
|
||||
- "The staging environment? Yeah, it's always running. Should it not be?"
|
||||
- "I don't have time to learn AWS billing — I have deploys to ship"
|
||||
|
||||
**THINKS:**
|
||||
- "Cost management isn't my job... but it keeps becoming my problem"
|
||||
- "I should have set a reminder to terminate those instances"
|
||||
- "AWS makes it way too easy to create expensive things and way too hard to know what they cost"
|
||||
- "I'm going to get blamed for this even though there's no guardrail to prevent it"
|
||||
- "Why doesn't AWS just TELL you when something is burning money?"
|
||||
- "I bet there are other zombie resources I don't even know about"
|
||||
|
||||
**DOES:**
|
||||
- Launches resources via Terraform and CLI, sometimes via console for quick tests
|
||||
- Forgets to clean up temporary resources (not malicious — just busy)
|
||||
- Checks costs only when asked by management
|
||||
- Uses `aws ce get-cost-and-usage` CLI occasionally but finds the output confusing
|
||||
- Tags resources inconsistently ("I'll add tags later" → never)
|
||||
- Responds to cost inquiries defensively ("It was just a test!")
|
||||
|
||||
**FEELS:**
|
||||
- **Embarrassed** — when caught leaving expensive resources running
|
||||
- **Defensive** — "There should be a system to catch this, not just blame me"
|
||||
- **Indifferent** — cost isn't Sam's KPI; uptime and velocity are
|
||||
- **Overwhelmed** — too many responsibilities, cost management is one more thing
|
||||
- **Anxious** — fear of making an expensive mistake and getting called out
|
||||
- **Resentful** — "Why is this my problem? Where are the guardrails?"
|
||||
|
||||
### Pain Points
|
||||
1. **No feedback loop** — Sam creates a resource and gets zero signal about its cost until someone complains weeks later
|
||||
2. **Easy to create, hard to track** — AWS makes it trivial to launch resources and nearly impossible to understand their cost implications in real-time
|
||||
3. **No safety net** — There's no automated system to catch forgotten resources. It's all human memory.
|
||||
4. **Blame culture** — When costs spike, the question is "who did this?" not "how do we prevent this?"
|
||||
5. **Cost literacy gap** — Sam is an excellent engineer but has no mental model for AWS pricing. NAT Gateway data processing? EBS IOPS charges? It's a foreign language.
|
||||
6. **Context-switching tax** — Investigating a cost issue means leaving the terminal/IDE and navigating the AWS billing console, which is a completely different mental model.
|
||||
|
||||
### Current Workarounds
|
||||
- Setting personal calendar reminders to terminate test resources (unreliable)
|
||||
- Using spot instances when remembering to (inconsistent)
|
||||
- Terraform `destroy` for test stacks (when they remember)
|
||||
- Asking in Slack before launching anything expensive (social pressure, not a system)
|
||||
- Nothing. Most of the time, there's just nothing. Hope and prayer.
|
||||
|
||||
### Jobs To Be Done (JTBD)
|
||||
- **When** I spin up a temporary resource for testing, **I want to** be automatically reminded (or have it auto-terminated) after a set period, **so I can** focus on my actual work without worrying about zombie resources.
|
||||
- **When** I'm about to create something expensive, **I want to** see the estimated cost impact immediately, **so I can** make an informed decision or choose a cheaper alternative.
|
||||
- **When** a cost anomaly is traced back to my actions, **I want to** fix it with one click from wherever I already am (Slack/terminal), **so I can** resolve it in 30 seconds instead of 15 minutes of console-clicking.
|
||||
|
||||
### Day-in-the-Life Scenario
|
||||
|
||||
**9:00 AM** — Start day. Check CI/CD pipelines. One failed overnight — flaky test. Re-run it.
|
||||
**9:30 AM** — Sprint planning. Pick up a ticket to set up a new ECS service for the payments team.
|
||||
**10:00 AM** — Writing Terraform for the new ECS service. Chooses instance type based on the last service they set up (m5.xlarge). Doesn't check if it's the right size. Doesn't estimate cost.
|
||||
**11:00 AM** — Data team asks Sam to spin up GPU instances for ML benchmarking. Sam launches 4x p3.2xlarge via CLI. Plans to terminate after lunch.
|
||||
**11:30 AM** — Production alert: database connection pool exhausted. All hands on deck.
|
||||
**11:30 AM - 2:00 PM** — Incident response. The GPU instances are completely forgotten.
|
||||
**2:00 PM** — Incident resolved. Sam is mentally drained. Grabs lunch.
|
||||
**2:30 PM** — Back to the ECS Terraform. Deploys to staging. Doesn't think about the GPU instances.
|
||||
**3:00 PM** — Code review for a teammate's Lambda function. Doesn't notice it logs full request payloads at DEBUG level (future CloudWatch cost bomb).
|
||||
**4:00 PM** — Pushes the ECS service to production. Monitors for 30 minutes. Looks good.
|
||||
**4:47 PM** — CTO Slack: "Did you launch those GPU instances?" The cold sweat begins.
|
||||
**4:50 PM** — Terminates the instances. $4,700 burned. Apologizes. Feels terrible.
|
||||
**5:00 PM** — Closes laptop. Spends the weekend low-key anxious about it.
|
||||
**Monday** — CTO announces a new "cloud cost policy." Sam knows it's because of them. Nobody will follow it.
|
||||
|
||||
---
|
||||
|
||||
# Phase 2: DEFINE
|
||||
|
||||
> *"A well-defined problem is a problem half-solved. But here's the jazz riff — the problem isn't 'costs are too high.' The problem is 'I'm flying blind in a machine that charges by the millisecond.' That's a fundamentally different design challenge."*
|
||||
|
||||
---
|
||||
|
||||
## Point-of-View (POV) Statements
|
||||
|
||||
### POV 1: The Startup CTO (Alex)
|
||||
**Alex**, a time-starved startup CTO who is personally accountable for AWS spend, **needs a way to** instantly understand and resolve unexpected cost spikes the moment they happen, **because** the 48-hour delay in current tools means thousands of dollars burn before they even know there's a problem, and every unexplained spike erodes investor confidence and their own credibility.
|
||||
|
||||
### POV 2: The FinOps Analyst (Jordan)
|
||||
**Jordan**, a solo FinOps analyst responsible for cost governance across 23 AWS accounts, **needs a way to** automatically detect, attribute, and communicate cost anomalies without manual data wrangling, **because** they spend 60% of their time collecting and reconciling data instead of analyzing it, and they are a single point of failure for cost visibility in a 150-person engineering org.
|
||||
|
||||
### POV 3: The DevOps Engineer (Sam)
|
||||
**Sam**, a DevOps engineer who accidentally creates expensive zombie resources, **needs a way to** get immediate cost feedback and automatic safety nets when creating or forgetting cloud resources, **because** there is currently zero feedback loop between "I launched a thing" and "that thing is costing $12/hour," and the resulting blame culture makes cost management feel punitive rather than preventive.
|
||||
|
||||
---
|
||||
|
||||
## Key Insights
|
||||
|
||||
1. **The Anxiety Gap** — The real pain isn't the dollar amount. It's the TIME between "something went wrong" and "I understand what happened." AWS's 48-hour delay turns a $200 problem into a $2,000 problem AND a week of anxiety. Speed of detection is speed of relief.
|
||||
|
||||
2. **Attribution Is Emotional, Not Just Financial** — "Who did this?" is the first question asked in every cost spike. Current tools can't answer it. This creates blame culture. If dd0c can instantly say "Sam's p3.2xlarge instances from Tuesday," it transforms the conversation from blame to resolution.
|
||||
|
||||
3. **Nobody Wakes Up Wanting to Do FinOps** — Alex doesn't want a cost dashboard. Sam doesn't want billing alerts. Jordan doesn't want more spreadsheets. They all want the ABSENCE of cost problems. The best cost tool is one you barely notice — until it saves you.
|
||||
|
||||
4. **The Guardrail Deficit** — AWS makes it trivially easy to create expensive resources and provides zero real-time feedback about cost. It's like a highway with no speed limit signs and no guardrails — and the speeding ticket arrives 2 days later. dd0c is the guardrail, not the ticket.
|
||||
|
||||
5. **Slack Is the Operating System** — All three personas live in Slack. Alex gets the CFO's panic message there. Jordan sends cost summaries there. Sam gets called out there. The product that wins is the one that meets them where they already are — not behind another login.
|
||||
|
||||
6. **The Trust Ladder** — Read-only detection (low trust) → Recommendations (medium trust) → One-click remediation (high trust) → Autonomous action (maximum trust). Users climb this ladder over time. V1 must support the full ladder but default to the bottom rung.
|
||||
|
||||
7. **Cost Literacy Is Near Zero** — Even experienced engineers don't understand AWS pricing. NAT Gateway data processing, cross-AZ transfer, EBS IOPS — it's deliberately opaque. dd0c must EXPLAIN costs in human language, not just report numbers.
|
||||
|
||||
8. **Waste Regenerates** — Optimization isn't a one-time event. New engineers join, new services launch, configurations drift. The zombie resource problem is perpetual. dd0c's value is continuous, not episodic.
|
||||
|
||||
---
|
||||
|
||||
## How Might We (HMW) Questions
|
||||
|
||||
### Detection & Speed
|
||||
1. **HMW** detect expensive resource creation in seconds instead of days, so users can intervene before costs accumulate?
|
||||
2. **HMW** distinguish between expected cost changes (planned deployments) and genuine anomalies, to minimize false positive fatigue?
|
||||
3. **HMW** make the "cost signal" as immediate and visceral as a production alert, so it gets the same urgency?
|
||||
|
||||
### Attribution & Understanding
|
||||
4. **HMW** automatically attribute every cost spike to a specific person, team, and action — without requiring perfect tagging?
|
||||
5. **HMW** explain cost anomalies in plain English so that a CTO, a FinOps analyst, AND a junior engineer all understand what happened?
|
||||
6. **HMW** show the "cost blast radius" of a single action (e.g., "this one CLI command is costing $12.24/hour") at the moment it happens?
|
||||
|
||||
### Remediation & Action
|
||||
7. **HMW** reduce the time from "anomaly detected" to "problem fixed" from hours to seconds?
|
||||
8. **HMW** make remediation feel safe (not scary) so users actually click the "Stop Instance" button instead of hesitating?
|
||||
9. **HMW** build automatic safety nets (auto-terminate, auto-schedule) that prevent problems without requiring human vigilance?
|
||||
|
||||
### Culture & Behavior
|
||||
10. **HMW** transform cost management from a blame game into a team sport?
|
||||
11. **HMW** make cost awareness a natural byproduct of daily engineering work, not a separate chore?
|
||||
12. **HMW** reward cost-conscious behavior instead of only punishing waste?
|
||||
|
||||
### Scale & Governance
|
||||
13. **HMW** give Jordan (the FinOps analyst) their time back by automating the 60% of work that's just data wrangling?
|
||||
14. **HMW** provide cost governance across 20+ AWS accounts without creating a bottleneck at one person?
|
||||
|
||||
---
|
||||
|
||||
## The Core Tension: Real-Time Detection vs. Accuracy
|
||||
|
||||
Here's the design tension that will define dd0c/cost's soul:
|
||||
|
||||
```
|
||||
FAST ←————————————————————→ ACCURATE
|
||||
CloudTrail events CUR line-item data
|
||||
(instant, but estimated) (hourly+, but precise)
|
||||
```
|
||||
|
||||
**The Tradeoff:**
|
||||
- **CloudTrail event detection** tells you "someone just launched a p3.2xlarge" within seconds. You can estimate the cost ($3.06/hour). But you don't have the ACTUAL billed amount yet — there could be reserved instance coverage, savings plans, spot pricing, or marketplace fees that change the real number.
|
||||
- **CUR/Cost Explorer data** gives you the exact billed amount, with all discounts and credits applied. But it's delayed by hours (CUR) or days (Cost Explorer).
|
||||
|
||||
**The Resolution — A Two-Layer Architecture:**
|
||||
|
||||
| Layer | Source | Speed | Accuracy | Use Case |
|
||||
|-------|--------|-------|----------|----------|
|
||||
| **Layer 1: Event Stream** | CloudTrail + EventBridge | Seconds | Estimated (~85% accurate) | "ALERT: New expensive resource detected" |
|
||||
| **Layer 2: Billing Reconciliation** | CloudWatch EstimatedCharges + CUR | Minutes to hours | Precise (99%+) | "UPDATE: Confirmed cost impact is $X" |
|
||||
|
||||
**The Design Principle:** Alert on Layer 1 (fast, estimated). Reconcile with Layer 2 (slow, precise). Always show the user which layer they're looking at. Never pretend an estimate is exact. Never wait for precision when speed saves money.
|
||||
|
||||
This is like a smoke detector vs. a fire investigation. The smoke detector goes off immediately — it might be burnt toast, it might be a real fire. You don't wait for the fire investigator's report before evacuating. You act on the fast signal, then refine your understanding.
|
||||
|
||||
**For each persona, this plays differently:**
|
||||
- **Alex (CTO):** Wants Layer 1 immediately. "I don't care if it's $1,175 or $1,230 — I need to know NOW that something is burning money." Precision can come later.
|
||||
- **Jordan (FinOps):** Needs both layers. Layer 1 for real-time awareness, Layer 2 for accurate reporting and forecasting. Jordan will be frustrated if estimates are wildly off.
|
||||
- **Sam (DevOps):** Wants Layer 1 as a safety net. "Tell me the second I forget to terminate something." Doesn't care about the exact dollar amount — cares about the pattern.
|
||||
|
||||
---
|
||||
481
products/05-aws-cost-anomaly/epics/epics.md
Normal file
481
products/05-aws-cost-anomaly/epics/epics.md
Normal file
@@ -0,0 +1,481 @@
|
||||
# dd0c/cost — V1 MVP Epics
|
||||
|
||||
This document breaks down the dd0c/cost MVP into implementable Epics and Stories. Stories are sized for a solo founder to complete in 1-3 days (1-5 points typically).
|
||||
|
||||
## Epic 1: CloudTrail Ingestion
|
||||
**Description:** Build the real-time event pipeline that receives CloudTrail events from customer accounts, filters for cost-relevant actions (EC2, RDS, Lambda), normalizes them into `CostEvents`, and estimates their on-demand cost. This is the foundational data ingestion layer.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 1.1: Cross-Account EventBridge Bus**
|
||||
- **As a** dd0c system, **I want** to receive CloudTrail events from external customer AWS accounts via EventBridge, **so that** I can process them centrally without running agents in customer accounts.
|
||||
- **Acceptance Criteria:**
|
||||
- `dd0c-cost-bus` created in dd0c's AWS account.
|
||||
- Resource policy allows `events:PutEvents` from any AWS account (scoped by external ID/trust later, but fundamentally open to receive).
|
||||
- Test events sent from a separate AWS account successfully arrive on the bus.
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** None
|
||||
- **Technical Notes:** Use AWS CDK. Ensure the bus is configured in `us-east-1`.
|
||||
|
||||
**Story 1.2: SQS Ingestion Queue & Dead Letter Queue**
|
||||
- **As a** data pipeline, **I want** events routed from EventBridge to an SQS FIFO queue, **so that** I can process them in order, deduplicate them, and handle bursts without dropping data.
|
||||
- **Acceptance Criteria:**
|
||||
- EventBridge rule routes matching events to `event-ingestion.fifo` queue.
|
||||
- SQS FIFO configured with `MessageGroupId` = accountId and deduplication enabled.
|
||||
- DLQ configured after 3 retries.
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 1.1
|
||||
- **Technical Notes:** CloudTrail can emit duplicates; use `eventID` for SQS deduplication ID.
|
||||
|
||||
**Story 1.3: Static Pricing Tables**
|
||||
- **As an** event processor, **I want** local static lookup tables for EC2, RDS, and Lambda on-demand pricing, **so that** I can estimate hourly costs in milliseconds without calling the slow AWS Pricing API.
|
||||
- **Acceptance Criteria:**
|
||||
- JSON/TypeScript dicts created for top 20 instance types for EC2 and RDS, plus Lambda per-GB-second rates.
|
||||
- Pricing covers `us-east-1` (and placeholder for others if needed).
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** None
|
||||
- **Technical Notes:** Keep it simple for V1. Hardcode the most common instance types. We don't need the entire AWS price list yet.
|
||||
|
||||
**Story 1.4: Event Processor Lambda**
|
||||
- **As an** event pipeline, **I want** a Lambda function to poll the SQS queue, normalize raw CloudTrail events into `CostEvent` schemas, and write them to DynamoDB, **so that** downstream systems have clean, standardized data.
|
||||
- **Acceptance Criteria:**
|
||||
- Lambda polls SQS (batch size 10).
|
||||
- Parses `RunInstances`, `CreateDBInstance`, `CreateFunction20150331`, etc.
|
||||
- Extracts actor (IAM User/Role ARN), resource ID, region.
|
||||
- Looks up pricing and appends `estimatedHourlyCost`.
|
||||
- Writes `CostEvent` to DynamoDB `dd0c-cost-main` table.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 1.2, Story 1.3
|
||||
- **Technical Notes:** Implement idempotency. Use DynamoDB Single-Table Design. Partition key: `ACCOUNT#<id>`, Sort key: `EVENT#<timestamp>#<eventId>`.
|
||||
|
||||
|
||||
## Epic 2: Anomaly Detection Engine
|
||||
**Description:** Implement the baseline learning and anomaly scoring algorithms. The engine evaluates incoming `CostEvent` records against account-specific, service-specific historical spending baselines to flag unusual spikes, new instance types, or unusual actors.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 2.1: Baseline Storage & Retrieval**
|
||||
- **As an** anomaly scorer, **I want** to read and write spending baselines per account/service/resource from DynamoDB, **so that** I have a statistical foundation to evaluate new events against.
|
||||
- **Acceptance Criteria:**
|
||||
- `Baseline` schema created in DynamoDB (`BASELINE#<account_id>`).
|
||||
- Read/Write logic implemented for running means, standard deviations, max observed, and expected actors/instance types.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 1.4
|
||||
- **Technical Notes:** Update baseline with `ADD` expressions in DynamoDB to avoid race conditions.
|
||||
|
||||
**Story 2.2: Cold-Start Absolute Thresholds**
|
||||
- **As a** new customer, **I want** my account to immediately flag highly expensive resources (>$5/hr) even if I have no baseline, **so that** I don't wait 14 days for the system to "learn" a $3,000 mistake.
|
||||
- **Acceptance Criteria:**
|
||||
- Implement absolute threshold heuristics: >$0.50/hr = INFO, >$5/hr = WARNING, >$25/hr = CRITICAL.
|
||||
- Apply logic when account maturity is `cold-start` (<14 days or <20 events).
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 2.1
|
||||
- **Technical Notes:** Implement a `scoreAnomaly` function that checks the maturity state of the baseline.
|
||||
|
||||
**Story 2.3: Statistical Anomaly Scoring**
|
||||
- **As an** anomaly scorer, **I want** to calculate composite anomaly scores using Z-scores, instance novelty, and actor novelty, **so that** I reduce false positives and only flag truly unusual behavior.
|
||||
- **Acceptance Criteria:**
|
||||
- Implement Z-score calculation (event cost vs baseline mean).
|
||||
- Implement novelty checks (is this instance type or actor new?).
|
||||
- Composite score logic computes severity (`info`, `warning`, `critical`).
|
||||
- Creates an `AnomalyRecord` in DynamoDB if threshold crossed.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 2.1
|
||||
- **Technical Notes:** Add unit tests covering various edge cases (new actor + cheap instance vs. familiar actor + expensive instance).
|
||||
|
||||
**Story 2.4: Feedback Loop ("Mark as Expected")**
|
||||
- **As an** anomaly engine, **I want** to update baselines when a user marks an anomaly as expected, **so that** I learn from feedback and stop alerting on normal workflows.
|
||||
- **Acceptance Criteria:**
|
||||
- Provide a function to append a resource type and actor to `expectedInstanceTypes` and `expectedActors`.
|
||||
- Future events matching this suppressed pattern get a reduced anomaly score.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 2.3
|
||||
- **Technical Notes:** This API will be called by the Slack action handler.
|
||||
|
||||
|
||||
## Epic 3: Notification Service
|
||||
**Description:** Build the Slack-first notification engine. Deliver rich Block Kit alerts containing anomaly context, estimated costs, and manual remediation suggestions. This is the product's primary user interface for V1.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 3.1: SQS Alert Queue & Notifier Lambda**
|
||||
- **As a** notification engine, **I want** to poll an alert queue and trigger a Lambda function for every new anomaly, **so that** I can format and send alerts asynchronously without blocking the ingestion path.
|
||||
- **Acceptance Criteria:**
|
||||
- Create standard SQS `alert-queue` for anomalies.
|
||||
- Create `notifier` Lambda that polls the queue.
|
||||
- SQS retries via visibility timeout on Slack API rate limits (429).
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 2.3
|
||||
- **Technical Notes:** The scorer Lambda pushes the anomaly ID to this queue.
|
||||
|
||||
**Story 3.2: Slack Block Kit Formatting**
|
||||
- **As a** user, **I want** anomaly alerts formatted nicely in Slack, **so that** I can instantly understand what resource launched, who launched it, the estimated cost, and why it was flagged.
|
||||
- **Acceptance Criteria:**
|
||||
- Use Slack Block Kit to design a highly readable card.
|
||||
- Include: Resource Type, Region, Cost/hr, Actor, Timestamp, and the reason (e.g., "New instance type never seen").
|
||||
- Test rendering for EC2, RDS, and Lambda anomalies.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 3.1
|
||||
- **Technical Notes:** Include a "Why this alert" section detailing the anomaly signals.
|
||||
|
||||
**Story 3.3: Manual Remediation Suggestions**
|
||||
- **As a** user, **I want** the Slack alert to include CLI commands to stop or terminate the anomalous resource, **so that** I can fix the issue immediately even before one-click buttons are available.
|
||||
- **Acceptance Criteria:**
|
||||
- Block Kit template appends a `Suggested actions` section.
|
||||
- Generate a valid `aws ec2 stop-instances` or `aws rds stop-db-instance` command based on the resource type and region.
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 3.2
|
||||
- **Technical Notes:** For V1, no actual remediation API calls are made by dd0c. This prevents accidental deletions and builds trust first.
|
||||
|
||||
**Story 3.4: Daily Digest Generator**
|
||||
- **As a** user, **I want** a daily summary of my spending and any minor anomalies, **so that** I don't get paged for every $0.50 resource but still have visibility.
|
||||
- **Acceptance Criteria:**
|
||||
- Create an EventBridge Scheduler rule (e.g., cron at 09:00 UTC).
|
||||
- Lambda queries the last 24h of anomalies and baseline metrics.
|
||||
- Sends a digest message (Spend Estimate, Anomalies Resolved vs. Open, Zombie Watch summary).
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 3.2
|
||||
- **Technical Notes:** Query DynamoDB GSI for recent anomalies (`ANOMALY#<id>#STATUS#*`).
|
||||
|
||||
|
||||
## Epic 4: Customer Onboarding
|
||||
**Description:** Automate the 5-minute setup experience. Create the CloudFormation templates and cross-account IAM roles required for dd0c to securely read CloudTrail events and resource metadata without touching customer data or secrets.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 4.1: IAM Read-Only CloudFormation Template**
|
||||
- **As a** customer, **I want** to deploy a simple, open-source CloudFormation template, **so that** I can grant dd0c secure, read-only access to my AWS account without worrying about compromised credentials.
|
||||
- **Acceptance Criteria:**
|
||||
- Create `dd0c-cost-readonly.yaml` template.
|
||||
- Role `dd0c-cost-readonly` with `sts:AssumeRole` policy.
|
||||
- Requires `ExternalId` parameter.
|
||||
- Allows `ec2:Describe*`, `rds:Describe*`, `lambda:List*`, `cloudwatch:*`, `ce:GetCostAndUsage`, `tag:GetResources`.
|
||||
- Hosted on a public S3 bucket (`dd0c-cf-templates`).
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** None
|
||||
- **Technical Notes:** Include an EventBridge rule that forwards `cost-relevant` CloudTrail events to dd0c's EventBridge bus (`arn:aws:events:...:dd0c-cost-bus`).
|
||||
|
||||
**Story 4.2: Cognito User Pool Authentication**
|
||||
- **As a** platform, **I want** a secure identity provider, **so that** users can sign up quickly using GitHub or Google SSO.
|
||||
- **Acceptance Criteria:**
|
||||
- Configure Amazon Cognito User Pool.
|
||||
- Enable GitHub and Google OIDC providers.
|
||||
- Provide a login URL and redirect to the dd0c app.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** None
|
||||
- **Technical Notes:** AWS Cognito is free for the first 50k MAU, keeping V1 costs at zero.
|
||||
|
||||
**Story 4.3: Account Setup API Endpoint**
|
||||
- **As a** new user, **I want** an API that initializes my tenant and generates a secure CloudFormation "quick-create" link, **so that** I can click one button to install the required AWS permissions.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/accounts/setup` created in API Gateway.
|
||||
- Validates Cognito JWT.
|
||||
- Generates a unique UUIDv4 `externalId` per tenant/account.
|
||||
- Returns a URL pointing to the AWS Console CloudFormation quick-create page with pre-filled parameters.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 4.1, Story 4.2
|
||||
- **Technical Notes:** The API Lambda should store the generated `externalId` in DynamoDB under the tenant record.
|
||||
|
||||
**Story 4.4: Role Validation & Activation**
|
||||
- **As a** dd0c system, **I want** to validate a user's AWS account connection by assuming their newly created role, **so that** I know I can receive events and start anomaly detection.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/accounts` API created (receives `awsAccountId`, `roleArn`).
|
||||
- Calls `sts:AssumeRole` using the `roleArn` and `externalId`.
|
||||
- On success, updates the account status to `active` in DynamoDB.
|
||||
- Automatically triggers a "Zombie Resource Scan" on connection.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 4.3
|
||||
- **Technical Notes:** This is the critical moment. If the `AssumeRole` fails, return an error explaining the `ExternalId` mismatch or missing permissions.
|
||||
|
||||
|
||||
## Epic 5: Dashboard API
|
||||
**Description:** Build the REST API for anomaly querying, account management, and basic metrics. V1 relies entirely on Slack for interaction, but a minimal API is needed for account settings and the upcoming V2 dashboard.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 5.1: Account Retrieval API**
|
||||
- **As a** user, **I want** to see my connected AWS accounts, **so that** I can view their health status and disconnect them if needed.
|
||||
- **Acceptance Criteria:**
|
||||
- `GET /v1/accounts` API created (returns `accountId`, status, `baselineMaturity`).
|
||||
- `DELETE /v1/accounts/{id}` API created.
|
||||
- Returns `401 Unauthorized` without a valid Cognito JWT.
|
||||
- Scopes database query to `tenantId`.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 4.4
|
||||
- **Technical Notes:** The disconnect endpoint should mark the account as `disconnecting` and trigger a background Lambda to delete the data within 72 hours.
|
||||
|
||||
**Story 5.2: Anomaly Listing API**
|
||||
- **As a** user, **I want** to view a list of recent anomalies, **so that** I can review past alerts or check if anything was missed.
|
||||
- **Acceptance Criteria:**
|
||||
- `GET /v1/anomalies` API created.
|
||||
- Queries DynamoDB GSI3 (`ANOMALY#<id>#STATUS#*`) for the authenticated account.
|
||||
- Supports `since`, `status`, and `severity` filters.
|
||||
- Implements basic pagination.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 2.3
|
||||
- **Technical Notes:** Include `slackMessageUrl` if the anomaly triggered a Slack alert.
|
||||
|
||||
**Story 5.3: Baseline Overrides**
|
||||
- **As a** user, **I want** to adjust anomaly sensitivity for specific services or resource types, **so that** I don't get paged for expected batch processing spikes.
|
||||
- **Acceptance Criteria:**
|
||||
- `PATCH /v1/accounts/{id}/baselines/{service}/{type}` API created.
|
||||
- Modifies the DynamoDB baseline record to update `sensitivityOverride` (`low`, `medium`, `high`).
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 2.1
|
||||
- **Technical Notes:** Valid values must be enforced by the API schema.
|
||||
|
||||
|
||||
## Epic 6: Dashboard UI
|
||||
**Description:** Build the initial Next.js/React frontend. While V1 focuses on Slack, the web dashboard handles onboarding, account connection, Slack OAuth, and basic anomaly viewing for users who prefer the web.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 6.1: Next.js Boilerplate & Auth**
|
||||
- **As a** user, **I want** to sign in to the dd0c/cost portal, **so that** I can configure my account and view my AWS connections.
|
||||
- **Acceptance Criteria:**
|
||||
- Initialize Next.js app with Tailwind CSS.
|
||||
- Implement AWS Amplify or `next-auth` for Cognito integration.
|
||||
- Landing page with `Start Free` button.
|
||||
- Protect `/dashboard` routes.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 4.2
|
||||
- **Technical Notes:** Keep the design clean and Vercel-like. The goal is to get the user authenticated in <10 seconds.
|
||||
|
||||
**Story 6.2: Onboarding Flow**
|
||||
- **As a** new user, **I want** a simple 3-step wizard to connect AWS and Slack, **so that** I don't get lost in documentation.
|
||||
- **Acceptance Criteria:**
|
||||
- "Connect AWS Account" screen.
|
||||
- Generates CloudFormation quick-create URL.
|
||||
- Polls `/v1/accounts/{id}/health` for successful connection.
|
||||
- "Connect Slack" screen initiates OAuth flow.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 4.3, Story 4.4, Story 6.1
|
||||
- **Technical Notes:** Provide a fallback manual input field if the auto-polling fails or the user closes the AWS Console window early.
|
||||
|
||||
**Story 6.3: Basic Dashboard View**
|
||||
- **As a** user, **I want** a simple dashboard showing my connected accounts, recent anomalies, and estimated monthly cost, **so that** I have a high-level view outside of Slack.
|
||||
- **Acceptance Criteria:**
|
||||
- Render an `Account Overview` table.
|
||||
- Fetch anomalies via `/v1/anomalies` and display in a simple list or timeline.
|
||||
- Indicate the account's baseline learning phase (e.g., "14 days left in learning phase").
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 5.1, Story 5.2, Story 6.1
|
||||
- **Technical Notes:** V1 UI shouldn't be complex. Avoid graphs or heavy chart libraries for MVP.
|
||||
|
||||
|
||||
## Epic 7: Slack Bot
|
||||
**Description:** Build the Slack bot interaction model. This includes the OAuth installation flow, parsing incoming slash commands (`/dd0c status`, `/dd0c anomalies`, `/dd0c digest`), and handling interactive message payloads for actions like snoozing or marking alerts as expected.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 7.1: Slack OAuth Installation Flow**
|
||||
- **As a** user, **I want** to securely install the dd0c app to my Slack workspace, **so that** the bot can send alerts to my designated channels.
|
||||
- **Acceptance Criteria:**
|
||||
- `GET /v1/slack/install` initiates the Slack OAuth v2 flow.
|
||||
- `GET /v1/slack/oauth_redirect` handles the callback, exchanging the code for a bot token.
|
||||
- Bot token and workspace details are securely stored in DynamoDB under the tenant's record.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 4.2
|
||||
- **Technical Notes:** Request minimum scopes: `chat:write`, `commands`, `incoming-webhook`. Encrypt the Slack bot token at rest.
|
||||
|
||||
**Story 7.2: Slash Command Parser & Router**
|
||||
- **As a** Slack user, **I want** to use commands like `/dd0c status`, **so that** I can interact with the system without leaving my chat window.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/slack/commands` API endpoint created to receive Slack command webhooks.
|
||||
- Validates Slack request signatures (HMAC-SHA256).
|
||||
- Routes `/dd0c status`, `/dd0c anomalies`, and `/dd0c digest` to respective handler functions.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 7.1
|
||||
- **Technical Notes:** Respond within 3 seconds or defer with a 200 OK and use the `response_url` for delayed execution.
|
||||
|
||||
**Story 7.3: Interactive Action Handler**
|
||||
- **As a** user, **I want** to click buttons on anomaly alerts to snooze them or mark them as expected, **so that** I can tune the system's noise level instantly.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/slack/actions` API endpoint created to receive interactive payloads.
|
||||
- Validates Slack request signatures.
|
||||
- Handles `mark_expected` action by updating the anomaly record and retraining the baseline.
|
||||
- Handles `snooze_Xh` actions by updating the `snoozeUntil` attribute.
|
||||
- Updates the original Slack message using the Slack API to reflect the action taken.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 3.2, Story 7.2
|
||||
- **Technical Notes:** V1 only implements non-destructive actions (snooze, mark expected). No actual AWS remediation API calls yet.
|
||||
|
||||
|
||||
## Epic 8: Infrastructure & DevOps
|
||||
**Description:** Define the serverless infrastructure using AWS CDK. This epic covers the deployment of the EventBridge buses, SQS queues, Lambda functions, DynamoDB tables, and setting up the CI/CD pipeline for automated testing and deployment.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 8.1: Core Serverless Stack (CDK)**
|
||||
- **As a** developer, **I want** the core ingestion and data storage infrastructure defined as code, **so that** I can deploy the dd0c platform reliably and repeatedly.
|
||||
- **Acceptance Criteria:**
|
||||
- AWS CDK (TypeScript) project initialized.
|
||||
- `dd0c-cost-main` DynamoDB table defined with GSIs and TTL.
|
||||
- `dd0c-cost-bus` EventBridge bus configured with resource policies allowing external puts.
|
||||
- `event-ingestion.fifo` and `alert-queue` SQS queues created.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** None
|
||||
- **Technical Notes:** Ensure DynamoDB is set to PAY_PER_REQUEST (on-demand) to minimize baseline costs.
|
||||
|
||||
**Story 8.2: Lambda Deployments & Triggers**
|
||||
- **As a** developer, **I want** to deploy the Lambda functions and connect them to their respective triggers, **so that** the event-driven architecture functions end-to-end.
|
||||
- **Acceptance Criteria:**
|
||||
- CDK definitions for `event-processor`, `anomaly-scorer`, `notifier`, and API handlers.
|
||||
- SQS event source mappings configured for processor and notifier Lambdas.
|
||||
- API Gateway REST API configured with routes pointing to the API handler Lambda.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 8.1
|
||||
- **Technical Notes:** Bundle Lambdas using `NodejsFunction` construct (esbuild) to minimize cold starts. Set explicit memory and timeout values.
|
||||
|
||||
**Story 8.3: Observability & Alarms**
|
||||
- **As an** operator, **I want** automated monitoring of the infrastructure, **so that** I am alerted if ingestion fails or components throttle.
|
||||
- **Acceptance Criteria:**
|
||||
- CloudWatch Alarms created for Lambda error rates (>5% in 5 mins).
|
||||
- Alarms created for SQS DLQ depth (`ApproximateNumberOfMessagesVisible` > 0).
|
||||
- Alarms send notifications to an SNS `ops-alerts` topic.
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 8.2
|
||||
- **Technical Notes:** Keep V1 alarms simple to avoid alert fatigue.
|
||||
|
||||
**Story 8.4: CI/CD Pipeline Setup**
|
||||
- **As a** solo founder, **I want** GitHub Actions to automatically test and deploy my code, **so that** I can push to `main` and have it live in minutes without manual deployment steps.
|
||||
- **Acceptance Criteria:**
|
||||
- GitHub Actions workflow created for PRs (lint, test).
|
||||
- Workflow created for `main` branch (lint, test, `cdk deploy --require-approval broadening`).
|
||||
- OIDC provider configured in AWS for passwordless GitHub Actions authentication.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 8.1
|
||||
- **Technical Notes:** Use AWS `configure-aws-credentials` action with Role to assume.
|
||||
|
||||
|
||||
## Epic 9: PLG & Free Tier
|
||||
**Description:** Implement the product-led growth (PLG) foundations. This involves building a seamless self-serve signup flow, enforcing free tier limits (1 AWS account), and providing the mechanism to upgrade to a paid tier via Stripe.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 9.1: Free Tier Enforcement**
|
||||
- **As a** platform, **I want** to limit free users to 1 connected AWS account, **so that** I can control infrastructure costs while letting users experience the product's value.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/accounts/setup` checks the tenant's current account count.
|
||||
- Rejects the request with `403 Forbidden` and an upgrade prompt if the limit (1) is reached on the free tier.
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 4.3
|
||||
- **Technical Notes:** Check the `TENANT#<id>` metadata record to determine the subscription tier.
|
||||
|
||||
**Story 9.2: Stripe Integration & Upgrade Flow**
|
||||
- **As a** user, **I want** to easily upgrade to a paid subscription, **so that** I can connect multiple AWS accounts and access premium features.
|
||||
- **Acceptance Criteria:**
|
||||
- Create a Stripe Checkout session endpoint (`POST /v1/billing/checkout`).
|
||||
- Configure a Stripe webhook handler to listen for `checkout.session.completed` and `customer.subscription.deleted`.
|
||||
- Update the tenant's tier to `pro` in DynamoDB upon successful payment.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 4.2
|
||||
- **Technical Notes:** The Pro tier is $19/account/month. Use Stripe Billing's per-unit pricing model tied to the number of active AWS accounts.
|
||||
|
||||
**Story 9.3: API Key Management (V1 Foundation)**
|
||||
- **As a** power user, **I want** to generate an API key, **so that** I can programmatically interact with my dd0c account in the future.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/api-keys` endpoint to generate a secure, scoped API key.
|
||||
- Hash the API key before storing it in DynamoDB (`TENANT#<id>#APIKEY#<hash>`).
|
||||
- Display the plain-text key only once during creation.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 5.1
|
||||
- **Technical Notes:** This lays the groundwork for the V2 Business tier API access.
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Epic 10: Transparent Factory Compliance
|
||||
**Description:** Cross-cutting epic ensuring dd0c/cost adheres to the 5 Transparent Factory tenets. A cost anomaly detector that auto-alerts on spending must itself be governed — false positives erode trust, false negatives cost money.
|
||||
|
||||
### Story 10.1: Atomic Flagging — Feature Flags for Anomaly Detection Rules
|
||||
**As a** solo founder, **I want** every new anomaly scoring algorithm, baseline model, and alert threshold behind a feature flag (default: off), **so that** a bad scoring change doesn't flood customers with false-positive cost alerts.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- OpenFeature SDK integrated into the anomaly scoring engine. V1: env-var or JSON file provider.
|
||||
- All flags evaluate locally — no network calls during cost event processing.
|
||||
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
|
||||
- Automated circuit breaker: if a flagged scoring rule generates >3x baseline alert volume over 1 hour, the flag auto-disables. Suppressed alerts buffered in DLQ for review.
|
||||
- Flags required for: new baseline algorithms, Z-score thresholds, instance novelty scoring, actor novelty detection, new AWS service parsers.
|
||||
|
||||
**Estimate:** 5 points
|
||||
**Dependencies:** Epic 2 (Anomaly Engine)
|
||||
**Technical Notes:**
|
||||
- Circuit breaker tracks alert-per-account rate in Redis with 1-hour sliding window.
|
||||
- DLQ: SQS queue. On circuit break, alerts are replayed once the flag is fixed or removed.
|
||||
- For the "no baseline" fast-path (>$5/hr resources), this is NOT behind a flag — it's a safety net that's always on.
|
||||
|
||||
### Story 10.2: Elastic Schema — Additive-Only for Cost Event Tables
|
||||
**As a** solo founder, **I want** all DynamoDB cost event and TimescaleDB baseline schema changes to be strictly additive, **so that** rollbacks never corrupt historical spending data or break baseline calculations.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- CI rejects migrations containing `DROP`, `ALTER ... TYPE`, or `RENAME` on existing columns/attributes.
|
||||
- New fields use `_v2` suffix for breaking changes.
|
||||
- All event parsers ignore unknown fields (Pydantic `extra="ignore"` or Go equivalent).
|
||||
- Dual-write during migration windows within the same transaction.
|
||||
- Every migration includes `sunset_date` comment (max 30 days).
|
||||
|
||||
**Estimate:** 3 points
|
||||
**Dependencies:** Epic 3 (Data Pipeline)
|
||||
**Technical Notes:**
|
||||
- `CostEvent` records in DynamoDB are append-only — never mutate historical events.
|
||||
- Baseline models in TimescaleDB: new algorithm versions write to new continuous aggregate, old aggregate remains queryable during transition.
|
||||
- GSI changes: add new GSIs, never remove old ones until sunset.
|
||||
|
||||
### Story 10.3: Cognitive Durability — Decision Logs for Scoring Algorithms
|
||||
**As a** future maintainer, **I want** every change to anomaly scoring weights, Z-score thresholds, or baseline learning rates accompanied by a `decision_log.json`, **so that** I understand why the system flagged (or missed) a $3,000 EC2 instance.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
|
||||
- CI requires a decision log for PRs touching `src/scoring/`, `src/baseline/`, or `src/detection/`.
|
||||
- Cyclomatic complexity cap of 10 enforced in CI.
|
||||
- Decision logs in `docs/decisions/`.
|
||||
|
||||
**Estimate:** 2 points
|
||||
**Dependencies:** None
|
||||
**Technical Notes:**
|
||||
- Threshold changes are the highest-risk decisions — document: "Why Z-score > 2.5 and not 2.0? What's the false-positive rate at each threshold?"
|
||||
- Include sample cost events showing before/after scoring behavior in decision logs.
|
||||
|
||||
### Story 10.4: Semantic Observability — AI Reasoning Spans on Anomaly Scoring
|
||||
**As an** engineer investigating a missed cost anomaly, **I want** every anomaly scoring decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I can trace exactly why a $500/hr GPU instance wasn't flagged.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- Every `CostEvent` evaluation creates an `anomaly_scoring` span.
|
||||
- Span attributes: `cost.account_id_hash`, `cost.service`, `cost.anomaly_score`, `cost.z_score`, `cost.instance_novelty`, `cost.actor_novelty`, `cost.alert_triggered` (bool), `cost.baseline_days` (how many days of baseline data existed).
|
||||
- If no baseline exists: `cost.fast_path_triggered` (bool) and `cost.hourly_rate`.
|
||||
- Spans export via OTLP. No PII — account IDs hashed, actor ARNs hashed.
|
||||
|
||||
**Estimate:** 3 points
|
||||
**Dependencies:** Epic 2 (Anomaly Engine)
|
||||
**Technical Notes:**
|
||||
- Use OpenTelemetry Python SDK with OTLP exporter. Batch export — cost events can be high volume.
|
||||
- The "no baseline fast-path" span is especially important — it's the safety net for new accounts.
|
||||
- Include `cost.baseline_days` so you can correlate alert accuracy with baseline maturity.
|
||||
|
||||
### Story 10.5: Configurable Autonomy — Governance for Cost Alerting
|
||||
**As a** solo founder, **I want** a `policy.json` that controls whether dd0c/cost can auto-alert customers or only log anomalies internally, **so that** I can validate scoring accuracy before enabling customer-facing notifications.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- `policy.json` defines `governance_mode`: `strict` (log-only, no customer alerts) or `audit` (auto-alert with logging).
|
||||
- Default for new accounts: `strict` for first 14 days (baseline learning period), then auto-promote to `audit`.
|
||||
- `panic_mode`: when true, all alerting stops. Anomalies are still scored and logged but no notifications sent. Dashboard shows "alerting paused" banner.
|
||||
- Per-account governance override: customers can set their own mode. Can only be MORE restrictive.
|
||||
- All policy decisions logged: "Alert for account X suppressed by strict mode", "Auto-promoted account Y to audit mode after 14-day baseline".
|
||||
|
||||
**Estimate:** 3 points
|
||||
**Dependencies:** Epic 4 (Notification Engine)
|
||||
**Technical Notes:**
|
||||
- The 14-day auto-promotion is key — it prevents alert spam during baseline learning while ensuring customers eventually get value.
|
||||
- Auto-promotion check: daily cron. If account has ≥14 days of baseline data AND false-positive rate <10%, promote to `audit`.
|
||||
- Panic mode: Redis key `dd0c:panic`. Notification engine short-circuits on this key.
|
||||
|
||||
### Epic 10 Summary
|
||||
| Story | Tenet | Points |
|
||||
|-------|-------|--------|
|
||||
| 10.1 | Atomic Flagging | 5 |
|
||||
| 10.2 | Elastic Schema | 3 |
|
||||
| 10.3 | Cognitive Durability | 2 |
|
||||
| 10.4 | Semantic Observability | 3 |
|
||||
| 10.5 | Configurable Autonomy | 3 |
|
||||
| **Total** | | **16** |
|
||||
1058
products/05-aws-cost-anomaly/innovation-strategy/session.md
Normal file
1058
products/05-aws-cost-anomaly/innovation-strategy/session.md
Normal file
File diff suppressed because it is too large
Load Diff
119
products/05-aws-cost-anomaly/party-mode/session.md
Normal file
119
products/05-aws-cost-anomaly/party-mode/session.md
Normal file
@@ -0,0 +1,119 @@
|
||||
# dd0c/cost — "Party Mode" Advisory Board Review
|
||||
|
||||
**Date:** February 28, 2026
|
||||
**Product:** dd0c/cost (AWS Cost Anomaly Detective)
|
||||
**Moderator:** Max (The Zoomer) — *Alright nerds, let's tear this apart. You've read the briefs. Real-time CloudTrail analysis, Slack-native remediation, $19/mo. Let's see if this is actually a business or just another dashboard nobody opens.*
|
||||
|
||||
---
|
||||
|
||||
## Round 1: INDIVIDUAL REVIEWS
|
||||
|
||||
### 1. The VC (Pattern-Matcher)
|
||||
**Excites me:** The wedge. Leading with "we catch the $5,000 GPU instance mistake in 60 seconds" is a visceral, high-conversion pitch. The PLG motion here is frictionless, and the time-to-value is under 10 minutes. I love the "gateway drug" cross-sell with dd0c/route.
|
||||
**Worries me:** Defensibility. What is the actual moat here? If Datadog decides to build a Slack-first cost alert, they crush you with distribution. And I'm still not convinced AWS won't just wake up and fix their native anomaly detection.
|
||||
**Vote:** CONDITIONAL GO. (Condition: Prove the CloudTrail event data actually creates a compounding data moat over time.)
|
||||
|
||||
### 2. The CTO (Infrastructure Veteran)
|
||||
**Excites me:** Closing the loop between detection and action. Telling my team an instance is burning money is useless if they have to log into the AWS console to kill it. The one-click Slack remediation is exactly how engineers actually want to work.
|
||||
**Worries me:** CloudTrail is noisy as hell, and the latency isn't perfectly zero. Mapping raw `RunInstances` events to accurate pricing (factoring in RIs, Savings Plans, and Spot pricing) in real-time is notoriously difficult. If the Slack bot cries wolf with inaccurate pricing three times, my engineers will mute the channel.
|
||||
**Vote:** CONDITIONAL GO. (Condition: Ship with hyper-conservative alert thresholds to prevent false positive fatigue.)
|
||||
|
||||
### 3. The Bootstrap Founder (Indie Hacker)
|
||||
**Excites me:** The math is beautiful. $19/month per account is a no-brainer impulse buy for any startup CTO. At $19, you only need ~526 connected accounts to hit $10K MRR. That is incredibly achievable with Hacker News and Reddit distribution.
|
||||
**Worries me:** Solo founder burnout. Processing real-time event streams at scale is an operational nightmare. You're building a highly available data pipeline. If your ingestion goes down, you miss the anomaly, and you lose trust forever. Can Brian actually support this while building 5 other products?
|
||||
**Vote:** GO. (Keep the scope violently narrow. No multi-cloud, no dashboards. Just Slack.)
|
||||
|
||||
### 4. The FinOps Practitioner (The Enterprise Buyer)
|
||||
**Excites me:** Nothing, really. I already use Vantage and I have CUR queries for everything else. But I acknowledge I am not the target buyer here. For the 40-person startup without a FinOps team, this is a lifesaver.
|
||||
**Worries me:** The $19/mo pricing leaves money on the table. A forgotten p4d instance costs $32/hour. If you save a company $2,000, charging them $19 feels like you're underselling the value. Also, attribution is going to be a nightmare without strict tagging, which startups never have.
|
||||
**Vote:** NO-GO. (Pivot to % of savings pricing, or at least tier it by total AWS spend. $19 is a toy price.)
|
||||
|
||||
### 5. The Contrarian (The Red Teamer)
|
||||
**Excites me:** The fact that everyone thinks cloud cost tools are a solved problem. They aren't. They're all building for the CFO. Building for the on-call DevOps engineer is a genuinely contrarian bet.
|
||||
**Worries me:** The "Slack-native" premise is a bug, not a feature. Have you seen a startup's `#alerts` channel? It's a graveyard of ignored webhooks. Adding cost alerts to the noise doesn't solve the problem, it just changes the venue of the ignored warning.
|
||||
**Vote:** CONDITIONAL GO. (Condition: The product must include a "Zombie Resource Auto-Kill" feature. Don't ask them to click a button in Slack. Just kill it and tell them you did it.)
|
||||
|
||||
---
|
||||
|
||||
## Round 2: CROSS-EXAMINATION
|
||||
|
||||
**Max (Moderator):** *Spicy start. I'm hearing some doubts about the price point and the noise. Let's get into it. VC, you think the market is crowded. Bootstrap, you think it's wide open. FinOps is scoffing at $19/mo. Go.*
|
||||
|
||||
**1. The VC (to FinOps):** You're voting NO-GO because $19/month is too cheap? Are you crazy? This is a volume play. Startups don't have $500/month for Vantage. $19 is the exact threshold where a CTO whips out the corporate card without asking permission.
|
||||
|
||||
**2. The FinOps Practitioner:** And that's exactly why they'll churn! If you charge $19/mo, they'll treat it like a $19 tool. The moment it flags a false positive on a planned EMR cluster deployment, they'll turn it off. If you charge $200, they'll at least adjust the configuration.
|
||||
|
||||
**3. The VC:** Wrong. They'll churn if the product sucks. At $19/mo, it's a "set and forget" insurance policy. The real risk is AWS Cost Anomaly Detection waking up and building Slack buttons for free. Datadog could build this over a weekend.
|
||||
|
||||
**4. The Bootstrap Founder (to VC):** Datadog has 3,000 engineers and they charge $23 per *host*. They're not going to cannibalize their upsell motion for a $19 product. And AWS hasn't fixed their billing UX in a decade. The market is crowded at the enterprise level, but there's a massive vacuum at the bottom for developers who just want to be left alone.
|
||||
|
||||
**5. The VC:** Okay, but what's the moat? Once you get to 500 customers, someone else clones the CloudTrail ingestion script and launches for $9/mo.
|
||||
|
||||
**6. The Bootstrap Founder:** The moat is the pattern data! Once dd0c learns your account's seasonal spending spikes and your remediation muscle memory is built into Slack, you don't switch to a $9 clone. And as a solo dev, Brian can run this infrastructure on $200/mo. The margins are insane.
|
||||
|
||||
**7. The CTO (to Contrarian):** Speaking of infrastructure, you want to auto-kill zombie resources? Are you out of your mind? If an automated script terminates a production ML training job because it thought it was a "zombie p4d instance," the CTO will literally fire the vendor on the spot.
|
||||
|
||||
**8. The Contrarian:** Oh, please. If a developer leaves a p4d running over the weekend without tagging it as production, they deserve the termination. You're trying to build a cost tool, but you're too scared to actually enforce the cost. Slack buttons are a coward's way out. Force them to opt-in to auto-termination for dev accounts.
|
||||
|
||||
**9. The CTO:** It's not about courage, it's about the reliability of CloudTrail. CloudTrail events are fast, but they don't contain real-time pricing data with Savings Plans and RIs factored in. If you auto-terminate an instance that was already covered by an RI, you just killed a workload for literally zero financial benefit.
|
||||
|
||||
**10. The FinOps Practitioner:** The CTO is 100% right. You cannot act on CloudTrail data alone. The CUR data is the only source of truth. If dd0c tells a CTO "this instance is costing $5/hour" but it's actually covered by an RI and costing $0, the CTO will lose all trust in the tool immediately.
|
||||
|
||||
**11. The Contrarian (to FinOps/CTO):** You two are entirely missing the point. The CTO doesn't care if it's exactly $5.00 or $4.12. They care that an unsanctioned GPU instance just spun up in `us-east-1` when the entire team is supposed to be in `us-west-2`. The speed of the alert is the product. The exact dollar amount is just decoration.
|
||||
|
||||
**12. The Bootstrap Founder:** Exactly. "Estimated cost: $5/hr" is enough to trigger a Slack conversation. If it's covered by an RI, the developer replies "It's fine, we have an RI," clicks the `[Snooze]` button, and goes back to work. That interaction takes 10 seconds. That's worth $19/mo.
|
||||
|
||||
---
|
||||
|
||||
## Round 3: STRESS TEST
|
||||
|
||||
**Max (Moderator):** *Let's break it down to the absolute worst-case scenarios. We have three fatal flaws we need to survive. Attack.*
|
||||
|
||||
### Attack 1: AWS Ships Real-Time Cost Anomaly Detection (Faster, Less Noisy)
|
||||
**The Scenario:** At re:Invent 2026, AWS announces a complete overhaul of their native tool. It's real-time, it has tunable ML models, and they launch a first-party Slack integration with remediation buttons. Oh, and it's bundled for free.
|
||||
- **Severity (1-10):** 9. This destroys the primary GTM and differentiation.
|
||||
- **Mitigation:** Your only play is the multi-cloud narrative (AWS + GCP + Azure) or the specific "developer-first" UX. AWS native tools are historically built for enterprise compliance, not startup speed.
|
||||
- **Pivot Option:** Pivot dd0c/cost into a feature of the broader `dd0c/portal` offering. If it can't survive as a standalone product against free AWS tools, bundle it into an IDP (Internal Developer Portal) where cost is just one widget next to PagerDuty and GitHub metrics.
|
||||
|
||||
### Attack 2: Market Consolidation (Datadog Acquires Vantage)
|
||||
**The Scenario:** Datadog acquires Vantage for $300M, integrating Vantage's FinOps capabilities directly into Datadog's massive footprint. Suddenly, every Datadog customer gets cost anomaly detection out of the box.
|
||||
- **Severity (1-10):** 7. Datadog's enterprise motion crushes your mid-market aspirations.
|
||||
- **Mitigation:** Datadog will inevitably raise Vantage's prices or bundle it behind an expensive tier. Emphasize the $19/mo price point and the anti-bloatware positioning. Play the "we are the tool for teams that hate Datadog's pricing model" card.
|
||||
- **Pivot Option:** Double down on the indie/bootstrapper market. Pivot strictly to a PLG motion for sub-50 person engineering teams where a Datadog contract is unjustifiable.
|
||||
|
||||
### Attack 3: False Positive Fatigue
|
||||
**The Scenario:** CloudTrail is noisy. You alert a CTO three times in one week about a $5/hour cost spike that turns out to be covered by an RI or a Spot Instance request that instantly terminated. The CTO's team mutes the `#dd0c-alerts` Slack channel. You're now just another ignored webhook.
|
||||
- **Severity (1-10):** 10. If the product loses trust, churn hits 100%. The "boy who cried wolf" is the death of all monitoring tools.
|
||||
- **Mitigation:** Ship with insanely conservative default thresholds. Require users to opt-in to lower sensitivity. Build an immediate feedback loop: every Slack alert needs a `[Mark as Expected]` button that instantly retrains the anomaly baseline for that specific resource tag.
|
||||
- **Pivot Option:** Pivot from "anomaly detection" to "Zombie Hunter." Stop trying to catch real-time spikes and focus purely on finding unused resources (unattached EBS volumes, empty ELBs, idle EC2s). No false positives there, just pure savings.
|
||||
|
||||
---
|
||||
|
||||
## Round 4: FINAL VERDICT
|
||||
|
||||
**Max (Moderator):** *The board has deliberated. It's time for the bloodbath. Unanimous or split decision on `dd0c/cost`? What's the final call? Let's go.*
|
||||
|
||||
### The Decision: SPLIT VERDICT (4-1 CONDITIONAL GO)
|
||||
|
||||
**The VC:** "If you can make a CTO feel like they have superpowers for 19 bucks a month, I'm in. But you better move fast before re:Invent ruins your life."
|
||||
**The CTO:** "I'll use it if you don't wake up my engineers with fake alarms. Tune the noise down, and I'll buy 5 licenses right now."
|
||||
**The Bootstrap Founder:** "The easiest $10k MRR you'll ever build. Don't overcomplicate it. Stay out of the enterprise."
|
||||
**The Contrarian:** "I hate Slack bots, but auto-killing zombies is a real product. Do it."
|
||||
**The FinOps Practitioner:** "You are leaving enterprise money on the table, and your attribution sucks. I vote NO-GO."
|
||||
|
||||
### Revised Priority in the `dd0c` Lineup
|
||||
`dd0c/cost` is officially the **Gateway Drug #2**.
|
||||
It must be launched immediately following `dd0c/route`. The entire GTM strategy depends on this product proving immediate, undeniable monetary ROI in Week 1 to earn the trust required for the rest of the platform.
|
||||
|
||||
### Top 3 Must-Get-Right Items
|
||||
1. **The '10-Minute Aha' Onboarding Flow.** No forms, no manual tagging requirements. A user must connect an AWS account via CloudFormation and get their first real alert (even if it's just an unattached EBS volume) within 10 minutes.
|
||||
2. **One-Click Remediation UX.** The `[Stop Instance]` Slack button is the entire moat. It has to work flawlessly without forcing a context switch to the AWS console.
|
||||
3. **Hyper-Conservative Default Alerting.** It is infinitely better to miss a $50 anomaly than to trigger 3 false positives in the first week. The baseline must learn before it screams.
|
||||
|
||||
### The One Kill Condition
|
||||
**If AWS announces real-time Cost Anomaly Detection with native Slack remediation at re:Invent 2026, kill the standalone product.** Pivot the CloudTrail ingestion engine immediately into `dd0c/alert` or `dd0c/drift` as a supplementary feature, and stop selling it as a $19/mo FinOps tool.
|
||||
|
||||
### Final Verdict: CONDITIONAL GO
|
||||
Cloud cost management is a crowded, bloody ocean. But everyone is building for the CFO. The wedge is building for the on-call engineer who just wants to stop a runaway GPU cluster from their phone without finding their YubiKey. The $19/month real-time Slack bot is a hyper-specific, defensible wedge. Build the smoke detector, hand them the fire extinguisher, and get out of their way.
|
||||
|
||||
**Max:** *Alright, party's over. Build the damn thing.*
|
||||
747
products/05-aws-cost-anomaly/product-brief/brief.md
Normal file
747
products/05-aws-cost-anomaly/product-brief/brief.md
Normal file
@@ -0,0 +1,747 @@
|
||||
# dd0c/cost — Product Brief
|
||||
## AWS Cost Anomaly Detective
|
||||
|
||||
**Version:** 1.0
|
||||
**Date:** February 28, 2026
|
||||
**Author:** Product Management
|
||||
**Status:** Conditional GO (4-1 Advisory Board Vote)
|
||||
**Classification:** Investor-Ready
|
||||
|
||||
---
|
||||
|
||||
# 1. EXECUTIVE SUMMARY
|
||||
|
||||
## Elevator Pitch
|
||||
|
||||
dd0c/cost is a real-time AWS billing anomaly detector that catches cost spikes in seconds — not the 24-48 hours that AWS native tools require — and delivers actionable Slack alerts with one-click remediation. At $19/account/month, it's the smoke detector with a fire extinguisher attached: it tells you what happened, who did it, and lets you fix it without leaving Slack.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Cloud cost management is broken at the speed layer. AWS customers collectively overspend by an estimated $16B+ annually on idle, forgotten, and misconfigured resources (Flexera State of the Cloud 2025). The average startup discovers cost anomalies 48-72 hours after they begin — by which time a single forgotten GPU instance has burned $1,400+ (p3.2xlarge at $12.24/hr × 4.8 days).
|
||||
|
||||
The root cause is architectural: every existing tool — including AWS's own Cost Anomaly Detection — is built on batch-processed Cost and Usage Report (CUR) data. CUR is designed for accounting, not operations. It's like getting your credit card statement a month late and wondering why you're broke.
|
||||
|
||||
Three compounding failures make this worse:
|
||||
|
||||
1. **No real-time feedback loop.** AWS makes it trivially easy to launch a $98/hour GPU instance and provides zero immediate cost signal. Engineers get no feedback between "I created a thing" and "the bill arrived."
|
||||
2. **No attribution.** When costs spike, the first question is "who did this?" AWS Cost Explorer answers at the service level ("EC2 went up"), not the human level ("Sam launched 4 GPU instances at 11:02 AM"). This creates blame culture instead of resolution.
|
||||
3. **No remediation path.** Even when anomalies are detected, fixing them requires navigating 5+ AWS console screens. The gap between "knowing" and "doing" is where money burns.
|
||||
|
||||
The AI infrastructure boom has made this exponentially worse. Enterprise AI/ML spend on AWS grew 340% from 2023-2025 (Gartner). GPU instances costing $12-$98/hour are now routine. Teams that never worried about AWS costs are suddenly getting $40K bills because someone left a SageMaker endpoint running over a weekend.
|
||||
|
||||
## Solution Overview
|
||||
|
||||
dd0c/cost replaces the industry's batch-processing paradigm with real-time event-stream analysis:
|
||||
|
||||
- **Real-time detection via CloudTrail:** Instead of waiting for CUR data, dd0c processes CloudTrail events through EventBridge as they happen. When someone launches an expensive resource, dd0c knows in seconds — not days.
|
||||
- **Slack-native alerts with full context:** Every alert includes what happened, who did it, when, estimated cost impact, and plain-English explanation. No dashboard required.
|
||||
- **One-click remediation:** Slack action buttons (Stop, Terminate, Schedule Shutdown, Snooze) let engineers fix problems without leaving their workflow. Remediation includes safety nets (automatic EBS snapshots before termination).
|
||||
- **Zombie resource hunting:** Daily automated scans for idle EC2 instances, unattached EBS volumes, orphaned Elastic IPs, and empty load balancers — the perpetual waste that regenerates as teams grow.
|
||||
- **Pattern learning:** Anomaly baselines adapt to each account's unique spending patterns over 30-90 days, reducing false positives and increasing detection accuracy over time.
|
||||
|
||||
## Target Customer
|
||||
|
||||
**Primary:** Series A/B SaaS startups. 10-50 engineers. 1-5 AWS accounts. $5K-$50K/month AWS spend. No dedicated FinOps team. The CTO or a senior DevOps engineer "owns" the bill as a side responsibility.
|
||||
|
||||
**Secondary:** Mid-market engineering teams (50-200 engineers) with a solo FinOps analyst drowning in manual data wrangling across 10-25 AWS accounts.
|
||||
|
||||
**Anti-target:** Enterprise organizations with $500K+/month AWS spend, dedicated FinOps teams, and existing CloudHealth/Vantage contracts. These are not our customers in Year 1.
|
||||
|
||||
## Key Differentiators
|
||||
|
||||
| Dimension | dd0c/cost | Industry Standard |
|
||||
|-----------|-----------|-------------------|
|
||||
| Detection speed | Seconds (CloudTrail events) | 24-48 hours (CUR/Cost Explorer) |
|
||||
| Alert channel | Slack-native with action buttons | Email/SNS, dashboard visits |
|
||||
| Remediation | One-click from Slack | Manual AWS Console navigation |
|
||||
| Attribution | Resource + user + action + timestamp | Service-level aggregates |
|
||||
| Setup time | 5 minutes (one-click CloudFormation) | 15-60 minutes (CUR configuration, dashboard setup) |
|
||||
| Price | $19/account/month | $100-500+/month or enterprise contracts |
|
||||
| Explanation quality | Plain English ("Sam launched 4x p3.2xlarge at 11:02am, burning $12.24/hr") | "Anomaly detected in EC2" |
|
||||
|
||||
---
|
||||
|
||||
# 2. MARKET OPPORTUNITY
|
||||
|
||||
## Market Sizing
|
||||
|
||||
| Segment | Size | Basis |
|
||||
|---------|------|-------|
|
||||
| **TAM** | $16.5B | Global cloud cost management and optimization market, 2026. All providers, all segments, all tool categories. (Gartner, FinOps Foundation, Flexera State of the Cloud 2025). 22% CAGR. |
|
||||
| **SAM** | $2.1B | AWS-specific cost anomaly detection and optimization for SMB/mid-market. ~340,000 AWS accounts spending $5K-$500K/month. Average willingness-to-pay ~$500/month for cost tooling. |
|
||||
| **SOM** | $1.0-3.6M ARR (Year 1) | 250 paying accounts at blended $29/account/month from dd0c/cost alone = ~$87K MRR / $1.04M ARR. Combined with dd0c/route ("gateway drug" pair), $2-3.6M ARR if execution is sharp. |
|
||||
|
||||
**The honest math:** To hit $50K MRR (the platform target), dd0c/cost alone won't get there. At $19/account/month, you need ~2,600 paying accounts for $50K MRR from cost alone. Realistically, dd0c/cost contributes $15-25K MRR and dd0c/route carries the rest. That's the strategy — the gateway drug pair, not a single product.
|
||||
|
||||
## Competitive Landscape
|
||||
|
||||
### Direct Competitors
|
||||
|
||||
**AWS Cost Anomaly Detection (Native)**
|
||||
- Free. ML-based. 24-48 hour detection delay. Black-box model with legendary false positive rates. No Slack integration. No remediation. UX buried behind 4 clicks in the Billing console. AWS's incentive structure is fundamentally misaligned — they profit when you overspend. They will never build a great cost reduction tool.
|
||||
- **Threat level:** LOW as a product. HIGH as a "good enough" excuse for prospects to do nothing.
|
||||
|
||||
**Vantage**
|
||||
- Modern FinOps platform. Series A ($13M). Cost reporting, K8s allocation, unit economics. Pricing starts ~$100/month, scales aggressively. Architecture is CUR-based (batch, not real-time). Moving upmarket toward FinOps analyst persona, not startup CTOs.
|
||||
- **Threat level:** MEDIUM. Could add real-time detection but would require a data pipeline rebuild (~6 month project). Window exists.
|
||||
|
||||
**nOps**
|
||||
- Automated cloud optimization (RI/SP purchasing, scheduling, spot migration). Enterprise-focused, opaque pricing ("Contact Sales"). Solves "help me save money systematically" — a different JTBD than "tell me the second something goes wrong."
|
||||
- **Threat level:** LOW-MEDIUM. Different positioning. Potential partner.
|
||||
|
||||
**Antimetal**
|
||||
- Group buying for cloud. Aggregates purchasing power for better RI/SP rates. Visibility features are table stakes. VC-backed, burning cash on a model requiring massive scale.
|
||||
- **Threat level:** LOW. Different business model entirely.
|
||||
|
||||
### Adjacent Competitors (Different Buyer, Overlapping Problem)
|
||||
|
||||
**CloudHealth (VMware/Broadcom)** — Enterprise. 6-month implementations. $50K+ annual contracts. Sells to VP of Infrastructure via golf courses. Irrelevant to our beachhead. **NEGLIGIBLE.**
|
||||
|
||||
**Kubecost / OpenCost** — K8s-only cost monitoring. Our beachhead customers are mostly running EC2, Lambda, and RDS. Complementary, not competitive. **NEGLIGIBLE.**
|
||||
|
||||
**Infracost** — Pre-deploy cost estimation (shift-left). We're runtime (shift-right). "Infracost tells you what it WILL cost. dd0c tells you what it IS costing." **Potential PARTNER.**
|
||||
|
||||
**ProsperOps** — Autonomous discount management. Pure savings execution. No anomaly detection. Different JTBD. **NEGLIGIBLE.**
|
||||
|
||||
### The Existential Threat
|
||||
|
||||
**Datadog**
|
||||
- Already has agents in customer infrastructure, CloudTrail ingestion, and Slack integrations. Adding real-time cost anomaly detection is a feature for them, not a product. 3,000 engineers.
|
||||
- **Why we might still win:** Datadog charges $23/host/month for infrastructure monitoring PLUS additional for cost management. A 50-host startup pays $1,150/month before cost features. Our $19/account/month is a rounding error. Their cost management is dashboard-first, not Slack-first. Their incentive is upselling more Datadog, not being the best cost tool.
|
||||
- **Threat level:** HIGH long-term. LOW short-term (enterprise focus, not startups).
|
||||
|
||||
### Blue Ocean Positioning
|
||||
|
||||
The incumbents cluster around reporting, governance, dashboards, and RI optimization — a Red Ocean of commoditized features. dd0c/cost's Blue Ocean is the quadrant nobody serves well:
|
||||
|
||||
```
|
||||
Factor | AWS Native | Vantage | CloudHealth | dd0c/cost
|
||||
--------------------------|-----------|---------|-------------|----------
|
||||
Detection Speed | 2 | 4 | 3 | 9
|
||||
Attribution (Who/What) | 2 | 6 | 7 | 8
|
||||
Remediation (Fix It) | 1 | 2 | 3 | 9
|
||||
Slack-Native Experience | 1 | 3 | 1 | 10
|
||||
Time-to-Value (Setup) | 6 | 4 | 2 | 9
|
||||
Pricing Transparency | 10 | 6 | 1 | 10
|
||||
Multi-Account Governance | 4 | 7 | 9 | 3
|
||||
Reporting/Dashboards | 5 | 8 | 9 | 2
|
||||
RI/SP Optimization | 3 | 6 | 8 | 1
|
||||
```
|
||||
|
||||
We deliberately score LOW on governance, reporting, and RI optimization. We score so high on speed, action, and simplicity that the comparison is absurd. This is textbook Blue Ocean: make the competition irrelevant by competing on different factors.
|
||||
|
||||
## Timing Thesis: Why Now
|
||||
|
||||
Four converging forces create an exceptional window:
|
||||
|
||||
**1. The AI Spend Explosion (2024-2026)**
|
||||
Enterprise AI/ML infrastructure spend on AWS grew 340% from 2023-2025. GPU instances cost $12-$98/hour. A single forgotten ML training job burns $5,000 in a weekend. Teams that never worried about AWS costs are suddenly panicking at $40K bills. This is creating a new generation of buyers who need cost detection urgently.
|
||||
|
||||
**2. FinOps Goes Mainstream**
|
||||
FinOps Foundation membership grew from 5,000 to 31,000+ between 2022-2025. "FinOps" job titles increased 4x on LinkedIn. The market is educated — we don't need to explain WHY cost management matters. We need to explain why our approach is better. Much easier sell.
|
||||
|
||||
**3. AWS Native Tools Are Still Terrible**
|
||||
AWS Cost Anomaly Detection launched in 2020. Six years later: 24-48 hour delays, no Slack, no remediation, black-box ML. AWS's billing team is a cost center, not a profit center. They have no incentive to invest heavily. Every year they don't fix this, the third-party market grows. We have 2-3 years minimum before AWS could ship something competitive.
|
||||
|
||||
**4. Regulatory Tailwinds**
|
||||
EU DORA requires financial institutions to monitor cloud spend. SOC 2/ISO 27001 auditors increasingly ask "how do you monitor cloud costs?" ESG/sustainability reporting links cloud efficiency to carbon footprint. FinOps Foundation certification is creating a professional class of buyers who actively seek tools.
|
||||
|
||||
---
|
||||
|
||||
# 3. PRODUCT DEFINITION
|
||||
|
||||
## Value Proposition
|
||||
|
||||
**For startup CTOs and DevOps engineers** who are personally accountable for AWS spend but have no time or tools for real-time cost governance, **dd0c/cost is a Slack-native cost anomaly detector** that catches billing spikes in seconds and lets you fix them with one click. **Unlike AWS Cost Anomaly Detection, Vantage, or CloudHealth,** dd0c/cost is built on real-time CloudTrail event streams (not batch CUR data), delivers alerts where engineers already work (Slack, not dashboards), and includes remediation — not just detection — at $19/account/month.
|
||||
|
||||
**The core promise:** The 48-hour blindspot between "something went wrong" and "I understand what happened" is eliminated. dd0c/cost turns a $4,700 weekend disaster into a $12 blip caught in 60 seconds.
|
||||
|
||||
## Personas
|
||||
|
||||
### Persona 1: Alex — The Startup CTO
|
||||
- **Profile:** 32, Series A startup, 12 engineers. Wears CTO/VP Eng/DevOps hat simultaneously. Personally signed the AWS Enterprise Agreement. The board sees every line item.
|
||||
- **Defining moment:** Tuesday 7:14 AM, brushing teeth. CFO forwards AWS billing alert: charges exceeded $8,000 (last month was $2,100). Stomach drops. Cost Explorer takes 11 seconds to load on mobile. Bar chart shows a spike but not WHERE or WHY. Alex spends 3 hours diagnosing what dd0c would have caught in 60 seconds.
|
||||
- **JTBD:** "When I see an unexpected AWS charge, I want to instantly understand what caused it and who's responsible, so I can fix it before it gets worse and explain it to stakeholders."
|
||||
- **What they hire dd0c for:** Speed of detection, attribution, credibility with investors.
|
||||
|
||||
### Persona 2: Sam — The DevOps Engineer
|
||||
- **Profile:** 26, backend/infrastructure engineer at a 40-person startup. Manages Terraform, CI/CD, and "whatever AWS thing is broken today." Doesn't think about costs until they cause a problem.
|
||||
- **Defining moment:** Friday 4:47 PM. CTO Slack: "Did you launch those GPU instances?" Sam spun up 4x p3.2xlarge on Tuesday for a 20-minute ML benchmark. Production incident pulled them away. Instances still running. 4 days × $12.24/hr × 4 instances = $4,700. Sam wants to disappear.
|
||||
- **JTBD:** "When I spin up a temporary resource, I want automatic safety nets so I can focus on my actual work without worrying about zombie resources."
|
||||
- **What they hire dd0c for:** The safety net they never had. No more blame. No more forgotten instances.
|
||||
|
||||
### Persona 3: Jordan — The Solo FinOps Analyst
|
||||
- **Profile:** 28, mid-size SaaS (150 engineers, 23 AWS accounts). Title is "Cloud Financial Analyst." The only person who understands AWS billing. Reports to VP Eng and dotted-line to Finance.
|
||||
- **Defining moment:** Last Thursday of the month. 14 browser tabs open. 3 days building the monthly cost report. $4,200 discrepancy between Cost Explorer and CUR data. 60% of time spent collecting and reconciling data, not analyzing it.
|
||||
- **JTBD:** "When an anomaly is detected, I want to immediately see the root cause with full context, so I can resolve it without a 3-hour investigation."
|
||||
- **What they hire dd0c for:** Getting their time back. Automated detection replaces manual data wrangling.
|
||||
|
||||
## Feature Roadmap
|
||||
|
||||
### MVP (V1) — Launch at Day 90
|
||||
|
||||
The V1 is ruthlessly scoped to three capabilities: detect, alert, fix.
|
||||
|
||||
**Real-Time Anomaly Detection**
|
||||
- CloudTrail → EventBridge → Lambda pipeline for real-time event ingestion
|
||||
- Z-score anomaly scoring with configurable sensitivity (default: conservative/high threshold)
|
||||
- Cost estimation for top 20 AWS services mapped from CloudTrail events (~85% accuracy)
|
||||
- Two-layer architecture: Layer 1 (CloudTrail, seconds, estimated) + Layer 2 (CloudWatch EstimatedCharges + CUR, hours, precise)
|
||||
- Pattern baseline learning over 30-90 days per account
|
||||
|
||||
**Slack-Native Alerts**
|
||||
- Block Kit messages: resource type, estimated cost/hour, who created it (IAM user/role), when, plain-English explanation
|
||||
- Action buttons: Stop Instance, Terminate Instance (with automatic EBS snapshot), Snooze (1hr/4hr/24hr/permanent), Mark as Expected (retrains baseline)
|
||||
- Daily digest: yesterday's spend summary, top anomalies, zombie resources found
|
||||
- End-of-month spend forecast
|
||||
|
||||
**Zombie Resource Hunter**
|
||||
- Daily automated scan: idle EC2 instances (CPU <5% for 72+ hours), unattached EBS volumes, orphaned Elastic IPs, empty load balancers, stopped instances with attached EBS
|
||||
- Slack report with one-click cleanup actions
|
||||
|
||||
**Onboarding**
|
||||
- One-click CloudFormation template (IAM read-only role, ~90 seconds)
|
||||
- Slack OAuth integration (~30 seconds)
|
||||
- Immediate zombie scan on connection (first value in <10 minutes)
|
||||
- Zero configuration required — opinionated defaults for everything
|
||||
|
||||
**What V1 explicitly does NOT include:** No web dashboard. No multi-account governance. No RI/SP optimization. No team attribution. No multi-cloud. No reporting. No forecasting beyond end-of-month estimate. These are deliberate omissions, not gaps.
|
||||
|
||||
### V2 — Months 4-6
|
||||
|
||||
- **Web dashboard:** Lightweight cost overview, anomaly history, trend visualization
|
||||
- **Multi-account support:** Connect multiple AWS accounts, unified alerting
|
||||
- **Team attribution:** Tag-based cost allocation to teams without requiring perfect tagging (heuristic matching via IAM roles and resource naming patterns)
|
||||
- **Budget circuit breakers:** Automatic alerts and optional enforcement when spend exceeds configurable thresholds
|
||||
- **Approval workflows:** Remediation actions on sensitive resources require manager approval via Slack thread
|
||||
- **Business tier pricing** ($49/account/month) with team features and API access
|
||||
|
||||
### V3 — Months 7-12
|
||||
|
||||
- **RI/SP optimization recommendations:** Identify savings plan and reserved instance opportunities
|
||||
- **Spend forecasting:** ML-based monthly and quarterly projections with confidence intervals
|
||||
- **Benchmarking:** "Companies similar to yours spend X on EC2" — powered by anonymized aggregate data across dd0c customers (requires 500+ customer scale)
|
||||
- **Custom anomaly rules:** User-defined detection logic beyond statistical baselines
|
||||
- **Autonomous remediation (opt-in):** Auto-terminate dev/staging zombies after configurable idle period, with notification
|
||||
|
||||
### V4 — Year 2
|
||||
|
||||
- **Multi-cloud:** GCP and Azure support (the play if AWS improves native tools)
|
||||
- **API platform:** Programmatic access for custom integrations and internal tooling
|
||||
- **dd0c platform integration:** Deep cross-sell with dd0c/route, dd0c/alert, dd0c/run
|
||||
|
||||
## User Journey
|
||||
|
||||
```
|
||||
AWARENESS ACTIVATION RETENTION EXPANSION
|
||||
───────────────────────── ───────────────────────── ───────────────────────── ─────────────────────────
|
||||
"Your AWS bill is lying "Start Free" → GitHub/ First zombie scan alert Connect 2nd AWS account
|
||||
to you" blog post / Google SSO (no credit card) within 10 minutes of setup ($19/mo each)
|
||||
Show HN / Reddit /
|
||||
aws-cost-cli OSS tool One-click CloudFormation First real-time anomaly Upgrade to Business tier
|
||||
(90 sec) → Slack OAuth alert → one-click fix → for team attribution
|
||||
"What's That Spike?" (30 sec) → Choose channel "dd0c just saved us $X"
|
||||
blog series (10 sec) Cross-sell dd0c/route
|
||||
Pattern learning kicks in (LLM cost routing)
|
||||
Bill Shock Calculator DONE. Total: 3-5 minutes. (30-90 days) → fewer
|
||||
(free ungated web tool) Zero configuration. false positives → trust dd0c/alert, dd0c/portal
|
||||
deepens → switching cost (platform expansion)
|
||||
increases
|
||||
```
|
||||
|
||||
**Critical conversion points:**
|
||||
1. **Signup → Connected account:** Must happen in same session. If they leave, 70% never return.
|
||||
2. **Connected → First alert:** Must happen within 24 hours (zombie scan provides this). If no alert in 48 hours, they forget dd0c exists.
|
||||
3. **First alert → First action:** The moment they click "Stop Instance" in Slack and it works, they're hooked. This is the product's magic moment.
|
||||
|
||||
## The Core Technical Tension: Speed vs. Accuracy
|
||||
|
||||
dd0c/cost's architecture resolves the fundamental tradeoff that defines the market:
|
||||
|
||||
| Layer | Source | Speed | Accuracy | Purpose |
|
||||
|-------|--------|-------|----------|---------|
|
||||
| Layer 1: Event Stream | CloudTrail + EventBridge | Seconds | ~85% (estimated, on-demand pricing) | "ALERT: New expensive resource detected" |
|
||||
| Layer 2: Billing Reconciliation | CloudWatch EstimatedCharges + CUR | Minutes to hours | 99%+ (includes RIs, SPs, Spot) | "UPDATE: Confirmed cost impact is $X" |
|
||||
|
||||
**Design principle:** Alert on Layer 1 (fast, estimated). Reconcile with Layer 2 (slow, precise). Always show the user which layer they're seeing. Never pretend an estimate is exact. Never wait for precision when speed saves money.
|
||||
|
||||
This is a smoke detector vs. a fire investigation. The smoke detector goes off immediately — it might be burnt toast, it might be a real fire. You don't wait for the fire investigator's report before evacuating.
|
||||
|
||||
## Pricing
|
||||
|
||||
### Tier Structure
|
||||
|
||||
| Tier | Price | Includes | Purpose |
|
||||
|------|-------|----------|---------|
|
||||
| **Free** | $0/month | 1 AWS account, daily anomaly checks (not real-time), Slack alerts without action buttons, weekly zombie report | Top of funnel. Deliver value. Create upgrade motivation via visible delay. |
|
||||
| **Pro** | $19/account/month | Real-time CloudTrail detection, Slack alerts WITH action buttons, daily zombie hunter, end-of-month forecast, daily digest, configurable sensitivity | Core product. 80% of revenue. |
|
||||
| **Business** | $49/account/month (or $399/month flat for ≤20 accounts) | Everything in Pro + team attribution, approval workflows, custom anomaly rules, API access, priority support | Expansion revenue. Launches with V2. |
|
||||
|
||||
### Why $19/month
|
||||
|
||||
1. **Impulse purchase threshold.** $19 doesn't require approval from anyone. $49 might. Conversion rate difference is typically 2-3x for developer tools.
|
||||
2. **Multi-account expansion.** 3 accounts = $57/month. 10 accounts = $190/month. Revenue scales naturally with customer growth.
|
||||
3. **Trivial ROI.** One forgotten GPU instance ($12.24/hr × 48hr = $587) pays for 2.5 years of dd0c. The ROI story doesn't need a spreadsheet.
|
||||
4. **Category positioning.** At $19, we're 5-25x cheaper than Vantage. That's not a price difference — it's a category difference. We're not "cheaper Vantage." We're a different thing.
|
||||
|
||||
### Free-to-Paid Conversion Mechanics
|
||||
|
||||
The free tier is deliberately designed to create upgrade pressure:
|
||||
- Free gets daily checks. Pro gets real-time. Every free alert includes: "We detected this anomaly 18 hours ago. On Pro, you'd have known in 60 seconds. Estimated cost of the delay: $220."
|
||||
- Free alerts have NO action buttons. You see the problem but must switch to AWS Console to fix it. The friction is the upgrade motivation.
|
||||
- Target conversion rate: 2.5-3.5% (consistent with developer tool benchmarks: Vercel 2.5%, Supabase 3.1%, Railway 2.8%).
|
||||
|
||||
---
|
||||
|
||||
# 4. GO-TO-MARKET PLAN
|
||||
|
||||
## Launch Strategy: Product-Led Growth (PLG)
|
||||
|
||||
No sales team. No demos. No "Contact Sales" button. The product sells itself or it doesn't sell at all.
|
||||
|
||||
The GTM motion is built on one principle: **time-to-value under 10 minutes.** A startup CTO should go from "I've never heard of dd0c" to "I just got my first anomaly alert in Slack" in a single sitting. If onboarding takes more than 5 minutes, we lose 60% of signups.
|
||||
|
||||
### The Onboarding Flow (Critical Path)
|
||||
|
||||
```
|
||||
1. Landing page → "Start Free" (no credit card)
|
||||
2. Sign up with GitHub or Google (no email/password forms)
|
||||
3. "Connect Your AWS Account" → One-click CloudFormation template
|
||||
→ Opens AWS Console with pre-filled CF stack
|
||||
→ Creates IAM role with read-only permissions
|
||||
→ Outputs role ARN back to dd0c
|
||||
→ Total: 90 seconds (including AWS Console login)
|
||||
4. "Connect Slack" → Standard Slack OAuth flow (30 seconds)
|
||||
5. "Choose a channel for alerts" → Dropdown (10 seconds)
|
||||
6. DONE. "We're monitoring your account. First alert incoming."
|
||||
```
|
||||
|
||||
**Immediate value delivery:** The moment the account connects, dd0c runs a zombie resource scan. Most accounts have at least one idle resource. First Slack alert within 5-10 minutes: "We found 3 potentially unused resources costing $127/month." This is the aha moment.
|
||||
|
||||
## Beachhead: Startups Burning AWS Credits
|
||||
|
||||
### The Ideal First Customer
|
||||
|
||||
Series A or B SaaS startup. 10-40 engineers. 1-3 AWS accounts. $5K-$50K/month AWS spend. No FinOps person. The CTO owns the bill as a side responsibility.
|
||||
|
||||
**Why this profile works:**
|
||||
- **Pain is acute and personal.** The CTO's name is on the account. The board sees every line item.
|
||||
- **Decision cycle is fast.** One person decides. No procurement. No security review committee. Sign up and be live in 10 minutes.
|
||||
- **$19/month is a non-decision.** Less than one engineer's daily coffee. If dd0c catches ONE forgotten GPU instance, it pays for itself for 5 years.
|
||||
- **They talk to each other.** Startup CTOs are in Slack communities (Rands Leadership, CTO Craft, YC groups), Twitter/X, and FinOps meetups. One happy customer generates 3 referrals.
|
||||
- **AWS credits make it free.** YC gives $100K in AWS credits. Via AWS Marketplace listing, dd0c becomes "free" — paid from credits they'd spend anyway.
|
||||
|
||||
### The First 10 Customers Playbook
|
||||
|
||||
1. **Customers 1-3: Network.** Brian is a senior AWS architect. Call people running startups on AWS. "I built something. Try it, give me honest feedback." Design partners — free for 6 months in exchange for weekly 15-minute feedback calls.
|
||||
2. **Customers 4-7: Hacker News + Reddit launch.** "Show HN: I built a real-time AWS cost anomaly detector." Tuesday or Wednesday morning US time. Product polished, landing page sharp, onboarding bulletproof. One shot at first impression.
|
||||
3. **Customers 8-10: Referrals from 1-7.** If the first 7 don't refer anyone, the product isn't good enough. Go back to step 1.
|
||||
|
||||
## Growth Loops
|
||||
|
||||
### Loop 1: Savings-Driven Virality
|
||||
```
|
||||
Customer saves $X → Shares "dd0c saved us $4,700" on Twitter/Slack community
|
||||
→ Peers sign up → They save $X → They share → Repeat
|
||||
```
|
||||
Amplifier: Monthly "savings report" email with shareable stats. Make it easy to brag about being smart with money.
|
||||
|
||||
### Loop 2: Engineering-as-Marketing (Open Source Tools)
|
||||
```
|
||||
Free CLI tool (aws-cost-cli, zombie-hunter) → GitHub stars → Developer awareness
|
||||
→ "Like this? dd0c does this automatically, in real-time" → Signups → Repeat
|
||||
```
|
||||
Each tool solves a small problem and funnels to dd0c for the full solution.
|
||||
|
||||
### Loop 3: Content SEO Flywheel
|
||||
```
|
||||
"What's That Spike?" blog post → Ranks for "AWS NAT Gateway cost spike"
|
||||
→ CTO Googles exact problem → Finds post → "dd0c would have caught this in 60 seconds"
|
||||
→ Signup → Repeat
|
||||
```
|
||||
Each post targets a long-tail keyword that the ICP searches when they have the exact problem dd0c solves.
|
||||
|
||||
### Loop 4: Cross-Sell from dd0c/route
|
||||
```
|
||||
Customer uses dd0c/route (LLM cost routing) → Saves $400/month on OpenAI
|
||||
→ Sees dd0c/cost in same workspace → "Oh, this monitors AWS too?"
|
||||
→ Connects AWS account → Finds $800/month in zombies → Platform lock-in deepens
|
||||
```
|
||||
This is the "gateway drug" strategy. Money saved on LLM costs earns the right to sell AWS cost monitoring.
|
||||
|
||||
## Content Strategy
|
||||
|
||||
### Pillar 1: "AWS Bill Shock Calculator" (Lead Generation)
|
||||
Free, ungated web tool. Input your monthly AWS bill → Output: "Companies your size waste 25-35%. That's $X-$Y/month. Here are the top 5 sources." CTA: "Want to find YOUR specific waste? Connect your AWS account (free)." Shareable, generates organic backlinks.
|
||||
|
||||
### Pillar 2: "What's That Spike?" Blog Series (SEO + Authority)
|
||||
Recurring series dissecting real AWS cost anomalies (anonymized):
|
||||
- "The NAT Gateway That Ate $3,000"
|
||||
- "When Autoscaling Doesn't Scale Back"
|
||||
- "The $5,000 GPU Instance Nobody Remembered"
|
||||
- "CloudWatch Logs Gone Wild"
|
||||
|
||||
Each post targets a specific long-tail SEO keyword that the ICP searches during an active cost crisis.
|
||||
|
||||
### Pillar 3: "The Real-Time FinOps Manifesto" (Category Creation)
|
||||
A single definitive piece establishing "real-time FinOps" as a recognized subcategory. If we define the category, dd0c is the default leader. Target: FinOps Foundation blog, The New Stack, InfoQ.
|
||||
|
||||
### Pillar 4: Open-Source Tools (Engineering-as-Marketing)
|
||||
- **`aws-cost-cli`**: CLI showing current AWS burn rate. `npx aws-cost-cli` → "Current burn rate: $1.87/hour | $44.88/day | $1,346/month."
|
||||
- **`zombie-hunter`**: CLI scanning for unused AWS resources. `npx zombie-hunter` → "Found 7 zombie resources costing $312/month."
|
||||
- **CloudFormation billing alerts template**: One-click CF template for proper billing alerts (better than AWS default). Free, dd0c branded.
|
||||
|
||||
## Channel Strategy
|
||||
|
||||
| Channel | Tactic | Expected Yield |
|
||||
|---------|--------|---------------|
|
||||
| Hacker News | "Show HN" launch post | 500-2,000 signups if front page. 2-5% convert. |
|
||||
| r/aws, r/devops | Genuine participation + "I built this" | 100-500 signups. Higher conversion (self-selected). |
|
||||
| Twitter/X | "Your AWS bill is lying to you" thread | Brand awareness. 50-200 signups per viral thread. |
|
||||
| FinOps Foundation Slack | Community participation, answer questions | 10-30 high-quality leads. Most educated buyers. |
|
||||
| Dev.to / Hashnode | Technical blog posts | SEO long-tail. 10-30 signups/month ongoing. |
|
||||
| AWS Marketplace | Listed within 90 days of launch | Pay-with-credits angle. AWS takes 3-5% cut. Worth it. |
|
||||
| Product Hunt | Same launch week as HN, different day | 200-500 signups. Lower conversion but brand awareness. |
|
||||
|
||||
## Partnerships
|
||||
|
||||
**AWS Marketplace (Priority: HIGH)** — List within 90 days. Customers pay using existing AWS committed spend/credits. YC startups with $100K in AWS credits can use dd0c for "free." Revenue impact: AWS takes 3-5%, worth it for distribution.
|
||||
|
||||
**FinOps Foundation (Priority: HIGH)** — Vendor membership. Contribute to framework documentation (specifically "Real-Time Cost Management" capability). Speak at FinOps X conference. Table stakes for credibility.
|
||||
|
||||
**Infracost (Priority: MEDIUM)** — Integration: Infracost for pre-deploy estimation + dd0c for post-deploy detection. Complementary products, same buyer. Cross-promotion opportunity.
|
||||
|
||||
## 90-Day Launch Timeline
|
||||
|
||||
### Days 1-30: Build the Core
|
||||
- CloudTrail → EventBridge → Lambda pipeline for real-time event ingestion
|
||||
- Anomaly scoring engine (Z-score, configurable sensitivity)
|
||||
- Cost estimation library (CloudTrail events → estimated hourly costs, top 20 AWS services)
|
||||
- Slack app: OAuth, Block Kit alert templates, action handlers (Stop, Terminate, Snooze, Mark Expected)
|
||||
- Daily digest message
|
||||
- **Deliverable:** Working product on own AWS accounts. Ugly but functional.
|
||||
|
||||
### Days 31-60: Polish + Design Partners
|
||||
- Landing page (one-page, Vercel-style)
|
||||
- GitHub/Google SSO signup
|
||||
- One-click CloudFormation onboarding template
|
||||
- Slack OAuth integration flow
|
||||
- Immediate zombie scan on account connection
|
||||
- Recruit 3-5 design partners from network. Free for 6 months, weekly feedback calls.
|
||||
- Instrument: time-to-first-alert, alert-to-action ratio, false positive rate
|
||||
- **Deliverable:** 5 real humans using it daily. Onboarding <5 minutes. False positive rate <30%.
|
||||
|
||||
### Days 61-90: Public Launch
|
||||
- Stripe billing integration ($19/account/month, free tier for 1 account)
|
||||
- First "What's That Spike?" blog post
|
||||
- `aws-cost-cli` open-source tool released
|
||||
- AWS Marketplace listing application submitted
|
||||
- FinOps Foundation vendor membership application
|
||||
- Show HN + Reddit + Product Hunt + Twitter launch
|
||||
- Personal outreach to 50 startup CTOs via LinkedIn/Twitter DMs
|
||||
- **Deliverable:** Product live, publicly available, with paying customers.
|
||||
|
||||
---
|
||||
|
||||
# 5. BUSINESS MODEL
|
||||
|
||||
## Revenue Model
|
||||
|
||||
**Primary revenue:** Per-account SaaS subscription. $19/account/month (Pro) and $49/account/month (Business, launching V2).
|
||||
|
||||
**Secondary revenue (future):** dd0c platform bundle pricing. dd0c/route + dd0c/cost bundle at $39/month flat for small teams (discount vs. buying separately). Creates pricing anchor that makes each individual product feel cheap.
|
||||
|
||||
**Revenue characteristics:**
|
||||
- Recurring (monthly subscription)
|
||||
- Usage-correlated (revenue scales with customer's AWS footprint — more accounts = more revenue)
|
||||
- Low churn by design (pattern learning + remediation workflows create switching costs over time)
|
||||
- Expansion-native (customers add accounts as they grow)
|
||||
|
||||
## Unit Economics
|
||||
|
||||
### Per-Customer Economics (Pro Tier, Single Account)
|
||||
|
||||
| Metric | Value | Notes |
|
||||
|--------|-------|-------|
|
||||
| Monthly revenue | $19 | Per connected AWS account |
|
||||
| Infrastructure cost | ~$0.80/month | CloudTrail processing (Lambda), anomaly storage (DynamoDB/Postgres), Slack API calls. Estimated at scale. |
|
||||
| Gross margin | ~96% | SaaS infrastructure costs are minimal at this price point |
|
||||
| CAC (PLG) | ~$15-25 | Blended across organic (HN, Reddit, SEO = ~$0) and paid content promotion (~$50-80 per paid signup). PLG means no sales team. |
|
||||
| Payback period | 1-2 months | At $19/month revenue and $15-25 CAC |
|
||||
| Target LTV | $190 | 10-month average lifetime at <10% monthly churn |
|
||||
| LTV:CAC ratio | 7.6-12.7x | Healthy. >3x is the benchmark for sustainable SaaS. |
|
||||
|
||||
### Multi-Account Expansion Economics
|
||||
|
||||
The real unit economics story is expansion revenue. A customer starts with 1 account ($19/month), then connects their staging account ($38/month), then their data account ($57/month). No additional CAC for expansion revenue.
|
||||
|
||||
| Accounts | Monthly Revenue | Annual Revenue | Notes |
|
||||
|----------|----------------|----------------|-------|
|
||||
| 1 | $19 | $228 | Entry point |
|
||||
| 3 | $57 | $684 | Typical startup (prod + staging + data) |
|
||||
| 5 | $95 | $1,140 | Growing startup |
|
||||
| 10 | $190 | $2,280 | Mid-market entry |
|
||||
| 20 | $399 (Business flat) | $4,788 | Business tier cap |
|
||||
|
||||
## Path to Revenue Milestones
|
||||
|
||||
### $10K MRR (~526 paying accounts)
|
||||
|
||||
**Timeline:** Month 6-9 (Scenario B "The Grind")
|
||||
|
||||
**How we get there:**
|
||||
- dd0c/cost: ~300 accounts × $19 = $5,700 MRR
|
||||
- dd0c/route: contributing remaining ~$4,300 MRR
|
||||
- Total: ~$10K MRR from the gateway drug pair
|
||||
|
||||
**Requirements:** 2,000+ free signups, 2.5% conversion, steady content marketing cadence, 2-3 "dd0c saved us $X" case studies published.
|
||||
|
||||
### $50K MRR (~2,600 paying accounts from cost alone, or blended across platform)
|
||||
|
||||
**Timeline:** Month 12-18
|
||||
|
||||
**How we get there (blended):**
|
||||
- dd0c/cost: ~1,000 accounts × $22 avg (mix of Pro + Business) = $22,000 MRR
|
||||
- dd0c/route: ~$18,000 MRR
|
||||
- dd0c/alert (launched Month 6): ~$10,000 MRR
|
||||
- Total: ~$50K MRR across 3 modules
|
||||
|
||||
**Requirements:** Strong PLG flywheel, AWS Marketplace traction, at least one viral content moment, cross-sell motion working between route and cost.
|
||||
|
||||
### $100K MRR
|
||||
|
||||
**Timeline:** Month 18-24
|
||||
|
||||
**How we get there:**
|
||||
- 4+ dd0c modules live
|
||||
- Business tier adoption driving higher ARPA
|
||||
- Platform bundle pricing
|
||||
- Early mid-market customers (10-25 accounts each)
|
||||
- Potential: first contractor hire for customer support
|
||||
|
||||
**Requirements:** Product-market fit validated across at least 3 modules. Churn <8%. NPS >40. The platform flywheel (modules more valuable together than apart) must be demonstrably working.
|
||||
|
||||
## Solo Founder Constraints & Mitigations
|
||||
|
||||
| Constraint | Impact | Mitigation |
|
||||
|-----------|--------|------------|
|
||||
| No sales team | Can't do enterprise outreach | PLG motion. Product sells itself or doesn't sell. |
|
||||
| No support team | Support burden scales with customers | Automate everything. Self-service docs. Community Slack. Hire part-time contractor at ~200 customers. |
|
||||
| No marketing team | Limited content output | Batch content creation. 1 blog post/week. Leverage open-source tools for organic reach. |
|
||||
| Single point of failure | Bus factor = 1 | Infrastructure as code. CI/CD. Automated testing. Documented runbooks. No manual processes per customer. |
|
||||
| Cognitive load of 6 products | Risk of building 6 mediocre products | Hard rule: no more than 2 products in active development at any time. dd0c/route + dd0c/cost first. Everything else waits. |
|
||||
| No fundraising | Limited runway for experimentation | Bootstrap-friendly unit economics. $19/month × 96% gross margin = profitable from customer #1. No burn rate to manage. |
|
||||
|
||||
## The "Gateway Drug" Cross-Sell Economics
|
||||
|
||||
The dd0c platform strategy depends on the gateway drug pair (route + cost) earning the right to sell everything else:
|
||||
|
||||
```
|
||||
Month 1-2: dd0c/route launches → Customer saves $400/month on LLM costs
|
||||
Month 2-3: dd0c/cost launches → Same customer saves $800/month on AWS waste
|
||||
Month 3: Customer is saving $1,200/month across two dd0c products for ~$60/month total
|
||||
Month 4-6: dd0c/alert launches → "Save your money AND your sleep"
|
||||
Month 6+: dd0c/portal → dd0c owns the developer experience. Switching cost is massive.
|
||||
```
|
||||
|
||||
**Data synergy:** dd0c/route knows which services make LLM API calls and their cost. dd0c/cost knows which AWS resources are running and their cost. Combined: "Your recommendation service is making $3,200/month in GPT-4o calls AND running on a $1,800/month p3.2xlarge. Here's how to cut both by 60%." Single-product competitors can't replicate this.
|
||||
|
||||
**Technical synergy:** Both products need AWS account integration, Slack integration, auth, and billing. Building dd0c/cost after dd0c/route means 50% of infrastructure already exists. Marginal engineering cost of the second product is much lower than the first.
|
||||
|
||||
---
|
||||
|
||||
# 6. RISKS & MITIGATIONS
|
||||
|
||||
## Top 5 Risks
|
||||
|
||||
### Risk 1: AWS Ships Real-Time Cost Anomaly Detection with Slack Remediation
|
||||
|
||||
- **Likelihood:** MEDIUM (40% within 2 years)
|
||||
- **Impact:** CRITICAL — Primary differentiator evaporates overnight
|
||||
- **Analysis:** AWS's billing team is a cost center, not a profit center. Real-time cost detection that helps customers spend LESS is antithetical to AWS's revenue model. They've had 15 years to build this and haven't. Their organizational incentives are structurally misaligned. Even if they improve, it'll be enterprise-focused, console-bound, and half-hearted.
|
||||
- **Mitigation:** Move fast. Establish brand and switching costs (pattern data, remediation workflows) before AWS can respond. If AWS ships something competitive, pivot to multi-cloud (AWS + GCP + Azure) — something AWS will NEVER build.
|
||||
- **Kill trigger:** If AWS announces real-time Cost Anomaly Detection with native Slack remediation at re:Invent 2026, kill the standalone product. Pivot the CloudTrail ingestion engine into dd0c/alert or dd0c/drift as a supplementary feature.
|
||||
|
||||
### Risk 2: Market Consolidation (Datadog Acquires Vantage or Builds Equivalent)
|
||||
|
||||
- **Likelihood:** HIGH (60% within 18 months for Datadog entering the space)
|
||||
- **Impact:** HIGH — Datadog has 3,000 engineers, $2B+ revenue, and existing customer infrastructure agents
|
||||
- **Analysis:** Datadog charges $23/host/month. Their cost management is an upsell, not a standalone product. A startup with 50 hosts pays $1,150/month for Datadog before cost features. Our $19/account/month is a completely different price point. Datadog optimizes for enterprise, not startups.
|
||||
- **Mitigation:** Don't compete on features. Compete on price and simplicity. Position as "the cost tool for teams that can't afford Datadog" or "teams that use Datadog for monitoring but don't want Datadog prices for cost management." If Datadog acquires Vantage, they'll inevitably raise prices or bundle behind expensive tiers. Double down on the anti-bloatware positioning.
|
||||
- **Pivot option:** Strictly PLG for sub-50 person engineering teams where a Datadog contract is unjustifiable.
|
||||
|
||||
### Risk 3: False Positive Fatigue Kills Retention
|
||||
|
||||
- **Likelihood:** HIGH (70% if not actively managed)
|
||||
- **Impact:** HIGH — If the product loses trust, churn hits 100%. The "boy who cried wolf" is the death of all monitoring tools.
|
||||
- **Analysis:** CloudTrail is noisy. Mapping raw `RunInstances` events to accurate pricing (factoring RIs, Savings Plans, Spot) in real-time is notoriously difficult. If the Slack bot cries wolf with inaccurate pricing three times, engineers mute the channel. Game over.
|
||||
- **Mitigation:**
|
||||
1. Ship with hyper-conservative default thresholds (miss $50 anomalies rather than trigger 3 false positives)
|
||||
2. Every alert includes `[Mark as Expected]` button that instantly retrains the baseline
|
||||
3. Composite anomaly scoring (multiple signals = high confidence, single signal = low confidence)
|
||||
4. User-tunable sensitivity per service
|
||||
5. Track alert-to-action ratio as core product metric. If <20% of alerts result in action, sensitivity is too high.
|
||||
6. Be transparent about estimates: "Estimated cost: $X/hour (on-demand pricing. Actual may differ with RIs/SPs)."
|
||||
- **Kill trigger:** Alert-to-action ratio <10% at Month 4.
|
||||
|
||||
### Risk 4: Solo Founder Burnout (Bus Factor = 1)
|
||||
|
||||
- **Likelihood:** MEDIUM-HIGH (50% within 18 months)
|
||||
- **Impact:** CRITICAL — Processing real-time event streams at scale is an operational nightmare. If ingestion goes down, you miss the anomaly, you lose trust forever.
|
||||
- **Analysis:** Brian is building 6 products simultaneously. The cognitive load, support burden, and operational complexity of running a multi-product SaaS as a solo founder is extreme. Burnout is the most common startup killer.
|
||||
- **Mitigation:**
|
||||
1. Hard rule: no more than 2 products in active development at any time
|
||||
2. Automate everything (IaC, CI/CD, automated testing, automated onboarding)
|
||||
3. Hire part-time support contractor within 6 months of launch
|
||||
4. dd0c/cost's Slack-first architecture eliminates 60% of the frontend engineering burden (no dashboard in V1)
|
||||
- **Kill trigger:** Spending >60% of time on dd0c/cost support instead of building.
|
||||
|
||||
### Risk 5: The "Good Enough" Trap — Free Tier Cannibalization
|
||||
|
||||
- **Likelihood:** HIGH (60-70% of signups stay free)
|
||||
- **Impact:** MEDIUM — Revenue growth stalls despite strong signup numbers
|
||||
- **Analysis:** The free tier (daily anomaly checks, 1 account) may be sufficient for many small startups. Daily checks catch most problems, just 24 hours late.
|
||||
- **Mitigation:**
|
||||
1. Make the free-to-paid gap visceral. Every free alert: "We detected this 18 hours ago. On Pro, you'd have known in 60 seconds. Estimated cost of the delay: $220."
|
||||
2. Free alerts have NO action buttons. See the problem, can't fix it from Slack. Friction = upgrade motivation.
|
||||
3. Accept 60-70% free as normal for PLG. Focus on the 30-40% who convert. At $19/month, volume matters more than conversion rate.
|
||||
|
||||
### Additional Risks (Monitored)
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|-----------|--------|------------|
|
||||
| IAM permission anxiety blocks adoption | MEDIUM (30%) | MEDIUM | Minimal permissions (read-only), open-source agent, SOC 2 within 12 months |
|
||||
| AI spend bubble pops | LOW-MEDIUM (20%) | MEDIUM | AI is the hook, not the product. dd0c detects ALL cost anomalies. Core problem persists regardless. |
|
||||
| Security breach / data incident | LOW (10%) | CATASTROPHIC | Minimize data collection, encrypt everything, no stored credentials (IAM cross-account roles), bug bounty from day 1 |
|
||||
| "We'll build it internally" | MEDIUM (25%) | LOW | Self-solving. Internal tools get abandoned. Content strategy demonstrates problem depth. $19/month < one engineer's afternoon. |
|
||||
|
||||
## Kill Criteria
|
||||
|
||||
Non-negotiable triggers to kill dd0c/cost and redirect effort:
|
||||
|
||||
1. **< 50 free signups within 30 days of Show HN launch.** Developer community doesn't care. Problem isn't painful enough or positioning is wrong.
|
||||
2. **< 5 paying customers within 90 days of launch.** Product-market fit isn't there at any price.
|
||||
3. **> 50% of paying customers churn within 60 days.** Product isn't delivering enough value to justify even $19/month.
|
||||
4. **AWS ships real-time anomaly detection with Slack integration.** Primary differentiator evaporates. Pivot or kill.
|
||||
5. **> 60% of time spent on support instead of building.** Product complexity is wrong for solo founder operating model.
|
||||
|
||||
If any trigger fires, don't rationalize. Don't "give it one more month." Kill it, learn from it, move on. dd0c has 5 other products.
|
||||
|
||||
## Pivot Options
|
||||
|
||||
| Trigger | Pivot |
|
||||
|---------|-------|
|
||||
| AWS closes the speed gap | Pivot to multi-cloud (AWS + GCP + Azure) — something AWS will never build |
|
||||
| Standalone product fails | Absorb CloudTrail engine into dd0c/portal as a cost widget, not a standalone product |
|
||||
| False positive crisis | Pivot from "anomaly detection" to "Zombie Hunter" — pure unused resource detection. Zero false positives, pure savings. |
|
||||
| Market too noisy | Rebrand as "dd0c/guard" — cost governance and guardrails, not detection. Prevention > detection. |
|
||||
|
||||
---
|
||||
|
||||
# 7. SUCCESS METRICS
|
||||
|
||||
## North Star Metric
|
||||
|
||||
**Anomalies Resolved** — the number of cost anomalies dd0c detected AND the customer took action on (Stop, Terminate, Schedule, or acknowledged via Mark as Expected).
|
||||
|
||||
Not signups. Not MRR. Not DAU. Anomalies resolved is the atomic unit of value. Every anomaly resolved is money saved, trust earned, and retention deepened. Everything else is a proxy.
|
||||
|
||||
## Leading Indicators (Predictive)
|
||||
|
||||
| Metric | Target | Why It Matters |
|
||||
|--------|--------|---------------|
|
||||
| Time-to-first-alert | <10 minutes | If users don't get value fast, they churn before they start |
|
||||
| Signup → Connected account rate | >60% | Measures onboarding friction. Below 60% = onboarding is broken |
|
||||
| Alert-to-action ratio | >25% | Product quality signal. Below 20% = false positive crisis |
|
||||
| Weekly active accounts (WAA) | Growing 10%+ week-over-week | Engagement health. Flat = product isn't sticky |
|
||||
| Free-to-paid conversion rate | 2.5-3.5% | Revenue efficiency. Below 2% = free tier is too generous or paid value unclear |
|
||||
|
||||
## Lagging Indicators (Confirmatory)
|
||||
|
||||
| Metric | Target | Why It Matters |
|
||||
|--------|--------|---------------|
|
||||
| MRR | Per milestone targets below | Revenue health |
|
||||
| Monthly churn rate | <8% | Retention. Above 15% = product isn't delivering sustained value |
|
||||
| NPS | >40 | Customer satisfaction. Below 20 = product problems |
|
||||
| Organic referral rate | >15% of new signups | Word-of-mouth health. Below 5% = product isn't remarkable enough to share |
|
||||
| Estimated customer savings | >10x subscription cost | ROI validation. If customers aren't saving 10x what they pay, pricing or detection is wrong |
|
||||
|
||||
## 30/60/90 Day Milestones
|
||||
|
||||
### Day 30: Core Product Complete
|
||||
- [ ] CloudTrail → EventBridge → Lambda pipeline operational
|
||||
- [ ] Anomaly scoring engine functional (Z-score, configurable sensitivity)
|
||||
- [ ] Slack app: alerts with action buttons (Stop, Terminate, Snooze, Mark Expected)
|
||||
- [ ] Daily digest message working
|
||||
- [ ] Tested on 2+ own AWS accounts
|
||||
- **Gate:** Can detect a manually-created expensive resource and alert in Slack within 120 seconds
|
||||
|
||||
### Day 60: Design Partners Active
|
||||
- [ ] 3-5 design partners using the product daily
|
||||
- [ ] Onboarding flow complete (CloudFormation + Slack OAuth, <5 minutes)
|
||||
- [ ] Immediate zombie scan on account connection
|
||||
- [ ] False positive rate <30%
|
||||
- [ ] At least 1 design partner has a "dd0c saved us $X" story
|
||||
- **Gate:** Time-to-first-alert <10 minutes for all design partners
|
||||
|
||||
### Day 90: Public Launch
|
||||
- [ ] Stripe billing live ($19/account/month, free tier)
|
||||
- [ ] Show HN + Reddit + Product Hunt launched
|
||||
- [ ] First "What's That Spike?" blog post published
|
||||
- [ ] `aws-cost-cli` open-source tool released
|
||||
- [ ] AWS Marketplace listing application submitted
|
||||
- [ ] 200+ free signups in launch week
|
||||
- **Gate:** At least 1 paying customer within 2 weeks of launch
|
||||
|
||||
### Month 4 Checkpoint
|
||||
- [ ] 25+ paying accounts
|
||||
- [ ] $475+ MRR
|
||||
- [ ] Alert-to-action ratio >25%
|
||||
- [ ] Monthly churn <10%
|
||||
- [ ] At least 2 organic referrals
|
||||
- **Kill trigger review:** If <5 paying accounts, initiate kill criteria evaluation
|
||||
|
||||
### Month 6 Checkpoint
|
||||
- [ ] 100+ paying accounts
|
||||
- [ ] $1,900+ MRR
|
||||
- [ ] NPS >40
|
||||
- [ ] Monthly churn <8%
|
||||
- [ ] V2 development underway (dashboard, multi-account)
|
||||
- [ ] Cross-sell motion with dd0c/route initiated
|
||||
- **Kill trigger review:** If <25 paying accounts or >15% churn, initiate kill criteria evaluation
|
||||
|
||||
## Metrics to Track Daily
|
||||
1. New signups (free + paid)
|
||||
2. Accounts connected (signup → connected conversion)
|
||||
3. Anomalies detected (total, by type, by severity)
|
||||
4. Anomalies acted on (stop, terminate, snooze, mark expected)
|
||||
5. Alert-to-action ratio
|
||||
6. Time-to-first-alert
|
||||
7. False positive reports (Mark as Expected / total alerts)
|
||||
|
||||
## Metrics to Track Weekly
|
||||
1. MRR and MRR growth rate
|
||||
2. Free-to-paid conversion rate
|
||||
3. Churn rate (accounts disconnected or downgraded)
|
||||
4. Estimated customer savings (sum of costs avoided via remediation)
|
||||
5. Support ticket volume (early warning for complexity issues)
|
||||
|
||||
---
|
||||
|
||||
# APPENDIX: SCENARIO PROJECTIONS
|
||||
|
||||
| Scenario | Probability | Month 3 MRR | Month 6 MRR | Month 12 MRR | Description |
|
||||
|----------|------------|-------------|-------------|--------------|-------------|
|
||||
| **A: The Rocket** | 20% | $2,850 | $9,500 | $19,000 | HN front page, 2K signups week 1, 3% conversion, strong word-of-mouth |
|
||||
| **B: The Grind** | 50% | $475 | $950 | $3,800 | Moderate HN traction, 500 signups week 1, slow steady growth via content |
|
||||
| **C: The Pivot** | 25% | $95 | $285 | — | Lukewarm response, 200 signups, 1.5% conversion. Rebrand as portal feature or kill. |
|
||||
| **D: The Extinction** | 5% | — | — | — | AWS ships competitive native tool. Kill immediately. Salvage CloudTrail engine for dd0c/alert. |
|
||||
|
||||
**Expected value (probability-weighted Month 12 MRR):** ~$5,700 from dd0c/cost alone. Combined with dd0c/route, the gateway drug pair targets $10-15K MRR at Month 12 under the most likely scenario.
|
||||
|
||||
---
|
||||
|
||||
*This brief synthesizes findings from four prior development phases: Brainstorm, Design Thinking, Innovation Strategy, and Party Mode Advisory Board Review. All contradictions between phases have been resolved in favor of the most conservative, execution-focused position. The advisory board voted 4-1 Conditional GO.*
|
||||
|
||||
*The bet: real-time CloudTrail analysis is an architectural wedge that incumbents can't easily follow. The condition: ship in 90 days, honor kill criteria, and stay ruthlessly focused on three things — detect fast, alert clearly, fix with one click.*
|
||||
|
||||
@@ -0,0 +1,103 @@
|
||||
# dd0c/cost — Test Architecture & TDD Strategy
|
||||
|
||||
**Version:** 2.0
|
||||
**Date:** February 28, 2026
|
||||
**Status:** Authoritative
|
||||
**Audience:** Founding engineer, future contributors
|
||||
|
||||
---
|
||||
|
||||
> **Guiding principle:** A cost anomaly detector that misses a $3,000 GPU instance is worse than useless — it's a liability. A cost anomaly detector that cries wolf 40% of the time gets disabled. Tests are the only way to ship with confidence at solo-founder velocity.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Testing Philosophy & TDD Workflow](#1-testing-philosophy--tdd-workflow)
|
||||
2. [Test Pyramid](#2-test-pyramid)
|
||||
3. [Unit Test Strategy](#3-unit-test-strategy)
|
||||
4. [Integration Test Strategy](#4-integration-test-strategy)
|
||||
5. [E2E & Smoke Tests](#5-e2e--smoke-tests)
|
||||
6. [Performance & Load Testing](#6-performance--load-testing)
|
||||
7. [CI/CD Pipeline Integration](#7-cicd-pipeline-integration)
|
||||
8. [Transparent Factory Tenet Testing](#8-transparent-factory-tenet-testing)
|
||||
9. [Test Data & Fixtures](#9-test-data--fixtures)
|
||||
10. [TDD Implementation Order](#10-tdd-implementation-order)
|
||||
|
||||
---
|
||||
|
||||
## 1. Testing Philosophy & TDD Workflow
|
||||
|
||||
### Red-Green-Refactor for dd0c/cost
|
||||
|
||||
TDD is non-negotiable for the anomaly scoring engine and baseline learning components. A scoring bug that ships to production means either missed anomalies (customers lose money) or false positives (customers disable the product). The cost of a test is minutes. The cost of a scoring bug is churn.
|
||||
|
||||
**Where TDD is mandatory:**
|
||||
- `src/scoring/` — every scoring signal, composite calculation, and severity classification
|
||||
- `src/baseline/` — all statistical operations (mean, stddev, rolling window, cold-start transitions)
|
||||
- `src/parsers/` — every CloudTrail event parser (RunInstances, CreateDBInstance, etc.)
|
||||
- `src/pricing/` — pricing lookup logic and cost estimation
|
||||
- `src/governance/` — policy.json evaluation, auto-promotion logic, panic mode
|
||||
|
||||
**Where TDD is recommended but not mandatory:**
|
||||
- `src/notifier/` — Slack Block Kit formatting (snapshot tests are sufficient)
|
||||
- `src/api/` — REST handlers (contract tests cover these)
|
||||
- `src/infra/` — CDK stacks (CDK assertions cover these)
|
||||
|
||||
**Where tests follow implementation:**
|
||||
- `src/onboarding/` — CloudFormation URL generation, Cognito flows (integration tests only)
|
||||
- `src/slack/` — OAuth flows, signature verification (integration tests)
|
||||
|
||||
### The Red-Green-Refactor Cycle
|
||||
|
||||
```
|
||||
RED: Write a failing test that describes the desired behavior.
|
||||
Name it precisely: what component, what input, what expected output.
|
||||
Run it. Watch it fail. Confirm it fails for the right reason.
|
||||
|
||||
GREEN: Write the minimum code to make the test pass.
|
||||
No gold-plating. No "while I'm here" refactors.
|
||||
Run the test. Watch it pass.
|
||||
|
||||
REFACTOR: Clean up the implementation without changing behavior.
|
||||
Extract constants. Rename variables. Simplify logic.
|
||||
Tests must still pass after every refactor step.
|
||||
```
|
||||
|
||||
### Test Naming Convention
|
||||
|
||||
All tests follow the pattern: `[unit under test] [scenario] [expected outcome]`
|
||||
|
||||
```typescript
|
||||
// ✅ Good — precise, readable, searchable
|
||||
describe('scoreAnomaly', () => {
|
||||
it('returns critical severity when z-score exceeds 5.0 and instance type is novel', () => {});
|
||||
it('returns none severity when account is in cold-start and cost is below $0.50/hr', () => {});
|
||||
it('returns warning severity when actor is novel but cost is within 2 standard deviations', () => {});
|
||||
it('compounds severity when multiple signals fire simultaneously', () => {});
|
||||
});
|
||||
|
||||
// ❌ Bad — vague, not searchable
|
||||
describe('scoring', () => {
|
||||
it('works correctly', () => {});
|
||||
it('handles edge cases', () => {});
|
||||
});
|
||||
```
|
||||
|
||||
### Decision Log Requirement
|
||||
|
||||
Per Transparent Factory tenet (Story 10.3), any PR touching `src/scoring/`, `src/baseline/`, or `src/detection/` must include a `docs/decisions/<YYYY-MM-DD>-<slug>.json` file. The test suite validates this in CI.
|
||||
|
||||
```json
|
||||
{
|
||||
"prompt": "Should Z-score threshold be 2.5 or 3.0?",
|
||||
"reasoning": "At 2.5, false positive rate in design partner data was 28%. At 3.0, it dropped to 18% with only 2 additional missed true positives over 30 days.",
|
||||
"alternatives_considered": ["2.0 (too noisy)", "3.5 (misses too many real anomalies)"],
|
||||
"confidence": "medium",
|
||||
"timestamp": "2026-02-28T10:00:00Z",
|
||||
"author": "brian"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user