dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
This commit is contained in:
136
dd0c-brand-strategy.md
Normal file
136
dd0c-brand-strategy.md
Normal file
@@ -0,0 +1,136 @@
|
|||||||
|
# 0xDD0C: The Anti-Bloatware Platform Strategy
|
||||||
|
**Prepared by:** Victor, Disruptive Innovation Oracle
|
||||||
|
**Target Audience:** Brian (Senior AWS Architect & Founder)
|
||||||
|
|
||||||
|
Brian. Let us dispense with the pleasantries. You are a senior architect. You know that in infrastructure, complexity is a tax levied on the incompetent. The market is currently drowning in tools that create more work than they eliminate. Datadog drains their budgets. Backstage drains their souls. DevOps has mutated into "AllOps," and the industry's response has been to sell them $100k enterprise subscriptions to manage the chaos.
|
||||||
|
|
||||||
|
We are not building six products. We are building one weapon with six calibers.
|
||||||
|
|
||||||
|
Here is your strategy to disrupt the 2026 DevOps landscape.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Brand Identity
|
||||||
|
|
||||||
|
### What does "dd0c" mean?
|
||||||
|
You are selling to developers, not MBAs. They smell marketing fluff from a mile away. Here are three interpretations of "dd0c":
|
||||||
|
|
||||||
|
1. **The Unix Purist:** `dd` (the relentless Unix command that copies and converts data) + `0c` (zero config / zero chaos).
|
||||||
|
2. **The Hacker / Memory Address:** `0xdd0c`. It looks like a hex memory address. It implies low-level, foundational, bare-metal truth.
|
||||||
|
3. **The Acronym:** Developer Driven, Zero Compromise. *(Too corporate. Discard.)*
|
||||||
|
|
||||||
|
**The Oracle's Pick:** The Hex/Unix hybrid. **`0xDD0C`**. It is raw. It is unapologetic.
|
||||||
|
|
||||||
|
### Brand Positioning Statement
|
||||||
|
For engineering teams drowning in AllOps and enterprise bloatware, dd0c is the unified infrastructure control plane that replaces fragmented, expensive tools with fast, opinionated workflows. Unlike Datadog or Spacelift, dd0c provides zero-configuration, Vercel-like simplicity for the 99% of teams who aren't Google but are forced to buy tools built for them.
|
||||||
|
|
||||||
|
### Tagline Options (Ranked)
|
||||||
|
1. **All signal. Zero chaos.** *(The winner. Devs want signal.)*
|
||||||
|
2. Infrastructure without the inflation.
|
||||||
|
3. The antidote to AllOps.
|
||||||
|
4. Run your cloud like you own it.
|
||||||
|
5. Stop paying to be paged.
|
||||||
|
|
||||||
|
### Brand Voice & Visual Identity
|
||||||
|
* **Voice:** Stoic, precise, slightly cynical about modern DevOps bloat. We do not use exclamation points. We use data. We speak like an elite senior engineer reviewing a junior's over-engineered pull request.
|
||||||
|
* **Visuals:** Linear meets Vercel. Dark mode default (Obsidian black, stark white, electric terminal green for positive states). Monospaced typography (Geist Mono). High information density. No cartoon mascots. No 3D corporate Memphis art.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Platform Narrative
|
||||||
|
|
||||||
|
### The Cohesive Story
|
||||||
|
The SINGLE problem dd0c solves is **The Enterprise Tax on Small-to-Mid Teams**. Tools today are built to be sold to VP-level buyers via golf-course sales motions, resulting in bloated feature sets that developers hate using. dd0c is built for the practitioner.
|
||||||
|
|
||||||
|
* **The Villain:** Alert fatigue, $20k SaaS bills, and the YAML-hell of Backstage.
|
||||||
|
* **The Aha Moment:** A developer gets paged at 3 AM for a non-issue, logs into AWS to see a $500 cost spike they can't trace, and realizes they are paying three different vendors to be this miserable.
|
||||||
|
* **The Hero's Journey:** They install one dd0c module (e.g., the LLM Cost Router) in 5 minutes. They save $400 in the first week. They realize the rest of the stack is just as fast. They migrate, reclaim their budget, and sleep through the night.
|
||||||
|
* **Competitive Positioning:** We are the "Linear for DevOps." We win on speed, developer experience (DX), and transparent pricing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Product Architecture
|
||||||
|
|
||||||
|
If you build six separate data models, you will fail. You are a solo founder. You must build a **Unified Control Plane** (Auth, Billing, OTel Data Lake, RBAC).
|
||||||
|
|
||||||
|
### Naming Convention
|
||||||
|
Keep it Unix-like. Slash notation.
|
||||||
|
* `dd0c/route` (LLM Cost Router)
|
||||||
|
* `dd0c/cost` (AWS Cost Anomaly)
|
||||||
|
* `dd0c/alert` (Alert Intelligence)
|
||||||
|
* `dd0c/run` (AI Runbooks)
|
||||||
|
* `dd0c/drift` (IaC Drift)
|
||||||
|
* `dd0c/portal` (Lightweight IDP)
|
||||||
|
|
||||||
|
### The Architecture of Adoption
|
||||||
|
1. **The Gateway Drugs:** `dd0c/route` and `dd0c/cost`. Why? **Immediate Monetary ROI.** If you save a company $2,000 on OpenAI and AWS bills in week one, you have bought the political capital to sell them anything else.
|
||||||
|
2. **The Sanity Wedge:** `dd0c/alert`. Once you save their money, save their sleep.
|
||||||
|
3. **The Expansion (The Sticky Layer):** `dd0c/portal`. The IDP becomes the browser homepage for every engineer. Once `portal` is installed, `dd0c` owns the developer experience. `drift` and `run` are just features inside the portal.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Go-to-Market Strategy
|
||||||
|
|
||||||
|
### Launch Sequence
|
||||||
|
* **Phase 1: The FinOps Wedge (Months 1-3).** Launch `dd0c/route` and `dd0c/cost`. Companies are currently bleeding cash on unoptimized GPT-4o calls and forgotten EC2 instances. Capture that budget.
|
||||||
|
* **Phase 2: The On-Call Savior (Months 4-6).** Launch `dd0c/alert` and `dd0c/run`. Integrate directly with the slack channels they already use.
|
||||||
|
* **Phase 3: The Platform Takeover (Months 7-12).** Launch `dd0c/portal` and `dd0c/drift`. Now that they trust you with their money and their sleep, they will trust you with their internal service catalog.
|
||||||
|
|
||||||
|
### Pricing Philosophy
|
||||||
|
* **FinOps Modules:** Usage-based (e.g., flat % of identified savings, or per 1M LLM tokens routed).
|
||||||
|
* **Workflow Modules:** Per-seat. $15-$30/engineer/month.
|
||||||
|
* **The Vercel Playbook:** Time-to-value must be < 5 minutes. Generous free tier for hobbyists. Zero "Contact Sales to see pricing" buttons. Open-source the local agents and proxy layers; charge for the managed control plane and dashboard.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Technical Platform Strategy
|
||||||
|
|
||||||
|
* **Shared Infrastructure:** This is your moat against burnout. Build one unified API gateway. Build one unified OpenTelemetry ingest pipeline.
|
||||||
|
* **The Data Advantage (The Flywheel):**
|
||||||
|
* Because `dd0c` has the IDP (`portal`), it knows *who* owns the microservice.
|
||||||
|
* When an anomaly is detected (`cost`), it doesn't alert a generic Slack channel; it pages the specific owner directly.
|
||||||
|
* When an alert fires (`alert`), it automatically attaches the runbook (`run`) linked in the portal. *The modules are 10x more valuable together than apart.*
|
||||||
|
* **Deployment:** Cloud-only SaaS for the dashboard, with a lightweight, open-source Rust/Go agent that runs inside their VPC. You never ask for their root AWS creds. The agent pushes telemetry out to you.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Financial Model Sketch
|
||||||
|
|
||||||
|
### Path to $50K MRR (Bootstrap Friendly)
|
||||||
|
To hit $50k MRR, you do not need to fight Datadog for enterprise contracts. You need **500 mid-sized teams paying you $100/month**.
|
||||||
|
|
||||||
|
* **Entry Tier (Free):** Up to $100/mo AI spend routed, 1 AWS account monitored.
|
||||||
|
* **Pro Tier ($49/mo base + $15/user):** Unlocks all 6 modules for small teams. (Average deal size: $199/mo).
|
||||||
|
* **Business Tier ($499/mo):** Unlimited stacks, advanced RBAC, priority AI routing.
|
||||||
|
|
||||||
|
*Math:* 200 Pro customers ($40k) + 20 Business customers ($10k) = $50k MRR.
|
||||||
|
|
||||||
|
### Maximizing Revenue per Engineering Effort
|
||||||
|
Build `dd0c/route` first. The FinOps Foundation's 2026 report screams that AI workload cost management is the #1 emerging challenge. It is mathematically the easiest to build (it's a proxy router with a React dashboard) and has the highest immediate willingness-to-pay.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Risk Assessment
|
||||||
|
|
||||||
|
As an architect, you must look at the structural failure points. Here are the top 5 existential risks and their mitigations:
|
||||||
|
|
||||||
|
1. **The "Six Mediocre Products" Risk (Technical Debt)**
|
||||||
|
* *Risk:* Building 6 modules alone means none of them achieve feature parity with single-focus competitors.
|
||||||
|
* *Mitigation:* Do not aim for feature parity. Win on UX, integration, and simplicity. Focus relentlessly on the shared core data model.
|
||||||
|
2. **The Datadog/AWS Sherlock Risk**
|
||||||
|
* *Risk:* AWS finally builds a good native Cost Anomaly UX, or Datadog drops prices.
|
||||||
|
* *Mitigation:* AWS UX is historically terrible. Datadog cannot drop prices without destroying their market cap. Your moat is cross-platform neutrality and superior developer experience.
|
||||||
|
3. **The Proxy Trust Barrier**
|
||||||
|
* *Risk:* Teams won't route their proprietary LLM prompts through a solo founder's SaaS.
|
||||||
|
* *Mitigation:* The `dd0c/route` proxy MUST be open-source and deployable in their VPC. Your SaaS only receives the telemetry (tokens used, latency, cost), never the prompt payloads.
|
||||||
|
4. **Agentic AI Obsoletes IDPs**
|
||||||
|
* *Risk:* If GitHub Agentic Workflows can auto-discover and fix everything, does anyone need an IDP?
|
||||||
|
* *Mitigation:* Agents need a source of truth. The IDP becomes the registry for the AI agents, not just the humans. Position `dd0c/portal` as the "Agent Control Plane."
|
||||||
|
5. **GTM Paralysis**
|
||||||
|
* *Risk:* Developers are the hardest demographic to market to. They use adblockers and hate salespeople.
|
||||||
|
* *Mitigation:* Engineering-as-marketing. Release free mini-tools (e.g., a CLI that instantly calculates your wasted LLM spend locally).
|
||||||
|
|
||||||
|
Brian. The market is begging for simplicity. Build the weapon.
|
||||||
|
|
||||||
|
*Checkmate.*
|
||||||
|
— Victor
|
||||||
177
devops-opportunities-2026.md
Normal file
177
devops-opportunities-2026.md
Normal file
@@ -0,0 +1,177 @@
|
|||||||
|
# DevOps/AWS Disruption Opportunities — 2026
|
||||||
|
|
||||||
|
**Research Date:** February 28, 2026
|
||||||
|
**Prepared for:** Brian (Senior AWS/Cloud Architect)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Pain Points in DevOps (2025–2026)
|
||||||
|
|
||||||
|
### What developers and platform engineers are complaining about RIGHT NOW:
|
||||||
|
|
||||||
|
#### "DevOps becomes AllOps"
|
||||||
|
The #1 complaint on r/devops is scope creep. DevOps engineers are expected to be sysadmins, DBAs, security engineers, network engineers, and on-call firefighters simultaneously. The thread [r/devops: "When DevOps becomes AllOps"](https://www.reddit.com/r/devops/comments/1re5llx/when_devops_becomes_allops/) captures this perfectly — engineers drowning in responsibilities with no clear boundaries.
|
||||||
|
|
||||||
|
#### Alert Fatigue & On-Call Burnout
|
||||||
|
Massive ongoing pain. [r/devops: "Drowning in alerts but critical issues keep slipping through"](https://www.reddit.com/r/devops/comments/1r9qvcd/drowning_in_alerts_but_critical_issues_keep/) — engineers receiving hundreds of alerts/day, critical incidents lost in noise. AI-powered observability tools "still pretty hit or miss." The consensus: alert fatigue is a symptom of undefined SLOs, but nobody has a great tool to bridge that gap automatically.
|
||||||
|
|
||||||
|
#### Datadog Pricing Rage
|
||||||
|
Datadog is universally acknowledged as powerful but absurdly expensive. Reddit's r/Observability: "Datadog and New Relic are everywhere, but they're starting to feel bloated and expensive for what they deliver." OpenTelemetry is winning hearts as the vendor-neutral standard, but the tooling on top of OTel is still fragmented. Alternatives like SigNoz, Grafana Cloud, and Last9 are gaining traction but none have nailed the "Datadog experience at 1/10th the price" yet.
|
||||||
|
|
||||||
|
#### IaC Fragmentation & Drift
|
||||||
|
[r/devops: "IaC at scale is dealing with fragmented..."](https://www.reddit.com/r/devops/comments/1r9980m/iac_at_scale_is_dealing_with_fragmented/) — teams running Terraform + Pulumi + CloudFormation + Helm + Kustomize simultaneously. State management is a nightmare. Drift detection is still mostly "run terraform plan and pray." Tools like Spacelift, ControlMonkey (KoMo AI copilot), and driftctl exist but are either expensive enterprise plays or abandoned OSS projects.
|
||||||
|
|
||||||
|
#### Terraform/OpenTofu Complexity at Scale
|
||||||
|
Terraform state management remains painful. State locking, state splitting, cross-stack references, import workflows — all manual and error-prone. The Terraform → OpenTofu fork created confusion. Teams are stuck between ecosystems.
|
||||||
|
|
||||||
|
#### Kubernetes Complexity
|
||||||
|
Still the elephant in the room. K8s is powerful but the operational overhead for small-to-mid teams is brutal. Networking, RBAC, secrets, upgrades, debugging — each requires deep expertise. Many teams are over-Kubernetesed for their actual needs.
|
||||||
|
|
||||||
|
#### Backstage / Internal Developer Portal Frustration
|
||||||
|
Backstage (Spotify's OSS IDP) is the default choice but universally complained about. Gartner explicitly warns against treating it as "ready-to-use." Maintenance costs balloon. Plugins break on upgrades. YAML catalog entries go stale. Commercial alternatives (Port, Cortex, Roadie, Harness IDP) are expensive. There's a massive gap for a lightweight, opinionated IDP that "just works" for teams of 10-100 engineers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Emerging Gaps & Underserved Markets
|
||||||
|
|
||||||
|
### A. AI/LLM Cost Management (FinOps for AI)
|
||||||
|
The FinOps Foundation's 2026 report shows the #1 emerging challenge is **AI workload cost management**. Teams are burning money on LLM inference with zero visibility. Key stats:
|
||||||
|
- Uniform model routing (sending everything to GPT-4o) wastes 60%+ on tasks that GPT-4o-mini handles fine
|
||||||
|
- SaaS sprawl is the new shadow IT — teams signing up for 10 different AI tools with no central cost tracking
|
||||||
|
- FinOps is shifting from "cloud cost optimization" to "AI + SaaS optimization"
|
||||||
|
- **Gap:** No good tool exists for small/mid teams to track, route, and optimize LLM API spend across providers (OpenAI, Anthropic, Google, self-hosted). Portkey.ai is emerging but focused on enterprise.
|
||||||
|
|
||||||
|
### B. AI/LLM Observability
|
||||||
|
Production LLM monitoring is a greenfield. Traditional APM tools (Datadog, New Relic) are bolting on AI features but they're clunky. Arize, Langfuse, and Helicone exist but the space is early. **Gap:** A lightweight, developer-friendly tool that monitors LLM calls in production — latency, cost, token usage, hallucination detection, prompt versioning — without requiring a PhD to set up.
|
||||||
|
|
||||||
|
### C. Cloud Repatriation Tooling
|
||||||
|
Cloud repatriation is accelerating hard in 2026:
|
||||||
|
- Broadcom/VMware predicts it's moving "from ad-hoc cost cutting to deliberate strategy"
|
||||||
|
- Gartner: 75% of European/Middle Eastern enterprises will repatriate workloads to sovereign environments by 2030 (up from 5% in 2025)
|
||||||
|
- Egress fees, compliance (GDPR, data sovereignty), and AI workload costs are driving this
|
||||||
|
- **Gap:** Migration assessment and execution tooling for cloud-to-colo/self-hosted moves. Most tools focus on cloud migration IN, not OUT. The reverse journey has almost no tooling.
|
||||||
|
|
||||||
|
### D. CI/CD Pipeline Security & Supply Chain
|
||||||
|
GitLab's 2025 DevSecOps Report: 67% of organizations introduce security vulnerabilities during CI/CD due to inconsistent controls. ReversingLabs 2026 report: malware on open-source platforms up 73%, attacks targeting developer tooling directly. **Gap:** Lightweight CI/CD security scanning that's not enterprise-priced. Snyk, Bridgecrew (Prisma Cloud), and Checkov exist but are complex. A focused, opinionated tool for small teams is missing.
|
||||||
|
|
||||||
|
### E. Compliance-as-Code for Startups
|
||||||
|
SOC 2, HIPAA, and ISO 27001 compliance is increasingly required even for small startups (customers demand it). Tools like Vanta and Drata exist but cost $15K-$50K/year and are designed for compliance teams, not engineers. **Gap:** A developer-first compliance tool that auto-generates evidence from your actual infrastructure (AWS Config, GitHub, Terraform state) without requiring a dedicated compliance hire.
|
||||||
|
|
||||||
|
### F. Secrets Management Simplification
|
||||||
|
HashiCorp Vault is powerful but operationally heavy. AWS Secrets Manager works but is AWS-locked. Most teams end up with secrets scattered across .env files, CI variables, Vault, and AWS SSM Parameter Store. **Gap:** A unified secrets management layer that works across clouds and CI/CD systems without requiring a Vault cluster.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Indie/Bootstrap-Friendly Opportunities ($5K–$50K MRR)
|
||||||
|
|
||||||
|
### Opportunity 1: LLM Cost Router & Dashboard
|
||||||
|
**What:** SaaS that sits between your app and LLM providers. Routes requests to the cheapest adequate model based on task complexity. Dashboard shows spend by team/feature/model.
|
||||||
|
**Why now:** Every company is integrating AI but nobody tracks the cost. Teams discover $10K/month bills with no attribution.
|
||||||
|
**Monetization:** Usage-based pricing (% of savings or flat per-request fee). Free tier for <$100/month LLM spend.
|
||||||
|
**Competitors:** Portkey.ai (enterprise-focused, $$$), Helicone (logging-focused, not routing), LiteLLM (OSS proxy, no SaaS dashboard). None nail the "set up in 5 minutes, save money immediately" pitch.
|
||||||
|
**Moat:** Accumulate routing intelligence data. The more traffic you see, the better your routing decisions.
|
||||||
|
**MRR potential:** $10K-$50K+ (every AI-using startup is a customer)
|
||||||
|
|
||||||
|
### Opportunity 2: IaC Drift Detection & Auto-Remediation SaaS
|
||||||
|
**What:** Continuous drift detection for Terraform/OpenTofu/Pulumi with Slack alerts and one-click remediation. Not a full IaC management platform — just the drift piece, done really well.
|
||||||
|
**Why now:** Spacelift ($$$), env0 ($$$), and Terraform Cloud handle this but are expensive full platforms. driftctl was abandoned by Snyk. ControlMonkey is enterprise. Nobody offers a focused, affordable drift-detection-as-a-service.
|
||||||
|
**Monetization:** Per-stack pricing. $29/mo for 10 stacks, $99/mo for 50, $299/mo for unlimited.
|
||||||
|
**Competitors:** Spacelift (starts ~$500/mo), env0 (similar), Terraform Cloud (HashiCorp pricing). All are platforms, not focused tools.
|
||||||
|
**MRR potential:** $5K-$20K (every Terraform team with >5 stacks needs this)
|
||||||
|
|
||||||
|
### Opportunity 3: Alert Intelligence Layer
|
||||||
|
**What:** Sits on top of existing monitoring (Datadog, Grafana, PagerDuty, OpsGenie). Uses AI to deduplicate, correlate, and prioritize alerts. Learns from your acknowledge/resolve patterns. Reduces alert volume by 70-90%.
|
||||||
|
**Why now:** Alert fatigue is the #1 on-call complaint. Existing tools generate alerts but don't intelligently filter them. AI is finally good enough to do this reliably.
|
||||||
|
**Monetization:** Per-seat pricing for on-call engineers. $15-$30/seat/month.
|
||||||
|
**Competitors:** BigPanda (enterprise, $100K+ deals), Moogsoft (acquired by Dell), PagerDuty AIOps (add-on, expensive). Nothing for teams of 5-50 engineers.
|
||||||
|
**MRR potential:** $10K-$30K
|
||||||
|
|
||||||
|
### Opportunity 4: Lightweight Internal Developer Portal
|
||||||
|
**What:** Opinionated IDP that auto-discovers services from your cloud provider + GitHub/GitLab. Service catalog, ownership, runbooks, on-call schedules — no YAML configuration needed. Anti-Backstage.
|
||||||
|
**Why now:** Backstage requires a dedicated platform team to maintain. Port/Cortex cost $20K+/year. Small-to-mid teams (10-100 engineers) have nothing.
|
||||||
|
**Monetization:** $10/engineer/month. Self-serve signup.
|
||||||
|
**Competitors:** Backstage (free but painful), Port ($$$), Cortex ($$$), Roadie (managed Backstage, still complex), OpsLevel ($$$).
|
||||||
|
**MRR potential:** $10K-$50K
|
||||||
|
|
||||||
|
### Opportunity 5: AWS Cost Anomaly Detective
|
||||||
|
**What:** Focused tool that monitors AWS billing in real-time, detects anomalies (runaway Lambda, forgotten EC2 instances, surprise data transfer), and sends actionable Slack alerts with one-click remediation (terminate, resize, reserve).
|
||||||
|
**Why now:** AWS Cost Explorer is terrible UX. CloudHealth/Apptio are enterprise. Vantage and Infracost are good but don't do real-time anomaly detection well. Most teams discover cost spikes at month-end.
|
||||||
|
**Monetization:** % of savings identified or flat monthly fee based on AWS spend tier.
|
||||||
|
**Competitors:** Vantage (good but broad), Infracost (pre-deploy only), AWS Cost Anomaly Detection (native but limited and poorly surfaced), CloudZero (enterprise).
|
||||||
|
**MRR potential:** $5K-$30K
|
||||||
|
|
||||||
|
### Opportunity 6: AI-Powered Runbook Automation
|
||||||
|
**What:** Takes your existing runbooks (Notion, Confluence, markdown) and turns them into executable, AI-assisted incident response workflows. When an alert fires, the AI walks the on-call engineer through the runbook steps, executing safe commands automatically and asking for approval on dangerous ones.
|
||||||
|
**Why now:** Agentic AI is mature enough. Runbooks exist but nobody follows them at 3am. PagerDuty/Rootly have basic automation but not AI-driven runbook execution.
|
||||||
|
**Monetization:** Per-incident or per-seat pricing.
|
||||||
|
**Competitors:** Rootly (incident management, not runbook execution), Shoreline.io (acquired by Cisco, enterprise), Rundeck (OSS, complex). Nobody does "AI reads your runbook and executes it."
|
||||||
|
**MRR potential:** $10K-$40K
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Competitive Landscape Summary
|
||||||
|
|
||||||
|
| Opportunity | Incumbents | Their Weakness |
|
||||||
|
|---|---|---|
|
||||||
|
| LLM Cost Router | Portkey, Helicone, LiteLLM | Enterprise-focused or OSS without SaaS |
|
||||||
|
| IaC Drift Detection | Spacelift, env0, TF Cloud | Expensive platforms, not focused tools |
|
||||||
|
| Alert Intelligence | BigPanda, Moogsoft, PagerDuty AIOps | Enterprise pricing, complex setup |
|
||||||
|
| Lightweight IDP | Backstage, Port, Cortex | Complex/expensive, need dedicated team |
|
||||||
|
| AWS Cost Anomaly | Vantage, CloudZero, AWS native | Broad focus, poor UX, enterprise pricing |
|
||||||
|
| Runbook Automation | Rootly, Shoreline, Rundeck | Not AI-driven, enterprise or abandoned |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. AI + DevOps Intersection
|
||||||
|
|
||||||
|
### Agentic DevOps Is the Big Theme
|
||||||
|
The term "Agentic DevOps" is everywhere in Feb 2026. Key developments:
|
||||||
|
|
||||||
|
- **Pulumi Neo:** AI agent that can provision and manage infrastructure autonomously. Represents the shift from "AI assists" to "AI executes."
|
||||||
|
- **GitHub Agentic Workflows:** GitHub Actions now supports coding agents that handle triage, documentation, code quality autonomously. [GitHub Blog: "Automate repository tasks with GitHub Agentic Workflows"](https://github.blog/ai-and-ml/automate-repository-tasks-with-github-agentic-workflows/)
|
||||||
|
- **HackerNoon: "The End of CI/CD Pipelines: The Dawn of Agentic DevOps"** — GitHub's agent fixed a flaky test in 11 minutes with no human code. But debugging agent failures is harder than debugging pipeline failures.
|
||||||
|
|
||||||
|
### Where the Gaps Are:
|
||||||
|
1. **Agent Observability:** When AI agents make infrastructure changes, who audits them? How do you trace what an agent did and why? This is a brand new problem with no good tooling.
|
||||||
|
2. **Policy Guardrails for AI Agents:** AI agents need policy boundaries (don't delete production databases). OPA/Rego exists but isn't designed for AI agent workflows. Opportunity for an "AI agent policy engine."
|
||||||
|
3. **AI-Assisted Incident Response:** The gap between "AI summarizes the alert" and "AI actually fixes the issue" is where the money is. Current tools do the former; the latter is barely explored for infrastructure.
|
||||||
|
4. **LLM-Powered IaC Generation:** Pulumi Neo is leading but it's Pulumi-only. A tool that generates Terraform/CloudFormation from natural language descriptions, with proper state management and drift detection, doesn't exist as a standalone product.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Trends — Last 30 Days (Feb 2026)
|
||||||
|
|
||||||
|
### Hot Right Now:
|
||||||
|
1. **Agentic DevOps** — The buzzword of the month. Every vendor is slapping "agentic" on their product. Real substance from Pulumi Neo and GitHub.
|
||||||
|
2. **Cloud Repatriation Acceleration** — Broadcom, CIO.com, DataBank all publishing major pieces. Driven by AI workload costs (GPU instances are expensive on cloud), data sovereignty regulations, and egress fee frustration.
|
||||||
|
3. **FinOps for AI** — The FinOps Foundation's 2026 State of FinOps report shifted focus from traditional cloud cost to AI/SaaS cost management. AI workloads are the new uncontrolled spend category.
|
||||||
|
4. **Software Supply Chain Security** — ReversingLabs 2026 report: malware on open-source platforms up 73%. Attacks now target developer tooling and AI development pipelines directly.
|
||||||
|
5. **OpenTelemetry Dominance** — OTel is winning the observability standards war. Vendor lock-in backlash is real. Teams want to own their telemetry pipeline and choose backends.
|
||||||
|
6. **Self-Healing Infrastructure** — AI-powered auto-remediation is moving from concept to early production. Still mostly vaporware from big vendors but the demand signal is strong.
|
||||||
|
7. **Platform Engineering Maturity** — Gartner and others pushing IDPs as mandatory, not optional. But the tooling gap between "Backstage is too hard" and "Port/Cortex is too expensive" remains wide open.
|
||||||
|
|
||||||
|
### New Launches & Tools (Feb 2026):
|
||||||
|
- **Pulumi Neo** — AI agent for infrastructure management
|
||||||
|
- **GitHub Agentic Workflows** — AI agents in GitHub Actions
|
||||||
|
- **ControlMonkey KoMo** — AI copilot for Terraform (tagging, drift, destructive change detection)
|
||||||
|
- **OneUptime** — Open-source monitoring platform gaining traction as Datadog alternative
|
||||||
|
- **SigNoz** — Open-source APM continuing to grow, positioned as "open-source Datadog"
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Top 3 Recommendations for Brian
|
||||||
|
|
||||||
|
Based on the research, here's where I'd focus if I were building:
|
||||||
|
|
||||||
|
### 🥇 #1: LLM Cost Router & Optimization Dashboard
|
||||||
|
**Why:** Massive, growing pain point with no dominant player. Every company integrating AI needs this. Brian's AWS expertise means he understands cloud cost optimization deeply — this is the AI-native version of that same problem. Bootstrap-friendly because the value prop is immediate and measurable (you save money from day 1). Could start as an open-source proxy with a paid dashboard.
|
||||||
|
|
||||||
|
### 🥈 #2: Alert Intelligence Layer (AI-Powered Alert Deduplication)
|
||||||
|
**Why:** Universal pain point, clear ROI (fewer pages = happier engineers = less churn). The AI/ML component is a genuine moat — your model improves with more data. Integrates with existing tools (non-rip-and-replace). PagerDuty/OpsGenie integration means instant distribution channel. Small teams will pay $15-30/seat/month without blinking.
|
||||||
|
|
||||||
|
### 🥉 #3: Lightweight IDP (Anti-Backstage)
|
||||||
|
**Why:** The "Backstage is too complex" complaint is universal and getting louder. The market is bifurcated: free-but-painful (Backstage) vs. expensive-and-enterprise (Port, Cortex). A $10/engineer/month self-serve IDP that auto-discovers from AWS/GitHub and requires zero YAML would fill a massive gap. Brian's AWS knowledge is directly applicable to the auto-discovery engine.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Research compiled from Reddit (r/devops, r/aws, r/kubernetes, r/sre, r/selfhosted, r/Observability), Hacker News, Pulumi Blog, GitHub Blog, HackerNoon, FinOps Foundation, ReversingLabs, Gartner references, Broadcom/VMware, CIO.com, Spacelift, ControlMonkey, and various tech blogs. All sources accessed February 28, 2026.*
|
||||||
1881
products/01-llm-cost-router/architecture/architecture.md
Normal file
1881
products/01-llm-cost-router/architecture/architecture.md
Normal file
File diff suppressed because it is too large
Load Diff
324
products/01-llm-cost-router/brainstorm/session.md
Normal file
324
products/01-llm-cost-router/brainstorm/session.md
Normal file
@@ -0,0 +1,324 @@
|
|||||||
|
# 🧠 LLM Cost Router — Brainstorming Session
|
||||||
|
|
||||||
|
**Facilitator:** Carson (Elite Brainstorming Specialist)
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Product:** LLM Cost Router & Optimization Dashboard
|
||||||
|
**Target:** Bootstrap SaaS, $5K–$50K MRR
|
||||||
|
**Session Goal:** 100+ ideas across problem space, solutions, differentiation, and risk
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: Problem Space Exploration
|
||||||
|
|
||||||
|
*"Alright team, let's start with the PAIN. What hurts? What keeps the CFO up at night? What makes the engineer cringe when they open the billing console? No idea is too small — if it stings, say it!"*
|
||||||
|
|
||||||
|
### Direct Cost Pain Points (Classic Brainstorm — Free Association)
|
||||||
|
|
||||||
|
1. **"GPT-4 for everything" syndrome** — Teams default to the most powerful model for every request, including trivial ones like formatting JSON or generating slugs. 60%+ waste.
|
||||||
|
2. **No per-feature cost attribution** — The monthly OpenAI bill is $12K but nobody knows if it's the chatbot, the summarizer, or the code review tool burning cash.
|
||||||
|
3. **No per-team/per-developer attribution** — Engineering teams share API keys. Who's running expensive experiments? Nobody knows.
|
||||||
|
4. **Surprise billing spikes** — A prompt engineering experiment goes wrong, retries in a loop, $3K gone in an hour. No alerts.
|
||||||
|
5. **Token estimation is broken** — Developers can't predict cost before sending a request. Tokenizer counts are approximate and vary by model.
|
||||||
|
6. **Retry storms** — Failed requests retry with exponential backoff but still burn tokens on partial completions. The cost of failures is invisible.
|
||||||
|
7. **Prompt bloat over time** — System prompts grow as features are added. Nobody audits prompt length. A 4,000-token system prompt on every request adds up fast.
|
||||||
|
8. **Context window stuffing** — RAG pipelines stuff maximum context "just in case." Sending 100K tokens when 10K would suffice.
|
||||||
|
9. **Streaming cost opacity** — Streaming responses make it harder to track per-request cost in real-time.
|
||||||
|
10. **Multi-provider bill reconciliation** — Teams using OpenAI + Anthropic + Google + Cohere get 4 separate bills with different billing models. No unified view.
|
||||||
|
|
||||||
|
### Hidden Costs Nobody Talks About (Reverse Brainstorm — "What costs are we pretending don't exist?")
|
||||||
|
|
||||||
|
11. **Latency cost** — Using GPT-4o when GPT-4o-mini would respond 3x faster. The user experience cost of slow responses is real but unmeasured.
|
||||||
|
12. **Developer time debugging model issues** — Hours spent figuring out why Claude gave a different answer than GPT. The human cost of multi-model chaos.
|
||||||
|
13. **Opportunity cost of model lock-in** — Teams build around one provider's API quirks. Switching costs grow silently.
|
||||||
|
14. **Compliance cost of untracked AI usage** — Shadow AI: developers using personal API keys. No audit trail. SOC 2 auditors will ask about this.
|
||||||
|
15. **Embedding re-computation cost** — Changing embedding models means re-embedding your entire corpus. Nobody budgets for this.
|
||||||
|
16. **Fine-tuning waste** — Teams fine-tune models that become obsolete in 3 months when a better base model drops.
|
||||||
|
17. **Testing/staging environment costs** — Running the same expensive models in dev/staging as production. Nobody sets up model tiers per environment.
|
||||||
|
18. **Prompt iteration cost** — The R&D cost of trying 50 prompt variations against GPT-4o to find the best one, when you could test against a cheap model first.
|
||||||
|
19. **Cache miss cost** — Identical or near-identical requests hitting the API repeatedly. Semantic caching could eliminate 20-40% of calls.
|
||||||
|
20. **Overprovisioned rate limits** — Paying for higher rate limit tiers "just in case" when actual usage is 10% of capacity.
|
||||||
|
|
||||||
|
### Workflow Waste (SCAMPER — Substitute, Combine, Adapt, Modify, Put to other use, Eliminate, Reverse)
|
||||||
|
|
||||||
|
21. **Summarize-then-analyze chains** — Step 1 summarizes with GPT-4o, Step 2 analyzes the summary with GPT-4o. Step 1 could use a tiny model.
|
||||||
|
22. **Classification tasks on large models** — Binary yes/no classification sent to a $15/M-token model when a $0.10/M-token model gets 98% accuracy.
|
||||||
|
23. **Batch jobs running synchronously** — Nightly batch processing using real-time API pricing instead of batch API discounts (OpenAI batch API is 50% cheaper).
|
||||||
|
24. **Redundant safety checks** — Multiple layers of content moderation each calling an LLM, when one dedicated moderation endpoint would suffice.
|
||||||
|
25. **Verbose output requests** — Asking for detailed explanations when the downstream consumer only needs a JSON object. Paying for output tokens nobody reads.
|
||||||
|
26. **Translation chains** — Translating content through English as an intermediary when direct translation would be cheaper and better.
|
||||||
|
|
||||||
|
### "Zero Waste AI" Vision (Analogy Thinking — "What would a Toyota Production System for AI look like?")
|
||||||
|
|
||||||
|
27. **Just-in-time model selection** — Like JIT manufacturing: use exactly the right model at exactly the right time, no inventory (no over-provisioning).
|
||||||
|
28. **Kanban for AI requests** — Visualize the flow of requests, identify bottlenecks, limit work-in-progress to prevent cost spikes.
|
||||||
|
29. **Kaizen for prompts** — Continuous improvement: every prompt gets reviewed monthly for token efficiency.
|
||||||
|
30. **Andon cord for AI spend** — Any team member can pull the cord (trigger an alert) when they notice unusual AI spending.
|
||||||
|
31. **Value stream mapping for LLM pipelines** — Map every LLM call in a workflow, identify which ones add value vs. waste.
|
||||||
|
32. **Poka-yoke (mistake-proofing)** — Guardrails that prevent sending a 100K-token request to GPT-4o when the task is simple classification.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: Solution Space Explosion
|
||||||
|
|
||||||
|
*"NOW we're cooking! Phase 1 gave us the pain — Phase 2 is where we go WILD with solutions. I want quantity over quality. Bad ideas welcome. TERRIBLE ideas celebrated. The worst idea in the room often leads to the best one. Let's GO!"*
|
||||||
|
|
||||||
|
### Routing Strategies (Classic Brainstorm + SCAMPER)
|
||||||
|
|
||||||
|
33. **Complexity-based routing** — Analyze the prompt: simple extraction → cheap model, multi-step reasoning → expensive model. Use a tiny classifier to decide.
|
||||||
|
34. **Latency-based routing** — User-facing requests get fast models, background jobs get cheap-but-slow models.
|
||||||
|
35. **Quality-threshold routing** — Define acceptable quality per task type. Route to the cheapest model that meets the threshold.
|
||||||
|
36. **Cascading routing** — Try the cheapest model first. If confidence is low, escalate to the next tier. Only pay for expensive models when needed.
|
||||||
|
37. **Time-of-day routing** — Use batch APIs during off-peak hours. Route to providers with lower pricing during their off-peak.
|
||||||
|
38. **Geographic routing** — Route to the nearest/cheapest regional endpoint. EU requests to EU-hosted models for compliance + cost.
|
||||||
|
39. **Token-budget routing** — Set a per-request token budget. Router picks the best model that fits within budget.
|
||||||
|
40. **Ensemble routing** — For critical requests, send to 2 cheap models and compare. Only escalate to expensive model if they disagree.
|
||||||
|
41. **Historical performance routing** — Track which model performs best for each task type over time. Route based on empirical data, not assumptions.
|
||||||
|
42. **A/B test routing** — Automatically A/B test models for each task. Converge on the cheapest one that maintains quality metrics.
|
||||||
|
43. **Fallback chain routing** — Primary model → fallback 1 → fallback 2. Automatic failover on rate limits, outages, or quality drops.
|
||||||
|
44. **Priority queue routing** — High-priority requests get premium models immediately. Low-priority requests queue for batch processing.
|
||||||
|
45. **Semantic similarity routing** — If a similar prompt was answered recently, return cached result or route to cheapest model for minor variations.
|
||||||
|
|
||||||
|
### Dashboard Features (Mind Map Explosion)
|
||||||
|
|
||||||
|
46. **Real-time spend ticker** — Live-updating cost counter like a stock ticker. Per-model, per-team, per-feature.
|
||||||
|
47. **Cost attribution by feature/endpoint** — Tag each API call with metadata (feature, team, environment). Drill down in dashboard.
|
||||||
|
48. **Spend forecasting** — ML-based projection: "At current rate, you'll spend $X this month." With confidence intervals.
|
||||||
|
49. **Anomaly detection alerts** — "Your summarization pipeline cost 400% more than usual today." Slack/email/PagerDuty integration.
|
||||||
|
50. **Model comparison reports** — "Switching your classification task from GPT-4o to Claude Haiku would save $2,100/month with <1% quality drop."
|
||||||
|
51. **Prompt efficiency scoring** — Score each prompt template on tokens-per-useful-output. Identify bloated prompts.
|
||||||
|
52. **Savings leaderboard** — Gamify cost optimization. "Team Backend saved $3,200 this month by switching to cascading routing."
|
||||||
|
53. **Budget guardrails** — Set hard/soft limits per team, per feature, per day. Auto-throttle or alert when approaching limits.
|
||||||
|
54. **Invoice reconciliation** — Match your internal tracking against provider invoices. Flag discrepancies.
|
||||||
|
55. **Carbon footprint tracking** — Estimate CO2 per model per request. ESG reporting for AI usage.
|
||||||
|
56. **ROI calculator per AI feature** — "Your chatbot costs $4K/month and handles 10K conversations. That's $0.40/conversation vs. $8/conversation for human support."
|
||||||
|
57. **Token waste heatmap** — Visual heatmap showing where tokens are wasted: long system prompts, verbose outputs, unnecessary context.
|
||||||
|
58. **Provider health dashboard** — Real-time status of each LLM provider. Latency, error rates, rate limit utilization.
|
||||||
|
59. **Cost-per-quality scatter plot** — Plot each model's cost vs. quality score for your specific tasks. Find the Pareto frontier.
|
||||||
|
|
||||||
|
### Developer Experience (Random Word Association — "SDK" + "magic" + "invisible")
|
||||||
|
|
||||||
|
60. **OpenAI-compatible proxy** — Drop-in replacement. Change your base URL, everything else stays the same. Zero code changes.
|
||||||
|
61. **One-line SDK wrapper** — `import { llm } from 'costrouter'; llm.chat(...)` — wraps OpenAI/Anthropic/Google SDKs transparently.
|
||||||
|
62. **CLI tool** — `costrouter analyze` scans your codebase, finds all LLM calls, estimates monthly cost, suggests optimizations.
|
||||||
|
63. **VS Code extension** — Inline cost estimates next to LLM API calls. "This call costs ~$0.003 per invocation."
|
||||||
|
64. **Middleware/interceptor pattern** — Express/FastAPI middleware that automatically wraps outgoing LLM calls.
|
||||||
|
65. **Terraform/Pulumi provider** — Define routing rules as infrastructure code. Version-controlled cost policies.
|
||||||
|
66. **GitHub Action** — PR comment: "This change adds a new GPT-4o call in the hot path. Estimated cost impact: +$1,200/month."
|
||||||
|
67. **Playground/sandbox** — Test prompts against multiple models simultaneously. See cost, latency, and quality side-by-side before deploying.
|
||||||
|
68. **Auto-generated migration guides** — "To switch from OpenAI to Anthropic for this task, change these 3 lines."
|
||||||
|
69. **Prompt optimizer** — Automatically compress prompts to use fewer tokens while maintaining output quality.
|
||||||
|
70. **Request inspector/debugger** — Chrome DevTools-style inspector for LLM requests. See tokens, cost, latency, routing decision for each call.
|
||||||
|
|
||||||
|
### Business Model Variations (Worst Possible Idea → Invert)
|
||||||
|
|
||||||
|
71. **% of savings** — "We saved you $5K this month, we keep 20%." Aligned incentives. Risk: hard to prove counterfactual.
|
||||||
|
72. **Flat per-request fee** — $0.001 per routed request. Simple, predictable. Scales with usage.
|
||||||
|
73. **Freemium with usage cap** — Free for <$100/month LLM spend. Paid tiers for higher volume.
|
||||||
|
74. **Open-core** — OSS proxy (routing engine) + paid dashboard/analytics/team features.
|
||||||
|
75. **Seat-based** — $29/developer/month. Simple for procurement.
|
||||||
|
76. **Spend-tier pricing** — Free for <$500 LLM spend, $49/mo for <$5K, $199/mo for <$50K, custom for enterprise.
|
||||||
|
77. **Reverse auction model** — (Wild!) Providers bid for your traffic. You get the lowest price automatically.
|
||||||
|
78. **Insurance model** — "Pay us $X/month, we guarantee your LLM costs won't exceed $Y." We eat the risk.
|
||||||
|
79. **Worst idea → invert: Charge per dollar WASTED** — Actually... a "waste tax" that donates to charity could be a viral marketing hook.
|
||||||
|
|
||||||
|
### Integration Approaches (Analogy — "How do CDNs work? Apply that to LLM routing")
|
||||||
|
|
||||||
|
80. **CDN-style edge proxy** — Deploy routing logic at the edge (Cloudflare Workers). Lowest latency routing decisions.
|
||||||
|
81. **DNS-style resolution** — `gpt4o.costrouter.ai` resolves to the cheapest equivalent model. Change DNS, not code.
|
||||||
|
82. **Service mesh sidecar** — Kubernetes sidecar that intercepts LLM traffic. Zero application changes.
|
||||||
|
83. **Browser extension for AI tools** — Intercept ChatGPT/Claude web UI usage. Track and optimize even manual usage.
|
||||||
|
84. **Webhook-based** — Send us your LLM logs via webhook. We analyze and recommend. No proxy needed (analytics-only mode).
|
||||||
|
85. **LangChain/LlamaIndex plugin** — Native integration with the most popular AI frameworks.
|
||||||
|
86. **OpenTelemetry collector** — Export LLM telemetry via OTel. Fits into existing observability pipelines.
|
||||||
|
|
||||||
|
### Wild Ideas (Lateral Thinking — "What if we had no constraints?")
|
||||||
|
|
||||||
|
87. **AI agent that negotiates volume discounts** — Bot that contacts LLM providers, negotiates enterprise pricing based on your aggregated usage across customers.
|
||||||
|
88. **Semantic response cache** — Cache responses by semantic similarity, not exact match. "What's the capital of France?" and "France's capital city?" return the same cached response.
|
||||||
|
89. **Predictive pre-computation** — Analyze usage patterns, pre-generate likely responses during off-peak hours at batch pricing.
|
||||||
|
90. **Model distillation-as-a-service** — Automatically fine-tune a small model on your specific tasks using your GPT-4o outputs. Replace the expensive model with your custom cheap one.
|
||||||
|
91. **LLM futures market** — (Truly wild) Let companies buy/sell LLM compute futures. Lock in pricing for next quarter.
|
||||||
|
92. **Cooperative buying group** — Pool small companies' usage to negotiate enterprise pricing collectively.
|
||||||
|
93. **Response quality bounty** — Users flag bad responses. The system learns which model/prompt combos fail and routes around them.
|
||||||
|
94. **"Eco mode" for AI** — Like phone battery saver. One toggle: "optimize for cost." Automatically downgrades all non-critical AI calls.
|
||||||
|
95. **AI spend carbon offset marketplace** — Track AI carbon footprint, automatically purchase offsets. ESG compliance built-in.
|
||||||
|
96. **Prompt compression engine** — Automatically rewrite prompts to be shorter while preserving intent. Like gzip for prompts.
|
||||||
|
97. **Multi-turn conversation optimizer** — Detect when a conversation can be summarized and continued with a cheaper model mid-stream.
|
||||||
|
98. **Self-hosted model recommender** — "Based on your usage, hosting Llama 3 on a $2K/month GPU would save you $8K/month vs. API calls."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3: Differentiation & Moat
|
||||||
|
|
||||||
|
*"Okay beautiful people, we've got a MOUNTAIN of ideas. Now let's get strategic. What makes this thing DEFENSIBLE? What stops someone from cloning it in a weekend? What makes customers STAY? This is where we separate a feature from a business."*
|
||||||
|
|
||||||
|
### Data Moats (Analogy — "What's Waze's moat? User-generated traffic data.")
|
||||||
|
|
||||||
|
99. **Routing intelligence network effect** — Every request teaches the router which model is best for which task. More customers = better routing = more savings = more customers. Flywheel.
|
||||||
|
100. **Cross-customer benchmarking** — "Companies like yours typically save 40% by routing classification to Haiku." Anonymized aggregate intelligence.
|
||||||
|
101. **Task-type performance database** — The world's largest dataset of "model X performs Y% on task type Z at cost W." Nobody else has this.
|
||||||
|
102. **Prompt efficiency corpus** — Anonymized library of optimized prompts. "Here's a 40% shorter version of your system prompt that performs identically."
|
||||||
|
|
||||||
|
### Switching Costs (SCAMPER — What can we Combine to increase stickiness?)
|
||||||
|
|
||||||
|
103. **Historical analytics lock-in** — 6 months of cost data, trends, and forecasts. Leaving means losing your analytics history.
|
||||||
|
104. **Custom routing rules** — Teams invest time configuring routing policies. That configuration is valuable and non-portable.
|
||||||
|
105. **Team workflows built around alerts** — Budget alerts, anomaly detection, Slack integrations — all wired into team processes.
|
||||||
|
106. **Compliance audit trail** — SOC 2 auditors accept your cost attribution reports. Switching means rebuilding compliance evidence.
|
||||||
|
|
||||||
|
### Technical Moats
|
||||||
|
|
||||||
|
107. **Proprietary complexity classifier** — A fast, accurate model that classifies prompt complexity in <5ms. Hard to replicate without massive training data.
|
||||||
|
108. **Real-time model benchmarking** — Continuously benchmark all models on standardized tasks. Know within hours when a model's quality changes (post-update regressions).
|
||||||
|
109. **Provider relationship advantages** — Early access to new models, volume discounts passed to customers, beta features.
|
||||||
|
110. **Multi-cloud routing optimization** — Optimize across AWS Bedrock, Azure OpenAI, Google Vertex, and direct APIs simultaneously. Complex to build, easy to use.
|
||||||
|
|
||||||
|
### Brand & Community Moats
|
||||||
|
|
||||||
|
111. **"AI FinOps" category creation** — Own the category name. Be the Datadog of AI cost management.
|
||||||
|
112. **Open-source proxy as top-of-funnel** — OSS routing engine gets adoption. Paid dashboard converts power users. Community contributes routing strategies.
|
||||||
|
113. **Public AI cost benchmarks** — Publish monthly "State of AI Costs" reports. Become the trusted source. Media coverage → brand → customers.
|
||||||
|
114. **Developer community & marketplace** — Community-contributed routing strategies, prompt optimizers, integrations. Ecosystem lock-in.
|
||||||
|
115. **Integration partnerships** — Official partner with LangChain, LlamaIndex, Vercel AI SDK. "Recommended cost optimization tool."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4: Anti-Ideas & Red Team
|
||||||
|
|
||||||
|
*"Time to put on our black hats. I want you to DESTROY this idea. Be ruthless. Be the VC who says no. Be the competitor who wants to crush us. Be the customer who churns. If we can survive this gauntlet, we've got something real."*
|
||||||
|
|
||||||
|
### Why This Could FAIL (Reverse Brainstorm — "How do we guarantee failure?")
|
||||||
|
|
||||||
|
116. **Race to zero pricing** — LLM providers keep cutting prices. If GPT-4o becomes as cheap as GPT-4o-mini, routing adds no value. The savings disappear.
|
||||||
|
117. **Provider lock-in by design** — OpenAI, Anthropic, and Google actively discourage multi-provider usage. Proprietary features (function calling formats, vision capabilities) make routing harder.
|
||||||
|
118. **"Good enough" built-in solutions** — OpenAI launches their own cost dashboard and routing. They have all the data already. Why would they let a third party capture this value?
|
||||||
|
119. **Latency overhead kills adoption** — Adding a proxy hop adds latency. For real-time chat applications, even 50ms matters. Developers won't accept the tradeoff.
|
||||||
|
120. **Trust barrier** — "You want me to route ALL my LLM traffic through your proxy? Including my proprietary prompts and customer data?" Security/compliance teams will block this.
|
||||||
|
121. **Small market initially** — Only companies spending >$1K/month on LLMs care about optimization. That's a smaller market than it seems in 2026.
|
||||||
|
122. **Open-source competition** — LiteLLM already exists as an OSS proxy. A well-funded OSS project could eat the market before a SaaS gains traction.
|
||||||
|
123. **Model convergence** — If all models become equally good and equally priced, routing intelligence has no value.
|
||||||
|
|
||||||
|
### Biggest Risks
|
||||||
|
|
||||||
|
124. **Single point of failure risk** — If the router goes down, ALL LLM calls fail. Customers won't accept this for production workloads without extreme reliability guarantees.
|
||||||
|
125. **Data privacy liability** — Routing means seeing all prompts and responses. One data breach and the company is dead. GDPR, HIPAA, SOC 2 all apply.
|
||||||
|
126. **Accuracy of complexity classification** — If the router sends a complex task to a cheap model and it fails, the customer blames you, not the model.
|
||||||
|
127. **Provider API changes** — OpenAI changes their API format, your proxy breaks. You're now maintaining compatibility layers for 5+ providers. Operational burden grows fast.
|
||||||
|
|
||||||
|
### Competitor Kill Strategies
|
||||||
|
|
||||||
|
128. **OpenAI launches "Smart Routing"** — Built into their API. Free. Game over for the routing value prop.
|
||||||
|
129. **Datadog acquires Helicone** — Adds LLM cost tracking to their existing observability platform. Instant distribution to 26K+ customers.
|
||||||
|
130. **LiteLLM raises $50M** — Goes from OSS project to well-funded SaaS competitor with 10x your engineering team.
|
||||||
|
131. **AWS Bedrock adds native routing** — Brian's own employer could build this as a platform feature. Free for Bedrock customers.
|
||||||
|
132. **Price war** — A VC-funded competitor offers the same product for free to gain market share. Burns cash to kill bootstrapped competitors.
|
||||||
|
|
||||||
|
### Assumptions That Might Be Wrong
|
||||||
|
|
||||||
|
133. **"Teams want multi-provider"** — Maybe most teams are happy with one provider. The multi-provider routing value prop only matters if teams actually use multiple models.
|
||||||
|
134. **"Cost is the primary concern"** — Maybe quality and reliability matter 10x more than cost. Teams might prefer to overpay for consistency.
|
||||||
|
135. **"A proxy is the right architecture"** — Maybe an analytics-only approach (no routing, just visibility) is what the market actually wants first.
|
||||||
|
136. **"Small teams will pay"** — Maybe only enterprises have enough LLM spend to justify a cost optimization tool. The bootstrap-friendly market might be too small.
|
||||||
|
137. **"Routing decisions can be automated"** — Maybe the task complexity is too nuanced for automated classification. Maybe humans need to define routing rules manually, which reduces the magic.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 5: Synthesis
|
||||||
|
|
||||||
|
*"What a session! We generated a LOT of signal. Let me pull together the themes, rank the winners, and highlight the wild cards that could change everything."*
|
||||||
|
|
||||||
|
### Top 10 Most Promising Ideas (Ranked)
|
||||||
|
|
||||||
|
| Rank | Idea | Why It Wins |
|
||||||
|
|------|------|-------------|
|
||||||
|
| 1 | **OpenAI-compatible proxy with zero-code setup** (#60) | Lowest adoption barrier. Change one URL, start saving. This IS the product. |
|
||||||
|
| 2 | **Cascading routing — try cheap first, escalate on low confidence** (#36) | Elegant, automatic, measurable savings. The core routing innovation. |
|
||||||
|
| 3 | **Cost attribution by feature/team/environment** (#47) | The dashboard killer feature. Nobody else does this well. Solves the "who's spending?" problem. |
|
||||||
|
| 4 | **Open-core model — OSS proxy + paid dashboard** (#74) | De-risks adoption, builds community, creates top-of-funnel. LiteLLM proves the model works. |
|
||||||
|
| 5 | **Semantic response cache** (#88) | 20-40% cost reduction with zero quality impact. Immediate, provable ROI. |
|
||||||
|
| 6 | **Anomaly detection with Slack/PagerDuty alerts** (#49) | Prevents the "$3K surprise bill" story. Emotional resonance + clear value. |
|
||||||
|
| 7 | **Spend-tier pricing model** (#76) | Aligns with customer growth. Free tier drives adoption. Simple to understand. |
|
||||||
|
| 8 | **Routing intelligence flywheel / data moat** (#99) | The strategic moat. More traffic = better routing = more savings = more traffic. |
|
||||||
|
| 9 | **Model comparison reports with savings estimates** (#50) | "Switch this task to Haiku, save $2,100/month." Actionable, specific, compelling. |
|
||||||
|
| 10 | **Prompt efficiency scoring & optimization** (#51, #96) | Unique differentiator. Nobody else helps you write cheaper prompts. |
|
||||||
|
|
||||||
|
### 3 Wild Card Ideas That Could Be Game-Changers
|
||||||
|
|
||||||
|
🃏 **Wild Card 1: Model Distillation-as-a-Service (#90)**
|
||||||
|
Automatically fine-tune a small, cheap model on your specific tasks using your expensive model's outputs. This turns a cost optimization tool into an AI platform play. If it works, customers save 90%+ and are locked in forever because the distilled model is trained on THEIR data. Massive moat.
|
||||||
|
|
||||||
|
🃏 **Wild Card 2: Cooperative Buying Group (#92)**
|
||||||
|
Pool hundreds of small companies' LLM usage to negotiate enterprise-tier pricing from providers. Like a credit union for AI compute. This creates a network effect that's nearly impossible to replicate and positions the company as the "collective bargaining agent" for the long tail of AI-using startups.
|
||||||
|
|
||||||
|
🃏 **Wild Card 3: Self-Hosted Model Recommender (#98)**
|
||||||
|
"Based on your usage patterns, deploying Llama 3 70B on 2x A100s would save you $14K/month vs. API calls." This extends the value prop beyond routing to infrastructure advisory. It's the natural evolution: first optimize API costs, then help customers graduate to self-hosting when it makes sense. Counter-intuitive (you lose the routing revenue) but builds massive trust and opens up a consulting/managed-service revenue stream.
|
||||||
|
|
||||||
|
### Key Themes That Emerged
|
||||||
|
|
||||||
|
1. **"Invisible by default"** — The winning product requires zero code changes. Proxy architecture with OpenAI-compatible API is non-negotiable. Adoption friction kills.
|
||||||
|
|
||||||
|
2. **"Show me the money"** — Every feature must connect to a dollar amount. Not "better observability" but "you saved $4,200 this month." The dashboard is a savings scoreboard.
|
||||||
|
|
||||||
|
3. **"Trust is the bottleneck"** — Routing all LLM traffic through a third party is a massive trust ask. The product needs SOC 2 from day one, data residency options, and an analytics-only mode for cautious adopters.
|
||||||
|
|
||||||
|
4. **"The moat is in the data"** — The routing intelligence flywheel is the only sustainable competitive advantage. Everything else can be cloned. The cross-customer performance database cannot.
|
||||||
|
|
||||||
|
5. **"Start narrow, expand wide"** — Start as a cost router. Expand to prompt optimization, model distillation, self-hosted recommendations. The wedge is cost savings; the platform is AI operations.
|
||||||
|
|
||||||
|
6. **"Open source is a feature, not a threat"** — LiteLLM proves OSS proxy works. Don't fight it — embrace it. Open-source the proxy, monetize the intelligence layer.
|
||||||
|
|
||||||
|
### Recommended Focus Areas for Product Brief
|
||||||
|
|
||||||
|
**Must-Have (V1 — "Save money in 5 minutes"):**
|
||||||
|
- OpenAI-compatible proxy (drop-in replacement)
|
||||||
|
- Complexity-based routing with cascading fallback
|
||||||
|
- Real-time cost dashboard with per-feature attribution
|
||||||
|
- Anomaly detection + Slack alerts
|
||||||
|
- Semantic response caching
|
||||||
|
- Free tier for <$500/month LLM spend
|
||||||
|
|
||||||
|
**Should-Have (V1.1 — "Prove the ROI"):**
|
||||||
|
- Model comparison reports with savings recommendations
|
||||||
|
- Prompt efficiency scoring
|
||||||
|
- Budget guardrails (soft/hard limits per team)
|
||||||
|
- Multi-provider bill reconciliation
|
||||||
|
|
||||||
|
**Could-Have (V2 — "Platform play"):**
|
||||||
|
- A/B test routing for model evaluation
|
||||||
|
- Prompt compression/optimization engine
|
||||||
|
- Self-hosted model cost comparison
|
||||||
|
- OpenTelemetry export for existing observability stacks
|
||||||
|
|
||||||
|
**Future Vision (V3+ — "AI FinOps platform"):**
|
||||||
|
- Model distillation-as-a-service
|
||||||
|
- Cooperative buying group for volume discounts
|
||||||
|
- AI agent policy engine (guardrails for agentic workflows)
|
||||||
|
- Carbon footprint tracking and offset marketplace
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Appendix: Idea Count by Phase
|
||||||
|
|
||||||
|
| Phase | Target | Actual |
|
||||||
|
|-------|--------|--------|
|
||||||
|
| Phase 1: Problem Space | 20+ | 32 |
|
||||||
|
| Phase 2: Solution Space | 30+ | 66 |
|
||||||
|
| Phase 3: Differentiation | 15+ | 17 |
|
||||||
|
| Phase 4: Anti-Ideas | 10+ | 22 |
|
||||||
|
| **Total unique ideas** | **100+** | **137** |
|
||||||
|
|
||||||
|
## Techniques Used
|
||||||
|
|
||||||
|
- **Classic Brainstorm (Free Association):** Ideas 1–10, 33–45
|
||||||
|
- **Reverse Brainstorm:** Ideas 11–20, 116–123
|
||||||
|
- **SCAMPER:** Ideas 21–26, 103–106
|
||||||
|
- **Analogy Thinking:** Ideas 27–32 (Toyota Production System), 80–86 (CDN architecture), 99–102 (Waze data moat)
|
||||||
|
- **Mind Map Explosion:** Ideas 46–59
|
||||||
|
- **Random Word Association:** Ideas 60–70
|
||||||
|
- **Worst Possible Idea → Invert:** Ideas 71–79
|
||||||
|
- **Lateral Thinking:** Ideas 87–98
|
||||||
|
- **Red Team / Black Hat:** Ideas 124–137
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Session complete. 137 ideas generated. The signal is strong: this product has legs. The proxy-first, open-core approach with a data-driven routing moat is the play. Now let's turn this into a product brief.* 🎯
|
||||||
1013
products/01-llm-cost-router/design-thinking/session.md
Normal file
1013
products/01-llm-cost-router/design-thinking/session.md
Normal file
File diff suppressed because it is too large
Load Diff
340
products/01-llm-cost-router/epics/epics.md
Normal file
340
products/01-llm-cost-router/epics/epics.md
Normal file
@@ -0,0 +1,340 @@
|
|||||||
|
# dd0c/route — V1 MVP Epics
|
||||||
|
|
||||||
|
This document outlines the core Epics and User Stories for the V1 MVP of dd0c/route, designed for a solo founder to implement in 1-3 day chunks per story.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 1: Proxy Engine
|
||||||
|
**Description:** Core Rust proxy that sits between the client application and LLM providers. Must maintain strict OpenAI API compatibility, support SSE streaming, and introduce <5ms latency overhead.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
- **Story 1.1:** As a developer, I want to swap my `OPENAI_BASE_URL` to the proxy endpoint, so that my existing OpenAI SDK works without code changes.
|
||||||
|
- **Story 1.2:** As a developer, I want streaming support (SSE) preserved, so that my chat applications remain responsive while using the proxy.
|
||||||
|
- **Story 1.3:** As a platform engineer, I want the proxy latency overhead to be <5ms, so that intelligent routing doesn't degrade our application's user experience.
|
||||||
|
- **Story 1.4:** As a developer, I want provider errors (e.g., rate limits) to be passed through transparently, so that my app's existing error handling continues to work.
|
||||||
|
|
||||||
|
### Acceptance Criteria
|
||||||
|
- Implements `POST /v1/chat/completions` for both streaming (`stream: true`) and non-streaming requests.
|
||||||
|
- Validates the `Authorization: Bearer` header against a Redis cache (falling back to DB).
|
||||||
|
- Successfully forwards requests to OpenAI and Anthropic, translating formats if necessary.
|
||||||
|
- Asynchronously emits telemetry events to an in-memory channel without blocking the hot path.
|
||||||
|
- P99 latency overhead is measured at <5ms.
|
||||||
|
|
||||||
|
### Estimate: 13 points
|
||||||
|
### Dependencies: None
|
||||||
|
### Technical Notes:
|
||||||
|
- Stack: Rust, `tokio`, `hyper`, `axum`.
|
||||||
|
- Use connection pooling for upstream providers to eliminate TLS handshake overhead.
|
||||||
|
- For streaming, parse only the first chunk/headers to make a routing decision, then passthrough. Count tokens from the final SSE chunk (e.g., `[DONE]`).
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 2: Router Brain
|
||||||
|
**Description:** The intelligence core of dd0c/route embedded within the proxy. It evaluates incoming requests against routing rules, classifies complexity heuristically, checks cost tables, and executes fallback chains.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
- **Story 2.1:** As an engineering manager, I want the router to classify the complexity of requests, so that simple extraction tasks are downgraded to cheaper models.
|
||||||
|
- **Story 2.2:** As an engineering manager, I want to configure routing rules (e.g., if feature=classify -> use cheapest from [gpt-4o-mini, claude-haiku]), so that I can automatically save money on predictable workloads.
|
||||||
|
- **Story 2.3:** As an application developer, I want the router to automatically fallback to an alternative model if the primary model fails or rate-limits, so that my application remains highly available.
|
||||||
|
- **Story 2.4:** As an engineering manager, I want cost savings calculated instantly based on up-to-date provider pricing, so that my dashboard data is immediately accurate.
|
||||||
|
|
||||||
|
### Acceptance Criteria
|
||||||
|
- Heuristic complexity classifier runs in <2ms based on token count, task patterns (regex on system prompt), and model hints.
|
||||||
|
- Evaluates first-match routing rules based on request tags (`X-DD0C-Feature`, `X-DD0C-Team`).
|
||||||
|
- Executes "passthrough", "cheapest", "quality-first", and "cascading" routing strategies.
|
||||||
|
- Enforces circuit breakers on downstream providers (e.g., open circuit if error rate > 10%).
|
||||||
|
- Calculates `cost_saved = cost_original - cost_actual` on the fly using in-memory cost tables.
|
||||||
|
|
||||||
|
### Estimate: 8 points
|
||||||
|
### Dependencies: Epic 1 (Proxy Engine)
|
||||||
|
### Technical Notes:
|
||||||
|
- Stack: Rust.
|
||||||
|
- Run purely in-memory on the proxy hot path. No DB queries per request.
|
||||||
|
- Cost tables and routing rules must be loaded at startup and refreshed via a background task every 60s.
|
||||||
|
- Use `serde_json` to inspect the `messages` array for complexity classification but do not persist the prompt.
|
||||||
|
- Circuit breaker state must be shared via Redis so all proxy instances agree on provider health.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 3: Analytics Pipeline
|
||||||
|
**Description:** High-throughput logging and aggregation system using TimescaleDB. Focuses on ingesting asynchronous telemetry from the Proxy Engine without blocking request processing.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
- **Story 3.1:** As a platform engineer, I want the proxy to emit telemetry without blocking the main request thread, so that our application performance remains unaffected.
|
||||||
|
- **Story 3.2:** As an engineering manager, I want my dashboard queries to be lightning fast even with millions of rows, so that I can quickly slice and dice our AI spend.
|
||||||
|
- **Story 3.3:** As an engineering manager, I want historical telemetry to be compressed or aged out automatically, so that the database storage costs remain minimal.
|
||||||
|
|
||||||
|
### Acceptance Criteria
|
||||||
|
- Proxy emits a `RequestEvent` over an in-memory `mpsc` channel via `tokio::spawn`.
|
||||||
|
- A background worker batches events and inserts them into TimescaleDB every 1s or 100 events using bulk `COPY INTO`.
|
||||||
|
- Continuous aggregates (`hourly_cost_summary`, `daily_cost_summary`) are created and updated on schedule to pre-calculate `total_cost`, `total_saved`, and `avg_latency`.
|
||||||
|
- TimescaleDB compression policies compress chunks older than 7 days by 90%+.
|
||||||
|
- The proxy must degrade gracefully if the analytics database is unavailable.
|
||||||
|
|
||||||
|
### Estimate: 8 points
|
||||||
|
### Dependencies: Epic 1 (Proxy Engine)
|
||||||
|
### Technical Notes:
|
||||||
|
- Stack: Rust (worker), PostgreSQL/TimescaleDB.
|
||||||
|
- Write the TimescaleDB migration scripts for the hypertable `request_events` and the continuous aggregates.
|
||||||
|
- Batching must be robust to worker panics (use bounded channels).
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 4: Dashboard API
|
||||||
|
**Description:** Axum REST API providing authentication, org/team management, routing rule CRUD, and data endpoints for the frontend dashboard. Focuses on frictionless developer onboarding.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
- **Story 4.1:** As an engineering manager, I want to authenticate via GitHub OAuth, so that I can create an organization and get an API key in under 60 seconds without remembering a password.
|
||||||
|
- **Story 4.2:** As an engineering manager, I want to manage my organization's routing rules and provider API keys securely, so that dd0c/route can successfully broker requests to OpenAI and Anthropic.
|
||||||
|
- **Story 4.3:** As an engineering manager, I want an endpoint that provides my historical spend and savings summary, so that I can visualize it in the UI.
|
||||||
|
- **Story 4.4:** As a platform engineer, I want to revoke an active API key, so that compromised credentials are immediately blocked.
|
||||||
|
|
||||||
|
### Acceptance Criteria
|
||||||
|
- Implements `/api/auth/github` OAuth flow issuing JWTs and refresh tokens.
|
||||||
|
- Implements `/api/orgs` CRUD for managing an organization and API keys.
|
||||||
|
- Implements `/api/dashboard/summary` and `/api/dashboard/treemap` queries hitting the TimescaleDB continuous aggregates.
|
||||||
|
- Implements `/api/requests` for the request inspector with filters (e.g., `model`, `feature`, `team`).
|
||||||
|
- Securely stores and encrypts provider API keys in PostgreSQL using an AES-256-GCM Data Encryption Key.
|
||||||
|
- Enforces an RBAC model (Owner, Member) per organization.
|
||||||
|
|
||||||
|
### Estimate: 13 points
|
||||||
|
### Dependencies: Epic 3 (Analytics Pipeline)
|
||||||
|
### Technical Notes:
|
||||||
|
- Stack: Rust (`axum`), PostgreSQL.
|
||||||
|
- Reuse `tokio` runtime to minimize context switches for a solo founder.
|
||||||
|
- Use `oauth2` crate for GitHub integration. JWTs are signed with RS256, refresh tokens in Redis.
|
||||||
|
- Ensure API keys are hashed (SHA-256) before storage; raw keys are never stored.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 5: Dashboard UI
|
||||||
|
**Description:** The React SPA serving the cost attribution dashboard. Visualizes the AI spend treemap, routing rules editor, real-time ticker, and request inspector. This is the product's primary visual "Aha" moment.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
- **Story 5.1:** As an engineering manager, I want to see a treemap of my organization's AI spend broken down by team, feature, and model, so that I can instantly identify the most expensive areas of my application.
|
||||||
|
- **Story 5.2:** As an engineering manager, I want a real-time counter showing "You saved $X this week," so that I feel confident the tool is paying for itself.
|
||||||
|
- **Story 5.3:** As a platform engineer, I want an interface to configure routing rules (e.g., drag-to-reorder priority), so that I can instruct the proxy without editing config files.
|
||||||
|
- **Story 5.4:** As a platform engineer, I want a request inspector that displays metadata, cost, latency, and the specific routing decision for every request, so that I can debug why a certain model was chosen.
|
||||||
|
|
||||||
|
### Acceptance Criteria
|
||||||
|
- React + Vite SPA deployed as static assets to S3 + CloudFront.
|
||||||
|
- Treemap visualization renders cost aggregations dynamically over selected time periods (7d/30d/90d).
|
||||||
|
- A routing rules editor allows CRUD operations and priority reordering for a team's rules.
|
||||||
|
- Request Inspector table displays paginated, filterable (`feature`, `team`, `status`) lists of telemetry without showing prompt content.
|
||||||
|
- Allows an admin to securely input OpenAI and Anthropic API keys.
|
||||||
|
|
||||||
|
### Estimate: 13 points
|
||||||
|
### Dependencies: Epic 4 (Dashboard API)
|
||||||
|
### Technical Notes:
|
||||||
|
- Stack: React, TypeScript, Vite, Tailwind CSS.
|
||||||
|
- No SSR required for V1 (keep it simple). Use `react-query` or similar for data fetching and caching.
|
||||||
|
- Build the treemap with a charting library like D3 or Recharts.
|
||||||
|
- Emphasize speed—data fetches should resolve from continuous aggregates in <200ms.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 6: Shadow Audit CLI
|
||||||
|
**Description:** The PLG "Shadow Audit" command-line tool (`npx dd0c-scan`). It analyzes a local codebase for LLM API calls, estimates monthly cost based on prompt templates, and projects savings with dd0c/route.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
- **Story 6.1:** As a developer, I want a zero-setup CLI tool (`npx dd0c-scan`) that scans my codebase and estimates how much money I'm currently wasting on overqualified LLMs, so that I can convince my manager to use dd0c/route.
|
||||||
|
- **Story 6.2:** As an engineering manager, I want the CLI to run locally without sending my source code to a third party, so that I can securely audit my own projects.
|
||||||
|
- **Story 6.3:** As an engineering manager, I want a clean, visually appealing terminal report showing "Top Opportunities" for model downgrades, so that I immediately see the value of routing.
|
||||||
|
|
||||||
|
### Acceptance Criteria
|
||||||
|
- Parses a local directory for OpenAI or Anthropic SDK usage in TypeScript/JavaScript/Python files.
|
||||||
|
- Identifies the models requested in the code and estimates token usage heuristically based on the strings passed to the SDK.
|
||||||
|
- Hits `/api/v1/pricing/current` to fetch the latest cost tables and calculates an estimated monthly bill and projected savings.
|
||||||
|
- Outputs a formatted terminal report showing total potential savings and a breakdown of the highest-impact files.
|
||||||
|
- Anonymized scan summary is sent to the server only if the user explicitly opts in.
|
||||||
|
|
||||||
|
### Estimate: 8 points
|
||||||
|
### Dependencies: Epic 4 (Dashboard API - Pricing Endpoint)
|
||||||
|
### Technical Notes:
|
||||||
|
- Stack: Node.js, `commander`, `chalk`, simple regex parsers for Python/JS SDKs.
|
||||||
|
- Keep the CLI lightweight, fast, and dependency-free as possible. No actual LLM parsing; use heuristics (string length/structure) for token estimates.
|
||||||
|
- Must run completely offline if the pricing table is cached.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 7: Slack Integration
|
||||||
|
**Description:** The primary retention mechanism and anomaly alerting system. An asynchronous worker task dispatches weekly savings digests and threshold-based budget alerts to Slack and Email.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
- **Story 7.1:** As an engineering manager, I want an automated weekly digest summarizing my team's AI savings, so that I can easily report to the CFO that our tooling investment is paying off.
|
||||||
|
- **Story 7.2:** As a platform engineer, I want to configure a budget limit (e.g., alert if daily spend > $100) and receive a Slack webhook notification immediately, so that I can stop a retry storm before the bill gets out of hand.
|
||||||
|
- **Story 7.3:** As an engineering manager, I want an email version of the weekly digest, so that I can forward it straight to my leadership team.
|
||||||
|
|
||||||
|
### Acceptance Criteria
|
||||||
|
- A standalone asynchronous worker (`dd0c-worker`) evaluates continuous aggregates (via TimescaleDB) every hour.
|
||||||
|
- Generates a "Monday Morning Digest" email via AWS SES.
|
||||||
|
- Emits Slack webhook payloads when a threshold alert is triggered (`threshold_amount`, `threshold_pct`).
|
||||||
|
- Adds a `X-DD0C-Signature` to outbound webhooks to prevent spoofing.
|
||||||
|
|
||||||
|
### Estimate: 8 points
|
||||||
|
### Dependencies: Epic 3 (Analytics Pipeline), Epic 4 (Dashboard API)
|
||||||
|
### Technical Notes:
|
||||||
|
- Stack: Rust (`tokio-cron`), `reqwest` (for webhooks), AWS SES.
|
||||||
|
- Worker is a singleton container (1 task) running alongside the proxy to avoid lock contention on cron tasks.
|
||||||
|
- Ensure alerts maintain state (using PostgreSQL `alert_configs` and `last_fired_at`) so users aren't spammed for the same incident.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 8: Infrastructure & DevOps
|
||||||
|
**Description:** Containerized ECS Fargate deployment, AWS native networking, basic monitoring, and fully automated CI/CD for the entire dd0c stack. Essential for a solo founder to deploy safely and frequently.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
- **Story 8.1:** As a solo founder, I want to use AWS ECS Fargate, so that I don't have to manage EC2 instances or worry about OS-level patching.
|
||||||
|
- **Story 8.2:** As a solo founder, I want a GitHub Actions CI/CD pipeline, so that `git push` automatically runs tests, builds containers, and deploys rolling updates with zero downtime.
|
||||||
|
- **Story 8.3:** As an operator, I want standard AWS CloudWatch alarms (e.g., P99 proxy latency > 50ms) connected to PagerDuty, so that I am only woken up when a critical threshold is breached.
|
||||||
|
- **Story 8.4:** As a solo founder, I want a strict separation between my configuration (PostgreSQL) and telemetry (TimescaleDB) stores, so that I can scale analytics independently from org/auth state.
|
||||||
|
|
||||||
|
### Acceptance Criteria
|
||||||
|
- Full AWS infrastructure defined via CDK (TypeScript) or Terraform.
|
||||||
|
- ALB routes `/v1/*` to the proxy container, `/api/*` to the dashboard API container.
|
||||||
|
- Dashboard static assets deployed to an S3 bucket with CloudFront caching.
|
||||||
|
- `docker build` produces three optimized images from a single Rust workspace (`dd0c-proxy`, `dd0c-api`, `dd0c-worker`).
|
||||||
|
- CloudWatch dashboards and minimum alarms configured (CPU >80%, Proxy Error Rate >5%, ALB 5xx Rate).
|
||||||
|
- `git push main` triggers a GitHub Action to test, lint, build, push to ECR, and update the ECS Fargate services.
|
||||||
|
|
||||||
|
### Estimate: 13 points
|
||||||
|
### Dependencies: Epic 1 (Proxy Engine), Epic 4 (Dashboard API)
|
||||||
|
### Technical Notes:
|
||||||
|
- Stack: AWS ECS Fargate, ALB, CloudFront, S3, RDS (PostgreSQL/TimescaleDB), ElastiCache (Redis), GitHub Actions.
|
||||||
|
- Ensure the ALB utilizes path-based routing correctly and handles TLS termination.
|
||||||
|
- For cost optimization on AWS, explore consolidating NAT Gateways or utilizing VPC Endpoints for S3/ECR/CloudWatch.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 9: Onboarding & PLG
|
||||||
|
**Description:** Self-serve signup, free tier, API key management, and a getting-started flow that gets users routing their first LLM call through dd0c/route in under 2 minutes. This is the growth engine.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
- **Story 9.1:** As a new user, I want to sign up with GitHub OAuth in one click, so that I can start using dd0c/route without filling out forms.
|
||||||
|
- **Story 9.2:** As a new user, I want a free tier (up to $50/month in routed LLM spend), so that I can evaluate the product with real traffic before committing.
|
||||||
|
- **Story 9.3:** As a developer, I want to generate and manage API keys from the dashboard, so that I can integrate dd0c/route into my applications.
|
||||||
|
- **Story 9.4:** As a new user, I want a guided "First Route" onboarding flow that gives me a working curl command, so that I see cost savings within 2 minutes of signing up.
|
||||||
|
- **Story 9.5:** As a team lead, I want to invite team members via email, so that my team can share a single org and see aggregated savings.
|
||||||
|
|
||||||
|
### Acceptance Criteria
|
||||||
|
- GitHub OAuth signup creates org + first API key automatically.
|
||||||
|
- Free tier enforced at the proxy level — requests beyond $50/month routed spend return 429 with upgrade CTA.
|
||||||
|
- API key CRUD: create, list, revoke, rotate. Keys are hashed at rest (bcrypt), only shown once on creation.
|
||||||
|
- Onboarding wizard: 3 steps — (1) copy API key, (2) paste curl command, (3) see first request in dashboard. Completion rate tracked.
|
||||||
|
- Team invite sends email with magic link. Invited user joins existing org on signup.
|
||||||
|
- Stripe Checkout integration for upgrade from free → paid ($49/month base).
|
||||||
|
|
||||||
|
### Estimate: 8 points
|
||||||
|
### Dependencies: Epic 4 (Dashboard API), Epic 5 (Dashboard UI)
|
||||||
|
### Technical Notes:
|
||||||
|
- Use Stripe Checkout Sessions for payment — no custom billing UI needed for V1.
|
||||||
|
- Free tier enforcement happens in the proxy hot path — must be O(1) lookup (Redis counter per org, reset monthly via cron).
|
||||||
|
- Onboarding completion events tracked via PostHog or simple DB events for funnel analysis.
|
||||||
|
- Magic link invites use signed JWTs with 72-hour expiry, stored in `pending_invites` table.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 10: Transparent Factory Compliance
|
||||||
|
**Description:** Cross-cutting epic ensuring dd0c/route adheres to the 5 Transparent Factory architectural tenets: Atomic Flagging, Elastic Schema, Cognitive Durability, Semantic Observability, and Configurable Autonomy. These stories are woven across the existing system — they don't add features, they add engineering discipline.
|
||||||
|
|
||||||
|
### Story 10.1: Atomic Flagging — Feature Flag Infrastructure
|
||||||
|
**As a** solo founder, **I want** every new routing rule, cost threshold, and provider failover behavior wrapped in a feature flag (default: off), **so that** I can deploy code continuously without risking production traffic.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- OpenFeature SDK integrated into the Rust proxy via a compatible provider (e.g., `flagd` sidecar or env-based provider for V1).
|
||||||
|
- All flags evaluate locally (in-memory or sidecar) — zero network calls on the hot path.
|
||||||
|
- Every flag has an `owner` field and a `ttl` (max 14 days). CI blocks deployment if any flag exceeds its TTL at 100% rollout.
|
||||||
|
- Automated circuit breaker: if a flagged code path increases P99 latency by >5% or error rate >2%, the flag auto-disables within 30 seconds.
|
||||||
|
- Flags exist for: model routing strategies, complexity classifier thresholds, provider failover chains, new dashboard features.
|
||||||
|
|
||||||
|
**Estimate:** 5 points
|
||||||
|
**Dependencies:** Epic 1 (Proxy Engine), Epic 2 (Router Brain)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Use OpenFeature Rust SDK. For V1, a simple JSON file or env-var provider is fine — no LaunchDarkly needed.
|
||||||
|
- Circuit breaker integration: extend the existing Redis-backed circuit breaker to also flip flags.
|
||||||
|
- Flag cleanup: add a `make flag-audit` target that lists expired flags.
|
||||||
|
|
||||||
|
### Story 10.2: Elastic Schema — Additive-Only Migration Discipline
|
||||||
|
**As a** solo founder, **I want** all TimescaleDB and Redis schema changes to be strictly additive, **so that** I can roll back any deployment instantly without data loss or broken readers.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- CI lint step rejects any migration containing `DROP`, `ALTER ... TYPE`, or `RENAME` on existing columns.
|
||||||
|
- New fields use `_v2` suffix or a new table when breaking changes are unavoidable.
|
||||||
|
- All Rust structs use `#[serde(deny_unknown_fields = false)]` (or equivalent) so V1 code ignores V2 fields.
|
||||||
|
- Dual-write pattern documented and enforced: during migration windows, the API writes to both old and new schema targets within the same DB transaction.
|
||||||
|
- Every migration file includes a `sunset_date` comment (max 30 days). A CI check warns if any migration is past sunset without cleanup.
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 3 (Analytics Pipeline)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Use `sqlx` migration files. Add a pre-commit hook or CI step that greps for forbidden DDL keywords.
|
||||||
|
- Redis key schema: version keys with prefix (e.g., `route:v1:config`, `route:v2:config`). Never rename keys.
|
||||||
|
- For the `request_events` hypertable, new columns are always `NULLABLE` with defaults.
|
||||||
|
|
||||||
|
### Story 10.3: Cognitive Durability — Decision Logs for Routing Logic
|
||||||
|
**As a** future maintainer (or future me), **I want** every change to routing algorithms, cost models, or provider selection logic accompanied by a `decision_log.json`, **so that** I can understand *why* a decision was made months later in under 60 seconds.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `decision_log.json` schema defined: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
|
||||||
|
- CI requires a `decision_log.json` entry for any PR touching `src/router/`, `src/cost/`, or migration files.
|
||||||
|
- Cyclomatic complexity cap of 10 enforced via `cargo clippy` or a custom lint. PRs exceeding this are blocked.
|
||||||
|
- Decision logs are committed alongside code in a `docs/decisions/` directory, one file per significant change.
|
||||||
|
|
||||||
|
**Estimate:** 2 points
|
||||||
|
**Dependencies:** None
|
||||||
|
**Technical Notes:**
|
||||||
|
- Use a PR template that prompts for the decision log fields.
|
||||||
|
- For the complexity cap, `cargo clippy -W clippy::cognitive_complexity` with threshold 10.
|
||||||
|
- Decision logs for cost table updates should include: source of pricing data, comparison with previous rates, expected savings impact.
|
||||||
|
|
||||||
|
### Story 10.4: Semantic Observability — AI Reasoning Spans on Routing Decisions
|
||||||
|
**As a** platform engineer debugging a misrouted request, **I want** every proxy routing decision to emit an OpenTelemetry span with structured AI reasoning metadata, **so that** I can trace exactly which model was chosen, why, and what alternatives were rejected.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- Every `/v1/chat/completions` request generates an `ai_routing_decision` span as a child of the request trace.
|
||||||
|
- Span attributes include: `ai.model_selected`, `ai.model_alternatives` (JSON array of rejected models + reasons), `ai.cost_delta` (savings vs. default), `ai.complexity_score`, `ai.routing_strategy` (passthrough/cheapest/quality-first/cascading).
|
||||||
|
- `ai.prompt_hash` (SHA-256 of first 500 chars of system prompt) included for correlation — never raw prompt content.
|
||||||
|
- Spans export to any OTLP-compatible backend (Grafana Cloud, Jaeger, etc.).
|
||||||
|
- No PII in any span attribute. Prompt content is hashed, not logged.
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 1 (Proxy Engine), Epic 2 (Router Brain)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Use `tracing` + `opentelemetry-rust` crate with OTLP exporter.
|
||||||
|
- The span should be created *inside* the router decision function, not as middleware — it needs access to the alternatives list.
|
||||||
|
- For V1, export to stdout in OTLP JSON format. Production: OTLP gRPC to a collector.
|
||||||
|
|
||||||
|
### Story 10.5: Configurable Autonomy — Governance Policy for Automated Routing
|
||||||
|
**As a** solo founder, **I want** a `policy.json` governance file that controls what the system is allowed to do autonomously (e.g., switch models, update cost tables, add providers), **so that** I maintain human oversight as the system grows.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `policy.json` defines `governance_mode`: `strict` (all changes require manual approval) or `audit` (changes auto-apply but are logged).
|
||||||
|
- The proxy checks `governance_mode` before applying any runtime config change (routing rule update, cost table refresh, provider addition).
|
||||||
|
- `panic_mode` flag: when set to `true`, the proxy freezes all routing rules to their last-known-good state, disables auto-failover, and routes everything to a single hardcoded provider.
|
||||||
|
- Governance drift monitoring: a weekly cron job logs the ratio of auto-applied vs. manually-approved changes. If auto-applied changes exceed 80% in `strict` mode, an alert fires.
|
||||||
|
- All policy check decisions logged: "Allowed by audit mode", "Blocked by strict mode", "Panic mode active — frozen".
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 2 (Router Brain)
|
||||||
|
**Technical Notes:**
|
||||||
|
- `policy.json` lives in the repo root and is loaded at startup + watched for changes via `notify` crate.
|
||||||
|
- For V1 as a solo founder, start in `audit` mode. `strict` mode is for when you hire or add AI agents to the pipeline.
|
||||||
|
- Panic mode should be triggerable via a single API call (`POST /admin/panic`) or by setting an env var — whichever is faster in an emergency.
|
||||||
|
|
||||||
|
### Epic 10 Summary
|
||||||
|
| Story | Tenet | Points |
|
||||||
|
|-------|-------|--------|
|
||||||
|
| 10.1 | Atomic Flagging | 5 |
|
||||||
|
| 10.2 | Elastic Schema | 3 |
|
||||||
|
| 10.3 | Cognitive Durability | 2 |
|
||||||
|
| 10.4 | Semantic Observability | 3 |
|
||||||
|
| 10.5 | Configurable Autonomy | 3 |
|
||||||
|
| **Total** | | **16** |
|
||||||
1122
products/01-llm-cost-router/innovation-strategy/session.md
Normal file
1122
products/01-llm-cost-router/innovation-strategy/session.md
Normal file
File diff suppressed because it is too large
Load Diff
121
products/01-llm-cost-router/party-mode/session.md
Normal file
121
products/01-llm-cost-router/party-mode/session.md
Normal file
@@ -0,0 +1,121 @@
|
|||||||
|
# 🎉 dd0c/route — Advisory Board "Party Mode" Review
|
||||||
|
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Product Under Review:** dd0c/route — LLM Cost Router & Optimization Dashboard
|
||||||
|
**Format:** BMad Creative Intelligence Suite — "Party Mode" (5 Expert Panelists)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Round 1: INDIVIDUAL REVIEWS
|
||||||
|
|
||||||
|
### 💸 The VC
|
||||||
|
**What excites me:** The market math is undeniable. Inference is eating the world, and companies are bleeding cash because developers are lazy and just use `gpt-4o` for everything. The wedge—changing one base URL—is brilliant. If the "Shadow Audit" can actually show a CFO they're wasting $10K a month before they even adopt the tool, that's a PLG motion that prints money.
|
||||||
|
|
||||||
|
**What worries me:** You have zero structural moat on day one, and you're competing against the hyperscalers' own roadmaps. Why won't OpenAI just release "Smart Tier" routing tomorrow? Why won't AWS Bedrock bake this into their console? Plus, LiteLLM already has the open-source community mindshare. You're telling me a solo bootstrapped founder is going to out-execute YC-backed teams and hyperscalers based on a "data flywheel" that takes 12 months to spin up?
|
||||||
|
|
||||||
|
**Vote: CONDITIONAL GO.** You need to prove the data network effect actually exists. If you can't show that your routing gets demonstrably better with scale by month 6, you're just a wrapper that's waiting to get sherlocked.
|
||||||
|
|
||||||
|
### 🏗️ The CTO
|
||||||
|
**What excites me:** The architectural discipline. Targeting <10ms latency in Rust and integrating with OpenTelemetry right out of the gate shows you actually understand production environments. The fact that you're treating LLM calls like standard infrastructure that needs circuit breakers and fallback chains is exactly how grown-up engineering teams think.
|
||||||
|
|
||||||
|
**What worries me:** Trust and scale. You're asking me to take my company's most sensitive data—customer prompts, PII, proprietary business logic—and pipe it through a side-project proxy run by one guy? Absolutely not. Even if you say you don't log the prompts, my compliance team will laugh you out of the room. If this proxy goes down, my entire AI product goes down.
|
||||||
|
|
||||||
|
**Vote: CONDITIONAL GO.** The V1 SaaS-only proxy is a non-starter for serious teams. You must offer a VPC-deployable data plane (where you only phone home the telemetry to your SaaS dashboard) from Day 1. If you do that, I'm in.
|
||||||
|
|
||||||
|
### 🚀 The Bootstrap Founder
|
||||||
|
**What excites me:** The pricing and the GTM. $49/month is the magic number—it's an expense report, not a procurement cycle. The "Weekly Savings Digest" is an incredible retention hook. If you actually save a team $500/month, they will never churn. 200 customers gets you to $10K MRR. I've built businesses on much flimsier value props than "I will literally hand you back your own money."
|
||||||
|
|
||||||
|
**What worries me:** Scope creep. Victor's strategy doc tells you to build a Rust proxy, a ClickHouse analytics dashboard, a CLI tool, and write weekly thought-leadership content. Bro, you have a day job. You're going to burn out in month two. You cannot fight LiteLLM on features while fighting Portkey on enterprise dashboards as a solo founder.
|
||||||
|
|
||||||
|
**Vote: GO.** But only if you aggressively cut scope. Drop the CLI. Drop the ML classifier for V1. Use dumb heuristics. Launch the proxy and the dashboard in 30 days. Get to $1K MRR before you write a single line of ML code.
|
||||||
|
|
||||||
|
### 📟 The DevOps Practitioner
|
||||||
|
**What excites me:** The "Boring Proxy" concept. I am so tired of my team maintaining a janky Node.js script to balance OpenAI rate limits. A drop-in replacement that handles fallbacks, retries, and gives me standard Prometheus/OTel metrics is a dream.
|
||||||
|
|
||||||
|
**What worries me:** The maintenance nightmare. OpenAI changes their API schema. Anthropic introduces a new prompt caching header. Google deprecates a model. You, a solo dev, have to patch the proxy within hours or my production traffic breaks. I'm taking a hard dependency on your weekend availability. That's a massive operational risk I don't want to absorb.
|
||||||
|
|
||||||
|
**Vote: NO-GO.** LiteLLM has hundreds of contributors fixing API breakages the second they happen. A solo founder cannot keep up with the churn of LLM provider APIs without sacrificing reliability. I wouldn't switch to this.
|
||||||
|
|
||||||
|
### 🃏 The Contrarian
|
||||||
|
**What excites me:** Everyone is entirely focused on the "routing" aspect, but that's actually the least interesting part. The real genius here is the attribution treemap. Nobody cares about saving $400 on API calls if the dashboard can solve the internal political problem of "who is spending all this AI budget?" This is a FinOps tool disguised as a router.
|
||||||
|
|
||||||
|
**What worries me:** The core premise is that LLMs will remain expensive enough to care about. They won't. Prices dropped 90% last year. They'll drop another 90%. When 1M tokens cost a penny, nobody will pay $49/month to optimize it. You're building a highly sophisticated coupon-clipper for a world that's moving toward post-scarcity intelligence.
|
||||||
|
|
||||||
|
**Vote: GO (BUT PIVOT).** Ditch the proxy entirely. Forget routing. Just ingest existing logs and give Marcus his CFO slide deck. Charge $99/mo for pure AI cost attribution. It removes the latency risk, removes the "proxy going down" risk, and solves the real human pain point (looking stupid in a budget meeting).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Round 2: CROSS-EXAMINATION
|
||||||
|
|
||||||
|
**The VC:** Bootstrap, you're delusional if you think 200 customers at $49/mo is a defensible business in this space. If Datadog or Helicone turns this on for free, your 200 customers churn overnight. You need enterprise contracts to survive the hyperscaler onslaught.
|
||||||
|
|
||||||
|
**The Bootstrap Founder:** VC, that's why you lose money on 99% of your bets. I don't need a $1B exit. $10K MRR pays the mortgage. Datadog charges $23/host minimum; they aren't giving LLM cost tracking away for free. And enterprise deals take 9 months to close. Brian needs cash flow on day 30, not a 60-page vendor security questionnaire next year.
|
||||||
|
|
||||||
|
**The CTO:** Contrarian, your "post-scarcity intelligence" theory is cute, but mathematically illiterate. Yes, per-token prices drop, but usage is exploding. Agentic workflows use 100x the tokens of standard RAG. The bill isn't going away, it's just shifting from "expensive models" to "massive volume." The routing still matters tremendously.
|
||||||
|
|
||||||
|
**The Contrarian:** If volume explodes 100x, CTO, then latency matters 100x more. Do you really think teams will add a third-party Rust proxy hop to an agentic loop running 50 times a second? They'll build the routing logic directly into their clients. The proxy is a dead end architecture for high-volume agents.
|
||||||
|
|
||||||
|
**The DevOps Practitioner:** Exactly. The proxy is a dead end because the second Anthropic changes their API schema on a Friday night, Brian is asleep, and my agentic loop is throwing 500s. I'm the one getting paged at 2 AM.
|
||||||
|
|
||||||
|
**The Bootstrap Founder:** DevOps, you're projecting your own operational PTSD onto the product. Brian isn't supporting 1,600 models like Portkey. He's supporting OpenAI and Anthropic. Two APIs. They don't make breaking schema changes every Friday. It's totally manageable.
|
||||||
|
|
||||||
|
**The DevOps Practitioner:** They literally just introduced prompt caching headers, structured outputs, and vision payloads in the last 6 months! If the proxy doesn't support the new feature on day one, my ML engineers scream at me that they can't use the new toy. I am not putting a bottleneck in front of the fastest-moving APIs in tech.
|
||||||
|
|
||||||
|
**The VC:** DevOps is right. The maintenance overhead is brutal. This is why I said there's no moat. You're building a feature, not a platform. The moment OpenAI releases "Smart Tier" routing, you have zero differentiation. Why won't Sam Altman just eat your lunch?
|
||||||
|
|
||||||
|
**The Contrarian:** VC, you're missing the point again. Sam Altman doesn't care about attribution by feature, team, and environment. He just wants your total API spend to go up. OpenAI's dashboard will never tell you "Team Backend wasted $400 on the summarizer." That's why I say ditch the proxy and just do the analytics. You own the FinOps dashboard, not the pipe.
|
||||||
|
|
||||||
|
**The CTO:** Ditching the proxy kills the "Shadow Audit" and the real-time cost prevention, Contrarian. If you only do analytics, you're just looking in the rearview mirror. You're telling Marcus he crashed the car *after* the bill arrives. The proxy is what stops the bleeding *before* the invoice hits.
|
||||||
|
|
||||||
|
**The Bootstrap Founder:** CTO hits the nail on the head. The value prop is "Change this one URL and stop bleeding cash today." You can't do that with a log parser. You need the proxy. Brian just needs to self-host the data plane so DevOps stops hyperventilating about PII and latency.
|
||||||
|
|
||||||
|
**The DevOps Practitioner:** I'll stop hyperventilating when Brian provides a certified Helm chart, an OTel collector, and a signed SLA. Until then, it's a weekend toy masquerading as infrastructure.
|
||||||
|
---
|
||||||
|
|
||||||
|
### Round 3: STRESS TEST
|
||||||
|
|
||||||
|
#### Stress Test 1: What if OpenAI drops prices 90% tomorrow?
|
||||||
|
**The Contrarian:** "This is inevitable. GPT-4o-mini is already basically free. When GPT-5 drops, the older models will go to zero. If the delta between the 'expensive' model and the 'cheap' model is pennies, nobody is paying you $49 to route it."
|
||||||
|
- **Severity (1-10):** 8/10. It fundamentally breaks the core value prop of "save $500/month."
|
||||||
|
- **Can Brian pivot?** Yes. If per-token cost is negligible, the pain point shifts to *token efficiency* (context window stuffing, prompt bloat, latency).
|
||||||
|
- **Mitigation:** Shift the positioning from "we pick the cheapest model" to "we optimize your payload." Build semantic caching, prompt compression, and attribution. The dashboard tracking "Who is running these 100K token prompts?" remains valuable even if the tokens are cheap, because 100K tokens still adds massive latency.
|
||||||
|
|
||||||
|
#### Stress Test 2: What if a well-funded competitor (Helicone, Portkey) copies the exact feature set?
|
||||||
|
**The VC:** "Helicone has YC money and mindshare. Portkey has $3M. If you prove the 'Shadow Audit' and the 'Attribution Treemap' work, they will build them in two weeks with a team of 10 engineers."
|
||||||
|
- **Severity (1-10):** 6/10. It's a risk, but incumbents usually move slower than expected and over-complicate simple features.
|
||||||
|
- **Can Brian pivot?** Yes. Move down-market. Portkey targets enterprise; Helicone targets observability power users. Brian can own the "Series A Bootstrap" niche.
|
||||||
|
- **Mitigation:** Rely on the unique GTM. Brian's advantage isn't the feature, it's the lack of friction. If Portkey builds a treemap but still requires a 30-minute sales call to get an API key, Brian wins the developer who just wants to run `npx dd0c-scan` on a Saturday night. Double down on PLG and community trust.
|
||||||
|
|
||||||
|
#### Stress Test 3: What if enterprises won't trust a proxy with their API keys and prompts?
|
||||||
|
**The CTO:** "I've said it already. You cannot send PII through a third-party startup. It's an automatic hard-stop from Infosec."
|
||||||
|
- **Severity (1-10):** 9/10. This is the single biggest adoption blocker. The "change your base URL" trick only works if you aren't violating GDPR, HIPAA, or SOC 2.
|
||||||
|
- **Can Brian pivot?** Yes, by splitting the control plane and data plane.
|
||||||
|
- **Mitigation:** For V1, accept that you won't close enterprise or healthcare customers. Stick to the beachhead: small SaaS startups who don't have compliance teams yet. For V2, build a self-hosted data plane. The Rust proxy runs inside the customer's VPC, strips the prompt payloads, and only sends telemetry (latency, token counts, cost) back to the dd0c SaaS dashboard.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Round 4: FINAL VERDICT
|
||||||
|
|
||||||
|
**The Panel Decision:** A SPLIT DECISION leaning heavily toward **GO (WITH MAJOR SCOPE REDUCTIONS)**.
|
||||||
|
|
||||||
|
**Reasoning:**
|
||||||
|
- **The VC (Go):** The market timing is perfect. AI FinOps is the next category.
|
||||||
|
- **The Bootstrap Founder (Go):** It's a textbook SaaS play with a clear wedge.
|
||||||
|
- **The Contrarian (Go-Pivot):** The value is in attribution, not routing.
|
||||||
|
- **The CTO (Conditional Go):** Only if there's a clear path to self-hosted proxies.
|
||||||
|
- **The DevOps Practitioner (No-Go):** The maintenance overhead of third-party API churn is a trap for a solo dev.
|
||||||
|
|
||||||
|
**Revised Priority Ranking:**
|
||||||
|
This is still **#1 in the dd0c lineup**, assuming Brian focuses on the PLG "Shadow Audit" and simple attribution dashboard rather than trying to build a complex ML routing classifier immediately. It solves a real problem *today* for people with budget authority.
|
||||||
|
|
||||||
|
**Top 3 Things Brian MUST Get Right:**
|
||||||
|
1. **The "Shadow Audit" Wedge:** The CLI that proves they are wasting $1,000s *before* asking them to change their base URL. It's the ultimate sales tool.
|
||||||
|
2. **The 5-Minute Setup:** Changing the base URL and adding an API key must be flawless. If it takes 2 hours to configure YAML, the PLG motion dies.
|
||||||
|
3. **The "Weekly Savings Digest":** Marcus needs an email every Monday showing his CFO why the product is paying for itself. This is the only retention moat in Year 1.
|
||||||
|
|
||||||
|
**The ONE Thing That Kills This If Wrong:**
|
||||||
|
**Adding latency or breaking APIs.** If the Rust proxy adds 100ms instead of 10ms, or if an Anthropic API change breaks production workloads for 12 hours while Brian is at his day job, trust is permanently destroyed. Infrastructure must be invisible.
|
||||||
|
|
||||||
|
**Final Recommendation:**
|
||||||
|
**GO (V1: Analytics First, Router Second).**
|
||||||
|
Brian should build the "Boring Proxy" in Rust (optimizing purely for latency and reliability), but rely heavily on *heuristics* for routing initially. The primary marketing and retention lever should be the **Dashboard Attribution** and the **Shadow Audit**. Launch the cost scan CLI in 30 days to validate demand before building the ML classifier.
|
||||||
437
products/01-llm-cost-router/product-brief/brief.md
Normal file
437
products/01-llm-cost-router/product-brief/brief.md
Normal file
@@ -0,0 +1,437 @@
|
|||||||
|
# dd0c/route — Product Brief
|
||||||
|
|
||||||
|
**Product:** dd0c/route — LLM Cost Router & Optimization Dashboard
|
||||||
|
**Brand:** 0xDD0C — "All signal. Zero chaos."
|
||||||
|
**Author:** Product Brief synthesized from BMad Creative Intelligence Suite (Phases 1–4)
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Status:** Investor-Ready Draft
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. EXECUTIVE SUMMARY
|
||||||
|
|
||||||
|
### Elevator Pitch
|
||||||
|
|
||||||
|
dd0c/route is an OpenAI-compatible proxy that sits between your application and LLM providers, intelligently routing each API request to the cheapest model that meets quality requirements — saving engineering teams 30–50% on AI costs with a single environment variable change. It pairs this routing engine with an attribution dashboard that answers the question no existing tool can: "Who is spending our AI budget, on what, and is it worth it?"
|
||||||
|
|
||||||
|
### Problem Statement
|
||||||
|
|
||||||
|
Enterprise and startup LLM spending is exploding — the global LLM market is projected to reach $36.1B by 2030 (Straits Research), with inference costs representing the fastest-growing line item on engineering budgets. Yet the tooling for managing this spend is stuck in 2022:
|
||||||
|
|
||||||
|
- **60%+ of LLM API calls use overqualified models.** Teams default to GPT-4o for everything — including trivial tasks like JSON formatting, classification, and extraction — because benchmarking alternatives takes days nobody has.
|
||||||
|
- **Zero cost attribution exists at the feature level.** Engineering managers receive a single monthly invoice from OpenAI ("$14,000") with no breakdown by feature, team, or environment. Cloud cost tooling solved this for AWS a decade ago. AI cost tooling hasn't caught up.
|
||||||
|
- **Multi-provider billing is a manual nightmare.** Teams using OpenAI + Anthropic + Google get three separate bills with three different billing models. Reconciliation is a monthly spreadsheet exercise that takes 3–4 hours.
|
||||||
|
- **Cost spikes are invisible until the invoice arrives.** A retry storm, a prompt engineering experiment gone wrong, or a new feature launch can burn $3K in an hour with no alert.
|
||||||
|
|
||||||
|
The result: engineering managers present estimated pie charts to CFOs, ML engineers feel guilty about costs they can't measure, and platform engineers maintain hand-rolled proxy scripts that started as 200 lines and grew to 2,000.
|
||||||
|
|
||||||
|
### Solution Overview
|
||||||
|
|
||||||
|
dd0c/route is a drop-in proxy (change one environment variable: `OPENAI_BASE_URL`) that provides:
|
||||||
|
|
||||||
|
1. **Intelligent Routing:** A complexity classifier analyzes each request and routes it to the cheapest adequate model — GPT-4o requests that are simple extractions get silently downgraded to GPT-4o-mini or Claude Haiku, saving 80–95% per request with negligible quality impact.
|
||||||
|
2. **Cost Attribution Dashboard:** Real-time treemap visualization showing spend by team → feature → endpoint → model. The "CFO slide deck" that writes itself.
|
||||||
|
3. **Weekly Savings Digest:** An automated Monday morning email showing exactly how much dd0c/route saved, broken down by category — the retention mechanism and viral loop (managers forward it to leadership).
|
||||||
|
4. **Budget Guardrails & Anomaly Alerts:** Per-team, per-feature spending limits with Slack/PagerDuty integration. Catches the $3K retry storm before it becomes a $3K invoice.
|
||||||
|
|
||||||
|
### Target Customer Profile
|
||||||
|
|
||||||
|
**Primary Beachhead:** Series A–B SaaS startups with 10–50 engineers, spending $2K–$15K/month on LLM APIs, with no dedicated ML infrastructure team. The CTO or VP Engineering can approve a $49/month tool via expense report — no procurement process, no 6-month evaluation cycle.
|
||||||
|
|
||||||
|
**Why this segment:** They feel the pain acutely ($5K–$15K/month hurts but doesn't justify hiring ML ops), they're technically sophisticated enough to adopt in minutes (they understand API proxies and environment variables), and they talk to each other (startup CTOs share tools in the same Slack communities, meetups, and newsletters).
|
||||||
|
|
||||||
|
### Key Differentiators
|
||||||
|
|
||||||
|
1. **5-Minute Setup, Zero Code Changes.** Change one environment variable. No SDK migration, no code refactor, no YAML configuration marathon. The fastest time-to-value in the category.
|
||||||
|
2. **Attribution-First Design.** Competitors focus on observability (what happened). dd0c/route focuses on attribution (who spent what, on which feature, and was it worth it). The treemap dashboard is the product's signature.
|
||||||
|
3. **"Shadow Audit" Pre-Sale Wedge.** A CLI tool (`npx dd0c-scan`) and passive log analysis mode that proves savings potential *before* asking the customer to route traffic. Value before trust. Evidence before commitment.
|
||||||
|
4. **Transparent, Flat Pricing.** $49/month Pro tier — an expense-report purchase. No per-seat fees that punish adoption, no usage-based billing that recreates the unpredictability problem we're solving.
|
||||||
|
5. **Open-Source Proxy Core.** The routing engine is open-source and self-hostable. The SaaS monetizes the intelligence layer (dashboard, analytics, digest, recommendations). Trust through transparency.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. MARKET OPPORTUNITY
|
||||||
|
|
||||||
|
### TAM / SAM / SOM
|
||||||
|
|
||||||
|
| Metric | Value | Basis |
|
||||||
|
|--------|-------|-------|
|
||||||
|
| **TAM** | $36.1B by 2030 | Global LLM market (Straits Research). Inference costs are the fastest-growing segment. |
|
||||||
|
| **SAM** | ~$5.4B | LLM API spend by companies with $1K–$100K/month bills — the segment where third-party cost optimization is viable (not too small to care, not large enough to build in-house). Estimated at ~15% of TAM. |
|
||||||
|
| **SOM (Year 1)** | $1.8M–$3.6M | 300–600 paying customers at $49–$199/month average. Achievable via PLG in the Series A–B SaaS beachhead. |
|
||||||
|
|
||||||
|
The FinOps Foundation's 2026 report identifies AI workload cost management as the #1 emerging challenge. Cloud FinOps is a mature $3B+ category; AI FinOps is its greenfield successor with no dominant player.
|
||||||
|
|
||||||
|
### Competitive Landscape
|
||||||
|
|
||||||
|
| Competitor | Positioning | Strengths | Weaknesses | dd0c/route Advantage |
|
||||||
|
|-----------|-------------|-----------|------------|---------------------|
|
||||||
|
| **LiteLLM** | Open-source LLM proxy framework | 15K+ GitHub stars, broad model support (1,600+), active community | No intelligence layer, no attribution dashboard, no SaaS product — it's a framework, not a solution | Product completeness: proxy + dashboard + digest + attribution |
|
||||||
|
| **Portkey** | Enterprise AI gateway | $3M funding, enterprise features, broad provider support | Enterprise sales motion, complex setup, overkill for small teams | 5-minute PLG setup vs. enterprise procurement cycle |
|
||||||
|
| **Helicone** | LLM observability platform | YC-backed, strong developer brand, good logging/tracing | Observability-focused (what happened), not optimization-focused (what to do). No intelligent routing. | Attribution + routing + actionable recommendations vs. passive logging |
|
||||||
|
| **Martian** | AI model router | Smart routing technology, usage-based pricing | Opaque pricing, routing-only (no dashboard/attribution), limited transparency | Transparent routing + full cost attribution dashboard |
|
||||||
|
| **OpenRouter** | Multi-model API gateway | Simple unified API, broad model access | 5% markup on all requests, no cost optimization intelligence, no attribution | Flat pricing + intelligent routing that reduces spend |
|
||||||
|
|
||||||
|
### Timing Thesis
|
||||||
|
|
||||||
|
Three converging forces make Q1 2026 the optimal launch window:
|
||||||
|
|
||||||
|
1. **The "AI in Production" Transition.** Companies are moving from experimentation to production deployment. Production AI costs are operational expenses that demand optimization tooling — creating the "tooling gap" dd0c/route fills.
|
||||||
|
|
||||||
|
2. **Multi-Model Reality.** The era of "just use OpenAI" is ending. Teams now use OpenAI + Anthropic + Google + open-source models. Multi-provider complexity creates demand for a unified routing and attribution layer.
|
||||||
|
|
||||||
|
3. **Agentic AI Volume Explosion.** Agentic workflows make 10–100x more API calls than simple chat. Even as per-token prices drop, total spend increases. The bill isn't going away — it's shifting from "expensive models" to "massive volume."
|
||||||
|
|
||||||
|
### Market Trends
|
||||||
|
|
||||||
|
- LLM inference costs dropped ~90% in 2024–2025, but total enterprise AI spend increased 3x due to volume growth
|
||||||
|
- "AI FinOps" is an emerging category with no category leader — the FinOps discipline is expanding from cloud infrastructure to AI workloads
|
||||||
|
- Developer tooling is consolidating around PLG motions — enterprise sales cycles are shortening for sub-$500/month tools
|
||||||
|
- Open-source AI infrastructure (LiteLLM, vLLM, Ollama) has normalized the concept of proxy layers between applications and LLM providers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. PRODUCT DEFINITION
|
||||||
|
|
||||||
|
### Core Value Proposition
|
||||||
|
|
||||||
|
> Change one environment variable. See where every AI dollar goes. Start saving automatically.
|
||||||
|
|
||||||
|
dd0c/route transforms AI cost management from a monthly guessing game into a real-time, automated discipline. It's the "Linear for AI FinOps" — fast, opinionated, and built for practitioners, not procurement committees.
|
||||||
|
|
||||||
|
### User Personas
|
||||||
|
|
||||||
|
**Persona 1: Priya Sharma — The ML Engineer (Age 29, Series B fintech)**
|
||||||
|
- Defaults to GPT-4o for everything because benchmarking alternatives takes days she doesn't have
|
||||||
|
- Feels guilty about costs but isn't empowered to fix them — no per-call cost visibility exists
|
||||||
|
- Needs: automatic model selection without workflow disruption, cost feedback at the code level
|
||||||
|
- dd0c/route value: "I keep writing `model='gpt-4o'` and the router quietly downgrades when it's safe. I stopped feeling guilty."
|
||||||
|
|
||||||
|
**Persona 2: Marcus Chen — The Engineering Manager (Age 36, same fintech)**
|
||||||
|
- Gets one opaque bill from OpenAI ("$14,000") with zero breakdown by feature, team, or environment
|
||||||
|
- Spends 3–4 hours monthly on manual spreadsheet reconciliation across providers
|
||||||
|
- Presents estimated pie charts to the CFO and feels like a fraud
|
||||||
|
- dd0c/route value: "The attribution treemap IS my slide deck. Monday morning digest goes straight to the CFO."
|
||||||
|
|
||||||
|
**Persona 3: Jordan Okafor — The Platform Engineer (Age 32, mid-stage SaaS)**
|
||||||
|
- Maintains a hand-rolled Node.js LLM proxy that started as 200 lines and grew to 2,000
|
||||||
|
- Gets paged when the proxy breaks; paranoid about it being a single point of failure
|
||||||
|
- Wants a Helm chart, OTel export, and config-as-code — then to never think about it again
|
||||||
|
- dd0c/route value: "I deployed it with Helm, pointed the env var, and went back to my actual job."
|
||||||
|
|
||||||
|
### Key Features by Release
|
||||||
|
|
||||||
|
#### MVP (Month 1–3)
|
||||||
|
- OpenAI-compatible proxy (Rust, <10ms overhead at p99)
|
||||||
|
- Rule-based routing with heuristic complexity classifier (token count + keyword patterns)
|
||||||
|
- Cascading try-cheap-first routing (cheap model → escalate on low confidence)
|
||||||
|
- Cost attribution dashboard: real-time ticker, treemap by feature/team/model
|
||||||
|
- Request inspector (tokens, cost, latency, routing decision per call)
|
||||||
|
- Weekly Savings Digest email (automated Monday morning)
|
||||||
|
- Budget guardrails with threshold-based anomaly alerts (Slack integration)
|
||||||
|
- OpenAI + Anthropic support only
|
||||||
|
- SaaS-hosted proxy
|
||||||
|
|
||||||
|
#### V2 (Month 4–6)
|
||||||
|
- Self-hosted data plane (Rust proxy in customer VPC, only telemetry to SaaS)
|
||||||
|
- Semantic response cache (exact-match V1, semantic similarity V2)
|
||||||
|
- A/B model testing (split traffic, measure cost/quality/latency, recommend winner)
|
||||||
|
- OTel export (Datadog, Grafana, Honeycomb integration)
|
||||||
|
- Google Gemini + Mistral provider support
|
||||||
|
- Quality threshold profiles ("customer-facing" vs. "internal-tool" vs. "batch-job")
|
||||||
|
- Prompt efficiency heatmap and optimization recommendations
|
||||||
|
|
||||||
|
#### V3 (Month 7–12)
|
||||||
|
- ML-based complexity classifier (trained on routing data flywheel)
|
||||||
|
- GitHub Action: cost impact comments on PRs
|
||||||
|
- Spend forecasting with confidence intervals
|
||||||
|
- VS Code extension with inline cost annotations
|
||||||
|
- SOC 2 Type II certification
|
||||||
|
- Enterprise features: SSO, RBAC, role-based dashboard views
|
||||||
|
- Model distillation recommendations ("hosting Llama 3 on a $2K/month GPU would save you $8K/month")
|
||||||
|
|
||||||
|
### User Journey
|
||||||
|
|
||||||
|
```
|
||||||
|
AWARENESS ACTIVATION RETENTION EXPANSION
|
||||||
|
─────────────────────────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
npx dd0c-scan ./src Change OPENAI_BASE_URL Weekly Savings Digest Team-wide rollout
|
||||||
|
↓ ↓ ↓ ↓
|
||||||
|
"You're wasting $4K/mo" First request routed Marcus forwards to CFO Budget guardrails
|
||||||
|
↓ ↓ ↓ ↓
|
||||||
|
Show HN / blog post Dashboard shows savings Routing rules refined Pro → Business tier
|
||||||
|
↓ ↓ ↓ ↓
|
||||||
|
Free signup "Aha" in <5 minutes Attribution data compounds dd0c/cost cross-sell
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pricing Model
|
||||||
|
|
||||||
|
| Tier | Price | Includes | Target |
|
||||||
|
|------|-------|----------|--------|
|
||||||
|
| **Free** | $0/month | Up to $500/month LLM spend routed, basic dashboard, 7-day data retention | Individual devs, evaluation |
|
||||||
|
| **Pro** | $49/month | Up to $15K/month LLM spend, full attribution treemap, weekly digest, 90-day retention, Slack alerts | Series A–B startups (beachhead) |
|
||||||
|
| **Business** | $199/month | Unlimited spend, self-hosted proxy option, OTel export, RBAC, 1-year retention, priority support | Growth-stage companies |
|
||||||
|
| **Enterprise** | Custom | SSO, SOC 2 compliance, dedicated support, SLA, custom integrations | Large organizations (V3+) |
|
||||||
|
|
||||||
|
**Pricing rationale (resolving Party Mode debate):** The Bootstrap Founder panelist argued for $49 flat; the VC argued for enterprise contracts. Resolution: $49 Pro tier captures the beachhead via expense-report purchases. $199 Business tier captures expansion revenue as teams grow. Enterprise tier deferred to V3 — closing enterprise deals takes 9 months and requires SOC 2, which a solo founder can't prioritize in Year 1. The Contrarian's suggestion to charge $99 for pure analytics (no proxy) is captured in the Free tier's shadow audit mode — prove value first, convert to routing later.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. GO-TO-MARKET PLAN
|
||||||
|
|
||||||
|
### Launch Strategy
|
||||||
|
|
||||||
|
**Phase 1: Engineering-as-Marketing (Days 1–30)**
|
||||||
|
- Build and ship `npx dd0c-scan` — the CLI that scans a codebase, estimates LLM spend, and shows savings potential. No account needed. No data leaves the machine. This is the top-of-funnel viral tool.
|
||||||
|
- Dogfood dd0c/route on Brian's own projects. If the founder doesn't use it daily, it's not ready.
|
||||||
|
|
||||||
|
**Phase 2: Private Beta (Days 31–60)**
|
||||||
|
- Invite 10–20 people from Brian's network: AWS colleagues, startup CTO friends, Twitter mutuals.
|
||||||
|
- Free access in exchange for 15 minutes of weekly feedback.
|
||||||
|
- Track: time to first route, first "aha" moment, first complaint.
|
||||||
|
- Milestone: 5+ beta users who say "I would pay for this" unprompted.
|
||||||
|
|
||||||
|
**Phase 3: Public Launch (Days 61–90)**
|
||||||
|
- Show HN post (Tuesday/Wednesday morning US time — highest traffic days)
|
||||||
|
- First comment: technical architecture, honest limitations, roadmap
|
||||||
|
- Simultaneous posts: Twitter/X, Reddit (r/MachineLearning, r/devops), relevant Slack communities
|
||||||
|
- "Why I Built dd0c/route" blog post (personal story, technical architecture, honest tradeoffs)
|
||||||
|
- "State of AI Costs Q1 2026" report (anonymized data from beta users)
|
||||||
|
- Target: 500+ signups in week 1, 10–20 paying customers by day 90
|
||||||
|
|
||||||
|
### Beachhead Market
|
||||||
|
|
||||||
|
Series A–B SaaS startups in the US, spending $2K–$15K/month on LLM APIs, with 10–50 engineers. Specifically:
|
||||||
|
- Companies building AI-powered features (chatbots, summarization, code review, RAG pipelines)
|
||||||
|
- No dedicated ML infrastructure team — the platform engineer or CTO manages LLM infrastructure as a side responsibility
|
||||||
|
- CTO/VP Eng can approve $49/month without procurement
|
||||||
|
- Active in developer communities (Hacker News, Twitter/X, Discord, Slack groups)
|
||||||
|
|
||||||
|
Estimated beachhead size: 5,000–10,000 companies in the US alone.
|
||||||
|
|
||||||
|
### Growth Loops & Viral Mechanics
|
||||||
|
|
||||||
|
1. **The Savings Digest Loop:** dd0c/route sends a Monday morning email → Marcus (eng manager) sees "$1,847 saved this week" → forwards to CFO → CFO mandates team-wide adoption → more teams onboard → more savings → bigger digest number → more forwards.
|
||||||
|
|
||||||
|
2. **The Shadow Audit Loop:** Developer runs `npx dd0c-scan` → sees "$4,200/month wasted" → shares screenshot on Twitter/Slack → other developers try it → some convert to paid.
|
||||||
|
|
||||||
|
3. **The "You Could Have Saved" Loop:** Free tier users see a persistent counter: "Estimated savings if you'd used dd0c routing: $X" → the number grows daily → conversion pressure increases naturally.
|
||||||
|
|
||||||
|
4. **The Open-Source Loop:** OSS proxy gets GitHub stars → developers discover the project → some self-host (free marketing) → power users convert to SaaS for the dashboard/digest/analytics.
|
||||||
|
|
||||||
|
### Content & Community Strategy
|
||||||
|
|
||||||
|
- **Weekly newsletter:** "This Week in AI Pricing" — model price changes, benchmark updates, cost optimization tips
|
||||||
|
- **Monthly report:** "State of AI Costs" — anonymized aggregate data from dd0c/route users. Becomes the industry reference.
|
||||||
|
- **SEO targets:** High-intent, low-competition keywords first ("LiteLLM alternative," "reduce OpenAI costs," "LLM cost attribution")
|
||||||
|
- **Guest posts:** The New Stack, Dev.to, InfoQ — backlinks + immediate traffic while SEO compounds
|
||||||
|
- **Community:** Discord server for users. The best first hire will come from this community.
|
||||||
|
|
||||||
|
### Partnership Opportunities
|
||||||
|
|
||||||
|
- **Framework integrations:** Official LangChain / LlamaIndex / Vercel AI SDK partner — "recommended cost optimization tool"
|
||||||
|
- **Cloud marketplaces:** AWS Marketplace listing (Brian's AWS expertise is an unfair advantage here)
|
||||||
|
- **FinOps community:** FinOps Foundation membership, conference talks, co-authored reports
|
||||||
|
- **Complementary tools:** Integrate with Datadog, Grafana, PagerDuty — be the AI cost data source that feeds existing observability stacks
|
||||||
|
|
||||||
|
### 90-Day Launch Timeline
|
||||||
|
|
||||||
|
| Week | Focus | Deliverable |
|
||||||
|
|------|-------|-------------|
|
||||||
|
| 1–2 | Build proxy | Working Rust proxy, OpenAI + Anthropic, <10ms overhead |
|
||||||
|
| 2–3 | Build dashboard | Cost overview, treemap, request inspector |
|
||||||
|
| 3–4 | Build digest | Automated Monday email with savings breakdown |
|
||||||
|
| 5–6 | Private beta | 10–20 users routing traffic, collecting feedback |
|
||||||
|
| 6–7 | Build CLI | `npx dd0c-scan` — the viral top-of-funnel tool |
|
||||||
|
| 7–8 | Iterate | Fix top 3 complaints, polish onboarding to <5 min |
|
||||||
|
| 9 | Pre-launch content | Blog post, AI costs report, Show HN draft, landing page |
|
||||||
|
| 10 | Show HN launch | All-day in comments. Simultaneous Twitter/Reddit/Slack |
|
||||||
|
| 11–12 | Post-launch | Analyze funnel, fix biggest drop-off, reach out to every paying customer |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. BUSINESS MODEL
|
||||||
|
|
||||||
|
### Revenue Model & Unit Economics
|
||||||
|
|
||||||
|
| Metric | Value | Notes |
|
||||||
|
|--------|-------|-------|
|
||||||
|
| **Average Revenue Per Account (ARPA)** | $75/month (blended) | Mix of $49 Pro and $199 Business customers |
|
||||||
|
| **Gross Margin** | ~85% | Infrastructure cost is minimal — proxy + ClickHouse + API on AWS, ~$150/month total at scale |
|
||||||
|
| **Monthly infrastructure cost** | $65–$185/month | AWS (proxy + API + analytics), email (Resend), analytics (PostHog free tier) |
|
||||||
|
| **Marginal cost per customer** | ~$0.50–$2/month | Proxy compute + telemetry storage. Near-zero marginal cost. |
|
||||||
|
|
||||||
|
### CAC / LTV Projections
|
||||||
|
|
||||||
|
| Metric | Target | Basis |
|
||||||
|
|--------|--------|-------|
|
||||||
|
| **CAC (organic/PLG)** | <$50 | Content marketing + Show HN + CLI virality. No paid ads in Year 1. |
|
||||||
|
| **Average customer lifetime** | 10+ months | Weekly digest drives retention; savings are visible and ongoing |
|
||||||
|
| **LTV** | >$750 | $75 ARPA × 10 months |
|
||||||
|
| **LTV:CAC ratio** | >15:1 | Best-in-class for PLG SaaS |
|
||||||
|
|
||||||
|
### Path to Revenue Milestones
|
||||||
|
|
||||||
|
| Milestone | Customers Needed | Timeline | What It Means |
|
||||||
|
|-----------|-----------------|----------|---------------|
|
||||||
|
| **$1K MRR** | ~20 Pro | Month 3–4 | Product-market fit signal |
|
||||||
|
| **$5K MRR** | ~80 Pro + 5 Business | Month 6–9 | Sustainable side project. "Should I keep going?" → Yes. |
|
||||||
|
| **$10K MRR** | ~150 Pro + 10 Business | Month 9–12 | "Should I quit my day job?" territory |
|
||||||
|
| **$25K MRR** | ~300 Pro + 20 Business | Month 12–18 | Quit the day job. This is a business. |
|
||||||
|
| **$50K MRR** | ~500 Pro + 40 Business | Month 18–24 | Hire first engineer. |
|
||||||
|
| **$100K MRR** | ~800 Pro + 80 Business | Month 24–36 | Series A optionality (or stay bootstrapped and profitable) |
|
||||||
|
|
||||||
|
### Resource Requirements (Solo Founder Constraints)
|
||||||
|
|
||||||
|
**Time budget:** 15 hours/week maximum until $5K MRR. This is a side project until the numbers say otherwise.
|
||||||
|
|
||||||
|
| Phase | Product Dev | Content | Community | Customer |
|
||||||
|
|-------|------------|---------|-----------|----------|
|
||||||
|
| Months 1–3 | 10 hrs (67%) | 3 hrs (20%) | 1.5 hrs (10%) | 0.5 hrs (3%) |
|
||||||
|
| Months 4–6 | 7 hrs (47%) | 4 hrs (27%) | 2 hrs (13%) | 2 hrs (13%) |
|
||||||
|
| Months 7–12 | 5 hrs (33%) | 4 hrs (27%) | 3 hrs (20%) | 3 hrs (20%) |
|
||||||
|
|
||||||
|
**Infrastructure budget:** $65–$185/month. Brian's AWS expertise keeps this minimal. The burn rate is essentially zero — patience is a competitive advantage funded startups don't have.
|
||||||
|
|
||||||
|
### Key Assumptions & Dependencies
|
||||||
|
|
||||||
|
1. **Engineers will route production traffic through a third-party proxy if savings are visible, immediate, and undeniable.** This is the core bet. Probability: 60/40 favorable.
|
||||||
|
2. **The cost delta between "expensive" and "cheap" models persists.** Frontier models will always command premium pricing; the spread between frontier and commodity persists even as absolute prices drop.
|
||||||
|
3. **Agentic AI drives volume growth that offsets per-token price declines.** Total LLM spend continues to increase even as unit costs decrease.
|
||||||
|
4. **PLG distribution works for this category.** The $49 price point and 5-minute setup enable self-serve adoption without a sales team.
|
||||||
|
5. **Brian can sustain 15 hours/week for 9–12 months.** The discipline of time-boxing is critical to avoiding burnout.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. RISKS & MITIGATIONS
|
||||||
|
|
||||||
|
### Top 5 Risks
|
||||||
|
|
||||||
|
| # | Risk | Severity | Probability | Source |
|
||||||
|
|---|------|----------|-------------|--------|
|
||||||
|
| 1 | **OpenAI builds native smart routing** — "Smart Tier" that auto-routes within their models | 8/10 | Medium | VC + Innovation Strategy |
|
||||||
|
| 2 | **Trust barrier blocks adoption** — Security/compliance teams refuse to route prompts through a startup's proxy | 9/10 | Medium-High | CTO + DevOps panelists |
|
||||||
|
| 3 | **LLM price race-to-zero** — Cost delta between models shrinks to the point where optimization saves <$100/month | 8/10 | Low-Medium | Contrarian panelist |
|
||||||
|
| 4 | **Solo founder burnout** — 15 hrs/week + day job + support burden exceeds sustainable capacity | 7/10 | Medium | Bootstrap Founder panelist |
|
||||||
|
| 5 | **Well-funded competitor copies features** — Helicone/Portkey builds Shadow Audit + Attribution Treemap with a 10-engineer team | 6/10 | Medium | VC panelist |
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
|
||||||
|
1. **OpenAI routing:** OpenAI's incentive is to sell the MOST expensive model, not the cheapest — smart routing cannibalizes their revenue. Even if they add it, dd0c/route routes ACROSS providers (OpenAI won't route you to Anthropic). Worst case: pivot to pure "AI FinOps analytics" — the attribution dashboard is valuable even without the proxy.
|
||||||
|
|
||||||
|
2. **Trust barrier:** V1 accepts this limitation — stick to the beachhead (startups without compliance teams). V1.5 (month 4–5): self-hosted data plane where the Rust proxy runs in the customer's VPC and only telemetry leaves their environment. Open-source the proxy core so customers can read every line of code. *Resolution note: The Party Mode CTO demanded VPC-deployable from Day 1. The Bootstrap Founder argued SaaS-only for V1 to reduce scope. Resolution: SaaS-only V1 for the beachhead, self-hosted V1.5 for expansion. The beachhead doesn't have compliance teams; the expansion market does.*
|
||||||
|
|
||||||
|
3. **Price race-to-zero:** Reposition from "use cheaper models" to "optimize AI spend" — framing survives price changes. Build semantic caching (saves money regardless of per-token pricing). Build prompt optimization features ("your average prompt is 40% longer than necessary"). The attribution dashboard remains valuable even if tokens cost a penny — "who is running these 100K token prompts?" is a latency and efficiency question, not just a cost question.
|
||||||
|
|
||||||
|
4. **Solo founder burnout:** Hard rule: no more than 15 hours/week until $5K MRR. Automate everything — zero-ops proxy, static dashboard, Discord community for support (not a ticket queue). The $5K MRR milestone is the "quit or don't" decision point. Build in public — the community becomes unpaid QA, feature prioritization, and emotional support.
|
||||||
|
|
||||||
|
5. **Competitor copies:** Rely on GTM speed, not feature moats. If Portkey builds a treemap but still requires a 30-minute sales call, Brian wins the developer who just wants to run `npx dd0c-scan` on a Saturday night. Double down on PLG friction advantage and community trust. Incumbents move slower than expected and over-complicate simple features.
|
||||||
|
|
||||||
|
### Kill Criteria
|
||||||
|
|
||||||
|
| Criterion | Threshold | Timeline |
|
||||||
|
|-----------|-----------|----------|
|
||||||
|
| No product-market fit signal | <50 free signups after Show HN launch | Month 1 |
|
||||||
|
| No conversion | <5 paying customers after 3 months of availability | Month 4 |
|
||||||
|
| Revenue plateau | <$2K MRR after 6 months | Month 7 |
|
||||||
|
| Churn exceeds growth | Net revenue retention <80% for 3 consecutive months | Month 6+ |
|
||||||
|
| Existential competitor launches | OpenAI/AWS launches free native routing covering 80%+ of dd0c/route's value | Any time |
|
||||||
|
| Burnout | >20 hrs/week AND below $5K MRR AND affecting day job/health | Month 6+ |
|
||||||
|
| Market thesis invalidated | Optimization saves <$100/month for the average customer | Any time |
|
||||||
|
|
||||||
|
**Walk-away rule:** If 2+ kill criteria are met simultaneously, stop. Not pivot. Stop. Pivoting a side project is how founders waste years.
|
||||||
|
|
||||||
|
**Exception:** If qualitative signals are strong (NPS >50, organic word-of-mouth) but quantitative metrics are below threshold, extend by 3 months.
|
||||||
|
|
||||||
|
### Pivot Options
|
||||||
|
|
||||||
|
1. **Pure AI FinOps Analytics (no proxy):** Ingest existing logs, provide attribution dashboard and CFO reports. Removes latency risk, proxy trust barrier, and API maintenance burden. Charge $99/month. (The Contrarian's recommendation.)
|
||||||
|
2. **Open-source everything, monetize consulting:** If the SaaS doesn't convert, release the full product as OSS and sell implementation consulting to enterprises at $200–$400/hour.
|
||||||
|
3. **Vertical specialization:** Instead of horizontal "all AI costs," specialize in one vertical (e.g., "AI cost optimization for healthcare" with HIPAA compliance built in). Smaller market, higher willingness to pay.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. SUCCESS METRICS
|
||||||
|
|
||||||
|
### North Star Metric
|
||||||
|
|
||||||
|
**Monthly Recurring Revenue (MRR).** Everything else is a leading indicator. MRR is the truth.
|
||||||
|
|
||||||
|
### Leading Indicators (Track Weekly)
|
||||||
|
|
||||||
|
| Metric | Target | Why It Matters |
|
||||||
|
|--------|--------|---------------|
|
||||||
|
| Signups | 50+/week post-launch | Top of funnel health |
|
||||||
|
| Activation rate (signup → first routed request) | >40% | Onboarding quality |
|
||||||
|
| Time to first route | <5 minutes median | The core adoption thesis |
|
||||||
|
| Weekly active routers | Growing 10%+ week-over-week | Product engagement |
|
||||||
|
| Savings per customer per month | >$100 average | Value delivery (must exceed subscription cost) |
|
||||||
|
| Digest email open rate | >50% | Retention mechanism health |
|
||||||
|
|
||||||
|
### Lagging Indicators (Track Monthly)
|
||||||
|
|
||||||
|
| Metric | Target | Why It Matters |
|
||||||
|
|--------|--------|---------------|
|
||||||
|
| Logo churn | <5%/month | Retention health |
|
||||||
|
| Revenue churn | <3%/month | Revenue health (expansion offsets logo churn) |
|
||||||
|
| CAC | <$50 (organic) | Acquisition efficiency |
|
||||||
|
| LTV | >$500 (10+ month lifetime) | Business viability |
|
||||||
|
| LTV:CAC ratio | >10:1 | PLG efficiency |
|
||||||
|
| Net revenue retention | >100% | Expansion > churn |
|
||||||
|
|
||||||
|
### Milestones
|
||||||
|
|
||||||
|
**30 Days:**
|
||||||
|
- Working proxy + dashboard deployed and dogfooded on Brian's own projects
|
||||||
|
- `npx dd0c-scan` CLI shipped and tested
|
||||||
|
- 10–20 private beta users routing traffic
|
||||||
|
|
||||||
|
**60 Days:**
|
||||||
|
- 5+ beta users who would pay unprompted
|
||||||
|
- Top 3 beta complaints fixed
|
||||||
|
- Onboarding polished to <5 minutes
|
||||||
|
- All launch content written
|
||||||
|
|
||||||
|
**90 Days:**
|
||||||
|
- Show HN launched
|
||||||
|
- 500+ signups
|
||||||
|
- 10–20 paying customers
|
||||||
|
- $1K–$2K MRR
|
||||||
|
- Clear understanding of who converts and why
|
||||||
|
|
||||||
|
**Month 6:**
|
||||||
|
- $5K MRR (80 Pro + 5 Business customers)
|
||||||
|
- Self-hosted proxy option shipped (V1.5)
|
||||||
|
- Weekly newsletter established with growing subscriber base
|
||||||
|
- Content/SEO generating measurable organic traffic
|
||||||
|
- Data flywheel showing early signs (routing accuracy improving with scale)
|
||||||
|
|
||||||
|
**Month 12:**
|
||||||
|
- $10K–$15K MRR (200 Pro + 20 Business customers)
|
||||||
|
- Decision point: go full-time or maintain as profitable side project
|
||||||
|
- OTel integration, A/B testing, and semantic caching shipped
|
||||||
|
- "State of AI Costs" report established as industry reference
|
||||||
|
- Community of 500+ developers in Discord
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## APPENDIX: The Unfair Bet
|
||||||
|
|
||||||
|
Every startup has one core belief that, if true, makes everything else work. If false, nothing else matters.
|
||||||
|
|
||||||
|
> **"Engineering teams will route production LLM traffic through a third-party proxy if the savings are visible, immediate, and undeniable."**
|
||||||
|
|
||||||
|
**Assessment: 60/40 favorable.**
|
||||||
|
|
||||||
|
Evidence FOR: Cloudflare/CDNs normalized third-party traffic routing. LiteLLM's 15K+ stars prove developers accept proxy layers. 30–50% savings are a powerful motivator. Multi-model usage makes a routing layer increasingly necessary.
|
||||||
|
|
||||||
|
Evidence AGAINST: LLM prompts contain proprietary data — more sensitive than typical web traffic. Security teams are increasingly paranoid about AI data flows. "Just use the cheap model" is free and requires zero trust.
|
||||||
|
|
||||||
|
**The mitigation that tips the odds:** The open-source, self-hosted proxy. If it runs in the customer's VPC and only telemetry leaves their environment, the trust barrier drops dramatically.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*This brief synthesizes insights from four BMad Creative Intelligence Suite phases: Brainstorming (Carson), Design Thinking (Maya), Innovation Strategy (Victor), and Party Mode Advisory Board Review. Contradictions between phases have been noted and resolved inline.*
|
||||||
|
|
||||||
|
*The LLM cost optimization market will produce a $100M+ company in the next 5 years. The question is whether a bootstrapped solo founder can capture enough of that market to build a meaningful business before funded players consolidate. This brief argues yes — if Brian moves fast, stays focused, and lets the savings numbers do the selling.*
|
||||||
2241
products/01-llm-cost-router/test-architecture/test-architecture.md
Normal file
2241
products/01-llm-cost-router/test-architecture/test-architecture.md
Normal file
File diff suppressed because it is too large
Load Diff
2032
products/02-iac-drift-detection/architecture/architecture.md
Normal file
2032
products/02-iac-drift-detection/architecture/architecture.md
Normal file
File diff suppressed because it is too large
Load Diff
343
products/02-iac-drift-detection/brainstorm/session.md
Normal file
343
products/02-iac-drift-detection/brainstorm/session.md
Normal file
@@ -0,0 +1,343 @@
|
|||||||
|
# 🔥 IaC Drift Detection & Auto-Remediation — BRAINSTORM SESSION 🔥
|
||||||
|
|
||||||
|
**Facilitator:** Carson, Elite Brainstorming Specialist
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Product:** dd0c Product #2 — IaC Drift Detection SaaS
|
||||||
|
**Energy Level:** ☢️ MAXIMUM ☢️
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
> *"Every piece of infrastructure that drifts from its declared state is a lie your system is telling you. Let's build the lie detector."* — Carson
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: Problem Space 🎯 (25 Ideas)
|
||||||
|
|
||||||
|
### Drift Scenarios That Cause the Most Pain
|
||||||
|
|
||||||
|
1. **The "Helpful" Engineer** — Someone SSH'd into prod and tweaked an nginx config "just for now." That was 8 months ago. The Terraform state thinks it's vanilla. It's not. It never was again.
|
||||||
|
|
||||||
|
2. **Security Group Roulette** — A developer opens port 22 to 0.0.0.0/0 via the AWS console "for 5 minutes" to debug. Forgets. Drift undetected. You're now on Shodan. Congrats.
|
||||||
|
|
||||||
|
3. **The Auto-Scaling Ghost** — ASG scales up, someone manually terminates instances, ASG state and Terraform state disagree. `terraform apply` now wants to destroy your running workload.
|
||||||
|
|
||||||
|
4. **IAM Policy Creep** — Someone adds an inline policy via console. Terraform doesn't know. That policy grants `s3:*` to a role that should only read. Compliance audit finds it 6 months later.
|
||||||
|
|
||||||
|
5. **The RDS Parameter Drift** — Database parameters changed via console for "performance tuning." Next `terraform apply` reverts them. Production database restarts. At 2pm on a Tuesday. During a demo.
|
||||||
|
|
||||||
|
6. **Tag Drift Avalanche** — Cost allocation tags removed or changed manually. FinOps team can't attribute $40K/month in spend. CFO is asking questions. Nobody knows which team owns what.
|
||||||
|
|
||||||
|
7. **DNS Record Drift** — Route53 records edited manually during an incident. Never reverted. Terraform state is wrong. Next apply overwrites the fix. Outage #2.
|
||||||
|
|
||||||
|
8. **The Terraform Import That Never Happened** — Resources created via console during an emergency. "We'll import them later." Later never comes. They exist outside state. They cost money. Nobody knows they're there.
|
||||||
|
|
||||||
|
9. **Cross-Account Drift** — Shared resources (VPC peering, Transit Gateway attachments) modified in one account. The other account's Terraform doesn't know. Networking breaks silently.
|
||||||
|
|
||||||
|
10. **The Module Version Mismatch** — Team A upgrades a shared module. Team B doesn't. Their states diverge. Applying either one now has unpredictable blast radius.
|
||||||
|
|
||||||
|
### What Happens When Drift Goes Undetected — Horror Stories
|
||||||
|
|
||||||
|
11. **The $200K Surprise** — Drifted auto-scaling policies kept spinning up GPU instances nobody asked for. Undetected for 3 weeks. The AWS bill was... educational.
|
||||||
|
|
||||||
|
12. **The Compliance Audit Failure** — SOC 2 auditor asks "show me your infrastructure matches your declared state." It doesn't. Audit failed. Customer contract at risk. 6-figure deal on the line.
|
||||||
|
|
||||||
|
13. **The Cascading Terraform Destroy** — Engineer runs `terraform apply` on a state that's 4 months stale. Terraform sees 47 resources it doesn't recognize. Proposes destroying them. Engineer hits yes. Half of staging is gone.
|
||||||
|
|
||||||
|
14. **The Security Breach Nobody Noticed** — Drifted security group + drifted IAM role = open door. Attacker got in through the gap between declared and actual state. The IaC said it was secure. The cloud said otherwise.
|
||||||
|
|
||||||
|
15. **The "It Works On My Machine" of Infrastructure** — Dev environment Terraform matches state. Prod doesn't. "But it works in dev!" Yes, because dev hasn't drifted. Prod has been manually patched 30 times.
|
||||||
|
|
||||||
|
### Why Existing Tools Fail
|
||||||
|
|
||||||
|
16. **`terraform plan` Is Not Monitoring** — It's a point-in-time check that requires someone to run it. Nobody runs it at 3am when the drift happens. It's a flashlight, not a security camera.
|
||||||
|
|
||||||
|
17. **Spacelift/env0 Are Platforms, Not Tools** — You don't want to migrate your entire IaC workflow to detect drift. That's like buying a car to use the cup holder. $500/mo minimum for what should be a focused utility.
|
||||||
|
|
||||||
|
18. **driftctl Is Dead** — Snyk acquired it, then abandoned it. The OSS community is orphaned. README still says "beta." Last meaningful commit: ancient history.
|
||||||
|
|
||||||
|
19. **Terraform Cloud's Drift Detection Is an Afterthought** — Buried in the UI. Limited to Terraform (no OpenTofu, no Pulumi). Requires full TFC adoption. HashiCorp pricing is... HashiCorp pricing.
|
||||||
|
|
||||||
|
20. **ControlMonkey Is Enterprise-Only** — Great product, but they want $50K+ contracts and 6-month sales cycles. A 10-person startup can't even get a demo.
|
||||||
|
|
||||||
|
### The Emotional Experience of Drift
|
||||||
|
|
||||||
|
21. **2am PagerDuty + Drift = Existential Dread** — You're debugging a production issue. Nothing matches what the code says. You can't trust your own infrastructure definitions. You're flying blind in the dark.
|
||||||
|
|
||||||
|
22. **The Trust Erosion** — Every time drift is discovered, the team trusts IaC less. "Why bother with Terraform if the console changes override it anyway?" IaC adoption dies from a thousand drifts.
|
||||||
|
|
||||||
|
23. **The Blame Game** — "Who changed this?" Nobody knows. No audit trail. The console doesn't log who clicked what (unless CloudTrail is perfectly configured, which... it's not).
|
||||||
|
|
||||||
|
### Hidden Costs of Drift
|
||||||
|
|
||||||
|
24. **Debugging Time Multiplier** — Engineers spend 2-5x longer debugging issues when the actual state doesn't match the declared state. You're debugging a phantom. The code says X, reality is Y, and you don't know that.
|
||||||
|
|
||||||
|
25. **Compliance Theater** — Teams spend weeks before audits manually reconciling state. Running `terraform plan` across 50 stacks, fixing drift, re-running. This is a full-time job that shouldn't exist.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: Solution Space 🚀 (42 Ideas)
|
||||||
|
|
||||||
|
### Detection Approaches
|
||||||
|
|
||||||
|
26. **Continuous Polling Engine** — Run `terraform plan` (or equivalent) on a schedule. Every 15 min, every hour, every day. Configurable per-stack. The "security camera" approach.
|
||||||
|
|
||||||
|
27. **Event-Driven Detection via CloudTrail** — Watch AWS CloudTrail (and Azure Activity Log, GCP Audit Log) for API calls that modify resources tracked in state. Instant drift detection — no polling needed.
|
||||||
|
|
||||||
|
28. **State File Diffing** — Compare current state file against last known-good state. Detect additions, removals, and modifications without running a full plan. Faster, cheaper, less permissions needed.
|
||||||
|
|
||||||
|
29. **Git-State Reconciliation** — Compare what's in the git repo (the desired state) against what's in the cloud (actual state). The "source of truth" approach. Works across any IaC tool.
|
||||||
|
|
||||||
|
30. **Hybrid Detection** — CloudTrail for real-time alerts on high-risk resources (security groups, IAM), scheduled polling for everything else. Best of both worlds. Cost-efficient.
|
||||||
|
|
||||||
|
31. **Resource Fingerprinting** — Hash the configuration of each resource. Compare hashes over time. If the hash changes and there's no corresponding git commit, that's drift. Lightweight and fast.
|
||||||
|
|
||||||
|
32. **Provider API Direct Query** — Skip Terraform entirely. Query AWS/Azure/GCP APIs directly and compare against declared state. Eliminates Terraform plan overhead. Works even if Terraform is broken.
|
||||||
|
|
||||||
|
33. **Multi-State Correlation** — Detect drift across multiple state files that reference shared resources. If VPC in state A drifts, alert teams using states B, C, D that reference that VPC.
|
||||||
|
|
||||||
|
### Remediation Strategies
|
||||||
|
|
||||||
|
34. **One-Click Revert** — "This security group drifted. Click here to revert to declared state." Generates and applies the minimal Terraform change. No full plan needed.
|
||||||
|
|
||||||
|
35. **Auto-Generated Fix PR** — Drift detected → automatically generate a PR that updates the Terraform code to match the new reality (when the drift is intentional). "Accept the drift" workflow.
|
||||||
|
|
||||||
|
36. **Approval Workflow** — Drift detected → Slack notification → team lead approves remediation → auto-applied. For teams that want human-in-the-loop but don't want to context-switch to a terminal.
|
||||||
|
|
||||||
|
37. **Scheduled Remediation Windows** — "Fix all non-critical drift every Sunday at 2am." Batch remediation with automatic rollback if health checks fail.
|
||||||
|
|
||||||
|
38. **Selective Auto-Remediation** — Define policies: "Always auto-revert security group changes. Never auto-revert RDS parameter changes. Ask for approval on IAM changes." Risk-based automation.
|
||||||
|
|
||||||
|
39. **Drift Quarantine** — When drift is detected on a critical resource, automatically lock it (prevent further manual changes) until the drift is resolved through IaC. Enforced guardrails.
|
||||||
|
|
||||||
|
40. **Rollback Snapshots** — Before any remediation, snapshot the current state. If remediation breaks something, one-click rollback to the drifted-but-working state. Safety net.
|
||||||
|
|
||||||
|
41. **Import Wizard** — For drift that should be accepted: auto-generate the `terraform import` commands and HCL code to bring the drifted resources into state properly. The "make it official" button.
|
||||||
|
|
||||||
|
### Notification & Alerting
|
||||||
|
|
||||||
|
42. **Slack-First Alerts** — Rich Slack messages with drift details, blast radius, and action buttons (Revert / Accept / Snooze / Assign). Where engineers already live.
|
||||||
|
|
||||||
|
43. **PagerDuty Integration for Critical Drift** — Security group opened to the internet? That's not a Slack message. That's a page. Severity-based routing.
|
||||||
|
|
||||||
|
44. **PR Comments** — When a PR is opened that would conflict with existing drift, comment on the PR: "⚠️ Warning: these resources have drifted since your branch was created."
|
||||||
|
|
||||||
|
45. **Daily Drift Digest** — Morning email/Slack summary: "You have 3 new drifts, 7 unresolved, 2 auto-remediated overnight. Here's your drift score: 94/100."
|
||||||
|
|
||||||
|
46. **Drift Score Dashboard** — Real-time "infrastructure health score" based on % of resources in declared state. Gamify it. Teams compete for 100% drift-free status.
|
||||||
|
|
||||||
|
47. **Compliance Alert Channel** — Separate notification stream for compliance-relevant drift (IAM, encryption, logging). Auto-CC the security team. Generate audit evidence.
|
||||||
|
|
||||||
|
48. **ChatOps Remediation** — `/drift fix sg-12345` in Slack. Bot runs the remediation. No need to open a terminal or dashboard.
|
||||||
|
|
||||||
|
### Multi-Tool Support
|
||||||
|
|
||||||
|
49. **Terraform + OpenTofu Day 1** — These are 95% compatible. Support both from launch. Capture the OpenTofu migration wave.
|
||||||
|
|
||||||
|
50. **Pulumi Support** — Pulumi's state format is different but the concept is identical. Second priority. Captures the "modern IaC" crowd.
|
||||||
|
|
||||||
|
51. **CloudFormation Read-Only** — Many teams have legacy CFN stacks they can't migrate. At minimum, detect drift on them (CFN has a drift detection API). Don't need to remediate — just alert.
|
||||||
|
|
||||||
|
52. **CDK Awareness** — CDK compiles to CloudFormation. Understand the CDK→CFN mapping so drift alerts reference the CDK construct, not the raw CFN resource. Developer-friendly.
|
||||||
|
|
||||||
|
53. **Crossplane/Kubernetes** — For teams using Kubernetes-native IaC. Detect drift between desired state (CRDs) and actual cloud state. Niche but growing fast.
|
||||||
|
|
||||||
|
### Visualization
|
||||||
|
|
||||||
|
54. **Drift Heat Map** — Visual map of your infrastructure colored by drift status. Green = clean, yellow = minor drift, red = critical drift. Instant situational awareness.
|
||||||
|
|
||||||
|
55. **Dependency Graph with Drift Overlay** — Show resource dependencies. Highlight drifted resources AND everything that depends on them. "Blast radius" visualization.
|
||||||
|
|
||||||
|
56. **Timeline View** — When did each drift occur? Correlate with CloudTrail events. "This security group drifted at 3:47pm when user jsmith made a console change."
|
||||||
|
|
||||||
|
57. **Drift Trends Over Time** — Is drift getting better or worse? Weekly/monthly trends. "Your team's drift rate decreased 40% this month." Metrics for engineering leadership.
|
||||||
|
|
||||||
|
58. **Stack Health Dashboard** — Per-stack view: resources managed, resources drifted, last check time, remediation history. The "single pane of glass" for IaC health.
|
||||||
|
|
||||||
|
### Compliance Angle
|
||||||
|
|
||||||
|
59. **SOC 2 Evidence Auto-Generation** — Automatically generate compliance evidence: "100% of infrastructure changes were made through IaC. Here are the 3 exceptions, all remediated within SLA."
|
||||||
|
|
||||||
|
60. **Audit Trail Export** — Every drift event, every remediation, every approval — logged and exportable as CSV/PDF for auditors. One-click audit package.
|
||||||
|
|
||||||
|
61. **Policy-as-Code Integration** — Integrate with OPA/Rego or Sentinel. "Alert on drift that violates policy X." Not just "something changed" but "something changed AND it's now non-compliant."
|
||||||
|
|
||||||
|
62. **Change Window Enforcement** — Detect drift that occurs outside approved change windows. "Someone modified production at 2am on Saturday. That's outside the change freeze."
|
||||||
|
|
||||||
|
### Developer Experience
|
||||||
|
|
||||||
|
63. **CLI Tool (`drift check`)** — Run locally before pushing. "Your stack has 2 drifts. Fix them before applying." Shift-left drift detection.
|
||||||
|
|
||||||
|
64. **GitHub Action** — `uses: dd0c/drift-check@v1` — Run drift detection in CI. Block merges if drift exists. Free tier for public repos.
|
||||||
|
|
||||||
|
65. **VS Code Extension** — Inline drift indicators in your .tf files. "⚠️ This resource has drifted" right in the editor. Click to see details.
|
||||||
|
|
||||||
|
66. **Terraform Provider** — A Terraform provider that outputs drift status as data sources. `data.driftcheck_status.my_stack.drifted_resources`. Use drift status in your IaC logic.
|
||||||
|
|
||||||
|
67. **`drift init`** — One command to connect your stack. Auto-discovers state backend, cloud provider, resources. 60-second setup. No YAML config files.
|
||||||
|
|
||||||
|
### 🌶️ Wild Ideas
|
||||||
|
|
||||||
|
68. **Predictive Drift Detection** — ML model trained on CloudTrail patterns. "Based on historical patterns, this resource is likely to drift in the next 48 hours." Predict before it happens.
|
||||||
|
|
||||||
|
69. **Auto-Generated Fix PRs with AI Explanation** — Not just the code fix, but a natural language explanation: "This security group was opened to 0.0.0.0/0 by jsmith at 3pm. Here's a PR that reverts it and adds a comment explaining why it should stay restricted."
|
||||||
|
|
||||||
|
70. **Drift Insurance** — "We guarantee your infrastructure matches your IaC. If drift causes an incident and we didn't catch it, we pay." SLA-backed drift detection. Bold positioning.
|
||||||
|
|
||||||
|
71. **Infrastructure Replay** — Record all drift events. Replay them to understand how your infrastructure evolved outside of IaC. "Here's a movie of everything that changed in prod this month that wasn't in git."
|
||||||
|
|
||||||
|
72. **Drift-Aware Terraform Plan** — Wrap `terraform plan` to show not just what will change, but what has ALREADY changed (drift) vs what you're ABOUT to change. Split the plan output into "drift remediation" and "new changes."
|
||||||
|
|
||||||
|
73. **Cross-Org Drift Benchmarking** — Anonymous, aggregated drift statistics. "Your organization has 12% drift rate. The industry average is 23%. You're in the top quartile." Competitive benchmarking.
|
||||||
|
|
||||||
|
74. **Natural Language Drift Queries** — "Show me all security-related drift in production from the last 7 days" → instant filtered view. ChatGPT for your infrastructure state.
|
||||||
|
|
||||||
|
75. **Drift Bounties** — Gamification: assign points for fixing drift. Leaderboard. "Sarah fixed 47 drifts this month. She's the Drift Hunter champion." Make compliance fun.
|
||||||
|
|
||||||
|
76. **"Chaos Drift" Testing** — Intentionally introduce drift in staging to test your team's detection and response capabilities. Like chaos engineering but for IaC discipline.
|
||||||
|
|
||||||
|
77. **Bi-Directional Sync** — Instead of just detecting drift, offer the option to sync in EITHER direction: revert cloud to match code, OR update code to match cloud. The user chooses which is the source of truth per-resource.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3: Differentiation & Moat 🏰 (18 Ideas)
|
||||||
|
|
||||||
|
78. **Focused Tool, Not a Platform** — Spacelift, env0, and TFC are platforms. We're a tool. We do ONE thing — drift detection — and we do it better than anyone. This is our positioning moat. "We're the Stripe of drift detection. Focused. Developer-friendly. Just works."
|
||||||
|
|
||||||
|
79. **Price Disruption** — $29/mo for 10 stacks vs $500/mo for Spacelift. 17x cheaper. Price is the moat for SMBs. Spacelift can't drop to $29 without cannibalizing their enterprise business.
|
||||||
|
|
||||||
|
80. **Open-Source Core** — Open-source the detection engine. Paid SaaS for dashboard, alerting, remediation, and team features. Builds community, trust, and adoption. Hard for competitors to FUD against OSS.
|
||||||
|
|
||||||
|
81. **Multi-Tool from Day 1** — Spacelift is Terraform-first. env0 is Terraform-first. We support Terraform + OpenTofu + Pulumi from launch. The "Switzerland" of drift detection. No vendor lock-in.
|
||||||
|
|
||||||
|
82. **CloudTrail Data Advantage** — The more CloudTrail data we process, the better our drift attribution and prediction models get. Network effect: more customers → better detection → more customers.
|
||||||
|
|
||||||
|
83. **Integration Ecosystem** — Deep integrations with Slack, PagerDuty, GitHub, GitLab, Jira, Linear. Become the "drift hub" that connects to everything. Switching cost = reconfiguring all integrations.
|
||||||
|
|
||||||
|
84. **Community Drift Patterns Library** — Open-source library of common drift patterns and remediation playbooks. "AWS security group drift → here's the standard remediation." Community-contributed. We host it.
|
||||||
|
|
||||||
|
85. **Self-Serve Onboarding** — No sales calls. No demos. Sign up, connect your state backend, get drift alerts in 5 minutes. Spacelift requires a sales conversation. We require a credit card.
|
||||||
|
|
||||||
|
86. **Free Tier That's Actually Useful** — 3 stacks free forever. Not a trial. Not limited to 14 days. Actually useful for small teams and side projects. Builds habit and word-of-mouth.
|
||||||
|
|
||||||
|
87. **Terraform State as a Service (Adjacent Product)** — Once we're reading state files, we can offer state management (locking, versioning, encryption) as an adjacent product. Expand the surface area.
|
||||||
|
|
||||||
|
88. **Compliance Certification Partnerships** — Partner with SOC 2 auditors. "Use dd0c drift detection and your audit evidence is pre-generated." Become the recommended tool in compliance playbooks.
|
||||||
|
|
||||||
|
89. **Education Content Moat** — Become THE authority on IaC drift. Blog posts, case studies, "State of Drift" annual report, conference talks. Own the narrative. When people think "drift," they think dd0c.
|
||||||
|
|
||||||
|
90. **API-First Architecture** — Everything we do is available via API. Customers build custom workflows on top. Creates switching costs — their automation depends on our API.
|
||||||
|
|
||||||
|
91. **Drift SLA Guarantees** — "We detect drift within 15 minutes or your month is free." Nobody else offers this. Bold, measurable, differentiated.
|
||||||
|
|
||||||
|
92. **Agent-Ready Architecture** — Build the API so AI agents (Pulumi Neo, GitHub Copilot, custom agents) can query drift status and trigger remediation programmatically. Be the drift detection layer for the agentic DevOps era.
|
||||||
|
|
||||||
|
93. **Embeddable Widget** — Let teams embed a drift status badge in their README, Backstage catalog, or internal wiki. Viral distribution through visibility.
|
||||||
|
|
||||||
|
94. **Multi-Cloud Correlation** — Detect drift across AWS + Azure + GCP simultaneously. Correlate cross-cloud dependencies. Nobody does this well.
|
||||||
|
|
||||||
|
95. **Acquisition Target Positioning** — Build something so good at drift detection that Spacelift, env0, or HashiCorp wants to acquire it rather than build it. The exit strategy IS the moat — be the best at one thing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4: Anti-Ideas & Red Team 💀 (14 Ideas)
|
||||||
|
|
||||||
|
96. **HashiCorp Builds It Natively** — Terraform 2.0 (or whatever) ships with built-in continuous drift detection. Risk: MEDIUM. HashiCorp moves slowly and their pricing alienates SMBs. OpenTofu fork means the community is fragmented. Even if they build it, it'll be Terraform-only and expensive.
|
||||||
|
|
||||||
|
97. **OpenTofu Builds It Natively** — OpenTofu adds drift detection as a core feature. Risk: LOW-MEDIUM. OpenTofu is community-driven and focused on core IaC, not SaaS features. They'd build the CLI piece, not the dashboard/alerting/remediation layer.
|
||||||
|
|
||||||
|
98. **Spacelift Launches a Free Tier** — Risk: MEDIUM-HIGH. Spacelift could offer basic drift detection for free to capture the market. Counter: their platform complexity is a liability. Free tier of a complex platform ≠ simple focused tool.
|
||||||
|
|
||||||
|
99. **"Drift Doesn't Matter" Argument** — Some teams argue that if you have good CI/CD and always apply from code, drift is impossible. Risk: LOW. This is theoretically true and practically false. Console access exists. Emergencies happen. Humans are humans.
|
||||||
|
|
||||||
|
100. **Cloud Providers Build It In** — AWS Config already does drift detection for CloudFormation. What if they extend it to Terraform? Risk: LOW. Cloud providers want you on THEIR IaC (CloudFormation, Bicep, Deployment Manager). They won't optimize for third-party tools.
|
||||||
|
|
||||||
|
101. **Security Scanners Expand Into Drift** — Prisma Cloud, Wiz, or Orca add drift detection as a feature. Risk: MEDIUM. They have the cloud access and customer base. Counter: they're security tools, not IaC tools. Drift detection would be a checkbox feature, not their core competency.
|
||||||
|
|
||||||
|
102. **The "Just Use CI/CD" Objection** — "Just run `terraform plan` in a cron job and parse the output." Risk: This is what most teams do today. It's fragile, doesn't scale, has no UI, no remediation, no audit trail. We're the productized version of this hack.
|
||||||
|
|
||||||
|
103. **State File Access Is a Blocker** — Reading Terraform state requires access to the backend (S3, Terraform Cloud, etc.). Some security teams won't grant this. Risk: MEDIUM. Counter: offer a "pull" model where the customer's CI runs our agent and pushes results. No state file access needed.
|
||||||
|
|
||||||
|
104. **Permissions Anxiety** — "I'm not giving a SaaS tool IAM access to my AWS account." Risk: HIGH. This is the #1 adoption blocker for any cloud security/management tool. Counter: read-only IAM role with minimal permissions. SOC 2 certification. Option to run agent in customer's VPC.
|
||||||
|
|
||||||
|
105. **The Market Is Too Small** — Maybe only 10,000 teams worldwide actually need dedicated drift detection. At $99/mo average, that's $12M TAM. Is that enough? Counter: drift detection is the wedge. Expand into state management, policy enforcement, IaC analytics.
|
||||||
|
|
||||||
|
106. **Terraform Is Dying** — What if the industry moves to Pulumi, CDK, or AI-generated infrastructure? Risk: LOW-MEDIUM in 3-year horizon. Terraform/OpenTofu has massive inertia. But we should be multi-tool from day 1 to hedge.
|
||||||
|
|
||||||
|
107. **AI Makes IaC Obsolete** — What if Pulumi Neo-style agents manage infrastructure directly and IaC files become unnecessary? Risk: LOW in 3 years, MEDIUM in 5 years. Even with AI agents, you need to detect when actual state diverges from intended state. The concept of drift survives even if the tooling changes.
|
||||||
|
|
||||||
|
108. **Enterprise Sales Required** — What if SMBs don't pay for drift detection but enterprises do? Then we need a sales team, which kills the bootstrap model. Counter: validate with self-serve SMB customers first. Add enterprise features (SSO, audit logs, SLAs) later.
|
||||||
|
|
||||||
|
109. **Open Source Competitor Emerges** — Someone builds an excellent OSS drift detection tool. Risk: MEDIUM. Counter: our moat is the SaaS layer (dashboard, alerting, remediation, team features), not the detection engine. If we open-source our own engine, we control the narrative.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 5: Synthesis 🏆
|
||||||
|
|
||||||
|
### Top 10 Ideas — Ranked
|
||||||
|
|
||||||
|
| Rank | Idea # | Name | Why It's Top 10 |
|
||||||
|
|------|--------|------|-----------------|
|
||||||
|
| 🥇 1 | 30 | **Hybrid Detection (CloudTrail + Polling)** | Best-in-class detection that's both real-time AND comprehensive. This is the technical differentiator. |
|
||||||
|
| 🥈 2 | 79 | **Price Disruption ($29/mo)** | 17x cheaper than Spacelift. The single most powerful go-to-market weapon. |
|
||||||
|
| 🥉 3 | 42 | **Slack-First Alerts with Action Buttons** | Meet engineers where they are. Revert/Accept/Snooze without leaving Slack. This IS the product for most users. |
|
||||||
|
| 4 | 34 | **One-Click Revert** | The killer feature. Detect drift AND fix it in one click. Nobody else does this as a focused tool. |
|
||||||
|
| 5 | 67 | **`drift init` — 60-Second Setup** | Self-serve onboarding is the growth engine. If setup takes more than 60 seconds, you lose. |
|
||||||
|
| 6 | 80 | **Open-Source Core** | Builds trust, community, and adoption. Paid SaaS for the good stuff. Proven model (GitLab, Sentry, PostHog). |
|
||||||
|
| 7 | 86 | **Free Tier (3 Stacks Forever)** | Habit-forming. Word-of-mouth. The developer who uses it on a side project brings it to work. |
|
||||||
|
| 8 | 38 | **Selective Auto-Remediation Policies** | "Always revert security group drift. Ask for approval on IAM." Risk-based automation is the enterprise unlock. |
|
||||||
|
| 9 | 49 | **Terraform + OpenTofu + Pulumi from Day 1** | Multi-tool support = "Switzerland" positioning. Captures migration waves in all directions. |
|
||||||
|
| 10 | 59 | **SOC 2 Evidence Auto-Generation** | Compliance is the budget unlocker. "This tool pays for itself in audit prep time saved." CFO-friendly. |
|
||||||
|
|
||||||
|
### 3 Wild Cards 🃏
|
||||||
|
|
||||||
|
| Wild Card | Idea # | Name | Why It's Wild |
|
||||||
|
|-----------|--------|------|---------------|
|
||||||
|
| 🃏 1 | 68 | **Predictive Drift Detection** | ML that predicts drift before it happens. "This resource will drift in 48 hours based on historical patterns." Nobody has this. It's the future. |
|
||||||
|
| 🃏 2 | 71 | **Infrastructure Replay** | A DVR for your infrastructure. Replay every change that happened outside IaC. Forensics meets compliance meets "holy crap that's cool." |
|
||||||
|
| 🃏 3 | 70 | **Drift Insurance / SLA Guarantee** | "We guarantee detection within 15 minutes or your month is free." Turns a SaaS tool into a trust contract. Unprecedented in the space. |
|
||||||
|
|
||||||
|
### Key Themes
|
||||||
|
|
||||||
|
1. **Simplicity Is the Moat** — Every competitor is a platform. We're a tool. The market is screaming for focused, affordable, easy-to-adopt solutions. Don't build a platform. Build a scalpel.
|
||||||
|
|
||||||
|
2. **Slack Is the UI** — For 80% of users, the Slack notification with action buttons IS the product. The dashboard is secondary. Design Slack-first, dashboard-second.
|
||||||
|
|
||||||
|
3. **Price Is a Feature** — At $29/mo, drift detection becomes a no-brainer expense. No procurement process. No budget approval. Credit card and go. This is how you win SMB.
|
||||||
|
|
||||||
|
4. **Compliance Sells to Leadership** — Engineers want drift detection for operational sanity. Leadership wants it for compliance evidence. Sell both stories. The engineer adopts it bottom-up; the compliance angle gets it approved top-down.
|
||||||
|
|
||||||
|
5. **Open Source Builds Trust** — Cloud security tools face massive trust barriers ("you want access to my AWS account?!"). Open-source core + SOC 2 certification + minimal permissions = trust equation solved.
|
||||||
|
|
||||||
|
6. **Multi-Tool Is Non-Negotiable** — The IaC landscape is fragmented (Terraform, OpenTofu, Pulumi, CDK, CloudFormation). A drift detection tool that only works with one is leaving money on the table.
|
||||||
|
|
||||||
|
### Recommended V1 Focus 🎯
|
||||||
|
|
||||||
|
**V1 = "Drift Detection That Just Works"**
|
||||||
|
|
||||||
|
Ship with:
|
||||||
|
- ✅ Terraform + OpenTofu support (Pulumi in V1.1)
|
||||||
|
- ✅ Hybrid detection: CloudTrail real-time + scheduled polling
|
||||||
|
- ✅ Slack alerts with Revert / Accept / Snooze buttons
|
||||||
|
- ✅ One-click remediation (revert to declared state)
|
||||||
|
- ✅ `drift init` CLI for 60-second onboarding
|
||||||
|
- ✅ Basic web dashboard (drift list, stack health, timeline)
|
||||||
|
- ✅ Free tier: 3 stacks, daily polling, Slack alerts
|
||||||
|
- ✅ Paid tier: $29/mo for 10 stacks, 15-min polling, remediation
|
||||||
|
|
||||||
|
Do NOT ship with (save for V2+):
|
||||||
|
- ❌ Pulumi support (V1.1)
|
||||||
|
- ❌ Predictive drift detection (V2 — needs data)
|
||||||
|
- ❌ SOC 2 evidence generation (V1.5)
|
||||||
|
- ❌ VS Code extension (V2)
|
||||||
|
- ❌ Auto-generated fix PRs (V1.5)
|
||||||
|
- ❌ Policy-as-code integration (V2)
|
||||||
|
|
||||||
|
**The V1 pitch:** *"Connect your Terraform state. Get Slack alerts when something drifts. Fix it in one click. $29/month. Set up in 60 seconds."*
|
||||||
|
|
||||||
|
That's it. That's the product. Ship it. 🚀
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Total Ideas Generated: 109** 🔥🔥🔥
|
||||||
|
|
||||||
|
*Session complete. Carson out. Go build something that makes infrastructure engineers sleep better at night.* ✌️
|
||||||
353
products/02-iac-drift-detection/design-thinking/session.md
Normal file
353
products/02-iac-drift-detection/design-thinking/session.md
Normal file
@@ -0,0 +1,353 @@
|
|||||||
|
# 🎷 dd0c/drift — Design Thinking Session
|
||||||
|
**Facilitator:** Maya, Design Thinking Maestro
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Product:** dd0c/drift — IaC Drift Detection & Auto-Remediation
|
||||||
|
**Method:** Full Design Thinking (Empathize → Define → Ideate → Prototype → Test → Iterate)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
> *"Drift is the jazz of infrastructure — unplanned improvisation that sounds beautiful to nobody. Our job isn't to kill the music. It's to give the musicians a score they actually want to follow."* — Maya
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: EMPATHIZE 🎭
|
||||||
|
|
||||||
|
*Before we design a single pixel, we sit in their chairs. We wear their headphones. We feel the 2am vibration of a PagerDuty alert in our bones. Design is about THEM, not us.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Persona 1: The Infrastructure Engineer — "Ravi"
|
||||||
|
|
||||||
|
**Name:** Ravi Krishnamurthy
|
||||||
|
**Title:** Senior Infrastructure Engineer
|
||||||
|
**Company:** Mid-stage SaaS startup, 120 employees, Series B
|
||||||
|
**Experience:** 6 years in infra, 4 years writing Terraform daily
|
||||||
|
**Tools:** Terraform, AWS, GitHub Actions, Slack, Datadog, PagerDuty
|
||||||
|
**Stacks managed:** 23 (across dev, staging, prod, and 3 microservice clusters)
|
||||||
|
|
||||||
|
#### Empathy Map
|
||||||
|
|
||||||
|
**SAYS:**
|
||||||
|
- "I'll import that resource into state tomorrow." *(Tomorrow never comes.)*
|
||||||
|
- "Who changed the security group? Was it you? Was it me? I honestly don't remember."
|
||||||
|
- "Don't touch prod through the console. I'm serious. PLEASE."
|
||||||
|
- "I need to run `terraform plan` across all stacks before the release, give me... four hours."
|
||||||
|
- "The state file is the source of truth." *(Said with the conviction of someone who knows it's a lie.)*
|
||||||
|
|
||||||
|
**THINKS:**
|
||||||
|
- "If I run `terraform apply` right now, will it destroy something that someone manually created last month?"
|
||||||
|
- "I'm mass-producing YAML and HCL for a living. Is this what I went to school for?"
|
||||||
|
- "There are at least 15 resources in prod that aren't in any state file. I know it. I can feel it."
|
||||||
|
- "If this next plan shows 40+ changes I didn't expect, I'm calling in sick."
|
||||||
|
- "I should automate drift checks. I've been saying that for 8 months."
|
||||||
|
|
||||||
|
**DOES:**
|
||||||
|
- Runs `terraform plan` manually before every apply, scanning the output like a bomb technician reading a wire diagram
|
||||||
|
- Maintains a mental map of "things that have drifted but I haven't fixed yet" — a cognitive debt ledger that never gets paid down
|
||||||
|
- Writes Slack messages like "PSA: DO NOT modify anything in the `prod-networking` stack via console" every few weeks
|
||||||
|
- Spends 30% of debugging time figuring out whether the issue is in the code, the state, or reality
|
||||||
|
- Keeps a personal spreadsheet of "resources created via console during incidents that need to be imported"
|
||||||
|
|
||||||
|
**FEELS:**
|
||||||
|
- **Anxiety** before every `terraform apply` — the gap between what the code says and what exists is a minefield
|
||||||
|
- **Guilt** about the drift they know exists but haven't fixed — it's technical debt they can see but can't prioritize
|
||||||
|
- **Frustration** when colleagues make console changes — it feels like someone scribbling in the margins of a book you're trying to keep clean
|
||||||
|
- **Loneliness** at 2am when the pager goes off and nothing matches the code — you're debugging a ghost
|
||||||
|
- **Imposter syndrome** when drift causes an incident — "I should have caught this. Why didn't I catch this?"
|
||||||
|
|
||||||
|
#### Pain Points
|
||||||
|
1. **No continuous visibility** — `terraform plan` is a flashlight, not a security camera. Drift happens between plans.
|
||||||
|
2. **State file trust erosion** — Every discovered drift chips away at confidence in IaC as a practice.
|
||||||
|
3. **Manual reconciliation is soul-crushing** — Running plan across 23 stacks, reading output, triaging changes, fixing one by one. This is a full workday that produces zero new value.
|
||||||
|
4. **Blast radius fear** — Applying to a drifted stack is Russian roulette. Will it fix the drift or destroy the workaround someone built at 3am last Tuesday?
|
||||||
|
5. **No attribution** — WHO drifted this? WHEN? WHY? CloudTrail exists but correlating it to Terraform resources requires a PhD in AWS forensics.
|
||||||
|
6. **Context switching tax** — Drift discovery interrupts real work. You're building a new module and suddenly you're spelunking through state files.
|
||||||
|
|
||||||
|
#### Current Workarounds
|
||||||
|
- **Cron job running `terraform plan`** — Fragile bash script that emails output. Nobody reads the emails. The cron job broke 3 weeks ago. Nobody noticed.
|
||||||
|
- **"Drift Fridays"** — Dedicated time to reconcile state. Gets cancelled when there's a release. Which is every Friday.
|
||||||
|
- **Console access restrictions** — Tried to remove console access. Got overruled because "we need it for emergencies." The emergency is now permanent.
|
||||||
|
- **Mental model** — Ravi literally remembers which stacks are "clean" and which are "dirty." This knowledge lives in his head. When he's on vacation, nobody knows.
|
||||||
|
|
||||||
|
#### Jobs To Be Done (JTBD)
|
||||||
|
1. **When** I'm about to run `terraform apply`, **I want to** know exactly what has drifted since my last apply, **so I can** apply with confidence instead of fear.
|
||||||
|
2. **When** someone makes a console change to a resource I manage, **I want to** be notified immediately with context (who, what, when), **so I can** decide whether to revert or codify it before it becomes invisible debt.
|
||||||
|
3. **When** I'm debugging a production issue at 2am, **I want to** instantly see whether the resource in question matches its declared state, **so I can** eliminate "drift" as a variable in 30 seconds instead of 30 minutes.
|
||||||
|
4. **When** it's audit season, **I want to** generate a report showing all drift events and their resolutions, **so I can** prove our infrastructure matches our code without spending a week on manual reconciliation.
|
||||||
|
|
||||||
|
#### Day-in-the-Life Scenario: "The 2am Discovery"
|
||||||
|
|
||||||
|
*It's 2:17am. Ravi's phone buzzes. PagerDuty. `CRITICAL: API latency > 5s on prod-api-cluster`.*
|
||||||
|
|
||||||
|
*He opens his laptop, eyes half-closed. Checks Datadog. The API pods are healthy. Load balancer looks fine. But wait — the target group health checks are failing for two instances. He checks the security group. It should allow traffic on port 8080 from the ALB.*
|
||||||
|
|
||||||
|
*It doesn't.*
|
||||||
|
|
||||||
|
*The security group has been modified. Port 8080 is restricted to a specific CIDR that... isn't the ALB's subnet. Someone changed this. When? He doesn't know. Who? He doesn't know. Why? He doesn't know.*
|
||||||
|
|
||||||
|
*He opens the Terraform code. The code says port 8080 should be open to the ALB security group. The code is right. Reality is wrong. But he can't just `terraform apply` — what if there are OTHER changes in this stack he doesn't know about? What if applying reverts something else that's keeping prod alive?*
|
||||||
|
|
||||||
|
*He runs `terraform plan`. It shows 12 changes. TWELVE. He expected one. He doesn't recognize 8 of them. His stomach drops.*
|
||||||
|
|
||||||
|
*He spends the next 47 minutes reading plan output, cross-referencing with CloudTrail (which has 4,000 events in the last 24 hours for this account), trying to figure out which of these 12 changes are safe to apply and which will make things worse.*
|
||||||
|
|
||||||
|
*At 3:04am, he manually fixes the security group via the console. Adding more drift to fix drift. The irony isn't lost on him. He makes a mental note to "clean this up tomorrow."*
|
||||||
|
|
||||||
|
*Tomorrow, he won't. He'll be too tired. And the drift will compound.*
|
||||||
|
|
||||||
|
*He goes back to sleep at 3:22am. The alarm is set for 6:30am. He lies awake until 4.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Persona 2: The Security/Compliance Lead — "Diana"
|
||||||
|
|
||||||
|
**Name:** Diana Okafor
|
||||||
|
**Title:** Head of Security & Compliance
|
||||||
|
**Company:** B2B SaaS, 200 employees, SOC 2 Type II certified, pursuing HIPAA
|
||||||
|
**Experience:** 10 years in security, 3 years in cloud compliance
|
||||||
|
**Tools:** AWS Config, Prisma Cloud, Jira, Confluence, spreadsheets (so many spreadsheets)
|
||||||
|
**Responsibility:** Ensuring infrastructure matches approved configurations across 4 AWS accounts
|
||||||
|
|
||||||
|
#### Empathy Map
|
||||||
|
|
||||||
|
**SAYS:**
|
||||||
|
- "Can you prove that production matches the approved Terraform configuration? I need evidence for the auditor."
|
||||||
|
- "When was the last time someone verified there's no drift in the PCI-scoped environment?"
|
||||||
|
- "I need a change log. Not CloudTrail — I need something a human can read."
|
||||||
|
- "If we fail this audit, we lose the Acme contract. That's $2.3 million ARR."
|
||||||
|
- "I don't care HOW you fix the drift. I care that it's fixed, documented, and won't happen again."
|
||||||
|
|
||||||
|
**THINKS:**
|
||||||
|
- "I'm building compliance evidence on a foundation of sand. The Terraform code says one thing. I have no idea if the cloud matches."
|
||||||
|
- "The engineering team says 'everything is in code.' I've been in this industry long enough to know that's never 100% true."
|
||||||
|
- "If an auditor asks me to demonstrate real-time drift detection and I show them a cron job... we're done."
|
||||||
|
- "I'm spending 60% of my time on evidence collection that should be automated."
|
||||||
|
- "One undetected IAM policy change could be the difference between 'compliant' and 'breach notification.'"
|
||||||
|
|
||||||
|
**DOES:**
|
||||||
|
- Maintains a 200-row spreadsheet mapping compliance controls to infrastructure resources — updated manually, always slightly out of date
|
||||||
|
- Requests quarterly "drift audits" from the infra team — which take 2 weeks and produce a PDF that's outdated by the time it's delivered
|
||||||
|
- Reviews CloudTrail logs for unauthorized changes — drowning in noise, looking for signal
|
||||||
|
- Writes compliance narratives that say "all infrastructure changes are made through version-controlled IaC" while knowing there are exceptions
|
||||||
|
- Schedules monthly meetings with engineering to review "infrastructure hygiene" — attendance drops every month
|
||||||
|
|
||||||
|
**FEELS:**
|
||||||
|
- **Exposed** — she knows there are gaps between declared and actual state, but can't quantify them
|
||||||
|
- **Dependent** — she relies entirely on the infra team to tell her whether things have drifted, and they're too busy to check regularly
|
||||||
|
- **Anxious before audits** — the two weeks before an audit are a scramble to reconcile state, fix drift, and generate evidence
|
||||||
|
- **Frustrated by tooling** — AWS Config gives her compliance rules but not IaC drift. Prisma Cloud gives her security posture but not Terraform state comparison. Nothing connects the dots.
|
||||||
|
- **Professionally vulnerable** — if a breach happens because of undetected drift, it's her name on the incident report
|
||||||
|
|
||||||
|
#### Pain Points
|
||||||
|
1. **No continuous compliance evidence** — compliance is proven in snapshots, not streams. Between audits, she's flying blind.
|
||||||
|
2. **IaC drift is invisible to security tools** — Prisma Cloud sees the current state. It doesn't know what the INTENDED state is. Drift is the gap between intent and reality, and no security tool measures it.
|
||||||
|
3. **Manual evidence collection** — generating audit evidence requires coordinating with 3 engineering teams, running plans, collecting outputs, formatting reports. It's a part-time job.
|
||||||
|
4. **Change attribution is archaeological** — figuring out who changed what, when, and whether it was approved requires cross-referencing CloudTrail, git history, Jira tickets, and Slack messages.
|
||||||
|
5. **Compliance theater** — she suspects some "evidence" is aspirational rather than factual. The narrative says "no manual changes" but she can't verify it.
|
||||||
|
|
||||||
|
#### Current Workarounds
|
||||||
|
- **Quarterly manual audits** — engineering runs `terraform plan` across all stacks, documents drift, fixes it, re-runs. Takes 2 weeks. Results are stale immediately.
|
||||||
|
- **AWS Config rules** — catches some configuration drift but doesn't compare against Terraform intent. Generates hundreds of findings, most irrelevant.
|
||||||
|
- **Honor system** — relies on engineers to report console changes. They don't. Not maliciously — they just forget, or they think "I'll fix it in code later."
|
||||||
|
- **Pre-audit fire drill** — 2 weeks before every audit, the entire infra team drops everything to reconcile state. Productivity crater.
|
||||||
|
|
||||||
|
#### Jobs To Be Done (JTBD)
|
||||||
|
1. **When** an auditor asks for evidence that infrastructure matches declared state, **I want to** generate a real-time compliance report in one click, **so I can** demonstrate continuous compliance instead of point-in-time snapshots.
|
||||||
|
2. **When** a change is made outside of IaC, **I want to** be alerted immediately with full attribution (who, what, when, from where), **so I can** assess the compliance impact and initiate remediation before it becomes a finding.
|
||||||
|
3. **When** preparing for SOC 2 / HIPAA audits, **I want to** have an automatically maintained audit trail of all drift events and their resolutions, **so I can** eliminate the 2-week pre-audit scramble.
|
||||||
|
4. **When** evaluating our security posture, **I want to** see a real-time "drift score" across all environments, **so I can** quantify infrastructure hygiene and track improvement over time.
|
||||||
|
|
||||||
|
#### Day-in-the-Life Scenario: "The Audit Scramble"
|
||||||
|
|
||||||
|
*It's Monday morning, 14 days before the SOC 2 Type II audit. Diana opens her laptop to 3 Slack messages from the auditor: "Please provide evidence for Control CC6.1 — logical access controls match approved configurations."*
|
||||||
|
|
||||||
|
*She messages the infra team lead: "I need a full drift report across all production stacks by Wednesday." The response: "We're in the middle of a release. Can it wait until next week?" It cannot wait until next week.*
|
||||||
|
|
||||||
|
*She compromises: "Can you at least run plans on the PCI-scoped stacks?" Two days later, she gets a Confluence page with terraform plan output pasted in. It shows 7 drifted resources. Three are tagged "known — will fix later." Two are "not sure what this is." Two are "probably fine."*
|
||||||
|
|
||||||
|
*"Probably fine" is not a compliance posture.*
|
||||||
|
|
||||||
|
*She spends Thursday manually cross-referencing the drifted resources with CloudTrail to build an attribution timeline. The CloudTrail console search is painfully slow. She exports to CSV. Opens it in Excel. 47,000 rows for the last 30 days. She filters by resource ID. Finds the change. It was made by a service role, not a human. Which service? She doesn't know. More digging.*
|
||||||
|
|
||||||
|
*By Friday, she has a draft evidence document that says "7 drift events detected, 5 remediated, 2 accepted with justification." The justifications are thin. She knows the auditor will push back. She rewrites them three times.*
|
||||||
|
|
||||||
|
*The audit passes. Barely. The auditor notes "opportunity for improvement in continuous configuration monitoring." Diana knows that means "fix this before next year or you'll get a finding."*
|
||||||
|
|
||||||
|
*She adds "evaluate drift detection tooling" to her Q2 OKRs. It's February. Q2 starts in April. The drift continues.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Persona 3: The DevOps Team Lead — "Marcus"
|
||||||
|
|
||||||
|
**Name:** Marcus Chen
|
||||||
|
**Title:** DevOps Team Lead
|
||||||
|
**Company:** E-commerce platform, 400 employees, multi-region AWS deployment
|
||||||
|
**Experience:** 12 years in ops/DevOps, managing a team of 4 infra engineers
|
||||||
|
**Tools:** Terraform, AWS (3 accounts), GitHub, Slack, Linear, Datadog, PagerDuty
|
||||||
|
**Stacks managed (team total):** 67 across dev, staging, prod, and disaster recovery
|
||||||
|
|
||||||
|
#### Empathy Map
|
||||||
|
|
||||||
|
**SAYS:**
|
||||||
|
- "How much time did we spend on drift remediation this sprint? I need to report that to leadership."
|
||||||
|
- "I need visibility across all 67 stacks. Not 'run plan on each one.' A dashboard. One screen."
|
||||||
|
- "We're spending 30% of our time firefighting things that shouldn't have changed. That's not engineering, that's janitorial work."
|
||||||
|
- "I can't hire another engineer just to babysit state files. The budget isn't there."
|
||||||
|
- "If I could show leadership a metric — 'drift incidents per week' trending down — I could justify the tooling investment."
|
||||||
|
|
||||||
|
**THINKS:**
|
||||||
|
- "My team is burning out. Ravi hasn't taken a real vacation in 8 months because he's the only one who knows which stacks are 'safe' to apply."
|
||||||
|
- "I have 67 stacks and zero visibility into which ones have drifted. I'm managing by hope."
|
||||||
|
- "If we had drift detection, I could turn reactive firefighting into proactive maintenance. That's the difference between a team that ships and a team that survives."
|
||||||
|
- "Spacelift would solve this but it's $6,000/year and requires migrating our entire workflow. I can't justify that for drift detection alone."
|
||||||
|
- "One of my engineers is going to quit if I don't reduce the on-call burden. The 2am pages for drift-related issues are killing morale."
|
||||||
|
|
||||||
|
**DOES:**
|
||||||
|
- Runs weekly "stack health" meetings where each engineer reports on their stacks — this is verbal, tribal knowledge transfer disguised as a meeting
|
||||||
|
- Maintains a Linear board of "known drift" issues that never gets prioritized over feature work
|
||||||
|
- Advocates to leadership for drift detection tooling — gets told "can't you just write a script?"
|
||||||
|
- Manually assigns drift remediation during "quiet sprints" — which don't exist
|
||||||
|
- Reviews every `terraform apply` in prod personally because he doesn't trust the state
|
||||||
|
|
||||||
|
#### Pain Points
|
||||||
|
1. **No aggregate visibility** — he can't see drift across 67 stacks without asking each engineer to run plans individually. There's no "single pane of glass."
|
||||||
|
2. **Team capacity drain** — drift remediation is unplanned work that displaces planned work. He can't forecast sprint velocity because drift is unpredictable.
|
||||||
|
3. **Knowledge silos** — each engineer knows their stacks. When someone is sick or on vacation, their stacks are black boxes. Drift knowledge is tribal.
|
||||||
|
4. **Can't quantify the problem** — leadership asks "how bad is drift?" and he can't answer with data. He has anecdotes and gut feelings. That doesn't unlock budget.
|
||||||
|
5. **Tool sprawl fatigue** — his team already uses 8+ tools. Adding another platform (Spacelift, env0) means migration, training, and ongoing maintenance. He wants a tool, not a platform.
|
||||||
|
6. **On-call burnout** — drift-related incidents inflate on-call burden. His team is on a 4-person rotation. One more quit and it's unsustainable.
|
||||||
|
|
||||||
|
#### Current Workarounds
|
||||||
|
- **Weekly manual checks** — each engineer runs `terraform plan` on their stacks Monday morning. Results shared in Slack. Nobody reads each other's results.
|
||||||
|
- **"Drift budget"** — allocates 20% of each sprint to "infrastructure hygiene." In practice, it's 5% because feature work always wins.
|
||||||
|
- **Tribal knowledge** — Marcus keeps a mental model of which stacks are "high risk" for drift. He assigns on-call accordingly. This doesn't scale.
|
||||||
|
- **Post-incident drift audits** — after every drift-related incident, the team does a full audit of related stacks. Reactive, not proactive.
|
||||||
|
|
||||||
|
#### Jobs To Be Done (JTBD)
|
||||||
|
1. **When** planning sprint capacity, **I want to** see a real-time count of drift events across all stacks, **so I can** allocate remediation time accurately instead of guessing.
|
||||||
|
2. **When** reporting to engineering leadership, **I want to** show drift metrics trending over time (drift rate, mean time to remediation, stacks affected), **so I can** justify tooling investment with data instead of anecdotes.
|
||||||
|
3. **When** an engineer is on vacation, **I want to** have automated drift monitoring on their stacks, **so I can** eliminate the "bus factor" of tribal knowledge.
|
||||||
|
4. **When** evaluating new tools, **I want to** adopt something that integrates with our existing workflow (Terraform + GitHub + Slack) without requiring a platform migration, **so I can** get value in days, not months.
|
||||||
|
5. **When** a drift event occurs, **I want to** route it to the right engineer automatically based on stack ownership, **so I can** reduce my role as a human router of drift information.
|
||||||
|
|
||||||
|
#### Day-in-the-Life Scenario: "The Visibility Gap"
|
||||||
|
|
||||||
|
*It's Wednesday standup. Marcus asks the team: "Any drift issues this week?"*
|
||||||
|
|
||||||
|
*Ravi: "I found 3 drifted resources in prod-api yesterday. Fixed two, the third is complicated — someone changed the RDS parameter group and I'm not sure if reverting will restart the database."*
|
||||||
|
|
||||||
|
*Priya: "I haven't checked my stacks yet this week. I've been heads-down on the Kubernetes migration."*
|
||||||
|
|
||||||
|
*Jordan: "My stacks are clean... I think. I ran plan on Monday. But that was before the incident on Tuesday where someone opened a port via console."*
|
||||||
|
|
||||||
|
*Sam: "I'm still fixing drift from last month's audit prep. I have 4 stacks with known drift that I haven't gotten to."*
|
||||||
|
|
||||||
|
*Marcus writes in his notebook: "3 confirmed drifts (Ravi), unknown (Priya), possibly new drift (Jordan), 4 known unresolved (Sam)." He has 67 stacks. He has visibility into maybe 15 of them. The other 52 are Schrödinger's infrastructure — simultaneously drifted and not drifted until someone opens the box.*
|
||||||
|
|
||||||
|
*After standup, his manager pings him: "The VP of Engineering wants to know our 'infrastructure reliability score' for the board deck. Can you get me a number by Friday?"*
|
||||||
|
|
||||||
|
*Marcus stares at the message. He has no number. He has a notebook with question marks. He opens a spreadsheet and starts making one up — educated guesses based on the last time each stack was checked. He knows it's fiction. The VP will present it as fact.*
|
||||||
|
|
||||||
|
*He adds "evaluate drift detection tools" to his personal to-do list. It's been there for 5 months. It keeps getting bumped by the next fire.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: DEFINE 🔍
|
||||||
|
|
||||||
|
*Now we take all that raw empathy — the 2am dread, the audit scramble, the standup question marks — and we distill it into sharp, actionable problem statements. This is where the jazz improvisation becomes a composed melody. We're not solving everything. We're finding the ONE note that resonates across all three personas.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Point-of-View (POV) Statements
|
||||||
|
|
||||||
|
**POV 1 — Ravi (Infrastructure Engineer):**
|
||||||
|
Ravi, a senior infrastructure engineer managing 23 Terraform stacks, needs a way to continuously know when his infrastructure has diverged from its declared state because the gap between code and reality is an invisible minefield that turns every `terraform apply` into an act of faith and every 2am incident into a forensic investigation against phantom configurations.
|
||||||
|
|
||||||
|
**POV 2 — Diana (Security/Compliance Lead):**
|
||||||
|
Diana, a head of security responsible for SOC 2 compliance across 4 AWS accounts, needs a way to continuously prove that infrastructure matches approved configurations because her current evidence is built on quarterly snapshots and engineer self-reporting — a house of cards that one undetected IAM policy change could collapse, taking a $2.3M customer contract with it.
|
||||||
|
|
||||||
|
**POV 3 — Marcus (DevOps Team Lead):**
|
||||||
|
Marcus, a DevOps lead managing 67 stacks through a team of 4 engineers, needs a way to see aggregate drift status across all stacks in real time because without it, he's managing infrastructure health through tribal knowledge and standup anecdotes — a system that breaks the moment someone takes a vacation, and that produces fiction when leadership asks for a number.
|
||||||
|
|
||||||
|
**POV 4 — The Composite (The Organization):**
|
||||||
|
Engineering organizations that practice Infrastructure as Code need a way to close the loop between declared state and actual state continuously because the current model — periodic manual checks, tribal knowledge, and reactive firefighting — erodes trust in IaC itself, burns out the engineers who maintain it, and creates compliance gaps that compound silently until they become audit failures or security incidents.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Key Insights
|
||||||
|
|
||||||
|
**Insight 1: Drift is a trust problem, not a technical problem.**
|
||||||
|
Every undetected drift event erodes trust — trust in the state file, trust in IaC as a practice, trust in teammates not to make console changes. When trust erodes far enough, teams abandon IaC discipline entirely. "Why bother with Terraform if reality doesn't match anyway?" dd0c/drift doesn't just detect configuration changes. It restores faith in the system.
|
||||||
|
|
||||||
|
**Insight 2: The absence of data is the biggest pain point.**
|
||||||
|
Ravi can't quantify his anxiety. Diana can't prove her compliance. Marcus can't report his team's infrastructure health. All three personas suffer from the same root cause: there is no continuous, automated measurement of the gap between intent and reality. The first product that provides this number — a simple drift score — wins all three personas simultaneously.
|
||||||
|
|
||||||
|
**Insight 3: Remediation without context is dangerous.**
|
||||||
|
"Just revert it" sounds simple until you realize the drift might be a 3am hotfix that's keeping production alive. The product must present drift with CONTEXT — who changed it, when, what else depends on it, and what happens if you revert. One-click remediation is the killer feature, but one-click destruction is the killer bug.
|
||||||
|
|
||||||
|
**Insight 4: The tool/platform divide is real and exploitable.**
|
||||||
|
Every competitor (Spacelift, env0, Terraform Cloud) bundles drift detection inside a platform that requires workflow migration. Our personas don't want to change how they work. They want to ADD visibility to how they already work. The market gap isn't "better drift detection." It's "drift detection that doesn't require you to change anything else."
|
||||||
|
|
||||||
|
**Insight 5: Three buyers, one product, three stories.**
|
||||||
|
- Ravi buys with his credit card because it eliminates his 2am dread. (Bottom-up, individual pain.)
|
||||||
|
- Diana approves the budget because it generates audit evidence. (Middle-out, compliance justification.)
|
||||||
|
- Marcus champions it to leadership because it produces metrics. (Top-down, organizational visibility.)
|
||||||
|
The product is the same. The value proposition changes per persona. This is the dd0c/drift GTM unlock.
|
||||||
|
|
||||||
|
**Insight 6: Slack is the control plane, not the dashboard.**
|
||||||
|
For Ravi, the Slack alert with action buttons IS the product 80% of the time. He doesn't want to open another dashboard. He wants to see "Security group sg-abc123 drifted. [Revert] [Accept] [Snooze]" in the channel he's already in. The web dashboard exists for Diana and Marcus. Ravi lives in Slack.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Core Tension: Automation vs. Control 🎸
|
||||||
|
|
||||||
|
*Here's the jazz tension — the dissonance that makes the music interesting:*
|
||||||
|
|
||||||
|
**The Automation Pull:**
|
||||||
|
- Engineers want drift to be fixed automatically. "If someone opens a port to 0.0.0.0/0, just close it. Don't ask me. I'm sleeping."
|
||||||
|
- Compliance wants continuous enforcement. "Infrastructure should ALWAYS match declared state. Period."
|
||||||
|
- Leadership wants zero drift. "Why do we have drift at all? Automate it away."
|
||||||
|
|
||||||
|
**The Control Pull:**
|
||||||
|
- Engineers fear auto-remediation will revert a hotfix that's keeping prod alive. "What if the drift is INTENTIONAL?"
|
||||||
|
- Security wants approval workflows. "We can't auto-revert IAM changes without understanding the blast radius."
|
||||||
|
- Operations wants change windows. "Don't auto-remediate during peak traffic. That's a recipe for a different kind of outage."
|
||||||
|
|
||||||
|
**The Resolution:**
|
||||||
|
The product must offer a SPECTRUM of automation, not a binary switch. Per-resource-type policies:
|
||||||
|
- **Auto-revert:** Security groups opened to 0.0.0.0/0. Always. No questions.
|
||||||
|
- **Alert + one-click:** IAM policy changes. Show me, let me decide, make it easy.
|
||||||
|
- **Digest only:** Tag drift. Tell me in the morning summary. I'll get to it.
|
||||||
|
- **Ignore:** Auto-scaling instance count changes. That's not drift, that's the system working.
|
||||||
|
|
||||||
|
This spectrum IS the product's sophistication. It's what separates dd0c/drift from a cron job running `terraform plan`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### How Might We (HMW) Questions
|
||||||
|
|
||||||
|
**Detection:**
|
||||||
|
1. **HMW** detect drift in real-time without requiring engineers to manually run `terraform plan`?
|
||||||
|
2. **HMW** distinguish between intentional changes (hotfixes, emergency responses) and unintentional drift (mistakes, forgotten console changes)?
|
||||||
|
3. **HMW** detect drift across 67+ stacks without requiring individual stack-by-stack checks?
|
||||||
|
4. **HMW** attribute drift to a specific person, time, and action without requiring engineers to become CloudTrail forensics experts?
|
||||||
|
|
||||||
|
**Remediation:**
|
||||||
|
5. **HMW** make drift remediation a 10-second action instead of a 30-minute investigation?
|
||||||
|
6. **HMW** give engineers confidence that reverting drift won't break something else?
|
||||||
|
7. **HMW** allow teams to define per-resource automation policies (auto-revert vs. alert vs. ignore) without complex configuration?
|
||||||
|
8. **HMW** offer "accept the drift" as a first-class workflow — updating code to match reality when the change was intentional?
|
||||||
|
|
||||||
|
**Visibility & Reporting:**
|
||||||
|
9. **HMW** give team leads a single number ("drift score") that represents infrastructure health across all stacks?
|
||||||
|
10. **HMW** generate SOC 2 / HIPAA compliance evidence automatically from drift detection data?
|
||||||
|
11. **HMW** show drift trends over time so teams can measure whether their IaC discipline is improving or degrading?
|
||||||
|
12. **HMW** route drift alerts to the right engineer automatically based on stack ownership?
|
||||||
|
|
||||||
|
**Adoption & Trust:**
|
||||||
|
13. **HMW** get an engineer from zero to first drift alert in under 60 seconds?
|
||||||
|
14. **HMW** build trust with security-conscious teams who won't give a SaaS tool IAM access to their AWS accounts?
|
||||||
|
15. **HMW** make drift detection feel like a natural extension of existing workflows (Terraform + GitHub + Slack) rather than a new tool to learn?
|
||||||
|
16. **HMW** make the free tier valuable enough to create habit and word-of-mouth without giving away the business?
|
||||||
|
|
||||||
|
---
|
||||||
505
products/02-iac-drift-detection/epics/epics.md
Normal file
505
products/02-iac-drift-detection/epics/epics.md
Normal file
@@ -0,0 +1,505 @@
|
|||||||
|
# dd0c/drift - V1 MVP Epics
|
||||||
|
|
||||||
|
## Epic 1: Drift Agent (Go CLI)
|
||||||
|
**Description:** The core open-source Go binary that runs in the customer's environment. It parses Terraform state, polls AWS APIs for actual resource configurations, calculates the diff, and scrubs sensitive data before transmission.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
#### Story 1.1: Terraform State Parser
|
||||||
|
**As an** Infrastructure Engineer, **I want** the agent to parse my Terraform state file locally, **so that** it can identify declared resources without uploading my raw state to a third party.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Successfully parses Terraform state v4 JSON format.
|
||||||
|
* Extracts a list of `managed` resources with their declared attributes.
|
||||||
|
* Handles both local `.tfstate` files and AWS S3 remote backend configurations.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** None
|
||||||
|
* **Technical Notes:** Use standard Go JSON unmarshaling. Create an internal graph representation. Focus exclusively on AWS provider resources.
|
||||||
|
|
||||||
|
#### Story 1.2: AWS Resource Polling
|
||||||
|
**As an** Infrastructure Engineer, **I want** the agent to query AWS for the current state of my resources, **so that** it can compare reality against my declared Terraform state.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Agent uses customer's local AWS credentials/IAM role to authenticate.
|
||||||
|
* Queries AWS APIs for the top 20 MVP resource types (e.g., `ec2:DescribeSecurityGroups`, `iam:GetRole`).
|
||||||
|
* Maps Terraform resource IDs to AWS identifiers.
|
||||||
|
* **Story Points:** 8
|
||||||
|
* **Dependencies:** Story 1.1
|
||||||
|
* **Technical Notes:** Use the official AWS Go SDK v2. Map API responses to a standardized internal schema that matches the state parser output. Add simple retry logic for rate limits.
|
||||||
|
|
||||||
|
#### Story 1.3: Drift Diff Calculation
|
||||||
|
**As an** Infrastructure Engineer, **I want** the agent to calculate attribute-level differences between my state file and AWS reality, **so that** I know exactly what changed.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Compares parsed state attributes with polled AWS attributes.
|
||||||
|
* Outputs a structured diff showing `old` (state) and `new` (reality) values.
|
||||||
|
* Ignores AWS-generated default attributes that aren't declared in state.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 1.1, Story 1.2
|
||||||
|
* **Technical Notes:** Implement a deep compare function. Requires hardcoded ignore lists for known noisy attributes (e.g., AWS-assigned IDs or timestamps).
|
||||||
|
|
||||||
|
#### Story 1.4: Secret Scrubbing Engine
|
||||||
|
**As a** Security Lead, **I want** the agent to scrub all sensitive data from the drift diffs, **so that** my database passwords and API keys are never transmitted to the dd0c SaaS.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Strips any attribute marked `sensitive` in the state file.
|
||||||
|
* Redacts values matching known secret patterns (e.g., `password`, `secret`, `token`).
|
||||||
|
* Replaces redacted values with `[REDACTED]`.
|
||||||
|
* Completely strips the `Private` field from state instances.
|
||||||
|
* **Story Points:** 3
|
||||||
|
* **Dependencies:** Story 1.3
|
||||||
|
* **Technical Notes:** Use regex for pattern matching. Validate scrubber with rigorous unit tests before shipping. Ensure the diff structure remains intact even when values are redacted.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 2: Agent Communication
|
||||||
|
**Description:** Secure data transmission between the local Go agent and the dd0c SaaS. Handles authentication, mTLS registration, heartbeat signals, and encrypted drift report uploads.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
#### Story 2.1: Agent Registration & Authentication
|
||||||
|
**As a** DevOps Lead, **I want** the agent to securely register itself with the dd0c SaaS using an API key, **so that** my drift data is securely associated with my organization.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Agent registers via `POST /v1/agents/register` using a static API key.
|
||||||
|
* Generates and exchanges mTLS certificates for subsequent requests.
|
||||||
|
* Receives configuration details (e.g., poll interval) from the SaaS.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** None
|
||||||
|
* **Technical Notes:** The agent needs to store the mTLS cert locally or in memory. Implement robust error handling for unauthorized/revoked API keys.
|
||||||
|
|
||||||
|
#### Story 2.2: Encrypted Payload Transmission
|
||||||
|
**As a** Security Lead, **I want** the agent to transmit drift reports over a secure, encrypted channel, **so that** our infrastructure data cannot be intercepted in transit.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Agent POSTs scrubbed drift reports to `/v1/drift-reports`.
|
||||||
|
* Communication enforces TLS 1.3 and uses the established mTLS client certificate.
|
||||||
|
* Payload is compressed (gzip) if over a certain threshold.
|
||||||
|
* **Story Points:** 3
|
||||||
|
* **Dependencies:** Story 1.4, Story 2.1
|
||||||
|
* **Technical Notes:** Ensure HTTP client in Go enforces TLS 1.3. Define the strict JSON schema for the `DriftReport` payload.
|
||||||
|
|
||||||
|
#### Story 2.3: Agent Heartbeat
|
||||||
|
**As a** DevOps Lead, **I want** the agent to send regular heartbeats to the SaaS, **so that** I know if the agent crashes or loses connectivity.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Agent sends a lightweight heartbeat payload every N minutes.
|
||||||
|
* Payload includes uptime, memory usage, and events processed.
|
||||||
|
* SaaS API logs the heartbeat to track agent health.
|
||||||
|
* **Story Points:** 2
|
||||||
|
* **Dependencies:** Story 2.1
|
||||||
|
* **Technical Notes:** Run heartbeat in a separate Go goroutine with a ticker. Handle transient network errors silently.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 3: Drift Analysis Engine
|
||||||
|
**Description:** The SaaS-side Node.js/TypeScript processor that ingests drift reports, classifies severity, calculates stack drift scores, and persists events to the database and event store.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
#### Story 3.1: Ingestion & Validation Pipeline
|
||||||
|
**As a** System Operator, **I want** the SaaS to receive and validate drift reports from agents via SQS, **so that** high volumes of reports are processed reliably without dropping data.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* API Gateway routes valid `POST /v1/drift-reports` to an SQS FIFO queue.
|
||||||
|
* Event Processor ECS task consumes from the queue.
|
||||||
|
* Validates the report payload against a strict JSON schema.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 2.2
|
||||||
|
* **Technical Notes:** Use `zod` for payload validation. Ensure message group IDs use `stack_id` to maintain ordering per stack.
|
||||||
|
|
||||||
|
#### Story 3.2: Drift Classification
|
||||||
|
**As an** Infrastructure Engineer, **I want** the SaaS to classify detected drift by severity and category, **so that** I can prioritize critical security issues over cosmetic tag changes.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Applies YAML-defined classification rules to incoming drift diffs.
|
||||||
|
* Tags events as Critical, High, Medium, or Low severity.
|
||||||
|
* Tags events with categories (Security, Configuration, Tags, etc.).
|
||||||
|
* **Story Points:** 3
|
||||||
|
* **Dependencies:** Story 3.1
|
||||||
|
* **Technical Notes:** Implement a fast rule evaluation engine. Default unmatched drift to "Medium/Configuration".
|
||||||
|
|
||||||
|
#### Story 3.3: Persistence & Event Sourcing
|
||||||
|
**As a** Compliance Lead, **I want** every drift detection event to be stored in an immutable log, **so that** I have a reliable audit trail for SOC 2 compliance.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Appends the raw drift event to DynamoDB (immutable event store).
|
||||||
|
* Upserts the current state of the resource in the PostgreSQL `resources` table.
|
||||||
|
* Inserts a new record in the PostgreSQL `drift_events` table for open drift.
|
||||||
|
* **Story Points:** 8
|
||||||
|
* **Dependencies:** Story 3.2
|
||||||
|
* **Technical Notes:** Handle database transactions carefully to keep PostgreSQL and DynamoDB in sync. Ensure Row-Level Security (RLS) is applied on all PostgreSQL inserts.
|
||||||
|
|
||||||
|
#### Story 3.4: Drift Score Calculation
|
||||||
|
**As a** DevOps Lead, **I want** the engine to calculate a drift score for each stack, **so that** I have a high-level metric of infrastructure health.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Updates the `drift_score` field on the `stacks` table after processing a report.
|
||||||
|
* Score is out of 100 (e.g., 100 = completely clean).
|
||||||
|
* Weighted penalization based on severity (Critical heavily impacts score, Low barely impacts).
|
||||||
|
* **Story Points:** 3
|
||||||
|
* **Dependencies:** Story 3.3
|
||||||
|
* **Technical Notes:** Define a simple but logical weighting algorithm. Run calculation synchronously during event processing for V1.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 4: Notification Service
|
||||||
|
**Description:** A Lambda-based service that formats drift events into actionable Slack messages (Block Kit) and handles delivery routing per stack configuration.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
#### Story 4.1: Slack Block Kit Formatting
|
||||||
|
**As an** Infrastructure Engineer, **I want** drift alerts to arrive as rich Slack messages, **so that** I can easily read the diff and context without leaving Slack.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Lambda function maps drift events to Slack Block Kit JSON.
|
||||||
|
* Message includes Stack Name, Resource Address, Timestamp, Severity, and CloudTrail Attribution.
|
||||||
|
* Displays a code block showing the `old` vs `new` attribute diff.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 3.3
|
||||||
|
* **Technical Notes:** Build a flexible Block Kit template builder. Ensure diffs that are too long are gracefully truncated to fit Slack's block length limits.
|
||||||
|
|
||||||
|
#### Story 4.2: Slack Routing & Fanout
|
||||||
|
**As a** DevOps Lead, **I want** drift alerts to be routed to specific Slack channels based on the stack, **so that** the right team sees the alert without noise.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Checks the `stacks` table for custom Slack channel overrides.
|
||||||
|
* Falls back to the organization's default Slack channel.
|
||||||
|
* Sends the formatted message via the Slack API.
|
||||||
|
* **Story Points:** 3
|
||||||
|
* **Dependencies:** Story 4.1
|
||||||
|
* **Technical Notes:** The Event Processor triggers this Lambda asynchronously via SQS (`notification-fanout` queue).
|
||||||
|
|
||||||
|
#### Story 4.3: Action Buttons (Revert/Accept)
|
||||||
|
**As an** Infrastructure Engineer, **I want** action buttons directly on the Slack alert, **so that** I can quickly trigger remediation workflows.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Slack message includes interactive buttons: `[Revert]`, `[Accept]`, `[Snooze]`, `[Assign]`.
|
||||||
|
* Buttons contain the `drift_event_id` in their payload value.
|
||||||
|
* **Story Points:** 2
|
||||||
|
* **Dependencies:** Story 4.1
|
||||||
|
* **Technical Notes:** Just rendering the buttons. Handling the interactive callbacks will be covered under the Slack Bot epic or Remediation Engine (V1 MVP focus).
|
||||||
|
|
||||||
|
#### Story 4.4: Notification Batching (Low Severity)
|
||||||
|
**As an** Infrastructure Engineer, **I want** low and medium severity drift alerts to be batched into a digest, **so that** my Slack channel isn't spammed with noisy tag changes.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Critical/High alerts are sent immediately.
|
||||||
|
* Medium/Low alerts are held in a DynamoDB table or SQS delay queue and dispatched as a daily/hourly digest.
|
||||||
|
* **Story Points:** 8
|
||||||
|
* **Dependencies:** Story 4.2
|
||||||
|
* **Technical Notes:** Use EventBridge Scheduler or a cron Lambda to process and flush the digest queue periodically.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 5: Dashboard API
|
||||||
|
**Description:** The REST API powering the React SPA web dashboard. Handles user authentication (Cognito), organization/stack management, and querying the PostgreSQL data for drift history.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
#### Story 5.1: API Authentication & RLS Setup
|
||||||
|
**As a** System Operator, **I want** the API to enforce authentication and isolate tenant data, **so that** a user from one organization cannot see another organization's data.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Integrates AWS Cognito JWT validation middleware.
|
||||||
|
* API sets `app.current_org_id` on the PostgreSQL connection session for Row-Level Security (RLS).
|
||||||
|
* Returns `401/403` for unauthorized requests.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Database Schema (RLS)
|
||||||
|
* **Technical Notes:** Build an Express/Node.js middleware layer. Ensure strict parameterization for all SQL queries beyond RLS to avoid injection.
|
||||||
|
|
||||||
|
#### Story 5.2: Stack Management Endpoints
|
||||||
|
**As a** DevOps Lead, **I want** a set of REST endpoints to view and manage my stacks, **so that** I can configure ownership, check connection health, and monitor the drift score.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Implements `GET /v1/stacks` (list all stacks with their scores and resource counts).
|
||||||
|
* Implements `GET /v1/stacks/:id` (stack details).
|
||||||
|
* Implements `PATCH /v1/stacks/:id` (update name, owner, Slack channel).
|
||||||
|
* **Story Points:** 3
|
||||||
|
* **Dependencies:** Story 5.1
|
||||||
|
* **Technical Notes:** Support basic pagination (`limit`, `offset`) on list endpoints. Include agent heartbeat status in stack details if applicable.
|
||||||
|
|
||||||
|
#### Story 5.3: Drift History & Event Queries
|
||||||
|
**As a** Security Lead, **I want** endpoints to search and filter drift events, **so that** I can review historical drift, find specific changes, and generate audit reports.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Implements `GET /v1/drift-events` with filters for `stack_id`, `status` (open/resolved), `severity`, and timestamp ranges.
|
||||||
|
* Joins the `drift_events` table with `resources` to return full address paths and diff payloads.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 5.1
|
||||||
|
* **Technical Notes:** Expose the JSONB `diff` field cleanly in the response payload. Use index-backed PostgreSQL queries for fast filtering.
|
||||||
|
|
||||||
|
#### Story 5.4: Policy Configuration Endpoints
|
||||||
|
**As a** DevOps Lead, **I want** an API to manage remediation policies, **so that** I can customize how the agent reacts to specific resource drift.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Implements CRUD operations for stack-level and org-level policies (`/v1/policies`).
|
||||||
|
* Validates policy configuration payloads (e.g., action type, valid resource expressions).
|
||||||
|
* **Story Points:** 3
|
||||||
|
* **Dependencies:** Story 5.1
|
||||||
|
* **Technical Notes:** For V1, store policies as JSON fields in a simple `policies` table or direct mapping to stacks. Keep validation simple (e.g., regex checks on resource types).
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 6: Dashboard UI
|
||||||
|
**Description:** The React Single Page Application (SPA) providing a web dashboard for stack overview, drift timeline, and resource-level diff viewer.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
#### Story 6.1: Stack Overview Dashboard
|
||||||
|
**As a** DevOps Lead, **I want** a main dashboard showing all my stacks and their drift scores, **so that** I can assess infrastructure health at a glance.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Displays a list/table of all monitored stacks.
|
||||||
|
* Shows a visual "Drift Score" indicator (0-100) per stack.
|
||||||
|
* Sortable by score, name, and last checked timestamp.
|
||||||
|
* Provides visual indicators for agent connection status.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 5.2
|
||||||
|
* **Technical Notes:** Build with React and Vite. Use a standard UI library (e.g., Tailwind UI or MUI). Implement efficient data fetching (e.g., React Query).
|
||||||
|
|
||||||
|
#### Story 6.2: Stack Detail & Drift Timeline
|
||||||
|
**As an** Infrastructure Engineer, **I want** to click into a stack and see a timeline of drift events, **so that** I can track when things changed and who changed them.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Shows a chronological list of drift events for the selected stack.
|
||||||
|
* Displays open vs. resolved status.
|
||||||
|
* Filters for severity and category.
|
||||||
|
* Includes CloudTrail attribution data (who, IP, action).
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 5.3, Story 6.1
|
||||||
|
* **Technical Notes:** Support pagination/infinite scrolling for the timeline. Use clear icons for event types (Security vs. Tags).
|
||||||
|
|
||||||
|
#### Story 6.3: Resource-Level Diff Viewer
|
||||||
|
**As an** Infrastructure Engineer, **I want** to see the exact attribute changes for a drifted resource, **so that** I know exactly how reality differs from my state file.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Clicking an event opens a detailed view/modal.
|
||||||
|
* Renders a code-diff view (red for old state, green for new reality).
|
||||||
|
* Clearly marks redacted sensitive values.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 6.2
|
||||||
|
* **Technical Notes:** Use a specialized diff viewing component (e.g., `react-diff-viewer`). Ensure it handles large JSON blocks gracefully.
|
||||||
|
|
||||||
|
#### Story 6.4: Auth & User Settings
|
||||||
|
**As a** User, **I want** to manage my account and view my API keys, **so that** I can deploy the agent and access my organization's dashboard.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Implements login/signup via Cognito (Email/Password & GitHub OAuth).
|
||||||
|
* Provides a settings page displaying the organization's static API key.
|
||||||
|
* Displays current subscription plan (Free tier limits for MVP).
|
||||||
|
* **Story Points:** 3
|
||||||
|
* **Dependencies:** Story 5.1
|
||||||
|
* **Technical Notes:** Securely manage JWT storage (HttpOnly cookies or secure local storage). Include a clear "copy to clipboard" for the API key.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 7: Slack Bot
|
||||||
|
**Description:** The interactive Slack application that handles user commands (`/drift score`) and processes the interactive action buttons (`[Revert]`, `[Accept]`) from drift alerts.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
#### Story 7.1: Interactive Remediation Callbacks (Revert)
|
||||||
|
**As an** Infrastructure Engineer, **I want** clicking `[Revert]` on a Slack alert to trigger a targeted `terraform apply`, **so that** I can fix drift instantly without leaving Slack.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* SaaS API Gateway (`/v1/slack/interactions`) receives the button click payload.
|
||||||
|
* Validates the Slack request signature.
|
||||||
|
* Generates a scoped `terraform plan -target` command and queues it for the agent.
|
||||||
|
* Updates the Slack message to "Reverting...".
|
||||||
|
* **Story Points:** 8
|
||||||
|
* **Dependencies:** Story 4.3
|
||||||
|
* **Technical Notes:** The actual execution happens via the Remediation Engine (ECS Fargate) dispatching commands to the agent. Requires careful state tracking (Pending -> Executing -> Completed).
|
||||||
|
|
||||||
|
#### Story 7.2: Interactive Acceptance Callbacks (Accept)
|
||||||
|
**As an** Infrastructure Engineer, **I want** clicking `[Accept]` to auto-generate a PR that updates my Terraform code to match reality, **so that** the drift becomes the new source of truth.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* SaaS generates a code patch representing the new state.
|
||||||
|
* Uses the GitHub API to create a branch and open a PR against the target repo.
|
||||||
|
* Updates the Slack message with a link to the PR.
|
||||||
|
* **Story Points:** 8
|
||||||
|
* **Dependencies:** Story 7.1
|
||||||
|
* **Technical Notes:** Will require GitHub App integration (or OAuth token) to create branches/PRs. The patch generation logic needs robust testing.
|
||||||
|
|
||||||
|
#### Story 7.3: Slack Slash Commands
|
||||||
|
**As a** DevOps Lead, **I want** to use `/drift score` and `/drift status <stack>` in Slack, **so that** I can check my infrastructure health on demand.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* `/drift score` returns the aggregate score for the organization.
|
||||||
|
* `/drift status prod-networking` returns the score, open events, and agent health for a specific stack.
|
||||||
|
* Formats output as a clean Slack Block Kit message visible only to the user.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 4.1, Story 5.2
|
||||||
|
* **Technical Notes:** Deploy an API Gateway endpoint specifically for Slack slash commands. Validate the token and use the internal Dashboard API logic to fetch scores.
|
||||||
|
|
||||||
|
#### Story 7.4: Snooze & Assign Callbacks
|
||||||
|
**As an** Infrastructure Engineer, **I want** to click `[Snooze 24h]` or `[Assign]`, **so that** I can manage alert noise or delegate investigation to a teammate.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* `[Snooze]` updates the event status to `snoozed` and schedules a wake-up time.
|
||||||
|
* `[Assign]` opens a Slack modal to select a team member, updating the event owner.
|
||||||
|
* The original Slack message updates to reflect the new state/owner.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 7.1
|
||||||
|
* **Technical Notes:** Snooze requires a scheduled EventBridge or cron job to un-snooze. Assign requires interacting with Slack's user selection menus.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 8: Infrastructure & DevOps
|
||||||
|
**Description:** The underlying cloud resources for the dd0c SaaS and the CI/CD pipelines to build, test, and release the agent and services.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
#### Story 8.1: SaaS Infrastructure (Terraform)
|
||||||
|
**As a** System Operator, **I want** the SaaS infrastructure to be defined as code, **so that** deployments are repeatable and I can dogfood my own drift detection tool.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Defines VPC, Subnets, ECS Fargate Clusters, RDS PostgreSQL (Multi-AZ), API Gateway, and SQS FIFO queues.
|
||||||
|
* Sets up CloudWatch log groups and IAM roles.
|
||||||
|
* Uses Terraform for all configuration.
|
||||||
|
* **Story Points:** 8
|
||||||
|
* **Dependencies:** Architecture Design Document
|
||||||
|
* **Technical Notes:** Build a modular Terraform setup. Use the official AWS provider. Include variables for environment separation (staging vs. prod).
|
||||||
|
|
||||||
|
#### Story 8.2: CI/CD Pipeline (GitHub Actions)
|
||||||
|
**As a** Developer, **I want** a fully automated CI/CD pipeline, **so that** code pushed to `main` is linted, tested, built, and deployed to ECS.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Runs `golangci-lint`, `go test`, ESLint, and Vitest on PRs.
|
||||||
|
* Builds multi-stage Docker images for the Event Processor, Dashboard API, and Remediation Engine.
|
||||||
|
* Pushes images to ECR and triggers an ECS rolling deploy.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 8.1
|
||||||
|
* **Technical Notes:** Use standard GitHub Actions (e.g., `aws-actions/configure-aws-credentials`). Add Trivy for basic container scanning.
|
||||||
|
|
||||||
|
#### Story 8.3: Agent Distribution (Releases & Homebrew)
|
||||||
|
**As an** Open Source User, **I want** to easily download and install the CLI agent, **so that** I can test drift detection locally without building from source.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Configures GoReleaser to cross-compile binaries for Linux/macOS/Windows (amd64/arm64).
|
||||||
|
* Auto-publishes GitHub Releases when a new tag is pushed.
|
||||||
|
* Creates a custom Homebrew tap (`brew install dd0c/tap/drift-cli`).
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 1.1
|
||||||
|
* **Technical Notes:** Create a dedicated `.github/workflows/release.yml` for GoReleaser.
|
||||||
|
|
||||||
|
#### Story 8.4: Agent Terraform Module Publication
|
||||||
|
**As a** DevOps Lead, **I want** a pre-built Terraform module to deploy the agent in my AWS account, **so that** I don't have to manually configure ECS tasks and EventBridge rules.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Creates the `dd0c/drift-agent/aws` Terraform module.
|
||||||
|
* Provisions an ECS Task, EventBridge rules, SQS, and IAM roles for the customer.
|
||||||
|
* Publishes the module to the public Terraform Registry.
|
||||||
|
* **Story Points:** 8
|
||||||
|
* **Dependencies:** Story 8.3
|
||||||
|
* **Technical Notes:** Adhere to Terraform Registry best practices. Ensure the `README.md` clearly explains the required `dd0c_api_key` and state bucket variables.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 9: Onboarding & PLG (Product-Led Growth)
|
||||||
|
**Description:** The self-serve funnel that guides users from CLI installation to their first drift alert in under 5 minutes, plus billing and tier management.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
#### Story 9.1: Self-Serve Signup & CLI Login
|
||||||
|
**As an** Engineer, **I want** to easily sign up for the free tier via the CLI, **so that** I don't have to fill out sales forms to test the product.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Running `drift auth login` opens a browser to an OAuth flow (GitHub/Email).
|
||||||
|
* The CLI spins up a local web server to catch the callback token.
|
||||||
|
* Successfully provisions an organization and user account in the SaaS.
|
||||||
|
* **Story Points:** 5
|
||||||
|
* **Dependencies:** Story 5.4, Story 8.3
|
||||||
|
* **Technical Notes:** The callback server should listen on `localhost` with a random or standard port (e.g., `8080`).
|
||||||
|
|
||||||
|
#### Story 9.2: Auto-Discovery (`drift init`)
|
||||||
|
**As an** Infrastructure Engineer, **I want** the CLI to auto-discover my Terraform state files, **so that** I can configure my first stack without typing S3 ARNs.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* `drift init` scans the current directory for `*.tf` files.
|
||||||
|
* Uses default AWS credentials to query S3 buckets matching common state file patterns.
|
||||||
|
* Prompts the user to register discovered stacks to their organization.
|
||||||
|
* **Story Points:** 8
|
||||||
|
* **Dependencies:** Story 9.1
|
||||||
|
* **Technical Notes:** Implement robust fallback to manual input if discovery fails.
|
||||||
|
|
||||||
|
#### Story 9.3: Free Tier Enforcement (1 Stack)
|
||||||
|
**As a** Product Manager, **I want** to enforce a free tier limit of 1 stack, **so that** users get value but are incentivized to upgrade for larger infrastructure needs.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* The API rejects attempts to register more than 1 stack on the Free plan.
|
||||||
|
* The Dashboard clearly shows "1/1 Stacks Used".
|
||||||
|
* The CLI prompts "Upgrade to Starter ($49/mo)" when trying to add a second stack.
|
||||||
|
* **Story Points:** 3
|
||||||
|
* **Dependencies:** Story 9.1
|
||||||
|
* **Technical Notes:** Enforce limits securely at the API level (e.g., `POST /v1/stacks` should return a `403 Stack Limit` error).
|
||||||
|
|
||||||
|
#### Story 9.4: Stripe Billing Integration
|
||||||
|
**As a** Solo Founder, **I want** customers to upgrade to paid tiers with a credit card, **so that** I can capture revenue without a sales process.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
* Integrates Stripe Checkout for the Starter ($49/mo) and Pro ($149/mo) tiers.
|
||||||
|
* Dashboard provides a billing management portal (Stripe Customer Portal).
|
||||||
|
* Webhooks listen for successful payments and update the organization's `plan` field in PostgreSQL.
|
||||||
|
* **Story Points:** 8
|
||||||
|
* **Dependencies:** Story 5.2
|
||||||
|
* **Technical Notes:** Must include Stripe webhook signature verification to prevent spoofed upgrades. Store `stripe_customer_id` on the `organizations` table.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 10: Transparent Factory Compliance
|
||||||
|
**Description:** Cross-cutting epic ensuring dd0c/drift adheres to the 5 Transparent Factory architectural tenets. For a drift detection product, these tenets are especially critical — the tool that detects infrastructure drift must itself be immune to uncontrolled drift.
|
||||||
|
|
||||||
|
### Story 10.1: Atomic Flagging — Feature Flags for Detection Behaviors
|
||||||
|
**As a** solo founder, **I want** every new drift detection rule, remediation action, and notification behavior wrapped in a feature flag (default: off), **so that** I can ship new detection capabilities without accidentally triggering false-positive alerts for customers.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- OpenFeature SDK integrated into the Go agent. V1 provider: env-var or JSON file-based (no external service).
|
||||||
|
- All flags evaluate locally — no network calls during drift scan execution.
|
||||||
|
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if any flag at 100% rollout exceeds TTL.
|
||||||
|
- Automated circuit breaker: if a flagged detection rule generates >3x the baseline false-positive rate over 1 hour, the flag auto-disables.
|
||||||
|
- Flags required for: new IaC provider support (Terraform/Pulumi/CDK), remediation suggestions, Slack notification formats, scan scheduling changes.
|
||||||
|
|
||||||
|
**Estimate:** 5 points
|
||||||
|
**Dependencies:** Epic 1 (Agent Core)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Use Go OpenFeature SDK (`go.openfeature.dev/sdk`). JSON file provider for V1.
|
||||||
|
- Circuit breaker: track false-positive dismissals per rule in Redis. If dismissal rate spikes, disable the flag.
|
||||||
|
- Flag audit: `make flag-audit` lists all flags with TTL status.
|
||||||
|
|
||||||
|
### Story 10.2: Elastic Schema — Additive-Only for Drift State Storage
|
||||||
|
**As a** solo founder, **I want** all DynamoDB and state file schema changes to be strictly additive, **so that** agent rollbacks never corrupt drift history or lose customer scan results.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- CI lint rejects any migration or schema change that removes, renames, or changes the type of existing DynamoDB attributes.
|
||||||
|
- New attributes use `_v2` suffix when breaking changes are needed. Old attributes remain readable.
|
||||||
|
- Go structs use `json:",omitempty"` and ignore unknown fields so V1 agents can read V2 state files without crashing.
|
||||||
|
- Dual-write enforced during migration windows: agent writes to both old and new attribute paths in the same DynamoDB `TransactWriteItems` call.
|
||||||
|
- Every schema change includes a `sunset_date` comment (max 30 days). CI warns on overdue cleanups.
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 2 (State Management)
|
||||||
|
**Technical Notes:**
|
||||||
|
- DynamoDB Single Table Design: version items with `_v` attribute. Agent code uses a factory to select the correct model.
|
||||||
|
- For Terraform state parsing, use `encoding/json` with `DisallowUnknownFields: false` to tolerate upstream state format changes.
|
||||||
|
- S3 state snapshots: never overwrite — always write new versioned keys.
|
||||||
|
|
||||||
|
### Story 10.3: Cognitive Durability — Decision Logs for Detection Logic
|
||||||
|
**As a** future maintainer, **I want** every change to drift detection algorithms, severity scoring, or remediation logic accompanied by a `decision_log.json`, **so that** I understand why a particular drift pattern is flagged as critical vs. informational.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
|
||||||
|
- CI requires a decision log entry for PRs touching `pkg/detection/`, `pkg/scoring/`, or `pkg/remediation/`.
|
||||||
|
- Cyclomatic complexity cap of 10 enforced via `golangci-lint` with `gocyclo` linter. PRs exceeding this are blocked.
|
||||||
|
- Decision logs committed in `docs/decisions/`, one per significant logic change.
|
||||||
|
|
||||||
|
**Estimate:** 2 points
|
||||||
|
**Dependencies:** None
|
||||||
|
**Technical Notes:**
|
||||||
|
- PR template includes decision log fields as a checklist.
|
||||||
|
- For drift scoring changes: document why specific thresholds were chosen (e.g., "security group changes scored critical because X% of breaches start with SG drift").
|
||||||
|
- `golangci-lint` config: `.golangci.yml` with `gocyclo: max-complexity: 10`.
|
||||||
|
|
||||||
|
### Story 10.4: Semantic Observability — AI Reasoning Spans on Drift Classification
|
||||||
|
**As an** SRE debugging a missed drift alert, **I want** every drift classification decision to emit an OpenTelemetry span with structured reasoning metadata, **so that** I can trace why a specific resource change was scored as low-severity when it should have been critical.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- Every drift scan emits a parent `drift_scan` span. Each resource comparison emits a child `drift_classification` span.
|
||||||
|
- Span attributes: `drift.resource_type`, `drift.severity_score`, `drift.classification_reason`, `drift.alternatives_considered` (e.g., "considered critical but downgraded because tag-only change").
|
||||||
|
- If AI-assisted classification is used (future): `ai.prompt_hash`, `ai.model_version`, `ai.confidence_score` included.
|
||||||
|
- Spans export via OTLP to any compatible backend.
|
||||||
|
- No PII or customer infrastructure details in spans — resource ARNs are hashed.
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 1 (Agent Core)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Use `go.opentelemetry.io/otel` with OTLP exporter.
|
||||||
|
- For V1 without AI classification, the `drift.classification_reason` is the rule name + threshold that triggered.
|
||||||
|
- ARN hashing: SHA-256 truncated to 12 chars for correlation without exposure.
|
||||||
|
|
||||||
|
### Story 10.5: Configurable Autonomy — Governance for Auto-Remediation
|
||||||
|
**As a** solo founder, **I want** a `policy.json` that controls whether the agent can auto-remediate drift or only report it, **so that** customers maintain full control over what the tool is allowed to change in their infrastructure.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `policy.json` defines `governance_mode`: `strict` (report-only, no remediation) or `audit` (auto-remediate with logging).
|
||||||
|
- Agent checks policy before every remediation action. In `strict` mode, remediation suggestions are logged but never executed.
|
||||||
|
- `panic_mode`: when true, agent stops all scans immediately, preserves last-known-good state, and sends a single "paused" notification.
|
||||||
|
- Per-customer policy override: customers can set their own governance mode via config, which is always more restrictive than the system default (never less).
|
||||||
|
- All policy decisions logged: "Remediation blocked by strict mode for resource X", "Auto-remediation applied in audit mode".
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 3 (Remediation Engine)
|
||||||
|
**Technical Notes:**
|
||||||
|
- `policy.json` in repo root, loaded at startup, watched via `fsnotify`.
|
||||||
|
- Customer-level overrides stored in DynamoDB `org_settings` item. Merge logic: `min(system_policy, customer_policy)` — customer can only be MORE restrictive.
|
||||||
|
- Panic mode trigger: `POST /admin/panic` or env var `DD0C_PANIC=true`. Agent drains current scan and halts.
|
||||||
|
|
||||||
|
### Epic 10 Summary
|
||||||
|
| Story | Tenet | Points |
|
||||||
|
|-------|-------|--------|
|
||||||
|
| 10.1 | Atomic Flagging | 5 |
|
||||||
|
| 10.2 | Elastic Schema | 3 |
|
||||||
|
| 10.3 | Cognitive Durability | 2 |
|
||||||
|
| 10.4 | Semantic Observability | 3 |
|
||||||
|
| 10.5 | Configurable Autonomy | 3 |
|
||||||
|
| **Total** | | **16** |
|
||||||
858
products/02-iac-drift-detection/innovation-strategy/session.md
Normal file
858
products/02-iac-drift-detection/innovation-strategy/session.md
Normal file
@@ -0,0 +1,858 @@
|
|||||||
|
# dd0c/drift — Disruptive Innovation Strategy
|
||||||
|
**Strategist:** Victor, Former McKinsey Partner & Startup Advisor
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Product:** dd0c/drift — IaC Drift Detection & Auto-Remediation SaaS
|
||||||
|
**Verdict:** Conditional GO — with caveats that will make you uncomfortable.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
> *"The graveyard of DevOps startups is filled with companies that built better mousetraps for mice that had already moved to a different house. Let's make sure the mice are still here."* — Victor
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 1: MARKET LANDSCAPE
|
||||||
|
|
||||||
|
### 1.1 Competitive Analysis — The Battlefield
|
||||||
|
|
||||||
|
Let me be precise about who you're fighting, because half of these "competitors" are actually your best friends.
|
||||||
|
|
||||||
|
#### Tier 1: The Platforms (Your Real Competition)
|
||||||
|
|
||||||
|
**Spacelift** — $40M+ raised. Series B. ~200 employees.
|
||||||
|
- Pricing: Starts ~$40/month for Cloud tier (limited). Business tier is custom/enterprise — realistically $500-$2,000+/mo for meaningful usage. Drift detection requires at minimum Starter+ tier with private workers.
|
||||||
|
- Drift detection is a *feature*, not the product. It's buried inside their IaC management platform.
|
||||||
|
- Strength: Deep Terraform integration, policy-as-code, mature RBAC.
|
||||||
|
- Weakness: Requires full workflow migration. You don't just "add" Spacelift — you rebuild your CI/CD around it. That's a 2-month project minimum.
|
||||||
|
- Vulnerability: They CANNOT price down to $29/stack without destroying their enterprise ACV. Their sales motion requires $30K+ deals to justify the CAC.
|
||||||
|
|
||||||
|
**env0** — $28M+ raised. Series A (large). ~100 employees.
|
||||||
|
- Pricing: Free tier exists but is crippled. Paid starts ~$35/user/month. For a team of 10 managing 50 stacks, you're looking at $350-$500+/mo before enterprise features.
|
||||||
|
- Drift detection: Available but secondary to their "Environment as a Service" positioning.
|
||||||
|
- Strength: Good self-service onboarding, cost estimation features, OpenTofu support.
|
||||||
|
- Weakness: Trying to be everything — cost management, drift, policy, self-service. Jack of all trades.
|
||||||
|
- Vulnerability: Same platform migration problem as Spacelift. And their "per-user" pricing punishes growing teams.
|
||||||
|
|
||||||
|
**Terraform Cloud / HCP Terraform** — HashiCorp (IBM). Infinite resources.
|
||||||
|
- Pricing: Free for up to 500 managed resources. Plus tier at $0.00014/hr per resource (~$1.23/resource/year). Gets expensive fast at scale. Enterprise is custom.
|
||||||
|
- Drift detection: Exists since 2023. Runs health assessments on a schedule. Terraform-only (obviously). No OpenTofu. No Pulumi.
|
||||||
|
- Strength: It's HashiCorp. Native integration. Brand trust.
|
||||||
|
- Weakness: IBM acquisition created mass exodus to OpenTofu. Pricing changes alienated the community. BSL license killed goodwill. Drift detection is basic — no remediation workflows, no Slack-native experience.
|
||||||
|
- Vulnerability: The HashiCorp-to-OpenTofu migration is YOUR recruiting ground. Every team that leaves TFC needs drift detection and won't go back.
|
||||||
|
|
||||||
|
**Pulumi Cloud** — $97M+ raised.
|
||||||
|
- Pricing: Free for individuals. Team at $50/month for 3 users. Enterprise custom.
|
||||||
|
- Drift detection: `pulumi refresh` exists but it's manual. No continuous monitoring. No alerting.
|
||||||
|
- Strength: Modern language support (TypeScript, Python, Go). Developer-loved.
|
||||||
|
- Weakness: Pulumi-only. Small market share vs Terraform/OpenTofu. Drift is an afterthought.
|
||||||
|
- Vulnerability: Not a competitor — they're a potential integration partner. Support Pulumi stacks and you capture their underserved users.
|
||||||
|
|
||||||
|
#### Tier 2: The Dead and Dying
|
||||||
|
|
||||||
|
**driftctl (Snyk)** — DEAD.
|
||||||
|
- Acquired by Snyk in 2022. Effectively abandoned. Last meaningful development: ancient history. README still says "beta."
|
||||||
|
- This is your single greatest market gift. driftctl proved demand exists. Snyk proved that big companies don't care about drift enough to maintain an OSS tool. The community is orphaned and actively looking for a replacement.
|
||||||
|
- Every GitHub issue on driftctl asking "is this project dead?" is a lead for dd0c/drift.
|
||||||
|
|
||||||
|
#### Tier 3: Adjacent Players
|
||||||
|
|
||||||
|
**Firefly.ai** — $23M+ raised. Israeli startup.
|
||||||
|
- Positioning: "Cloud Asset Management" — broader than drift. They do inventory, codification (turning unmanaged resources into IaC), drift detection, and policy.
|
||||||
|
- Pricing: Enterprise-only. "Contact Sales." Minimum $1,000+/mo based on G2 reviews and AWS Marketplace listings.
|
||||||
|
- Strength: Comprehensive cloud visibility. Good at the "unmanaged resources" problem.
|
||||||
|
- Weakness: Enterprise sales motion. No self-serve. No PLG. A 5-person startup can't even get a demo without pretending to be bigger than they are.
|
||||||
|
- Vulnerability: They're selling to CISOs, not engineers. You're selling to the engineer with a credit card. Different buyer, different motion, no conflict.
|
||||||
|
|
||||||
|
**Digger** — Open-source Terraform CI/CD.
|
||||||
|
- Positioning: "Terraform CI/CD that runs in your CI." Open-source core.
|
||||||
|
- Drift detection: Basic. Not their focus.
|
||||||
|
- Strength: Runs in your existing CI (GitHub Actions, GitLab CI). No separate platform.
|
||||||
|
- Weakness: Small team, limited features, drift is a checkbox not a product.
|
||||||
|
- Vulnerability: Potential partner, not competitor. Digger users need drift detection. You provide it.
|
||||||
|
|
||||||
|
**ControlMonkey** — Enterprise IaC management.
|
||||||
|
- Pricing: Enterprise-only. $50K+ annual contracts.
|
||||||
|
- Irrelevant to your beachhead. They're playing a different game at a different price point.
|
||||||
|
|
||||||
|
#### Competitive Summary
|
||||||
|
|
||||||
|
| Player | Drift Detection | Pricing | Self-Serve | Multi-IaC | Your Advantage |
|
||||||
|
|--------|----------------|---------|------------|-----------|----------------|
|
||||||
|
| Spacelift | Feature (good) | $500+/mo | No (sales) | Terraform, OpenTofu, Pulumi, CFN | 17x cheaper, no migration |
|
||||||
|
| env0 | Feature (basic) | $350+/mo | Partial | Terraform, OpenTofu | Focused tool, not platform |
|
||||||
|
| HCP Terraform | Feature (basic) | Variable | Yes | Terraform only | Multi-IaC, no vendor lock-in |
|
||||||
|
| Pulumi Cloud | Manual only | $50+/mo | Yes | Pulumi only | Continuous, multi-IaC |
|
||||||
|
| driftctl | Dead | Free (OSS) | N/A | Terraform | You exist. They don't. |
|
||||||
|
| Firefly | Feature (good) | $1,000+/mo | No (sales) | Multi-IaC | 34x cheaper, PLG |
|
||||||
|
| Digger | Basic | Free (OSS) | N/A | Terraform | Dedicated product |
|
||||||
|
| **dd0c/drift** | **Core product** | **$29/stack** | **Yes** | **TF + OTF + Pulumi** | **—** |
|
||||||
|
|
||||||
|
### 1.2 Market Sizing
|
||||||
|
|
||||||
|
Let me be honest about the numbers, because most market sizing is fiction dressed in a suit.
|
||||||
|
|
||||||
|
**TAM (Total Addressable Market) — IaC Management & Governance**
|
||||||
|
- The global IaC market is projected at $2.5-$3.5B by 2027 (various analyst reports, growing 25-30% CAGR).
|
||||||
|
- But that includes everything: provisioning, CI/CD, policy, cost management. Drift detection is a slice.
|
||||||
|
- Realistic TAM for "IaC drift detection and remediation" specifically: **$800M-$1.2B** by 2027. This includes the drift features embedded in platforms like Spacelift/env0 plus standalone tools.
|
||||||
|
|
||||||
|
**SAM (Serviceable Addressable Market) — Teams Using Terraform/OpenTofu/Pulumi Who Need Drift Detection**
|
||||||
|
- HashiCorp reported 3,500+ enterprise customers and millions of Terraform users before the IBM acquisition.
|
||||||
|
- Conservatively, 150,000-200,000 organizations actively use Terraform/OpenTofu in production.
|
||||||
|
- Of those, ~60% (90,000-120,000) have 10+ stacks and experience meaningful drift.
|
||||||
|
- At $29/stack/month average, with average 20 stacks per org: ~$29 × 20 × 100,000 = **$696M/year SAM**.
|
||||||
|
- More conservatively (only teams with 10-100 stacks, excluding enterprises that will buy Spacelift anyway): **$200-$400M SAM**.
|
||||||
|
|
||||||
|
**SOM (Serviceable Obtainable Market) — What You Can Realistically Capture in 24 Months**
|
||||||
|
- As a solo founder with PLG motion, targeting SMB/mid-market teams (5-50 engineers, 10-100 stacks).
|
||||||
|
- Realistic first-year target: 200-500 paying customers.
|
||||||
|
- At average $145/mo (5 stacks × $29): **$350K-$870K ARR in Year 1**.
|
||||||
|
- Year 2 with expansion and word-of-mouth: **$1.5M-$3M ARR**.
|
||||||
|
- SOM: **$3-$5M** in the 24-month horizon.
|
||||||
|
|
||||||
|
**Brian, here's the uncomfortable truth:** The SOM is real but modest. This isn't a venture-scale market as a standalone product. It's a wedge. The dd0c platform strategy (route + cost + alert + drift + portal) is what makes this a $50M+ opportunity. Drift alone is a $3-5M ARR business. That's a great lifestyle business or a great wedge into a bigger platform play. It is NOT a standalone unicorn.
|
||||||
|
|
||||||
|
### 1.3 Timing — Why NOW
|
||||||
|
|
||||||
|
Four forces are converging that make February 2026 the optimal entry window:
|
||||||
|
|
||||||
|
**1. The HashiCorp Exodus (2024-2026)**
|
||||||
|
IBM's acquisition of HashiCorp and the BSL license change triggered the largest migration event in IaC history. OpenTofu adoption is accelerating. Teams migrating from Terraform Cloud to OpenTofu + GitHub Actions lose their (mediocre) drift detection. They need a replacement. They're actively searching. RIGHT NOW.
|
||||||
|
|
||||||
|
**2. driftctl's Death Created a Vacuum**
|
||||||
|
driftctl was the only focused, open-source drift detection tool. Snyk killed it. The community is orphaned. GitHub issues, Reddit threads, and HN comments are filled with "what do I use instead of driftctl?" There is no answer. You ARE the answer.
|
||||||
|
|
||||||
|
**3. IaC Adoption Hit Mainstream (2024-2025)**
|
||||||
|
IaC is no longer a practice of elite DevOps teams. It's standard. Mid-market companies with 20-50 engineers now have 30+ Terraform stacks. They've graduated from "learning IaC" to "suffering from IaC at scale." Drift is the #1 pain point of IaC at scale. The market of sufferers just 10x'd.
|
||||||
|
|
||||||
|
**4. Multi-Tool Reality**
|
||||||
|
Teams no longer use just Terraform. They use Terraform AND OpenTofu AND Pulumi AND CloudFormation (legacy). No existing tool handles drift across all of them. The first tool that does owns the "Switzerland" position.
|
||||||
|
|
||||||
|
### 1.4 Regulatory & Trend Tailwinds
|
||||||
|
|
||||||
|
**SOC 2 Type II** — Now table stakes for any B2B SaaS. Auditors are increasingly asking: "How do you ensure your infrastructure matches your declared configuration?" The answer "we run terraform plan sometimes" is no longer acceptable. Continuous drift detection is becoming a compliance requirement, not a nice-to-have.
|
||||||
|
|
||||||
|
**HIPAA / HITRUST** — Healthcare SaaS companies managing PHI need to prove infrastructure configurations haven't been tampered with. Drift detection = continuous compliance evidence.
|
||||||
|
|
||||||
|
**PCI DSS 4.0** — Effective March 2025. Requirement 1.2.5 requires documentation and review of all allowed services, protocols, and ports. Drift in security groups is now a PCI finding, not just an operational annoyance.
|
||||||
|
|
||||||
|
**FedRAMP / StateRAMP** — Government cloud compliance frameworks increasingly require continuous monitoring of configuration state. Drift detection maps directly to NIST 800-53 CM-3 (Configuration Change Control) and CM-6 (Configuration Settings).
|
||||||
|
|
||||||
|
**Cyber Insurance** — Insurers are asking more detailed questions about infrastructure configuration management. Companies with continuous drift detection get better rates. This is an emerging but real purchasing driver.
|
||||||
|
|
||||||
|
**The Net Effect:** Compliance is transforming drift detection from "engineering nice-to-have" to "business requirement." Diana (the compliance persona from your design thinking session) isn't just a user — she's the budget unlocker. When the auditor says "you need this," the CFO writes the check.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 2: COMPETITIVE POSITIONING
|
||||||
|
|
||||||
|
### 2.1 Blue Ocean Strategy Canvas
|
||||||
|
|
||||||
|
The Blue Ocean framework asks: where are all competitors investing heavily, and where is nobody investing at all? The blue ocean is the uncontested space.
|
||||||
|
|
||||||
|
**Factors of Competition in IaC Management:**
|
||||||
|
|
||||||
|
```
|
||||||
|
Factor | Spacelift | env0 | TFC | Firefly | dd0c/drift
|
||||||
|
--------------------------|-----------|------|------|---------|----------
|
||||||
|
Platform Breadth | 9 | 8 | 8 | 8 | 2
|
||||||
|
Enterprise Features | 9 | 7 | 9 | 9 | 2
|
||||||
|
Drift Detection Depth | 6 | 4 | 3 | 7 | 10
|
||||||
|
Remediation Workflows | 5 | 3 | 2 | 6 | 9
|
||||||
|
Self-Serve Onboarding | 3 | 5 | 6 | 2 | 10
|
||||||
|
Time-to-Value | 3 | 4 | 5 | 2 | 10
|
||||||
|
Price Accessibility | 2 | 3 | 4 | 1 | 10
|
||||||
|
Multi-IaC Support | 7 | 6 | 2 | 7 | 8
|
||||||
|
Slack-Native Experience | 4 | 3 | 2 | 3 | 10
|
||||||
|
Compliance Reporting | 5 | 4 | 5 | 7 | 8
|
||||||
|
CI/CD Orchestration | 9 | 8 | 9 | 4 | 0
|
||||||
|
Policy-as-Code | 8 | 6 | 7 | 8 | 0
|
||||||
|
Cost Management | 3 | 7 | 3 | 5 | 0
|
||||||
|
```
|
||||||
|
|
||||||
|
**The Blue Ocean:** dd0c/drift deliberately scores ZERO on CI/CD orchestration, policy-as-code, and cost management. This is not weakness — it's strategy. Every competitor is fighting over the same red ocean of "IaC platform" features. dd0c/drift creates blue ocean by:
|
||||||
|
|
||||||
|
1. **Eliminating** platform features entirely (no CI/CD, no policy engine, no cost tools)
|
||||||
|
2. **Raising** drift detection and remediation to 10/10 — making it the core product, not a feature
|
||||||
|
3. **Creating** Slack-native remediation (nobody does this well) and 60-second onboarding (nobody does this at all)
|
||||||
|
4. **Reducing** price by 17x, making procurement irrelevant (credit card purchase, not enterprise sales)
|
||||||
|
|
||||||
|
The strategic canvas shows a clear "crossing pattern" — dd0c/drift's value curve is the INVERSE of every competitor. Where they're high, you're low. Where they're low, you're high. This is textbook Blue Ocean. You're not competing. You're redefining.
|
||||||
|
|
||||||
|
### 2.2 Porter's Five Forces
|
||||||
|
|
||||||
|
**1. Threat of New Entrants: MEDIUM-HIGH**
|
||||||
|
- Low technical barriers. Any competent engineer can build a cron job that runs `terraform plan`. The detection engine is not rocket science.
|
||||||
|
- BUT: the product layer (UX, Slack integration, remediation workflows, compliance reporting, multi-IaC support) is 10x harder than the detection engine. The moat is product, not technology.
|
||||||
|
- Cloud providers could build native drift detection (AWS Config already does it for CloudFormation). But they won't optimize for third-party IaC tools — it's against their strategic interest.
|
||||||
|
- Verdict: Easy to enter, hard to win. Your moat is speed-to-market + product quality + community.
|
||||||
|
|
||||||
|
**2. Bargaining Power of Buyers: HIGH**
|
||||||
|
- Buyers have alternatives: do nothing (run `terraform plan` manually), use platform features (Spacelift/env0), or build internal tooling.
|
||||||
|
- Switching costs are low for a drift detection tool — it's read-only access to state files and cloud APIs.
|
||||||
|
- Price sensitivity is high for SMB/mid-market (your target). They'll compare $29/stack to "free" (manual process) and need to see clear ROI.
|
||||||
|
- Verdict: You must demonstrate ROI in the first 5 minutes. The "Drift Cost Calculator" content play is essential — show them what drift is costing them BEFORE they sign up.
|
||||||
|
|
||||||
|
**3. Bargaining Power of Suppliers: LOW**
|
||||||
|
- Your "suppliers" are cloud provider APIs (AWS, Azure, GCP) and IaC tool formats (Terraform state, Pulumi state).
|
||||||
|
- These are open/standardized. No supplier can cut you off.
|
||||||
|
- The one risk: HashiCorp could make Terraform state format proprietary or restrict access. But OpenTofu fork means this is a self-defeating move. And the BSL license already pushed the community away.
|
||||||
|
- Verdict: No supplier risk. You're reading open formats and public APIs.
|
||||||
|
|
||||||
|
**4. Threat of Substitutes: HIGH**
|
||||||
|
- The primary substitute is "do nothing" — manual `terraform plan` runs, tribal knowledge, and hope.
|
||||||
|
- Secondary substitute: build internal tooling. Many teams have a bash script or GitHub Action that approximates drift detection.
|
||||||
|
- Tertiary substitute: platform features in Spacelift/env0/TFC that "good enough" drift detection as part of a broader purchase.
|
||||||
|
- Verdict: Your biggest competitor is inertia. The product must be SO easy to adopt and SO immediately valuable that "do nothing" feels irresponsible. The 60-second onboarding is not a nice-to-have — it's existential.
|
||||||
|
|
||||||
|
**5. Competitive Rivalry: MEDIUM**
|
||||||
|
- No one is competing on drift detection as a primary product. Everyone treats it as a feature.
|
||||||
|
- The "focused drift detection tool" category has exactly one dead player (driftctl) and zero live ones.
|
||||||
|
- Rivalry will increase if dd0c/drift proves the market — Spacelift/env0 will invest more in their drift features, and new entrants will appear.
|
||||||
|
- Verdict: You have a 12-18 month window of low rivalry. Use it to build market position and community before the platforms respond.
|
||||||
|
|
||||||
|
**Porter's Overall Assessment:** The industry structure is favorable for a focused entrant. High buyer power and high substitute threat mean you must nail the value proposition and onboarding. But low supplier power and medium rivalry give you room to establish position. The key strategic imperative: **move fast, build community, create switching costs through integrations and data before the platforms wake up.**
|
||||||
|
|
||||||
|
### 2.3 Value Curve vs. Competitors
|
||||||
|
|
||||||
|
The value proposition differs by competitor:
|
||||||
|
|
||||||
|
**vs. Spacelift:** "You don't need to migrate your entire CI/CD pipeline to detect drift. dd0c/drift plugs into your existing workflow in 60 seconds. It costs $29/stack instead of $500+/month. And it does drift detection better because that's ALL it does."
|
||||||
|
|
||||||
|
**vs. env0:** "env0 is a platform that happens to detect drift. dd0c/drift is a drift detection tool that does nothing else — and does it 10x better. No per-user pricing that punishes growing teams. No platform migration. Just drift detection that works."
|
||||||
|
|
||||||
|
**vs. Terraform Cloud:** "You left TFC because of the BSL license and IBM pricing. Why would you go back for mediocre drift detection? dd0c/drift works with Terraform AND OpenTofu AND Pulumi. It's the drift detection TFC should have built but never will."
|
||||||
|
|
||||||
|
**vs. Firefly:** "Firefly is enterprise cloud asset management for companies with $50K+ budgets and 6-month procurement cycles. dd0c/drift is for the team that needs drift detection TODAY, for $29/stack, with a credit card. No sales calls. No demos. No SOWs."
|
||||||
|
|
||||||
|
**vs. "Do Nothing" (Manual terraform plan):** "You're already running terraform plan manually. You know it doesn't scale. You know things drift between checks. You know the cron job broke 3 weeks ago and nobody noticed. dd0c/drift is the productized version of what you're already trying to do — except it actually works, runs continuously, alerts you in Slack, and lets you fix drift in one click."
|
||||||
|
|
||||||
|
### 2.4 Solo Founder Advantages
|
||||||
|
|
||||||
|
Brian, being a solo founder against $40M-funded competitors sounds suicidal. It's actually your greatest strategic advantage. Here's why:
|
||||||
|
|
||||||
|
**1. The 17x Price Advantage Is Structural, Not Temporary**
|
||||||
|
Spacelift has ~200 employees. At $150K average fully-loaded cost, that's $30M/year in payroll. They NEED $30K+ enterprise deals to survive. They literally cannot sell a $29/stack product — the unit economics don't support their cost structure. Your cost structure is you, a laptop, and AWS credits. You can profitably serve customers they can't afford to talk to.
|
||||||
|
|
||||||
|
**2. Focus Is a Weapon**
|
||||||
|
Spacelift's product team is split across CI/CD, policy, drift, blueprints, contexts, worker pools, and 50 other features. Their drift detection gets maybe 5% of engineering attention. Your drift detection gets 100%. You will always be 6-12 months ahead on drift-specific features because it's your ONLY product.
|
||||||
|
|
||||||
|
**3. Speed of Decision**
|
||||||
|
You can ship a feature in a day. Spacelift needs a product review, design review, engineering sprint, QA cycle, and release train. When a customer asks for OpenTofu support, you ship it Thursday. They ship it next quarter. In a market where the IaC landscape is shifting rapidly (OpenTofu, Pulumi growth, AI-generated IaC), speed of adaptation is survival.
|
||||||
|
|
||||||
|
**4. Authenticity**
|
||||||
|
You're a senior AWS architect who has LIVED the drift problem. You're not a product manager who read about it in a Gartner report. Your blog posts, HN comments, and conference talks will carry the credibility of someone who's been paged at 2am because of drift. Developers buy from practitioners, not from marketing teams.
|
||||||
|
|
||||||
|
**5. No Investor Pressure to "Go Enterprise"**
|
||||||
|
Bootstrapped means you don't have a board pushing you to hire a sales team and chase $100K deals. You can stay PLG, stay developer-focused, and stay cheap. This is a strategic moat that funded competitors literally cannot replicate — their investors won't let them.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 3: DISRUPTION ANALYSIS
|
||||||
|
|
||||||
|
### 3.1 Christensen Framework — Classic Low-End Disruption
|
||||||
|
|
||||||
|
Clayton Christensen would look at dd0c/drift and smile. This is textbook disruption theory. Let me walk through it precisely, because understanding WHY this works matters more than believing it will.
|
||||||
|
|
||||||
|
**The Incumbent's Dilemma:**
|
||||||
|
Spacelift and env0 are classic sustaining innovators. They started with a focused product, raised venture capital, and are now marching upmarket to justify their valuations. Every quarter, they add more features (policy-as-code, cost estimation, blueprints, self-service portals) to win larger enterprise deals. Their product roadmap is driven by what their biggest customers ask for — which is always "more platform."
|
||||||
|
|
||||||
|
This creates the classic Christensen gap at the bottom of the market:
|
||||||
|
|
||||||
|
```
|
||||||
|
Performance
|
||||||
|
▲
|
||||||
|
│ ╱ Spacelift/env0 trajectory
|
||||||
|
│ ╱ (more platform features)
|
||||||
|
│ ╱
|
||||||
|
│ ╱
|
||||||
|
│ ╱ ← Enterprise needs
|
||||||
|
│ ╱
|
||||||
|
│ ╱─────────────────── SMB needs (drift detection + remediation)
|
||||||
|
│ ╱
|
||||||
|
│ ╱ ← dd0c/drift enters HERE
|
||||||
|
│╱
|
||||||
|
└──────────────────────► Time
|
||||||
|
```
|
||||||
|
|
||||||
|
**The Gap:** SMB and mid-market teams (5-50 engineers, 10-100 stacks) need drift detection. They do NOT need CI/CD orchestration, policy-as-code engines, or self-service infrastructure portals. Spacelift/env0 are overshooting their needs and overcharging for the privilege.
|
||||||
|
|
||||||
|
**dd0c/drift enters at the low end** — offering "just" drift detection at 17x lower price. Incumbents look at this and think: "That's a feature, not a product. They'll never win enterprise deals." They're right. And that's exactly why they won't respond.
|
||||||
|
|
||||||
|
**The Disruption Sequence:**
|
||||||
|
1. **Year 1:** dd0c/drift captures SMB teams that can't afford or don't need Spacelift. Spacelift ignores this — these aren't their customers anyway.
|
||||||
|
2. **Year 2:** dd0c/drift adds compliance reporting, team features, and deeper remediation. Mid-market teams start choosing dd0c/drift over Spacelift for drift specifically, while using GitHub Actions for CI/CD.
|
||||||
|
3. **Year 3:** dd0c/drift's drift detection is so superior (because it's the ONLY thing they build) that even Spacelift's enterprise customers start asking "why is dd0c/drift better at drift than you?" Spacelift can't catch up because drift gets 5% of their engineering attention.
|
||||||
|
4. **Year 4:** dd0c/drift expands into adjacent features (state management, policy for drift, IaC analytics) — moving upmarket from a position of strength.
|
||||||
|
|
||||||
|
**The Key Insight:** Disruption doesn't require you to be better at everything. It requires you to be better at ONE thing that matters, at a price point incumbents can't match. $29/stack for world-class drift detection is that thing.
|
||||||
|
|
||||||
|
### 3.2 Jobs-To-Be-Done (JTBD) Competitive Analysis
|
||||||
|
|
||||||
|
JTBD theory says customers don't buy products — they "hire" them to do a job. Let's map the jobs and who currently gets hired:
|
||||||
|
|
||||||
|
**Job 1: "Help me know when my infrastructure has drifted from code"**
|
||||||
|
- Current hire: Manual `terraform plan` (free, unreliable, doesn't scale)
|
||||||
|
- Alternative hire: Spacelift drift detection (expensive, requires platform adoption)
|
||||||
|
- Alternative hire: Cron job + bash script (free, fragile, no UI, breaks silently)
|
||||||
|
- dd0c/drift fit: **PERFECT.** This is the core job. 10/10 fit.
|
||||||
|
|
||||||
|
**Job 2: "Help me fix drift quickly without breaking things"**
|
||||||
|
- Current hire: Manual investigation + careful `terraform apply` (slow, risky, requires expertise)
|
||||||
|
- Alternative hire: Spacelift auto-reconciliation (good but requires full platform)
|
||||||
|
- Alternative hire: Nothing. Most teams just live with drift.
|
||||||
|
- dd0c/drift fit: **EXCELLENT.** One-click revert with context and blast radius analysis. 9/10 fit.
|
||||||
|
|
||||||
|
**Job 3: "Help me prove to auditors that infrastructure matches code"**
|
||||||
|
- Current hire: Manual quarterly reconciliation + spreadsheets (expensive in engineer-hours, always stale)
|
||||||
|
- Alternative hire: AWS Config (partial — doesn't compare against IaC intent)
|
||||||
|
- Alternative hire: Firefly (good but enterprise-only, $1,000+/mo)
|
||||||
|
- dd0c/drift fit: **STRONG.** Continuous compliance evidence generation. 8/10 fit.
|
||||||
|
|
||||||
|
**Job 4: "Help me see infrastructure health across all stacks"**
|
||||||
|
- Current hire: Standup meetings + tribal knowledge (doesn't scale, single point of failure)
|
||||||
|
- Alternative hire: Spacelift dashboard (good but requires full platform)
|
||||||
|
- Alternative hire: Custom Datadog dashboards (partial, doesn't understand IaC state)
|
||||||
|
- dd0c/drift fit: **STRONG.** Drift score dashboard, stack health view. 8/10 fit.
|
||||||
|
|
||||||
|
**Job 5: "Help me manage my entire IaC workflow (plan, apply, approve, deploy)"**
|
||||||
|
- Current hire: GitHub Actions + manual process
|
||||||
|
- Alternative hire: Spacelift, env0, TFC (this is their core job)
|
||||||
|
- dd0c/drift fit: **ZERO.** Not your job. Don't even think about it. This is where competitors live. Stay away.
|
||||||
|
|
||||||
|
**JTBD Strategic Implication:** dd0c/drift is the best hire for Jobs 1-4 and deliberately refuses Job 5. This is the "anti-platform" positioning. You win by doing LESS, not more. Every feature request that smells like Job 5 ("can you also run our terraform apply?") gets a polite "no — use GitHub Actions for that, we integrate with it."
|
||||||
|
|
||||||
|
### 3.3 Switching Costs Analysis
|
||||||
|
|
||||||
|
Switching costs determine whether customers stay. For dd0c/drift, the analysis is nuanced:
|
||||||
|
|
||||||
|
**Switching costs FROM competitors TO dd0c/drift: LOW (your advantage)**
|
||||||
|
- From Spacelift: Teams frustrated with platform complexity can add dd0c/drift alongside Spacelift (or instead of it for drift). No migration required.
|
||||||
|
- From env0: Same story. dd0c/drift doesn't replace their CI/CD — it replaces their drift feature.
|
||||||
|
- From TFC: Teams leaving TFC for OpenTofu need drift detection. dd0c/drift is a natural addition to their new GitHub Actions workflow.
|
||||||
|
- From "do nothing": The 60-second `drift init` setup means the switching cost from manual process to dd0c/drift is essentially zero.
|
||||||
|
|
||||||
|
**Switching costs FROM dd0c/drift TO competitors: BUILD THESE DELIBERATELY**
|
||||||
|
This is where you need to be strategic. A drift detection tool with low switching costs is a commodity. You must create stickiness:
|
||||||
|
|
||||||
|
1. **Integration Depth** — Deep Slack integration (custom channels per stack, approval workflows, remediation history in threads). Reconfiguring all of this in a competitor is painful.
|
||||||
|
2. **Historical Data** — 12 months of drift history, trend data, and compliance audit trails. This data doesn't export to Spacelift. Leaving means losing your audit history.
|
||||||
|
3. **Policy Configuration** — Per-resource-type remediation policies (auto-revert security groups, alert on IAM, ignore tags). Rebuilding these policies in another tool takes weeks.
|
||||||
|
4. **Team Workflows** — Stack ownership assignments, on-call routing, approval chains. These are organizational knowledge encoded in the tool.
|
||||||
|
5. **Compliance Dependency** — Once your SOC 2 audit evidence references dd0c/drift reports, switching tools means re-establishing evidence chains with your auditor. Nobody wants to do that mid-audit-cycle.
|
||||||
|
|
||||||
|
**Net Assessment:** Easy to adopt (low switching costs in), increasingly hard to leave (growing switching costs out). This is the ideal dynamic for a PLG product.
|
||||||
|
|
||||||
|
### 3.4 Data Moats — Drift Pattern Intelligence
|
||||||
|
|
||||||
|
Here's where it gets interesting. And where the long-term defensibility lives.
|
||||||
|
|
||||||
|
**The Data Flywheel:**
|
||||||
|
Every dd0c/drift customer generates drift data: what resources drift, how often, who causes it, what time of day, what the blast radius is, how long until remediation. Individually, this data is useful for that customer. Aggregated and anonymized across thousands of customers, it becomes an intelligence asset that no competitor can replicate.
|
||||||
|
|
||||||
|
**What You Can Build With Aggregate Drift Data:**
|
||||||
|
|
||||||
|
1. **Drift Probability Scores** — "Security groups drift 3.2x more often than VPCs. RDS parameter groups drift most frequently on Fridays after deployments." Per-resource-type drift probability, trained on real-world data across thousands of organizations.
|
||||||
|
|
||||||
|
2. **Predictive Drift Alerts** — "Based on patterns from 10,000 similar organizations, this resource has a 78% chance of drifting in the next 48 hours." This is the "Wild Card #1" from the brainstorm session — and it's achievable with sufficient data.
|
||||||
|
|
||||||
|
3. **Remediation Recommendations** — "When this type of security group drift occurs, 89% of teams revert it. 11% accept it. Here's the most common reason for acceptance." AI-powered remediation suggestions based on what similar teams do.
|
||||||
|
|
||||||
|
4. **Industry Benchmarking** — "Your organization has a 12% drift rate. The median for Series B SaaS companies with 20-50 stacks is 18%. You're in the top quartile." Competitive benchmarking that creates FOMO and drives adoption.
|
||||||
|
|
||||||
|
5. **Compliance Risk Scoring** — "Organizations with your drift profile have a 34% higher rate of SOC 2 findings related to configuration management." Risk quantification that sells to the Diana persona.
|
||||||
|
|
||||||
|
**The Moat Mechanics:**
|
||||||
|
- More customers → more drift data → better predictions → better product → more customers.
|
||||||
|
- This flywheel takes 12-18 months to become meaningful (you need ~500+ customers generating data).
|
||||||
|
- Once spinning, it's nearly impossible for a new entrant to replicate — they'd need the same customer base to generate the same data.
|
||||||
|
- Spacelift/env0 COULD build this, but drift is a feature for them, not the product. They won't invest in drift-specific ML when they're building CI/CD features for enterprise deals.
|
||||||
|
|
||||||
|
**The Honest Caveat:** Data moats are real but slow to build. For the first 12 months, your moat is speed, focus, and price. The data moat kicks in around Year 2. Don't over-index on it in your pitch — it's a long-term strategic asset, not a launch differentiator.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 4: GO-TO-MARKET STRATEGY
|
||||||
|
|
||||||
|
### 4.1 Beachhead — The First 10 Customers
|
||||||
|
|
||||||
|
Brian, the first 10 customers define your company. Get them wrong and you'll spend 18 months building features for the wrong audience. Get them right and they become your case studies, your referral engine, and your product advisory board.
|
||||||
|
|
||||||
|
**Ideal First Customer Profile:**
|
||||||
|
- Team size: 5-20 engineers
|
||||||
|
- Stacks: 10-50 Terraform/OpenTofu stacks
|
||||||
|
- Cloud: AWS (start single-cloud, expand later)
|
||||||
|
- Current drift solution: Manual `terraform plan` or nothing
|
||||||
|
- Budget for drift tooling: $0-$500/mo (can't afford Spacelift, won't get Firefly to return their call)
|
||||||
|
- Pain trigger: Recent drift-related incident, upcoming SOC 2 audit, or engineer burnout from manual reconciliation
|
||||||
|
- Decision maker: The infra engineer or DevOps lead (not a VP — no procurement process)
|
||||||
|
|
||||||
|
**Where These People Live:**
|
||||||
|
1. **r/terraform** — 80K+ members. Search for "drift" and you'll find weekly posts asking for solutions. These are your people. They're already in pain.
|
||||||
|
2. **r/devops** — 300K+ members. Broader audience but drift discussions surface regularly.
|
||||||
|
3. **Hacker News** — "Show HN" launches for developer tools consistently hit front page. A well-crafted launch post ("I built a $29/mo alternative to Spacelift's drift detection") will generate 200+ comments and 5,000+ site visits.
|
||||||
|
4. **driftctl GitHub Issues** — The abandoned driftctl repo has open issues from people asking "what do I use instead?" These are pre-qualified leads. Literally people who searched for your product and found a dead project.
|
||||||
|
5. **HashiCorp Community Forum** — Teams migrating from TFC to OpenTofu are actively discussing tooling gaps. Drift detection is consistently mentioned.
|
||||||
|
6. **DevOps Slack Communities** — Rands Leadership Slack, DevOps Chat, Kubernetes Slack (#terraform channel). Organic mentions and helpful answers build credibility.
|
||||||
|
7. **Twitter/X DevOps Community** — DevOps influencers (Kelsey Hightower, Charity Majors, etc.) regularly discuss IaC pain points. A well-timed thread about drift costs gets amplified.
|
||||||
|
|
||||||
|
**The First 10 Acquisition Playbook:**
|
||||||
|
1. **Customers 1-3: Personal network.** Brian, you're a senior AWS architect. You know people who manage Terraform stacks. Call them. "I'm building a drift detection tool. Can I give you free access for 3 months in exchange for feedback?" These are your design partners.
|
||||||
|
2. **Customers 4-6: Community engagement.** Spend 2 weeks answering drift-related questions on r/terraform and r/devops. Don't pitch. Just help. Then post "Show HN: I built dd0c/drift — $29/mo drift detection for Terraform." The community engagement creates credibility for the launch.
|
||||||
|
3. **Customers 7-10: Content-driven inbound.** Publish "The True Cost of Infrastructure Drift" blog post with the Drift Cost Calculator. Promote on HN, Reddit, Twitter. Convert readers to free tier users, convert free tier to paid.
|
||||||
|
|
||||||
|
**Timeline to First 10 Paying Customers:** 60-90 days from public launch. This is aggressive but achievable with the right community engagement.
|
||||||
|
|
||||||
|
### 4.2 Pricing — Is $29/Stack the Right Anchor?
|
||||||
|
|
||||||
|
Let me stress-test the $29/stack/month pricing from multiple angles.
|
||||||
|
|
||||||
|
**The Bull Case for $29/Stack:**
|
||||||
|
|
||||||
|
1. **Credit Card Threshold** — $29 is below the "ask my manager" threshold at most companies. An engineer can expense it. No procurement. No legal review. No 3-month sales cycle. This is the PLG unlock.
|
||||||
|
|
||||||
|
2. **Anchoring Against Competitors** — When someone Googles "drift detection pricing" and sees Spacelift at $500+/mo and dd0c/drift at $29/stack, the contrast is visceral. "17x cheaper" is a headline that writes itself.
|
||||||
|
|
||||||
|
3. **Expansion Revenue** — A team starts with 5 stacks ($145/mo). As they grow to 20 stacks ($580/mo), then 50 stacks ($1,450/mo), revenue expands naturally without upselling. The pricing model has built-in NDR (Net Dollar Retention) >120%.
|
||||||
|
|
||||||
|
4. **Market Positioning** — $29/stack says "this is a utility, not a platform." It positions dd0c/drift as infrastructure (like Datadog per-host pricing) rather than a software platform (like Spacelift per-seat pricing).
|
||||||
|
|
||||||
|
**The Bear Case Against $29/Stack:**
|
||||||
|
|
||||||
|
1. **Revenue Per Customer Is Low** — Average customer with 15 stacks = $435/mo. You need 115 customers to hit $50K MRR. That's achievable but requires real marketing effort for a solo founder.
|
||||||
|
|
||||||
|
2. **Stack Count Is Ambiguous** — What's a "stack"? A Terraform workspace? A state file? A root module? Customers will argue about this. You need a crystal-clear definition.
|
||||||
|
|
||||||
|
3. **Penalizes Good Architecture** — Teams that split infrastructure into many small stacks (best practice) pay more than teams with monolithic stacks. This creates a perverse incentive.
|
||||||
|
|
||||||
|
4. **Enterprise Sticker Shock** — A team with 200 stacks would pay $5,800/mo. At that point, Spacelift's platform (which includes drift detection PLUS CI/CD) starts looking reasonable. You lose the price advantage at scale.
|
||||||
|
|
||||||
|
**Victor's Pricing Recommendation:**
|
||||||
|
|
||||||
|
Don't do pure per-stack pricing. Use a **tiered model with stack bundles:**
|
||||||
|
|
||||||
|
| Tier | Price | Stacks | Polling | Features |
|
||||||
|
|------|-------|--------|---------|----------|
|
||||||
|
| Free | $0 | 3 stacks | Daily | Slack alerts, basic dashboard |
|
||||||
|
| Starter | $49/mo | 10 stacks | 15-min | + One-click remediation, stack ownership |
|
||||||
|
| Pro | $149/mo | 30 stacks | 5-min | + Compliance reports, auto-remediation policies, API |
|
||||||
|
| Business | $399/mo | 100 stacks | 1-min | + SSO, RBAC, audit trail export, priority support |
|
||||||
|
| Enterprise | Custom | Unlimited | Real-time | + SLA, dedicated support, custom integrations |
|
||||||
|
|
||||||
|
**Why This Is Better Than Pure Per-Stack:**
|
||||||
|
|
||||||
|
1. **Predictable pricing** — Customers know exactly what they'll pay. No surprise bills when they add stacks.
|
||||||
|
2. **Encourages adoption** — "I'm on the 30-stack plan but only using 15. Let me add more stacks since I'm already paying for them." Drives usage.
|
||||||
|
3. **Natural upsell** — When a team hits 30 stacks, they upgrade to Business. Clear upgrade trigger.
|
||||||
|
4. **Enterprise ceiling** — Business at $399/mo is still 10x cheaper than Spacelift. You maintain the price advantage even at scale.
|
||||||
|
5. **Free tier is generous** — 3 stacks with daily polling is genuinely useful for small teams and side projects. This is your viral loop.
|
||||||
|
|
||||||
|
**The $29/stack messaging still works for marketing** — "Starting at $29/stack" is the headline. The tiered pricing is what they see on the pricing page. The anchor is set.
|
||||||
|
|
||||||
|
### 4.3 Channel Strategy — PLG via CLI Onboarding
|
||||||
|
|
||||||
|
**The Funnel:**
|
||||||
|
|
||||||
|
```
|
||||||
|
Awareness (Content + Community)
|
||||||
|
↓
|
||||||
|
Interest (Drift Cost Calculator / Blog)
|
||||||
|
↓
|
||||||
|
Activation (`drift init` — 60 seconds to first alert)
|
||||||
|
↓
|
||||||
|
Engagement (Daily Slack alerts, drift score)
|
||||||
|
↓
|
||||||
|
Revenue (Free → Starter when they hit 4+ stacks or need remediation)
|
||||||
|
↓
|
||||||
|
Expansion (Starter → Pro → Business as stacks grow)
|
||||||
|
↓
|
||||||
|
Referral ("This tool saved my team 10 hours/week" → word of mouth)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Channel Breakdown:**
|
||||||
|
|
||||||
|
**1. Hacker News (Primary Launch Channel)**
|
||||||
|
- "Show HN" post: "dd0c/drift — $29/mo drift detection for Terraform/OpenTofu. Set up in 60 seconds."
|
||||||
|
- HN loves: open-source components, solo founder stories, tools that replace expensive platforms, clear pricing.
|
||||||
|
- Expected outcome: 200-500 comments, 5,000-15,000 site visits, 100-300 signups, 10-30 paying customers.
|
||||||
|
- Timing: Launch on a Tuesday or Wednesday morning (US Eastern). Avoid Mondays (crowded) and Fridays (low traffic).
|
||||||
|
|
||||||
|
**2. Reddit (Sustained Community Engagement)**
|
||||||
|
- r/terraform: Weekly engagement answering drift questions. Monthly "how we detect drift" technical posts.
|
||||||
|
- r/devops: Broader DevOps audience. Focus on the operational pain, not the tool.
|
||||||
|
- r/aws: AWS-specific drift scenarios (security groups, IAM policies).
|
||||||
|
- Rule: 10:1 ratio of helpful comments to self-promotion. Build credibility first.
|
||||||
|
|
||||||
|
**3. Technical Blog (SEO + Thought Leadership)**
|
||||||
|
- "The True Cost of Infrastructure Drift" (with calculator)
|
||||||
|
- "driftctl Is Dead. Here's What to Use Instead."
|
||||||
|
- "How to Detect Terraform Drift Without Spacelift"
|
||||||
|
- "SOC 2 and Infrastructure Drift: A Compliance Guide"
|
||||||
|
- "Terraform vs OpenTofu: Drift Detection Compared"
|
||||||
|
- Each post targets a specific long-tail keyword. SEO compounds over 6-12 months.
|
||||||
|
|
||||||
|
**4. GitHub (Open-Source Lead Gen)**
|
||||||
|
- Open-source the CLI detection engine (`dd0c/drift-cli`).
|
||||||
|
- Free, local drift detection. No account needed.
|
||||||
|
- The CLI outputs: "Found 7 drifted resources. View details and remediate at app.dd0c.dev" — the upsell to SaaS.
|
||||||
|
- GitHub stars = social proof. Target 1,000 stars in first 3 months.
|
||||||
|
|
||||||
|
**5. Conference Talks (Credibility + Reach)**
|
||||||
|
- HashiConf (if it still exists post-IBM), KubeCon, DevOpsDays, local meetups.
|
||||||
|
- Talk title: "The Hidden Cost of Infrastructure Drift: Data from 1,000 Terraform Stacks" (once you have the data).
|
||||||
|
- Conference talks convert at low volume but high quality — the people who approach you after the talk are pre-sold.
|
||||||
|
|
||||||
|
### 4.4 Content Strategy — The Drift Cost Calculator
|
||||||
|
|
||||||
|
**The "Drift Cost Calculator" is your single most important marketing asset.** Here's why:
|
||||||
|
|
||||||
|
Most developer tools market with features: "We detect drift! We have Slack alerts! We do remediation!" Features don't sell. Pain quantification sells.
|
||||||
|
|
||||||
|
**The Calculator:**
|
||||||
|
A simple web tool where an engineer inputs:
|
||||||
|
- Number of Terraform stacks
|
||||||
|
- Average team size
|
||||||
|
- Average engineer salary
|
||||||
|
- Frequency of manual drift checks
|
||||||
|
- Number of drift-related incidents per quarter
|
||||||
|
|
||||||
|
**Output:**
|
||||||
|
- "Your team spends approximately **$47,000/year** on manual drift management."
|
||||||
|
- "At $149/mo for dd0c/drift Pro, your ROI is **26x** in the first year."
|
||||||
|
- "You're losing **312 engineering hours/year** to drift-related work."
|
||||||
|
|
||||||
|
**Why This Works:**
|
||||||
|
1. It makes the invisible visible. Drift costs are hidden in engineer time, not in a line item.
|
||||||
|
2. It creates urgency. "$47K/year" is a number a manager can act on. "Drift is annoying" is not.
|
||||||
|
3. It's shareable. Engineers send the calculator to their managers. "Look what drift is costing us."
|
||||||
|
4. It captures leads. "Enter your email to get the full report" after showing the headline number.
|
||||||
|
5. It's content marketing gold. "We analyzed drift costs across 500 teams. The average is $52K/year." Blog post writes itself.
|
||||||
|
|
||||||
|
### 4.5 Open-Source as Lead Gen
|
||||||
|
|
||||||
|
**The Strategy:** Open-source the drift detection CLI. Charge for the SaaS layer.
|
||||||
|
|
||||||
|
**What's Open-Source:**
|
||||||
|
- `drift-cli` — Local drift detection for Terraform/OpenTofu. Runs `drift check` and outputs drifted resources to stdout.
|
||||||
|
- Works offline. No account needed. No telemetry.
|
||||||
|
- Supports single-stack scanning. Multi-stack requires SaaS.
|
||||||
|
|
||||||
|
**What's Paid SaaS:**
|
||||||
|
- Continuous monitoring (scheduled + event-driven)
|
||||||
|
- Slack/PagerDuty alerts
|
||||||
|
- One-click remediation
|
||||||
|
- Dashboard and drift score
|
||||||
|
- Compliance reports
|
||||||
|
- Team features (ownership, routing, RBAC)
|
||||||
|
- Historical data and trends
|
||||||
|
- Multi-stack aggregate view
|
||||||
|
|
||||||
|
**The Conversion Funnel:**
|
||||||
|
1. Engineer discovers `drift-cli` on GitHub or HN.
|
||||||
|
2. Runs `drift check` on one stack. Finds 5 drifted resources. "Oh crap."
|
||||||
|
3. Wants to run it on all 30 stacks continuously. Can't do that locally.
|
||||||
|
4. Signs up for free tier (3 stacks, daily polling).
|
||||||
|
5. Gets hooked on Slack alerts. Wants remediation and more stacks.
|
||||||
|
6. Upgrades to Starter ($49/mo) or Pro ($149/mo).
|
||||||
|
|
||||||
|
This is the Sentry/PostHog/GitLab playbook. Open-source core builds trust and adoption. Paid SaaS captures value from teams that need more.
|
||||||
|
|
||||||
|
### 4.6 Partnership Strategy
|
||||||
|
|
||||||
|
**HashiCorp/Terraform Ecosystem:**
|
||||||
|
- List on Terraform Registry as a complementary tool.
|
||||||
|
- Write a Terraform provider (`terraform-provider-driftcheck`) that exposes drift status as data sources.
|
||||||
|
- Publish in HashiCorp's partner ecosystem (if they still maintain one post-IBM).
|
||||||
|
- Caveat: Don't depend on HashiCorp goodwill. They may view you as competitive to TFC. Maintain independence.
|
||||||
|
|
||||||
|
**OpenTofu Foundation:**
|
||||||
|
- Become a visible OpenTofu ecosystem partner. Sponsor the project. Contribute to discussions.
|
||||||
|
- Position dd0c/drift as "the drift detection tool for the OpenTofu community."
|
||||||
|
- OpenTofu teams are actively building their toolchain. Be part of it from day one.
|
||||||
|
|
||||||
|
**Slack Marketplace:**
|
||||||
|
- List dd0c/drift as a Slack app. Slack Marketplace is an underrated distribution channel for DevOps tools.
|
||||||
|
- "Install dd0c/drift from Slack" → OAuth → connect state backend → first alert in 5 minutes.
|
||||||
|
|
||||||
|
**AWS Marketplace:**
|
||||||
|
- List on AWS Marketplace for teams that want to pay through their AWS bill (consolidated billing, committed spend credits).
|
||||||
|
- AWS Marketplace listing also provides credibility and discoverability.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 5: RISK MATRIX
|
||||||
|
|
||||||
|
### 5.1 Top 10 Risks — Ranked by Severity × Probability
|
||||||
|
|
||||||
|
I'm going to be brutal here. If you can't stomach these risks, don't build this product.
|
||||||
|
|
||||||
|
| # | Risk | Severity (1-5) | Probability (1-5) | Score | Timeframe |
|
||||||
|
|---|------|----------------|-------------------|-------|-----------|
|
||||||
|
| 1 | HashiCorp/IBM builds native drift detection into TFC that's "good enough" | 5 | 3 | 15 | 12-24 months |
|
||||||
|
| 2 | Solo founder burnout — you're building 6 products, not 1 | 5 | 4 | 20 | 6-12 months |
|
||||||
|
| 3 | Spacelift drops drift detection into their free tier to kill you | 4 | 3 | 12 | 12-18 months |
|
||||||
|
| 4 | OpenTofu fragments the market, slowing IaC adoption overall | 3 | 3 | 9 | 12-24 months |
|
||||||
|
| 5 | AWS/Azure/GCP build native IaC drift detection into their consoles | 5 | 2 | 10 | 24-36 months |
|
||||||
|
| 6 | Security concerns prevent teams from granting read access to state/cloud | 4 | 3 | 12 | Immediate |
|
||||||
|
| 7 | "Good enough" internal tooling (bash scripts, GitHub Actions) prevents adoption | 3 | 4 | 12 | Ongoing |
|
||||||
|
| 8 | AI-generated IaC reduces drift by making reconciliation trivial | 3 | 2 | 6 | 18-36 months |
|
||||||
|
| 9 | Pricing pressure from open-source alternatives (someone forks driftctl, builds a better one) | 3 | 3 | 9 | 6-18 months |
|
||||||
|
| 10 | Customer concentration risk — first 10 customers represent 80%+ of revenue | 3 | 4 | 12 | 0-12 months |
|
||||||
|
|
||||||
|
### 5.2 Mitigation Strategies
|
||||||
|
|
||||||
|
**Risk 1: HashiCorp/IBM Builds Native Drift Detection**
|
||||||
|
|
||||||
|
This is the existential risk. Let's be precise about it.
|
||||||
|
|
||||||
|
*Why it might happen:* IBM paid $4.6B for HashiCorp. They need to justify the acquisition by expanding TFC's feature set and increasing enterprise revenue. Drift detection is an obvious feature to improve.
|
||||||
|
|
||||||
|
*Why it might NOT happen:* IBM is IBM. They move slowly. They'll focus on enterprise features (governance, compliance frameworks, SSO) that justify $70K+ annual contracts. Improving drift detection for the free/starter tier doesn't move the revenue needle. Also, post-BSL, the community is migrating to OpenTofu. IBM may double down on enterprise lock-in rather than community features.
|
||||||
|
|
||||||
|
*Mitigation:*
|
||||||
|
1. **Multi-IaC is your insurance policy.** TFC will only ever support Terraform. dd0c/drift supports Terraform + OpenTofu + Pulumi. Every team using multiple IaC tools is immune to TFC's drift features.
|
||||||
|
2. **Speed.** You need to be 18 months ahead on drift-specific features by the time IBM responds. That means shipping weekly, not quarterly.
|
||||||
|
3. **Community lock-in.** If dd0c/drift is the community standard for drift detection (the "driftctl successor"), IBM improving TFC drift won't matter — the community has already chosen you.
|
||||||
|
4. **Worst case:** TFC drift detection becomes "good enough" for Terraform-only teams on TFC. You lose that segment. But teams on OpenTofu, Pulumi, or multi-IaC are still yours. That's the growing segment.
|
||||||
|
|
||||||
|
**Risk 2: Solo Founder Burnout**
|
||||||
|
|
||||||
|
This is the risk I'm most worried about. Not because of the market — because of you.
|
||||||
|
|
||||||
|
*The math:* dd0c is 6 products. Even if drift is Phase 3, you're building route, cost, and alert first. By the time you get to drift, you'll have 3 products to maintain, support, and market. Adding a 4th is not "building a new product" — it's "adding 25% more work to an already unsustainable workload."
|
||||||
|
|
||||||
|
*Mitigation:*
|
||||||
|
1. **Ruthless prioritization.** If drift is the product with the clearest market gap and the strongest disruption thesis (it is), consider moving it to Phase 1 or Phase 2. Don't wait until you're already exhausted.
|
||||||
|
2. **Shared infrastructure.** The dd0c platform architecture (shared auth, billing, OTel pipeline) must be built ONCE and reused. If each product has its own backend, you're dead.
|
||||||
|
3. **AI-assisted development.** You're already using AI tools. Lean harder. Use Cursor/Copilot for 80% of the boilerplate. Reserve your cognitive energy for architecture decisions and customer conversations.
|
||||||
|
4. **Hire at $30K MRR.** The moment you hit $30K MRR across all dd0c products, hire a part-time contractor for support and bug fixes. Don't try to be a solo founder past $30K MRR — the support burden alone will consume you.
|
||||||
|
|
||||||
|
**Risk 3: Spacelift Drops Drift Into Free Tier**
|
||||||
|
|
||||||
|
*Why it might happen:* If dd0c/drift gains traction and starts appearing in "Spacelift alternatives" searches, Spacelift's marketing team will notice. The easiest response is to make their basic drift detection free.
|
||||||
|
|
||||||
|
*Why it might NOT happen:* Spacelift's drift detection requires private workers, which have infrastructure costs. Making it free erodes their upgrade path. Their investors won't love giving away features that drive enterprise upgrades.
|
||||||
|
|
||||||
|
*Mitigation:*
|
||||||
|
1. **Be better, not just cheaper.** If your drift detection is 10x better (Slack-native, one-click remediation, compliance reports, multi-IaC), "free but mediocre" from Spacelift won't matter. Nobody switched from Figma to free Adobe XD.
|
||||||
|
2. **Different buyer.** Spacelift's free tier targets teams evaluating their platform. dd0c/drift targets teams who DON'T WANT a platform. Different buyer, different motion.
|
||||||
|
3. **Speed of innovation.** By the time Spacelift responds, you should be 2 versions ahead with features they haven't thought of (drift prediction, cost-of-drift analytics, compliance automation).
|
||||||
|
|
||||||
|
**Risk 4: OpenTofu Fragments the Market**
|
||||||
|
|
||||||
|
*Mitigation:* This is actually an opportunity disguised as a risk. Fragmentation means teams use BOTH Terraform and OpenTofu (migration is gradual). They need a tool that works with both. That's you. Fragmentation increases your value proposition.
|
||||||
|
|
||||||
|
**Risk 5: Cloud Providers Build Native Drift Detection**
|
||||||
|
|
||||||
|
*Mitigation:* AWS Config already does basic configuration drift detection for CloudFormation. It's been around for years. It's terrible. Cloud providers optimize for their own IaC tools, not third-party ones. AWS will never build great Terraform drift detection — it's against their strategic interest (they want you on CloudFormation). You're safe for 3+ years.
|
||||||
|
|
||||||
|
**Risk 6: Security Concerns Block Adoption**
|
||||||
|
|
||||||
|
*Mitigation:*
|
||||||
|
1. **Read-only access only.** dd0c/drift never needs write access to cloud resources (except for remediation, which is opt-in).
|
||||||
|
2. **State file access architecture.** Offer multiple modes: (a) SaaS reads state from S3/GCS directly (requires IAM role), (b) Agent runs in customer's VPC and pushes drift data out (no inbound access), (c) CLI mode for air-gapped environments.
|
||||||
|
3. **SOC 2 certification for dd0c itself.** Eat your own dog food. Get SOC 2 certified. It's expensive ($20-50K) but it eliminates the "can we trust a solo founder's SaaS?" objection.
|
||||||
|
4. **Open-source the agent.** If the detection agent is open-source, security teams can audit the code. Trust through transparency.
|
||||||
|
|
||||||
|
**Risk 7: "Good Enough" Internal Tooling**
|
||||||
|
|
||||||
|
*Mitigation:* This is your biggest competitor — inertia. The Drift Cost Calculator directly attacks this by quantifying the cost of "good enough." When an engineer sees "$47K/year in drift management costs" vs. "$149/mo for dd0c/drift," the bash script suddenly looks expensive.
|
||||||
|
|
||||||
|
**Risk 8: AI-Generated IaC Reduces Drift**
|
||||||
|
|
||||||
|
*Mitigation:* AI might make it easier to WRITE IaC, but drift isn't caused by bad code — it's caused by humans making console changes, emergency hotfixes, and auto-scaling events. AI doesn't prevent a panicked engineer from opening a security group at 2am. If anything, AI-generated IaC increases the volume of managed resources, which increases the surface area for drift. This risk is overblown.
|
||||||
|
|
||||||
|
**Risk 9: Open-Source Competitor Emerges**
|
||||||
|
|
||||||
|
*Mitigation:* Beat them to it. YOUR CLI is open-source. If someone wants to build a free drift detection tool, they'll fork yours rather than building from scratch. You control the ecosystem. The SaaS layer (continuous monitoring, Slack integration, compliance reports, team features) is where you capture value — and that's hard to replicate in OSS.
|
||||||
|
|
||||||
|
**Risk 10: Customer Concentration**
|
||||||
|
|
||||||
|
*Mitigation:* Standard startup risk. Mitigate by aggressively pursuing PLG (many small customers) rather than a few large ones. Target: no single customer >10% of revenue by month 6.
|
||||||
|
|
||||||
|
### 5.3 Kill Criteria — When to Walk Away
|
||||||
|
|
||||||
|
Brian, every good strategist defines the conditions under which they retreat. Here are yours:
|
||||||
|
|
||||||
|
**Kill the product if ANY of these are true at the 6-month mark:**
|
||||||
|
1. **< 50 free tier signups** after HN launch + Reddit engagement + blog content. This means the market doesn't care, regardless of what the personas say.
|
||||||
|
2. **< 5 paying customers** after 90 days of the paid tier being available. Free users who won't pay are a vanity metric.
|
||||||
|
3. **Free-to-paid conversion < 3%.** Industry benchmark for PLG developer tools is 3-7%. Below 3% means the free tier is too generous or the paid tier isn't compelling.
|
||||||
|
4. **NPS < 30** from first 20 customers. If early adopters (who are the most forgiving) aren't enthusiastic, the product isn't solving a real problem.
|
||||||
|
5. **HashiCorp announces "Terraform Cloud Drift Detection Pro"** with continuous monitoring, Slack alerts, and remediation — included in the Plus tier. If this happens before you have 100+ customers, pivot to a different dd0c module.
|
||||||
|
|
||||||
|
**Kill the product if ANY of these are true at the 12-month mark:**
|
||||||
|
1. **< $10K MRR.** At $10K MRR after 12 months, the growth trajectory doesn't support a standalone product. Fold drift detection into dd0c/portal as a feature instead.
|
||||||
|
2. **Churn > 8% monthly.** Developer tools should have <5% monthly churn. Above 8% means customers try it, find it insufficient, and leave. The product isn't sticky.
|
||||||
|
3. **CAC payback > 12 months.** If it takes more than 12 months of revenue to recoup the cost of acquiring a customer, the unit economics don't work for a bootstrapped founder.
|
||||||
|
|
||||||
|
### 5.4 Scenario Planning — Revenue Projections
|
||||||
|
|
||||||
|
**Scenario A: "The Rocket" (20% probability)**
|
||||||
|
- HN launch goes viral. 500+ signups in week 1.
|
||||||
|
- driftctl community adopts dd0c/drift as the successor.
|
||||||
|
- Word-of-mouth drives organic growth.
|
||||||
|
- Month 6: 100 paying customers, $15K MRR
|
||||||
|
- Month 12: 350 paying customers, $52K MRR
|
||||||
|
- Month 24: 1,200 paying customers, $180K MRR
|
||||||
|
- Outcome: Standalone product. Consider raising seed round to accelerate.
|
||||||
|
|
||||||
|
**Scenario B: "The Grind" (50% probability)**
|
||||||
|
- Steady but unspectacular growth. Community engagement works but slowly.
|
||||||
|
- Each blog post and Reddit thread brings 5-10 signups.
|
||||||
|
- Month 6: 40 paying customers, $6K MRR
|
||||||
|
- Month 12: 150 paying customers, $22K MRR
|
||||||
|
- Month 24: 500 paying customers, $75K MRR
|
||||||
|
- Outcome: Solid product within the dd0c platform. Not a standalone business but a strong wedge.
|
||||||
|
|
||||||
|
**Scenario C: "The Slog" (25% probability)**
|
||||||
|
- Market is interested but conversion is low. Free tier gets usage, paid tier struggles.
|
||||||
|
- Competitors respond faster than expected.
|
||||||
|
- Month 6: 15 paying customers, $2.2K MRR
|
||||||
|
- Month 12: 60 paying customers, $9K MRR
|
||||||
|
- Month 24: 150 paying customers, $22K MRR
|
||||||
|
- Outcome: Fold into dd0c/portal as a feature. Not viable as standalone.
|
||||||
|
|
||||||
|
**Scenario D: "The Flop" (5% probability)**
|
||||||
|
- Market doesn't materialize. Teams prefer internal tooling or platform features.
|
||||||
|
- HN launch gets 30 comments and 200 visits.
|
||||||
|
- Month 6: 5 paying customers, $750 MRR
|
||||||
|
- Month 12: < $5K MRR
|
||||||
|
- Outcome: Kill it. Redirect effort to dd0c/route or dd0c/cost.
|
||||||
|
|
||||||
|
**Expected Value Calculation:**
|
||||||
|
Weighted average Month 12 MRR: (0.20 × $52K) + (0.50 × $22K) + (0.25 × $9K) + (0.05 × $5K) = $10.4K + $11K + $2.25K + $0.25K = **$23.9K MRR**
|
||||||
|
|
||||||
|
That's ~$287K ARR at the 12-month mark. For a bootstrapped solo founder, that's a real business. Not a unicorn. A real business.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 6: STRATEGIC RECOMMENDATIONS
|
||||||
|
|
||||||
|
### 6.1 The 90-Day Launch Plan
|
||||||
|
|
||||||
|
**Days 1-30: Build the Foundation**
|
||||||
|
|
||||||
|
- Week 1-2: Build `drift-cli` (open-source). Terraform + OpenTofu support. Single-stack scanning. Output to stdout.
|
||||||
|
- Week 2-3: Build the SaaS detection engine. Multi-stack continuous monitoring. S3/GCS state backend integration.
|
||||||
|
- Week 3-4: Build Slack integration. Drift alerts with [Revert] [Accept] [Snooze] buttons. This is the MVP killer feature.
|
||||||
|
- Week 4: Build the dashboard. Drift score, stack list, drift history. Minimal but functional.
|
||||||
|
- Deliverable: Working product that can detect drift across multiple Terraform/OpenTofu stacks and alert via Slack.
|
||||||
|
|
||||||
|
**Days 31-60: Seed the Community**
|
||||||
|
|
||||||
|
- Week 5: Publish `drift-cli` on GitHub. Write a clear README with GIF demos. Target: 100 stars in week 1.
|
||||||
|
- Week 5-6: Begin daily engagement on r/terraform, r/devops. Answer drift questions. Don't pitch. Build credibility.
|
||||||
|
- Week 6: Publish "The True Cost of Infrastructure Drift" blog post with Drift Cost Calculator.
|
||||||
|
- Week 7: Publish "driftctl Is Dead. Here's What to Use Instead." (This will rank on Google for "driftctl alternative.")
|
||||||
|
- Week 7-8: Recruit 3-5 design partners from personal network. Free access for 3 months. Weekly feedback calls.
|
||||||
|
- Deliverable: 200+ GitHub stars, 50+ email list signups, 3-5 design partners actively using the product.
|
||||||
|
|
||||||
|
**Days 61-90: Launch and Convert**
|
||||||
|
|
||||||
|
- Week 9: "Show HN" launch. Prepare the post carefully. Have the landing page, pricing page, and docs ready.
|
||||||
|
- Week 9-10: Respond to every HN comment. Fix bugs reported by early users within 24 hours. Ship daily.
|
||||||
|
- Week 10: Launch on Product Hunt (secondary channel, lower priority than HN).
|
||||||
|
- Week 11: Publish case study from design partner: "How [Company] Reduced Drift by 90% in 2 Weeks."
|
||||||
|
- Week 12: Enable paid tiers. Convert free users to Starter/Pro.
|
||||||
|
- Deliverable: 200+ free tier users, 10+ paying customers, $1.5K+ MRR.
|
||||||
|
|
||||||
|
### 6.2 Key Metrics and Milestones
|
||||||
|
|
||||||
|
**North Star Metric:** Stacks monitored (total across all customers). This measures adoption depth, not just customer count.
|
||||||
|
|
||||||
|
**Leading Indicators:**
|
||||||
|
| Metric | Month 3 Target | Month 6 Target | Month 12 Target |
|
||||||
|
|--------|---------------|----------------|-----------------|
|
||||||
|
| GitHub stars (drift-cli) | 500 | 1,500 | 3,000 |
|
||||||
|
| Free tier users | 200 | 600 | 1,500 |
|
||||||
|
| Paying customers | 10 | 50 | 150 |
|
||||||
|
| MRR | $1.5K | $7.5K | $22K |
|
||||||
|
| Stacks monitored | 300 | 1,500 | 5,000 |
|
||||||
|
| Free-to-paid conversion | 5% | 5% | 7% |
|
||||||
|
| Monthly churn | <5% | <5% | <4% |
|
||||||
|
| NPS | 40+ | 45+ | 50+ |
|
||||||
|
|
||||||
|
**Lagging Indicators:**
|
||||||
|
- Net Dollar Retention (NDR): Target >120% (customers expand as they add stacks)
|
||||||
|
- CAC Payback: Target <6 months
|
||||||
|
- LTV:CAC ratio: Target >3:1
|
||||||
|
|
||||||
|
### 6.3 Open-Source Core Strategy — YES, With Boundaries
|
||||||
|
|
||||||
|
**Verdict: YES. Open-source the detection engine. Charge for the operational layer.**
|
||||||
|
|
||||||
|
**The Logic:**
|
||||||
|
1. **Trust.** Security-conscious teams won't run a closed-source agent in their VPC. Open-source eliminates this objection.
|
||||||
|
2. **Distribution.** GitHub is a distribution channel. Stars, forks, and contributors are free marketing.
|
||||||
|
3. **Community.** An open-source project attracts contributors who build integrations you don't have time to build (Azure support, GCP support, Pulumi support).
|
||||||
|
4. **Defensibility.** Counterintuitively, open-source is MORE defensible than closed-source. If someone forks your CLI, they still need to build the SaaS layer (Slack integration, dashboard, compliance reports, team features, continuous monitoring). That's 80% of the value. The detection engine is 20%.
|
||||||
|
|
||||||
|
**The Boundary:**
|
||||||
|
- Open-source: `drift-cli` (detection engine, single-stack scanning, stdout output)
|
||||||
|
- Proprietary SaaS: Everything else (continuous monitoring, Slack/PagerDuty integration, remediation workflows, dashboard, compliance reports, team features, API, historical data)
|
||||||
|
|
||||||
|
**License:** Apache 2.0 for the CLI. Not AGPL (too restrictive, scares enterprises). Not MIT (too permissive, allows competitors to embed without attribution). Apache 2.0 is the sweet spot — permissive enough for adoption, with patent protection.
|
||||||
|
|
||||||
|
### 6.4 The "Unfair Bet" — What Makes This Work
|
||||||
|
|
||||||
|
Brian, here's the honest assessment. You asked me to make the case or tell you to skip it. I'm going to do both.
|
||||||
|
|
||||||
|
**The Case FOR dd0c/drift:**
|
||||||
|
|
||||||
|
1. **The driftctl vacuum is real and time-limited.** There is a window — maybe 12-18 months — where the community is actively searching for a driftctl replacement and nobody has filled the gap. If you ship a credible product in that window, you become the default. If you wait, someone else will.
|
||||||
|
|
||||||
|
2. **The disruption math works.** $29/stack (or $49-$399/mo tiered) vs. $500+/mo platforms is a 10-17x price advantage. This isn't a marginal improvement — it's a category shift. You're not competing with Spacelift. You're making Spacelift's drift feature irrelevant for 80% of the market.
|
||||||
|
|
||||||
|
3. **Compliance is a forcing function.** SOC 2, PCI DSS 4.0, HIPAA — these aren't optional. Auditors are asking for continuous drift detection. This transforms your product from "nice-to-have" to "the auditor said we need this." Compliance-driven purchases have shorter sales cycles and lower churn.
|
||||||
|
|
||||||
|
4. **The platform strategy amplifies the bet.** dd0c/drift isn't just a standalone product — it's a wedge into the dd0c platform. A customer who uses drift is a customer who sees dd0c/cost, dd0c/alert, and dd0c/portal in the sidebar. Cross-sell potential is enormous.
|
||||||
|
|
||||||
|
5. **Your background is the unfair advantage.** You're a senior AWS architect. You've lived this problem. You can write the blog posts, give the conference talks, and answer the Reddit questions with authentic credibility that no marketing team can manufacture.
|
||||||
|
|
||||||
|
**The Case AGAINST dd0c/drift (the uncomfortable part):**
|
||||||
|
|
||||||
|
1. **It's Product #4 in a 6-product platform, built by one person.** The brand strategy puts drift in Phase 3 (months 7-12). By then, you'll be maintaining route, cost, and alert. Adding drift means you're running 4 products simultaneously. The burnout risk is not theoretical — it's mathematical.
|
||||||
|
|
||||||
|
2. **The standalone TAM is modest.** $3-5M SOM in 24 months is a great lifestyle business but won't attract investors if you ever want to raise. As a platform wedge, it's valuable. As a standalone bet, it's limited.
|
||||||
|
|
||||||
|
3. **The "do nothing" competitor is strong.** Most teams tolerate drift. They've been tolerating it for years. Converting "tolerators" to "payers" requires more marketing effort than converting "seekers" to "payers." Your beachhead is seekers (driftctl refugees, post-incident teams, pre-audit teams), but that's a smaller pool than the total market suggests.
|
||||||
|
|
||||||
|
4. **State file access is a trust barrier.** Reading Terraform state files means seeing resource configurations, sometimes including sensitive data. Even with read-only access, security teams will scrutinize this. The agent-in-VPC architecture mitigates it but adds deployment complexity.
|
||||||
|
|
||||||
|
**Victor's Final Verdict:**
|
||||||
|
|
||||||
|
**BUILD IT. But change the sequencing.**
|
||||||
|
|
||||||
|
Move dd0c/drift to Phase 2 (months 4-6), not Phase 3. Here's why:
|
||||||
|
|
||||||
|
- dd0c/route (LLM cost router) is a good Phase 1 product — immediate ROI, easy to build.
|
||||||
|
- dd0c/drift has a TIME-SENSITIVE market window (driftctl vacuum). Every month you wait, the window shrinks.
|
||||||
|
- dd0c/cost (AWS cost anomaly) can wait — the cost management market is crowded and not time-sensitive.
|
||||||
|
- dd0c/alert can be Phase 3 — alert fatigue is a chronic problem, not an acute one.
|
||||||
|
|
||||||
|
**Revised Launch Sequence:**
|
||||||
|
1. Phase 1 (Months 1-3): `dd0c/route` — The FinOps wedge. Immediate ROI. Easy build.
|
||||||
|
2. Phase 2 (Months 4-6): `dd0c/drift` — The driftctl vacuum. Time-sensitive. Compliance tailwind.
|
||||||
|
3. Phase 3 (Months 7-9): `dd0c/alert` — The on-call savior. Builds on route + drift data.
|
||||||
|
4. Phase 4 (Months 10-12): `dd0c/portal` + `dd0c/cost` + `dd0c/run` — The platform play.
|
||||||
|
|
||||||
|
**The Unfair Bet in One Sentence:**
|
||||||
|
|
||||||
|
dd0c/drift wins because it's the only product in the market that treats drift detection as the ENTIRE product rather than a feature checkbox — at a price point that makes the decision trivial and an onboarding experience that makes the switch instant — launched into a market vacuum left by driftctl's death, at the exact moment compliance requirements are making drift detection mandatory.
|
||||||
|
|
||||||
|
That's not a bet. That's a calculated position with favorable odds.
|
||||||
|
|
||||||
|
Ship it, Brian. But ship it in Phase 2, not Phase 3. The window won't wait.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*"The best time to plant a tree was 20 years ago. The second best time is now. The worst time is 'after I finish three other products first.'"*
|
||||||
|
|
||||||
|
— Victor
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Document Status:** COMPLETE
|
||||||
|
**Confidence Level:** HIGH (conditional on sequencing change)
|
||||||
|
**Next Step:** Technical architecture session — define the detection engine, state backend integrations, and Slack workflow architecture.
|
||||||
|
**Recommended Follow-Up:** Competitive intelligence deep-dive on Firefly.ai (they're the closest to your positioning and the least understood).
|
||||||
130
products/02-iac-drift-detection/party-mode/session.md
Normal file
130
products/02-iac-drift-detection/party-mode/session.md
Normal file
@@ -0,0 +1,130 @@
|
|||||||
|
# 🪩 dd0c/drift — Advisory Board Party Mode 🪩
|
||||||
|
|
||||||
|
**Product:** dd0c/drift — IaC Drift Detection & Auto-Remediation SaaS
|
||||||
|
**Format:** BMad Creative Intelligence Suite "Party Mode"
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎙️ The Panel
|
||||||
|
|
||||||
|
1. **The VC** — Pattern-matching machine. Seen 1000 AI/DevOps pitches. Asks "what's the moat?"
|
||||||
|
2. **The CTO** — 20 years building infrastructure. Skeptical of AI hype. Asks "does drift detection actually work reliably at scale?"
|
||||||
|
3. **The Bootstrap Founder** — Built 3 profitable SaaS products solo. Asks "can one person ship this?"
|
||||||
|
4. **The DevOps Practitioner** — Platform engineer managing 200+ Terraform stacks. Asks "would I actually pay for this?"
|
||||||
|
5. **The Contrarian** — Deliberately argues the opposite. Finds the blind spots.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🥊 ROUND 1: INDIVIDUAL REVIEWS
|
||||||
|
|
||||||
|
**The VC**
|
||||||
|
"Look, I'm looking at the TAM and it's… cute. $3-5M SOM for a standalone tool? My associates wouldn't even take the meeting. BUT. The wedge strategy is brilliant. Capitalizing on the driftctl vacuum and HashiCorp's BSL fallout is classic opportunistic timing. What excites me is the compliance angle—SOC 2 turns this from a 'vitamin' into a 'painkiller.' What worries me is the ceiling. If you stay at $29/stack, you need massive volume. If Spacelift or env0 decide to commoditize drift detection to protect their platform play, your CAC payback goes negative.
|
||||||
|
**Vote: CONDITIONAL GO (Only if it's a wedge for the broader dd0c platform).**"
|
||||||
|
|
||||||
|
**The CTO**
|
||||||
|
"I've been paged at 2am because someone manually tweaked an ASG and Terraform wiped it. Drift is a real operational nightmare. But let's talk reality: doing continuous detection across AWS, GCP, Azure, Terraform, Tofu, AND Pulumi without causing rate limit chaos or requiring god-mode IAM roles is a hard engineering problem. I like the 'read-only' approach and hybrid CloudTrail + polling idea. What worries me is remediation. Auto-reverting infrastructure state via a Slack button sounds cool until it causes a cascading failure because you didn't understand the blast radius of that emergency hotfix.
|
||||||
|
**Vote: CONDITIONAL GO (Nail the blast-radius context before enabling write access).**"
|
||||||
|
|
||||||
|
**The Bootstrap Founder**
|
||||||
|
"I love this. It's a scalpel, not a Swiss Army knife. You don't need a sales team to sell a $29/stack tool that engineers can expense. The PLG motion of open-sourcing the CLI to capture the orphaned driftctl community is a proven playbook. Can one person ship this? Yes, if you leverage AI for the boilerplate and keep the UI dead simple. The risk? You're building a multi-tool SaaS. If you try to support TF, Tofu, Pulumi, and CloudFormation on day one, you'll drown in edge cases.
|
||||||
|
**Vote: GO.**"
|
||||||
|
|
||||||
|
**The DevOps Practitioner**
|
||||||
|
"Finally, someone who understands that I don't want to migrate my entire CI/CD pipeline just to see what drifted. I'm already running GitHub Actions. I just want a Slack alert when a security group changes. The 60-second setup is exactly what I need. Would I pay $29/stack? Honestly, my boss would approve a $149/mo 'Pro' tier instantly if it generates our SOC 2 evidence. My biggest worry is noise. If this thing alerts me every time an ASG scales or a dynamic tag updates, I will mute the channel in 48 hours and churn in a month.
|
||||||
|
**Vote: GO.**"
|
||||||
|
|
||||||
|
**The Contrarian**
|
||||||
|
"Everyone's talking about how great it is that you're an independent tool. That's your biggest weakness. The market is moving towards consolidated platforms, not fragmented point solutions. Engineers have tool fatigue. Also, your entire premise assumes IaC is the permanent future. What if the future is just AI agents directly managing cloud APIs, making 'state files' obsolete? You're building a better saddle for a horse while Ford is building the Model T. Furthermore, selling to engineers is a nightmare. They'd rather spend 40 hours building a fragile bash script than pay you $29 a month.
|
||||||
|
**Vote: NO-GO.**"
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🥊 ROUND 2: CROSS-EXAMINATION
|
||||||
|
|
||||||
|
**1. The VC:** "Alright, Founder. You love this because it's a 'scalpel.' But a $29/stack utility caps out at what, $3-5M ARR? That's a rounding error. Without the platform play, this is a lifestyle business. Where's the enterprise scale?"
|
||||||
|
|
||||||
|
**2. The Bootstrap Founder:** "VC, you're blinded by billion-dollar valuations. A $3M ARR solo business with 90% margins is phenomenal. The goal isn't to IPO; the goal is to capture the 100,000 mid-market teams HashiCorp priced out. You don't need to be a unicorn to be highly profitable."
|
||||||
|
|
||||||
|
**3. The CTO:** "Let's talk tech, Practitioner. 'One-click revert' from a Slack message? Have you ever tried to un-drift a complex RDS parameter group via a naive Terraform apply? You'll cause a reboot during peak hours. You can't just slap a 'revert' button on infrastructure without understanding dependencies."
|
||||||
|
|
||||||
|
**4. The DevOps Practitioner:** "CTO, you're assuming we're going to auto-revert databases. The MVP isn't fixing RDS parameters. It's reverting the security group someone opened to 0.0.0.0/0 at 3am. For critical state, the button should just generate a PR so I can review the plan output."
|
||||||
|
|
||||||
|
**5. The Contrarian:** "You're both missing the point. Practitioner, you're not going to pay $29/stack for PR generation. You have a GitHub Action for that. If this tool doesn't automate the hard remediation, it's just an expensive notification service."
|
||||||
|
|
||||||
|
**6. The DevOps Practitioner:** "Wrong, Contrarian. My GitHub Action only runs when I push code. It doesn't run when a developer clicks buttons in the AWS console. The value is the continuous polling and CloudTrail integration, not just the code diff."
|
||||||
|
|
||||||
|
**7. The VC:** "Contrarian has a point, though. If Spacelift sees you taking market share, they drop a free drift detection tier. Then what? Your entire '17x cheaper' moat evaporates."
|
||||||
|
|
||||||
|
**8. The Bootstrap Founder:** "Spacelift can't afford to give away private workers for continuous drift polling. Their unit economics don't support it. And even if they do, migrating to Spacelift takes two months. Installing dd0c/drift takes 60 seconds. PLG beats sales cycles."
|
||||||
|
|
||||||
|
**9. The Contrarian:** "Setup takes 60 seconds? Only if your security team rubber-stamps giving a third-party SaaS tool IAM access to your production AWS account and Terraform state files containing secrets. Spoiler: they won't."
|
||||||
|
|
||||||
|
**10. The CTO:** "Contrarian is right about the IAM permissions. If the SaaS requires a cross-account role with `s3:GetObject` on the state bucket, every SOC 2 auditor will flag it. It needs to be an agent running inside their VPC that pushes data out."
|
||||||
|
|
||||||
|
**11. The DevOps Practitioner:** "Exactly. An open-source CLI running in our CI/CD cron or as an ECS task, pushing the diffs to the dd0c SaaS. Our security team would approve that in a heartbeat because no inbound access is required."
|
||||||
|
|
||||||
|
**12. The Contrarian:** "So now your 60-second setup just became 'deploy an ECS task, configure IAM roles, and manage a new internal agent.' Congratulations, you've just reinvented enterprise software deployment."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🥊 ROUND 3: STRESS TEST
|
||||||
|
|
||||||
|
### Point 1: What if HashiCorp ships native drift detection in Terraform Cloud?
|
||||||
|
|
||||||
|
**The VC:** "Severity: **8/10**. IBM didn't buy HashiCorp to sit on their hands. If they bake real-time drift detection into the TFC Plus tier, they kill your 'HashiCorp exodus' narrative before you scale."
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- **The CTO:** "Multi-IaC support is your shield. TFC will *never* support OpenTofu or Pulumi. By the time they ship it, a massive chunk of the market has migrated to Tofu."
|
||||||
|
- **The DevOps Practitioner:** "And TFC's drift detection historically sucks. It's scheduled, not continuous via CloudTrail. And it's buried in their UI. Your Slack-native integration beats them on UX."
|
||||||
|
|
||||||
|
**Pivot Option:** Lean entirely into OpenTofu, Pulumi, and CloudFormation. Become the multi-cloud, multi-tool standard, explicitly targeting teams that are anti-vendor-lock-in.
|
||||||
|
|
||||||
|
### Point 2: What if the driftctl community vacuum gets filled by another OSS project?
|
||||||
|
|
||||||
|
**The Contrarian:** "Severity: **6/10**. Snyk abandoned driftctl, sure. But all it takes is one bored platform engineer at a series B startup to fork it, slap a new UI on it, and gain 5k GitHub stars. Free always beats $29/mo."
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- **The Bootstrap Founder:** "That's why *you* open-source the core CLI from day one! You build the successor. The OSS project detects drift locally; the SaaS layer does the continuous polling, Slack alerts, RBAC, and SOC 2 reporting. Let them fork the CLI. They still need the operational dashboard."
|
||||||
|
- **The CTO:** "Running an OSS project is a lot of work. The mitigation here is speed. Ship the CLI, get the stars, establish yourself as the de-facto standard before anyone else realizes the vacuum exists."
|
||||||
|
|
||||||
|
**Pivot Option:** Turn the CLI into a generic standard that integrates with existing observability tools (Datadog, New Relic) and pivot the SaaS into an enterprise compliance reporting engine.
|
||||||
|
|
||||||
|
### Point 3: What if enterprises won't trust a third-party tool with their cloud credentials?
|
||||||
|
|
||||||
|
**The CTO:** "Severity: **9/10**. Reading state files means reading secrets. Giving a bootstrapped SaaS tool cross-account access to production AWS and state buckets is a massive red flag for any CISO."
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- **The DevOps Practitioner:** "This is the 'push vs. pull' architecture debate. The SaaS *never* pulls from their cloud. The customer runs the open-source CLI/agent in their own secure environment (e.g., a GitHub Actions cron or ECS task) and pushes an encrypted payload of the drift diffs to the dd0c SaaS."
|
||||||
|
- **The VC:** "You also need to get SOC 2 certified immediately. It's expensive, but you can't sell a compliance tool without passing compliance yourself."
|
||||||
|
|
||||||
|
**Pivot Option:** Offer an on-premise/VPC-deployed enterprise version of the dashboard. Instead of SaaS, you sell an appliance they deploy entirely inside their perimeter.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🥊 ROUND 4: FINAL VERDICT
|
||||||
|
|
||||||
|
**The Facilitator:** "The tension is thick. The market is waiting. We've stress-tested the tech, the timing, and the TAM. We need a verdict. Go, No-Go, or Pivot?"
|
||||||
|
|
||||||
|
**The Panel Deliberates:**
|
||||||
|
- **The VC:** "If you launch it as a standalone unicorn pitch, I'm out. If you launch it as a wedge for the broader dd0c suite, I'm in."
|
||||||
|
- **The CTO:** "Push-architecture for the agent, or you're dead on arrival with enterprise security teams. If you build the push agent, I'm in."
|
||||||
|
- **The Bootstrap Founder:** "Ship the open-source CLI yesterday. The vacuum is real. I'm a hard GO."
|
||||||
|
- **The DevOps Practitioner:** "I've already asked my boss for the corporate card. Give me Slack alerts and I'm a GO."
|
||||||
|
- **The Contrarian:** "I still hate it. You're building an alarm system for a burning house instead of putting out the fire. But fine, the burning house owners have money. PIVOT to a compliance engine, and I guess it's a GO."
|
||||||
|
|
||||||
|
### Decision: SPLIT DECISION → CONDITIONAL GO
|
||||||
|
|
||||||
|
**Revised Priority in dd0c Lineup:**
|
||||||
|
Move dd0c/drift from Phase 3 to **Phase 2**. The driftctl vacuum is a time-sensitive, closing window. You must launch the open-source CLI immediately, capitalize on the HashiCorp BSL fallout, and use it as a powerful wedge into mid-market engineering teams.
|
||||||
|
|
||||||
|
**Top 3 Must-Get-Right Items:**
|
||||||
|
1. **Push-Based Architecture:** The SaaS must NEVER ask for inbound read access to a customer's AWS account or Terraform state buckets. The open-source agent runs in the customer's CI/CD or VPC and pushes drift events to the dd0c API.
|
||||||
|
2. **Slack-Native UX:** The alert must be the UI. Rich messages, diff context, and action buttons (`Revert`, `Snooze`, `Acknowledge`) entirely within Slack.
|
||||||
|
3. **Multi-IaC Day One:** You must launch with Terraform AND OpenTofu support. Supporting Pulumi shortly after is a massive differentiator against HashiCorp.
|
||||||
|
|
||||||
|
**The One Kill Condition:**
|
||||||
|
If you launch the open-source CLI on Hacker News/Reddit and cannot achieve **500 GitHub stars and 50 free-tier SaaS signups within 30 days**, kill the SaaS product. It means the market has either solved the problem internally or the pain isn't acute enough to adopt new tooling.
|
||||||
|
|
||||||
|
**Final Verdict: GO.**
|
||||||
|
|
||||||
|
*The panel adjourns, grabbing slices of cold pizza while the CTO mutters about IAM policies.*
|
||||||
695
products/02-iac-drift-detection/product-brief/brief.md
Normal file
695
products/02-iac-drift-detection/product-brief/brief.md
Normal file
@@ -0,0 +1,695 @@
|
|||||||
|
# dd0c/drift — Product Brief
|
||||||
|
**Product:** IaC Drift Detection & Auto-Remediation SaaS
|
||||||
|
**Author:** Max Mayfield (Product Manager, Phase 5)
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Status:** Investor-Ready Draft
|
||||||
|
**Pipeline Phase:** BMad Phase 5 — Product Brief
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. EXECUTIVE SUMMARY
|
||||||
|
|
||||||
|
### Elevator Pitch
|
||||||
|
|
||||||
|
dd0c/drift is a focused, developer-first SaaS tool that continuously monitors Terraform, OpenTofu, and Pulumi infrastructure for drift from declared state — and lets engineers fix it in one click from Slack. It replaces the fragile cron jobs, manual `terraform plan` runs, and tribal knowledge that teams currently rely on, at 10-17x less than platform competitors like Spacelift or env0.
|
||||||
|
|
||||||
|
**The one-liner:** Connect your IaC state. Get Slack alerts when something drifts. Fix it in one click. Set up in 60 seconds.
|
||||||
|
|
||||||
|
### Problem Statement
|
||||||
|
|
||||||
|
Infrastructure as Code promised a single source of truth. In practice, it's a polite fiction.
|
||||||
|
|
||||||
|
Engineers make console changes during 2am incidents. Auto-scaling events mutate state. Emergency hotfixes bypass the IaC pipeline. The result: a growing, invisible gap between what the code declares and what actually exists in the cloud. This gap — drift — is the #1 operational pain point of IaC at scale.
|
||||||
|
|
||||||
|
**The data:**
|
||||||
|
- Engineers spend 2-5x longer debugging issues when actual state doesn't match declared state (design thinking persona research).
|
||||||
|
- Teams with 20+ stacks report spending 30% of sprint capacity on unplanned drift-related firefighting.
|
||||||
|
- Pre-audit drift reconciliation consumes 2+ weeks of engineering time per audit cycle — time that produces zero new value.
|
||||||
|
- A single undetected security group drift (port opened to 0.0.0.0/0) has led to breaches, compliance failures, and six-figure customer contract losses.
|
||||||
|
- The average mid-market team (20 stacks, 10 engineers) spends an estimated $47,000/year on manual drift management — a cost that's invisible because it's buried in engineer time, not a line item.
|
||||||
|
|
||||||
|
There is no focused, affordable, self-serve tool that solves this. The market's only dedicated open-source option — driftctl — was acquired by Snyk and abandoned. Platform vendors (Spacelift, env0, Terraform Cloud) bundle drift detection as a feature inside $500+/mo platforms that require full workflow migration. The result: most teams "solve" drift with bash scripts, tribal knowledge, and hope.
|
||||||
|
|
||||||
|
### Solution Overview
|
||||||
|
|
||||||
|
dd0c/drift is a standalone drift detection and remediation tool — not a platform. It does one thing and does it better than anyone:
|
||||||
|
|
||||||
|
1. **Hybrid Detection Engine** — Combines CloudTrail event-driven detection (real-time for high-risk resources like security groups and IAM) with scheduled polling (comprehensive coverage for everything else). This is the "security camera" approach vs. the industry-standard "flashlight" (`terraform plan`).
|
||||||
|
|
||||||
|
2. **Slack-First Remediation** — Rich Slack messages with drift context (who changed it, when, blast radius) and action buttons: `[Revert]` `[Accept]` `[Snooze]` `[Assign]`. For 80% of users, the Slack alert IS the product. No dashboard required.
|
||||||
|
|
||||||
|
3. **One-Click Fix** — Revert drift to declared state, or accept it by auto-generating a PR that updates code to match reality. Both directions. The engineer chooses which is the source of truth, per resource.
|
||||||
|
|
||||||
|
4. **60-Second Onboarding** — `drift init` auto-discovers state backend, cloud provider, and resources. No YAML config. No platform migration. Plugs into existing Terraform + GitHub + Slack workflows.
|
||||||
|
|
||||||
|
5. **Push-Based Architecture** — An open-source agent runs inside the customer's CI/CD or VPC and pushes encrypted drift data to the dd0c SaaS. The SaaS never requires inbound access to customer cloud accounts or state files. This resolves the #1 enterprise adoption blocker (IAM trust).
|
||||||
|
|
||||||
|
### Target Customer
|
||||||
|
|
||||||
|
**Primary:** Mid-market engineering teams (5-50 engineers, 10-100 Terraform/OpenTofu stacks, AWS-first) who experience meaningful drift but can't afford or don't need a full IaC platform. They use GitHub Actions for CI/CD, Slack for communication, and a credit card for tooling purchases under $500/mo.
|
||||||
|
|
||||||
|
**Three buyer personas, one product:**
|
||||||
|
- **The Infrastructure Engineer (Ravi):** Buys with a credit card because it eliminates 2am dread. Bottom-up adoption driven by individual pain.
|
||||||
|
- **The Security/Compliance Lead (Diana):** Approves the budget because it generates SOC 2 audit evidence automatically. Middle-out adoption driven by compliance requirements.
|
||||||
|
- **The DevOps Team Lead (Marcus):** Champions it to leadership because it produces drift metrics and eliminates tribal knowledge. Top-down adoption driven by organizational visibility.
|
||||||
|
|
||||||
|
### Key Differentiators
|
||||||
|
|
||||||
|
| Differentiator | dd0c/drift | Competitors |
|
||||||
|
|---|---|---|
|
||||||
|
| **Product focus** | Drift detection IS the product (100% of engineering effort) | Drift is a feature (5% of engineering effort) |
|
||||||
|
| **Price** | $49-$399/mo (tiered bundles) | $500-$2,000+/mo (platforms) |
|
||||||
|
| **Onboarding** | 60 seconds, self-serve, credit card | Weeks-to-months, sales calls, platform migration |
|
||||||
|
| **Multi-IaC** | Terraform + OpenTofu + Pulumi from Day 1 | Terraform-only or limited multi-tool |
|
||||||
|
| **Architecture** | Push-based agent (no inbound cloud access) | Pull-based (requires IAM cross-account roles) |
|
||||||
|
| **UX paradigm** | Slack-native with action buttons | Dashboard-first, Slack as afterthought |
|
||||||
|
| **Open-source** | CLI detection engine is OSS (Apache 2.0) | Proprietary |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. MARKET OPPORTUNITY
|
||||||
|
|
||||||
|
### Market Sizing
|
||||||
|
|
||||||
|
**TAM (Total Addressable Market) — IaC Management & Governance:**
|
||||||
|
The global IaC market is projected at $2.5-$3.5B by 2027 (25-30% CAGR). The drift detection and remediation slice — including drift features embedded in platforms — represents an estimated **$800M-$1.2B** by 2027.
|
||||||
|
|
||||||
|
**SAM (Serviceable Addressable Market) — Teams Using Terraform/OpenTofu/Pulumi Who Need Drift Detection:**
|
||||||
|
- 150,000-200,000 organizations actively use Terraform/OpenTofu in production.
|
||||||
|
- ~60% (90,000-120,000) have 10+ stacks and experience meaningful drift.
|
||||||
|
- Conservative estimate targeting teams with 10-100 stacks (excluding enterprises that will buy Spacelift regardless): **$200-$400M SAM**.
|
||||||
|
|
||||||
|
**SOM (Serviceable Obtainable Market) — 24-Month Capture:**
|
||||||
|
- Solo founder with PLG motion, targeting SMB/mid-market (5-50 engineers, 10-100 stacks).
|
||||||
|
- Year 1 realistic target: 200-500 paying customers at ~$145/mo average = **$350K-$870K ARR**.
|
||||||
|
- Year 2 with expansion and word-of-mouth: **$1.5M-$3M ARR**.
|
||||||
|
- 24-month SOM: **$3-$5M**.
|
||||||
|
|
||||||
|
**The honest framing:** $3-5M SOM as a standalone product is a strong bootstrapped business, not a venture-scale outcome. The strategic value is as a wedge into the broader dd0c platform (route + cost + alert + drift + portal), which targets a $50M+ opportunity. Drift alone funds the founder; the platform funds the company.
|
||||||
|
|
||||||
|
### Competitive Landscape (Top 5)
|
||||||
|
|
||||||
|
| Competitor | What They Are | Drift Capability | Pricing | Vulnerability |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| **Spacelift** | IaC management platform ($40M+ raised) | Good — but a feature, not the product. Requires private workers. | $500-$2,000+/mo | Can't price down to $49 without cannibalizing enterprise ACV. Requires full workflow migration. |
|
||||||
|
| **env0** | "Environment as a Service" platform ($28M+ raised) | Basic — secondary to their core positioning | $350-$500+/mo (per-user) | Jack of all trades. Per-user pricing punishes growing teams. Same migration problem. |
|
||||||
|
| **HCP Terraform (HashiCorp/IBM)** | Native Terraform Cloud | Basic — scheduled health assessments, no remediation workflows | Variable; gets expensive at scale | IBM acquisition triggered OpenTofu exodus. Terraform-only. BSL license killed community goodwill. |
|
||||||
|
| **Firefly.ai** | Cloud Asset Management ($23M+ raised) | Good — but bundled in enterprise package | $1,000+/mo, enterprise-only, "Contact Sales" | Sells to CISOs, not engineers. No self-serve. A 5-person startup can't get a demo. |
|
||||||
|
| **driftctl (Snyk)** | Open-source drift detection CLI | Was good — now dead | Free (abandoned OSS) | Acquired and abandoned. Community orphaned. README still says "beta." **This vacuum is our market entry.** |
|
||||||
|
|
||||||
|
**The competitive insight:** Every live competitor treats drift detection as a feature inside a platform. Nobody treats it as the entire product. dd0c/drift's value curve is the inverse of every competitor — zero on CI/CD orchestration and policy engines, 10/10 on drift detection depth, remediation workflows, Slack-native UX, and self-serve onboarding. This is textbook Blue Ocean positioning.
|
||||||
|
|
||||||
|
### Timing Thesis — Why February 2026
|
||||||
|
|
||||||
|
Four forces are converging that create a 12-18 month window of opportunity:
|
||||||
|
|
||||||
|
**1. The HashiCorp Exodus (2024-2026)**
|
||||||
|
IBM's acquisition of HashiCorp and the BSL license change triggered the largest migration event in IaC history. Teams migrating from Terraform Cloud to OpenTofu + GitHub Actions lose their (mediocre) drift detection. They need a replacement and are actively searching right now.
|
||||||
|
|
||||||
|
**2. The driftctl Vacuum**
|
||||||
|
driftctl was the only focused, open-source drift detection tool. Snyk killed it. GitHub issues, Reddit threads, and HN comments are filled with "what do I use instead of driftctl?" There is no answer. dd0c/drift IS the answer. This vacuum is time-limited — someone will fill it within 12-18 months.
|
||||||
|
|
||||||
|
**3. IaC Adoption Hit Mainstream**
|
||||||
|
IaC is no longer a practice of elite DevOps teams. Mid-market companies with 20-50 engineers now have 30+ Terraform stacks. They've graduated from "learning IaC" to "suffering from IaC at scale." The market of sufferers just 10x'd.
|
||||||
|
|
||||||
|
**4. Compliance Is Becoming a Forcing Function**
|
||||||
|
- **SOC 2 Type II:** Auditors increasingly ask "How do you ensure infrastructure matches declared configuration?" — "we run terraform plan sometimes" is no longer acceptable.
|
||||||
|
- **PCI DSS 4.0** (effective March 2025): Requirement 1.2.5 requires documentation and review of all allowed services, protocols, and ports. Security group drift is now a PCI finding.
|
||||||
|
- **HIPAA/HITRUST:** Healthcare SaaS companies need to prove infrastructure configurations haven't been tampered with.
|
||||||
|
- **FedRAMP/StateRAMP:** Continuous monitoring of configuration state maps directly to NIST 800-53 CM-3 and CM-6.
|
||||||
|
- **Cyber Insurance:** Insurers are asking detailed questions about infrastructure configuration management. Continuous drift detection improves rates.
|
||||||
|
|
||||||
|
Compliance transforms drift detection from "engineering nice-to-have" to "business requirement." When the auditor says "you need this," the CFO writes the check.
|
||||||
|
|
||||||
|
### Market Trends
|
||||||
|
|
||||||
|
- **Multi-IaC reality:** Teams no longer use just Terraform. They use Terraform AND OpenTofu AND Pulumi AND CloudFormation (legacy). The first tool that handles drift across all of them owns the "Switzerland" position.
|
||||||
|
- **Platform fatigue:** Engineering teams are experiencing tool sprawl fatigue. They want focused tools that integrate with existing workflows, not new platforms that require migration.
|
||||||
|
- **AI-assisted infrastructure:** AI agents (Pulumi Neo, GitHub Copilot) are generating more IaC, increasing the volume of managed resources and the surface area for drift. AI doesn't prevent a panicked engineer from opening a security group at 2am.
|
||||||
|
- **Shift from periodic to continuous:** The industry is moving from point-in-time compliance checks to continuous monitoring. Drift detection is the infrastructure equivalent of this shift.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. PRODUCT DEFINITION
|
||||||
|
|
||||||
|
### Value Proposition
|
||||||
|
|
||||||
|
**For infrastructure engineers:** "Stop dreading `terraform apply`. Know exactly what drifted, who changed it, and fix it in one click — without leaving Slack."
|
||||||
|
|
||||||
|
**For compliance leads:** "Generate continuous SOC 2 / HIPAA compliance evidence automatically. Eliminate the 2-week pre-audit scramble."
|
||||||
|
|
||||||
|
**For DevOps leads:** "See drift across all stacks in one dashboard. Replace tribal knowledge with data. Show leadership a number, not an anecdote."
|
||||||
|
|
||||||
|
**The composite:** dd0c/drift closes the loop between declared state and actual state continuously — restoring trust in IaC as a practice, eliminating reactive firefighting, and turning compliance from a quarterly scramble into an always-on posture.
|
||||||
|
|
||||||
|
### Personas
|
||||||
|
|
||||||
|
**Persona 1: Ravi — The Infrastructure Engineer**
|
||||||
|
- Senior infra engineer, 6 years experience, manages 23 Terraform stacks
|
||||||
|
- Runs `terraform plan` manually before every apply, scanning output like a bomb technician
|
||||||
|
- Maintains a mental map of "things that have drifted but I haven't fixed yet"
|
||||||
|
- Feels anxiety before every apply, guilt about known drift, loneliness at 2am when nothing matches the code
|
||||||
|
- **JTBD:** "When I'm about to run `terraform apply`, I want to know exactly what has drifted so I can apply with confidence instead of fear."
|
||||||
|
- **Buys because:** Eliminates 2am dread. Credit card purchase. Bottom-up.
|
||||||
|
|
||||||
|
**Persona 2: Diana — The Security/Compliance Lead**
|
||||||
|
- Head of Security, 10 years experience, responsible for SOC 2 Type II across 4 AWS accounts
|
||||||
|
- Maintains a 200-row spreadsheet mapping compliance controls to infrastructure resources — always slightly out of date
|
||||||
|
- Spends 60% of her time on evidence collection that should be automated
|
||||||
|
- **JTBD:** "When an auditor asks for evidence that infrastructure matches declared state, I want to generate a real-time compliance report in one click."
|
||||||
|
- **Buys because:** Generates audit evidence. Budget approval. Middle-out.
|
||||||
|
|
||||||
|
**Persona 3: Marcus — The DevOps Team Lead**
|
||||||
|
- DevOps lead, 12 years experience, manages 67 stacks through a team of 4 engineers
|
||||||
|
- Has zero aggregate visibility — manages infrastructure health through standup anecdotes and tribal knowledge
|
||||||
|
- Team is burning out from on-call burden inflated by drift-related incidents
|
||||||
|
- **JTBD:** "When reporting to leadership, I want to show drift metrics trending over time so I can justify tooling investment with data."
|
||||||
|
- **Buys because:** Produces metrics, eliminates bus factor. Champions to leadership. Top-down.
|
||||||
|
|
||||||
|
### Feature Roadmap
|
||||||
|
|
||||||
|
#### MVP (Month 1 — Launch)
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---|---|
|
||||||
|
| **Hybrid detection engine** | CloudTrail event-driven (real-time for security groups, IAM) + scheduled polling (comprehensive). The "security camera" vs. "flashlight" approach. |
|
||||||
|
| **Terraform + OpenTofu support** | Full support for both from Day 1. Multi-IaC is a launch differentiator, not a roadmap item. |
|
||||||
|
| **Slack-native alerts** | Rich messages with drift context: what changed, who changed it (CloudTrail attribution), when, and blast radius preview. Action buttons: `[Revert]` `[Accept]` `[Snooze]` `[Assign]`. |
|
||||||
|
| **One-click revert** | Revert drift to declared state via Terraform apply scoped to the drifted resource. Includes blast radius check before execution. |
|
||||||
|
| **One-click accept** | Accept drift by auto-generating a PR that updates IaC code to match current reality. Both directions — engineer chooses which is the source of truth. |
|
||||||
|
| **Drift score dashboard** | Single number per stack and aggregate across all stacks. "Your infrastructure is 94% aligned with declared state." Minimal but functional web UI. |
|
||||||
|
| **Push-based agent** | Open-source CLI/agent runs in customer's CI/CD (GitHub Actions cron) or VPC (ECS task). Pushes encrypted drift data to dd0c SaaS. No inbound access required. |
|
||||||
|
| **60-second onboarding** | `drift init` auto-discovers state backend, cloud provider, and resources. No YAML config files. |
|
||||||
|
| **Stack ownership** | Assign stacks to engineers. Route drift alerts to the right person automatically. |
|
||||||
|
|
||||||
|
#### V2 (Month 3-4)
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---|---|
|
||||||
|
| **Per-resource automation policies** | Spectrum of automation per resource type: Auto-revert (security groups opened to 0.0.0.0/0), Alert + one-click (IAM changes), Digest only (tag drift), Ignore (ASG instance counts). This spectrum IS the product's sophistication. |
|
||||||
|
| **Compliance report generation** | One-click SOC 2 / HIPAA evidence reports. Continuous audit trail of all drift events and resolutions. Exportable PDF/CSV. |
|
||||||
|
| **Pulumi support** | Extend detection engine to Pulumi state. Capture the underserved Pulumi community. |
|
||||||
|
| **Drift trends & analytics** | Drift rate over time, mean time to remediation, most-drifted resource types, drift by team member. The metrics Marcus needs for leadership. |
|
||||||
|
| **PagerDuty / OpsGenie integration** | Route critical drift (security groups, IAM) through existing on-call rotation. |
|
||||||
|
| **Teams & RBAC** | Multi-team support with role-based access. Stack-level permissions. |
|
||||||
|
|
||||||
|
#### V3 (Month 6-9)
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---|---|
|
||||||
|
| **Drift prediction** | "Based on patterns from N similar organizations, this resource has a 78% chance of drifting in the next 48 hours." Requires aggregate data from 500+ customers. |
|
||||||
|
| **Industry benchmarking** | "Your drift rate is 12%. The median for Series B SaaS companies is 18%. You're in the top quartile." Competitive FOMO that drives adoption. |
|
||||||
|
| **Multi-cloud support** | Azure and GCP detection alongside AWS. |
|
||||||
|
| **CloudFormation support** | Capture legacy stacks that haven't migrated to Terraform/OpenTofu. |
|
||||||
|
| **SSO / SAML** | Enterprise authentication. Unlocks larger team adoption. |
|
||||||
|
| **API & webhooks** | Programmatic access to drift data for custom integrations and internal dashboards. |
|
||||||
|
| **dd0c platform integration** | Drift data feeds into dd0c/alert (intelligent routing), dd0c/portal (service catalog enrichment), and dd0c/run (automated runbooks for drift remediation). Cross-module flywheel. |
|
||||||
|
|
||||||
|
### User Journey
|
||||||
|
|
||||||
|
```
|
||||||
|
1. DISCOVER
|
||||||
|
Engineer sees "driftctl alternative" blog post, HN launch, or Reddit recommendation.
|
||||||
|
Downloads open-source drift-cli. Runs `drift check` on one stack.
|
||||||
|
Finds 7 drifted resources. "Oh crap."
|
||||||
|
|
||||||
|
2. ACTIVATE (60 seconds)
|
||||||
|
Signs up for free tier. Runs `drift init`.
|
||||||
|
CLI auto-discovers S3 state backend, AWS account, 3 stacks.
|
||||||
|
First Slack alert arrives within 5 minutes.
|
||||||
|
|
||||||
|
3. ENGAGE (Week 1)
|
||||||
|
Daily Slack alerts become part of the workflow.
|
||||||
|
Reverts a security group drift in one click. Accepts a tag drift.
|
||||||
|
Checks drift score dashboard — "We're at 87% alignment."
|
||||||
|
|
||||||
|
4. CONVERT (Week 2-4)
|
||||||
|
Hits 4-stack limit on free tier. Wants to add remaining 12 stacks.
|
||||||
|
Upgrades to Starter ($49/mo, 10 stacks) with a credit card.
|
||||||
|
No manager approval needed. No procurement.
|
||||||
|
|
||||||
|
5. EXPAND (Month 2-6)
|
||||||
|
Adds more stacks. Hits 10-stack limit. Upgrades to Pro ($149/mo, 30 stacks).
|
||||||
|
Diana (compliance) discovers the compliance report feature.
|
||||||
|
Generates SOC 2 evidence in one click. Becomes internal champion.
|
||||||
|
Marcus (team lead) sees the drift trends dashboard. Uses it in leadership reports.
|
||||||
|
|
||||||
|
6. ADVOCATE (Month 6+)
|
||||||
|
Team presents "How we reduced drift by 90%" at internal engineering all-hands.
|
||||||
|
Engineer mentions dd0c/drift on r/terraform. Word-of-mouth loop begins.
|
||||||
|
Team evaluates dd0c/cost and dd0c/alert — platform expansion.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pricing — Resolution
|
||||||
|
|
||||||
|
**The pricing question:** The brainstorm session proposed $29/stack/month flat pricing. The innovation strategy recommended tiered bundles ($49-$399/mo) over flat per-stack. The party mode panel's DevOps Practitioner said "my boss would approve a $149/mo Pro tier instantly if it generates SOC 2 evidence." The Contrarian argued $29/stack is too low for meaningful revenue.
|
||||||
|
|
||||||
|
**Resolution: Tiered bundles win.** Here's why:
|
||||||
|
|
||||||
|
Pure per-stack pricing has three fatal flaws:
|
||||||
|
1. It penalizes good architecture — teams that split into many small stacks (best practice) pay more.
|
||||||
|
2. It creates enterprise sticker shock — 200 stacks × $29 = $5,800/mo, at which point Spacelift's platform looks reasonable.
|
||||||
|
3. It's unpredictable — customers can't forecast costs as they add stacks.
|
||||||
|
|
||||||
|
Tiered bundles solve all three while preserving the "$29/stack" marketing anchor (Starter tier = $49/mo for 10 stacks ≈ $4.90/stack effective).
|
||||||
|
|
||||||
|
**Final Pricing:**
|
||||||
|
|
||||||
|
| Tier | Price | Stacks | Polling Frequency | Key Features |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| **Free** | $0/mo | 3 stacks | Daily | Slack alerts, basic dashboard, drift score |
|
||||||
|
| **Starter** | $49/mo | 10 stacks | 15-minute | + One-click remediation, stack ownership, CloudTrail attribution |
|
||||||
|
| **Pro** | $149/mo | 30 stacks | 5-minute | + Compliance reports, auto-remediation policies, drift trends, API, PagerDuty |
|
||||||
|
| **Business** | $399/mo | 100 stacks | 1-minute | + SSO, RBAC, audit trail export, priority support, custom integrations |
|
||||||
|
| **Enterprise** | Custom | Unlimited | Real-time (CloudTrail) | + SLA, dedicated support, on-prem agent option, custom compliance frameworks |
|
||||||
|
|
||||||
|
**Pricing justification:**
|
||||||
|
- **Free tier is genuinely useful** — 3 stacks with daily polling creates habit and word-of-mouth. This is the viral loop.
|
||||||
|
- **Starter at $49** — Below the "ask my manager" threshold. An engineer can expense this. No procurement. No legal review.
|
||||||
|
- **Pro at $149** — The sweet spot. Compliance reports unlock Diana's budget. 30 stacks covers most mid-market teams. This is the volume tier.
|
||||||
|
- **Business at $399** — Still 10x cheaper than Spacelift. Covers large teams (100 stacks) with enterprise features. Natural upsell trigger when teams hit 30 stacks.
|
||||||
|
- **Enterprise at custom** — Exists for the 1% who need unlimited stacks, SLAs, and on-prem. Not the focus. Don't build a sales team for this.
|
||||||
|
|
||||||
|
**The $29/stack anchor still works for marketing:** "Starting at less than $5/stack" or "17x cheaper than Spacelift" are the headlines. The tiered pricing is what they see on the pricing page.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. GO-TO-MARKET PLAN
|
||||||
|
|
||||||
|
### Launch Strategy
|
||||||
|
|
||||||
|
dd0c/drift launches as a Phase 2 product in the dd0c suite (months 4-6), following dd0c/route (LLM cost router). Victor's innovation strategy recommended moving drift up from Phase 3 due to the time-sensitive driftctl vacuum. The party mode panel unanimously agreed. This brief confirms: **drift launches in Phase 2.**
|
||||||
|
|
||||||
|
The GTM motion is pure PLG (Product-Led Growth). No sales team. No enterprise outbound. No "Contact Sales" buttons. The product sells itself through:
|
||||||
|
1. An open-source CLI that proves value locally before asking for a signup.
|
||||||
|
2. A 60-second onboarding flow that converts interest into activation instantly.
|
||||||
|
3. Slack alerts that deliver value daily, creating habit and dependency.
|
||||||
|
4. Word-of-mouth from engineers who share their drift score improvements.
|
||||||
|
|
||||||
|
### Beachhead: driftctl Refugees + r/terraform
|
||||||
|
|
||||||
|
**Primary beachhead:** Engineers who used driftctl and are actively searching for a replacement. These are pre-qualified leads — they already understand the problem, have budget intent, and are searching for a solution that doesn't exist yet.
|
||||||
|
|
||||||
|
**Where they live:**
|
||||||
|
- **driftctl GitHub Issues** — Open issues from people asking "is this project dead?" and "what do I use instead?" These are literal inbound leads.
|
||||||
|
- **r/terraform** (80K+ members) — Weekly posts asking for drift solutions. Search "drift" and find your first 50 prospects.
|
||||||
|
- **r/devops** (300K+ members) — Broader audience, drift discussions surface regularly.
|
||||||
|
- **Hacker News** — "Show HN" launches for developer tools consistently hit front page. Solo founder + open-source + clear pricing = HN catnip.
|
||||||
|
- **HashiCorp Community Forum** — Teams migrating from TFC to OpenTofu discussing tooling gaps. Drift detection is consistently mentioned.
|
||||||
|
- **DevOps Slack communities** — Rands Leadership Slack, DevOps Chat, Kubernetes Slack (#terraform channel).
|
||||||
|
- **Twitter/X DevOps community** — DevOps influencers regularly discuss IaC pain points.
|
||||||
|
|
||||||
|
**First 10 customer acquisition playbook:**
|
||||||
|
- **Customers 1-3:** Personal network. Brian is a senior AWS architect — he knows people managing Terraform stacks. Free access for 3 months in exchange for weekly feedback. These are design partners.
|
||||||
|
- **Customers 4-6:** Community engagement. 2 weeks of answering drift questions on r/terraform and r/devops. Don't pitch. Just help. Build credibility, then launch.
|
||||||
|
- **Customers 7-10:** Content-driven inbound. "The True Cost of Infrastructure Drift" blog post + Drift Cost Calculator. Convert readers to free tier, free tier to paid.
|
||||||
|
|
||||||
|
### Growth Loops
|
||||||
|
|
||||||
|
**Loop 1: Open-Source → Free Tier → Paid (Primary)**
|
||||||
|
```
|
||||||
|
Engineer discovers drift-cli on GitHub/HN
|
||||||
|
→ Runs `drift check` locally, finds drift
|
||||||
|
→ Signs up for free tier (3 stacks)
|
||||||
|
→ Gets hooked on Slack alerts
|
||||||
|
→ Hits stack limit, upgrades to Starter/Pro
|
||||||
|
→ Tells teammate → teammate discovers drift-cli
|
||||||
|
```
|
||||||
|
|
||||||
|
**Loop 2: Compliance → Budget → Expansion**
|
||||||
|
```
|
||||||
|
Diana (compliance) discovers drift reports during audit prep
|
||||||
|
→ Generates SOC 2 evidence in one click (vs. 2-week manual scramble)
|
||||||
|
→ Becomes internal champion, approves budget increase
|
||||||
|
→ Team expands to Pro/Business tier
|
||||||
|
→ Diana mentions dd0c/drift to compliance peers at industry events
|
||||||
|
```
|
||||||
|
|
||||||
|
**Loop 3: Content → SEO → Inbound**
|
||||||
|
```
|
||||||
|
Blog post ranks for "terraform drift detection" / "driftctl alternative"
|
||||||
|
→ Engineer reads post, tries Drift Cost Calculator
|
||||||
|
→ Sees "$47K/year in drift costs" → downloads CLI
|
||||||
|
→ Enters Loop 1
|
||||||
|
```
|
||||||
|
|
||||||
|
**Loop 4: Incident → Adoption (Event-Driven)**
|
||||||
|
```
|
||||||
|
Team has a drift-related incident (security group change causes outage)
|
||||||
|
→ Post-mortem action item: "evaluate drift detection tooling"
|
||||||
|
→ Engineer Googles "terraform drift detection tool"
|
||||||
|
→ Finds dd0c/drift blog post or GitHub repo
|
||||||
|
→ Enters Loop 1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Content Strategy
|
||||||
|
|
||||||
|
**Pillar content (SEO + thought leadership):**
|
||||||
|
1. "The True Cost of Infrastructure Drift" — with interactive Drift Cost Calculator. The single most important marketing asset. Quantifies invisible pain.
|
||||||
|
2. "driftctl Is Dead. Here's What to Use Instead." — Will rank for "driftctl alternative" on Google. Direct capture of orphaned community.
|
||||||
|
3. "How to Detect Terraform Drift Without Spacelift" — Targets teams evaluating platforms who don't want platform migration.
|
||||||
|
4. "SOC 2 and Infrastructure Drift: A Compliance Guide" — Targets Diana persona. Compliance-driven purchase justification.
|
||||||
|
5. "Terraform vs OpenTofu: Drift Detection Compared" — Captures migration-related search traffic.
|
||||||
|
|
||||||
|
**The Drift Cost Calculator:**
|
||||||
|
A web tool where an engineer inputs: number of stacks, team size, average salary, frequency of manual checks, drift incidents per quarter. Output: "Your team spends approximately $47,000/year on manual drift management. At $149/mo for dd0c/drift Pro, your ROI is 26x in the first year." This is shareable — engineers send it to managers. It captures leads. It's content marketing gold.
|
||||||
|
|
||||||
|
### Open-Source CLI as Lead Gen
|
||||||
|
|
||||||
|
**What's open-source (Apache 2.0):**
|
||||||
|
- `drift-cli` — Local drift detection for Terraform/OpenTofu. Runs `drift check` and outputs drifted resources to stdout. Works offline. No account needed. No telemetry. Single-stack scanning.
|
||||||
|
|
||||||
|
**What's paid SaaS:**
|
||||||
|
- Continuous monitoring (scheduled + event-driven)
|
||||||
|
- Slack/PagerDuty alerts with action buttons
|
||||||
|
- One-click remediation (revert or accept)
|
||||||
|
- Dashboard, drift score, trends
|
||||||
|
- Compliance reports
|
||||||
|
- Team features (ownership, routing, RBAC)
|
||||||
|
- Historical data
|
||||||
|
- Multi-stack aggregate view
|
||||||
|
|
||||||
|
**The conversion funnel:**
|
||||||
|
`drift-cli` outputs: "Found 7 drifted resources. View details and remediate at app.dd0c.dev" — the natural upsell. This is the Sentry/PostHog/GitLab playbook. Open-source core builds trust and adoption. Paid SaaS captures value from teams that need operational features.
|
||||||
|
|
||||||
|
**Target:** 1,000 GitHub stars in first 3 months. Stars = social proof = distribution.
|
||||||
|
|
||||||
|
### Partnerships
|
||||||
|
|
||||||
|
- **OpenTofu Foundation:** Become a visible ecosystem partner. Sponsor the project. Position dd0c/drift as "the drift detection tool for the OpenTofu community." OpenTofu teams are actively building their toolchain — be part of it from Day 1.
|
||||||
|
- **Slack Marketplace:** List dd0c/drift as a Slack app. "Install from Slack → OAuth → connect state backend → first alert in 5 minutes." Underrated distribution channel.
|
||||||
|
- **AWS Marketplace:** List for teams that want to pay through their AWS bill (consolidated billing, committed spend credits). Also provides credibility and discoverability.
|
||||||
|
- **Digger (OSS Terraform CI/CD):** Digger users need drift detection. Integration partnership, not competition.
|
||||||
|
- **Terraform Registry:** List as a complementary tool. Publish a `terraform-provider-driftcheck` data source.
|
||||||
|
|
||||||
|
### 90-Day Launch Timeline
|
||||||
|
|
||||||
|
**Days 1-30: Build the Foundation**
|
||||||
|
- Week 1-2: Build `drift-cli` (open-source). Terraform + OpenTofu support. Single-stack scanning. Output to stdout.
|
||||||
|
- Week 2-3: Build SaaS detection engine. Multi-stack continuous monitoring. S3/GCS state backend integration.
|
||||||
|
- Week 3-4: Build Slack integration. Drift alerts with action buttons. This is the MVP killer feature.
|
||||||
|
- Week 4: Build dashboard. Drift score, stack list, drift history. Minimal but functional.
|
||||||
|
- **Deliverable:** Working product that detects drift across multiple Terraform/OpenTofu stacks and alerts via Slack.
|
||||||
|
|
||||||
|
**Days 31-60: Seed the Community**
|
||||||
|
- Week 5: Publish `drift-cli` on GitHub. Clear README with GIF demos. Target: 100 stars in week 1.
|
||||||
|
- Week 5-6: Begin daily engagement on r/terraform, r/devops. Answer drift questions. Don't pitch.
|
||||||
|
- Week 6: Publish "The True Cost of Infrastructure Drift" blog post with Drift Cost Calculator.
|
||||||
|
- Week 7: Publish "driftctl Is Dead. Here's What to Use Instead."
|
||||||
|
- Week 7-8: Recruit 3-5 design partners from personal network. Free access, weekly feedback calls.
|
||||||
|
- **Deliverable:** 200+ GitHub stars, 50+ email list signups, 3-5 design partners actively using the product.
|
||||||
|
|
||||||
|
**Days 61-90: Launch and Convert**
|
||||||
|
- Week 9: "Show HN" launch. Tuesday or Wednesday morning (US Eastern). Landing page, pricing page, and docs ready.
|
||||||
|
- Week 9-10: Respond to every HN comment. Fix bugs within 24 hours. Ship daily.
|
||||||
|
- Week 10: Launch on Product Hunt (secondary channel).
|
||||||
|
- Week 11: Publish design partner case study: "How [Company] Reduced Drift by 90% in 2 Weeks."
|
||||||
|
- Week 12: Enable paid tiers. Convert free users to Starter/Pro.
|
||||||
|
- **Deliverable:** 200+ free tier users, 10+ paying customers, $1.5K+ MRR.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. BUSINESS MODEL
|
||||||
|
|
||||||
|
### Revenue Model
|
||||||
|
|
||||||
|
**Primary revenue:** Tiered SaaS subscriptions (Free / $49 / $149 / $399 / Custom).
|
||||||
|
|
||||||
|
**Revenue characteristics:**
|
||||||
|
- **Recurring:** Monthly subscriptions with annual discount option (2 months free on annual).
|
||||||
|
- **Expansion-native:** Revenue grows as customers add stacks and upgrade tiers. Built-in NDR (Net Dollar Retention) >120%.
|
||||||
|
- **Low-touch:** Self-serve signup, credit card billing, no sales team required for Free through Business tiers.
|
||||||
|
- **Compliance-sticky:** Once SOC 2 audit evidence references dd0c/drift reports, switching tools means re-establishing evidence chains with auditors. Nobody does that mid-audit-cycle.
|
||||||
|
|
||||||
|
**Secondary revenue (future):**
|
||||||
|
- AWS Marketplace transactions (consolidated billing).
|
||||||
|
- dd0c platform cross-sell (drift customers adopt dd0c/cost, dd0c/alert, dd0c/portal).
|
||||||
|
- Enterprise on-prem/VPC-deployed dashboard (license fee, not SaaS).
|
||||||
|
|
||||||
|
### Unit Economics
|
||||||
|
|
||||||
|
**Assumptions:**
|
||||||
|
- Average customer: Pro tier ($149/mo) — this is the volume tier based on persona analysis.
|
||||||
|
- Infrastructure cost per customer: ~$8-12/mo (compute for polling, storage for drift history, Slack API calls).
|
||||||
|
- Gross margin: ~92-95%.
|
||||||
|
- CAC (blended): ~$150-$300 (PLG motion — content + community + open-source, no paid ads initially).
|
||||||
|
- CAC payback: 1-2 months at Pro tier.
|
||||||
|
- LTV (assuming 5% monthly churn, 24-month average lifetime): $149 × 24 = $3,576.
|
||||||
|
- LTV:CAC ratio: 12-24x (healthy; target >3x).
|
||||||
|
|
||||||
|
**Revenue mix projection (Month 12):**
|
||||||
|
|
||||||
|
| Tier | Customers | MRR | % of MRR |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Free | 1,200 | $0 | 0% |
|
||||||
|
| Starter ($49) | 50 | $2,450 | 11% |
|
||||||
|
| Pro ($149) | 80 | $11,920 | 54% |
|
||||||
|
| Business ($399) | 18 | $7,182 | 32% |
|
||||||
|
| Enterprise | 2 | $600 | 3% |
|
||||||
|
| **Total** | **1,350 (150 paid)** | **$22,152** | **100%** |
|
||||||
|
|
||||||
|
### Path to $10K / $50K / $100K MRR
|
||||||
|
|
||||||
|
**$10K MRR — "Ramen Profitable" (Month 6-9)**
|
||||||
|
- ~67 paying customers at blended $149/mo average.
|
||||||
|
- Achieved through: HN launch momentum + community engagement + 2-3 blog posts ranking on Google + design partner referrals.
|
||||||
|
- Solo founder is sustainable at this level. Infrastructure costs ~$1K/mo. Net income ~$9K/mo.
|
||||||
|
- **Milestone significance:** Validates product-market fit. Proves the market will pay.
|
||||||
|
|
||||||
|
**$50K MRR — "Real Business" (Month 15-20)**
|
||||||
|
- ~335 paying customers at blended $149/mo average.
|
||||||
|
- Achieved through: SEO compounding + word-of-mouth + Slack Marketplace distribution + first conference talks + compliance-driven purchases accelerating.
|
||||||
|
- Hire first part-time contractor for support and bug fixes at ~$30K MRR.
|
||||||
|
- **Milestone significance:** Sustainable solo business. Funds development of dd0c platform expansion.
|
||||||
|
|
||||||
|
**$100K MRR — "Platform Inflection" (Month 24-30)**
|
||||||
|
- ~500 paying customers at blended $200/mo average (mix shifts toward Pro/Business as larger teams adopt).
|
||||||
|
- Achieved through: dd0c platform cross-sell (drift customers adopt other modules) + enterprise tier traction + AWS Marketplace + potential seed round to accelerate.
|
||||||
|
- Hire 1-2 full-time engineers. Transition from solo founder to small team.
|
||||||
|
- **Milestone significance:** dd0c becomes a platform company, not a single-product company.
|
||||||
|
|
||||||
|
### Solo Founder Constraints
|
||||||
|
|
||||||
|
**What one person can realistically do:**
|
||||||
|
- Build and maintain the core product (detection engine, Slack integration, dashboard).
|
||||||
|
- Write 2-4 blog posts per month.
|
||||||
|
- Engage on Reddit/HN daily (30 min/day).
|
||||||
|
- Handle support for up to ~100 customers (Slack-based, async).
|
||||||
|
- Ship weekly releases.
|
||||||
|
|
||||||
|
**What one person cannot do:**
|
||||||
|
- Build enterprise features (SSO, SAML, advanced RBAC) while also shipping core features and doing marketing.
|
||||||
|
- Handle support for 200+ customers without it consuming all productive time.
|
||||||
|
- Attend conferences while also shipping code.
|
||||||
|
- Build multi-cloud support (Azure, GCP) while maintaining AWS quality.
|
||||||
|
|
||||||
|
**The constraint strategy:**
|
||||||
|
- Ruthlessly prioritize AWS + Terraform + OpenTofu. Don't touch Azure/GCP/Pulumi until $30K MRR.
|
||||||
|
- Use AI-assisted development (Cursor/Copilot) for 80% of boilerplate. Reserve cognitive energy for architecture and customer conversations.
|
||||||
|
- Hire first contractor at $30K MRR. First full-time hire at $75K MRR.
|
||||||
|
- Shared dd0c platform infrastructure (auth, billing, OTel pipeline) is built once and reused across all modules. This is the moat against burnout.
|
||||||
|
|
||||||
|
### Key Assumptions
|
||||||
|
|
||||||
|
1. **The driftctl vacuum persists for 12+ months.** If someone fills it before dd0c/drift launches, the beachhead shrinks significantly.
|
||||||
|
2. **Engineers will adopt a new tool for drift detection specifically.** The "do nothing" competitor (manual `terraform plan`) is strong. The product must demonstrate ROI in the first 5 minutes.
|
||||||
|
3. **Compliance requirements continue tightening.** SOC 2, PCI DSS 4.0, and HIPAA are driving drift detection from "nice-to-have" to "required." If compliance pressure plateaus, the Diana persona weakens.
|
||||||
|
4. **Push-based architecture is acceptable to security teams.** The open-source agent running in customer VPC must satisfy CISO review. If it doesn't, adoption stalls at security-conscious organizations.
|
||||||
|
5. **PLG motion works for infrastructure tooling.** Bottom-up adoption by individual engineers, expanding to team purchases. If procurement processes block credit card purchases, the self-serve model breaks.
|
||||||
|
6. **Brian can sustain development velocity across multiple dd0c modules.** Drift is Product #2 in a 6-product suite. If dd0c/route (Phase 1) consumes more time than expected, drift launch delays and the window may close.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. RISKS & MITIGATIONS
|
||||||
|
|
||||||
|
### Top 5 Risks (from Party Mode Stress Tests)
|
||||||
|
|
||||||
|
**Risk 1: HashiCorp/IBM Ships Native Drift Detection in TFC (Severity: 8/10)**
|
||||||
|
|
||||||
|
IBM paid $4.6B for HashiCorp. They have infinite resources and strategic motivation to improve TFC's drift features. If they ship continuous monitoring + Slack alerts + remediation in the TFC Plus tier, the "HashiCorp exodus" narrative dies.
|
||||||
|
|
||||||
|
*Why it might not happen:* IBM moves slowly. They'll focus on enterprise governance features that justify $70K+ contracts, not improving drift for the free/starter tier. Post-BSL, the community is migrating to OpenTofu — IBM may double down on enterprise lock-in rather than community features.
|
||||||
|
|
||||||
|
*Mitigation:*
|
||||||
|
- Multi-IaC support is the insurance policy. TFC will never support OpenTofu or Pulumi. Every team using multiple IaC tools is immune to TFC's drift features.
|
||||||
|
- Speed. Be 18 months ahead on drift-specific features by the time IBM responds. Ship weekly, not quarterly.
|
||||||
|
- Community lock-in. If dd0c/drift is the community standard (the "driftctl successor"), IBM improving TFC drift won't matter — the community has already chosen.
|
||||||
|
|
||||||
|
**Risk 2: Solo Founder Burnout (Severity: 9/10, Probability: High)**
|
||||||
|
|
||||||
|
This is the risk the party mode panel was most worried about — and so am I. dd0c is 6 products. Even with drift in Phase 2, Brian will be maintaining dd0c/route while building drift. Adding a 4th, 5th, 6th product is not "building new products" — it's adding 25% more work each time to an already unsustainable workload.
|
||||||
|
|
||||||
|
*Mitigation:*
|
||||||
|
- Shared platform infrastructure (auth, billing, OTel pipeline) built once and reused. If each product has its own backend, this fails.
|
||||||
|
- AI-assisted development for 80% of boilerplate.
|
||||||
|
- Hire at $30K MRR. Don't try to be solo past that threshold.
|
||||||
|
- Ruthless scope control. MVP means MVP. No feature creep. No Azure/GCP until $30K MRR.
|
||||||
|
|
||||||
|
**Risk 3: Spacelift/env0 Commoditize Drift Detection (Severity: 7/10)**
|
||||||
|
|
||||||
|
If dd0c/drift gains traction and appears in "Spacelift alternatives" searches, Spacelift's marketing team will notice. The easiest response: drop basic drift detection into their free tier.
|
||||||
|
|
||||||
|
*Why it might not happen:* Spacelift's drift detection requires private workers with infrastructure costs. Making it free erodes their upgrade path. Their investors won't love giving away features that drive enterprise upgrades.
|
||||||
|
|
||||||
|
*Mitigation:*
|
||||||
|
- Be better, not just cheaper. If drift detection is 10x better (Slack-native, one-click remediation, compliance reports, multi-IaC), "free but mediocre" from Spacelift won't matter. Nobody switched from Figma to free Adobe XD.
|
||||||
|
- Different buyer. Spacelift's free tier targets teams evaluating their platform. dd0c/drift targets teams who don't want a platform. Different buyer, different motion.
|
||||||
|
|
||||||
|
**Risk 4: Enterprise Security Teams Block Adoption (Severity: 8/10)**
|
||||||
|
|
||||||
|
Reading state files means reading resource configurations, sometimes including sensitive data. Giving a bootstrapped SaaS tool access to production AWS and state buckets is a red flag for any CISO. The party mode CTO called this severity 9/10.
|
||||||
|
|
||||||
|
*Mitigation:*
|
||||||
|
- Push-based architecture is non-negotiable. The SaaS never pulls from customer cloud. The open-source agent runs in their VPC and pushes encrypted drift diffs out.
|
||||||
|
- Open-source the agent so security teams can audit the code. Trust through transparency.
|
||||||
|
- Get dd0c SOC 2 certified. Expensive ($20-50K) but eliminates the "can we trust a solo founder's SaaS?" objection. You can't sell a compliance tool without passing compliance yourself.
|
||||||
|
|
||||||
|
**Risk 5: "Do Nothing" Inertia (Severity: 6/10, Probability: High)**
|
||||||
|
|
||||||
|
Most teams tolerate drift. They've been tolerating it for years. The primary substitute is "do nothing" — manual `terraform plan` runs, tribal knowledge, and hope. Converting tolerators to payers requires more effort than converting seekers to payers.
|
||||||
|
|
||||||
|
*Mitigation:*
|
||||||
|
- The Drift Cost Calculator directly attacks this by quantifying the cost of "good enough." When an engineer sees "$47K/year in drift management costs" vs. "$149/mo for dd0c/drift," the bash script suddenly looks expensive.
|
||||||
|
- Target seekers first (driftctl refugees, post-incident teams, pre-audit teams), not tolerators. The beachhead is people already in pain.
|
||||||
|
- Compliance as forcing function. When the auditor says "you need continuous drift detection," inertia loses.
|
||||||
|
|
||||||
|
### Kill Criteria
|
||||||
|
|
||||||
|
**Kill at 6 months if ANY of these are true:**
|
||||||
|
1. < 50 free tier signups after HN launch + Reddit engagement + blog content. Market doesn't care.
|
||||||
|
2. < 5 paying customers after 90 days of paid tier availability. Free users who won't pay are vanity.
|
||||||
|
3. Free-to-paid conversion < 3%. Industry benchmark for PLG dev tools is 3-7%.
|
||||||
|
4. NPS < 30 from first 20 customers. If early adopters aren't enthusiastic, the product isn't solving a real problem.
|
||||||
|
5. HashiCorp announces "TFC Drift Detection Pro" with continuous monitoring, Slack alerts, and remediation included in Plus tier — before dd0c/drift has 100+ customers.
|
||||||
|
|
||||||
|
**Kill at 12 months if ANY of these are true:**
|
||||||
|
1. < $10K MRR. Growth trajectory doesn't support standalone product. Fold drift into dd0c/portal as a feature.
|
||||||
|
2. Monthly churn > 8%. Dev tools should have <5%. Above 8% means the product isn't sticky.
|
||||||
|
3. CAC payback > 12 months. Unit economics don't work for a bootstrapped founder.
|
||||||
|
|
||||||
|
### Pivot Options
|
||||||
|
|
||||||
|
- **Pivot A: Compliance Engine.** If drift detection alone doesn't convert but compliance reports do, pivot to a broader "IaC Compliance Platform" — drift detection becomes a feature feeding compliance evidence generation, audit trail management, and regulatory reporting. Diana becomes the primary buyer, not Ravi.
|
||||||
|
- **Pivot B: dd0c/portal Feature.** If drift doesn't sustain as a standalone product, fold it into dd0c/portal as the "infrastructure health" module. Drift detection becomes a feature of the IDP, not a product. Reduces standalone revenue pressure.
|
||||||
|
- **Pivot C: Multi-Tool Standard.** If the multi-IaC angle resonates more than drift specifically, pivot to a generic "IaC state comparison engine" that integrates with existing observability tools (Datadog, New Relic). Become the standard for state comparison, let others build the UX.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. SUCCESS METRICS
|
||||||
|
|
||||||
|
### North Star Metric
|
||||||
|
|
||||||
|
**Stacks monitored** (total across all customers).
|
||||||
|
|
||||||
|
This measures adoption depth, not just customer count. A customer monitoring 50 stacks is 10x more engaged (and 10x more likely to retain) than a customer monitoring 5. It also directly correlates with revenue (more stacks = higher tier) and with the data flywheel (more stacks = better drift intelligence).
|
||||||
|
|
||||||
|
### Leading Indicators
|
||||||
|
|
||||||
|
| Metric | Description | Why It Matters |
|
||||||
|
|---|---|---|
|
||||||
|
| **GitHub stars (drift-cli)** | Social proof and top-of-funnel awareness | Stars → downloads → free signups → paid conversions |
|
||||||
|
| **Free tier signups** | Activation rate of interested engineers | Measures whether the value proposition resonates |
|
||||||
|
| **Free-to-paid conversion rate** | % of free users who upgrade | Measures whether the product delivers enough value to pay for |
|
||||||
|
| **Time-to-first-alert** | Minutes from signup to first Slack drift alert | Measures onboarding friction. Target: <5 minutes. |
|
||||||
|
| **Weekly active stacks** | Stacks with at least one drift check in the past 7 days | Measures engagement depth, not just signup vanity |
|
||||||
|
| **Slack action rate** | % of drift alerts that receive a Revert/Accept/Snooze action | Measures whether alerts are actionable vs. noise |
|
||||||
|
|
||||||
|
### Lagging Indicators
|
||||||
|
|
||||||
|
| Metric | Description | Target |
|
||||||
|
|---|---|---|
|
||||||
|
| **MRR** | Monthly Recurring Revenue | See milestones below |
|
||||||
|
| **Net Dollar Retention (NDR)** | Revenue expansion from existing customers | >120% (customers upgrade as they add stacks) |
|
||||||
|
| **Monthly churn** | % of paying customers lost per month | <5% |
|
||||||
|
| **CAC payback** | Months to recoup customer acquisition cost | <6 months |
|
||||||
|
| **LTV:CAC ratio** | Lifetime value vs. acquisition cost | >3:1 (target 10:1+) |
|
||||||
|
| **NPS** | Net Promoter Score from paying customers | >40 |
|
||||||
|
|
||||||
|
### Milestones
|
||||||
|
|
||||||
|
**30 Days Post-Launch:**
|
||||||
|
- 200+ GitHub stars on drift-cli
|
||||||
|
- 50+ free tier signups
|
||||||
|
- 3-5 design partners actively using the product
|
||||||
|
- First Slack alert delivered to a non-Brian user
|
||||||
|
- Zero critical bugs in production
|
||||||
|
|
||||||
|
**60 Days Post-Launch:**
|
||||||
|
- 500+ GitHub stars
|
||||||
|
- 150+ free tier signups
|
||||||
|
- 10+ paying customers
|
||||||
|
- $1.5K+ MRR
|
||||||
|
- "driftctl Is Dead" blog post ranking on page 1 for "driftctl alternative"
|
||||||
|
- First unsolicited mention on r/terraform or r/devops
|
||||||
|
|
||||||
|
**90 Days Post-Launch:**
|
||||||
|
- 1,000+ GitHub stars
|
||||||
|
- 300+ free tier signups
|
||||||
|
- 25+ paying customers
|
||||||
|
- $3.5K+ MRR
|
||||||
|
- Free-to-paid conversion rate >5%
|
||||||
|
- First design partner case study published
|
||||||
|
- NPS >40 from first 20 customers
|
||||||
|
|
||||||
|
### Month 6 Targets
|
||||||
|
|
||||||
|
| Metric | Target |
|
||||||
|
|---|---|
|
||||||
|
| GitHub stars | 1,500 |
|
||||||
|
| Free tier users | 600 |
|
||||||
|
| Paying customers | 50 |
|
||||||
|
| MRR | $7,500 |
|
||||||
|
| Stacks monitored | 1,500 |
|
||||||
|
| Monthly churn | <5% |
|
||||||
|
| NDR | >110% |
|
||||||
|
|
||||||
|
### Month 12 Targets
|
||||||
|
|
||||||
|
| Metric | Target |
|
||||||
|
|---|---|
|
||||||
|
| GitHub stars | 3,000 |
|
||||||
|
| Free tier users | 1,500 |
|
||||||
|
| Paying customers | 150 |
|
||||||
|
| MRR | $22,000 |
|
||||||
|
| Stacks monitored | 5,000 |
|
||||||
|
| Monthly churn | <4% |
|
||||||
|
| NDR | >120% |
|
||||||
|
| Free-to-paid conversion | 7% |
|
||||||
|
| NPS | >50 |
|
||||||
|
| CAC payback | <6 months |
|
||||||
|
| LTV:CAC | >10:1 |
|
||||||
|
|
||||||
|
### Scenario-Weighted Revenue Projection
|
||||||
|
|
||||||
|
| Scenario | Probability | Month 6 MRR | Month 12 MRR | Month 24 MRR |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| **Rocket** (viral HN launch, community adopts as driftctl successor) | 20% | $15K | $52K | $180K |
|
||||||
|
| **Grind** (steady growth, community works but slowly) | 50% | $6K | $22K | $75K |
|
||||||
|
| **Slog** (interest but low conversion, competitors respond) | 25% | $2.2K | $9K | $22K |
|
||||||
|
| **Flop** (market doesn't materialize) | 5% | $750 | $5K | $5K |
|
||||||
|
| **Weighted Expected Value** | — | **$6.7K** | **$23.9K** | **$78.8K** |
|
||||||
|
|
||||||
|
Weighted Month 12 MRR of ~$24K = ~$287K ARR. For a bootstrapped solo founder, that's a real business. Not a unicorn. A real business that funds the dd0c platform expansion.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## APPENDIX: CROSS-PHASE CONTRADICTION RESOLUTION
|
||||||
|
|
||||||
|
This brief synthesized four prior phase documents. Key contradictions and their resolutions:
|
||||||
|
|
||||||
|
| Contradiction | Resolution |
|
||||||
|
|---|---|
|
||||||
|
| **Pricing: $29/stack flat vs. tiered bundles** — Brainstorm proposed $29/stack. Innovation strategy recommended tiers ($49-$399). Party mode practitioner wanted $149 Pro tier. | **Tiered bundles win.** Flat per-stack penalizes good architecture, creates enterprise sticker shock, and is unpredictable. Tiers solve all three while preserving the "$29/stack" marketing anchor. See Section 3 pricing table. |
|
||||||
|
| **Launch sequencing: Phase 3 (months 7-12) vs. Phase 2 (months 4-6)** — Brand strategy placed drift in Phase 3. Innovation strategy and party mode both recommended Phase 2. | **Phase 2 wins.** The driftctl vacuum is time-sensitive. Every month of delay shrinks the window. dd0c/route (Phase 1) is a faster build; drift follows immediately. |
|
||||||
|
| **Standalone product vs. platform wedge** — VC panelist said $3-5M SOM isn't venture-scale. Bootstrap founder said $3M ARR solo is phenomenal. | **Both are right.** Drift is a strong standalone bootstrapped business AND a wedge into the dd0c platform. The brief treats it as both: standalone metrics for the first 12 months, platform expansion metrics for months 12-24. No need to choose yet. |
|
||||||
|
| **Auto-remediation scope** — CTO warned about blast radius of one-click revert. Practitioner said MVP should focus on safe reverts (security groups), not complex state (RDS parameters). | **Spectrum of automation.** Per-resource-type policies: auto-revert for security groups opened to 0.0.0.0/0, alert + one-click for IAM, digest for tags, ignore for ASG scaling. The spectrum IS the product's sophistication. Complex state remediation generates a PR for human review, not a direct apply. |
|
||||||
|
| **Architecture: SaaS pull vs. push-based agent** — Contrarian and CTO both flagged IAM trust as a blocker. Practitioner proposed push-based agent. | **Push-based is non-negotiable.** The SaaS never pulls from customer cloud. Open-source agent runs in customer VPC, pushes encrypted diffs out. This was unanimous across all phases. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*"The window won't wait. Ship it."* — Victor
|
||||||
|
|
||||||
|
**Document Status:** COMPLETE
|
||||||
|
**Confidence Level:** HIGH
|
||||||
|
**Next Step:** Technical architecture session — define the detection engine, state backend integrations, and Slack workflow architecture.
|
||||||
|
|
||||||
File diff suppressed because it is too large
Load Diff
1279
products/03-alert-intelligence/architecture/architecture.md
Normal file
1279
products/03-alert-intelligence/architecture/architecture.md
Normal file
File diff suppressed because it is too large
Load Diff
227
products/03-alert-intelligence/brainstorm/session.md
Normal file
227
products/03-alert-intelligence/brainstorm/session.md
Normal file
@@ -0,0 +1,227 @@
|
|||||||
|
# 🧠 Alert Intelligence Layer — Brainstorm Session
|
||||||
|
|
||||||
|
**Product:** #3 — Alert Intelligence Layer (dd0c platform)
|
||||||
|
**Date:** 2026-02-28
|
||||||
|
**Facilitator:** Carson (Elite Brainstorming Specialist)
|
||||||
|
**Total Ideas Generated:** 112
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: Problem Space (22 ideas)
|
||||||
|
|
||||||
|
### What Alert Fatigue Actually Feels Like at 3am
|
||||||
|
1. **The Pavlov's Dog Effect** — Your phone buzzes and your cortisol spikes before you even read it. After 6 months of on-call, the sound of a notification triggers anxiety even on vacation.
|
||||||
|
2. **The Boy Who Cried Wolf** — After 50 false alarms, you stop reading the details. You just ack and go back to sleep. The 51st one is the real outage.
|
||||||
|
3. **The Scroll of Doom** — You wake up to 347 unread alerts in Slack. You have to mentally triage which ones matter. By the time you find the real one, it's been firing for 40 minutes.
|
||||||
|
4. **The Guilt Loop** — You muted a channel because it was too noisy. Now you feel guilty. What if something real fires? You unmute. It's noisy again. Repeat.
|
||||||
|
5. **The Resignation Trigger** — Alert fatigue is the #1 cited reason engineers leave on-call-heavy roles. It's not the incidents — it's the noise between incidents.
|
||||||
|
|
||||||
|
### What Alerts Are Always Noise? What Patterns Exist?
|
||||||
|
6. **The Cascade** — One root cause (DB goes slow) triggers 47 downstream alerts across 12 services. Every single one is a symptom, not the cause.
|
||||||
|
7. **The Flapper** — CPU hits 80%, alert fires. CPU drops to 79%, resolves. CPU hits 80% again. 14 times in an hour. Same alert, same non-issue.
|
||||||
|
8. **The Deployment Storm** — Every deploy triggers a brief spike in error rates. 100% predictable. 100% still alerting.
|
||||||
|
9. **The Threshold Lie** — "Alert when latency > 500ms" was set 2 years ago when traffic was 10x lower. Nobody updated it. It fires daily now.
|
||||||
|
10. **The Zombie Alert** — The service it monitors was decommissioned 6 months ago. The alert still fires. Nobody knows who owns it.
|
||||||
|
11. **The Chatty Neighbor** — One misconfigured service generates 80% of all alerts. Everyone knows. Nobody fixes it because "it's not my team's service."
|
||||||
|
12. **The Scheduled Noise** — Cron jobs, batch processes, maintenance windows — all generate predictable alerts at predictable times.
|
||||||
|
13. **The Metric Drift** — Seasonal traffic patterns mean thresholds that work in January fail in December. Static thresholds can't handle dynamic systems.
|
||||||
|
|
||||||
|
### Why Do Critical Alerts Get Lost?
|
||||||
|
14. **Signal Drowning** — When everything is urgent, nothing is urgent. Critical alerts are buried in a sea of warnings.
|
||||||
|
15. **Channel Overload** — Alerts go to Slack, email, PagerDuty, and SMS simultaneously. Engineers pick ONE channel and ignore the rest. If the critical alert only went to email...
|
||||||
|
16. **Context Collapse** — "High CPU on prod-web-07" tells you nothing. Is it the one serving 40% of traffic or the one that's being decommissioned?
|
||||||
|
17. **The Wrong Person** — Alert goes to the on-call for Team A, but the root cause is in Team B's service. 30 minutes of "not my problem" before escalation.
|
||||||
|
|
||||||
|
### Cost Analysis
|
||||||
|
18. **MTTR Tax** — Every minute of alert triage is a minute not spent fixing. Teams with high noise have 3-5x longer MTTR.
|
||||||
|
19. **The Attrition Cost** — Replacing a senior SRE costs $150-300K (recruiting + ramp). Alert fatigue drives attrition. Do the math.
|
||||||
|
20. **The Incident That Wasn't Caught** — One missed P1 incident can cost $100K-$10M depending on the business. Alert fatigue makes this inevitable.
|
||||||
|
21. **Cognitive Load Tax** — Engineers who had a bad on-call night are 40% less productive the next day. That's a hidden cost nobody tracks.
|
||||||
|
22. **The Compliance Risk** — In regulated industries (fintech, healthcare), missed alerts can mean regulatory fines. Alert fatigue is a compliance risk.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: Solution Space (48 ideas)
|
||||||
|
|
||||||
|
### AI Deduplication Approaches
|
||||||
|
23. **Semantic Fingerprinting** — Hash alerts by semantic meaning, not exact text. "High CPU on web-01" and "CPU spike detected on web-01" are the same alert.
|
||||||
|
24. **Topology-Aware Grouping** — Use service dependency maps to group alerts that share a common upstream cause. DB slow → API slow → Frontend errors = 1 incident, not 47 alerts.
|
||||||
|
25. **Time-Window Clustering** — Alerts within a 5-minute window affecting related services get auto-grouped into a single incident.
|
||||||
|
26. **Embedding-Based Similarity** — Use lightweight embeddings (sentence-transformers) to compute similarity scores between alerts. Cluster above threshold.
|
||||||
|
27. **Template Extraction** — Learn alert templates ("X metric on Y host exceeded Z threshold") and deduplicate by template + parameters.
|
||||||
|
28. **Cross-Source Dedup** — Same incident triggers alerts in Datadog AND PagerDuty AND Grafana. Deduplicate across sources, not just within.
|
||||||
|
|
||||||
|
### Correlation Strategies
|
||||||
|
29. **Deployment Correlation** — Automatically correlate alert spikes with recent deployments (pull from CI/CD: GitHub Actions, ArgoCD, etc.). "This started 3 minutes after deploy #4521."
|
||||||
|
30. **Change Correlation** — Beyond deploys: config changes, feature flag flips, infrastructure changes (Terraform applies), DNS changes.
|
||||||
|
31. **Service Dependency Graph** — Auto-discover or import service maps. When alert fires, show the blast radius and likely root cause.
|
||||||
|
32. **Temporal Pattern Matching** — "This exact pattern of alerts happened 3 times before. Each time it was caused by X." Learn from history.
|
||||||
|
33. **Cross-Team Correlation** — Alert in Team A's service + alert in Team B's service = shared dependency issue. Neither team sees the full picture alone.
|
||||||
|
34. **Infrastructure Event Correlation** — Cloud provider incidents, network blips, AZ failures — correlate with external status pages automatically.
|
||||||
|
35. **Calendar-Aware Correlation** — Black Friday traffic, end-of-month batch jobs, quarterly reporting — correlate with known business events.
|
||||||
|
|
||||||
|
### Priority Scoring
|
||||||
|
36. **SLO-Based Priority** — If this alert threatens an SLO with < 20% error budget remaining, it's critical. If error budget is 90% full, it can wait.
|
||||||
|
37. **Business Impact Scoring** — Assign business value to services (revenue-generating, customer-facing, internal-only). Alert priority inherits from service importance.
|
||||||
|
38. **Historical Resolution Priority** — Alerts that historically required immediate action get high priority. Alerts that were always acked-and-ignored get suppressed.
|
||||||
|
39. **Blast Radius Scoring** — How many users/services are affected? An alert affecting 1 user vs 1 million users should have very different priorities.
|
||||||
|
40. **Time-Decay Priority** — An alert that's been firing for 5 minutes is more urgent than one that just started (it's not self-resolving).
|
||||||
|
41. **Compound Scoring** — Combine multiple signals: SLO impact × business value × blast radius × historical urgency = composite priority score.
|
||||||
|
42. **Dynamic Thresholds** — Replace static thresholds with ML-based anomaly detection. Alert only when behavior is genuinely anomalous for THIS time of day, THIS day of week.
|
||||||
|
|
||||||
|
### Learning Mechanisms
|
||||||
|
43. **Ack Pattern Learning** — If an alert is acknowledged within 10 seconds 95% of the time, it's probably noise. Learn to auto-suppress.
|
||||||
|
44. **Resolution Pattern Learning** — Track what actually gets resolved vs what auto-resolves. Focus human attention on alerts that need human action.
|
||||||
|
45. **Runbook Extraction** — Parse existing runbooks and link them to alert types. If the runbook says "check if it's a deploy," automate that check.
|
||||||
|
46. **Postmortem Mining** — Analyze incident postmortems to identify which alerts were useful signals and which were noise during real incidents.
|
||||||
|
47. **Feedback Loops** — Explicit thumbs up/down on alert usefulness. "Was this alert helpful?" Build a labeled dataset from real engineer feedback.
|
||||||
|
48. **Snooze Intelligence** — Learn from snooze patterns. If everyone snoozes "disk usage > 80%" for 24 hours, maybe the threshold should be 90%.
|
||||||
|
49. **Team-Specific Learning** — Different teams have different noise profiles. Learn per-team, not globally.
|
||||||
|
50. **Seasonal Learning** — Recognize that December traffic patterns are different from July. Adjust baselines seasonally.
|
||||||
|
|
||||||
|
### Integration Approaches
|
||||||
|
51. **Webhook Receiver (Primary)** — Accept webhooks from any monitoring tool. Zero-config for tools that support webhook destinations. Lowest friction.
|
||||||
|
52. **API Polling (Secondary)** — For tools that don't support webhooks well, poll their APIs on a schedule.
|
||||||
|
53. **Slack Bot Integration** — Live in Slack where engineers already are. Receive alerts, show grouped incidents, allow ack/resolve from Slack.
|
||||||
|
54. **PagerDuty Bidirectional Sync** — Don't replace PagerDuty — sit in front of it. Filter noise before it hits PagerDuty's on-call rotation.
|
||||||
|
55. **Terraform Provider** — Configure alert rules, suppression policies, and service maps as code. GitOps-friendly.
|
||||||
|
56. **OpenTelemetry Collector Plugin** — Tap into the OTel pipeline to correlate alerts with traces and logs.
|
||||||
|
57. **GitHub/GitLab Integration** — Pull deployment events, PR merges, and config changes for correlation.
|
||||||
|
|
||||||
|
### UX Ideas
|
||||||
|
58. **Slack-Native Experience** — Primary interface IS Slack. Threaded incident channels, interactive buttons, slash commands. No new tool to learn.
|
||||||
|
59. **Mobile-First Dashboard** — On-call engineers are on their phones at 3am. The mobile experience must be exceptional, not an afterthought.
|
||||||
|
60. **Daily Digest Email** — "Yesterday: 347 alerts fired. 12 were real. Here's what we suppressed and why." Build trust through transparency.
|
||||||
|
61. **Alert Replay** — Visualize an incident timeline: which alerts fired in what order, how they were grouped, what the AI decided. Full auditability.
|
||||||
|
62. **Noise Report Card** — Weekly report per team: "Your noisiest alerts, your most-ignored alerts, suggested tuning." Gamify noise reduction.
|
||||||
|
63. **On-Call Handoff Summary** — Auto-generated summary for shift handoffs: "Here's what happened, what's still open, what to watch."
|
||||||
|
64. **Service Health Dashboard** — Not another dashboard — a SMART dashboard that only shows what's actually wrong right now, with context.
|
||||||
|
65. **CLI Tool** — `alert-intel status`, `alert-intel suppress <pattern>`, `alert-intel explain <incident-id>`. For the terminal-native engineers.
|
||||||
|
|
||||||
|
### Escalation Intelligence
|
||||||
|
66. **Smart Routing** — Route to the engineer who last fixed this type of issue, not just whoever is on-call.
|
||||||
|
67. **Auto-Escalation Rules** — If no ack in 10 minutes AND SLO impact is high, auto-escalate to the next tier. No human needed to press "escalate."
|
||||||
|
68. **Responder Availability** — Integrate with calendar/Slack status. Don't page someone who's marked as unavailable — find the backup automatically.
|
||||||
|
69. **Fatigue-Aware Routing** — If the on-call engineer has been paged 5 times tonight, route the next one to the secondary. Prevent burnout in real-time.
|
||||||
|
70. **Cross-Team Escalation** — When correlation shows the root cause is in another team's service, auto-notify that team with context.
|
||||||
|
|
||||||
|
### Wild Ideas
|
||||||
|
71. **Predictive Alerting** — Use time-series forecasting to predict when a metric WILL breach a threshold. Alert before it happens. "CPU will hit 95% in ~20 minutes based on current trend."
|
||||||
|
72. **Alert Simulation Mode** — "What if we changed this threshold? Here's what would have happened last month." Simulate before you ship.
|
||||||
|
73. **Incident Autopilot** — For known, repeatable incidents, execute the runbook automatically. Human just approves. "We've seen this 47 times. Auto-scaling fixes it. Execute? [Yes/No]"
|
||||||
|
74. **Natural Language Alert Creation** — "Alert me if checkout latency is bad" → AI translates to proper metric query + dynamic threshold.
|
||||||
|
75. **Alert Debt Score** — Like tech debt but for monitoring. "You have 47 alerts that fire daily and are always ignored. Your alert debt score is 73/100."
|
||||||
|
76. **Chaos Engineering Integration** — During chaos experiments, automatically suppress expected alerts and highlight unexpected ones.
|
||||||
|
77. **LLM-Powered Root Cause Analysis** — Feed the AI the alert cluster + recent changes + service graph → get a natural language hypothesis: "Likely cause: memory leak introduced in commit abc123, deployed 12 minutes ago."
|
||||||
|
78. **Voice Interface for On-Call** — "Hey Alert Intel, what's going on?" at 3am when you can't read your phone screen. Get a spoken summary.
|
||||||
|
79. **Alert Sound Design** — Different sounds for different severity/types. Your brain learns to distinguish "noise ping" from "real incident alarm" without reading.
|
||||||
|
80. **Collaborative Incident Chat** — Auto-create a war room channel, pull in relevant people, seed it with context, timeline, and suggested actions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3: Differentiation (15 ideas)
|
||||||
|
|
||||||
|
### What Makes This Defensible?
|
||||||
|
81. **Data Moat** — Every alert processed, every ack, every resolve, every feedback signal makes the model smarter. Competitors starting fresh can't match 6 months of learned patterns.
|
||||||
|
82. **Network Effects Within Org** — The more teams that use it, the better cross-team correlation works. Creates internal pressure to expand.
|
||||||
|
83. **Integration Depth** — Deep bidirectional integrations with 5-10 monitoring tools create switching costs. You'd have to rewire everything.
|
||||||
|
84. **Institutional Knowledge Capture** — The system learns what your senior SRE knows intuitively. When that person leaves, the knowledge stays. That's incredibly valuable.
|
||||||
|
85. **Custom Model Per Customer** — Each customer's model is trained on THEIR patterns. Generic competitors can't match customer-specific intelligence.
|
||||||
|
|
||||||
|
### Why Not Just Use PagerDuty's Built-In AI?
|
||||||
|
86. **Vendor Lock-In** — PagerDuty AIOps only works with PagerDuty. We work across ALL your tools. Most teams use 3-5 monitoring tools.
|
||||||
|
87. **Price** — PagerDuty AIOps is an expensive add-on to an already expensive product. We're $15-30/seat vs their $50+/seat for AIOps alone.
|
||||||
|
88. **Independence** — We're tool-agnostic. Switch from Datadog to Grafana? We don't care. Your learned patterns carry over.
|
||||||
|
89. **Focus** — PagerDuty is an incident management platform that bolted on AI. We're AI-first, purpose-built for alert intelligence.
|
||||||
|
90. **Transparency** — Big platforms are black boxes. We show exactly why an alert was suppressed, grouped, or escalated. Engineers need to trust it.
|
||||||
|
|
||||||
|
### Additional Differentiation
|
||||||
|
91. **SMB-First Design** — BigPanda needs a 6-month enterprise deployment. We need a webhook URL and 10 minutes.
|
||||||
|
92. **Open Core Potential** — Open-source the core deduplication engine. Build community. Monetize the hosted service + advanced features.
|
||||||
|
93. **Developer Experience** — API-first, CLI tools, Terraform provider, GitOps config. Built BY engineers FOR engineers, not by enterprise sales teams.
|
||||||
|
94. **Time to Value** — Show value in the first hour. "You received 200 alerts today. We would have shown you 23." Instant proof.
|
||||||
|
95. **Community-Shared Patterns** — Anonymized, opt-in pattern sharing. "Teams using Kubernetes + Istio commonly see this noise pattern. Auto-suppress?" Collective intelligence.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4: Anti-Ideas (17 ideas)
|
||||||
|
|
||||||
|
### Why Would This Fail?
|
||||||
|
96. **Trust Gap** — Engineers will NOT trust AI to suppress alerts on day one. One missed critical alert and they'll disable it forever. The trust ramp is the hardest problem.
|
||||||
|
97. **The Datadog Threat** — Datadog has $2B+ revenue and is building AI features aggressively. They could ship "AI Alert Grouping" as a free feature tomorrow.
|
||||||
|
98. **Integration Maintenance Hell** — Supporting 10+ monitoring tools means 10+ APIs that change, break, and deprecate. Integration maintenance could eat the entire engineering team.
|
||||||
|
99. **Cold Start Problem** — The AI needs data to learn. Day 1, it's dumb. How do you deliver value before the model has learned anything?
|
||||||
|
100. **Alert Suppression Liability** — If the AI suppresses a real alert and there's an outage, who's liable? Legal/compliance teams will ask this question.
|
||||||
|
101. **Small TAM Concern** — Teams of 5-50 engineers at $15-30/seat = $75-$1,500/month per customer. Need thousands of customers to build a real business.
|
||||||
|
102. **Enterprise Gravity** — Larger companies (where the money is) already have BigPanda or PagerDuty AIOps. SMBs have less budget and higher churn.
|
||||||
|
103. **The "Good Enough" Problem** — Teams might just... mute channels and deal with it. The pain is real but the workarounds are free.
|
||||||
|
104. **Security Concerns** — Alert data contains service names, infrastructure details, error messages. Sending this to a third party is a security review.
|
||||||
|
105. **Champion Risk** — If the one SRE who championed the tool leaves, does the team keep paying? Single-champion products churn hard.
|
||||||
|
106. **Monitoring Tool Consolidation** — The trend is toward fewer tools (Datadog is eating everything). If teams consolidate to one tool, the "cross-tool" value prop weakens.
|
||||||
|
107. **AI Hype Fatigue** — "AI-powered" is becoming meaningless. Engineers are skeptical of AI claims. Need to prove it with numbers, not buzzwords.
|
||||||
|
108. **Open Source Competition** — Someone could build an open-source version. The deduplication algorithms aren't rocket science. The value is in the learned models.
|
||||||
|
109. **Webhook Reliability** — If our system goes down, alerts don't get processed. We become a single point of failure in the alerting pipeline. That's terrifying.
|
||||||
|
110. **Feature Creep Temptation** — The pull to become a full incident management platform (competing with PagerDuty) is strong. Must resist and stay focused.
|
||||||
|
111. **Pricing Pressure** — At $15-30/seat, margins are thin. Infrastructure costs for ML inference could eat profits if not carefully managed.
|
||||||
|
112. **The "Just Fix Your Alerts" Argument** — Some will say: "If your alerts are noisy, fix your alerts." They're not wrong. We're a band-aid on a deeper problem.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 5: Synthesis
|
||||||
|
|
||||||
|
### Top 10 Ideas (Ranked)
|
||||||
|
|
||||||
|
| Rank | Idea | Why |
|
||||||
|
|------|------|-----|
|
||||||
|
| 1 | **Webhook Receiver + Slack-Native UX** (#51, #58) | Lowest friction entry point. Engineers don't leave Slack. 10-minute setup. |
|
||||||
|
| 2 | **Topology-Aware Alert Grouping** (#24) | The single highest-impact feature. Turns 47 alerts into 1 incident. Immediate, visible value. |
|
||||||
|
| 3 | **Deployment Correlation** (#29) | "This started after deploy #4521" is the most useful sentence in incident response. Pulls from GitHub/CI automatically. |
|
||||||
|
| 4 | **Ack/Resolve Pattern Learning** (#43, #44) | The data moat starts here. Every interaction makes the system smarter. Passive learning, no user effort. |
|
||||||
|
| 5 | **Daily Noise Report Card** (#62) | Builds trust through transparency. Shows what was suppressed and why. Gamifies noise reduction. |
|
||||||
|
| 6 | **SLO-Based Priority Scoring** (#36) | Objective, defensible prioritization. Not "AI magic" — math based on your own SLO definitions. |
|
||||||
|
| 7 | **Time-Window Clustering** (#25) | Simple, effective, explainable. "These 12 alerts all fired within 2 minutes — probably related." |
|
||||||
|
| 8 | **Feedback Loops (Thumbs Up/Down)** (#47) | Explicit signal to train the model. Engineers feel in control. Builds the labeled dataset for V2 ML. |
|
||||||
|
| 9 | **Alert Simulation Mode** (#72) | "What would have happened last month?" is the killer demo. Proves value before they commit. |
|
||||||
|
| 10 | **PagerDuty Bidirectional Sync** (#54) | Don't replace PagerDuty — sit in front of it. Reduces friction. "Keep your existing setup, just add us." |
|
||||||
|
|
||||||
|
### 3 Wild Cards
|
||||||
|
|
||||||
|
1. **🔮 Predictive Alerting** (#71) — Forecast threshold breaches before they happen. Hard to build, but if it works, it's magic. "You have 20 minutes before this becomes a problem." Game-changing for V2.
|
||||||
|
|
||||||
|
2. **🤖 Incident Autopilot** (#73) — Auto-execute known runbooks with human approval. "We've seen this 47 times. Auto-scaling fixes it every time. Execute?" This is where the real money is long-term.
|
||||||
|
|
||||||
|
3. **🧠 Community-Shared Patterns** (#95) — Anonymized collective intelligence. "87% of Kubernetes teams suppress this alert. Want to?" Network effects across customers, not just within orgs. Could be the ultimate moat.
|
||||||
|
|
||||||
|
### Recommended V1 Scope
|
||||||
|
|
||||||
|
**V1: "Smart Alert Funnel"**
|
||||||
|
|
||||||
|
Core features (ship in 8-12 weeks):
|
||||||
|
- **Webhook ingestion** from Datadog, Grafana, PagerDuty, OpsGenie (4 integrations)
|
||||||
|
- **Time-window clustering** (group alerts within 5-min windows affecting related services)
|
||||||
|
- **Semantic deduplication** (same alert, different wording = 1 alert)
|
||||||
|
- **Deployment correlation** (GitHub Actions / GitLab CI integration)
|
||||||
|
- **Slack bot** as primary UX (grouped incidents, ack/resolve, context)
|
||||||
|
- **Daily digest** showing noise reduction stats
|
||||||
|
- **Thumbs up/down feedback** on every grouped incident
|
||||||
|
- **Alert simulation** ("connect us to your webhook history, we'll show you what V1 would have done")
|
||||||
|
|
||||||
|
What V1 is NOT:
|
||||||
|
- No ML-based anomaly detection (use rule-based grouping first)
|
||||||
|
- No predictive alerting
|
||||||
|
- No auto-remediation
|
||||||
|
- No custom dashboards (Slack IS the dashboard)
|
||||||
|
- No on-prem deployment
|
||||||
|
|
||||||
|
**V1 Success Metric:** Reduce actionable alert volume by 60%+ within the first week.
|
||||||
|
|
||||||
|
**V1 Pricing:** $19/seat/month. Free tier for up to 5 seats (land-and-expand).
|
||||||
|
|
||||||
|
**V1 Go-to-Market:**
|
||||||
|
- Target: DevOps/SRE teams at Series A-C startups (20-200 employees)
|
||||||
|
- Channel: Dev Twitter/X, Hacker News launch, DevOps subreddit, conference lightning talks
|
||||||
|
- Hook: "Connect your webhook. See your noise reduction in 60 seconds."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Session complete. 112 ideas generated across 5 phases. Let's build this thing.* 🚀
|
||||||
342
products/03-alert-intelligence/design-thinking/session.md
Normal file
342
products/03-alert-intelligence/design-thinking/session.md
Normal file
@@ -0,0 +1,342 @@
|
|||||||
|
# 🎷 dd0c/alert — Design Thinking Session
|
||||||
|
**Product:** Alert Intelligence Layer (dd0c/alert)
|
||||||
|
**Facilitator:** Maya, Design Thinking Maestro
|
||||||
|
**Date:** 2026-02-28
|
||||||
|
**Method:** Full Design Thinking (Empathize → Define → Ideate → Prototype → Test → Iterate)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
> *"An alert system that cries wolf isn't broken — it's traumatizing the shepherd. We're not fixing alerts. We're restoring trust between humans and their machines."*
|
||||||
|
> — Maya
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Phase 1: EMPATHIZE 🎧
|
||||||
|
|
||||||
|
Design is jazz. You don't start by playing — you start by *listening*. And right now, the on-call world is screaming a blues riff at 3am, and nobody's transcribing the melody. Let's sit with these people. Let's feel what they feel before we dare to build anything.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Persona 1: Priya Sharma — The On-Call Engineer
|
||||||
|
|
||||||
|
**Age:** 28 | **Role:** Backend Engineer, On-Call Rotation | **Company:** Mid-stage fintech startup, 85 engineers
|
||||||
|
**Slack Status:** 🔴 "on-call until Thursday, pray for me"
|
||||||
|
|
||||||
|
### Empathy Map
|
||||||
|
|
||||||
|
**SAYS:**
|
||||||
|
- "I got paged 6 times last night. Five were nothing."
|
||||||
|
- "I just ack everything now. I'll look at it in the morning if it's still firing."
|
||||||
|
- "I used to love this job. Now I dread Tuesdays." (her on-call day)
|
||||||
|
- "Can someone PLEASE fix the checkout-latency alert? It fires every deploy."
|
||||||
|
- "I'm not burned out, I'm just... tired." (she's burned out)
|
||||||
|
|
||||||
|
**THINKS:**
|
||||||
|
- "If I mute this channel, will I miss the one real incident?"
|
||||||
|
- "My manager says on-call is 'shared responsibility' but I've been on rotation 3x more than anyone else this quarter."
|
||||||
|
- "I wonder if that startup down the street has on-call this bad."
|
||||||
|
- "What if I just... don't answer? What's the worst that happens?"
|
||||||
|
- "I'm mass-acking alerts at 3am. This is not engineering. This is whack-a-mole."
|
||||||
|
|
||||||
|
**DOES:**
|
||||||
|
- Sets multiple alarms because she doesn't trust herself to wake up for pages anymore — her brain has learned to sleep through them
|
||||||
|
- Keeps a personal "ignore list" in a Notion doc — alerts she's learned are always noise
|
||||||
|
- Spends the first 20 minutes of every incident figuring out if it's real
|
||||||
|
- Writes angry Slack messages in #sre-gripes at 3:17am
|
||||||
|
- Checks the deploy log manually every single time an alert fires
|
||||||
|
- Takes a "recovery day" after bad on-call nights (uses PTO, doesn't tell anyone why)
|
||||||
|
|
||||||
|
**FEELS:**
|
||||||
|
- **Anxiety:** The phantom vibration in her pocket. Even off-call, she flinches when her phone buzzes.
|
||||||
|
- **Resentment:** Toward the teams that ship noisy services and never fix their alerts.
|
||||||
|
- **Guilt:** When she mutes channels or acks without investigating.
|
||||||
|
- **Isolation:** Nobody who isn't on-call understands what it's like. Her partner thinks she's "just checking her phone."
|
||||||
|
- **Helplessness:** She's filed 12 tickets to fix noisy alerts. 2 got addressed. The rest are "backlog."
|
||||||
|
|
||||||
|
### Pain Points
|
||||||
|
1. **Signal-to-noise ratio is catastrophic** — 80-90% of pages are non-actionable
|
||||||
|
2. **Context is missing** — Alert says "high latency on prod-api-12" but doesn't say WHY or whether it matters
|
||||||
|
3. **No correlation** — She has to manually check if a deploy just happened, if other services are affected, if it's a known pattern
|
||||||
|
4. **Alert ownership is broken** — Nobody owns the noisy alerts. The team that created the service moved on. The alert is an orphan.
|
||||||
|
5. **Recovery time is invisible** — Management doesn't see the cognitive cost of a bad night. She's 40% less productive the next day but nobody measures that.
|
||||||
|
6. **Tools fragment her attention** — Alerts come from Datadog, PagerDuty, Slack, and email. She has to context-switch across 4 tools to understand one incident.
|
||||||
|
|
||||||
|
### Current Workarounds
|
||||||
|
- Personal Notion "ignore list" of known-noisy alerts
|
||||||
|
- Slack keyword muting for specific alert patterns
|
||||||
|
- A bash script she wrote that checks the deploy log when she gets paged (she's automated her own triage)
|
||||||
|
- Group chat with other on-call engineers where they share "is this real?" messages at 3am
|
||||||
|
- Coffee. So much coffee.
|
||||||
|
|
||||||
|
### Jobs to Be Done (JTBD)
|
||||||
|
- **When** I get paged at 3am, **I want to** instantly know if this is real and what to do, **so I can** either fix it fast or go back to sleep.
|
||||||
|
- **When** I start my on-call shift, **I want to** know what's currently broken and what's just noise, **so I can** mentally prepare and not waste energy on false alarms.
|
||||||
|
- **When** an incident is happening, **I want to** see all related alerts grouped together with context, **so I can** focus on the root cause instead of chasing symptoms.
|
||||||
|
- **When** my on-call shift ends, **I want to** hand off cleanly with a summary, **so I can** actually disconnect and recover.
|
||||||
|
|
||||||
|
### Day-in-the-Life: Tuesday (On-Call Day)
|
||||||
|
|
||||||
|
**6:30 AM** — Alarm goes off. Checks phone immediately. 14 alerts overnight. Scrolls through them in bed. 12 are the usual suspects (checkout-latency, disk-usage-warning, the zombie alert from the decommissioned auth service). 2 look potentially real. Gets up with a knot in her stomach.
|
||||||
|
|
||||||
|
**9:15 AM** — Standup. "Anything from on-call?" She says "quiet night" because explaining 14 alerts that were all noise is exhausting and nobody wants to hear it.
|
||||||
|
|
||||||
|
**11:42 AM** — Page: "Error rate spike on payment-service." Heart rate jumps. Opens Datadog. Opens PagerDuty. Opens Slack. Checks the deploy log — yes, someone deployed 8 minutes ago. Checks the PR. It's a config change. Error rate is already recovering. Acks the alert. Total time: 12 minutes. Actual action needed: zero.
|
||||||
|
|
||||||
|
**2:15 PM** — Trying to do actual feature work. Gets paged again. Same checkout-latency alert that fires every afternoon during peak traffic. Acks in 3 seconds without looking. Goes back to coding. Loses her flow state. Takes 20 minutes to get back into the problem.
|
||||||
|
|
||||||
|
**3:47 AM (Wednesday)** — Phone screams. Bolts awake. Heart pounding. "Database connection pool exhausted on prod-db-primary." This one is real. But she doesn't know that yet. Spends 8 minutes triaging — checking if it's the flapper, checking if there was a deploy, checking if other services are affected. By the time she confirms it's real and starts the runbook, it's been 11 minutes. MTTR clock is ticking.
|
||||||
|
|
||||||
|
**4:22 AM** — Incident resolved. Wide awake now. Adrenaline. Can't sleep. Opens Twitter. Sees a meme about on-call life. Laughs bitterly. Considers updating her LinkedIn.
|
||||||
|
|
||||||
|
**9:00 AM (Wednesday)** — Shows up to work exhausted. Manager asks about the incident. She explains. Manager says "great job." She thinks: "Great job would be not getting paged for garbage 6 times before the real one."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Persona 2: Marcus Chen — The SRE/Platform Lead
|
||||||
|
|
||||||
|
**Age:** 34 | **Role:** Senior SRE / Platform Team Lead | **Company:** Series C SaaS company, 140 engineers, 8 SREs
|
||||||
|
**Slack Status:** 📊 "Reviewing Q1 on-call metrics (they're bad)"
|
||||||
|
|
||||||
|
### Empathy Map
|
||||||
|
|
||||||
|
**SAYS:**
|
||||||
|
- "We need to fix our alert hygiene. I've been saying this for two quarters."
|
||||||
|
- "I can't force product teams to fix their alerts. I can only write guidelines nobody reads."
|
||||||
|
- "Our MTTR is 34 minutes. Industry benchmark is 15. I know why, but I can't fix it alone."
|
||||||
|
- "PagerDuty costs us $47/seat and half the alerts it sends are noise."
|
||||||
|
- "I need a way to show leadership that alert quality is an engineering productivity problem, not just an SRE problem."
|
||||||
|
|
||||||
|
**THINKS:**
|
||||||
|
- "I know which alerts are noise. I've known for months. But fixing them requires buy-in from 6 different teams and none of them prioritize it."
|
||||||
|
- "If I suppress alerts aggressively, and something breaks, it's MY head on the block."
|
||||||
|
- "The junior engineers on rotation are getting destroyed. I can see it in their faces. I need to protect them but I don't have the tools."
|
||||||
|
- "I could build something internally... but I've been saying that for a year and we never have the bandwidth."
|
||||||
|
- "Am I going to spend my entire career fighting alert noise? Is this really what SRE is?"
|
||||||
|
|
||||||
|
**DOES:**
|
||||||
|
- Runs a monthly "alert review" meeting that nobody wants to attend
|
||||||
|
- Maintains a spreadsheet tracking alert-to-incident ratios per service (manually updated, always out of date)
|
||||||
|
- Writes alert rules for other teams because they won't write good ones themselves
|
||||||
|
- Spends 30% of his time on alert tuning instead of platform work
|
||||||
|
- Advocates for "alert budgets" per team (like error budgets) — leadership likes the idea but won't enforce it
|
||||||
|
- Reviews every postmortem looking for "was the right alert the first alert?" (answer is usually no)
|
||||||
|
|
||||||
|
**FEELS:**
|
||||||
|
- **Frustration:** He has the expertise to fix this but not the organizational leverage. He's a platform lead, not a VP.
|
||||||
|
- **Responsibility:** Every bad on-call night for his team feels like his failure. He set up the rotation. He should have fixed the noise.
|
||||||
|
- **Exhaustion:** Alert tuning is Sisyphean. Fix 10 noisy alerts, 15 new ones appear because someone shipped a new service with default thresholds.
|
||||||
|
- **Professional anxiety:** His MTTR metrics look bad. Leadership sees numbers, not the nuance of why.
|
||||||
|
- **Loneliness:** He's the only one who sees the full picture. Product teams see their alerts. He sees ALL the alerts. The view is terrifying.
|
||||||
|
|
||||||
|
### Pain Points
|
||||||
|
1. **No leverage over alert quality** — He can write guidelines, but can't force teams to follow them. Alert quality is a tragedy of the commons.
|
||||||
|
2. **Manual correlation is his full-time job** — He's the human correlation engine. When an incident happens, HE connects the dots across services because no tool does it.
|
||||||
|
3. **Metrics are hard to produce** — Proving that alert noise costs money requires data he has to manually compile. Leadership wants dashboards, not spreadsheets.
|
||||||
|
4. **Tool sprawl** — His team uses Datadog for metrics, Grafana for some dashboards, PagerDuty for paging, OpsGenie for some teams that refused PagerDuty. He's managing 4 alerting surfaces.
|
||||||
|
5. **The cold start problem with every new service** — New services launch with terrible default alerts. By the time they're tuned, the team has already suffered through weeks of noise.
|
||||||
|
6. **Retention risk** — He's lost 2 engineers in the past year who cited on-call burden. Recruiting replacements took 4 months each.
|
||||||
|
|
||||||
|
### Current Workarounds
|
||||||
|
- The spreadsheet. Always the spreadsheet.
|
||||||
|
- Monthly "alert amnesty" where teams can delete alerts without judgment (attendance: poor)
|
||||||
|
- A Slack bot he hacked together that counts alerts per channel per day (it breaks constantly)
|
||||||
|
- Manually tagging alerts as "noise" or "signal" in postmortem docs
|
||||||
|
- Begging product managers to prioritize alert fixes by framing it as "developer productivity"
|
||||||
|
- Taking on-call shifts himself to "lead by example" (and to spare his junior engineers)
|
||||||
|
|
||||||
|
### Jobs to Be Done (JTBD)
|
||||||
|
- **When** I'm reviewing on-call health, **I want to** see exactly which alerts are noise and which are signal across all teams, **so I can** prioritize fixes with data instead of gut feel.
|
||||||
|
- **When** a new service launches, **I want to** automatically apply intelligent alert defaults, **so I can** prevent the cold-start noise problem.
|
||||||
|
- **When** I'm presenting to leadership, **I want to** show the business cost of alert noise (MTTR impact, engineer hours wasted, attrition risk), **so I can** get budget and priority for fixing it.
|
||||||
|
- **When** an incident is in progress, **I want to** see correlated alerts across all services and tools in one view, **so I can** guide the response team to the root cause faster.
|
||||||
|
|
||||||
|
### Day-in-the-Life: Monday
|
||||||
|
|
||||||
|
**8:00 AM** — Opens his alert metrics spreadsheet. Last week: 1,247 alerts across all teams. 89 resulted in actual incidents. That's a 7.1% signal rate. He's been tracking this for 6 months. It's getting worse, not better.
|
||||||
|
|
||||||
|
**9:30 AM** — Alert review meeting. 3 of 8 team leads show up. They review the top 10 noisiest alerts. Everyone agrees they should be fixed. Nobody commits to a timeline. Marcus assigns himself 4 of them because nobody else will.
|
||||||
|
|
||||||
|
**11:00 AM** — Gets pulled into an incident. Payment service is throwing errors. He immediately checks: was there a deploy? (Yes, 20 minutes ago.) Are other services affected? (He checks 3 dashboards to find out — yes, the downstream notification service is also erroring.) He connects the dots in 6 minutes. The on-call engineer had been looking at the notification service errors for 15 minutes without realizing the root cause was upstream.
|
||||||
|
|
||||||
|
**1:30 PM** — Writes a postmortem for last week's P1. In the "what went well / what didn't" section, he writes: "The first alert that fired was a symptom, not the cause. The causal alert fired 4 minutes later but was buried in 23 other alerts." He's written this same sentence in 11 different postmortems.
|
||||||
|
|
||||||
|
**3:00 PM** — 1:1 with his manager (Director of Engineering). Manager asks about MTTR. Marcus shows the spreadsheet. Manager says "Can you get this into a dashboard?" Marcus thinks: "With what time?"
|
||||||
|
|
||||||
|
**5:30 PM** — Reviewing PagerDuty bill. $47/seat × 40 engineers on rotation = $1,880/month. For a tool that faithfully delivers noise to people's phones at 3am. He wonders if there's something better.
|
||||||
|
|
||||||
|
**7:00 PM** — At home. Gets a Slack DM from a junior engineer: "Hey Marcus, I'm on-call tonight. Any tips for the checkout-latency alert? It fired 8 times last night for Priya." He sends her his personal runbook. He thinks about building an internal tool. Again. He opens a beer instead.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Persona 3: Diana Okafor — The VP of Engineering
|
||||||
|
|
||||||
|
**Age:** 41 | **Role:** VP of Engineering | **Company:** Same Series C SaaS, 140 engineers, reports to CTO
|
||||||
|
**Slack Status:** Rarely on Slack. Lives in Google Docs and Zoom.
|
||||||
|
|
||||||
|
### Empathy Map
|
||||||
|
|
||||||
|
**SAYS:**
|
||||||
|
- "Our MTTR is 34 minutes. The board wants it under 15. What's the plan?"
|
||||||
|
- "I keep hearing about alert fatigue but I need data, not anecdotes."
|
||||||
|
- "We lost two SREs last quarter. Recruiting is taking forever. We need to fix the on-call experience."
|
||||||
|
- "I'm not going to approve another $50K tool unless someone can show me ROI in the first quarter."
|
||||||
|
- "Why are we paying Datadog $180K/year and PagerDuty $22K/year and still having these problems?"
|
||||||
|
|
||||||
|
**THINKS:**
|
||||||
|
- "On-call burnout is a retention problem disguised as a tooling problem. Or is it the other way around?"
|
||||||
|
- "If we have another major incident where the alert was missed because of noise, the CTO is going to ask me hard questions I don't have answers to."
|
||||||
|
- "Marcus keeps asking for headcount. I believe him that the team is stretched, but I need to justify it with metrics the CFO will accept."
|
||||||
|
- "The engineers complain about on-call but I don't have visibility into what's actually happening. I see incident counts and MTTR. I don't see the human cost."
|
||||||
|
- "We're spending $200K+/year on monitoring and alerting tools. Are we getting $200K of value?"
|
||||||
|
|
||||||
|
**DOES:**
|
||||||
|
- Reviews MTTR and incident count dashboards weekly (surface-level metrics that don't capture the real problem)
|
||||||
|
- Approves tool purchases based on ROI projections and vendor demos (has been burned by tools that demo well but don't deliver)
|
||||||
|
- Runs quarterly engineering satisfaction surveys — "on-call experience" has been the #1 complaint for 3 consecutive quarters
|
||||||
|
- Asks Marcus for "the alert noise number" before board meetings (Marcus scrambles to update his spreadsheet)
|
||||||
|
- Compares their incident metrics to industry benchmarks and doesn't like what she sees
|
||||||
|
- Has started mentioning "alert fatigue" in leadership meetings because she read a Gartner report about it
|
||||||
|
|
||||||
|
**FEELS:**
|
||||||
|
- **Accountability pressure:** She owns engineering productivity. If MTTR is bad, it's her problem. If engineers quit, it's her problem.
|
||||||
|
- **Information asymmetry:** She knows something is wrong but can't see the details. She's dependent on Marcus's spreadsheets and anecdotal reports.
|
||||||
|
- **Budget anxiety:** Every new tool is a line item she has to defend. The CFO questions every SaaS subscription over $10K/year.
|
||||||
|
- **Empathy (distant):** She was an engineer once. She remembers bad on-call nights. But it's been 8 years since she was in rotation, and the scale of the problem has changed.
|
||||||
|
- **Strategic concern:** Competitors are shipping faster. If her engineers are spending 30% of their cognitive energy on alert noise, that's 30% less innovation.
|
||||||
|
|
||||||
|
### Pain Points
|
||||||
|
1. **No single metric for alert health** — She has MTTR, incident count, and anecdotes. She needs a "noise score" or "alert quality index" she can track over time and present to the board.
|
||||||
|
2. **ROI of monitoring tools is unmeasurable** — She's spending $200K+/year on Datadog + PagerDuty + Grafana. She can't quantify what she's getting for that money.
|
||||||
|
3. **Attrition is expensive and invisible** — Losing an SRE costs $150-300K (recruiting + ramp + lost institutional knowledge). Alert fatigue drives attrition. But the causal chain is hard to prove to a CFO.
|
||||||
|
4. **Tool fatigue** — Her teams already use too many tools. Adding another one is a hard sell unless it REPLACES something or has undeniable, immediate value.
|
||||||
|
5. **Compliance risk** — They're in fintech. Missed alerts could mean regulatory issues. She loses sleep over this (ironic, given the product).
|
||||||
|
6. **No visibility into cross-team alert patterns** — She doesn't know that Team A's noisy alerts are causing Team B's MTTR to spike because of shared dependencies.
|
||||||
|
|
||||||
|
### Current Workarounds
|
||||||
|
- Marcus's spreadsheet (she knows it's manual and incomplete, but it's all she has)
|
||||||
|
- Quarterly "on-call health" reviews that produce action items nobody follows up on
|
||||||
|
- Throwing headcount at the problem (hiring more SREs to spread the on-call load)
|
||||||
|
- Vendor calls with PagerDuty's "customer success" team that result in no meaningful changes
|
||||||
|
- Asking engineering managers to "prioritize alert hygiene" without giving them dedicated time to do it
|
||||||
|
|
||||||
|
### Jobs to Be Done (JTBD)
|
||||||
|
- **When** I'm preparing for a board meeting, **I want to** show a clear metric for operational health that includes alert quality, **so I can** demonstrate that we're improving (or justify investment if we're not).
|
||||||
|
- **When** I'm evaluating a new tool, **I want to** see projected ROI based on our actual data within the first week, **so I can** make a fast, confident buy decision.
|
||||||
|
- **When** an engineer quits citing on-call burden, **I want to** have data showing exactly how bad their on-call experience was, **so I can** fix the systemic issue instead of just backfilling the role.
|
||||||
|
- **When** I'm allocating engineering time, **I want to** know which teams have the worst alert noise, **so I can** direct investment where it has the most impact.
|
||||||
|
|
||||||
|
### Day-in-the-Life: Wednesday
|
||||||
|
|
||||||
|
**7:30 AM** — Checks email. CTO forwarded a Gartner report: "AIOps Market to Reach $40B by 2028." Attached note: "Should we be looking at this?" She adds it to her reading list.
|
||||||
|
|
||||||
|
**9:00 AM** — Leadership standup. CTO asks about the P1 incident from last week. Diana gives the summary. CTO asks: "Why did it take 34 minutes to respond?" Diana says: "The on-call engineer was triaging other alerts when it fired." CTO's eyebrow goes up. Diana makes a mental note to talk to Marcus.
|
||||||
|
|
||||||
|
**10:30 AM** — 1:1 with Marcus. He shows her the spreadsheet: 1,247 alerts last week, 89 real incidents. She does the math: 93% noise. She asks: "Can we get this to 50% noise?" Marcus says: "Not without dedicated engineering time from every team, or a tool that does it for us." She asks him to evaluate options.
|
||||||
|
|
||||||
|
**12:00 PM** — Lunch with the Head of Recruiting. They discuss the two open SRE roles. Average time-to-fill for SREs in their market: 67 days. Cost per hire: $45K (agency fees + interview time). Diana thinks about how much cheaper it would be to just not burn out the SREs they have.
|
||||||
|
|
||||||
|
**2:00 PM** — Quarterly planning. She's trying to allocate 15% of engineering time to "platform health" but product managers are pushing back. They want features. She needs ammunition — hard data showing that alert noise is costing them feature velocity.
|
||||||
|
|
||||||
|
**4:00 PM** — Reviews the engineering satisfaction survey results. On-call experience: 2.1 out of 5. Comments include: "I dread my on-call weeks," "The alerts are mostly useless," and "I'm considering leaving if this doesn't improve." She highlights these for the CTO.
|
||||||
|
|
||||||
|
**6:30 PM** — Driving home. Thinks about the $200K monitoring bill. Thinks about the 2 engineers who left. Thinks about the 34-minute MTTR. Thinks: "There has to be a better way." Opens LinkedIn at a red light. Sees an ad for yet another AIOps platform. Closes LinkedIn.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Phase 2: DEFINE 🎯
|
||||||
|
|
||||||
|
> *"The problem is never what people say it is. Priya says 'too many alerts.' Marcus says 'bad alert hygiene.' Diana says 'high MTTR.' They're all describing the same elephant from different angles. Our job is to see the whole animal."*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Point-of-View Statements
|
||||||
|
|
||||||
|
A POV statement crystallizes the tension: [User] needs [need] because [insight].
|
||||||
|
|
||||||
|
### Priya (On-Call Engineer)
|
||||||
|
**Priya, a dedicated backend engineer on a weekly on-call rotation, needs to instantly distinguish real incidents from noise at 3am because her brain has been conditioned to ignore alerts — and the one time she shouldn't ignore one, she will.**
|
||||||
|
|
||||||
|
The deeper insight: Priya's problem isn't volume. It's *trust erosion*. Every false alarm trains her nervous system to stop caring. The alert system is literally conditioning her to fail at the one moment it matters most. This is a Pavlovian tragedy.
|
||||||
|
|
||||||
|
### Marcus (SRE/Platform Lead)
|
||||||
|
**Marcus, an SRE lead responsible for operational health across 8 teams, needs a way to make alert quality visible and actionable across the organization because he currently holds all the correlation knowledge in his head — and that knowledge walks out the door when he goes on vacation.**
|
||||||
|
|
||||||
|
The deeper insight: Marcus is a human AIOps engine. He IS the correlation layer. He IS the deduplication algorithm. The organization has outsourced its alert intelligence to one person's brain. That's not a process — that's a single point of failure wearing a hoodie.
|
||||||
|
|
||||||
|
### Diana (VP of Engineering)
|
||||||
|
**Diana, a VP of Engineering accountable for engineering productivity and retention, needs a single, defensible metric for alert health because she's fighting a war she can't measure — and in leadership, what you can't measure, you can't fund.**
|
||||||
|
|
||||||
|
The deeper insight: Diana's problem is *translation*. She needs to convert "Priya had a terrible night" into "$47,000 in lost productivity and attrition risk" — a language the CFO and board understand. Without that translation layer, alert fatigue remains an anecdote, not a budget line item.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Insights
|
||||||
|
|
||||||
|
1. **Trust is the product, not technology.** The #1 barrier to adoption isn't feature gaps — it's that engineers won't trust AI to suppress alerts. One missed critical alert = permanent distrust. The trust ramp IS the product challenge.
|
||||||
|
|
||||||
|
2. **Alert fatigue is a tragedy of the commons.** No single team owns the problem. Team A's noisy service creates Team B's on-call nightmare. Without organizational visibility, everyone optimizes locally and the system degrades globally.
|
||||||
|
|
||||||
|
3. **The correlation knowledge is trapped in human brains.** Marcus knows that "DB slow + API errors + frontend timeouts = one incident." That knowledge isn't in any tool. When Marcus is unavailable, MTTR doubles because nobody else can connect the dots.
|
||||||
|
|
||||||
|
4. **Metrics exist at the wrong altitude.** Diana sees MTTR (too high-level). Priya sees individual alerts (too low-level). Nobody sees the middle layer: alert quality, noise ratios, correlation patterns, cost-per-false-alarm. This middle layer is where decisions should be made.
|
||||||
|
|
||||||
|
5. **The economic cost is real but invisible.** Alert fatigue drives attrition ($150-300K per lost SRE), inflates MTTR (3-5x longer with high noise), reduces next-day productivity (40% cognitive tax), and creates compliance risk. But nobody has a dashboard that shows this. The cost hides in plain sight.
|
||||||
|
|
||||||
|
6. **Engineers don't want another tool — they want fewer interruptions.** The bar for a new tool is astronomically high. It must live where they already are (Slack), require zero behavior change, and prove value before asking for commitment.
|
||||||
|
|
||||||
|
7. **The cold start paradox.** AI needs data to be smart. Day 1, it's dumb. But Day 1 is when you need to prove value. The solution: rule-based intelligence first (time-window clustering, deployment correlation) that works immediately, with ML that gets smarter over time.
|
||||||
|
|
||||||
|
8. **Transparency is non-negotiable.** Engineers will reject a black box. Every suppression, every grouping, every priority score must be explainable. "We suppressed this because..." is the most important sentence in the product.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Core Tension
|
||||||
|
|
||||||
|
Here's the fundamental tension at the heart of dd0c/alert — the jazz dissonance that makes this product interesting:
|
||||||
|
|
||||||
|
**Engineers need FEWER alerts to do their jobs, but they're TERRIFIED of missing the one that matters.**
|
||||||
|
|
||||||
|
This is not a feature problem. This is a *psychological safety* problem. The product must simultaneously:
|
||||||
|
- **Suppress aggressively** (to deliver the 70-90% noise reduction that makes it worth buying)
|
||||||
|
- **Never miss a critical alert** (to maintain the trust that makes it usable)
|
||||||
|
- **Prove it's working** (to justify the suppression to skeptical engineers AND budget-conscious VPs)
|
||||||
|
|
||||||
|
The resolution of this tension is *graduated trust*. You don't suppress on Day 1. You SHOW what you WOULD suppress. You let the engineer confirm. You build a track record. You earn the right to act autonomously. Like a new musician sitting in with a jazz ensemble — you listen for 3 songs before you solo.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How Might We (HMW) Questions
|
||||||
|
|
||||||
|
These are the creative springboards. Each one opens a design space.
|
||||||
|
|
||||||
|
### Trust & Transparency
|
||||||
|
1. **HMW** build trust with on-call engineers who've been burned by every "smart" alerting tool before?
|
||||||
|
2. **HMW** make alert suppression decisions transparent and reversible, so engineers feel in control even when AI is acting?
|
||||||
|
3. **HMW** create a "trust score" that grows over time, unlocking more autonomous behavior as the system proves itself?
|
||||||
|
|
||||||
|
### Signal vs. Noise
|
||||||
|
4. **HMW** reduce alert volume by 70%+ without ever suppressing a genuinely critical alert?
|
||||||
|
5. **HMW** help engineers distinguish "this is noise" from "this is a symptom of something real" in under 10 seconds?
|
||||||
|
6. **HMW** automatically correlate alerts across multiple monitoring tools so engineers see ONE incident instead of 47 symptoms?
|
||||||
|
|
||||||
|
### Organizational Visibility
|
||||||
|
7. **HMW** give Marcus (SRE lead) a real-time view of alert quality across all teams without requiring manual data collection?
|
||||||
|
8. **HMW** translate alert noise into business metrics (dollars, hours, attrition risk) that Diana can present to the board?
|
||||||
|
9. **HMW** create accountability for alert quality without turning it into a blame game between teams?
|
||||||
|
|
||||||
|
### Developer Experience
|
||||||
|
10. **HMW** deliver value in the first 60 seconds of setup, before the AI has learned anything?
|
||||||
|
11. **HMW** make Slack the primary interface so engineers never have to learn a new tool?
|
||||||
|
12. **HMW** provide context-rich incident summaries that eliminate the "is this real?" triage phase entirely?
|
||||||
|
|
||||||
|
### Learning & Adaptation
|
||||||
|
13. **HMW** learn from engineer behavior (acks, snoozes, ignores) without requiring explicit feedback?
|
||||||
|
14. **HMW** handle the cold-start problem so the product is useful on Day 1, not Day 30?
|
||||||
|
15. **HMW** adapt to each team's unique noise profile instead of applying one-size-fits-all rules?
|
||||||
|
|
||||||
|
### Human Cost
|
||||||
|
16. **HMW** measure and make visible the human cost of alert fatigue (sleep disruption, cognitive load, burnout)?
|
||||||
|
17. **HMW** protect junior engineers on rotation from the worst of the noise while they're still learning?
|
||||||
|
18. **HMW** turn on-call from a dreaded obligation into a manageable, even empowering experience?
|
||||||
|
|
||||||
|
---
|
||||||
480
products/03-alert-intelligence/epics/epics.md
Normal file
480
products/03-alert-intelligence/epics/epics.md
Normal file
@@ -0,0 +1,480 @@
|
|||||||
|
# dd0c/alert — V1 MVP Epics
|
||||||
|
**Product:** dd0c/alert (Alert Intelligence Platform)
|
||||||
|
**Phase:** 7 — Epics & Stories
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 1: Webhook Ingestion
|
||||||
|
**Description:** The front door of dd0c/alert. Responsible for receiving alert payloads from monitoring providers via webhooks, validating their authenticity, normalizing them into a canonical schema, and queuing them securely for the Correlation Engine. Must support high burst volume (incident storms) and guarantee zero dropped payloads.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 1.1: Datadog Webhook Ingestion**
|
||||||
|
* **As a** Platform Engineer, **I want** to send Datadog webhooks to a unique dd0c URL, **so that** my Datadog alerts enter the correlation pipeline.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- System exposes `POST /v1/wh/{tenant_id}/datadog`
|
||||||
|
- Normalizes Datadog JSON (handles arrays/batched alerts) into the Canonical Alert Schema.
|
||||||
|
- Normalizes Datadog P1-P5 severities into critical/high/medium/low/info.
|
||||||
|
* **Estimate:** 3 points
|
||||||
|
|
||||||
|
**Story 1.2: PagerDuty Webhook Ingestion**
|
||||||
|
* **As a** Platform Engineer, **I want** to send PagerDuty v3 webhooks to dd0c, **so that** my PD incidents are tracked.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- System exposes `POST /v1/wh/{tenant_id}/pagerduty`
|
||||||
|
- Normalizes PagerDuty JSON into the Canonical Alert Schema.
|
||||||
|
* **Estimate:** 3 points
|
||||||
|
|
||||||
|
**Story 1.3: HMAC Signature Validation**
|
||||||
|
* **As a** Security Admin, **I want** all incoming webhooks to have their HMAC signatures validated, **so that** bad actors cannot inject fake alerts.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Rejects payloads with missing or invalid `DD-WEBHOOK-SIGNATURE` or `X-PagerDuty-Signature` headers with 401 Unauthorized.
|
||||||
|
- Compares against the integration secret stored in DynamoDB/Secrets Manager.
|
||||||
|
* **Estimate:** 3 points
|
||||||
|
|
||||||
|
**Story 1.4: Payload Normalization & Deduplication (Fingerprinting)**
|
||||||
|
* **As an** On-Call Engineer, **I want** identical alerts to be deterministically fingerprinted, **so that** flapping or duplicated payloads are instantly recognized.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Generates a SHA-256 fingerprint based on `tenant_id + provider + service + normalized_title`.
|
||||||
|
- Pushes canonical alert to SQS FIFO queue with `MessageGroupId=tenant_id`.
|
||||||
|
- Saves raw payload asynchronously to S3 for audit/replay.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- Story 1.3 depends on Tenant/Integration configuration existing (Epic 9).
|
||||||
|
- Story 1.4 depends on Canonical Alert Schema definition.
|
||||||
|
|
||||||
|
### Technical Notes
|
||||||
|
- **Infra:** API Gateway HTTP API -> Lambda -> SQS FIFO.
|
||||||
|
- Lambda must return 200 OK to the provider in <100ms. S3 raw payload storage must be non-blocking (async).
|
||||||
|
- Use ULIDs for `alert_id` for time-sortability.
|
||||||
|
|
||||||
|
## Epic 2: Correlation Engine
|
||||||
|
**Description:** The intelligence core. Consumes the normalized SQS FIFO queue, groups alerts based on time windows and service dependencies, and outputs correlated incidents.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 2.1: Time-Window Clustering**
|
||||||
|
* **As an** On-Call Engineer, **I want** alerts firing within a brief time window for the same service to be grouped together, **so that** I don't get paged 10 times for one failure.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Opens a 5-minute (configurable) correlation window in Redis when a new alert fingerprint arrives.
|
||||||
|
- Groups subsequent alerts for the same tenant/service into the active window.
|
||||||
|
- Stores the correlation state in ElastiCache Redis.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
**Story 2.2: Cascading Failure Correlation (Service Graph)**
|
||||||
|
* **As an** On-Call Engineer, **I want** cascading failures across dependent services to be merged into a single incident, **so that** I can see the blast radius of an issue.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Reads explicit service dependencies from DynamoDB (`upstream -> downstream`).
|
||||||
|
- If a window is open for an upstream service, downstream service alerts are merged into the same window.
|
||||||
|
* **Estimate:** 8 points
|
||||||
|
|
||||||
|
**Story 2.3: Active Window Extension**
|
||||||
|
* **As an** On-Call Engineer, **I want** the correlation window to automatically extend if alerts are still trickling in, **so that** long-running, cascading incidents are correctly grouped.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- If a new alert arrives within the last 30 seconds of a window, the window extends by 2 minutes (max 15 minutes).
|
||||||
|
- Updates the `closes_at` timestamp in Redis.
|
||||||
|
* **Estimate:** 3 points
|
||||||
|
|
||||||
|
**Story 2.4: Incident Generation & Persistence**
|
||||||
|
* **As an** On-Call Engineer, **I want** completed time windows to be saved as durable incidents, **so that** I have a permanent record of the correlated event.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- When a window closes, it generates an Incident record in DynamoDB.
|
||||||
|
- Generates an event in TimescaleDB for trend tracking.
|
||||||
|
- Pushes a `correlation-request` to the Suggestion Engine SQS queue.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- Story 2.1 depends on Epic 1 (normalized SQS queue).
|
||||||
|
- Story 2.2 depends on a basic service dependency mapping (either config or API).
|
||||||
|
|
||||||
|
### Technical Notes
|
||||||
|
- **Infra:** ECS Fargate consuming SQS FIFO.
|
||||||
|
- Must use Redis Sorted Sets for active window management (`closes_at_epoch` as score).
|
||||||
|
- The correlation engine must be stateless (relying on Redis) so it can scale horizontally to handle incident storms.
|
||||||
|
|
||||||
|
## Epic 3: Noise Analysis
|
||||||
|
**Description:** The Suggestion Engine. Calculates a noise score (0-100) for correlated incidents and generates observe-only suppression suggestions. It strictly adheres to V1 constraints by *never* taking auto-action.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 3.1: Rule-Based Noise Scoring**
|
||||||
|
* **As an** On-Call Engineer, **I want** every incident to receive a noise score based on objective data points, **so that** I have a metric to understand if this incident is likely a false positive.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Calculates a 0-100 noise score when an incident is generated.
|
||||||
|
- Scores based on duplicate fingerprints (flapping), severity distribution (info vs critical), and time of day.
|
||||||
|
- Cap at 100, floor at 0.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
**Story 3.2: "Never Suppress" Safelist Execution**
|
||||||
|
* **As a** Platform Engineer, **I want** critical services (databases, billing) to be excluded from high noise scoring regardless of pattern, **so that** I never miss a genuine P1.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Implements a default safelist regex (e.g., `db|rds|payment|billing`).
|
||||||
|
- Forces the noise score below 50 if the service or title matches the safelist, or if severity is critical.
|
||||||
|
* **Estimate:** 3 points
|
||||||
|
|
||||||
|
**Story 3.3: Observe-Only Suppression Suggestions**
|
||||||
|
* **As an** On-Call Engineer, **I want** the system to tell me what it *would* have suppressed, **so that** I can build trust in its intelligence without risking an outage.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- If a noise score > 80, the system generates a `suppress` suggestion record in DynamoDB.
|
||||||
|
- Generates plain-English reasoning for the suggestion (e.g., "This pattern was resolved automatically 4 times this month.").
|
||||||
|
- `action_taken` is always hardcoded to `none` for V1.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
**Story 3.4: Incident Scoring Metrics Collection**
|
||||||
|
* **As an** Engineering Manager, **I want** the noise scores and counts to be stored as time-series data, **so that** I can view trends in our alert hygiene over time.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Writes noise score, alert counts, and unique fingerprints to TimescaleDB `alert_timeseries` table.
|
||||||
|
* **Estimate:** 3 points
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- Story 3.1 depends on Epic 2 for Incident Generation.
|
||||||
|
- Story 3.3 depends on Epic 5 (Slack Bot) to display the suggestion.
|
||||||
|
|
||||||
|
### Technical Notes
|
||||||
|
- **Infra:** ECS Fargate consuming from the `correlation-request` SQS queue.
|
||||||
|
- Use PostgreSQL (TimescaleDB) for historical frequency lookups ("how many times has this fired in 7 days?") to inform the score.
|
||||||
|
|
||||||
|
## Epic 4: CI/CD Correlation
|
||||||
|
**Description:** Ingests deployment events and correlates them with alert storms. The "killer feature" mandated by the Party Mode board for V1 MVP, answering "did this break right after a deploy?"
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 4.1: GitHub Actions Deploy Ingestion**
|
||||||
|
* **As a** Platform Engineer, **I want** to connect my GitHub Actions deployment webhooks, **so that** dd0c/alert knows exactly when and who deployed to production.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- System exposes `POST /v1/wh/{tenant_id}/github`
|
||||||
|
- Validates `X-Hub-Signature-256`.
|
||||||
|
- Normalizes GHA workflow run payload into `DeployEvent` canonical schema.
|
||||||
|
- Pushes deploy event to SQS FIFO queue (`deploy-event`).
|
||||||
|
* **Estimate:** 3 points
|
||||||
|
|
||||||
|
**Story 4.2: Deploy-to-Alert Correlation**
|
||||||
|
* **As an** On-Call Engineer, **I want** an alert cluster to be automatically tagged with a recent deployment to that service, **so that** I don't waste 15 minutes checking deploy logs manually.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- When the Correlation Engine opens a window, it queries DynamoDB for deployments to the affected service within a configurable lookback window (default 15m for prod, 30m for staging).
|
||||||
|
- If a match is found, the deploy context (`deploy_pr`, `deploy_author`, `source_url`) is attached to the window state.
|
||||||
|
* **Estimate:** 8 points
|
||||||
|
|
||||||
|
**Story 4.3: Deploy-Weighted Noise Scoring**
|
||||||
|
* **As an** On-Call Engineer, **I want** alerts that are highly correlated with deployments to be scored as more likely to be noise (if they aren't critical), **so that** feature flags and config refreshes don't wake me up.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- If a deploy event is attached to an incident, boost the noise score by 15-30 points.
|
||||||
|
- Additional +5 points if the PR title matches `config` or `feature-flag`.
|
||||||
|
* **Estimate:** 2 points
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- Story 4.2 depends on Epic 2 (Correlation Engine) and Epic 3 (Noise Analysis).
|
||||||
|
- Service name mapping between GitHub and Datadog/PagerDuty (convention-based string matching).
|
||||||
|
|
||||||
|
### Technical Notes
|
||||||
|
- **Infra:** The Deployment Tracker runs as a module within the Correlation Engine ECS Task to avoid network latency.
|
||||||
|
- DynamoDB needs a Global Secondary Index (GSI): `tenant_id` + `service` + `completed_at` to quickly find recent deploys.
|
||||||
|
|
||||||
|
## Epic 5: Slack Bot
|
||||||
|
**Description:** The primary interface for on-call engineers. Delivers correlated incident summaries, observe-only suppression suggestions, and daily alert digests directly into Slack. Provides interactive buttons for engineers to acknowledge or validate suggestions.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 5.1: Incident Summary Notifications**
|
||||||
|
* **As an** On-Call Engineer, **I want** to receive a single, concise Slack message when an alert storm is correlated, **so that** I don't get flooded with dozens of individual alert notifications.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Bot sends a formatted Slack Block Kit message to a configured channel.
|
||||||
|
- Message groups all related alerts under a single incident title.
|
||||||
|
- Displays the total number of correlated alerts, affected services, and start time.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
**Story 5.2: Observe-Only Suppression Suggestions in Slack**
|
||||||
|
* **As an** On-Call Engineer, **I want** the Slack message to include the system's noise score and suppression recommendation, **so that** I can evaluate its accuracy in real-time.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- If noise score > 80, the message includes a specific "Suggestion" block (e.g., "Would have auto-suppressed: 95% noise score").
|
||||||
|
- Includes the plain-English reasoning generated in Epic 3.
|
||||||
|
* **Estimate:** 3 points
|
||||||
|
|
||||||
|
**Story 5.3: Interactive Feedback Actions**
|
||||||
|
* **As an** On-Call Engineer, **I want** to click "Good Catch" or "Bad Suggestion" on the Slack message, **so that** I can help train the noise analysis engine for future versions.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Slack message includes interactive buttons for feedback.
|
||||||
|
- Clicking a button sends a payload back to dd0c/alert to record the user's validation in the database.
|
||||||
|
- Updates the Slack message to acknowledge the feedback.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
**Story 5.4: Daily Alert Digest**
|
||||||
|
* **As an** Engineering Manager, **I want** a daily summary of the noisiest services and total incidents dropped into Slack, **so that** my team can prioritize technical debt.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- A scheduled job runs daily at 9 AM (configurable timezone).
|
||||||
|
- Aggregates the previous 24 hours of data from TimescaleDB.
|
||||||
|
- Posts a summary of "Top 3 Noisiest Services" and "Total Time Saved" (estimated) to the channel.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- Story 5.1 depends on Epic 2 (Correlation Engine).
|
||||||
|
- Story 5.2 depends on Epic 3 (Noise Analysis).
|
||||||
|
|
||||||
|
### Technical Notes
|
||||||
|
- **Infra:** AWS Lambda for handling incoming Slack interactions (buttons) via API Gateway.
|
||||||
|
- Use Slack's Block Kit Builder for UI consistency.
|
||||||
|
- Requires storing Slack Workspace and Channel tokens securely in AWS Secrets Manager or DynamoDB.
|
||||||
|
|
||||||
|
## Epic 6: Dashboard API
|
||||||
|
**Description:** The backend REST API that powers the dd0c/alert web dashboard. Provides secure endpoints for authentication, querying historical incidents, analyzing alert volume, and managing tenant configuration.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 6.1: Tenant Authentication & Authorization**
|
||||||
|
* **As a** Platform Engineer, **I want** to securely log in to the dashboard API, **so that** I can manage my organization's alert data safely.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Implement JWT-based authentication.
|
||||||
|
- Enforce tenant isolation on all API endpoints (users can only access data for their `tenant_id`).
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
**Story 6.2: Incident Query Endpoints**
|
||||||
|
* **As an** On-Call Engineer, **I want** to fetch a paginated list of historical incidents and their associated alerts, **so that** I can review past outages.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- `GET /v1/incidents` supports pagination, time-range filtering, and service filtering.
|
||||||
|
- `GET /v1/incidents/{incident_id}/alerts` returns the raw alerts correlated into that incident.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
**Story 6.3: Analytics & Noise Score API**
|
||||||
|
* **As an** Engineering Manager, **I want** to query aggregated metrics about alert noise and volume, **so that** I can populate charts on the dashboard.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- `GET /v1/analytics/noise` returns time-series data of average noise scores per service.
|
||||||
|
- Queries TimescaleDB efficiently using materialized views or continuous aggregates if necessary.
|
||||||
|
* **Estimate:** 8 points
|
||||||
|
|
||||||
|
**Story 6.4: Configuration Management Endpoints**
|
||||||
|
* **As a** Platform Engineer, **I want** to manage my integration webhooks and routing rules via API, **so that** I can script my onboarding or use the UI.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- CRUD endpoints for managing Slack channel destinations.
|
||||||
|
- Endpoints to generate and rotate inbound webhook secrets for Datadog/PagerDuty.
|
||||||
|
* **Estimate:** 3 points
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- Story 6.2 and 6.3 depend on TimescaleDB schema and data from Epics 2 and 3.
|
||||||
|
|
||||||
|
### Technical Notes
|
||||||
|
- **Infra:** API Gateway HTTP API -> AWS Lambda (Node.js/Go).
|
||||||
|
- Strict validation middleware required for tenant isolation.
|
||||||
|
- Use standard OpenAPI 3.0 specification for documentation.
|
||||||
|
|
||||||
|
## Epic 7: Dashboard UI
|
||||||
|
**Description:** The React Single Page Application (SPA) for dd0c/alert. Gives users a visual interface to view the incident timeline, inspect alert correlation details, and understand the noise scoring.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 7.1: Incident Timeline View**
|
||||||
|
* **As an** On-Call Engineer, **I want** a main feed showing all correlated incidents chronologically, **so that** I can see the current state of my systems at a glance.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- React SPA fetches and displays data from `GET /v1/incidents`.
|
||||||
|
- Visual distinction between high-noise (suggested suppressed) and low-noise (critical) incidents.
|
||||||
|
- Real-time updates or auto-refresh every 30 seconds.
|
||||||
|
* **Estimate:** 8 points
|
||||||
|
|
||||||
|
**Story 7.2: Alert Correlation Visualizer**
|
||||||
|
* **As an** On-Call Engineer, **I want** to click on an incident and see exactly which alerts were grouped together, **so that** I understand why the engine correlated them.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Detail pane showing the timeline of individual alerts within the incident window.
|
||||||
|
- Displays the deployment context (Epic 4) if applicable.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
**Story 7.3: Noise Score Breakdown**
|
||||||
|
* **As a** Platform Engineer, **I want** to see the exact factors that contributed to an incident's noise score, **so that** I can trust the engine's reasoning.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- UI component displaying the 0-100 noise score gauge.
|
||||||
|
- Lists the bulleted reasoning (e.g., "+20 points: Occurred 10 times this week", "+15 points: Recent deployment").
|
||||||
|
* **Estimate:** 3 points
|
||||||
|
|
||||||
|
**Story 7.4: Analytics Dashboard**
|
||||||
|
* **As an** Engineering Manager, **I want** charts showing alert volume and noise trends over the last 30 days, **so that** I can track improvements in our alert hygiene.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Integrates a charting library (e.g., Recharts or Chart.js).
|
||||||
|
- Displays a bar chart of total alerts vs. correlated incidents to show "noise reduction" value.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- Depends entirely on Epic 6 (Dashboard API).
|
||||||
|
|
||||||
|
### Technical Notes
|
||||||
|
- **Infra:** Hosted on AWS S3 + CloudFront or Vercel.
|
||||||
|
- Framework: React (Next.js or Vite).
|
||||||
|
- Tailwind CSS for rapid styling.
|
||||||
|
|
||||||
|
## Epic 8: Infrastructure & DevOps
|
||||||
|
**Description:** The foundational cloud infrastructure and deployment pipelines necessary to run dd0c/alert reliably, securely, and with observability.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 8.1: Infrastructure as Code (IaC)**
|
||||||
|
* **As a** Developer, **I want** all AWS resources defined in code, **so that** I can easily spin up staging and production environments identically.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Terraform or AWS CDK defines VPC, API Gateway, Lambda functions, ECS Fargate clusters, SQS queues, and DynamoDB tables.
|
||||||
|
- State is stored securely in an S3 backend with DynamoDB locking.
|
||||||
|
* **Estimate:** 8 points
|
||||||
|
|
||||||
|
**Story 8.2: CI/CD Pipelines**
|
||||||
|
* **As a** Developer, **I want** automated testing and deployment when I push to main, **so that** I can ship features quickly without manual steps.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- GitHub Actions workflow runs unit tests and linters on PRs.
|
||||||
|
- Merges to `main` trigger a deployment to the staging environment, followed by a manual approval for production.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
**Story 8.3: System Monitoring & Logging**
|
||||||
|
* **As a** System Admin, **I want** central logging and metrics for the dd0c/alert services, **so that** I can debug issues when the platform itself fails.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- All Lambda and ECS logs route to CloudWatch Logs.
|
||||||
|
- CloudWatch Alarms configured for API 5xx errors and SQS Dead Letter Queue (DLQ) messages.
|
||||||
|
* **Estimate:** 3 points
|
||||||
|
|
||||||
|
**Story 8.4: Database Provisioning (Timescale & Redis)**
|
||||||
|
* **As a** Database Admin, **I want** managed, highly available instances for TimescaleDB and Redis, **so that** the correlation engine runs with low latency and durable storage.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Provisions AWS ElastiCache for Redis (for active window state).
|
||||||
|
- Provisions RDS for PostgreSQL with TimescaleDB extension, or uses Timescale Cloud.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- Blocked by architectural decisions being finalized.
|
||||||
|
- Blocks Epics 1, 2, 3 from being deployed to production.
|
||||||
|
|
||||||
|
### Technical Notes
|
||||||
|
- Optimize for Solo Founder: Keep infrastructure simple. Managed services over self-hosted.
|
||||||
|
- Ensure appropriate IAM roles with least privilege access between Lambda/ECS and DynamoDB/SQS.
|
||||||
|
|
||||||
|
## Epic 9: Onboarding & PLG
|
||||||
|
**Description:** Product-Led Growth and the critical 60-second time-to-value flow. Ensures a frictionless setup experience for new users to connect their monitoring tools and Slack workspace immediately.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 9.1: Frictionless Sign-Up**
|
||||||
|
* **As a** New User, **I want** to sign up using my GitHub or Google account, **so that** I don't have to create and remember a new password.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Implement OAuth2 login (GitHub/Google).
|
||||||
|
- Automatically provisions a new `tenant_id` and default configuration upon successful first login.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
**Story 9.2: Webhook Setup Wizard**
|
||||||
|
* **As a** New User, **I want** a step-by-step wizard to configure my Datadog or PagerDuty webhooks, **so that** I can start sending data to dd0c/alert immediately.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- UI wizard provides copy-paste ready webhook URLs and secrets.
|
||||||
|
- Includes a "Waiting for first payload..." state that updates in real-time via WebSockets or polling when the first alert arrives.
|
||||||
|
* **Estimate:** 8 points
|
||||||
|
|
||||||
|
**Story 9.3: Slack App Installation Flow**
|
||||||
|
* **As a** New User, **I want** a 1-click "Add to Slack" button, **so that** I can authorize dd0c/alert to post in my incident channels.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Implements the standard Slack OAuth v2 flow.
|
||||||
|
- Allows the user to select the default channel for incident summaries.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
**Story 9.4: Free Tier Limitations**
|
||||||
|
* **As a** Product Owner, **I want** a free tier that limits the number of processed alerts or retention period, **so that** users can try the product without me incurring massive AWS costs.
|
||||||
|
* **Acceptance Criteria:**
|
||||||
|
- Free tier limits enforced at the ingestion API (e.g., max 10,000 alerts/month).
|
||||||
|
- UI displays a usage quota bar.
|
||||||
|
- Data retention in TimescaleDB automatically purged after 7 days for free tier tenants.
|
||||||
|
* **Estimate:** 5 points
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
- Depends on Epic 6 (Dashboard API) and Epic 7 (Dashboard UI).
|
||||||
|
- Story 9.2 depends on Epic 1 (Webhook Ingestion) being live.
|
||||||
|
|
||||||
|
### Technical Notes
|
||||||
|
- Use Auth0, Clerk, or AWS Cognito to minimize authentication development time for the Solo Founder.
|
||||||
|
- Real-time "Waiting for payload" can be implemented via a lightweight polling endpoint if WebSockets add too much complexity.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 10: Transparent Factory Compliance
|
||||||
|
**Description:** Cross-cutting epic ensuring dd0c/alert adheres to the 5 Transparent Factory tenets. For an alert intelligence platform, Semantic Observability is paramount — a tool that reasons about alerts must make its own reasoning fully transparent.
|
||||||
|
|
||||||
|
### Story 10.1: Atomic Flagging — Feature Flags for Correlation & Scoring Rules
|
||||||
|
**As a** solo founder, **I want** every new correlation rule, noise scoring algorithm, and suppression behavior behind a feature flag (default: off), **so that** a bad scoring change doesn't silence critical alerts in production.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- OpenFeature SDK integrated into the alert processing pipeline. V1: env-var or JSON file provider.
|
||||||
|
- All flags evaluate locally — no network calls in the alert ingestion hot path.
|
||||||
|
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
|
||||||
|
- Automated circuit breaker: if a flagged scoring rule suppresses >2x the baseline alert volume over 30 minutes, the flag auto-disables and all suppressed alerts are re-emitted.
|
||||||
|
- Flags required for: new correlation patterns, CI/CD deployment correlation, noise scoring thresholds, notification channel routing.
|
||||||
|
|
||||||
|
**Estimate:** 5 points
|
||||||
|
**Dependencies:** Epic 2 (Correlation Engine)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Circuit breaker is critical here — a bad suppression rule is worse than no suppression. Track suppression counts per flag in Redis with 30-min sliding window.
|
||||||
|
- Re-emission: suppressed alerts buffered in a dead-letter queue for 1 hour. On circuit break, replay the queue.
|
||||||
|
|
||||||
|
### Story 10.2: Elastic Schema — Additive-Only for Alert Event Store
|
||||||
|
**As a** solo founder, **I want** all alert event schema changes to be strictly additive, **so that** historical alert correlation data remains queryable after any deployment.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- CI rejects migrations containing `DROP`, `ALTER ... TYPE`, or `RENAME` on existing columns/attributes.
|
||||||
|
- New fields use `_v2` suffix for breaking changes. Old fields remain readable.
|
||||||
|
- All event parsers configured to ignore unknown fields (Pydantic `model_config = {"extra": "ignore"}` or equivalent).
|
||||||
|
- Dual-write during migration windows within the same DB transaction.
|
||||||
|
- Every migration includes `sunset_date` comment (max 30 days). CI warns on overdue cleanups.
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 3 (Event Store)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Alert events are append-only by nature — leverage this. Never mutate historical events.
|
||||||
|
- For correlation metadata (enrichments added post-ingestion), store as separate linked records rather than mutating the original event.
|
||||||
|
- TimescaleDB compression policies must handle both V1 and V2 column layouts.
|
||||||
|
|
||||||
|
### Story 10.3: Cognitive Durability — Decision Logs for Scoring Logic
|
||||||
|
**As a** future maintainer, **I want** every change to noise scoring weights, correlation rules, or suppression thresholds accompanied by a `decision_log.json`, **so that** I can understand why alert X was classified as noise vs. signal.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
|
||||||
|
- CI requires a decision log for PRs touching `src/scoring/`, `src/correlation/`, or `src/suppression/`.
|
||||||
|
- Cyclomatic complexity cap of 10 enforced in CI. Scoring functions must be decomposable and testable.
|
||||||
|
- Decision logs in `docs/decisions/`, one per significant logic change.
|
||||||
|
|
||||||
|
**Estimate:** 2 points
|
||||||
|
**Dependencies:** None
|
||||||
|
**Technical Notes:**
|
||||||
|
- Scoring weight changes are especially important to document — "why is deployment correlation weighted 0.7 and not 0.5?"
|
||||||
|
- Include sample alert scenarios in decision logs showing before/after scoring behavior.
|
||||||
|
|
||||||
|
### Story 10.4: Semantic Observability — AI Reasoning Spans on Alert Classification
|
||||||
|
**As an** on-call engineer investigating a missed critical alert, **I want** every alert scoring and correlation decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I can trace exactly why an alert was scored as noise when it was actually a P1 incident.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- Every alert ingestion creates a parent `alert_evaluation` span. Child spans for `noise_scoring`, `correlation_matching`, and `suppression_decision`.
|
||||||
|
- Span attributes: `alert.source`, `alert.noise_score`, `alert.correlation_matches` (JSON array), `alert.suppressed` (bool), `alert.suppression_reason`.
|
||||||
|
- If AI-assisted classification is used: `ai.prompt_hash`, `ai.model_version`, `ai.confidence_score`, `ai.reasoning_chain` (summarized).
|
||||||
|
- CI/CD correlation spans include: `alert.deployment_correlation_score`, `alert.deployment_id`, `alert.time_since_deploy_seconds`.
|
||||||
|
- No PII in spans. Alert payloads are hashed for correlation, not logged raw.
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 2 (Correlation Engine)
|
||||||
|
**Technical Notes:**
|
||||||
|
- This is the most important tenet for dd0c/alert. If the tool suppresses an alert, the reasoning MUST be traceable.
|
||||||
|
- Use `opentelemetry-python` with OTLP exporter. Batch span export to avoid per-alert overhead.
|
||||||
|
- For V1 without AI: `alert.suppression_reason` is the rule name + threshold. When AI scoring is added, the full reasoning chain is captured.
|
||||||
|
|
||||||
|
### Story 10.5: Configurable Autonomy — Governance for Alert Suppression
|
||||||
|
**As a** solo founder, **I want** a `policy.json` that controls whether dd0c/alert can auto-suppress alerts or only annotate them, **so that** customers never lose visibility into their alerts without explicit opt-in.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `policy.json` defines `governance_mode`: `strict` (annotate-only, never suppress) or `audit` (auto-suppress with full logging).
|
||||||
|
- Default for all new customers: `strict`. Suppression requires explicit opt-in.
|
||||||
|
- `panic_mode`: when true, all suppression stops immediately. Every alert passes through unmodified. A "panic active" banner appears in the dashboard.
|
||||||
|
- Per-customer governance override: customers can only be MORE restrictive than system default.
|
||||||
|
- All policy decisions logged with full context: "Alert X suppressed by audit mode, rule Y, score Z" or "Alert X annotation-only, strict mode active".
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 4 (Notification Router)
|
||||||
|
**Technical Notes:**
|
||||||
|
- `strict` mode is the safe default — dd0c/alert adds value even without suppression by annotating alerts with correlation data and noise scores.
|
||||||
|
- Panic mode: single Redis key `dd0c:panic`. All suppression checks short-circuit on this key. Triggerable via `POST /admin/panic` or env var.
|
||||||
|
- Customer override: stored in org settings. Merge: `max_restrictive(system, customer)`.
|
||||||
|
|
||||||
|
### Epic 10 Summary
|
||||||
|
| Story | Tenet | Points |
|
||||||
|
|-------|-------|--------|
|
||||||
|
| 10.1 | Atomic Flagging | 5 |
|
||||||
|
| 10.2 | Elastic Schema | 3 |
|
||||||
|
| 10.3 | Cognitive Durability | 2 |
|
||||||
|
| 10.4 | Semantic Observability | 3 |
|
||||||
|
| 10.5 | Configurable Autonomy | 3 |
|
||||||
|
| **Total** | | **16** |
|
||||||
1389
products/03-alert-intelligence/innovation-strategy/session.md
Normal file
1389
products/03-alert-intelligence/innovation-strategy/session.md
Normal file
File diff suppressed because it is too large
Load Diff
115
products/03-alert-intelligence/party-mode/session.md
Normal file
115
products/03-alert-intelligence/party-mode/session.md
Normal file
@@ -0,0 +1,115 @@
|
|||||||
|
# 🎉 dd0c/alert — Party Mode Advisory Board
|
||||||
|
**Product:** Alert Intelligence Layer (dd0c/alert)
|
||||||
|
**Date:** 2026-02-28
|
||||||
|
**Format:** BMad Creative Intelligence Suite - "Party Mode"
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Round 1: INDIVIDUAL REVIEWS
|
||||||
|
|
||||||
|
**1. The VC (Pattern-Matching Machine)**
|
||||||
|
* **Excites me:** The wedge. Entering via a webhook bypassing enterprise procurement is a beautiful PLG motion. The $19/seat price point makes it an individual contributor expense swipe. Total addressable market for AIOps is massive.
|
||||||
|
* **Worries me:** The moat. Sentence-transformers and time-window clustering are commodities now. What's stopping incident.io from adding this to their $16/seat tier tomorrow? What's stopping PagerDuty from dropping their AIOps add-on price?
|
||||||
|
* **Vote:** CONDITIONAL GO. (Prove you can get 500 teams before the incumbents wake up).
|
||||||
|
|
||||||
|
**2. The CTO (20 Years in Infrastructure)**
|
||||||
|
* **Excites me:** Cross-tool correlation. I have teams on Datadog, teams on Prometheus, and everyone routes to PagerDuty. A centralized intelligence layer that sees the whole topology is a holy grail for reducing MTTR.
|
||||||
|
* **Worries me:** The "Black Box" of AI. The moment this thing auto-suppresses a critical database failover alert because it "looked like a routine spike," I'm firing the vendor. "Explainability" is easy to put on a slide, incredibly hard to engineer reliably.
|
||||||
|
* **Vote:** CONDITIONAL GO. (Needs a strict, default-deny "Trust Ramp" and zero auto-suppression in V1).
|
||||||
|
|
||||||
|
**3. The Bootstrap Founder (Solo SaaS Veteran)**
|
||||||
|
* **Excites me:** The unbundling play. You don't need to build a whole incident management platform. You're just building a webhook processor with a Slack bot. That is 100% shippable by a solo dev in 30 days. $19/seat means 500 seats (like 25 mid-sized teams) gets you to $10K MRR.
|
||||||
|
* **Worries me:** Support burden. When a webhook drops at 4am, you're the one getting paged. Can a solo founder maintain 99.99% uptime on an alert ingestion pipeline while also doing marketing and sales?
|
||||||
|
* **Vote:** GO. (The math works. Keep the scope ruthlessly small).
|
||||||
|
|
||||||
|
**4. The On-Call Engineer (Drowning in 3am Pages)**
|
||||||
|
* **Excites me:** Finally, someone acknowledges the human cost! The "Noise Report Card" and the idea of translating my 3am suffering into a dollar metric for my VP is brilliant. Also, the deploy correlation—if you can just tell me "this broke right after PR #452," you've saved me 15 minutes of digging.
|
||||||
|
* **Worries me:** Trusting it. I've used PagerDuty's "Intelligent Alert Grouping" and it routinely groups unrelated things or misses obvious correlations. If I have to double-check the AI's work, it's just adding cognitive load, not removing it.
|
||||||
|
* **Vote:** CONDITIONAL GO. (Only if it's strictly "suggest-only" until I explicitly train it to auto-suppress).
|
||||||
|
|
||||||
|
**5. The Contrarian (The Blind-Spot Finder)**
|
||||||
|
* **Excites me:** The fact that everyone is so focused on the AI. That means the real value is actually the low-tech stuff: webhook unification, slack-native UI, and basic time-window grouping.
|
||||||
|
* **Worries me:** You're all treating "alert fatigue" like a software problem. It's an organizational problem. Companies have noisy alerts because their engineering culture is broken and they don't prioritize technical debt. Putting an AI band-aid over a broken culture just gives them permission to keep writing terrible code with bad thresholds.
|
||||||
|
* **Vote:** NO-GO. (You're a painkiller for a disease that requires surgery. They'll eventually churn when they realize they still don't know how their systems work).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Round 2: CROSS-EXAMINATION
|
||||||
|
|
||||||
|
**The VC:** So, Priya... I mean, On-Call Engineer. I see you're suffering. That's great, pain sells. But will your boss actually pay $19/seat for this, or will she just tell you to keep muting channels? Honestly, $19 feels too cheap for a critical B2B tool, but too high if you can't even get budget approval.
|
||||||
|
|
||||||
|
**The On-Call Engineer:** $19 a month is two overpriced coffees in San Francisco. My VP of Engineering spends $180K a year on Datadog alone, and we still have a 34-minute MTTR. If I can take the "Noise Report Card," drop it on her desk, and say, "this tool will give us back 40 engineering hours a week," she'll swipe the card. But she won't pay $50K for BigPanda. We're a 140-person engineering org.
|
||||||
|
|
||||||
|
**The VC:** That's fair. But why wouldn't Datadog just bundle this? They already ingest your metrics.
|
||||||
|
|
||||||
|
**The On-Call Engineer:** Because we don't just use Datadog! We use Grafana, OpsGenie, CloudWatch... Datadog can't see the alerts Grafana is throwing. We need something that sits *across* all of them.
|
||||||
|
|
||||||
|
**The Bootstrap Founder:** Let me jump in on that VC pessimism. You're worried about moats and BigPanda. I look at this and see a textbook unbundling play. I don't need to build a $50M/year business. If I hit 500 teams at 20 seats, that's $190K MRR. One guy. Almost 100% margins.
|
||||||
|
|
||||||
|
**The VC:** $190K MRR with 10,000 active webhooks firing constantly? As a solo founder? One AWS outage and your whole "alert intelligence layer" goes down. If you're the single point of failure for an SRE team's 3am pages, you are going to get sued into oblivion when you miss a P1.
|
||||||
|
|
||||||
|
**The Bootstrap Founder:** That's why the architecture is an *overlay*. We don't replace their PagerDuty webhooks. We sit parallel or upstream. If our ingestion goes down, the fallback is their raw, noisy alert stream. They're no worse off than they were yesterday!
|
||||||
|
|
||||||
|
**The CTO:** Hold on. Let's talk about the actual tech. Contrarian, you called this a "band-aid." But let's be real: I've spent 20 years fighting alert hygiene. Every company's culture is "broken" by your definition. Microservices mean no single team understands the whole topology anymore. AI correlation isn't a band-aid, it's the only way to synthesize 500 microservices throwing errors at once.
|
||||||
|
|
||||||
|
**The Contrarian:** Synthesis is fine. *Suppression* is the problem. You're putting a black-box LLM in charge of deciding if an alert is real. "Oh, the embedding similarity score is 0.95, it must be the same issue." No, CTO! What if the payment gateway fails *at the exact same time* a frontend deploy goes out? Your "smart" AI correlates them, suppresses the payment alert as a "deploy symptom," and you lose $400K in an hour.
|
||||||
|
|
||||||
|
**The CTO:** Which is why the "Trust Ramp" is the only way I'd buy this. V1 cannot auto-suppress. Period. It has to say, "Hey, I grouped these 14 alerts into 1 incident. Thumbs up or thumbs down?" It needs to earn my trust before it ever gets permission to mute a single payload.
|
||||||
|
|
||||||
|
**The Contrarian:** But if it doesn't auto-suppress, it hasn't solved the 3am problem! Priya is still getting woken up to press "thumbs down" on a bad grouping! You've just replaced "alert fatigue" with "AI grading fatigue."
|
||||||
|
|
||||||
|
**The On-Call Engineer:** Honestly? I'd take grading fatigue over raw alerts. If my phone wakes me up, and instead of 14 separate pages I see *one* grouped incident with a "suspected cause: Deploy #4521" tag... I can go back to sleep in 30 seconds instead of spending 15 minutes correlating it manually.
|
||||||
|
|
||||||
|
**The VC:** But where is the retention? Once they spend 6 months using your tool to figure out which alerts are noise, won't they just go fix the underlying alerts and then cancel their $19/seat subscription? You're training them to not need you!
|
||||||
|
|
||||||
|
**The Bootstrap Founder:** Have you ever met a software engineer? They will *never* fix the underlying alerts. They'll just keep writing new microservices with new noisy default thresholds. The alert hygiene problem is a treadmill. We're selling them a permanent personal trainer.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Round 3: STRESS TEST
|
||||||
|
|
||||||
|
### Threat 1: PagerDuty Ships Native AI Correlation (They're already working on it)
|
||||||
|
* **The VC Attacks:** PagerDuty is $430M ARR, heavily funded, and literally building this right now into their AIOps tier. If they bundle cross-tool correlation into their enterprise plans or drop the price for the mid-market, your $19/seat standalone tool is dead in the water. Why would anyone pay for a separate ingestion layer?
|
||||||
|
* **Severity:** 8/10
|
||||||
|
* **Mitigation (The CTO & Founder):** PagerDuty's strength is its cage. They only deeply correlate what runs *through* PagerDuty. dd0c/alert sits upstream of OpsGenie, Datadog, Grafana Cloud, and Slack natively. Second, our $19/seat price makes us a rounding error. PagerDuty's AIOps is an expensive, clunky add-on. We build for the mid-market who can't justify doubling their PagerDuty bill.
|
||||||
|
* **Pivot Option:** Double down on *cross-tool visualization* and deployment correlation inside Slack. If they improve grouping, we pivot harder into becoming the "incident context brain" connecting GitHub/CI to PagerDuty.
|
||||||
|
|
||||||
|
### Threat 2: AI Suppresses a Critical Alert (The "Outage Liability" Scenario)
|
||||||
|
* **The On-Call Engineer Attacks:** Month 3. The system gets cocky. A database connection pool exhaust error fires during a routine frontend deploy. The AI thinks, "Ah, deploy noise," and suppresses it. We are down for 4 hours. My VP rips out dd0c/alert the next morning and writes a furious blog post. The company's reputation dies instantly.
|
||||||
|
* **Severity:** 10/10 (Existential)
|
||||||
|
* **Mitigation (The Contrarian & CTO):** V1 has ZERO auto-suppression out of the box. The "Trust Ramp" is non-negotiable. We only auto-suppress when specific, user-confirmed correlation templates reach 99% accuracy. Even then, we have a hard-coded "never suppress" safelist (e.g., specific tags like `sev1`, `database`, `billing`). Finally, provide an "Audit Trail" so transparent that even if it *does* make a mistake, the team sees exactly why, and can fix the rule in 5 seconds.
|
||||||
|
* **Pivot Option:** Drop auto-suppression entirely if the market rejects it. Pivot to pure "Alert Grouping & Context Synthesis" inside Slack. Just grouping 47 pages into 1 reduces the 3am panic significantly, without the liability of muting anything.
|
||||||
|
|
||||||
|
### Threat 3: Enterprises Won't Send Alert Data to a 3rd Party
|
||||||
|
* **The CTO Attacks:** My CISO will never approve sending raw Datadog metrics and error payloads to a solo founder's SaaS app. That data contains user IDs, stack traces, and API keys leaked by juniors. SOC2 takes a year and $50K.
|
||||||
|
* **Severity:** 7/10
|
||||||
|
* **Mitigation (The Bootstrap Founder):** We aren't selling to enterprises with 12-page vendor procurement checklists. We are targeting 40-engineer Series B startups where Marcus the SRE can just plug in a webhook on Friday afternoon. For the security-conscious mid-market, we offer "Payload Stripping" mode: the webhook agent runs locally or they configure Datadog to *only* send us metadata (source, timestamp, severity, alert name), stripping the raw logs.
|
||||||
|
* **Pivot Option:** Open-source the correlation engine (the ingestion worker). Customers run `dd0c-worker` in their own VPC, which computes the ML embeddings locally and only sends anonymous hashes and timing data to our SaaS dashboard.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Round 4: FINAL VERDICT
|
||||||
|
|
||||||
|
**The Panel Convenes:**
|
||||||
|
The room is thick with tension. The On-Call Engineer is scrolling blindly through PagerDuty notifications out of muscle memory. The CTO is drawing network diagrams on the whiteboard. The Bootstrap Founder is checking Stripe. The VC is checking Twitter. The Contrarian is just shaking their head.
|
||||||
|
|
||||||
|
**The Decision:**
|
||||||
|
SPLIT DECISION (4-1 GO).
|
||||||
|
|
||||||
|
The Contrarian holds out: "You're selling a painkiller to an organizational culture problem. But I admit, people buy painkillers."
|
||||||
|
|
||||||
|
**Revised Priority in the dd0c Lineup:**
|
||||||
|
This is the wedge. While dd0c/run is the ultimate value, you can't auto-remediate what you can't intelligently correlate. dd0c/alert MUST launch first or simultaneously with dd0c/run. It is the "brain" that feeds the "hands."
|
||||||
|
|
||||||
|
**Top 3 Must-Get-Right Items:**
|
||||||
|
1. **The 60-Second "Wow":** The moment the webhook is pasted, the Slack bot needs to group 50 noisy alerts into 5 actionable incidents. Immediate ROI.
|
||||||
|
2. **The "Trust Ramp" (No Auto-Suppress in V1):** Engineers must explicitly opt-in to suppression rules. Show what *would* have been suppressed and let them confirm it.
|
||||||
|
3. **Deployment Correlation:** Pulling CI/CD webhooks to say "This alert spike started 2 minutes after PR #1042 was merged" is the killer feature that none of the legacy AIOps tools do gracefully out of the box.
|
||||||
|
|
||||||
|
**The One Kill Condition:**
|
||||||
|
If the product cannot achieve a verifiable 50% noise reduction for 10 paying beta teams within 90 days without suppressing a single false-negative (real alert missed), kill the product or strip it back to a pure Slack formatting tool.
|
||||||
|
|
||||||
|
**FINAL VERDICT:**
|
||||||
|
**🟢 GO.**
|
||||||
|
|
||||||
|
Alert fatigue is an epidemic. The incumbents are too fat to sell to the mid-market at $19/seat. The PLG webhook motion is pristine. Build the wedge, earn the trust, and then sell them the runbooks. Go build it.
|
||||||
543
products/03-alert-intelligence/product-brief/brief.md
Normal file
543
products/03-alert-intelligence/product-brief/brief.md
Normal file
@@ -0,0 +1,543 @@
|
|||||||
|
# dd0c/alert — Product Brief
|
||||||
|
### AI-Powered Alert Intelligence for Engineering Teams
|
||||||
|
**Version:** 1.0 | **Date:** 2026-02-28 | **Author:** dd0c Product | **Status:** Phase 5 — Product Brief
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. EXECUTIVE SUMMARY
|
||||||
|
|
||||||
|
### Elevator Pitch
|
||||||
|
|
||||||
|
dd0c/alert is an AI-powered alert intelligence layer that sits upstream of your existing monitoring stack — PagerDuty, OpsGenie, Datadog, Grafana — correlating, deduplicating, and contextualizing alerts across all tools via a single webhook. Slack-first. $19/seat/month. Prove value in 60 seconds.
|
||||||
|
|
||||||
|
### Problem Statement
|
||||||
|
|
||||||
|
Alert fatigue is an epidemic hiding in plain sight.
|
||||||
|
|
||||||
|
The average on-call engineer at a mid-size company receives **4,000+ alerts per month**. Industry data consistently shows **70–90% are non-actionable** — duplicate symptoms, transient spikes, deploy artifacts, and orphaned monitors nobody owns. The consequences are measurable and severe:
|
||||||
|
|
||||||
|
- **MTTR inflation:** Engineers spend the first 8–15 minutes of every incident determining if it's real, manually correlating across dashboards, and checking deploy logs. Average MTTR at affected orgs: 34 minutes vs. a 15-minute industry benchmark.
|
||||||
|
- **Attrition:** On-call satisfaction scores average 2.1/5 at companies with high alert noise. Replacing a single SRE costs $150–300K (recruiting, ramp, lost institutional knowledge). Alert burden is now cited as a top-3 reason for SRE attrition.
|
||||||
|
- **Invisible cost:** A 140-engineer org with 93% alert noise wastes an estimated 40+ engineering hours per week on false-alarm triage — roughly $300K/year in loaded salary, with zero feature output to show for it.
|
||||||
|
- **Trust erosion:** Every false alarm trains engineers to ignore alerts. The system conditions its operators to fail at the one moment it matters most — a Pavlovian tragedy playing out nightly across thousands of on-call rotations.
|
||||||
|
|
||||||
|
No mid-market solution exists today. BigPanda charges $50K–$500K/year and requires 6-month deployments. PagerDuty's AIOps is locked to PagerDuty-only alerts at $41–59/seat on top of base platform costs. incident.io's alert features are shallow. The 150,000+ engineering teams between 20–500 engineers are completely underserved.
|
||||||
|
|
||||||
|
### Solution Overview
|
||||||
|
|
||||||
|
dd0c/alert is a cross-tool alert intelligence layer deployed via webhook in under 5 minutes:
|
||||||
|
|
||||||
|
1. **Ingest** — Accepts alert webhooks from any monitoring tool (Datadog, Grafana, PagerDuty, OpsGenie, CloudWatch, Prometheus Alertmanager). No agents, no SDKs, no credentials.
|
||||||
|
2. **Correlate** — Groups related alerts using time-window clustering, service-dependency mapping, and CI/CD deployment correlation. V1 is rule-based; V2 adds ML-based semantic deduplication via sentence-transformer embeddings.
|
||||||
|
3. **Contextualize** — Enriches each correlated incident with deployment context ("started 2 minutes after PR #1042 merged to payment-service"), affected service topology, historical resolution patterns, and linked runbooks.
|
||||||
|
4. **Surface** — Delivers grouped, context-rich incident cards to Slack with thumbs-up/down feedback buttons. Engineers see 5 incidents instead of 47 raw alerts.
|
||||||
|
5. **Learn** — Every ack, snooze, override, and feedback signal trains the model. The system gets smarter with every on-call shift.
|
||||||
|
|
||||||
|
**V1 is strictly observe-and-suggest.** No auto-suppression. The system shows what it *would* suppress and lets engineers confirm. Trust is earned through a graduated "Trust Ramp," not assumed.
|
||||||
|
|
||||||
|
### Target Customer
|
||||||
|
|
||||||
|
**Primary:** Series A–C startups and mid-market companies with 20–200 engineers, running microservices on Kubernetes, using 2+ monitoring tools, with painful on-call rotations. The champion is the SRE lead or senior platform engineer (28–38 years old, 5–10 years experience) who can add a webhook integration without VP approval.
|
||||||
|
|
||||||
|
**Secondary:** The VP of Engineering who needs a defensible metric for alert health to present to the board, justify tooling spend, and address attrition driven by on-call burden.
|
||||||
|
|
||||||
|
**Anti-ICP:** Enterprises with 500+ engineers requiring SOC2 on Day 1, companies using only one monitoring tool, companies without on-call rotations, companies already running BigPanda.
|
||||||
|
|
||||||
|
### Key Differentiators
|
||||||
|
|
||||||
|
| Differentiator | Why It Matters |
|
||||||
|
|---|---|
|
||||||
|
| **Cross-tool correlation** | The only mid-market product purpose-built to correlate alerts across Datadog + Grafana + PagerDuty + OpsGenie simultaneously. PagerDuty only sees PagerDuty. Datadog only sees Datadog. dd0c/alert sees everything. |
|
||||||
|
| **60-second time to value** | Paste a webhook URL → see grouped incidents in Slack within 60 seconds. BigPanda takes 6 months. This isn't incremental — it's a category shift. |
|
||||||
|
| **CI/CD deployment correlation** | Automatic "this alert spike started after deploy X" tagging. The single most valuable piece of context during incident triage, and no legacy AIOps tool does it gracefully for the mid-market. |
|
||||||
|
| **Transparent, explainable decisions** | Every grouping and suppression decision is logged with plain-English reasoning. No black boxes. Engineers can audit, override, and learn from every decision. |
|
||||||
|
| **Observe-and-suggest Trust Ramp** | V1 never auto-suppresses. The system earns autonomy through demonstrated accuracy, graduating from observe → suggest-and-confirm → auto-suppress only with explicit engineer opt-in. |
|
||||||
|
| **$19/seat pricing** | 1/3 to 1/100th the cost of alternatives. Below the "just expense it" threshold ($380/month for a 20-person team). Below the "build internally" threshold (one engineer-day costs more than a year of dd0c/alert for a small team). |
|
||||||
|
| **Overlay architecture** | Doesn't replace anything. Sits on top of existing tools. Zero-risk adoption: remove the webhook and your existing pipeline is untouched. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. MARKET OPPORTUNITY
|
||||||
|
|
||||||
|
### Market Sizing
|
||||||
|
|
||||||
|
| Segment | Size | Methodology |
|
||||||
|
|---|---|---|
|
||||||
|
| **TAM** | **$5.3B–$16.4B** | Global AIOps market (2024–2025). Alert intelligence/correlation represents ~25–30% = $1.3B–$4.9B. Growing at 17–30% CAGR depending on analyst (Fortune Business Insights, GM Insights, Mordor Intelligence). |
|
||||||
|
| **SAM** | **~$800M** | Companies with 20–500 engineers, using 2+ monitoring tools, experiencing alert fatigue, willing to adopt SaaS. ~150,000–200,000 such companies globally (Series A through mid-market). Average potential spend: $4,000–$6,000/year at dd0c/alert's price point. |
|
||||||
|
| **SOM** | **$1.7M–$9.1M ARR (Year 1–2)** | Year 1: 200–500 paying teams × 15 avg seats × $19/seat × 12 months = $684K–$1.71M ARR. Year 2 with expansion: $3M–$9.1M ARR. Bootstrappable without venture capital. |
|
||||||
|
|
||||||
|
**The math that matters:** 500 teams × 15 seats × $19/seat × 12 months = $1.71M ARR. At 2,000 teams × 20 seats = $9.12M ARR. The PLG motion and low friction make volume achievable at this price point.
|
||||||
|
|
||||||
|
### Competitive Landscape
|
||||||
|
|
||||||
|
#### Tier 1: Enterprise AIOps Incumbents
|
||||||
|
|
||||||
|
| Competitor | Revenue / Funding | Alert Intelligence | Pricing | Threat to dd0c |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| **PagerDuty AIOps** | ~$430M ARR (public) | Medium depth, PagerDuty-only ecosystem | $41–59/seat + base platform | **MEDIUM** — Massive install base but locked to single tool. Mid-market finds it too expensive. Will improve in 12–18 months. |
|
||||||
|
| **BigPanda** | $196M raised | Deep correlation engine, patent portfolio | $50K–$500K/year, "Contact Sales" | **LOW** — Cannot profitably serve dd0c's target market. 6-month deployments. Different game entirely. |
|
||||||
|
| **Moogsoft (Dell/BMC)** | Acquired | Deep ML (legacy) | Enterprise pricing | **LOW** — Post-acquisition identity crisis. Innovation stalled. Trapped inside legacy ITSM platform. |
|
||||||
|
|
||||||
|
#### Tier 2: Modern Incident Management
|
||||||
|
|
||||||
|
| Competitor | Revenue / Funding | Alert Intelligence | Pricing | Threat to dd0c |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| **incident.io** | $57M raised (Series B) | Shallow but growing. Recently added "Alerts" product | ~$16–25/seat | **HIGH** — Same buyer persona, same PLG playbook, same Slack-native approach. Most dangerous competitor. If they build deep alert intelligence, speed becomes existential. |
|
||||||
|
| **Rootly** | $20M+ raised | Shallow — basic routing rules, not ML | ~$15–20/seat | **MEDIUM** — Could add alert intelligence but DNA is incident response. |
|
||||||
|
| **FireHydrant** | $70M+ raised | Shallow — checkbox feature | ~$20–35/seat | **MEDIUM** — Broad but shallow. Trying to be everything. |
|
||||||
|
|
||||||
|
#### Tier 3: Emerging Threat
|
||||||
|
|
||||||
|
| Competitor | Threat | Timeline |
|
||||||
|
|---|---|---|
|
||||||
|
| **Datadog** ($2.1B+ ARR) | Will build alert intelligence features. Has the data, ML team, and distribution. But Datadog only works with Datadog — their moat is also their cage. | **HIGH long-term, LOW short-term.** 12–18 month window. |
|
||||||
|
|
||||||
|
#### dd0c/alert's Competitive Position
|
||||||
|
|
||||||
|
dd0c/alert occupies a blue ocean at the intersection of:
|
||||||
|
1. **Deep alert intelligence** (like BigPanda/Moogsoft) — not shallow routing rules
|
||||||
|
2. **At SMB/mid-market pricing** (like incident.io/Rootly) — not enterprise contracts
|
||||||
|
3. **With instant time-to-value** (like nobody) — 60 seconds, not 6 months
|
||||||
|
4. **Across all monitoring tools** (like nobody for the mid-market) — not locked to one ecosystem
|
||||||
|
|
||||||
|
This combination does not exist today. BigPanda has the intelligence but not the accessibility. incident.io has the accessibility but not the intelligence. dd0c/alert threads the needle between them.
|
||||||
|
|
||||||
|
### Timing Thesis: The 18-Month Window
|
||||||
|
|
||||||
|
Four structural forces are converging in 2026 that create a once-in-a-cycle entry window:
|
||||||
|
|
||||||
|
**1. Alert fatigue has hit critical mass.** The average mid-size company now runs 200–500 microservices, each generating its own alerts. "Alert fatigue" has gone from an SRE inside joke to a board-level retention concern. VPs of Engineering are now *asking* for solutions — they weren't 2 years ago.
|
||||||
|
|
||||||
|
**2. AI capabilities have matured, but incumbents haven't shipped.** Embedding models make semantic alert deduplication trivially cheap. LLMs generate useful incident summaries. Inference costs have dropped 10x in 2 years. But incumbents built their ML stacks in 2019–2021 on legacy architectures. A greenfield product built today has a massive technical advantage.
|
||||||
|
|
||||||
|
**3. Datadog pricing backlash + tool fragmentation.** Datadog's aggressive pricing has created a revolt. Teams are migrating to Grafana Cloud, self-hosted Prometheus, and alternatives. This fragmentation is *good* for dd0c/alert — the more tools a team uses, the more they need a cross-tool correlation layer.
|
||||||
|
|
||||||
|
**4. Regulatory tailwinds.** SOC2, HIPAA, PCI-DSS, and DORA (EU Digital Operational Resilience Act) all require demonstrable incident response capabilities. "How do you ensure critical alerts aren't missed?" is becoming a compliance question. dd0c/alert's transparent audit trail is a compliance feature that black-box AI can't match.
|
||||||
|
|
||||||
|
**The window closes in ~18 months.** PagerDuty will ship better native AIOps (12–18 months). incident.io will deepen alert intelligence (6–12 months). Datadog will launch cross-signal correlation (12–18 months). After that, dd0c competes on execution and data moat, not market gap — which is fine, if the moat is built by then.
|
||||||
|
|
||||||
|
### Market Trends
|
||||||
|
|
||||||
|
- **Microservices proliferation** driving exponential alert volume growth
|
||||||
|
- **SRE attrition at historic highs** — companies connecting on-call burden to turnover
|
||||||
|
- **"Build vs. buy" shifting to buy** as AI tooling costs drop below internal development thresholds
|
||||||
|
- **Platform unbundling** — teams rejecting monolithic platforms in favor of best-of-breed point solutions (Linear unbundled Jira; dd0c/alert unbundles alert intelligence from incident management platforms)
|
||||||
|
- **AI skepticism rising** — engineers increasingly skeptical of "AI-powered" claims, favoring transparent, explainable tools over black-box magic. dd0c's stoic, anti-hype brand voice is a strategic advantage here
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. PRODUCT DEFINITION
|
||||||
|
|
||||||
|
### Value Proposition
|
||||||
|
|
||||||
|
**For on-call engineers:** "You got paged 6 times last night. 5 were noise. We would have let you sleep." dd0c/alert reduces alert volume 70%+ by correlating and deduplicating across all your monitoring tools, delivering context-rich incident cards to Slack instead of raw alert spam.
|
||||||
|
|
||||||
|
**For SRE/platform leads:** "What if Marcus's pattern recognition was available to every on-call engineer, 24/7?" dd0c/alert institutionalizes the tribal correlation knowledge trapped in senior engineers' heads — cross-service dependencies, deploy-correlated noise, seasonal patterns — and makes it available to every engineer on rotation.
|
||||||
|
|
||||||
|
**For VPs of Engineering:** "Your alert noise costs $300K/year in wasted engineering time and drives your best SREs to quit. Here's the dashboard that proves it — and the tool that fixes it." dd0c/alert translates alert fatigue into business metrics (dollars wasted, hours lost, attrition risk) that justify investment at the board level.
|
||||||
|
|
||||||
|
### Personas
|
||||||
|
|
||||||
|
#### Priya Sharma — The On-Call Engineer (Primary User)
|
||||||
|
- 28, backend engineer, weekly on-call rotation at a mid-stage fintech (85 engineers)
|
||||||
|
- Gets paged 6+ times per night; 80–90% are non-actionable
|
||||||
|
- Keeps a personal Notion "ignore list" of known-noisy alerts
|
||||||
|
- Has a bash script that checks deploy logs when she gets paged — she's automated her own triage
|
||||||
|
- Spends the first 12–20 minutes of every incident figuring out if it's real
|
||||||
|
- **JTBD:** "When I get paged at 3am, I want to instantly know if this is real and what to do, so I can either fix it fast or go back to sleep."
|
||||||
|
|
||||||
|
#### Marcus Chen — The SRE/Platform Lead (Champion / Buyer)
|
||||||
|
- 34, senior SRE leading a team of 8 at a Series C SaaS company (140 engineers)
|
||||||
|
- He IS the human correlation engine — connects dots across services because no tool does it
|
||||||
|
- Maintains a manual spreadsheet tracking alert-to-incident ratios (always out of date)
|
||||||
|
- Spends 30% of his time on alert tuning instead of platform work
|
||||||
|
- Lost 2 engineers in the past year who cited on-call burden
|
||||||
|
- **JTBD:** "When I'm reviewing on-call health, I want to see exactly which alerts are noise and which are signal across all teams, so I can prioritize fixes with data instead of gut feel."
|
||||||
|
|
||||||
|
#### Diana Okafor — The VP of Engineering (Economic Buyer)
|
||||||
|
- 41, VP of Engineering, reports to CTO, accountable for MTTR and retention
|
||||||
|
- Sees MTTR of 34 minutes vs. 15-minute benchmark; on-call satisfaction at 2.1/5 for 3 consecutive quarters
|
||||||
|
- Spending $200K+/year on Datadog + PagerDuty + Grafana with no way to quantify ROI
|
||||||
|
- Needs a single, defensible metric for alert health she can present to the board
|
||||||
|
- **JTBD:** "When I'm preparing for a board meeting, I want to show a clear metric for operational health that includes alert quality, so I can demonstrate improvement or justify investment."
|
||||||
|
|
||||||
|
### Feature Roadmap
|
||||||
|
|
||||||
|
#### V1 — MVP: "Observe & Suggest" (Month 1, 30-day build)
|
||||||
|
|
||||||
|
**CRITICAL DESIGN DECISION: V1 is strictly observe-and-suggest. No auto-suppression. No auto-muting. The system shows what it *would* do and lets engineers confirm. This resolves contradictions from earlier phases where auto-suppression was discussed — the party mode board unanimously mandated this constraint, and it is non-negotiable for V1.**
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---|---|
|
||||||
|
| **Webhook ingestion** | Accept alert payloads from Datadog, PagerDuty, OpsGenie, Grafana via webhook URL. No agents, no SDKs. |
|
||||||
|
| **Payload normalization** | Transform each source's format into a unified alert schema (source, severity, timestamp, service, message). |
|
||||||
|
| **Time-window clustering** | Group alerts firing within N minutes of each other into correlated incidents. Rule-based, no ML required. |
|
||||||
|
| **CI/CD deployment correlation** | Connect to GitHub/GitLab webhooks. Tag alert clusters with "started after deploy X" context. Party mode mandated this as a V1 must-have. |
|
||||||
|
| **Slack bot** | Post grouped incident cards to Slack. Each card shows: grouped alert count, source tools, suspected trigger, severity. Thumbs-up/down feedback buttons. |
|
||||||
|
| **Daily digest** | Summary of alerts received vs. incidents created, noise ratio, top noisy alerts. |
|
||||||
|
| **Suppression log** | Every grouping decision logged with plain-English reasoning. Searchable. Auditable. |
|
||||||
|
| **"What would have happened" view** | Show what dd0c/alert *would* have suppressed — without actually suppressing anything. The core trust-building mechanism. |
|
||||||
|
|
||||||
|
**What V1 does NOT include:** ML-based semantic dedup, auto-suppression, SSO/SCIM, custom dashboards, mobile app, API, SOC2 certification.
|
||||||
|
|
||||||
|
#### V2 — Intelligence Layer (Months 2–4)
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---|---|
|
||||||
|
| **Semantic deduplication** | Sentence-transformer embeddings to group alerts with similar meaning but different wording. |
|
||||||
|
| **Alert Simulation Mode** | Upload historical PagerDuty/OpsGenie exports → see what dd0c/alert would have done last month. The killer demo: proves value with zero risk, zero commitment. |
|
||||||
|
| **Noise Report Card** | Weekly per-team report: noise ratios, noisiest alerts, suggested tuning, estimated cost of noise. Gamifies alert hygiene. Creates organizational accountability. |
|
||||||
|
| **Trust Ramp — Stage 2** | "Suggest-and-confirm" mode. System proposes suppressions; engineer approves/rejects with one click. Auto-suppression unlocked only for specific, user-confirmed patterns reaching 99% accuracy. |
|
||||||
|
| **"Never suppress" safelist** | Hard-coded defaults (sev1, database, billing, security) that are never suppressed regardless of model confidence. User-configurable. |
|
||||||
|
| **Business impact dashboard** | Translate noise into dollars: hours wasted, estimated attrition cost, MTTR impact. Diana's board-meeting ammunition. |
|
||||||
|
| **Additional integrations** | CloudWatch, Prometheus Alertmanager, custom webhook format support. |
|
||||||
|
|
||||||
|
#### V3 — Platform & Automation (Months 5–9)
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---|---|
|
||||||
|
| **dd0c/run integration** | Alert fires → correlated incident → suggested runbook → one-click execute. The flywheel that makes alert + run 10x more valuable together. |
|
||||||
|
| **Cross-team correlation** | When multiple teams send alerts, correlate incidents across service boundaries. "Every time Team A's DB alerts fire, Team B's API errors follow 2 minutes later." |
|
||||||
|
| **Predictive severity scoring** | Historical resolution data predicts incident severity. "This pattern was resolved by 'restart-payment-service' 14 times in 3 months." |
|
||||||
|
| **Trust Ramp — Stage 3** | Full auto-suppression for patterns with proven track records. Circuit breaker: if accuracy drops below 95%, auto-fallback to pass-through mode. |
|
||||||
|
| **SSO (SAML/OIDC)** | Required for Business tier and company-wide rollouts. |
|
||||||
|
| **API access** | Programmatic access to alert data, noise metrics, and suppression rules. |
|
||||||
|
| **SOC2 Type II** | Certification process started at ~Month 6, completed by Month 9. |
|
||||||
|
| **Community patterns (future)** | Anonymized cross-customer pattern sharing. "87% of teams running K8s + Istio suppress this pattern." Requires 500+ customers. Architect the data pipeline to support this from Day 1. |
|
||||||
|
|
||||||
|
### User Journey
|
||||||
|
|
||||||
|
```
|
||||||
|
DISCOVER ACTIVATE ENGAGE EXPAND
|
||||||
|
─────────────────────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
"Alert fatigue sucks" "Paste webhook URL, "See noise reduction "Roll out to all teams,
|
||||||
|
connect Slack" in 60 seconds" upgrade to Business"
|
||||||
|
|
||||||
|
Blog post / HN launch / Free tier signup → Daily digest shows Cross-team correlation
|
||||||
|
Alert Fatigue Calculator / copy webhook URL → 47 alerts → 8 incidents. value prop triggers
|
||||||
|
Twitter / conf talk paste into Datadog/PD → Noise Report Card in expansion. VP sees
|
||||||
|
first alerts flow → weekly SRE review. business impact
|
||||||
|
Slack bot groups them Thumbs-up/down trains dashboard → mandates
|
||||||
|
in <60 seconds. the model. Trust grows. company-wide rollout.
|
||||||
|
"WOW: 47 → 8." dd0c/run cross-sell.
|
||||||
|
```
|
||||||
|
|
||||||
|
**The critical activation metric: Time to First "Wow"**
|
||||||
|
|
||||||
|
Target: **60 seconds** from signup to seeing grouped incidents in Slack. This is the party mode board's #1 mandate. The entire PLG motion lives or dies on this number.
|
||||||
|
|
||||||
|
The Alert Simulation shortcut for prospects not ready to connect live alerts: upload last 30 days of PagerDuty/OpsGenie export → see "Last month, you received 4,200 alerts. We would have shown you 340 incidents." Proves value with zero risk.
|
||||||
|
|
||||||
|
### Pricing
|
||||||
|
|
||||||
|
| Tier | Price | Includes | Target |
|
||||||
|
|---|---|---|---|
|
||||||
|
| **Free** | $0 | Up to 5 seats, 1,000 alerts/month, 2 integrations, 7-day retention | Solo devs, tiny teams, tire-kickers. Removes cost objection. |
|
||||||
|
| **Pro** | $19/seat/month | Unlimited alerts, 4 integrations, 90-day retention, Slack bot, daily digest, deployment correlation, Noise Report Card | Teams of 5–50. The beachhead. Credit-card swipe, no procurement. |
|
||||||
|
| **Business** | $39/seat/month | Everything in Pro + unlimited integrations, 1-year retention, API access, custom suppression rules, priority support, SSO | Teams of 50–200. Expansion tier when VP mandates company-wide rollout. |
|
||||||
|
| **Enterprise** | Custom | Everything in Business + dedicated instance, SLA, SOC2 report, custom integrations | 200+ seats. Don't build until Year 2. |
|
||||||
|
|
||||||
|
**Pricing rationale:**
|
||||||
|
- $19/seat for a 20-person team = $380/month. Below the "just expense it" threshold (most eng managers can expense <$500/month without VP approval).
|
||||||
|
- ROI is trivial: one prevented false-alarm page at 3am saves ~$25–33 in engineer productivity. dd0c/alert needs to prevent ONE false page per engineer per month to pay for itself. At 70% noise reduction, ROI is 10–50x.
|
||||||
|
- Below the "build internally" threshold: one engineer-day building a custom dedup script (~$600) exceeds a year of dd0c/alert for a small team.
|
||||||
|
- Average blended price across customers: ~$25/seat (mix of Pro and Business tiers).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. GO-TO-MARKET PLAN
|
||||||
|
|
||||||
|
### Launch Strategy
|
||||||
|
|
||||||
|
dd0c/alert is Phase 2 of the dd0c platform ("The On-Call Savior," months 4–6 per brand strategy). It launches after dd0c/route and dd0c/cost have established the dd0c brand and are generating ≥$5K MRR — proving the platform resonates before adding a third product.
|
||||||
|
|
||||||
|
The GTM motion is **pure PLG via webhook integration.** No sales team. No "Contact Sales." No 6-month POCs. The webhook URL is the distribution channel — the lowest-friction integration mechanism in all of DevOps (copy URL, paste into monitoring tool, done).
|
||||||
|
|
||||||
|
### Beachhead: The First 10 Customers
|
||||||
|
|
||||||
|
**Ideal First Customer Profile:**
|
||||||
|
- Series A–C startup, 30–150 engineers
|
||||||
|
- Running microservices on Kubernetes (AWS EKS or GCP GKE)
|
||||||
|
- Using at least 2 of: Datadog, Grafana, PagerDuty, OpsGenie
|
||||||
|
- Dedicated SRE/platform team of 2–8 people
|
||||||
|
- On-call rotation exists and is painful (verify via public postmortem blogs — companies that publish postmortems have mature-enough incident culture to care about alert quality)
|
||||||
|
|
||||||
|
**Champion profile:** The SRE lead or senior platform engineer (28–38, 5–10 years experience), active on Twitter/X or SRE Slack communities, has complained publicly about alert fatigue, and has authority to add a webhook without VP approval.
|
||||||
|
|
||||||
|
**Where to find them:**
|
||||||
|
|
||||||
|
| Channel | Tactic | Expected Customers |
|
||||||
|
|---|---|---|
|
||||||
|
| **SRE Twitter/X** | Search for engineers tweeting about alert fatigue, PagerDuty frustration, on-call burnout. Engage authentically. DM 50 warm leads at launch: "I built something for this. Free for 30 days." 10–15% conversion on warm DMs. | 3–4 |
|
||||||
|
| **Hacker News** | "Show HN: I was tired of getting paged for garbage at 3am, so I built dd0c/alert." Be technical, be honest, show the architecture. HN loves solo founder stories from senior engineers solving their own pain. 200–500 signups, 2–5% convert. | 2–3 |
|
||||||
|
| **SRE Slack communities** | Rands Leadership Slack, DevOps Chat, SRE community Slack, Kubernetes Slack. Participate in alert fatigue conversations. Offer free beta access. | 2–3 |
|
||||||
|
| **Conference lightning talks** | SREcon, KubeCon, DevOpsDays. "How We Reduced Alert Volume 80% With a Webhook and Some Embeddings." Live demo converts attendees that night. | 1–2 |
|
||||||
|
| **Personal network** | Brian's AWS architect network. First 1–2 customers should be people he knows personally — they'll give honest feedback and forgive V1 bugs. | 1–2 |
|
||||||
|
|
||||||
|
**Target: 10 paying customers within 4 weeks of launch.**
|
||||||
|
|
||||||
|
### The "Prove Value in 60 Seconds" Onboarding Requirement
|
||||||
|
|
||||||
|
The party mode board mandated this as the #1 must-get-right item. The entire PLG funnel depends on it:
|
||||||
|
|
||||||
|
1. User signs up (email + company name, nothing else)
|
||||||
|
2. User gets a webhook URL
|
||||||
|
3. User pastes webhook URL into Datadog/PagerDuty/Grafana notification settings
|
||||||
|
4. First alerts start flowing in
|
||||||
|
5. Within 60 seconds, dd0c/alert shows in Slack: "You've received 47 alerts in the last hour. We identified 8 unique incidents. Here's how we'd group them."
|
||||||
|
6. **That's the "wow."** 47 → 8. Visible, immediate, undeniable.
|
||||||
|
|
||||||
|
**Alert Simulation shortcut** for prospects who want proof before connecting live alerts: "Upload your last 30 days of alert history (CSV export from PagerDuty/OpsGenie). We'll show you what last month would have looked like." This is the killer demo — proves value with zero risk, zero commitment, zero live integration. No competitor offers this.
|
||||||
|
|
||||||
|
### Growth Loops
|
||||||
|
|
||||||
|
**Loop 1: Noise Report Card → Internal Virality**
|
||||||
|
Weekly per-team noise report → Marcus shares with Diana → Diana mandates company-wide rollout → more teams adopt → cross-team correlation improves → more value → more sharing. The report card is both a retention feature and an expansion trigger.
|
||||||
|
|
||||||
|
**Loop 2: Alert Fatigue Calculator → Lead Gen → Conversion**
|
||||||
|
Free public web tool (dd0c.com/calculator). Engineers input their alert volume, noise %, team size, salary. Calculator outputs: hours wasted, dollar cost, attrition risk. CTA: "Want to see your actual noise reduction? Connect dd0c/alert free →." Genuinely useful even without dd0c/alert — gets shared in Slack channels, 1:1s, all-hands. Captures and qualifies leads (someone entering "500 alerts/week, 85% noise, 40 engineers" is a perfect customer).
|
||||||
|
|
||||||
|
**Loop 3: Cross-Team Expansion**
|
||||||
|
Land in one team → demonstrate 60% noise reduction → pitch: "Connect all 8 teams and we estimate 85% reduction because we can correlate across service boundaries." Cross-team correlation is the expansion trigger that no single-team tool can match.
|
||||||
|
|
||||||
|
**Loop 4: dd0c/alert → dd0c/run Cross-Sell**
|
||||||
|
Engineers see "Suggested Runbook" placeholders on incident cards → "Want to auto-attach runbooks? Add dd0c/run." Alert intelligence feeds runbook automation; resolution data feeds back into smarter correlation. The flywheel that makes the platform 10x more valuable than either product alone.
|
||||||
|
|
||||||
|
### Content Strategy
|
||||||
|
|
||||||
|
| Asset | Purpose | Timeline |
|
||||||
|
|---|---|---|
|
||||||
|
| **Alert Fatigue Calculator** | Lead gen, SEO, qualification. Long-tail keyword "alert fatigue cost calculator" = high purchase intent, low competition. | Launch day |
|
||||||
|
| **Engineering blog** | Technical credibility. "The True Cost of Alert Fatigue," "How We Reduced Alert Volume 80%," "The Architecture of dd0c/alert: Semantic Dedup with Sentence Transformers." | Ongoing from launch |
|
||||||
|
| **Open-source CLI: `dd0c-dedup`** | Engineering-as-marketing. Local tool that analyzes PagerDuty/OpsGenie export files and shows noise patterns. Free sample → SaaS subscription. | Month 1 |
|
||||||
|
| **"State of Alert Fatigue" annual report** | Survey 500+ SREs. Publish benchmarks. Become the industry reference that journalists and conference speakers cite. dd0c becomes synonymous with "alert intelligence." | Month 6 |
|
||||||
|
| **Case studies** | Social proof. First case study from earliest customer. "How [Company] reduced alert noise 73% in 2 weeks." | Month 2–3 |
|
||||||
|
| **Build-in-public Twitter thread** | Authenticity. Share progress, architecture decisions, customer wins. SRE audience respects transparency. | Pre-launch through ongoing |
|
||||||
|
|
||||||
|
### Marketplace Partnerships
|
||||||
|
|
||||||
|
| Partner | Distribution Value | Priority | Pitch |
|
||||||
|
|---|---|---|---|
|
||||||
|
| **PagerDuty Marketplace** | Very High — 28,000+ customers, exact buyer persona | P0 | "We make PagerDuty better. We reduce noise before it hits your platform. Complement, not competitor." |
|
||||||
|
| **Grafana Plugin Directory** | High — massive open-source community, growing as teams migrate from Datadog | P0 | Natural distribution. Plugin sends Grafana alerts to dd0c/alert. |
|
||||||
|
| **Datadog Marketplace** | High — growing marketplace | P1 | "We help Datadog customers get more value by correlating Datadog alerts with alerts from other tools." |
|
||||||
|
| **OpsGenie/Atlassian Marketplace** | Medium — #2 on-call tool, Atlassian distribution | P1 | Atlassian ecosystem reach. |
|
||||||
|
| **Slack App Directory** | Medium — discovery channel | P1 | Slack-native positioning. |
|
||||||
|
|
||||||
|
### 90-Day Launch Timeline
|
||||||
|
|
||||||
|
| Period | Actions | Targets |
|
||||||
|
|---|---|---|
|
||||||
|
| **Days 1–30: Build MLP** | Core engine (webhook ingestion, normalization, time-window clustering, deployment correlation). Slack bot. Dashboard MVP (Noise Report Card, integration management, suppression log). | Ship V1. First webhook received. |
|
||||||
|
| **Days 31–60: Launch & Validate** | HN "Show HN" post. Twitter/X announcement. Alert Fatigue Calculator live. SRE Slack community outreach. Personal network DMs. Daily customer conversations. Fix top 3 pain points. | 25–50 free signups. 5–10 paying teams. First case study. |
|
||||||
|
| **Days 61–90: Prove Flywheel** | Add semantic dedup (sentence-transformer embeddings). Ship Alert Simulation Mode. Submit to PagerDuty Marketplace + Grafana Plugin Directory. Publish first case study. Launch dd0c/alert + dd0c/run integration. | 50–100 free users. 15–25 paying teams. $5K+ MRR. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. BUSINESS MODEL
|
||||||
|
|
||||||
|
### Revenue Model
|
||||||
|
|
||||||
|
**Primary revenue:** Per-seat SaaS subscription (Pro at $19/seat/month, Business at $39/seat/month).
|
||||||
|
|
||||||
|
**Expansion revenue:** Seat expansion within accounts (land with 10 seats, expand to 50+ as more teams adopt) + tier upgrades (Pro → Business when VP mandates company-wide rollout and needs SSO/longer retention) + cross-product upsell (dd0c/alert → dd0c/run bundle).
|
||||||
|
|
||||||
|
**Future revenue (Year 2+):** Usage-based pricing tiers for high-volume customers processing >100K alerts/month. Enterprise tier with custom pricing for 200+ seat deployments.
|
||||||
|
|
||||||
|
### Unit Economics
|
||||||
|
|
||||||
|
| Metric | Value | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| **Average deal size** | $285/month ($19 × 15 seats) | Pro tier, typical mid-market team |
|
||||||
|
| **Blended ARPU** | ~$375/month | Mix of Pro ($285) and Business ($780) customers |
|
||||||
|
| **Gross margin** | ~85–90% | Infrastructure costs are minimal: webhook ingestion + embedding computation + Slack API. No agents to host. |
|
||||||
|
| **CAC (PLG)** | ~$50–150 | Content marketing + community engagement. No paid ads initially. No sales team. |
|
||||||
|
| **CAC payback** | <1 month | At $285/month ARPU and $150 CAC, payback is immediate. |
|
||||||
|
| **LTV (at 5% monthly churn)** | ~$5,700 | $285/month × 20-month average lifetime. Improves as data moat reduces churn over time. |
|
||||||
|
| **LTV:CAC ratio** | 38:1 to 114:1 | Exceptional unit economics enabled by PLG + solo founder cost structure. |
|
||||||
|
|
||||||
|
**Cost structure advantage:** Zero employees, zero investors, zero burn rate. Profitable from customer #1. BigPanda needs $40M+ in revenue to break even (200+ employees at ~$200K fully loaded). incident.io raised $57M and must move upmarket to satisfy investor returns. dd0c can price at $19/seat and be profitable because the cost structure IS the moat.
|
||||||
|
|
||||||
|
### Path to Revenue Milestones
|
||||||
|
|
||||||
|
#### $10K MRR (~35 paying teams)
|
||||||
|
- **Timeline:** Month 3–4 (Grind scenario), Month 2 (Rocket scenario)
|
||||||
|
- **How:** First 10 customers from launch channels (HN, Twitter, personal network). Next 25 from content marketing, marketplace listings, and word of mouth.
|
||||||
|
- **Solo founder feasible:** Yes. Product is stable, support is manageable, marketing is content-driven.
|
||||||
|
|
||||||
|
#### $50K MRR (~175 paying teams)
|
||||||
|
- **Timeline:** Month 8–10 (Grind), Month 5 (Rocket)
|
||||||
|
- **How:** PLG flywheel kicking in. Noise Report Card driving internal expansion. Alert Fatigue Calculator generating steady leads. PagerDuty Marketplace live. First case studies published. dd0c/run cross-sell beginning.
|
||||||
|
- **Solo founder feasible:** Stretching. Consider first hire (engineer) at $30K MRR to maintain velocity.
|
||||||
|
|
||||||
|
#### $100K MRR (~350 paying teams)
|
||||||
|
- **Timeline:** Month 12–15 (Grind), Month 8 (Rocket)
|
||||||
|
- **How:** Cross-team expansion driving seat growth. Business tier adoption at 20%+ of customers. dd0c/alert + dd0c/run bundle driving 30–40% of new signups. Community patterns feature (if 500+ customers reached) creating cross-customer network effects.
|
||||||
|
- **Solo founder feasible:** No. Need 2–3 person team. First engineer hired at $30K MRR, second at $75K MRR. Hire for infrastructure reliability and ML — the two areas that compound value fastest.
|
||||||
|
|
||||||
|
### Solo Founder Constraints & Mitigations
|
||||||
|
|
||||||
|
| Constraint | Mitigation |
|
||||||
|
|---|---|
|
||||||
|
| **Support burden** | Self-service docs, in-app guides, community Slack channel. Overlay architecture means dd0c going down = fallback to raw alerts (no worse than before). |
|
||||||
|
| **Uptime expectations** | Multi-region webhook endpoints with failover. Dual-path: webhook for real-time + periodic API polling for reconciliation. Health check monitoring if webhook volume drops to zero. |
|
||||||
|
| **Feature velocity** | Shared dd0c platform infrastructure (auth, billing, data pipeline) means each new product is incremental, not greenfield. Ruthless scope control. |
|
||||||
|
| **Burnout / bus factor** | Hire first engineer at $30K MRR, not $100K MRR. Don't wait until drowning. Automate everything automatable. |
|
||||||
|
|
||||||
|
### Revenue Scenarios (24-Month Projection)
|
||||||
|
|
||||||
|
| Scenario | Probability | Month 6 ARR | Month 12 ARR | Month 24 ARR |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| **Rocket** (everything clicks) | 20% | $342K | $1.64M | $12.5M |
|
||||||
|
| **Grind** (solid PMF, slower growth) | 50% | $109K | $513K | $3.03M |
|
||||||
|
| **Pivot** (competitive pressure, stalls) | 30% | $34K | $109K | Pivot to dd0c/run feature |
|
||||||
|
| **Expected value (weighted)** | — | $138K | $596K | $4.05M |
|
||||||
|
|
||||||
|
The expected-value scenario produces a $4M ARR product at Month 24. Even the Grind scenario (most likely) yields $3M ARR — enough to hire a small team and compound growth. This is a real business at every scenario except Pivot, which has defined kill criteria.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. RISKS & MITIGATIONS
|
||||||
|
|
||||||
|
### Top 5 Risks
|
||||||
|
|
||||||
|
#### Risk 1: PagerDuty Ships Native Cross-Tool AI Correlation
|
||||||
|
- **Probability:** HIGH (80%) | **Impact:** CRITICAL | **Timeline:** 12–18 months
|
||||||
|
- **Threat:** PagerDuty already has "Event Intelligence." If they ship genuinely good alert intelligence bundled free into existing plans, dd0c's value prop for PagerDuty-only shops evaporates.
|
||||||
|
- **Mitigation:** dd0c's cross-tool correlation is the hedge — PagerDuty can only improve intelligence for PagerDuty alerts. Speed: be in market with 500+ customers and a trained data moat before they ship. Position as complement: "Keep PagerDuty for on-call. Add dd0c/alert in front to cut noise 70% across ALL your tools."
|
||||||
|
- **Residual risk:** MEDIUM. PagerDuty-only shops (~30% of TAM) become harder. Multi-tool shops (70% of TAM) unaffected.
|
||||||
|
- **Pivot option:** Double down on cross-tool visualization and deployment correlation inside Slack. Become the "incident context brain" connecting CI/CD to PagerDuty.
|
||||||
|
|
||||||
|
#### Risk 2: AI Suppresses a Real P1 Alert (Existential Trust Event)
|
||||||
|
- **Probability:** MEDIUM (50%) | **Impact:** CRITICAL | **Timeline:** Ongoing from Day 1
|
||||||
|
- **Threat:** One suppressed critical alert causing a production outage = permanent distrust. "dd0c/alert suppressed a P1 and we had a 2-hour outage" on Hacker News destroys the brand instantly.
|
||||||
|
- **Mitigation:** V1 has ZERO auto-suppression (non-negotiable). Trust Ramp: observe → suggest-and-confirm → auto-suppress only with explicit opt-in on patterns reaching 99% accuracy. "Never suppress" safelist (sev1, database, billing, security) — configurable, default-on. Transparent audit trail for every decision. Circuit breaker: if accuracy drops below 95%, auto-fallback to pass-through mode.
|
||||||
|
- **Residual risk:** MEDIUM. This risk never reaches zero — it's the existential tension of the product. Managing it IS the core competency.
|
||||||
|
- **Pivot option:** Drop auto-suppression entirely. Pivot to pure "Alert Grouping & Context Synthesis" in Slack. Grouping 47 pages into 1 still reduces 3am panic significantly without suppression liability.
|
||||||
|
|
||||||
|
#### Risk 3: Data Privacy — Enterprises Won't Send Alert Data to a Solo Founder's SaaS
|
||||||
|
- **Probability:** MEDIUM (50%) | **Impact:** HIGH | **Timeline:** From Day 1
|
||||||
|
- **Threat:** Alert data contains service names, infrastructure details, error messages, sometimes customer data in payloads. CISOs will block adoption.
|
||||||
|
- **Mitigation:** Target Series B startups where Marcus the SRE can plug in a webhook without procurement review (not Fortune 500). Offer "Payload Stripping" mode: only receive metadata (source, timestamp, severity, alert name), strip raw logs. Publish clear data handling policy. SOC2 Type II by Month 6–9. Architecture transparency: publish diagrams showing encryption in transit (TLS) and at rest (AES-256), no access to monitoring credentials.
|
||||||
|
- **Residual risk:** MEDIUM. Slows enterprise adoption but doesn't block mid-market PLG.
|
||||||
|
- **Pivot option:** Open-source the correlation engine (`dd0c-worker`). Customers run it in their own VPC; only anonymous hashes and timing data sent to SaaS dashboard.
|
||||||
|
|
||||||
|
#### Risk 4: incident.io Adds Deep Alert Intelligence
|
||||||
|
- **Probability:** HIGH (70%) | **Impact:** HIGH | **Timeline:** 6–12 months
|
||||||
|
- **Threat:** Same buyer persona, same PLG motion, same Slack-native approach. $57M raised, 100+ employees. If they invest heavily in ML-based correlation, they offer alert intelligence + incident management in one product.
|
||||||
|
- **Mitigation:** Speed — be the recognized "alert intelligence" brand before they get there. Depth over breadth — their alert intelligence is one feature among many; dd0c's is the entire product, 10x deeper. The dd0c/alert + dd0c/run flywheel creates compound value they'd need two products to match. Interop positioning: "Use incident.io for incident management. Use dd0c/alert for alert intelligence. They work great together."
|
||||||
|
- **Residual risk:** MEDIUM-HIGH. This is the biggest competitive threat. Monitor their product roadmap obsessively.
|
||||||
|
|
||||||
|
#### Risk 5: Solo Founder Burnout / Bus Factor
|
||||||
|
- **Probability:** MEDIUM-HIGH (60%) | **Impact:** CRITICAL | **Timeline:** 6–12 months
|
||||||
|
- **Threat:** Building and supporting multiple dd0c products while doing marketing, sales, and customer support. One person maintaining 99.99% uptime on an alert ingestion pipeline.
|
||||||
|
- **Mitigation:** Ruthless scope control (V1 is minimal: time-window clustering + deployment correlation + Slack bot). Shared platform infrastructure reduces per-product effort. Overlay architecture means downtime = fallback to raw alerts, not total failure. Hire first engineer at $30K MRR. Automate support via self-service docs and community Slack.
|
||||||
|
- **Residual risk:** MEDIUM. Solo founder risk is real and doesn't fully mitigate. Discipline about scope is the only defense.
|
||||||
|
|
||||||
|
### Risk Summary Matrix
|
||||||
|
|
||||||
|
| # | Risk | Probability | Impact | Residual | Action |
|
||||||
|
|---|---|---|---|---|---|
|
||||||
|
| 1 | PagerDuty builds natively | HIGH | CRITICAL | MEDIUM | Outrun. Cross-tool positioning. |
|
||||||
|
| 2 | AI suppresses real P1 | MEDIUM | CRITICAL | MEDIUM | Engineer. Trust Ramp. Never-suppress safelist. |
|
||||||
|
| 3 | Data privacy concerns | MEDIUM | HIGH | MEDIUM | Certify. Payload stripping. SOC2. |
|
||||||
|
| 4 | incident.io adds alert intelligence | HIGH | HIGH | MEDIUM-HIGH | Outrun. Depth + flywheel. |
|
||||||
|
| 5 | Solo founder burnout | MEDIUM-HIGH | CRITICAL | MEDIUM | Scope ruthlessly. Hire early. |
|
||||||
|
|
||||||
|
### Kill Criteria
|
||||||
|
|
||||||
|
These are the signals to STOP and redirect resources:
|
||||||
|
|
||||||
|
1. **Can't find 10 paying customers in 90 days.** If the pain isn't acute enough for 10 teams to pay $19/seat after a free trial, the market isn't ready. Redirect to dd0c/run or dd0c/portal.
|
||||||
|
2. **Cannot achieve verifiable 50% noise reduction for 10 paying beta teams within 90 days without a single false-negative** (real alert missed). Kill the product or strip it back to a pure Slack formatting tool.
|
||||||
|
3. **False positive rate exceeds 5% after 90 days.** If suppression accuracy can't reach 95% within 3 months of real-world data, the technology isn't ready. Go back to R&D.
|
||||||
|
4. **PagerDuty ships free, cross-tool alert intelligence.** Market position becomes untenable. Pivot dd0c/alert into a feature of dd0c/run.
|
||||||
|
5. **incident.io launches deep alert intelligence at <$15/seat.** Fighting uphill. Consider folding dd0c/alert into dd0c/run rather than competing standalone.
|
||||||
|
6. **Monthly customer churn exceeds 10% after Month 3.** Value isn't sticky. Investigate root cause before continuing investment.
|
||||||
|
7. **Spending >60% of time on support instead of building.** Product isn't self-service enough. Fix UX or reconsider viability as solo-founder venture.
|
||||||
|
|
||||||
|
### Pivot Options
|
||||||
|
|
||||||
|
| Trigger | Pivot |
|
||||||
|
|---|---|
|
||||||
|
| Competitive pressure kills standalone viability | Fold dd0c/alert into dd0c/run as a feature (alert correlation → auto-remediation pipeline) |
|
||||||
|
| Auto-suppression rejected by market | Pure "Alert Grouping & Context Synthesis" tool — no suppression, just better Slack formatting with deploy context |
|
||||||
|
| Data privacy blocks SaaS adoption | Open-source the correlation engine; charge for the dashboard/analytics SaaS layer |
|
||||||
|
| Alert intelligence commoditized | Pivot to deployment correlation as primary value prop — "the CI/CD ↔ incident bridge" |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. SUCCESS METRICS
|
||||||
|
|
||||||
|
### North Star Metric
|
||||||
|
|
||||||
|
**Alerts Correlated Per Month**
|
||||||
|
|
||||||
|
Every correlated alert = an engineer who didn't get interrupted by a duplicate or noise alert. It's measurable, meaningful, and grows with both customer count and per-customer value. It captures the core promise: turning alert chaos into actionable signal.
|
||||||
|
|
||||||
|
### Leading Indicators (Predict Future Success)
|
||||||
|
|
||||||
|
| Metric | Target | Why It Matters |
|
||||||
|
|---|---|---|
|
||||||
|
| Time to first webhook | <5 minutes | Activation friction. If this is >30 minutes, the PLG motion is broken. |
|
||||||
|
| Time to first "wow" (grouped incident in Slack) | <60 seconds after first alert | The party mode mandate. The moment that converts tire-kickers to believers. |
|
||||||
|
| Thumbs-up/down ratio on Slack cards | >80% thumbs-up | Model accuracy signal. Below 70% = correlation quality is insufficient. |
|
||||||
|
| Free → Paid conversion rate | >5% | Willingness to pay. Below 2% = value prop isn't landing. |
|
||||||
|
| Weekly active users / total seats | >60% | Engagement depth. Below 30% = shelfware risk. |
|
||||||
|
| Integrations per customer | >2 | Multi-tool stickiness. More integrations = higher switching cost = lower churn. |
|
||||||
|
|
||||||
|
### Lagging Indicators (Confirm Business Health)
|
||||||
|
|
||||||
|
| Metric | Target | Why It Matters |
|
||||||
|
|---|---|---|
|
||||||
|
| MRR and MRR growth rate | 15–30% MoM (Stage 1) | Business trajectory. |
|
||||||
|
| Net revenue retention | >110% | Expansion outpacing churn. Land-and-expand working. |
|
||||||
|
| Logo churn (monthly) | <5% | Customer satisfaction. >10% = kill criteria triggered. |
|
||||||
|
| Noise reduction % (customer-reported) | >50% (target 70%+) | Core value delivery. <30% = kill criteria triggered. |
|
||||||
|
| NPS | >40 | Product-market fit signal. <20 = fundamental problem. |
|
||||||
|
| Seats per customer (avg) | Growing over time | Internal expansion working. |
|
||||||
|
|
||||||
|
### 30/60/90 Day Milestones
|
||||||
|
|
||||||
|
| Milestone | Day 30 | Day 60 | Day 90 |
|
||||||
|
|---|---|---|---|
|
||||||
|
| **Product** | V1 shipped. Webhook ingestion, time-window clustering, deployment correlation, Slack bot live. | Semantic dedup added. Alert Simulation Mode live. Top 3 user pain points fixed. | dd0c/run integration live. PagerDuty Marketplace submitted. |
|
||||||
|
| **Customers** | First webhook received. First free users. | 25–50 free signups. 5–10 paying teams. | 50–100 free users. 15–25 paying teams. |
|
||||||
|
| **Revenue** | $0–$1K MRR | $1K–$3K MRR | $3K–$5K+ MRR |
|
||||||
|
| **Validation** | Time-to-first-webhook <5 min confirmed. | Noise reduction >50% confirmed with real customers. First case study drafted. | Free-to-paid conversion >5%. NPS >40. Kill criteria evaluated. |
|
||||||
|
|
||||||
|
### Month 6 Targets
|
||||||
|
|
||||||
|
| Metric | Target |
|
||||||
|
|---|---|
|
||||||
|
| Paying teams | 100 (Grind) / 250 (Rocket) |
|
||||||
|
| MRR | $25K (Grind) / $70K (Rocket) |
|
||||||
|
| Noise reduction (avg across customers) | >65% |
|
||||||
|
| PagerDuty Marketplace | Live and generating signups |
|
||||||
|
| SOC2 Type II | Process started |
|
||||||
|
| dd0c/run cross-sell rate | 15%+ of alert customers |
|
||||||
|
| Net revenue retention | >110% |
|
||||||
|
|
||||||
|
### Month 12 Targets
|
||||||
|
|
||||||
|
| Metric | Target |
|
||||||
|
|---|---|
|
||||||
|
| Paying teams | 400 (Grind) / 1,000 (Rocket) |
|
||||||
|
| ARR | $513K (Grind) / $1.64M (Rocket) |
|
||||||
|
| Noise reduction (avg) | >70% |
|
||||||
|
| Team size | 2–3 (first engineer hired at $30K MRR) |
|
||||||
|
| SOC2 Type II | Certified |
|
||||||
|
| Cross-product adoption (alert + run) | 30–40% of customers |
|
||||||
|
| Community patterns feature | Architected, beta if 500+ customers reached |
|
||||||
|
| Net revenue retention | >120% |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*This product brief synthesizes findings from four prior phases: Brainstorm (200+ ideas), Design Thinking (5 personas, empathy mapping, journey mapping), Innovation Strategy (Christensen disruption analysis, Blue Ocean strategy, Porter's Five Forces, JTBD analysis), and Party Mode (5-person advisory board stress test, 4-1 GO verdict). All contradictions have been resolved in favor of the party mode board's mandates: V1 is observe-and-suggest only, deployment correlation is a V1 must-have, and the product must prove value within 60 seconds of pasting a webhook.*
|
||||||
|
|
||||||
|
*dd0c/alert is a classic low-end disruption: BigPanda intelligence at 1/100th the price, for the 150,000 mid-market teams the incumbents can't profitably serve. The 18-month window is open. Build the wedge, earn the trust, sell them the runbooks.*
|
||||||
|
|
||||||
|
**All signal. Zero chaos.**
|
||||||
|
|
||||||
@@ -0,0 +1,69 @@
|
|||||||
|
# dd0c/alert — Test Architecture & TDD Strategy
|
||||||
|
**Product:** dd0c/alert (Alert Intelligence Platform)
|
||||||
|
**Version:** 2.0 | **Date:** 2026-02-28 | **Phase:** 7 — Test Architecture
|
||||||
|
**Stack:** TypeScript / Node.js 20 | Vitest | Testcontainers | LocalStack
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Testing Philosophy & TDD Workflow
|
||||||
|
|
||||||
|
### 1.1 Core Principle
|
||||||
|
|
||||||
|
dd0c/alert is an intelligence platform — it makes decisions about what engineers see during incidents. A wrong suppression decision can hide a P1. A wrong correlation can create noise. **Tests are not optional; they are the specification.**
|
||||||
|
|
||||||
|
Every behavioral rule in the Correlation Engine, Noise Scorer, and Notification Router must be expressed as a failing test before a single line of implementation is written.
|
||||||
|
|
||||||
|
### 1.2 Red-Green-Refactor Cycle
|
||||||
|
|
||||||
|
```
|
||||||
|
RED → Write a failing test that describes the desired behavior.
|
||||||
|
The test must fail for the right reason (not a compile error).
|
||||||
|
|
||||||
|
GREEN → Write the minimum code to make the test pass.
|
||||||
|
No gold-plating. No "while I'm here" changes.
|
||||||
|
|
||||||
|
REFACTOR → Clean up the implementation without breaking tests.
|
||||||
|
Extract functions, rename for clarity, remove duplication.
|
||||||
|
Tests stay green throughout.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Strict rule:** No implementation code is written without a failing test first. PRs that add implementation without a corresponding test are blocked by CI.
|
||||||
|
|
||||||
|
### 1.3 Test Naming Convention
|
||||||
|
|
||||||
|
Tests follow the `given_when_then` pattern using Vitest's `describe`/`it` structure:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
describe('NoiseScorer', () => {
|
||||||
|
describe('given a deploy-correlated alert window', () => {
|
||||||
|
it('should boost noise score by 25 points when deploy is attached', () => { ... });
|
||||||
|
it('should add 5 additional points when PR title contains "feature-flag"', () => { ... });
|
||||||
|
it('should not boost score above 50 when service matches never-suppress safelist', () => { ... });
|
||||||
|
});
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
Test file naming: `{module}.test.ts` for unit tests, `{module}.integration.test.ts` for integration tests, `{journey}.e2e.test.ts` for E2E.
|
||||||
|
|
||||||
|
### 1.4 When Tests Lead (TDD Mandatory)
|
||||||
|
|
||||||
|
TDD is **mandatory** for:
|
||||||
|
- All noise scoring logic (`src/scoring/`)
|
||||||
|
- All correlation rules (`src/correlation/`)
|
||||||
|
- All suppression decisions (`src/suppression/`)
|
||||||
|
- HMAC validation per provider
|
||||||
|
- Canonical schema mapping (every provider parser)
|
||||||
|
- Feature flag circuit breaker logic
|
||||||
|
- Governance policy enforcement (`policy.json` evaluation)
|
||||||
|
- Any function with cyclomatic complexity > 3
|
||||||
|
|
||||||
|
TDD is **recommended but not enforced** for:
|
||||||
|
- Infrastructure glue code (SQS consumers, DynamoDB adapters)
|
||||||
|
- Slack Block Kit message formatting
|
||||||
|
- Dashboard API route handlers (covered by integration tests)
|
||||||
|
|
||||||
|
### 1.5 Test Ownership
|
||||||
|
|
||||||
|
Each epic owns its tests. The Correlation Engine team owns `src/correlation/**/*.test.ts`. No cross-team test ownership. If a test breaks due to a dependency change, the team that changed the dependency fixes the test.
|
||||||
|
|
||||||
|
---
|
||||||
2017
products/04-lightweight-idp/architecture/architecture.md
Normal file
2017
products/04-lightweight-idp/architecture/architecture.md
Normal file
File diff suppressed because it is too large
Load Diff
245
products/04-lightweight-idp/brainstorm/session.md
Normal file
245
products/04-lightweight-idp/brainstorm/session.md
Normal file
@@ -0,0 +1,245 @@
|
|||||||
|
# dd0c/portal — Brainstorm Session
|
||||||
|
**Product:** Lightweight Internal Developer Portal ("The Anti-Backstage")
|
||||||
|
**Facilitator:** Carson, Elite Brainstorming Specialist
|
||||||
|
**Date:** 2026-02-28
|
||||||
|
|
||||||
|
> *Every idea gets a seat at the table. We filter later. Let's GO.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: Problem Space (25 ideas)
|
||||||
|
|
||||||
|
### Why Does Backstage Suck?
|
||||||
|
|
||||||
|
1. **YAML Cemetery** — Backstage requires hand-written `catalog-info.yaml` in every repo. Engineers write it once, never update it. Within 6 months your catalog is a graveyard of lies.
|
||||||
|
2. **Plugin Roulette** — Backstage plugins break on every upgrade. The plugin ecosystem is wide but shallow — half-maintained community plugins that rot.
|
||||||
|
3. **Dedicated Platform Team Required** — You need 1-2 full-time engineers just to keep Backstage running. For a 30-person team, that's 3-7% of your engineering headcount babysitting a developer portal.
|
||||||
|
4. **React Monolith From Hell** — Backstage is a massive React app you have to fork, customize, build, and deploy yourself. It's not a product, it's a framework. Spotify built it for Spotify.
|
||||||
|
5. **Upgrade Treadmill** — Backstage releases constantly. Each upgrade risks breaking your custom plugins and templates. Teams fall behind and get stuck on ancient versions.
|
||||||
|
6. **Cold Start Problem** — Day 1 of Backstage: empty catalog. You have to manually register every service. Nobody does it. The portal launches to crickets.
|
||||||
|
7. **No Opinions** — Backstage is infinitely configurable, which means it ships with zero useful defaults. You have to decide everything: what metadata to track, what plugins to install, how to organize the catalog.
|
||||||
|
8. **Search Is Terrible** — Backstage's built-in search is basic. Finding "who owns the payment service" requires navigating a clunky UI tree.
|
||||||
|
9. **Authentication Nightmare** — Setting up auth (Okta, GitHub, Google) in Backstage requires custom provider configuration that's poorly documented.
|
||||||
|
10. **No Auto-Discovery** — Backstage doesn't discover anything. It's a static registry that depends entirely on humans keeping it current. Humans don't.
|
||||||
|
|
||||||
|
### What Do Engineers Actually Need? (The 80/20)
|
||||||
|
|
||||||
|
11. **"Who owns this?"** — The #1 question. When something breaks at 3 AM, you need to know who to page. That's it. That's the killer feature.
|
||||||
|
12. **"What does this service do?"** — A one-paragraph description, its dependencies, and its API docs. Not a 40-page Confluence novel.
|
||||||
|
13. **"Is it healthy right now?"** — Green/yellow/red. Deployment status. Last deploy time. Current error rate. One glance.
|
||||||
|
14. **"Where's the runbook?"** — When the service is on fire, where do I go? Link to the runbook, the dashboard, the logs.
|
||||||
|
15. **"What depends on this?"** — Dependency graph. If I change this service, what breaks?
|
||||||
|
16. **"How do I set up my dev environment for this?"** — README, setup scripts, required env vars. Onboarding in 10 minutes, not 10 days.
|
||||||
|
|
||||||
|
### The Pain of NOT Having an IDP
|
||||||
|
|
||||||
|
17. **Tribal Knowledge Monopoly** — "Ask Dave, he built that service 3 years ago." Dave left 6 months ago. Now nobody knows.
|
||||||
|
18. **Confluence Graveyard** — Teams document services in Confluence pages that are 2 years stale. New engineers follow outdated instructions and waste days.
|
||||||
|
19. **Slack Archaeology** — "I think someone posted the architecture diagram in #platform-eng last March?" Engineers spend hours searching Slack history for institutional knowledge.
|
||||||
|
20. **Incident Response Roulette** — Alert fires → nobody knows who owns the service → 30-minute delay finding the right person → MTTR doubles.
|
||||||
|
21. **Onboarding Black Hole** — New engineer joins. Spends first 2 weeks asking "what is this service?" and "who do I talk to about X?" in Slack. Productivity = zero.
|
||||||
|
22. **Duplicate Services** — Without a catalog, Team A builds a notification service. Team B doesn't know it exists. Team B builds another notification service. Now you have two.
|
||||||
|
23. **Zombie Services** — Services that nobody owns, nobody uses, but nobody is brave enough to turn off. They accumulate like barnacles, costing money and creating security risk.
|
||||||
|
24. **Compliance Panic** — Auditor asks "show me all services that handle PII and their owners." Without an IDP, this is a multi-week scavenger hunt.
|
||||||
|
25. **Shadow Architecture** — The actual architecture diverges from every diagram ever drawn. Nobody has a true picture of what's running in production.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: Solution Space (42 ideas)
|
||||||
|
|
||||||
|
### Auto-Discovery Approaches
|
||||||
|
|
||||||
|
26. **AWS Resource Tagger** — Scan AWS accounts via read-only IAM role. Discover EC2, ECS, Lambda, RDS, S3, API Gateway. Map them to services using tags, naming conventions, and CloudFormation stack associations.
|
||||||
|
27. **GitHub/GitLab Repo Scanner** — Scan org repos. Infer services from repo names, `Dockerfile` presence, CI/CD configs, deployment manifests. Extract README descriptions automatically.
|
||||||
|
28. **Kubernetes Label Harvester** — Connect to K8s clusters. Discover deployments, services, ingresses. Map labels (`app`, `team`, `owner`) to catalog entries.
|
||||||
|
29. **Terraform State Reader** — Parse Terraform state files (S3 backends). Build infrastructure graph from resource relationships. Know exactly what infra each service uses.
|
||||||
|
30. **CI/CD Pipeline Analyzer** — Read GitHub Actions / GitLab CI / Jenkins configs. Infer deployment targets, environments, and service relationships from pipeline definitions.
|
||||||
|
31. **DNS/Route53 Reverse Map** — Scan DNS records to discover all public-facing services and map them back to infrastructure.
|
||||||
|
32. **CloudFormation Stack Walker** — Parse CF stacks to understand resource groupings and cross-stack references. Build dependency graphs automatically.
|
||||||
|
33. **Package.json / go.mod / pom.xml Dependency Inference** — Read dependency files to infer internal service-to-service relationships (shared libraries = likely communication).
|
||||||
|
34. **Git Blame Ownership** — Infer service ownership from git commit history. Who commits most to this repo? That's probably the owner (or at least knows who is).
|
||||||
|
35. **PagerDuty/OpsGenie Schedule Import** — Pull on-call schedules to auto-populate "who to page" for each service.
|
||||||
|
36. **OpenAPI/Swagger Auto-Ingest** — Detect and index API specs from repos. Surface them in the portal as live, searchable API documentation.
|
||||||
|
37. **Docker Compose Graph** — Parse `docker-compose.yml` files to understand local development service topologies.
|
||||||
|
|
||||||
|
### Service Catalog Features
|
||||||
|
|
||||||
|
38. **One-Line Service Card** — Every service gets a card: name, owner, health, last deploy, language, repo link. Scannable in 2 seconds.
|
||||||
|
39. **Dependency Graph Visualization** — Interactive graph showing service-to-service dependencies. Click a node to see details. Highlight blast radius.
|
||||||
|
40. **Health Dashboard** — Aggregate health from multiple sources (CloudWatch, Datadog, Grafana, custom health endpoints). Show unified red/yellow/green.
|
||||||
|
41. **Ownership Registry** — Team → services mapping. Click a team, see everything they own. Click a service, see the team and on-call rotation.
|
||||||
|
42. **Runbook Linker** — Auto-detect runbooks in repos (markdown files in `/runbooks`, `/docs`, or linked in README). Surface them on the service card.
|
||||||
|
43. **Environment Matrix** — Show all environments (dev, staging, prod) for each service. Current version deployed in each. Drift between environments highlighted.
|
||||||
|
44. **SLO Tracker** — Define SLOs per service. Show current burn rate. Alert when SLO budget is burning too fast. Simple — not a full SLO platform, just visibility.
|
||||||
|
45. **Cost Attribution** — Pull from dd0c/cost. Show monthly AWS cost per service. "This service costs $847/month." Engineers never see this data today.
|
||||||
|
46. **Tech Radar Integration** — Tag services with their tech stack. Surface org-wide technology adoption. "We have 47 services on Node 18, 3 still on Node 14."
|
||||||
|
47. **README Renderer** — Pull and render the repo README directly in the portal. No context switching to GitHub.
|
||||||
|
48. **Changelog Feed** — Show recent deployments, config changes, and incidents per service. "What happened to this service this week?"
|
||||||
|
|
||||||
|
### Developer Experience
|
||||||
|
|
||||||
|
49. **Instant Search (Cmd+K)** — Algolia-fast search across all services, teams, APIs, runbooks. The portal IS the search bar.
|
||||||
|
50. **Slack Bot** — `/dd0c who owns payment-service` → instant answer in Slack. No need to open the portal.
|
||||||
|
51. **CLI Tool** — `dd0c portal search "payment"` → results in terminal. For engineers who live in the terminal.
|
||||||
|
52. **Browser New Tab** — dd0c/portal as the browser new tab page. Every time an engineer opens a tab, they see their team's services, recent incidents, and deployment status.
|
||||||
|
53. **VS Code Extension** — Right-click a service import → "View in dd0c/portal" → opens service card. See ownership and docs without leaving the editor.
|
||||||
|
54. **GitHub PR Enrichment** — Bot comments on PRs with service context: "This PR affects payment-service (owned by @payments-team, 99.9% SLO, last incident 3 days ago)."
|
||||||
|
55. **Mobile-Friendly View** — When you're on-call and get paged on your phone, the portal should be usable on mobile. Backstage is not.
|
||||||
|
56. **Deep Links** — Every service, team, runbook, and API has a stable URL. Paste it in Slack, Jira, anywhere. It just works.
|
||||||
|
|
||||||
|
### Zero-Config Magic
|
||||||
|
|
||||||
|
57. **Convention Over Configuration** — If your repo is named `payment-service`, the service is named `payment-service`. If it has a `Dockerfile`, it's a deployable service. If it has `owner` in CODEOWNERS, that's the owner. Zero YAML needed.
|
||||||
|
58. **Smart Defaults** — First run: connect AWS account + GitHub org. Portal auto-populates with everything it finds. Engineer reviews and corrects, not creates from scratch.
|
||||||
|
59. **Progressive Enhancement** — Start with auto-discovered data (maybe 60% accurate). Let teams enrich over time. Never require manual entry as a prerequisite.
|
||||||
|
60. **Confidence Scores** — Show "we're 85% sure @payments-team owns this" based on git history and AWS tags. Let humans confirm or correct. Learn from corrections.
|
||||||
|
61. **Ghost Service Detection** — Find AWS resources that don't map to any known repo or team. Surface them as "orphaned infrastructure" — potential zombie services or cost waste.
|
||||||
|
|
||||||
|
### Scorecard / Maturity Model
|
||||||
|
|
||||||
|
62. **Production Readiness Score** — Does this service have: health check? Logging? Alerting? Runbook? Score it 0-100. Gamify production readiness.
|
||||||
|
63. **Documentation Coverage** — Does the repo have a README? API docs? Architecture decision records? Score it.
|
||||||
|
64. **Security Posture** — Are dependencies up to date? Any known CVEs? Is the Docker image scanned? Secrets in env vars vs. secrets manager?
|
||||||
|
65. **On-Call Readiness** — Is there an on-call rotation defined? Is the runbook current? Has the team done a recent incident drill?
|
||||||
|
66. **Leaderboard** — Team-level maturity scores. Friendly competition. "Platform team is at 92%, payments team is at 67%." Gamification drives adoption.
|
||||||
|
67. **Improvement Suggestions** — "Your service is missing a health check endpoint. Here's a template for Express/FastAPI/Go." Actionable, not just a score.
|
||||||
|
|
||||||
|
### dd0c Module Integration
|
||||||
|
|
||||||
|
68. **Alert → Owner Routing (dd0c/alert)** — Alert fires → portal knows the owner → alert routes directly to the right person. No more generic #alerts channel.
|
||||||
|
69. **Drift Visibility (dd0c/drift)** — Service card shows "⚠️ 3 infrastructure drifts detected." Click to see details in dd0c/drift.
|
||||||
|
70. **Cost Per Service (dd0c/cost)** — Service card shows monthly AWS cost. "This Lambda costs $234/month." Engineers finally see the money.
|
||||||
|
71. **Runbook Execution (dd0c/run)** — Runbook linked in portal is executable via dd0c/run. "Service is down → click runbook → AI walks you through recovery."
|
||||||
|
72. **LLM Cost Per Service (dd0c/route)** — If the service uses LLM APIs, show the AI spend. "This service spent $1,200 on GPT-4o last month."
|
||||||
|
73. **Unified Incident View** — When an incident happens, the portal becomes the war room: service health, owner, runbook, recent changes, cost impact — all on one screen.
|
||||||
|
|
||||||
|
### Wild Ideas 🔥
|
||||||
|
|
||||||
|
74. **The IDP Is Just a Search Engine** — Forget the catalog UI. The entire product is a search bar. Type anything: service name, team name, API endpoint, error message. Get instant answers. Google for your infrastructure.
|
||||||
|
75. **AI Agent: "Ask Your Infra"** — Natural language queries: "Who owns the service that handles Stripe webhooks?" "What changed in production in the last 24 hours?" "Which services don't have runbooks?" The AI queries the catalog and answers.
|
||||||
|
76. **Auto-Generated Architecture Diagrams** — From discovered services and dependencies, auto-generate C4 / system context diagrams. Always up-to-date because they're generated from reality, not drawn by hand.
|
||||||
|
77. **"New Engineer" Mode** — A guided tour for new hires. "Here are the 10 most important services. Here's who owns what. Here's how to set up your dev environment." Onboarding in 1 hour, not 1 week.
|
||||||
|
78. **Service DNA** — Every service gets a unique fingerprint: its tech stack, dependencies, deployment pattern, cost profile, health history. Use this to find similar services, suggest best practices, detect anomalies.
|
||||||
|
79. **Incident Replay** — After an incident, the portal shows a timeline: what changed, what broke, who responded, how it was fixed. Auto-generated post-mortem skeleton.
|
||||||
|
80. **"What If" Simulator** — "What if we deprecate service X?" Show the blast radius: which services depend on it, which teams are affected, estimated migration effort.
|
||||||
|
81. **Service Lifecycle Tracker** — Track services from creation → active → deprecated → decommissioned. Prevent zombie services by making the lifecycle visible.
|
||||||
|
82. **Auto-PR for Missing Metadata** — Portal detects a service is missing an owner tag. Automatically opens a PR to the repo adding a `CODEOWNERS` file with a suggested owner based on git history.
|
||||||
|
83. **Ambient Dashboard (TV Mode)** — Full-screen mode for office TVs. Show team services, health status, recent deploys, SLO burn rates. The engineering floor's heartbeat monitor.
|
||||||
|
84. **Service Comparison** — Side-by-side comparison of two services: tech stack, cost, health, maturity score. Useful for migration planning or standardization.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3: Differentiation & Moat (18 ideas)
|
||||||
|
|
||||||
|
### How to Beat Backstage
|
||||||
|
|
||||||
|
85. **Time-to-Value: 5 Minutes vs. 5 Months** — Backstage takes months to set up. dd0c/portal takes 5 minutes (connect AWS + GitHub, auto-discover, done). This is the entire pitch. Speed kills.
|
||||||
|
86. **Zero Maintenance** — Backstage is self-hosted and requires constant upgrades. dd0c/portal is SaaS. We handle upgrades, scaling, and plugin compatibility. Your platform team can go back to building platforms.
|
||||||
|
87. **Auto-Discovery vs. Manual Entry** — Backstage requires humans to write YAML. dd0c/portal discovers everything automatically. The catalog is always current because it's generated from reality, not maintained by humans.
|
||||||
|
88. **Opinionated > Configurable** — Backstage gives you a blank canvas. dd0c/portal gives you a finished painting. We make the decisions so you don't have to. Convention over configuration.
|
||||||
|
89. **"Backstage Migrator"** — One-click import from existing Backstage `catalog-info.yaml` files. Lower the switching cost to zero. Eat their lunch.
|
||||||
|
|
||||||
|
### How to Beat Port/Cortex/OpsLevel
|
||||||
|
|
||||||
|
90. **Price: $10/eng vs. $200+/eng** — Port, Cortex, and OpsLevel charge enterprise prices ($20K+/year). dd0c/portal is $10/engineer/month with self-serve signup. No sales calls. No procurement process.
|
||||||
|
91. **Self-Serve vs. Sales-Led** — You can start using dd0c/portal today. Port requires a demo call, a POC, and a 6-week evaluation. By the time their sales cycle completes, you've been using dd0c for 2 months.
|
||||||
|
92. **Simplicity as Feature** — Port and Cortex have massive feature sets designed for 1000+ engineer orgs. dd0c/portal has 20% of the features for 80% of the value. For a 30-person team, less is more.
|
||||||
|
93. **dd0c Platform Integration** — Port is a standalone IDP. dd0c/portal is part of a unified platform (cost, alerts, drift, runbooks). The IDP that knows your costs, routes your alerts, and executes your runbooks. Nobody else can do this.
|
||||||
|
|
||||||
|
### The Moat
|
||||||
|
|
||||||
|
94. **Data Network Effect** — The more services discovered, the better the dependency graph, the smarter the ownership inference, the more accurate the health aggregation. Data compounds.
|
||||||
|
95. **Platform Lock-In (The Good Kind)** — Once dd0c/portal is the browser homepage for every engineer, switching costs are enormous. It's the operating system for your engineering org.
|
||||||
|
96. **Cross-Module Flywheel** — Portal makes alerts smarter (route to owner). Alerts make portal stickier (engineers open it during incidents). Cost data makes portal indispensable (engineers check service costs). Each module reinforces the others.
|
||||||
|
97. **AI-Powered Inference Engine** — Over time, dd0c learns patterns across all customers (anonymized): common service architectures, typical ownership structures, standard tech stacks. The AI gets smarter with scale. New customers get better auto-discovery on day 1.
|
||||||
|
98. **Community Catalog Templates** — Open-source library of service templates (Express API, Lambda function, ECS service). New services created from templates are automatically portal-ready. The community builds the ecosystem.
|
||||||
|
99. **"Agent Control Plane" Positioning** — As agentic AI grows, AI agents need a source of truth about services. dd0c/portal becomes the registry that AI agents query. "Which service handles payments?" The IDP isn't just for humans anymore — it's for AI agents too.
|
||||||
|
100. **Compliance Moat** — Once dd0c/portal is the system of record for service ownership and maturity, it becomes compliance infrastructure. SOC 2 auditors love it. Ripping it out means losing your compliance evidence.
|
||||||
|
101. **Integration Depth** — Build deep integrations with the tools teams already use (GitHub, Slack, PagerDuty, Datadog, AWS). Each integration makes dd0c/portal harder to replace.
|
||||||
|
102. **Open-Source Discovery Agent** — Open-source the discovery agent (runs in their VPC). Proprietary SaaS dashboard. The OSS agent builds trust and community. The dashboard is the business.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4: Anti-Ideas & Red Team (14 ideas)
|
||||||
|
|
||||||
|
### Why Would This Fail?
|
||||||
|
|
||||||
|
103. **"Lightweight" = "Toy"** — Teams might dismiss dd0c/portal as too simple. "We need Backstage because we're a serious engineering org." Perception problem: lightweight sounds like it can't scale.
|
||||||
|
104. **GitHub Ships a Built-In Catalog** — GitHub already has repository topics, CODEOWNERS, and dependency graphs. If they add a "Service Catalog" tab, dd0c/portal's value proposition evaporates overnight.
|
||||||
|
105. **Backstage Gets Easy** — Roadie (managed Backstage) is improving. If Backstage 2.0 ships with auto-discovery and zero-config setup, the "Anti-Backstage" positioning dies.
|
||||||
|
106. **AWS Ships a Good IDP** — AWS has Service Catalog, but it's terrible. If they build a real IDP integrated with their ecosystem, they have distribution dd0c can't match.
|
||||||
|
107. **Discovery Accuracy Problem** — Auto-discovery sounds magical but might be 60% accurate. If engineers open the portal and see wrong data, they'll never trust it again. First impressions are everything.
|
||||||
|
108. **Small Teams Don't Need an IDP** — A 15-person team might genuinely not need a service catalog. They all sit in the same room. The TAM might be smaller than expected.
|
||||||
|
109. **Enterprise Gravity** — As teams grow past 100 engineers, they'll "graduate" to Port or Cortex. dd0c/portal might be a stepping stone, not a destination. High churn at the top end.
|
||||||
|
110. **Solo Founder Risk** — Building an IDP requires integrations with dozens of tools (AWS, GCP, Azure, GitHub, GitLab, Bitbucket, PagerDuty, OpsGenie, Datadog, Grafana...). That's a massive surface area for one person.
|
||||||
|
111. **The "Free" Competitor Problem** — Backstage is free. Convincing teams to pay $10/eng/month when a free option exists requires the value gap to be enormous and obvious.
|
||||||
|
112. **Data Sensitivity** — The portal needs read access to AWS accounts and GitHub orgs. Security teams at larger companies will block this. The trust barrier is real.
|
||||||
|
113. **Agentic AI Makes IDPs Obsolete** — If AI agents can answer "who owns this service?" by reading git history and Slack in real-time, do you need a static catalog at all?
|
||||||
|
114. **Platform Engineering Fatigue** — Teams are tired of adopting new tools. "We just finished setting up Backstage, we're not switching." Migration fatigue is real.
|
||||||
|
115. **The "Good Enough" Spreadsheet** — Many teams track services in a Google Sheet or Notion database. It's ugly but it works. Convincing them to pay for a dedicated tool is harder than it sounds.
|
||||||
|
116. **Churn from Simplicity** — If the product is truly lightweight, there's less surface area for stickiness. Users might churn because they feel they've "outgrown" it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 5: Synthesis
|
||||||
|
|
||||||
|
### Top 10 Ideas (Ranked)
|
||||||
|
|
||||||
|
| Rank | Idea | Why It Wins |
|
||||||
|
|------|------|-------------|
|
||||||
|
| 1 | **5-Minute Auto-Discovery Setup** (#57, #58) | THE differentiator. Connect AWS + GitHub → catalog populated. Zero YAML. This is the entire pitch against Backstage. |
|
||||||
|
| 2 | **Cmd+K Instant Search** (#49, #74) | The portal IS the search bar. "Who owns X?" answered in 2 seconds. This is the daily-use hook that makes it the browser homepage. |
|
||||||
|
| 3 | **AI "Ask Your Infra" Agent** (#75) | Natural language queries against your service catalog. "What changed in prod today?" This is the 2026 differentiator that no competitor has. |
|
||||||
|
| 4 | **Ownership Registry + PagerDuty Sync** (#41, #35) | The #1 use case: who owns this, who's on-call. Auto-populated from PagerDuty/OpsGenie + git history. Solves the 3 AM problem. |
|
||||||
|
| 5 | **dd0c Cross-Module Integration** (#68-73, #96) | Alerts route to owners. Costs attributed to services. Runbooks linked and executable. The platform flywheel that standalone IDPs can't match. |
|
||||||
|
| 6 | **Production Readiness Scorecard** (#62-67) | Gamified maturity model. Teams compete to improve scores. Drives adoption AND improves engineering practices. Two birds, one stone. |
|
||||||
|
| 7 | **Slack Bot** (#50) | `/dd0c who owns payment-service` — meet engineers where they already are. Reduces friction to zero. Drives organic adoption. |
|
||||||
|
| 8 | **Auto-Generated Dependency Graphs** (#39, #76) | Visual blast radius. "If this service goes down, these 12 services are affected." Always current because it's generated from reality. |
|
||||||
|
| 9 | **Backstage Migrator** (#89) | One-click import from Backstage YAML. Lowers switching cost to zero. Directly targets the frustrated Backstage user base. |
|
||||||
|
| 10 | **$10/eng Self-Serve Pricing** (#90, #91) | No sales calls. No procurement. Credit card signup. This alone disqualifies Port/Cortex/OpsLevel for 80% of the market. |
|
||||||
|
|
||||||
|
### 3 Wild Cards 🃏
|
||||||
|
|
||||||
|
| # | Wild Card | Why It's Wild |
|
||||||
|
|---|-----------|---------------|
|
||||||
|
| 🃏1 | **"New Engineer" Guided Onboarding Mode** (#77) | Turns the IDP into an onboarding tool. "Welcome to Acme Corp. Here are your 47 services in 5 minutes." HR teams would champion this. Completely different buyer persona. |
|
||||||
|
| 🃏2 | **Agent Control Plane** (#99) | Position dd0c/portal as the registry that AI agents query, not just humans. As agentic DevOps explodes in 2026, this could be the defining use case. The IDP becomes infrastructure for AI. |
|
||||||
|
| 🃏3 | **Auto-PR for Missing Metadata** (#82) | The portal doesn't just show gaps — it fixes them. Detects missing CODEOWNERS, opens a PR with suggested owners. The catalog improves itself. Self-healing metadata. |
|
||||||
|
|
||||||
|
### Recommended V1 Scope
|
||||||
|
|
||||||
|
**Core (Must Ship):**
|
||||||
|
- AWS auto-discovery (EC2, ECS, Lambda, RDS, API Gateway via read-only IAM role)
|
||||||
|
- GitHub org scan (repos, languages, CODEOWNERS, README)
|
||||||
|
- Service cards (name, owner, description, repo, health, last deploy)
|
||||||
|
- Cmd+K instant search
|
||||||
|
- Team → services ownership mapping
|
||||||
|
- PagerDuty/OpsGenie on-call schedule import
|
||||||
|
- Slack bot (`/dd0c who owns X`)
|
||||||
|
- Self-serve signup, $10/engineer/month
|
||||||
|
|
||||||
|
**V1.1 (Fast Follow):**
|
||||||
|
- Production readiness scorecard
|
||||||
|
- Dependency graph visualization
|
||||||
|
- Kubernetes discovery
|
||||||
|
- Backstage YAML importer
|
||||||
|
- dd0c/alert integration (route alerts to service owners)
|
||||||
|
- dd0c/cost integration (show cost per service)
|
||||||
|
|
||||||
|
**V1.2 (Differentiator):**
|
||||||
|
- AI "Ask Your Infra" natural language queries
|
||||||
|
- Auto-PR for missing metadata
|
||||||
|
- New engineer onboarding mode
|
||||||
|
- Terraform state parsing
|
||||||
|
|
||||||
|
**Explicitly NOT V1:**
|
||||||
|
- Custom plugins/extensions (that's Backstage's trap)
|
||||||
|
- GCP/Azure support (AWS-first, expand later)
|
||||||
|
- Software templates / scaffolding (stay focused on catalog)
|
||||||
|
- Full SLO management (just show basic health)
|
||||||
|
- Self-hosted option (SaaS only to start)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
> **Total ideas generated: 116**
|
||||||
|
> *Session complete. The Anti-Backstage has a blueprint. Now go build it.* 🔥
|
||||||
301
products/04-lightweight-idp/design-thinking/session.md
Normal file
301
products/04-lightweight-idp/design-thinking/session.md
Normal file
@@ -0,0 +1,301 @@
|
|||||||
|
# dd0c/portal — Design Thinking Session
|
||||||
|
**Product:** Lightweight Internal Developer Portal ("The Anti-Backstage")
|
||||||
|
**Facilitator:** Maya, Design Thinking Maestro
|
||||||
|
**Date:** 2026-02-28
|
||||||
|
|
||||||
|
> *"Design is jazz. You listen before you play. You feel the room before you solo. And the room right now? It's full of engineers who are lost, frustrated, and pretending they're not."*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: EMPATHIZE 🎭
|
||||||
|
|
||||||
|
The cardinal sin of developer tooling is building for the architect's ego instead of the practitioner's panic. Before we sketch a single wireframe, we need to sit in three very different chairs and feel the splinters.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Persona 1: The New Hire — "Jordan"
|
||||||
|
|
||||||
|
**Profile:** Software Engineer II, 2 years experience, just joined a 60-person engineering org. First day was Monday. It's now Wednesday. Jordan has imposter syndrome and a laptop full of bookmarks that don't work.
|
||||||
|
|
||||||
|
#### Empathy Map
|
||||||
|
|
||||||
|
| Dimension | Details |
|
||||||
|
|-----------|---------|
|
||||||
|
| **SAYS** | "Where's the documentation for this?" · "Who should I ask about the payments service?" · "I don't want to bother anyone with a dumb question" · "The wiki says to use Jenkins but I think we use GitHub Actions now?" · "Can someone add me to the right Slack channels?" |
|
||||||
|
| **THINKS** | "Everyone else seems to know how this all fits together" · "I'm going to look stupid if I ask this again" · "This architecture diagram is from 2023, is it still accurate?" · "I have no idea what half these services do" · "Am I even looking at the right repo?" |
|
||||||
|
| **DOES** | Searches Slack history obsessively · Opens 47 browser tabs trying to map the system · Reads READMEs that haven't been updated in 18 months · Asks their buddy (assigned mentor) the same question three different ways · Builds a personal Notion doc trying to map services to teams · Sits in standup not understanding half the service names mentioned |
|
||||||
|
| **FEELS** | Overwhelmed — the system landscape is a fog · Anxious — "I should be productive by now" · Isolated — everyone's too busy to hand-hold · Frustrated — information exists but it's scattered across 6 tools · Embarrassed — asking "who owns this?" feels like admitting ignorance |
|
||||||
|
|
||||||
|
#### Pain Points
|
||||||
|
1. **The Scavenger Hunt** — Finding who owns what requires asking in Slack, searching Confluence, checking GitHub CODEOWNERS, and sometimes just guessing. There's no single source of truth.
|
||||||
|
2. **Stale Documentation** — 70% of internal docs are outdated. Jordan follows a setup guide, hits an error on step 3, and has no idea if the guide is wrong or they are.
|
||||||
|
3. **Invisible Architecture** — No one has drawn an accurate system diagram in over a year. The mental model of "how things connect" lives exclusively in senior engineers' heads.
|
||||||
|
4. **Social Cost of Questions** — Every question Jordan asks interrupts someone. After the third "hey, quick question" DM in a day, Jordan stops asking and starts guessing.
|
||||||
|
5. **Tool Sprawl** — Services are documented across GitHub READMEs, Confluence, Notion, Google Docs, and Slack pinned messages. There's no index of indexes.
|
||||||
|
|
||||||
|
#### Current Workarounds
|
||||||
|
- Personal Notion database mapping services → owners → repos (manually maintained, already drifting)
|
||||||
|
- Screenshot collection of architecture whiteboard photos from onboarding
|
||||||
|
- Bookmarked Slack threads with "useful" context that's already stale
|
||||||
|
- Relying heavily on one senior engineer who's becoming a bottleneck
|
||||||
|
- Reading git blame to figure out who last touched a service
|
||||||
|
|
||||||
|
#### Jobs To Be Done (JTBD)
|
||||||
|
- **When** I join a new company, **I want to** quickly understand the service landscape **so I can** contribute meaningfully without feeling lost.
|
||||||
|
- **When** I'm assigned a ticket involving an unfamiliar service, **I want to** find the owner and documentation in under 60 seconds **so I can** unblock myself without interrupting teammates.
|
||||||
|
- **When** I hear a service name in standup, **I want to** look it up instantly **so I can** follow the conversation and not feel like an outsider.
|
||||||
|
|
||||||
|
#### Day-in-the-Life Scenario
|
||||||
|
|
||||||
|
> **9:00 AM** — Jordan opens Slack. 47 unread messages across 12 channels. Half reference services Jordan has never heard of.
|
||||||
|
>
|
||||||
|
> **9:30 AM** — Standup. Tech lead mentions "the inventory-sync Lambda is flaking again." Jordan nods. Has no idea what inventory-sync does or where it lives.
|
||||||
|
>
|
||||||
|
> **10:00 AM** — Assigned first real ticket: "Add retry logic to the order-processor." Jordan searches GitHub for "order-processor." Three repos come up. Which one is the right one? The README in the first one says "DEPRECATED — see order-processor-v2." The v2 repo has no README.
|
||||||
|
>
|
||||||
|
> **10:45 AM** — Jordan DMs their mentor: "Hey, which repo is the actual order-processor?" Mentor is in a meeting. No response for 2 hours.
|
||||||
|
>
|
||||||
|
> **11:00 AM** — Jordan searches Confluence for "order-processor." Finds a page from 2024 with an architecture diagram that references services that no longer exist.
|
||||||
|
>
|
||||||
|
> **12:00 PM** — Lunch. Jordan feels behind. Other new hires from the same cohort seem to be figuring things out faster (they're not — they're just better at hiding it).
|
||||||
|
>
|
||||||
|
> **2:00 PM** — Mentor responds: "Oh yeah, it's order-processor-v2 but we actually call it order-engine now. The repo name is wrong. Talk to @sarah for the runbook."
|
||||||
|
>
|
||||||
|
> **2:30 PM** — Jordan DMs Sarah. Sarah is on PTO until Friday.
|
||||||
|
>
|
||||||
|
> **3:00 PM** — Jordan has spent 5 hours and written zero lines of code. The ticket is untouched. The frustration is real.
|
||||||
|
>
|
||||||
|
> **5:00 PM** — Jordan updates their personal Notion doc with everything learned today. Tomorrow they'll forget half of it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Persona 2: The Platform Engineer — "Alex"
|
||||||
|
|
||||||
|
**Profile:** Senior Platform Engineer, 6 years experience, has been maintaining the company's Backstage instance for 14 months. Alex is burned out. The Backstage instance is a Frankenstein monster of custom plugins, broken upgrades, and YAML files that nobody updates. Alex fantasizes about deleting the entire thing.
|
||||||
|
|
||||||
|
#### Empathy Map
|
||||||
|
|
||||||
|
| Dimension | Details |
|
||||||
|
|-----------|---------|
|
||||||
|
| **SAYS** | "I spend 40% of my time maintaining Backstage instead of building actual platform tooling" · "Nobody updates their catalog-info.yaml" · "The plugin broke again after the upgrade" · "I've become the Backstage janitor" · "We need to simplify this" |
|
||||||
|
| **THINKS** | "I didn't become an engineer to babysit a React monolith" · "If I leave, nobody can maintain this" · "Backstage was supposed to save us time, not create more work" · "Maybe we should just use a spreadsheet" · "The catalog is 60% lies at this point" |
|
||||||
|
| **DOES** | Writes custom Backstage plugins that break on every upgrade · Sends monthly "please update your catalog-info.yaml" Slack messages that everyone ignores · Manually fixes broken service entries · Runs Backstage upgrade migrations that take 2-3 days each · Writes documentation for Backstage that nobody reads · Attends Backstage community calls hoping for solutions that never come |
|
||||||
|
| **FEELS** | Exhausted — maintaining Backstage is a full-time job on top of their actual job · Resentful — "I'm a platform engineer, not a portal admin" · Trapped — too much invested to abandon Backstage, too broken to love it · Lonely — nobody else on the team understands or cares about the IDP · Skeptical — "every new tool promises to be different" |
|
||||||
|
|
||||||
|
#### Pain Points
|
||||||
|
1. **The Maintenance Tax** — Backstage requires constant care: plugin updates, dependency bumps, custom provider maintenance, auth config changes. It's a pet, not cattle.
|
||||||
|
2. **The YAML Lie** — `catalog-info.yaml` files are the foundation of Backstage, and they're fiction. Teams write them once during onboarding, never update them. The catalog drifts from reality within weeks.
|
||||||
|
3. **Plugin Fragility** — Community plugins are maintained by volunteers who disappear. Custom plugins break on Backstage upgrades. The plugin ecosystem is a house of cards.
|
||||||
|
4. **Zero Adoption Metrics** — Alex has no idea if anyone actually uses Backstage. There's no analytics. The portal might have 3 daily users or 30. Alex suspects it's closer to 3.
|
||||||
|
5. **Upgrade Dread** — Every Backstage version bump is a multi-day migration project. Alex is 4 versions behind because the last upgrade broke 3 plugins and took a week to fix.
|
||||||
|
|
||||||
|
#### Current Workarounds
|
||||||
|
- Wrote a cron job that checks for missing `catalog-info.yaml` files and posts in Slack (everyone mutes the channel)
|
||||||
|
- Maintains a "known broken" list of Backstage features they tell people to avoid
|
||||||
|
- Built a simple internal API that returns service ownership from GitHub CODEOWNERS as a backup
|
||||||
|
- Runs a quarterly "Backstage cleanup day" that nobody attends
|
||||||
|
- Has a secret spreadsheet that's actually more accurate than Backstage
|
||||||
|
|
||||||
|
#### Jobs To Be Done (JTBD)
|
||||||
|
- **When** I'm responsible for the developer portal, **I want** it to maintain itself **so I can** focus on building actual platform capabilities.
|
||||||
|
- **When** the catalog data drifts from reality, **I want** automatic reconciliation from source-of-truth systems **so I can** trust the data without manual intervention.
|
||||||
|
- **When** leadership asks "how's the IDP going?", **I want** real adoption metrics **so I can** prove value or make the case to change course.
|
||||||
|
|
||||||
|
#### Day-in-the-Life Scenario
|
||||||
|
|
||||||
|
> **8:30 AM** — Alex opens their laptop. Three Slack DMs overnight: "Backstage is showing the wrong owner for auth-service," "The API docs plugin isn't loading," and "Can you add our new service to the catalog?"
|
||||||
|
>
|
||||||
|
> **9:00 AM** — Alex investigates the wrong-owner issue. The `catalog-info.yaml` in the auth-service repo lists the previous team. The team was reorged 4 months ago. Nobody updated the YAML. Alex manually fixes it. This is the third time this month.
|
||||||
|
>
|
||||||
|
> **9:30 AM** — The API docs plugin. It broke after last week's Backstage patch update. Alex checks the plugin's GitHub repo. Last commit: 8 months ago. The maintainer hasn't responded to issues in 6 months. Alex starts debugging the plugin source code.
|
||||||
|
>
|
||||||
|
> **11:00 AM** — Still debugging the plugin. Alex considers writing a replacement from scratch. Estimates 2 weeks of work. Puts it on the backlog that's already 40 items deep.
|
||||||
|
>
|
||||||
|
> **11:30 AM** — "Can you add our new service?" Alex sends the team the `catalog-info.yaml` template for the 50th time. They'll fill it out wrong. Alex will fix it later.
|
||||||
|
>
|
||||||
|
> **1:00 PM** — Alex's actual platform work (building a new CI/CD pipeline template) has been untouched for 3 days. The sprint review is tomorrow. Alex has nothing to show.
|
||||||
|
>
|
||||||
|
> **3:00 PM** — Engineering Director asks: "How many teams are actively using the portal?" Alex doesn't know. Backstage has no built-in analytics. Alex says "most teams" and hopes nobody asks for specifics.
|
||||||
|
>
|
||||||
|
> **5:00 PM** — Alex updates their resume. Not seriously. Mostly seriously.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Persona 3: The Engineering Director — "Priya"
|
||||||
|
|
||||||
|
**Profile:** Director of Engineering, manages 8 teams (62 engineers), reports to the VP of Engineering. Priya needs to answer hard questions about service ownership, production readiness, and engineering maturity — and currently can't answer any of them with confidence.
|
||||||
|
|
||||||
|
#### Empathy Map
|
||||||
|
|
||||||
|
| Dimension | Details |
|
||||||
|
|-----------|---------|
|
||||||
|
| **SAYS** | "Which teams own which services? I need a complete map" · "Are we production-ready for the SOC 2 audit?" · "How many services don't have runbooks?" · "I need to justify the platform team's headcount" · "Why did that incident take 45 minutes to route to the right team?" |
|
||||||
|
| **THINKS** | "I'm flying blind on service maturity" · "If an auditor asks me about ownership, I'm going to look incompetent" · "We're spending $200K/year on a platform engineer to maintain Backstage — is that worth it?" · "I bet there are zombie services costing us money that nobody owns" · "The new hires are taking too long to ramp up" |
|
||||||
|
| **DOES** | Asks team leads for service ownership spreadsheets (gets different answers from each) · Runs quarterly "production readiness" reviews that are manual and painful · Approves platform team budget without clear ROI metrics · Escalates during incidents when nobody knows who owns the broken service · Presents engineering maturity metrics to the VP that are mostly guesswork |
|
||||||
|
| **FEELS** | Anxious — lack of visibility into the service landscape is a leadership risk · Frustrated — simple questions ("who owns this?") shouldn't require a research project · Pressured — SOC 2 audit is in 3 months and the compliance evidence is scattered · Guilty — knows the platform team is drowning but can't justify more headcount without data · Impatient — "we've been talking about fixing this for a year" |
|
||||||
|
|
||||||
|
#### Pain Points
|
||||||
|
1. **The Ownership Black Hole** — No single, trustworthy source of service-to-team mapping. During incidents, precious minutes are wasted finding the right responder.
|
||||||
|
2. **Compliance Anxiety** — SOC 2 auditors will ask "show me every service that handles PII and its owner." Today, answering this requires a multi-week manual audit.
|
||||||
|
3. **Invisible ROI** — The platform team maintains Backstage, but Priya can't quantify the value. Is it saving time? Is anyone using it? She's spending $200K/year on vibes.
|
||||||
|
4. **Onboarding Drag** — New engineers take 3-4 weeks to become productive. Priya suspects poor internal tooling is a major factor but can't prove it.
|
||||||
|
5. **Zombie Infrastructure** — Priya knows there are services and AWS resources that nobody owns. They cost money, create security risk, and nobody's accountable.
|
||||||
|
|
||||||
|
#### Current Workarounds
|
||||||
|
- Quarterly manual audit where each team lead submits a spreadsheet of their services (takes 2 weeks, outdated by the time it's compiled)
|
||||||
|
- Incident post-mortems that repeatedly cite "unclear ownership" as a contributing factor
|
||||||
|
- A shared Google Sheet titled "Service Ownership Map" that 4 different people maintain with conflicting data
|
||||||
|
- Relies on tribal knowledge from senior engineers who've been at the company 3+ years
|
||||||
|
- Asks the platform engineer (Alex) for ad-hoc reports that take days to compile
|
||||||
|
|
||||||
|
#### Jobs To Be Done (JTBD)
|
||||||
|
- **When** an incident occurs, **I want** instant, authoritative service-to-owner mapping **so that** mean time to resolution isn't inflated by "who owns this?" confusion.
|
||||||
|
- **When** preparing for a compliance audit, **I want** an always-current inventory of services, owners, and maturity attributes **so that** I can produce evidence in minutes, not weeks.
|
||||||
|
- **When** justifying platform team investment, **I want** concrete adoption and value metrics **so that** I can defend the budget with data, not anecdotes.
|
||||||
|
|
||||||
|
#### Day-in-the-Life Scenario
|
||||||
|
|
||||||
|
> **7:30 AM** — Priya checks her phone. PagerDuty alert from 2 AM: "payment-gateway latency spike." The incident channel shows 15 minutes of "does anyone know who owns payment-gateway?" before the right engineer was found. MTTR: 47 minutes. Should have been 15.
|
||||||
|
>
|
||||||
|
> **9:00 AM** — Leadership sync. VP asks: "How many of our services meet production readiness standards?" Priya says "most of the critical path services." The VP asks for a number. Priya doesn't have one. She promises a report by Friday.
|
||||||
|
>
|
||||||
|
> **10:00 AM** — Priya asks Alex (platform engineer) to generate a production readiness report. Alex says it'll take 3-4 days because the data is scattered across Backstage (partially accurate), GitHub (partially complete), and team leads' heads (partially available).
|
||||||
|
>
|
||||||
|
> **11:00 AM** — SOC 2 prep meeting. Compliance team asks: "Can you provide a list of all services that process customer data, their owners, and their security controls?" Priya's stomach drops. She knows this will be a fire drill.
|
||||||
|
>
|
||||||
|
> **1:00 PM** — 1:1 with a new hire (Jordan). Jordan mentions it took 3 days to figure out which repo to work in. Priya makes a mental note to "fix onboarding" — the same note she's made every quarter for a year.
|
||||||
|
>
|
||||||
|
> **3:00 PM** — Budget review. CFO asks why the platform team needs another engineer. Priya can't quantify the current IDP's value. The headcount request is deferred to next quarter.
|
||||||
|
>
|
||||||
|
> **4:30 PM** — Priya opens the "Service Ownership Map" Google Sheet. It was last updated 6 weeks ago. Three services listed don't exist anymore. Two new services aren't listed. She closes the tab and sighs.
|
||||||
|
>
|
||||||
|
> **6:00 PM** — Priya drafts an email to her team leads: "Please update the service ownership spreadsheet by EOW." She knows from experience that 3 out of 8 will respond, and the data will be inconsistent.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Cross-Persona Insight: The Emotional Thread
|
||||||
|
|
||||||
|
There's a jazz chord that connects all three personas — it's the **anxiety of not knowing**:
|
||||||
|
|
||||||
|
- **Jordan** doesn't know where things are and is afraid to ask.
|
||||||
|
- **Alex** doesn't know if anyone uses what they've built and is afraid to find out.
|
||||||
|
- **Priya** doesn't know the true state of her engineering org and is afraid to be exposed.
|
||||||
|
|
||||||
|
The product that resolves this anxiety — that replaces "I don't know" with "I can look it up in 2 seconds" — wins not just their budget, but their loyalty.
|
||||||
|
|
||||||
|
> *"The best products don't just solve problems. They dissolve the anxiety that surrounds the problem. That's the difference between a tool and a companion."*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: DEFINE 🎯
|
||||||
|
|
||||||
|
Time to distill the empathy into precision. We've sat in the chairs. We've felt the splinters. Now we name the splinters — because you can't pull out what you can't name.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Point-of-View (POV) Statements
|
||||||
|
|
||||||
|
A POV statement is a design brief in one sentence. It reframes the problem from the user's emotional reality, not the product team's assumptions.
|
||||||
|
|
||||||
|
**POV 1 — The New Hire (Jordan)**
|
||||||
|
> Jordan, a newly hired engineer drowning in scattered tribal knowledge, needs a way to instantly understand the service landscape and find who owns what — because every hour spent on a scavenger hunt for basic context is an hour of productivity lost and confidence eroded.
|
||||||
|
|
||||||
|
**POV 2 — The Platform Engineer (Alex)**
|
||||||
|
> Alex, a burned-out platform engineer trapped maintaining a Backstage instance that nobody trusts, needs a developer portal that maintains itself from real infrastructure data — because the current model of human-maintained YAML catalogs is a lie that creates more work than it eliminates.
|
||||||
|
|
||||||
|
**POV 3 — The Engineering Director (Priya)**
|
||||||
|
> Priya, an engineering director who can't answer basic questions about service ownership and maturity, needs always-current, authoritative visibility into her engineering org's service landscape — because flying blind on ownership creates incident response delays, compliance risk, and an inability to justify platform investment.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### "How Might We" Questions
|
||||||
|
|
||||||
|
HMW questions are the bridge between empathy and ideation. Each one is a door. Some lead to hallways. Some lead to ballrooms. Let's open them all.
|
||||||
|
|
||||||
|
#### Discoverability & Knowledge
|
||||||
|
|
||||||
|
1. **HMW** make the entire service landscape searchable in under 2 seconds, so that "who owns this?" is never a Slack question again?
|
||||||
|
2. **HMW** eliminate stale documentation by generating service context directly from infrastructure reality (AWS, GitHub, K8s) instead of human-written YAML?
|
||||||
|
3. **HMW** surface the "invisible architecture" — the actual dependency graph — without requiring anyone to draw or maintain diagrams?
|
||||||
|
4. **HMW** make a new hire's first week feel like orientation instead of archaeology?
|
||||||
|
|
||||||
|
#### Maintenance & Trust
|
||||||
|
|
||||||
|
5. **HMW** build a catalog that stays accurate without requiring any engineer to manually update it — ever?
|
||||||
|
6. **HMW** give platform engineers their time back by replacing portal maintenance with portal auto-healing?
|
||||||
|
7. **HMW** create confidence scores for catalog data so users know what's verified vs. inferred, rather than treating everything as equally trustworthy (or untrustworthy)?
|
||||||
|
8. **HMW** make the portal's data freshness visible and transparent, so trust is earned through proof, not promises?
|
||||||
|
|
||||||
|
#### Visibility & Accountability
|
||||||
|
|
||||||
|
9. **HMW** give engineering leaders a real-time, always-current view of service ownership, maturity, and gaps — without quarterly manual audits?
|
||||||
|
10. **HMW** turn "who owns this?" from a 15-minute incident delay into a 0-second lookup?
|
||||||
|
11. **HMW** make production readiness measurable and visible so that teams self-correct without top-down mandates?
|
||||||
|
12. **HMW** surface zombie services and orphaned infrastructure automatically, so cost waste and security risk don't hide in the shadows?
|
||||||
|
|
||||||
|
#### Adoption & Stickiness
|
||||||
|
|
||||||
|
13. **HMW** make the portal so useful for daily work that engineers voluntarily set it as their browser homepage?
|
||||||
|
14. **HMW** meet engineers where they already are (Slack, terminal, VS Code) instead of asking them to visit yet another dashboard?
|
||||||
|
15. **HMW** create a "5-minute setup to first insight" experience that makes Backstage's months-long onboarding feel absurd?
|
||||||
|
16. **HMW** design the portal so that using it is faster than NOT using it — so adoption is driven by selfishness, not mandates?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Key Insights
|
||||||
|
|
||||||
|
These are the non-obvious truths that emerged from empathy. They're the foundation of everything we design.
|
||||||
|
|
||||||
|
**Insight 1: The Real Problem Isn't Missing Data — It's Scattered Data**
|
||||||
|
The information exists. Ownership is in CODEOWNERS. Descriptions are in READMEs. Health is in CloudWatch. On-call is in PagerDuty. The problem isn't that nobody documented anything — it's that nobody aggregated it. The portal's job isn't to create new data. It's to unify existing data into one searchable surface.
|
||||||
|
|
||||||
|
**Insight 2: Manual Maintenance Is a Design Flaw, Not a User Failure**
|
||||||
|
Backstage blames engineers for not updating YAML. That's like blaming users for not reading the manual. If your product requires humans to do repetitive, low-reward maintenance to function, your product is broken. Auto-discovery isn't a feature — it's the correction of a fundamental design error.
|
||||||
|
|
||||||
|
**Insight 3: The New Hire Is the Canary in the Coal Mine**
|
||||||
|
If a new hire can't find what they need in their first week, the entire org has a knowledge management problem. They're just the ones who feel it most acutely. Solving for Jordan solves for everyone — because every engineer is a "new hire" to the 80% of services they've never touched.
|
||||||
|
|
||||||
|
**Insight 4: Trust Is Binary — And It's Earned on First Contact**
|
||||||
|
If an engineer opens the portal and sees one wrong owner or one stale description, they close the tab and never come back. Trust in a catalog is binary: either you trust it or you don't. There's no "mostly trust." This means accuracy on day one matters more than feature completeness. Ship less, but ship truth.
|
||||||
|
|
||||||
|
**Insight 5: The Portal Must Be Faster Than Slack**
|
||||||
|
The current workaround for "who owns this?" is asking in Slack. If the portal is slower than typing a Slack message and waiting for a response, the portal loses. The bar isn't "better than Backstage." The bar is "faster than asking a human." That's Cmd+K in under 2 seconds.
|
||||||
|
|
||||||
|
**Insight 6: Directors Don't Need Dashboards — They Need Answers**
|
||||||
|
Priya doesn't want a dashboard with 47 widgets. She wants to answer three questions: "Who owns what?", "Are we production-ready?", and "What's falling through the cracks?" If the portal can answer those three questions on demand, Priya becomes the champion who sells it to the VP.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Core Tension: Comprehensiveness vs. Simplicity
|
||||||
|
|
||||||
|
Here's the jazz dissonance at the heart of this product. It's the tension that will define every design decision:
|
||||||
|
|
||||||
|
```
|
||||||
|
COMPREHENSIVENESS SIMPLICITY
|
||||||
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||||
|
"Show me everything about "Just tell me who owns
|
||||||
|
every service: dependencies, this and where the repo
|
||||||
|
health, cost, SLOs, runbooks, is. I don't need a
|
||||||
|
deployment history, tech stack, dissertation."
|
||||||
|
security posture, compliance
|
||||||
|
status..."
|
||||||
|
|
||||||
|
→ This is Backstage/Port/Cortex. → This is a spreadsheet.
|
||||||
|
Powerful but overwhelming. Simple but insufficient.
|
||||||
|
Nobody uses it because it's Everyone outgrows it
|
||||||
|
too much. in 3 months.
|
||||||
|
```
|
||||||
|
|
||||||
|
**The Resolution: Progressive Disclosure**
|
||||||
|
|
||||||
|
dd0c/portal must be a spreadsheet on the surface and a platform underneath. The default view is radically simple: service name, owner, health, repo link. One line per service. Scannable in 2 seconds.
|
||||||
|
|
||||||
|
But depth is one click away. Click the service → dependencies, cost, deployment history, runbook, scorecard. The complexity exists but it doesn't assault you on arrival.
|
||||||
|
|
||||||
|
The design principle: **"Calm surface, deep ocean."**
|
||||||
|
|
||||||
|
This is how you beat both Backstage (too complex on arrival) AND the Google Sheet (too shallow to grow into). You start simpler than a spreadsheet and grow deeper than Backstage — but only when the user asks for depth.
|
||||||
|
|
||||||
|
> *"In jazz, the notes you don't play matter more than the ones you do. The portal's default state should be silence — clean, calm, just the essentials. The solo comes when you lean in."*
|
||||||
|
|
||||||
|
---
|
||||||
477
products/04-lightweight-idp/epics/epics.md
Normal file
477
products/04-lightweight-idp/epics/epics.md
Normal file
@@ -0,0 +1,477 @@
|
|||||||
|
# dd0c/portal — V1 MVP Epics & Stories
|
||||||
|
|
||||||
|
This document outlines the complete set of Epics for the V1 MVP of dd0c/portal, a Lightweight Internal Developer Portal. The scope is strictly limited to the "5-Minute Miracle" auto-discovery from AWS and GitHub, Cmd+K search, basic catalog UI, Slack bot, and self-serve PLG onboarding. No AI agent, no GitLab, no scorecards in V1.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 1: AWS Discovery Engine
|
||||||
|
**Description:** Build the core AWS scanning capability using a read-only cross-account IAM role. This engine must enumerate CloudFormation stacks, ECS services, Lambda functions, API Gateway APIs, and RDS instances, and group them into inferred "services" based on naming conventions and tags.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 1.1: Cross-Account Role Assumption**
|
||||||
|
*As a Platform Engineer, I want the discovery engine to securely assume a read-only IAM role in my AWS account using an External ID, so that my infrastructure data can be scanned without sharing long-lived credentials.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- System successfully assumes cross-account role using AWS STS.
|
||||||
|
- Role assumption enforces a tenant-specific `ExternalId`.
|
||||||
|
- Failure to assume role surfaces a clear error (e.g., "Role not found" or "Invalid ExternalId").
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** None
|
||||||
|
- **Technical Notes:** Use `boto3` STS client. Ensure the Step Functions orchestrator passes the correct tenant configuration.
|
||||||
|
|
||||||
|
**Story 1.2: CloudFormation & Tag Scanner**
|
||||||
|
*As a System, I want to scan CloudFormation stacks and tagged Resource Groups, so that I can group individual AWS resources into high-confidence "services".*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- System extracts stack names and maps them to service names (Confidence: 0.95).
|
||||||
|
- System extracts `service`, `team`, or `project` tags from resources.
|
||||||
|
- Lists resources belonging to each stack/tag group.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 1.1
|
||||||
|
- **Technical Notes:** Use `cloudformation:DescribeStacks` and `resourcegroupstaggingapi:GetResources`. Parallelize across regions if specified.
|
||||||
|
|
||||||
|
**Story 1.3: Compute Resource Enumeration (ECS & Lambda)**
|
||||||
|
*As a System, I want to list all ECS services and Lambda functions, so that I can discover deployable compute units.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- System lists all ECS clusters, services, and task definitions (extracting container images).
|
||||||
|
- System lists all Lambda functions and their API Gateway event source mappings.
|
||||||
|
- Standalone compute resources without a CFN stack are still captured as potential services.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 1.1
|
||||||
|
- **Technical Notes:** Requires pagination handling. Lambda cold starts for this Python scanner are acceptable. Output mapped payload to SQS for the Reconciler.
|
||||||
|
|
||||||
|
**Story 1.4: Database Enumeration (RDS)**
|
||||||
|
*As a System, I want to list RDS instances, so that I can associate data stores with their consuming services.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- System lists RDS instances and their tags.
|
||||||
|
- Maps databases to services based on naming prefixes or CFN stack membership.
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** Story 1.1, Story 1.2
|
||||||
|
- **Technical Notes:** Use `rds:DescribeDBInstances`. These are marked as infrastructure attached to a service, rather than services themselves.
|
||||||
|
|
||||||
|
## Epic 2: GitHub Discovery
|
||||||
|
**Description:** Implement the org-wide GitHub scanning to extract repositories, primary languages, CODEOWNERS, README content, and GitHub Actions deployments. Cross-reference this with the AWS scanner output.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 2.1: Org & Repo Scanner**
|
||||||
|
*As a System, I want to use the GitHub GraphQL API to list all repositories, their primary language, and commit history, so that I can map the codebase landscape.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- System lists all active (non-archived, non-forked) repos in the connected org.
|
||||||
|
- System extracts primary language and top 5 committers per repo.
|
||||||
|
- GraphQL queries are batched to avoid rate limits (up to 100 repos per call).
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** None
|
||||||
|
- **Technical Notes:** Implement using `octokit` in a Node.js Lambda function.
|
||||||
|
|
||||||
|
**Story 2.2: CODEOWNERS & README Extraction**
|
||||||
|
*As a System, I want to extract and parse the CODEOWNERS and README files from the default branch, so that I can infer service ownership and descriptions.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- System parses `HEAD:CODEOWNERS` and extracts mapped GitHub teams or individuals.
|
||||||
|
- System parses `HEAD:README.md` and extracts the first descriptive paragraph.
|
||||||
|
- If a team is listed in CODEOWNERS, it becomes a candidate for service ownership (Weight: 0.40).
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 2.1
|
||||||
|
- **Technical Notes:** The GraphQL expression `HEAD:CODEOWNERS` retrieves blob text efficiently in the same repo query.
|
||||||
|
|
||||||
|
**Story 2.3: CI/CD Target Extraction**
|
||||||
|
*As a System, I want to scan `.github/workflows/deploy.yml` for deployment targets, so that I can link repositories to specific AWS infrastructure (ECS/Lambda).*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- System parses workflow YAML to find deployment actions (e.g., `aws-actions/amazon-ecs-deploy-task-definition`).
|
||||||
|
- System links the repo to a discovered AWS service if the task definition name matches.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 2.1, Epic 1
|
||||||
|
- **Technical Notes:** This is crucial for cross-referencing in the Reconciliation Engine.
|
||||||
|
|
||||||
|
**Story 2.4: Reconciliation & Ownership Inference Engine**
|
||||||
|
*As a System, I want to merge the AWS infrastructure payloads and the GitHub repository payloads into a unified service entity, and calculate a confidence score for ownership.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- AWS resources are deduplicated and mapped to corresponding GitHub repos based on tags, deploy workflows, or fuzzy naming (e.g., "payment-api" matches "payment-service").
|
||||||
|
- Ownership is scored based on CODEOWNERS, Git blame frequency, and tags.
|
||||||
|
- Final merged entity is pushed to PostgreSQL as a "Service".
|
||||||
|
- **Estimate:** 8
|
||||||
|
- **Dependencies:** Story 1.2, 1.3, 2.1, 2.2, 2.3
|
||||||
|
- **Technical Notes:** This runs in a Node.js Lambda triggered by the Step Functions orchestrator. It processes batches from the SQS queues holding AWS and GitHub scan results.
|
||||||
|
|
||||||
|
## Epic 3: Service Catalog
|
||||||
|
**Description:** Implement the primary datastore (Aurora Serverless v2 PostgreSQL), the Service API (CRUD), ownership mapping logic, and manual enrichment endpoints. This is the source of truth for the entire platform.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 3.1: Core Catalog Schema Setup**
|
||||||
|
*As a System, I want a PostgreSQL relational database to store tenants, services, users, and teams, so that the discovered catalog data is durably stored.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Create the schema per the architecture (tenants, users, connections, services, teams, service_ownership, discovery_runs).
|
||||||
|
- Apply multi-tenant Row-Level Security (RLS) on all core tables using `tenant_id`.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** None
|
||||||
|
- **Technical Notes:** Use Prisma or raw SQL migrations. Implement API middleware to set `tenant_id` on every request.
|
||||||
|
|
||||||
|
**Story 3.2: Service Ownership Model Implementation**
|
||||||
|
*As a System, I want a `service_ownership` table and scoring logic, so that multiple ownership signals (CODEOWNERS vs Git Blame vs CFN tags) can be tracked and weighted.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- System stores multiple candidate teams per service with ownership types and confidence scores.
|
||||||
|
- The highest-confidence team becomes the primary owner.
|
||||||
|
- If top scores are tied or under 0.50, flag the service as "ambiguous" or "unowned".
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 3.1
|
||||||
|
- **Technical Notes:** Implement scoring algorithm in Python Lambda and map back to PostgreSQL.
|
||||||
|
|
||||||
|
**Story 3.3: Manual Correction API**
|
||||||
|
*As an Engineer, I want to manually override the inferred owner or description of a service, so that I can fix incorrect auto-discovered data.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `PATCH /api/v1/services/{service_id}` allows updates to `team_id`, `description`, etc.
|
||||||
|
- Manual corrections override inferred data with a confidence score of 1.00 (`source="user_correction"`).
|
||||||
|
- The correction is saved in the `corrections` log table.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 3.1
|
||||||
|
- **Technical Notes:** The update should trigger an async background job (SQS) to update the Meilisearch index immediately.
|
||||||
|
|
||||||
|
**Story 3.4: PagerDuty / OpsGenie Integration**
|
||||||
|
*As an Engineering Manager, I want to link my team to a PagerDuty schedule, so that the service catalog shows the current on-call engineer for each service.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- The API allows saving an encrypted PagerDuty API key per tenant.
|
||||||
|
- The system maps PagerDuty schedules to inferred Teams.
|
||||||
|
- `GET /api/v1/services/{service_id}` returns the active on-call individual.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 3.1
|
||||||
|
- **Technical Notes:** Credentials must be stored in AWS Secrets Manager using KMS. Use the PagerDuty `GET /schedules` API endpoint.
|
||||||
|
|
||||||
|
## Epic 4: Search Engine
|
||||||
|
**Description:** Deploy Meilisearch and a Redis cache to support a <100ms Cmd+K search bar for the entire portal. The search must be typo-tolerant and isolate tenant data perfectly.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 4.1: Meilisearch Index Sync**
|
||||||
|
*As a System, I want to sync service entities from PostgreSQL to Meilisearch, so that I have a fast, full-text index for the UI.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- On every service upsert in PostgreSQL, publish to SQS.
|
||||||
|
- A Lambda consumes SQS and updates Meilisearch `addDocuments`.
|
||||||
|
- The index configuration is applied (searchable attributes, filterable attributes, typo-tolerance enabled).
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 3.1
|
||||||
|
- **Technical Notes:** The index sync must map relational data to a flat JSON structure.
|
||||||
|
|
||||||
|
**Story 4.2: Cmd+K Search Endpoint & Security**
|
||||||
|
*As an Engineer, I want a fast `/api/v1/services/search` endpoint that queries Meilisearch, so that I can instantly find services.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- API receives query and proxy queries to Meilisearch.
|
||||||
|
- API enforces tenant isolation by injecting `tenant_id = '{current_tenant}'` filter.
|
||||||
|
- Search returns results in <100ms.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 4.1
|
||||||
|
- **Technical Notes:** Implement in Node.js Fastify.
|
||||||
|
|
||||||
|
**Story 4.3: Prefix Caching with Redis**
|
||||||
|
*As a System, I want to cache the most common searches in Redis, so that I can return results in <10ms for repeated queries.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Cache identical query prefixes per tenant in ElastiCache Redis.
|
||||||
|
- Set TTL to 5 minutes or invalidate on service upserts.
|
||||||
|
- Redis cache miss falls back to Meilisearch.
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** Story 4.2
|
||||||
|
- **Technical Notes:** Serverless Redis pricing scales perfectly. Use normalized queries as the cache key.
|
||||||
|
|
||||||
|
## Epic 5: Service Cards UI
|
||||||
|
**Description:** Build the React Single Page Application (SPA) providing a fast, scannable catalog interface, Cmd+K search dialog, and detailed service cards.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 5.1: SPA Framework & Routing Setup**
|
||||||
|
*As an Engineer, I want the portal built in React with Vite and TailwindCSS, so that the UI is fast and responsive.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Setup React + Vite + React Router.
|
||||||
|
- Deploy to S3 and CloudFront with edge caching.
|
||||||
|
- Implement a basic authentication context tied to GitHub OAuth.
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** None
|
||||||
|
- **Technical Notes:** The frontend state must be minimal, relying heavily on SWR or React Query for data fetching.
|
||||||
|
|
||||||
|
**Story 5.2: The Cmd+K Modal**
|
||||||
|
*As an Engineer, I want to press Cmd+K to open a global search bar, so that I can find a service or team instantly from anywhere.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Pressing Cmd+K (or Ctrl+K) opens a modal overlay.
|
||||||
|
- Input debounces keystrokes by 150ms before calling `/api/v1/services/search`.
|
||||||
|
- Arrow keys navigate search results, Enter selects a service.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 4.2, Story 5.1
|
||||||
|
- **Technical Notes:** Use an accessible dialog library like Radix UI or Headless UI. Ensure ARIA labels are correct.
|
||||||
|
|
||||||
|
**Story 5.3: Scannable Service List View**
|
||||||
|
*As an Engineering Director, I want to view all my organization's services in a single-line-per-service table, so that I can quickly review ownership and health.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- The default dashboard displays a paginated or infinite-scroll table of services.
|
||||||
|
- Columns: Name, Owner (with confidence badge), Health, Last Deploy, Repo Link.
|
||||||
|
- The table is filterable by team, health, and tech stack.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 3.1, Story 5.1
|
||||||
|
- **Technical Notes:** Use a lightweight table component (e.g., TanStack Table).
|
||||||
|
|
||||||
|
**Story 5.4: Service Detail View (Progressive Disclosure)**
|
||||||
|
*As an Engineer, I want to click on a service row to see its full details in a slide-over panel or expanded card, so that I can dive deeper when necessary.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Clicking a service expands an in-page panel showing full details.
|
||||||
|
- Tabs separate data: Overview (description, stack), Infra (AWS ARNs), On-Call, Corrections.
|
||||||
|
- There is a "Correct" button next to the inferred owner or description.
|
||||||
|
- **Estimate:** 8
|
||||||
|
- **Dependencies:** Story 3.4, Story 5.1, Story 5.3
|
||||||
|
- **Technical Notes:** Lazy-load tab contents from the API (`GET /api/v1/services/{service_id}`) to minimize initial render payload.
|
||||||
|
|
||||||
|
## Epic 6: Dashboard & Overview
|
||||||
|
**Description:** Implement the org-wide and team-specific dashboards that provide aggregate health, ownership matrix, and discovery status.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 6.1: The Director Dashboard**
|
||||||
|
*As an Engineering Director, I want an org-wide view of my total services, teams, unowned services, and discovery accuracy, so that I can report to leadership and ensure compliance.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- The dashboard displays four summary KPIs: Total Services, Total Teams, Unowned Services, Accuracy Trend (week-over-week).
|
||||||
|
- Contains a specific card showing "Services needing review".
|
||||||
|
- A "Recent Activity" feed lists the latest deploys or ownership changes.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 3.1, Story 5.1
|
||||||
|
- **Technical Notes:** The API needs a new `/api/v1/dashboard/stats` endpoint to compute these KPIs efficiently using PostgreSQL aggregations.
|
||||||
|
|
||||||
|
**Story 6.2: "My Team" Dashboard Focus**
|
||||||
|
*As an Engineer, I want the dashboard to default to showing services owned by my team, so that I don't have to filter out the noise of the entire org.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- The UI uses the logged-in user's GitHub ID to find their Team.
|
||||||
|
- A "Your Team" section lists only their services with immediate health status.
|
||||||
|
- "Recent" section shows the last 5 services the user viewed.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 3.1, Story 6.1
|
||||||
|
- **Technical Notes:** Use browser local storage to save the "recent views".
|
||||||
|
|
||||||
|
## Epic 7: Slack Bot
|
||||||
|
**Description:** Build the Slack integration to allow engineers to query the service catalog using `/dd0c who owns <service>` and `/dd0c oncall <service>`.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 7.1: Slack App & OAuth Setup**
|
||||||
|
*As an Administrator, I want to add a dd0c Slack app to my workspace, so that engineers can use slash commands.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Create the Slack App using `@slack/bolt`.
|
||||||
|
- The API handles the OAuth flow and saves the workspace token to the tenant `connections` table.
|
||||||
|
- The bot is added to a tenant's workspace and receives slash commands.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 3.1
|
||||||
|
- **Technical Notes:** The Slack Bot Lambda must handle verify tokens and return a 200 OK within 3 seconds.
|
||||||
|
|
||||||
|
**Story 7.2: Slash Command: /dd0c who owns**
|
||||||
|
*As an Engineer, I want to type `/dd0c who owns <service_name>`, so that the bot instantly replies with the owner, repo, and health.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- The bot receives the command, extracts the service name, and calls `GET /api/v1/services/search`.
|
||||||
|
- The bot formats the response as a Slack block kit message with the service name, owner team, confidence score, repo link, and a link to the portal.
|
||||||
|
- The response is ephemeral unless the user specifies otherwise.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 4.2, Story 7.1
|
||||||
|
- **Technical Notes:** Use Meilisearch directly or the API. Ensure the search handles typo-tolerance if the user misspells the service.
|
||||||
|
|
||||||
|
**Story 7.3: Slash Command: /dd0c oncall**
|
||||||
|
*As an Engineer, I want to type `/dd0c oncall <service_name>`, so that the bot instantly tells me who is currently on-call for that service.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- The bot receives the command, looks up the service, and queries PagerDuty (via the API).
|
||||||
|
- The bot returns the active on-call individual and schedule details.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 3.4, Story 7.2
|
||||||
|
- **Technical Notes:** If no on-call is configured, the bot returns a friendly error with a link to set it up in the portal.
|
||||||
|
|
||||||
|
## Epic 8: Infrastructure & DevOps
|
||||||
|
**Description:** Implement the AWS infrastructure for the dd0c platform itself using Infrastructure as Code, set up the CI/CD pipeline via GitHub Actions, and deploy the foundational resources including VPC, ECS Fargate clusters, RDS Aurora, and the cross-account IAM role assumption engine.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 8.1: Core AWS Foundation (VPC, RDS, ElastiCache)**
|
||||||
|
*As a System, I need a secure VPC and data tier, so that the Portal API and Discovery Engine have durable, isolated storage.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Deploy VPC with public and private subnets.
|
||||||
|
- Provision Aurora Serverless v2 PostgreSQL database in private subnets.
|
||||||
|
- Provision ElastiCache Redis (Serverless) for caching and sessions.
|
||||||
|
- Deploy KMS keys for credential encryption.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** None
|
||||||
|
- **Technical Notes:** Use AWS CDK or CloudFormation. Ensure all data stores are encrypted at rest using KMS.
|
||||||
|
|
||||||
|
**Story 8.2: ECS Fargate Cluster & Portal API Deployment**
|
||||||
|
*As a System, I need an ECS Fargate cluster running the Portal API and Meilisearch, so that the web application and search engine are highly available.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Create ECS cluster.
|
||||||
|
- Deploy Portal API Fargate service behind an Application Load Balancer.
|
||||||
|
- Deploy Meilisearch Fargate service with an EFS volume for persistent index storage.
|
||||||
|
- Configure auto-scaling rules based on CPU and request count.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 8.1
|
||||||
|
- **Technical Notes:** Meilisearch only needs 1 task initially. Use multi-stage Docker builds to keep image sizes small.
|
||||||
|
|
||||||
|
**Story 8.3: Discovery Engine Serverless Deployment**
|
||||||
|
*As a System, I need the Step Functions orchestrator and Lambda scanners deployed, so that the auto-discovery pipeline can execute.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Deploy AWS Scanner, GitHub Scanner, Reconciler, and Inference Lambdas.
|
||||||
|
- Deploy the Step Functions state machine linking the Lambdas.
|
||||||
|
- Provision SQS FIFO queues for discovery events.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Epic 1, Epic 2, Story 8.1
|
||||||
|
- **Technical Notes:** Lambdas must have VPC access to write to Aurora, but need a NAT Gateway to reach the GitHub API.
|
||||||
|
|
||||||
|
**Story 8.4: CI/CD Pipeline via GitHub Actions**
|
||||||
|
*As an Engineer, I want automated CI/CD pipelines, so that I can safely build, test, and deploy the platform without manual intervention.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- CI workflow runs linting, unit tests, and Trivy container scanning on every PR.
|
||||||
|
- CD workflow deploys to staging environment, runs a discovery accuracy smoke test, and waits for manual approval to deploy to production.
|
||||||
|
- Deployment updates ECS services and Lambda aliases seamlessly.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 8.2, Story 8.3
|
||||||
|
- **Technical Notes:** Keep it simple—use GitHub Actions natively, avoid complex external CD tools for V1.
|
||||||
|
|
||||||
|
**Story 8.5: Customer IAM Role Template**
|
||||||
|
*As an Administrator, I want a standardized CloudFormation template to deploy in my AWS account, so that I can easily grant dd0c read-only access.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Template creates an IAM role with a strict read-only policy mapped to dd0c's required services.
|
||||||
|
- Trust policy mandates a tenant-specific `ExternalId`.
|
||||||
|
- Template output provides the Role ARN.
|
||||||
|
- Template is hosted publicly on S3.
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** Epic 1
|
||||||
|
- **Technical Notes:** Avoid using `ReadOnlyAccess` managed policy. Explicitly deny IAM, S3 object access, and Secrets Manager.
|
||||||
|
|
||||||
|
## Epic 9: Onboarding & PLG
|
||||||
|
**Description:** Build the 5-minute self-serve onboarding wizard that drives the Product-Led Growth (PLG) motion. This flow must flawlessly guide the user through GitHub OAuth, AWS connection, and trigger the initial real-time discovery scan.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 9.1: GitHub OAuth & Tenant Creation**
|
||||||
|
*As a New User, I want to sign up using my GitHub account, so that I don't have to create a new password and the system can instantly identify my organization.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- User clicks "Sign up with GitHub" and authorizes the dd0c GitHub App.
|
||||||
|
- System extracts user identity and organization context.
|
||||||
|
- System creates a new Tenant and User record in PostgreSQL.
|
||||||
|
- JWT session is established and user is routed to the setup wizard.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 3.1, Story 5.1
|
||||||
|
- **Technical Notes:** Request minimal scopes initially. Store the installation ID securely.
|
||||||
|
|
||||||
|
**Story 9.2: Stripe Billing Integration (Free & Team Tiers)**
|
||||||
|
*As a New User, I want to select a pricing plan, so that I can start using the product within my budget.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Wizard prompts user to select "Free" (up to 10 engineers) or "Team" ($10/engineer/mo).
|
||||||
|
- If Team is selected, user is redirected to Stripe Checkout.
|
||||||
|
- Webhook listener updates the tenant subscription status upon successful payment.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 9.1
|
||||||
|
- **Technical Notes:** Use Stripe Checkout sessions. Keep the webhook Lambda extremely fast to avoid Stripe timeouts.
|
||||||
|
|
||||||
|
**Story 9.3: AWS Connection Wizard Step**
|
||||||
|
*As a New User, I want a frictionless way to connect my AWS account, so that the discovery engine can access my infrastructure.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- UI displays a one-click CloudFormation deployment link pre-populated with a unique `ExternalId`.
|
||||||
|
- User pastes the generated Role ARN back into the UI.
|
||||||
|
- API validates the role via `sts:AssumeRole` before proceeding.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 8.5, Story 9.1
|
||||||
|
- **Technical Notes:** The `sts:AssumeRole` call validates both the ARN and the ExternalId. Give clear error messages if validation fails.
|
||||||
|
|
||||||
|
**Story 9.4: Real-Time Discovery WebSocket**
|
||||||
|
*As a New User, I want to see the discovery engine working in real-time, so that I trust the system and get that "Holy Shit" moment.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- API Gateway WebSocket API maintains a connection with the onboarding SPA.
|
||||||
|
- Step Functions and Lambdas push progress events (e.g., "Found 47 AWS resources...") to the UI via the WebSocket.
|
||||||
|
- When complete, the user is automatically redirected to their populated Service Catalog.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Epic 1, Epic 2, Story 9.3
|
||||||
|
- **Technical Notes:** Implement via API Gateway WebSocket API and a simple broadcasting Lambda.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 10: Transparent Factory Compliance
|
||||||
|
**Description:** Cross-cutting epic ensuring dd0c/portal adheres to the 5 Transparent Factory tenets. As an Internal Developer Platform, portal is the control plane for other teams' services — governance and observability are existential requirements, not nice-to-haves.
|
||||||
|
|
||||||
|
### Story 10.1: Atomic Flagging — Feature Flags for Discovery & Catalog Behaviors
|
||||||
|
**As a** solo founder, **I want** every new auto-discovery heuristic, catalog enrichment, and scorecard rule behind a feature flag (default: off), **so that** a bad discovery rule doesn't pollute the service catalog with phantom services.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- OpenFeature SDK integrated into the catalog service. V1: JSON file provider.
|
||||||
|
- All flags evaluate locally — no network calls during service discovery scans.
|
||||||
|
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
|
||||||
|
- Automated circuit breaker: if a flagged discovery rule creates >5 unconfirmed services in a single scan, the flag auto-disables and the phantom entries are quarantined (not deleted).
|
||||||
|
- Flags required for: new discovery sources (GitHub, GitLab, K8s), scorecard criteria, ownership inference rules, template generators.
|
||||||
|
|
||||||
|
**Estimate:** 5 points
|
||||||
|
**Dependencies:** Epic 1 (Service Discovery)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Quarantine pattern: flagged phantom services get `status: quarantined` rather than deletion. Allows review before purge.
|
||||||
|
- Discovery scans are batch operations — flag check happens once per scan config, not per-service.
|
||||||
|
|
||||||
|
### Story 10.2: Elastic Schema — Additive-Only for Service Catalog Store
|
||||||
|
**As a** solo founder, **I want** all DynamoDB catalog schema changes to be strictly additive, **so that** rollbacks never corrupt the service catalog or lose ownership mappings.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- CI rejects any schema change that removes, renames, or changes type of existing DynamoDB attributes.
|
||||||
|
- New attributes use `_v2` suffix for breaking changes.
|
||||||
|
- All service catalog parsers ignore unknown fields (`encoding/json` with flexible unmarshalling).
|
||||||
|
- Dual-write during migration windows within `TransactWriteItems`.
|
||||||
|
- Every schema change includes `sunset_date` comment (max 30 days).
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 2 (Catalog Store)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Service catalog is the source of truth for org topology — schema corruption here cascades to scorecards, ownership, and templates.
|
||||||
|
- DynamoDB Single Table Design: version items with `_v` attribute. Use item-level versioning, not table duplication.
|
||||||
|
- Ownership mappings are especially sensitive — never overwrite, always append with timestamp.
|
||||||
|
|
||||||
|
### Story 10.3: Cognitive Durability — Decision Logs for Ownership Inference
|
||||||
|
**As a** future maintainer, **I want** every change to ownership inference algorithms, scorecard weights, or discovery heuristics accompanied by a `decision_log.json`, **so that** I understand why service X was assigned to team Y.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
|
||||||
|
- CI requires a decision log for PRs touching `pkg/discovery/`, `pkg/scoring/`, or `pkg/ownership/`.
|
||||||
|
- Cyclomatic complexity cap of 10 via `golangci-lint`. PRs exceeding this are blocked.
|
||||||
|
- Decision logs in `docs/decisions/`.
|
||||||
|
|
||||||
|
**Estimate:** 2 points
|
||||||
|
**Dependencies:** None
|
||||||
|
**Technical Notes:**
|
||||||
|
- Ownership inference is the highest-risk logic — wrong assignments erode trust in the entire platform.
|
||||||
|
- Document: "Why CODEOWNERS > git blame frequency > PR reviewer count for ownership signals?"
|
||||||
|
- Scorecard weight changes must include before/after examples showing how scores shift.
|
||||||
|
|
||||||
|
### Story 10.4: Semantic Observability — AI Reasoning Spans on Discovery & Scoring
|
||||||
|
**As a** platform engineer debugging a wrong ownership assignment, **I want** every discovery and scoring decision to emit an OpenTelemetry span with reasoning metadata, **so that** I can trace why a service was assigned to the wrong team.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- Every discovery scan emits a parent `catalog_scan` span. Each service evaluation emits child spans: `ownership_inference`, `scorecard_evaluation`.
|
||||||
|
- Span attributes: `catalog.service_id`, `catalog.ownership_signals` (JSON array of signal sources + weights), `catalog.confidence_score`, `catalog.scorecard_result`.
|
||||||
|
- If AI-assisted inference is used: `ai.prompt_hash`, `ai.model_version`, `ai.reasoning_chain`.
|
||||||
|
- Spans export via OTLP. No PII — repo names and team names are hashed in spans.
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 1 (Service Discovery), Epic 3 (Scorecards)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Use `go.opentelemetry.io/otel`. Batch export to avoid per-service overhead during large scans.
|
||||||
|
- Ownership inference spans should include ALL signals considered, not just the winning one — this is critical for debugging.
|
||||||
|
|
||||||
|
### Story 10.5: Configurable Autonomy — Governance for Catalog Mutations
|
||||||
|
**As a** solo founder, **I want** a `policy.json` that controls whether the platform can auto-update ownership, auto-create services, or only suggest changes, **so that** teams maintain control over their catalog entries.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `policy.json` defines `governance_mode`: `strict` (suggest-only, no auto-mutations) or `audit` (auto-apply with logging).
|
||||||
|
- Default: `strict`. Auto-discovery populates a "pending review" queue rather than directly mutating the catalog.
|
||||||
|
- `panic_mode`: when true, all discovery scans halt, catalog is frozen read-only, and a "maintenance" banner shows in the UI.
|
||||||
|
- Per-team governance override: teams can lock their services to `strict` even if system is in `audit` mode.
|
||||||
|
- All policy decisions logged: "Service X auto-created in audit mode", "Ownership change for Y queued for review in strict mode".
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 1 (Service Discovery)
|
||||||
|
**Technical Notes:**
|
||||||
|
- "Pending review" queue is a DynamoDB table with `status: pending`. UI shows a review inbox for platform admins.
|
||||||
|
- Per-team locks: `team_settings` item in DynamoDB with `governance_override` field.
|
||||||
|
- Panic mode: `POST /admin/panic` or env var. Catalog API returns 503 for all write operations.
|
||||||
|
|
||||||
|
### Epic 10 Summary
|
||||||
|
| Story | Tenet | Points |
|
||||||
|
|-------|-------|--------|
|
||||||
|
| 10.1 | Atomic Flagging | 5 |
|
||||||
|
| 10.2 | Elastic Schema | 3 |
|
||||||
|
| 10.3 | Cognitive Durability | 2 |
|
||||||
|
| 10.4 | Semantic Observability | 3 |
|
||||||
|
| 10.5 | Configurable Autonomy | 3 |
|
||||||
|
| **Total** | | **16** |
|
||||||
1001
products/04-lightweight-idp/innovation-strategy/session.md
Normal file
1001
products/04-lightweight-idp/innovation-strategy/session.md
Normal file
File diff suppressed because it is too large
Load Diff
96
products/04-lightweight-idp/party-mode/session.md
Normal file
96
products/04-lightweight-idp/party-mode/session.md
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
# dd0c/portal — Advisory Board "Party Mode" Review
|
||||||
|
|
||||||
|
## Round 1: INDIVIDUAL REVIEWS
|
||||||
|
|
||||||
|
### 1. The VC
|
||||||
|
**What excites me:** The wedge. You're attacking the bottom of the market that Port and Cortex are completely ignoring because their VC masters demand $50k enterprise ACVs. The $10/eng price point is a beautiful bottom-up PLG motion.
|
||||||
|
**What worries me:** The moat. Specifically, GitHub. It's an existential kill shot. If GitHub adds a "Service Catalog" tab that just aggregates CODEOWNERS and Actions, your entire TAM evaporates overnight. Also, building integrations for AWS, GitHub, PagerDuty, and Slack is a whole team's job, not a solo dev's.
|
||||||
|
**Vote:** CONDITIONAL (Prove the GitHub moat).
|
||||||
|
|
||||||
|
### 2. The CTO
|
||||||
|
**What excites me:** Finally, someone admitting that `catalog-info.yaml` is a lie. Generating the catalog from infrastructure truth (AWS tags, Terraform state) instead of human intentions is exactly how it should work.
|
||||||
|
**What worries me:** Auto-discovery accuracy. You claim 80% day-one accuracy. I've been doing this 20 years, and I've never seen an AWS environment clean enough to auto-discover accurately. If my devs log in and see 40% garbage data, they will never trust the tool again. Trust is binary here.
|
||||||
|
**Vote:** CONDITIONAL (Show me the discovery engine works on a messy, real-world AWS account).
|
||||||
|
|
||||||
|
### 3. The Bootstrap Founder
|
||||||
|
**What excites me:** The $10/engineer/month self-serve model. 50 engineers = $500/mo. You only need 20 mid-sized customers to hit $10k MRR. The "Backstage Migrator" and the calculator tool are genius low-friction acquisition channels.
|
||||||
|
**What worries me:** The surface area. A portal needs constant care and feeding of API integrations. AWS changes an API? Your portal breaks. GitHub rate limits? Portal breaks. Supporting this solo while trying to market it is a fast track to burnout.
|
||||||
|
**Vote:** GO (If you keep the scope aggressively small).
|
||||||
|
|
||||||
|
### 4. The Platform Engineer
|
||||||
|
**What excites me:** No YAML. Oh my god, no YAML. I spent 14 months babysitting Backstage and fighting community plugins. If I can connect an IAM role and GitHub OAuth and get a 2-second Cmd+K search for my team, I'll put my corporate card in right now.
|
||||||
|
**What worries me:** Will people actually use it daily? I already have a Confluence page that's *mostly* right. If this is just another dashboard I have to bookmark, it'll die. It has to be faster than asking in Slack.
|
||||||
|
**Vote:** GO. Take my money and save me from Spotify's monolith.
|
||||||
|
|
||||||
|
### 5. The Contrarian
|
||||||
|
**What excites me:** Nothing. You're building a feature, not a product.
|
||||||
|
**What worries me:** You are trying to sell a glorified spreadsheet. "We'll map your AWS to your GitHub!" Cool, I can do that in Notion with a Python script on a cron job. You're betting against human nature—if the org is chaotic enough to need this, they won't have the clean AWS tags required for your auto-discovery to work. If they have clean tags, they don't need you.
|
||||||
|
**Vote:** NO-GO.
|
||||||
|
|
||||||
|
## Round 2: CROSS-EXAMINATION
|
||||||
|
|
||||||
|
**1. The VC:** Okay, Mr. Bootstrap. You love the $10/eng math. "Just 20 customers to $10k MRR!" But what is the TAM, really? Are there enough 50-engineer teams that will actually pull out a credit card for a service catalog?
|
||||||
|
|
||||||
|
**2. The Bootstrap Founder:** The SAM is $840M, easily. A solo founder doesn't need to conquer the world. We just need 100 teams to hit half a mil ARR. The Backstage graveyard is full of teams practically begging for this.
|
||||||
|
|
||||||
|
**3. The VC:** Yeah, until those 50-eng teams grow to 200 engineers, realize they need RBAC and compliance reports, and immediately churn to Port or Cortex. You're building a stepping stone, not a billion-dollar company. You have zero enterprise retention.
|
||||||
|
|
||||||
|
**4. The Bootstrap Founder:** Who cares about a billion dollars?! $50k MRR with 90% margins, no sales team, and no board meetings is the ultimate dream. Let them churn to Cortex when they're huge. We own the bottom of the funnel. Why feed the VC machine?
|
||||||
|
|
||||||
|
**5. The Platform Engineer:** Time out. Let's talk about the real problem. CTO, you said auto-discovery won't work on a messy AWS account. Have you ever actually tried to maintain a Backstage YAML file?
|
||||||
|
|
||||||
|
**6. The CTO:** I've been in AWS accounts where every Lambda is named `test-function-final-2` and there isn't a single tag in sight. Good luck inferring service ownership from that garbage. Auto-discovery only works if your infra is already pristine.
|
||||||
|
|
||||||
|
**7. The Platform Engineer:** But that's the beauty of it! If the portal shows you the garbage, it forces teams to fix the root cause. We don't have to manually write YAML, we just fix the actual AWS tags, which we should be doing anyway!
|
||||||
|
|
||||||
|
**8. The CTO:** You're assuming they'll stick around to fix it. If the first impression is 60% accuracy, developers will bounce immediately. Engineers absolutely despise tools that lie to them. Trust is binary.
|
||||||
|
|
||||||
|
**9. The Contrarian:** Why are we even having this debate? If you're a 50-person team, just use a Notion database! It's free, it's already there, and it takes five minutes to update. You're trying to solve a social problem with a $500/month SaaS tool.
|
||||||
|
|
||||||
|
**10. The Platform Engineer:** Are you kidding me? A Notion database rots in exactly two weeks! Have you ever tried to keep an engineering org's Confluence page updated while 15 devs are pushing to production every day? It's a full-time job!
|
||||||
|
|
||||||
|
**11. The Contrarian:** Oh, please. It takes 5 minutes a month per team to update a table. You want to pay a third party $6,000 a year just so you don't have to type "payments team" next to a repository link? This isn't a startup, it's a weekend project!
|
||||||
|
|
||||||
|
**12. The VC:** Actually, the Contrarian is missing the point. Notion doesn't have a PagerDuty integration that dynamically syncs with GitHub CODEOWNERS. When a P1 incident hits at 3 AM, a Cmd+K Slack integration that instantly routes to the right on-call engineer is easily worth $500 a month in saved MTTR.
|
||||||
|
|
||||||
|
**13. The Contrarian:** Then write a Zapier webhook for your Notion page. Or better yet, just ask in Slack. If your MTTR is tanking because people don't know who owns what, you have a management problem, not a tooling problem.
|
||||||
|
|
||||||
|
**14. The Bootstrap Founder:** Spoken like someone who hasn't built software in 10 years. People pay for convenience. This "weekend project" replaces a 14-month Backstage deployment nightmare. I'd pay $10/head tomorrow.
|
||||||
|
|
||||||
|
## Round 3: STRESS TEST
|
||||||
|
|
||||||
|
### Threat 1: GitHub Ships a Native Service Catalog
|
||||||
|
**The Attack:** GitHub already has CODEOWNERS, dependency graphs, and Actions. Microsoft is arguably already building this. If they launch a "Services" tab that renders all of this perfectly for free, the market is dead.
|
||||||
|
- **Severity:** 10/10. Existential threat.
|
||||||
|
- **Mitigations:** Multi-cloud and multi-source data fusion. GitHub only knows about code and CI; it doesn't know about PagerDuty on-call schedules, AWS CloudWatch health metrics, or AWS costs. dd0c/portal must integrate deeply with the operational tools GitHub ignores.
|
||||||
|
- **Pivot Options:** Double down on the dd0c platform flywheel. If GitHub owns the catalog layer natively, pivot dd0c/portal into a free aggregation layer that drives users directly into paid modules like dd0c/cost, dd0c/alert, and dd0c/run.
|
||||||
|
|
||||||
|
### Threat 2: Auto-Discovery Accuracy is Only 70% (And Trust Dies)
|
||||||
|
**The Attack:** The 5-minute magical first run maps AWS and GitHub, but because of messy tags and chaotic monorepos, 30% of the data is garbage. Engineers log in, see a mess, declare it useless, and churn on day 1.
|
||||||
|
- **Severity:** 8/10. Fatal to adoption and PLG motion.
|
||||||
|
- **Mitigations:** Introduce explicit confidence scores immediately. Do not state facts; state probabilities ("We are 75% confident @payments owns this"). Make the first experience a guided calibration wizard rather than a static presentation of broken data.
|
||||||
|
- **Pivot Options:** Switch the core pitch from "zero-config auto-discovery" to "AI-assisted onboarding." If fully autonomous discovery fails, use LLMs to analyze repos and actively chat with team leads to build the catalog interactively.
|
||||||
|
|
||||||
|
### Threat 3: Backstage 2.0 Dramatically Simplifies Setup
|
||||||
|
**The Attack:** The CNCF gets tired of the complaints, or Roadie open-sources a zero-config setup wizard. The "Backstage takes 6 months" wedge evaporates.
|
||||||
|
- **Severity:** 6/10. Harmful to the acquisition model, but survivable.
|
||||||
|
- **Mitigations:** Backstage will never stop being a massive React monolith that requires heavy hosting and maintenance. Its DNA is "unopinionated framework." Lean into the "Opinionated SaaS vs. Self-Hosted Monolith" angle.
|
||||||
|
- **Pivot Options:** Focus heavily on proprietary features Backstage won't build, like AI natural language querying ("Ask Your Infra") and deep daily-use habits (the Slack bot and Cmd+K shortcut). Backstage is a dashboard you visit quarterly; dd0c is a workflow tool you use daily.
|
||||||
|
|
||||||
|
## Round 4: FINAL VERDICT
|
||||||
|
|
||||||
|
**The Decision:** SPLIT DECISION (4-1 in favor). The Contrarian dissents, citing "Glorified Spreadsheet."
|
||||||
|
|
||||||
|
**Revised Priority in dd0c lineup:**
|
||||||
|
Portal is the **Hub**, but it should be built **Third**. Launch `dd0c/cost` and `dd0c/alert` first. Use those to capture initial revenue and validate the pain points, then launch `dd0c/portal` as the connective tissue that makes the whole platform sticky. It's the ultimate upsell mechanism, not the first wedge.
|
||||||
|
|
||||||
|
**Top 3 Must-Get-Right Items:**
|
||||||
|
1. **The 5-Minute "Holy Shit" Moment.** If the auto-discovery engine doesn't map 80% of an AWS/GitHub environment on the first try with high confidence, the PLG motion dies in the water.
|
||||||
|
2. **Speed > Features.** The Cmd+K search and Slack bot must be sub-500ms. It has to be faster than asking a human in Slack, or the behavioral habit will never form.
|
||||||
|
3. **The dd0c Flywheel.** It cannot be just a standalone catalog. It must immediately show cross-module value via cost visibility (dd0c/cost) and alert routing (dd0c/alert).
|
||||||
|
|
||||||
|
**The One Kill Condition:**
|
||||||
|
If GitHub announces a native "Service Catalog" at GitHub Universe 2026, and it integrates with PagerDuty and AWS natively, kill `dd0c/portal` as a standalone product immediately. Pivot to making it a free internal feature of the dd0c platform to drive adoption of the paid operational modules.
|
||||||
|
|
||||||
|
**Final Verdict:** **GO.**
|
||||||
|
The IDP space is a graveyard of $30M VC-funded over-engineered monoliths and abandoned Spotify YAML files. There is a massive, starving market of 50-engineer teams who just want to know who is on call for the payment gateway at 3 AM. Stop asking them to write configuration files. Give them the search bar. Take their $10 a month. Build the Anti-Backstage.
|
||||||
813
products/04-lightweight-idp/product-brief/brief.md
Normal file
813
products/04-lightweight-idp/product-brief/brief.md
Normal file
@@ -0,0 +1,813 @@
|
|||||||
|
# dd0c/portal — Product Brief
|
||||||
|
**Product:** Lightweight Internal Developer Portal ("The Anti-Backstage")
|
||||||
|
**Version:** 1.0
|
||||||
|
**Date:** 2026-02-28
|
||||||
|
**Author:** Product Strategy Team
|
||||||
|
**Status:** Phase 5 — Product Brief (Investor-Ready)
|
||||||
|
**Board Decision:** GO (4-1) — Build Third in dd0c Sequence
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. EXECUTIVE SUMMARY
|
||||||
|
|
||||||
|
### Elevator Pitch
|
||||||
|
|
||||||
|
dd0c/portal is a zero-configuration internal developer portal that auto-discovers your services from AWS and GitHub in 5 minutes, replaces months of Backstage setup with a single IAM role connection, and gives every engineer a Cmd+K search bar that answers "who owns this?" faster than asking in Slack. $10/engineer/month. No YAML. No dedicated platform team. No lies in your catalog.
|
||||||
|
|
||||||
|
### Problem Statement
|
||||||
|
|
||||||
|
The internal developer portal market is broken in both directions:
|
||||||
|
|
||||||
|
**The Enterprise Trap:** Backstage (open-source, Spotify) requires 1-2 dedicated engineers to maintain, takes 3-6 months to deploy, and depends on `catalog-info.yaml` files that engineers write once and never update. After 12 months, most Backstage instances have <30% catalog accuracy. Port, Cortex, and OpsLevel charge $25-50/engineer/month with enterprise sales cycles, pricing out the 80% of engineering teams with 20-200 engineers.
|
||||||
|
|
||||||
|
**The Spreadsheet Trap:** Teams that can't afford enterprise IDPs track services in Google Sheets and Notion databases that rot within weeks. When a P1 incident fires at 3 AM, precious minutes are wasted in Slack asking "who owns this?" — a question that should take 0 seconds to answer.
|
||||||
|
|
||||||
|
**The data tells the story:**
|
||||||
|
- Engineers spend an average of 3-5 hours/week searching for internal service information (Humanitec 2025 Platform Engineering Survey)
|
||||||
|
- New hires take 3-4 weeks to become productive, with poor internal tooling cited as the #1 friction point
|
||||||
|
- 70%+ of internal documentation is stale within 6 months of creation
|
||||||
|
- Incident MTTR increases by 15-30 minutes when service ownership is unclear
|
||||||
|
- <5% of organizations have a functioning IDP in 2026 (Gartner), despite 80% recognizing the need
|
||||||
|
- The Backstage graveyard is growing — community forums and Reddit are filled with teams who invested months and got minimal value
|
||||||
|
|
||||||
|
### Solution Overview
|
||||||
|
|
||||||
|
dd0c/portal is a SaaS developer portal that generates its service catalog from infrastructure reality instead of human-maintained YAML files:
|
||||||
|
|
||||||
|
1. **5-Minute Auto-Discovery:** Connect a read-only AWS IAM role and GitHub OAuth. The discovery engine scans EC2, ECS, Lambda, RDS, API Gateway, CloudFormation stacks, GitHub repos, CODEOWNERS, and README files. Within 30 seconds, the catalog is populated with >80% accuracy — including service names, inferred owners, descriptions, tech stacks, and repo links.
|
||||||
|
|
||||||
|
2. **Cmd+K Instant Search:** The entire portal is a search bar. Type any service name, team name, or keyword — results in <500ms. Faster than asking in Slack. This is the daily-use hook that makes the portal the browser homepage.
|
||||||
|
|
||||||
|
3. **Zero Maintenance:** The catalog auto-refreshes from infrastructure sources. No YAML to maintain. No platform engineer babysitting plugins. The catalog stays accurate because it's generated from truth, not maintained by humans.
|
||||||
|
|
||||||
|
4. **dd0c Platform Integration:** Portal is the hub of the dd0c platform. dd0c/cost shows per-service AWS spend. dd0c/alert routes incidents to the right owner. dd0c/run links executable runbooks. Each module makes the portal richer; the portal makes each module smarter.
|
||||||
|
|
||||||
|
### Target Customer
|
||||||
|
|
||||||
|
**Primary:** Engineering teams of 20-200 engineers, AWS-primary, using GitHub, who either:
|
||||||
|
- Tried Backstage and abandoned it (or are limping along with a broken instance)
|
||||||
|
- Evaluated enterprise IDPs (Port, Cortex) and couldn't justify the $60-150K/year price tag
|
||||||
|
- Currently rely on Google Sheets, Notion, or Slack tribal knowledge for service ownership
|
||||||
|
|
||||||
|
**Buyer Personas:**
|
||||||
|
- **The Platform Engineer (Alex):** Burned out maintaining Backstage. Wants a portal that maintains itself so they can build actual platform capabilities.
|
||||||
|
- **The Engineering Director (Priya):** Needs always-current service ownership for incident response, compliance audits, and headcount justification. Can't answer "who owns what?" with confidence today.
|
||||||
|
- **The New Hire (Jordan):** Drowning in scattered tribal knowledge. Needs to understand the service landscape in hours, not weeks.
|
||||||
|
|
||||||
|
**Firmographic Profile:**
|
||||||
|
- 20-200 engineers (sweet spot: 40-100)
|
||||||
|
- AWS as primary cloud (70%+ of infrastructure)
|
||||||
|
- GitHub organization (not GitLab/Bitbucket at launch)
|
||||||
|
- Series A through Series D startups, or mid-market companies with modern engineering practices
|
||||||
|
- US/EU-based (timezone alignment for solo founder support)
|
||||||
|
|
||||||
|
### Key Differentiators
|
||||||
|
|
||||||
|
| Dimension | Backstage | Port/Cortex | dd0c/portal |
|
||||||
|
|-----------|-----------|-------------|-------------|
|
||||||
|
| Time to value | 3-6 months | 2-6 weeks | 5 minutes |
|
||||||
|
| Catalog maintenance | Manual YAML (1-2 FTEs) | Semi-manual (0.25 FTE) | Auto-discovery (0 FTE) |
|
||||||
|
| Day-1 accuracy | 0% (empty catalog) | 30-50% (import + manual) | 80%+ (auto-discovered) |
|
||||||
|
| Pricing (50 eng) | $0 + $200-400K labor | $60-150K/year | $6,000/year |
|
||||||
|
| Daily-use stickiness | Low (nobody visits) | Medium | High (Cmd+K, Slack bot, browser homepage) |
|
||||||
|
| AI capabilities | None | Basic | Native (Ask Your Infra, V2) |
|
||||||
|
| Platform integration | Standalone | Standalone | dd0c flywheel (cost, alerts, drift, runbooks) |
|
||||||
|
| Setup requirement | Fork React monolith, configure plugins, write YAML | Sales call, POC, weeks of configuration | Connect AWS + GitHub, done |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. MARKET OPPORTUNITY
|
||||||
|
|
||||||
|
### Market Sizing
|
||||||
|
|
||||||
|
**Total Addressable Market (TAM): $4.3B/year**
|
||||||
|
- Global software developers: ~28 million (Evans Data, 2026)
|
||||||
|
- Developers in organizations with 20+ engineers (IDP-relevant): ~12 million
|
||||||
|
- Average willingness to pay for developer portal tooling: ~$30/dev/month (blended across tiers)
|
||||||
|
- TAM = 12M × $30 × 12 = $4.3B
|
||||||
|
|
||||||
|
**Serviceable Addressable Market (SAM): $840M/year**
|
||||||
|
- AWS-primary organizations (dd0c's initial integration scope): ~40% of cloud market
|
||||||
|
- Teams of 20-500 engineers (too big for a spreadsheet, too small for Port/Cortex)
|
||||||
|
- Estimated developer population in segment: ~2.8M developers
|
||||||
|
- At $25/dev/month blended: SAM = 2.8M × $25 × 12 = $840M
|
||||||
|
|
||||||
|
**Serviceable Obtainable Market (SOM):**
|
||||||
|
- Year 1-2 realistic penetration: 0.1-1% of SAM
|
||||||
|
- Conservative (0.1%): ~2,800 developers across 50-80 orgs → $336K ARR at $10/eng
|
||||||
|
- Aggressive (1% by Year 3): ~28,000 developers → $3.36M ARR
|
||||||
|
- Expected value at Month 12: ~$300K ARR (probability-weighted across scenarios)
|
||||||
|
|
||||||
|
**Market growth:** IDP market growing at 30-40% CAGR as platform engineering becomes mainstream. Gartner estimates <5% of organizations have a functioning IDP in 2026, with 80% recognizing the need. The gap between awareness and adoption is dd0c's opportunity.
|
||||||
|
|
||||||
|
### Competitive Landscape
|
||||||
|
|
||||||
|
#### Tier 1: Open Source Incumbent — Backstage (Spotify/CNCF)
|
||||||
|
- 27K+ GitHub stars, CNCF Incubating project
|
||||||
|
- Free but requires 1-2 FTEs to maintain ($200-400K/year true cost)
|
||||||
|
- `catalog-info.yaml` model is fundamentally broken — depends on humans doing unpaid maintenance
|
||||||
|
- Most instances have <30% catalog accuracy after 12 months
|
||||||
|
- Plugin ecosystem is wide but shallow; community plugins rot
|
||||||
|
- **dd0c's angle:** "Backstage takes 5 months. We take 5 minutes. Backstage requires YAML. We require nothing."
|
||||||
|
|
||||||
|
#### Tier 2: Managed Backstage — Roadie
|
||||||
|
- Managed Backstage-as-a-Service, ~$30-50/engineer/month
|
||||||
|
- Removes hosting burden but still fundamentally Backstage (still requires YAML, still has cold-start problem)
|
||||||
|
- Validates that Backstage maintenance is a real pain point — which is dd0c's thesis
|
||||||
|
- **Threat level:** Low. Roadie is a better Backstage; dd0c is a different paradigm.
|
||||||
|
|
||||||
|
#### Tier 3: Enterprise IDP Platforms — Port ($33M Series A), Cortex ($35M+), OpsLevel
|
||||||
|
- Enterprise pricing: $25-50/engineer/month, sales-led, minimum contracts $30K+/year
|
||||||
|
- Feature-rich: self-service workflows, advanced RBAC, compliance reports, custom scorecards
|
||||||
|
- Built for 500+ engineer orgs with dedicated platform teams and procurement budgets
|
||||||
|
- Setup takes weeks. Requires significant configuration. Pricing excludes small-to-mid teams entirely.
|
||||||
|
- **dd0c's angle:** 10-20x cheaper, 100x faster to set up, zero maintenance. For the 80% of teams that Port/Cortex ignores.
|
||||||
|
|
||||||
|
#### Tier 4: Shadow Competitors (Existential Threats)
|
||||||
|
- **GitHub:** Already has CODEOWNERS, dependency graphs, repository topics, Actions. If GitHub ships a "Services" tab, the lightweight IDP market contracts overnight. **Severity: 10/10.**
|
||||||
|
- **Datadog Service Catalog:** Basic today, but Datadog has $2B+ revenue and massive distribution. If bundled effectively with monitoring, it's "free" for existing customers.
|
||||||
|
- **Atlassian Compass:** Integrated with Jira/Confluence. Currently mediocre, but Atlassian has massive mid-market distribution.
|
||||||
|
- **AWS Service Catalog:** Terrible UX today, but AWS has native infrastructure data access.
|
||||||
|
|
||||||
|
#### Competitive Positioning Matrix
|
||||||
|
|
||||||
|
```
|
||||||
|
Factor | Backstage | Port/Cortex | Roadie | dd0c/portal
|
||||||
|
------------------------|-----------|-------------|--------|------------
|
||||||
|
Setup Time | Months | Weeks | Days | 5 minutes
|
||||||
|
Catalog Maintenance | Manual | Semi-manual | Manual | Auto
|
||||||
|
Day-1 Accuracy | 0% | 30-50% | 0% | 80%+
|
||||||
|
Price (50 engineers) | $200K+ | $60-150K | $18-30K| $6K
|
||||||
|
Daily Active Usage | <10% | 20-30% | 15-25% | Target: 40%+
|
||||||
|
AI-Native | No | Basic | No | Core (V2)
|
||||||
|
Platform Integration | Plugins | Standalone | Plugins| dd0c flywheel
|
||||||
|
Solo Founder Viable | No | No | No | Yes
|
||||||
|
```
|
||||||
|
|
||||||
|
### Timing Thesis: Why Now
|
||||||
|
|
||||||
|
Four forces are converging to create a window of opportunity:
|
||||||
|
|
||||||
|
**1. Backstage Fatigue (2024-2026)**
|
||||||
|
The Backstage hype cycle has peaked and entered the trough of disillusionment. Teams that adopted Backstage in 2023-2024 are 12-18 months in and discovering the maintenance burden is unsustainable, catalog accuracy has degraded below 50%, and the platform engineer maintaining it is burned out or has quit. This creates a massive pool of "Backstage refugees" — teams that believe in the IDP concept but are disillusioned with the execution. These are dd0c's first customers.
|
||||||
|
|
||||||
|
**2. Platform Engineering Goes Mainstream (2025-2026)**
|
||||||
|
Platform engineering is no longer a luxury — it's expected. Budget is being allocated specifically for developer experience tooling. The question has shifted from "do we need an IDP?" to "which one?" This means more buyers, more budget, and less evangelism required.
|
||||||
|
|
||||||
|
**3. AI-Native Expectations (2026)**
|
||||||
|
Engineers use Copilot, Cursor, and Claude daily. A developer portal that requires manual YAML maintenance feels archaic. The expectation is: "Why can't it just figure out what services we have?" dd0c's auto-discovery + AI query model aligns with 2026 developer expectations. Backstage's manual model feels like 2020.
|
||||||
|
|
||||||
|
**4. FinOps + Platform Engineering Convergence**
|
||||||
|
Engineering leaders want cost-per-service alongside ownership and health. No standalone IDP does this well. dd0c's platform approach (portal + cost + alerts) is uniquely positioned for this convergence.
|
||||||
|
|
||||||
|
**Regulatory Tailwinds:**
|
||||||
|
- SOC 2 / ISO 27001 adoption increasing — auditors ask "show me service ownership." An always-current IDP is compliance infrastructure.
|
||||||
|
- EU DORA (Digital Operational Resilience Act) requires service mapping and incident response capabilities for financial services.
|
||||||
|
- US Executive Order on AI requires organizations to inventory AI-powered services and their owners.
|
||||||
|
- Platform engineering job postings up 340% since 2023 — the buyer persona is proliferating.
|
||||||
|
|
||||||
|
**The window is open. It won't stay open forever.** GitHub could ship a native service catalog. Datadog could invest seriously in theirs. The 12-18 month head start window is the entire strategic opportunity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. PRODUCT DEFINITION
|
||||||
|
|
||||||
|
### Value Proposition
|
||||||
|
|
||||||
|
**For platform engineers** who are drowning in Backstage maintenance, dd0c/portal is a self-maintaining developer portal that generates its catalog from infrastructure reality. Unlike Backstage, which requires manual YAML files that nobody updates, dd0c/portal auto-discovers services from AWS and GitHub with >80% accuracy in 5 minutes and zero ongoing maintenance.
|
||||||
|
|
||||||
|
**For engineering directors** who can't answer "who owns what?" with confidence, dd0c/portal provides always-current, authoritative service ownership mapping. Unlike quarterly manual audits or stale Google Sheets, dd0c/portal auto-refreshes from source-of-truth systems and answers compliance questions in seconds, not weeks.
|
||||||
|
|
||||||
|
**For every engineer** who wastes time searching for service information, dd0c/portal is a Cmd+K search bar that's faster than asking in Slack. Unlike scattered documentation across Confluence, Notion, and pinned Slack messages, dd0c/portal is one surface with everything: owner, repo, health, on-call, cost, runbook.
|
||||||
|
|
||||||
|
**Core design principle: "Calm surface, deep ocean."** The default view is radically simple — service name, owner, health, repo link. One line per service. Scannable in 2 seconds. But depth is one click away: dependencies, cost, deployment history, runbook, scorecard. Start simpler than a spreadsheet, grow deeper than Backstage — but only when the user asks for depth.
|
||||||
|
|
||||||
|
### Personas
|
||||||
|
|
||||||
|
#### Persona 1: Jordan — The New Hire
|
||||||
|
- Software Engineer II, 2 years experience, just joined a 60-person engineering org
|
||||||
|
- Spends first week on a scavenger hunt across Slack, Confluence, GitHub, and Google Docs trying to map the service landscape
|
||||||
|
- Assigned a ticket involving an unfamiliar service — can't find the right repo, the owner, or current documentation
|
||||||
|
- **JTBD:** "When I join a new company, I want to quickly understand the service landscape so I can contribute meaningfully without feeling lost."
|
||||||
|
- **dd0c moment:** Jordan opens the portal, hits Cmd+K, types the service name from standup, and instantly sees: owner, repo, description, on-call, last deploy. What used to take 3 days takes 3 seconds.
|
||||||
|
|
||||||
|
#### Persona 2: Alex — The Platform Engineer
|
||||||
|
- Senior Platform Engineer, 6 years experience, maintaining Backstage for 14 months
|
||||||
|
- Spends 40% of time on Backstage maintenance instead of building actual platform tooling
|
||||||
|
- Sends monthly "please update your catalog-info.yaml" Slack messages that everyone ignores
|
||||||
|
- Has a secret spreadsheet that's more accurate than Backstage
|
||||||
|
- **JTBD:** "When I'm responsible for the developer portal, I want it to maintain itself so I can focus on building actual platform capabilities."
|
||||||
|
- **dd0c moment:** Alex connects the AWS IAM role and GitHub OAuth. 30 seconds later, 147 services appear — with owners, repos, and health status. Alex's spreadsheet is obsolete. The monthly YAML nagging stops forever.
|
||||||
|
|
||||||
|
#### Persona 3: Priya — The Engineering Director
|
||||||
|
- Director of Engineering, manages 8 teams (62 engineers), reports to VP of Engineering
|
||||||
|
- Can't answer "which teams own which services?" without a multi-week manual audit
|
||||||
|
- SOC 2 audit in 3 months — compliance team needs service ownership evidence she can't produce
|
||||||
|
- Incident MTTR inflated by 15+ minutes because nobody knows who to page
|
||||||
|
- **JTBD:** "When preparing for a compliance audit, I want an always-current inventory of services, owners, and maturity attributes so I can produce evidence in minutes, not weeks."
|
||||||
|
- **dd0c moment:** Priya opens the portal dashboard. Every service, every owner, every health status — live, accurate, exportable. The SOC 2 evidence that used to take 2 weeks is generated in 30 seconds.
|
||||||
|
|
||||||
|
### Feature Roadmap
|
||||||
|
|
||||||
|
#### MVP (V1) — "The 5-Minute Miracle" — Months 1-3
|
||||||
|
|
||||||
|
The MVP is ruthlessly scoped. It does three things and does them perfectly:
|
||||||
|
|
||||||
|
**Auto-Discovery Engine (THE product)**
|
||||||
|
- AWS discovery via read-only IAM role: EC2, ECS, Lambda, RDS, API Gateway, CloudFormation stacks
|
||||||
|
- Infers "services" from CloudFormation stacks, ECS services, tagged resource groups, and naming conventions
|
||||||
|
- GitHub org scanner: repos, languages, CODEOWNERS, README extraction, deployment workflows
|
||||||
|
- Cross-references AWS resources ↔ GitHub repos to build service-to-repo mapping
|
||||||
|
- Ownership inference from CODEOWNERS, git blame frequency, and team membership
|
||||||
|
- Confidence scores on every data point: "85% confident @payments-team owns this (source: CODEOWNERS + git history)"
|
||||||
|
- Auto-refresh on configurable schedule (default: every 6 hours)
|
||||||
|
- **Target: >80% accuracy on first run. This is the entire business.**
|
||||||
|
|
||||||
|
**Service Catalog UI**
|
||||||
|
- Service cards: name, owner (with confidence), description (extracted from README), repo link, tech stack, health status, last deploy timestamp
|
||||||
|
- Cmd+K instant search across all services, teams, and keywords — results in <500ms
|
||||||
|
- Progressive disclosure: default view is one-line-per-service table. Click to expand full service detail.
|
||||||
|
- Team directory: which team owns which services, team members, contact info
|
||||||
|
- Correction UI: one-click to fix wrong ownership or add missing data. Corrections feed back into the discovery model.
|
||||||
|
|
||||||
|
**Integrations (Minimum Viable)**
|
||||||
|
- AWS: read-only IAM role (runs in customer's VPC, pushes metadata to SaaS)
|
||||||
|
- GitHub: OAuth app for org access
|
||||||
|
- PagerDuty/OpsGenie: import on-call schedules, map to services ("Who's on-call for this right now?")
|
||||||
|
- Slack bot: `/dd0c who owns <service>` — responds in <2 seconds
|
||||||
|
|
||||||
|
**Infrastructure**
|
||||||
|
- Auth: GitHub OAuth (SSO via GitHub org membership)
|
||||||
|
- Billing: Stripe, $10/engineer/month, self-serve credit card signup
|
||||||
|
- Onboarding: connect AWS + GitHub in 3 clicks, auto-discovery runs, catalog populated
|
||||||
|
|
||||||
|
**What V1 explicitly does NOT include:**
|
||||||
|
- No scorecards or maturity models
|
||||||
|
- No dependency graphs
|
||||||
|
- No software templates or scaffolding
|
||||||
|
- No custom plugins or extensibility
|
||||||
|
- No Kubernetes or Terraform discovery (AWS + GitHub only)
|
||||||
|
- No advanced RBAC (org-level access only)
|
||||||
|
- No self-hosted option
|
||||||
|
|
||||||
|
#### V1.1 — "The Daily Habit" — Months 4-6
|
||||||
|
|
||||||
|
**Dependency Visualization**
|
||||||
|
- Auto-inferred service dependency graph from AWS VPC flow logs, API Gateway routes, and Lambda event sources
|
||||||
|
- Visual dependency map (click a service → see what it calls and what calls it)
|
||||||
|
- Impact radius: "If this service goes down, these 5 services are affected"
|
||||||
|
|
||||||
|
**Scorecards (Lightweight)**
|
||||||
|
- Production readiness checklist per service: has README? Has CODEOWNERS? Has runbook? Has monitoring? Has on-call rotation?
|
||||||
|
- Org-wide scorecard dashboard: "72% of services meet production readiness standards"
|
||||||
|
- Exportable for compliance evidence (SOC 2, ISO 27001)
|
||||||
|
|
||||||
|
**Backstage Migrator**
|
||||||
|
- One-click import from existing `catalog-info.yaml` files
|
||||||
|
- Maps Backstage catalog entries to dd0c services, merges with auto-discovered data
|
||||||
|
- "Migrate from Backstage in 10 minutes" — the acquisition wedge for Backstage refugees
|
||||||
|
|
||||||
|
**Enhanced Discovery**
|
||||||
|
- Terraform state file parsing (service infrastructure mapping)
|
||||||
|
- Kubernetes label/annotation discovery (for K8s-based services)
|
||||||
|
- Improved accuracy via ML-assisted pattern matching from user corrections across customers
|
||||||
|
|
||||||
|
#### V2 — "Ask Your Infra" — Months 7-12
|
||||||
|
|
||||||
|
**AI Natural Language Querying (The Differentiator)**
|
||||||
|
- "Ask Your Infra" agent: natural language questions against the service catalog
|
||||||
|
- Examples:
|
||||||
|
- "Which services handle PII data?"
|
||||||
|
- "Who owns the services that the checkout flow depends on?"
|
||||||
|
- "Show me all services that haven't been deployed in 90 days"
|
||||||
|
- "What's the total AWS cost of the payments team's services?"
|
||||||
|
- "Which services don't have runbooks?"
|
||||||
|
- Powered by LLM with the service catalog as structured context — not hallucinating, querying verified data
|
||||||
|
- Available in portal UI, Slack bot, and CLI
|
||||||
|
|
||||||
|
**dd0c Platform Integration**
|
||||||
|
- dd0c/cost integration: per-service AWS cost attribution on every service card
|
||||||
|
- dd0c/alert integration: alert routing to service owner, incident history on service card
|
||||||
|
- dd0c/run integration: linked runbooks per service, AI-assisted incident response
|
||||||
|
- Cross-module dashboard: "Service X costs $847/mo, had 3 incidents this month, has 2 IaC drifts"
|
||||||
|
|
||||||
|
**Advanced Features**
|
||||||
|
- Change feed: "What changed in the service landscape this week?" (new services, ownership changes, health status changes)
|
||||||
|
- Zombie service detection: services with no deployments, no traffic, and no owner for 90+ days
|
||||||
|
- Cost anomaly per service: "Service X cost jumped 340% this week"
|
||||||
|
|
||||||
|
#### V3 — "The Platform" — Months 12-18
|
||||||
|
|
||||||
|
**Multi-Cloud**
|
||||||
|
- GCP discovery (Cloud Run, GKE, Cloud Functions)
|
||||||
|
- Azure discovery (App Service, AKS, Functions)
|
||||||
|
- Multi-cloud service catalog with unified view
|
||||||
|
|
||||||
|
**Enterprise Features**
|
||||||
|
- Advanced RBAC (team-level permissions, service-level visibility controls)
|
||||||
|
- SSO (Okta, Azure AD) beyond GitHub OAuth
|
||||||
|
- Audit logs (who viewed/changed what, when)
|
||||||
|
- Compliance reports (SOC 2 evidence packages, auto-generated)
|
||||||
|
- API access (programmatic catalog queries for CI/CD integration)
|
||||||
|
|
||||||
|
**Agent Control Plane**
|
||||||
|
- Registry for AI agents operating in the infrastructure
|
||||||
|
- "Which AI agents have access to which services?"
|
||||||
|
- Agent activity monitoring and governance
|
||||||
|
- Position dd0c/portal as the source of truth that AI agents query — not competing with agents, but enabling them
|
||||||
|
|
||||||
|
### User Journey: First 30 Minutes
|
||||||
|
|
||||||
|
```
|
||||||
|
Minute 0:00 — Engineer discovers dd0c/portal (blog post, HN, colleague recommendation)
|
||||||
|
Minute 0:30 — Signs up with GitHub OAuth. Enters credit card. 30 seconds.
|
||||||
|
Minute 1:00 — Onboarding wizard: "Connect your AWS account." Provides CloudFormation
|
||||||
|
template for read-only IAM role. One-click deploy.
|
||||||
|
Minute 3:00 — AWS connected. GitHub org already connected via OAuth.
|
||||||
|
"Starting auto-discovery..."
|
||||||
|
Minute 3:30 — Discovery complete. "Found 147 services across 89 repos and 12 AWS accounts."
|
||||||
|
The "Holy Shit" moment.
|
||||||
|
Minute 4:00 — Engineer sees the catalog. Services listed with names, owners (with confidence
|
||||||
|
scores), repos, health status. Scans the list. "This is... actually right."
|
||||||
|
Minute 5:00 — Hits Cmd+K. Types "payment." Instant results: payment-gateway, payment-processor,
|
||||||
|
payment-webhook. Clicks payment-gateway → sees owner, repo, on-call, last deploy,
|
||||||
|
tech stack. All auto-discovered.
|
||||||
|
Minute 6:00 — Notices one wrong owner. Clicks "Correct" → selects the right team → done.
|
||||||
|
System learns. Confidence score on similar services adjusts.
|
||||||
|
Minute 8:00 — Copies the Slack bot install link. Adds /dd0c to the team Slack.
|
||||||
|
Minute 10:00 — Types /dd0c who owns auth-service in #engineering. Bot responds instantly
|
||||||
|
with owner, on-call, repo link. Three colleagues react with 👀.
|
||||||
|
Minute 15:00 — Sets dd0c/portal as browser homepage.
|
||||||
|
Minute 30:00 — Shares screenshot in team Slack: "Look what I just found."
|
||||||
|
Three teammates sign up within the hour.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pricing
|
||||||
|
|
||||||
|
**Free Tier — "Try It"**
|
||||||
|
- Up to 10 engineers
|
||||||
|
- Up to 25 discovered services
|
||||||
|
- Cmd+K search, basic service cards
|
||||||
|
- No Slack bot, no PagerDuty integration
|
||||||
|
- Purpose: let individual engineers try the product without budget approval. They become internal champions.
|
||||||
|
|
||||||
|
**Team Tier — $10/engineer/month — "The Sweet Spot"**
|
||||||
|
- Unlimited services
|
||||||
|
- Full auto-discovery (AWS + GitHub)
|
||||||
|
- Cmd+K search, Slack bot, PagerDuty/OpsGenie integration
|
||||||
|
- Confidence scores, correction UI, auto-refresh
|
||||||
|
- Scorecards (V1.1+)
|
||||||
|
- Self-serve signup, credit card, no sales call
|
||||||
|
- No minimum commitment, cancel anytime
|
||||||
|
- **Target customer: 20-100 engineers → $200-$1,000/month**
|
||||||
|
|
||||||
|
**Business Tier — $25/engineer/month — "The Platform" (Month 12+)**
|
||||||
|
- Everything in Team, plus:
|
||||||
|
- dd0c module integrations (cost, alert, drift, run)
|
||||||
|
- "Ask Your Infra" AI agent
|
||||||
|
- Dependency graphs
|
||||||
|
- Advanced RBAC, SSO (Okta/Azure AD)
|
||||||
|
- Compliance reports (SOC 2 evidence packages)
|
||||||
|
- Priority support
|
||||||
|
- **Target customer: 100-500 engineers → $2,500-$12,500/month**
|
||||||
|
|
||||||
|
**Pricing rationale:**
|
||||||
|
- $10/engineer removes the procurement barrier. For a 50-person team, that's $500/month — within most engineering managers' discretionary spending authority. No procurement process, no legal review, no 6-month sales cycle.
|
||||||
|
- The ROI calculation is trivial: "Does this save each engineer more than 15 minutes per month?" The Cmd+K search alone saves 15 minutes in the first week.
|
||||||
|
- $10/engineer is structurally impossible for VC-backed competitors to match. Port and Cortex have 50+ employees and $30M+ in funding. Their cost structure requires $25-50/engineer. dd0c's cost structure is one person + cloud hosting.
|
||||||
|
- Pricing evolution planned: $10 at launch → introduce Business tier at $25 by month 12 → effective ARPU rises to $15-20 as customers add modules and upgrade.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. GO-TO-MARKET PLAN
|
||||||
|
|
||||||
|
### Launch Strategy: Targeting Backstage Refugees
|
||||||
|
|
||||||
|
The primary acquisition channel is not paid ads, not outbound sales, and not conference sponsorships. It's content-driven PLG targeting the single most receptive audience in DevOps: **teams that tried Backstage and failed.**
|
||||||
|
|
||||||
|
These teams have already self-selected. They believe in the IDP concept. They've invested 3-18 months. They've felt the pain of YAML rot, plugin maintenance, and catalog decay. They are actively searching for alternatives. They are dd0c's first 100 customers.
|
||||||
|
|
||||||
|
**Where they congregate:**
|
||||||
|
- Backstage Discord and GitHub Discussions (frustrated questions, feature requests that will never ship)
|
||||||
|
- r/devops and r/platformengineering (posts titled "Backstage alternatives?" appear monthly)
|
||||||
|
- Platform Engineering Slack community (~15K members)
|
||||||
|
- PlatformCon conference attendees and speakers
|
||||||
|
- Blog posts titled "What We Learned From Our Backstage Implementation" (translation: "Why We Failed")
|
||||||
|
- Google searches for "Backstage alternatives," "Backstage too complex," "IDP without YAML"
|
||||||
|
|
||||||
|
### Content Strategy: Engineering-as-Marketing
|
||||||
|
|
||||||
|
Every piece of content serves one of two purposes: (1) capture Backstage refugees, or (2) demonstrate the 5-minute magic.
|
||||||
|
|
||||||
|
**Tier 1: Flagship Content (Pre-Launch)**
|
||||||
|
|
||||||
|
1. **"I Maintained Backstage for 18 Months. Here's Why I Quit."**
|
||||||
|
- Honest, technical post-mortem. Not a hit piece — a relatable story.
|
||||||
|
- Target: HN front page, r/devops top post, Twitter/X viral thread
|
||||||
|
- CTA: "We built the alternative. Try it in 5 minutes."
|
||||||
|
- This single piece of content, if it resonates, generates the first 50-100 signups.
|
||||||
|
|
||||||
|
2. **"The Backstage Migration Calculator"**
|
||||||
|
- Free web tool: input your Backstage metrics (engineers, hours/week on maintenance, catalog accuracy %)
|
||||||
|
- Output: total cost of ownership, comparison to dd0c/portal pricing, projected time savings
|
||||||
|
- Lead capture: email required for full report
|
||||||
|
- SEO targets: "Backstage cost," "Backstage alternatives," "Backstage vs"
|
||||||
|
- This tool validates demand before a single line of portal code is written.
|
||||||
|
|
||||||
|
3. **"Is Your IDP Actually Used? A 5-Minute Audit"**
|
||||||
|
- Checklist/scorecard format. "How many engineers visited your IDP this week? Is your catalog >80% accurate?"
|
||||||
|
- Most teams score poorly → creates urgency → CTA to dd0c
|
||||||
|
|
||||||
|
**Tier 2: SEO & Thought Leadership (Launch + Ongoing)**
|
||||||
|
|
||||||
|
4. **"Zero-Config Service Discovery: How We Auto-Map Your AWS Infrastructure"** — technical deep-dive
|
||||||
|
5. **"The Internal Developer Portal Buyer's Guide (2026)"** — honest comparison including dd0c's weaknesses
|
||||||
|
6. **"Why Your Service Catalog Is Lying to You"** — thought piece on manual vs. auto-discovered catalogs
|
||||||
|
7. **"How [Company] Replaced Backstage in 5 Minutes"** — customer case study video walkthroughs
|
||||||
|
|
||||||
|
**Tier 3: Community & Social Proof (Post-Launch)**
|
||||||
|
|
||||||
|
8. Customer case studies (as soon as first 3 customers are live)
|
||||||
|
9. Monthly "State of the Catalog" reports — anonymized data on discovery accuracy and adoption patterns
|
||||||
|
10. Open-source the discovery agent — builds trust, enables security audits, creates community contributions
|
||||||
|
|
||||||
|
### Growth Loops
|
||||||
|
|
||||||
|
**Loop 1: The Slack Bot Viral Loop**
|
||||||
|
Every time an engineer uses `/dd0c who owns <service>` in a public Slack channel, it's a product demo. Colleagues see the instant response, ask "what is this?", and sign up. The Slack bot is a passive viral mechanism that scales with usage.
|
||||||
|
|
||||||
|
**Loop 2: The Screenshot Loop**
|
||||||
|
The "Holy Shit" moment — 147 services auto-discovered in 30 seconds — is inherently shareable. Engineers screenshot the catalog and share it in team Slack, Twitter/X, and engineering blogs. The visual impact of a populated catalog appearing from nothing is the product's best marketing asset.
|
||||||
|
|
||||||
|
**Loop 3: The Org Expansion Loop**
|
||||||
|
One team adopts dd0c/portal → other teams in the org notice → engineering director sees cross-team adoption → approves org-wide rollout. Bottom-up adoption creates top-down demand. This is the Slack/Figma playbook.
|
||||||
|
|
||||||
|
**Loop 4: The dd0c Platform Upsell Loop**
|
||||||
|
Portal customer sees per-service cost data (dd0c/cost integration) → "Wait, Service X costs $847/month?!" → upgrades to dd0c/cost → sees alert routing (dd0c/alert) → upgrades again. Portal is the top-of-funnel for the entire dd0c platform.
|
||||||
|
|
||||||
|
**Viral coefficient target:** k=0.3 (each new user brings 0.3 additional users through Slack sharing and team adoption). At k=0.3, organic growth supplements content marketing but doesn't replace it.
|
||||||
|
|
||||||
|
### Partnership Strategy
|
||||||
|
|
||||||
|
**AWS Marketplace (Month 6)**
|
||||||
|
- List dd0c/portal on AWS Marketplace
|
||||||
|
- Customers with committed AWS spend (EDPs) can use marketplace credits to pay for dd0c — removes the budget objection entirely
|
||||||
|
- AWS Marketplace provides credibility signaling and co-marketing opportunities via ISV Partner program
|
||||||
|
|
||||||
|
**GitHub Marketplace (Month 3)**
|
||||||
|
- List the GitHub App on GitHub Marketplace
|
||||||
|
- Lower friction than AWS Marketplace, higher discovery volume
|
||||||
|
- Natural discovery point for GitHub-centric engineering teams
|
||||||
|
|
||||||
|
**PagerDuty / OpsGenie Integration Partners**
|
||||||
|
- Deep integration with on-call tools is a key feature
|
||||||
|
- Co-marketing: "Route alerts to the right owner automatically"
|
||||||
|
- PagerDuty partner ecosystem promotes integrated tools
|
||||||
|
|
||||||
|
**Strategic Non-Partnerships:**
|
||||||
|
- Do NOT partner with Datadog (competing service catalog)
|
||||||
|
- Do NOT seek AWS investment (maintain independence and future multi-cloud optionality)
|
||||||
|
- Do NOT pursue SI/consulting partnerships (wrong channel for $10/eng PLG product)
|
||||||
|
|
||||||
|
### 90-Day Launch Timeline
|
||||||
|
|
||||||
|
**Weeks 1-4: Build the Core**
|
||||||
|
|
||||||
|
| Week | Deliverable |
|
||||||
|
|------|------------|
|
||||||
|
| 1 | AWS auto-discovery engine: EC2, ECS, Lambda, RDS via read-only IAM role. Map resources to services using CloudFormation stacks, tags, naming conventions. |
|
||||||
|
| 2 | GitHub org scanner: repos, languages, CODEOWNERS, README extraction. Cross-reference with AWS to build service-to-repo mapping. |
|
||||||
|
| 3 | Service catalog UI: service cards, Cmd+K instant search (<300ms). Auth (GitHub OAuth), billing (Stripe). |
|
||||||
|
| 4 | Onboarding flow (connect AWS + GitHub in 3 clicks). Confidence scores. Correction UI. |
|
||||||
|
|
||||||
|
**Weeks 5-8: Polish & Beta**
|
||||||
|
|
||||||
|
| Week | Deliverable |
|
||||||
|
|------|------------|
|
||||||
|
| 5 | Slack bot: `/dd0c who owns <service>`. Responds in <2 seconds. |
|
||||||
|
| 6 | PagerDuty/OpsGenie integration: import on-call schedules, map to services. |
|
||||||
|
| 7 | Backstage YAML importer. Landing page. Waitlist. |
|
||||||
|
| 8 | Beta launch: invite 20 teams. Personal onboarding call with each. Obsessively collect feedback on discovery accuracy. |
|
||||||
|
|
||||||
|
**Weeks 9-12: Launch & First Revenue**
|
||||||
|
|
||||||
|
| Week | Deliverable |
|
||||||
|
|------|------------|
|
||||||
|
| 9 | Incorporate beta feedback. Fix top 5 discovery accuracy issues. |
|
||||||
|
| 10 | Publish "I Maintained Backstage for 18 Months" blog post. Ship Backstage Migration Calculator. |
|
||||||
|
| 11 | Public launch: HN "Show HN," Reddit (r/devops, r/platformengineering, r/aws), Twitter/X thread. Coordinated for maximum simultaneous visibility. |
|
||||||
|
| 12 | Convert beta users to paid. Target: 10 paying customers by end of week 12. |
|
||||||
|
|
||||||
|
**Pre-launch content (before writing portal code):**
|
||||||
|
- Ship the Backstage Migration Calculator and "Why I Quit Backstage" blog post first
|
||||||
|
- Validate demand with content before investing in engineering
|
||||||
|
- If the content doesn't resonate, the product won't either
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. BUSINESS MODEL
|
||||||
|
|
||||||
|
### Revenue Model
|
||||||
|
|
||||||
|
**Primary revenue:** Per-seat SaaS subscription at $10/engineer/month (Team tier).
|
||||||
|
|
||||||
|
**Revenue expansion mechanisms:**
|
||||||
|
1. **Seat expansion:** As customer engineering teams grow, revenue grows automatically
|
||||||
|
2. **Tier upgrade:** Team ($10/eng) → Business ($25/eng) as customers need RBAC, compliance, AI queries
|
||||||
|
3. **Module attach:** Portal customers add dd0c/cost, dd0c/alert, dd0c/run — each at $10-15/eng/month
|
||||||
|
4. **Effective ARPU trajectory:** $10/eng (portal only) → $25/eng (portal + 1 module) → $40/eng (portal + 2 modules + Business tier)
|
||||||
|
|
||||||
|
**Revenue model is NOT:**
|
||||||
|
- Usage-based (predictable revenue > usage volatility for a solo founder)
|
||||||
|
- Enterprise sales-led (no sales team, no SDRs, no demo calls)
|
||||||
|
- Freemium-dependent (free tier is acquisition, not the business)
|
||||||
|
|
||||||
|
### Unit Economics
|
||||||
|
|
||||||
|
**Cost structure (solo founder, Month 6):**
|
||||||
|
|
||||||
|
| Cost Item | Monthly |
|
||||||
|
|-----------|---------|
|
||||||
|
| Cloud infrastructure (AWS/Vercel) | $500-1,500 |
|
||||||
|
| Stripe payment processing (2.9% + $0.30) | ~3% of revenue |
|
||||||
|
| Domain, email, tooling | $200 |
|
||||||
|
| LLM API costs (discovery + AI queries, V2) | $200-500 |
|
||||||
|
| Brian's time (opportunity cost) | $15,000 (imputed) |
|
||||||
|
| **Total operating cost (excl. founder)** | **~$2,000-2,500/month** |
|
||||||
|
|
||||||
|
**Gross margin:** ~90% (SaaS infrastructure costs are minimal at this scale)
|
||||||
|
|
||||||
|
**Customer Acquisition Cost (CAC):** Near-zero for content-driven PLG. Primary cost is Brian's time writing blog posts and building free tools. Imputed CAC: ~$50-100 per customer (time spent on content ÷ customers acquired).
|
||||||
|
|
||||||
|
**Lifetime Value (LTV):** At $500/month average (50 engineers × $10), with 5% monthly churn → average customer lifetime of 20 months → LTV = $10,000. LTV:CAC ratio > 100:1. (This is unusually high because CAC is near-zero for PLG.)
|
||||||
|
|
||||||
|
**Payback period:** <1 month (self-serve credit card, no sales cycle, immediate revenue).
|
||||||
|
|
||||||
|
### Path to Revenue Milestones
|
||||||
|
|
||||||
|
**$10K MRR — "Validation" (Target: Month 9-12)**
|
||||||
|
|
||||||
|
| Metric | Requirement |
|
||||||
|
|--------|-------------|
|
||||||
|
| Customers | 20-25 |
|
||||||
|
| Avg engineers per customer | 40-50 |
|
||||||
|
| Avg MRR per customer | $400-500 |
|
||||||
|
| Monthly churn | <5% |
|
||||||
|
| **Milestone significance** | Product-market fit confirmed. Solo founder business is viable. |
|
||||||
|
|
||||||
|
**$50K MRR — "Sustainability" (Target: Month 15-18)**
|
||||||
|
|
||||||
|
| Metric | Requirement |
|
||||||
|
|--------|-------------|
|
||||||
|
| Customers | 80-100 |
|
||||||
|
| Avg engineers per customer | 50-60 |
|
||||||
|
| Avg MRR per customer | $500-625 |
|
||||||
|
| Module attach rate | >20% (portal + at least one other dd0c module) |
|
||||||
|
| Monthly churn | <3% |
|
||||||
|
| **Milestone significance** | $600K ARR. Hire first contractor (support + integration maintenance). Brian focuses on product + growth. |
|
||||||
|
|
||||||
|
**$100K MRR — "Scale" (Target: Month 20-24)**
|
||||||
|
|
||||||
|
| Metric | Requirement |
|
||||||
|
|--------|-------------|
|
||||||
|
| Customers | 150-200 |
|
||||||
|
| Avg engineers per customer | 55-65 |
|
||||||
|
| Avg MRR per customer | $500-667 |
|
||||||
|
| Business tier adoption | >15% of customers |
|
||||||
|
| dd0c platform revenue (all modules) | $200K+ MRR combined |
|
||||||
|
| Monthly churn | <2.5% |
|
||||||
|
| **Milestone significance** | $1.2M ARR. Hire 1-2 FTEs (engineer + DevRel). Consider seed funding if growth warrants acceleration. |
|
||||||
|
|
||||||
|
### Solo Founder Constraints & Sequencing Rationale
|
||||||
|
|
||||||
|
The Party Mode board recommended building dd0c/portal **third** in the dd0c launch sequence, after dd0c/route (LLM cost router) and dd0c/cost (AWS cost anomaly). This sequencing is strategically correct for three reasons:
|
||||||
|
|
||||||
|
**1. Revenue before platform.**
|
||||||
|
dd0c/route and dd0c/cost have immediate monetary ROI — they save companies money on LLM and AWS bills in week one. This generates revenue and political capital before portal launches. Portal is a harder sell as a standalone ("pay us to organize your services") but an easy sell as a platform add-on ("you're already saving $2K/month with dd0c — now see which services cost what").
|
||||||
|
|
||||||
|
**2. Data before catalog.**
|
||||||
|
dd0c/cost generates per-service cost data. dd0c/alert generates per-service incident data. When portal launches, it can immediately show cost and incident data on every service card — making the portal 10x more valuable on day one than it would be as a standalone catalog. The modules create the data; the portal visualizes it.
|
||||||
|
|
||||||
|
**3. Customers before cold-start.**
|
||||||
|
By the time portal launches (Month 7-12 of the dd0c platform), dd0c already has paying customers on route and cost. These customers are the portal's first users — no cold-start problem, no "will anyone try this?" uncertainty. They're already in the dd0c ecosystem and the portal is a natural expansion.
|
||||||
|
|
||||||
|
**The sequencing risk:** If dd0c/route and dd0c/cost fail to gain traction, portal never launches. This is acceptable — if the FinOps wedge doesn't work, the platform thesis is wrong, and building a standalone IDP at $10/engineer is an even harder path.
|
||||||
|
|
||||||
|
**Solo founder capacity allocation (portal phase):**
|
||||||
|
- 50% — Discovery engine accuracy (the product)
|
||||||
|
- 20% — UI/UX and integrations
|
||||||
|
- 15% — Content marketing and community
|
||||||
|
- 10% — Infrastructure and ops
|
||||||
|
- 5% — Customer support (automated first, personal for top accounts)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. RISKS & MITIGATIONS
|
||||||
|
|
||||||
|
### Risk 1: GitHub Ships a Native Service Catalog
|
||||||
|
**Probability:** 40% within 24 months
|
||||||
|
**Impact:** Catastrophic (existential)
|
||||||
|
**Severity:** CRITICAL
|
||||||
|
|
||||||
|
GitHub already has CODEOWNERS, dependency graphs, repository topics, and Actions. If GitHub adds a "Services" tab that aggregates these into a searchable catalog with auto-discovery from Actions deployment targets, dd0c/portal's core value proposition evaporates for GitHub-only shops.
|
||||||
|
|
||||||
|
**Why it might not happen:**
|
||||||
|
- GitHub's product strategy is focused on AI (Copilot) and security (Advanced Security). IDP is not their priority.
|
||||||
|
- Microsoft/GitHub has historically been slow to build platform features (GitHub Projects took years and is still mediocre).
|
||||||
|
- A real IDP requires cross-platform data (AWS, PagerDuty, Datadog) that GitHub doesn't have and may not want to integrate.
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- Build value GitHub can't replicate: cross-platform integration (AWS + GitHub + PagerDuty + Slack), the dd0c module flywheel, and AI-native querying
|
||||||
|
- Position dd0c as "GitHub + AWS + everything else" — not "GitHub but better"
|
||||||
|
- If GitHub announces a service catalog, immediately pivot to positioning dd0c as the multi-source aggregation layer that includes GitHub's data alongside operational tools
|
||||||
|
- **Speed is the primary mitigation.** Establish the beachhead and build switching costs before GitHub moves. Every month of head start is a month of habit formation.
|
||||||
|
|
||||||
|
**Kill trigger:** If GitHub announces a native service catalog at GitHub Universe 2026 that integrates with PagerDuty and AWS natively, kill dd0c/portal as a standalone product. Pivot to making it a free feature within the dd0c platform to drive adoption of paid modules.
|
||||||
|
|
||||||
|
### Risk 2: Auto-Discovery Accuracy Falls Below 80%
|
||||||
|
**Probability:** 35%
|
||||||
|
**Impact:** Critical (fatal to PLG motion)
|
||||||
|
**Severity:** CRITICAL
|
||||||
|
|
||||||
|
The entire product thesis rests on auto-discovery being "good enough" on first run. If engineers connect their AWS account and see 60% accuracy with wrong owners and phantom services, they close the tab and never return. Trust is binary. One bad first impression is fatal.
|
||||||
|
|
||||||
|
**Technical challenges:**
|
||||||
|
- AWS resources don't always map cleanly to "services" (is each Lambda a service? Each ECS task definition?)
|
||||||
|
- GitHub repos don't always map to deployed services (monorepos, shared libraries, archived repos)
|
||||||
|
- Ownership inference from git blame is noisy
|
||||||
|
- Naming conventions vary wildly across organizations
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- Confidence scores, not assertions: "85% confident @payments-team owns this (source: CODEOWNERS + git history). Confirm or correct."
|
||||||
|
- Conservative discovery: show 80 services at 90% accuracy rather than 150 services at 60% accuracy
|
||||||
|
- Rapid feedback loop: user corrections improve the model. After 10 corrections, accuracy should be >95%.
|
||||||
|
- Invest 50% of engineering time on discovery accuracy in year 1. This is the product.
|
||||||
|
- If fully autonomous discovery fails, pivot to "AI-assisted onboarding" — LLMs analyze repos and chat with team leads to build the catalog interactively.
|
||||||
|
|
||||||
|
**Kill trigger:** If accuracy is <60% after 3 months of engineering on diverse real-world AWS accounts, the technical thesis is wrong. Kill.
|
||||||
|
|
||||||
|
### Risk 3: Solo Founder Burnout / Capacity Ceiling
|
||||||
|
**Probability:** 50%
|
||||||
|
**Impact:** Critical
|
||||||
|
**Severity:** CRITICAL
|
||||||
|
|
||||||
|
Building an IDP with AWS integration, GitHub integration, PagerDuty, Slack bot, Cmd+K search, auto-discovery engine, SaaS infrastructure, billing, auth, and customer support — as one person — is enormous. The failure mode isn't "Brian can't code it." It's "Brian builds it, gets 30 customers, and drowns in support tickets, bug reports, and integration requests."
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- Ruthless scope control: V1 is auto-discovery + service cards + Cmd+K + Slack bot + billing. That's it.
|
||||||
|
- Architecture for zero maintenance: serverless infrastructure, automated deployments, automated monitoring
|
||||||
|
- AI-assisted development: Cursor/Copilot/Claude for 50%+ of code generation
|
||||||
|
- Kill criteria with deadlines: if milestones aren't hit, kill the product. Don't let sunk cost fallacy turn a 6-month experiment into a 2-year death march.
|
||||||
|
- Hire first contractor at $20K MRR for support and integration maintenance
|
||||||
|
|
||||||
|
### Risk 4: Datadog Bundles a Good-Enough Service Catalog
|
||||||
|
**Probability:** 30%
|
||||||
|
**Impact:** High
|
||||||
|
**Severity:** HIGH
|
||||||
|
|
||||||
|
Datadog already has a Service Catalog feature. With $2B+ revenue and massive distribution, if they invest seriously, teams already paying for Datadog get an IDP for "free."
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- Datadog's catalog is monitoring-centric (discovers from APM traces, not infrastructure). dd0c discovers from AWS + GitHub, which is more comprehensive.
|
||||||
|
- Datadog's pricing ($23+/host/month) means their IDP is "free" only for existing $50K+/year Datadog customers. For teams using CloudWatch or Grafana, Datadog's IDP is irrelevant.
|
||||||
|
- Position dd0c as monitoring-agnostic. Works with Datadog, Grafana, CloudWatch, or nothing.
|
||||||
|
|
||||||
|
### Risk 5: Market Is Smaller Than Estimated
|
||||||
|
**Probability:** 25%
|
||||||
|
**Impact:** High
|
||||||
|
**Severity:** HIGH
|
||||||
|
|
||||||
|
Do teams of 20-50 engineers actually need — and will they pay for — a service catalog? Many function fine with a Google Sheet and Slack.
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- Free tier tests this hypothesis at zero cost. If teams under 30 don't convert, raise minimum target to 50+.
|
||||||
|
- Beachhead is "teams that tried Backstage and failed" — already self-selected as needing an IDP.
|
||||||
|
- If market is smaller than expected, portal becomes a free feature driving adoption of paid dd0c modules (cost, alerts, drift).
|
||||||
|
|
||||||
|
### Kill Criteria Summary
|
||||||
|
|
||||||
|
| Milestone | Deadline | Kill Trigger |
|
||||||
|
|-----------|----------|-------------|
|
||||||
|
| Auto-discovery >75% accuracy on test accounts | Month 3 | <60% accuracy after 3 months of engineering |
|
||||||
|
| 10 beta users actively using weekly | Month 4 | Can't get 10 free users |
|
||||||
|
| 5 paying customers ($10/eng) | Month 6 | Can't convert 5 teams from free to paid |
|
||||||
|
| 20 paying customers, <10% monthly churn | Month 9 | Churn exceeds 10%/month (novelty, not habit) |
|
||||||
|
| $10K MRR | Month 12 | Market too small or GTM broken |
|
||||||
|
| GitHub announces native service catalog | Any time | Assess if dd0c differentiation survives. If GitHub covers 70%+ of dd0c value, kill standalone portal. |
|
||||||
|
|
||||||
|
### Pivot Options
|
||||||
|
|
||||||
|
If dd0c/portal fails as a standalone product, three pivot paths exist:
|
||||||
|
|
||||||
|
1. **Free portal → paid modules.** Make portal free. Use it as top-of-funnel for dd0c/cost and dd0c/alert. "Free service catalog → discover your services → see that Service X costs $847/month → upgrade to dd0c/cost." Portal becomes the world's most effective upsell mechanism.
|
||||||
|
|
||||||
|
2. **AI-assisted onboarding.** If autonomous discovery fails, pivot from "zero-config auto-discovery" to "AI-assisted catalog building." LLMs analyze repos, Slack history, and AWS resources, then chat with team leads to build the catalog interactively. Still faster than Backstage YAML, but with a human-in-the-loop.
|
||||||
|
|
||||||
|
3. **Agent Control Plane.** If AI agents make static catalogs obsolete, pivot dd0c/portal to be the registry and governance layer for AI agents operating in infrastructure. "Which agents have access to which services? What did they do? Who authorized it?" The portal becomes infrastructure for AI, not a competitor to AI.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. SUCCESS METRICS
|
||||||
|
|
||||||
|
### North Star Metric: Daily Active Users (DAU), Not Seats
|
||||||
|
|
||||||
|
Most IDP companies measure seats (how many engineers have accounts). This is a vanity metric. An engineer can have an account and never log in. dd0c/portal's north star is **Daily Active Users as a percentage of total seats (DAU/Seats).**
|
||||||
|
|
||||||
|
**Why DAU, not seats:**
|
||||||
|
- DAU measures habit formation. If engineers open the portal daily, it's their browser homepage. If it's their browser homepage, switching costs are enormous.
|
||||||
|
- DAU correlates with retention. High DAU = low churn. Low DAU = the product is a novelty that will be forgotten.
|
||||||
|
- DAU drives the viral loop. Active users use the Slack bot, share screenshots, and recommend the product to colleagues. Inactive seat-holders do nothing.
|
||||||
|
|
||||||
|
**Target:** >40% DAU/Seats by Month 6. (For context: Slack achieves ~60% DAU/Seats. Linear achieves ~40%. Most enterprise SaaS tools achieve <15%.)
|
||||||
|
|
||||||
|
### Leading Indicators (Weekly Review)
|
||||||
|
|
||||||
|
| Metric | Month 3 Target | Month 6 Target | Month 12 Target |
|
||||||
|
|--------|---------------|----------------|-----------------|
|
||||||
|
| DAU / Total Seats | >25% | >40% | >45% |
|
||||||
|
| Cmd+K searches per user per week | >3 | >5 | >7 |
|
||||||
|
| Slack bot queries per org per week | >5 | >10 | >15 |
|
||||||
|
| Auto-discovery accuracy (1 - correction rate) | >75% | >85% | >90% |
|
||||||
|
| Time-to-first-value (signup → first search) | <10 min | <5 min | <5 min |
|
||||||
|
| Organic signup rate (word-of-mouth %) | >20% | >30% | >40% |
|
||||||
|
|
||||||
|
### Lagging Indicators (Monthly Review)
|
||||||
|
|
||||||
|
| Metric | Month 6 Target | Month 12 Target |
|
||||||
|
|--------|---------------|-----------------|
|
||||||
|
| Paying customers | 10-25 | 50-120 |
|
||||||
|
| MRR | $4K-12.5K | $25K-72K |
|
||||||
|
| Net Revenue Retention (NRR) | >105% | >110% |
|
||||||
|
| Monthly logo churn | <8% | <5% |
|
||||||
|
| Monthly revenue churn | <6% | <3% |
|
||||||
|
| Catalog completeness (% of actual services discovered) | >85% | >90% |
|
||||||
|
| Module attach rate (portal customers adding 2nd dd0c module) | N/A | >20% |
|
||||||
|
|
||||||
|
### Milestones
|
||||||
|
|
||||||
|
| Milestone | Target Date | Significance |
|
||||||
|
|-----------|------------|--------------|
|
||||||
|
| Auto-discovery engine working (>75% accuracy) | Month 3 | Technical thesis validated |
|
||||||
|
| 10 beta users with weekly active usage | Month 4 | Product resonates with real users |
|
||||||
|
| First 5 paying customers | Month 6 | Willingness to pay confirmed |
|
||||||
|
| "I Quit Backstage" post hits 10K+ views | Month 3-4 | Content-market fit validated |
|
||||||
|
| 20 paying customers, <10% churn | Month 9 | Retention thesis validated |
|
||||||
|
| $10K MRR | Month 12 | Solo founder business viable |
|
||||||
|
| First customer adds second dd0c module | Month 12 | Platform flywheel activated |
|
||||||
|
| $50K MRR | Month 18 | Hire first team member |
|
||||||
|
| >90% auto-discovery accuracy | Month 12 | Technical moat established |
|
||||||
|
| AWS Marketplace listing live | Month 6 | Distribution channel opened |
|
||||||
|
|
||||||
|
### The Anti-Metric: Feature Count
|
||||||
|
|
||||||
|
Do NOT measure or celebrate feature count. Every feature added is maintenance burden for a solo founder. The goal is maximum value from minimum features. Measure value delivered per feature, not features shipped. The product with 10 features used daily beats the product with 100 features used quarterly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## APPENDIX: SCENARIO PROJECTIONS
|
||||||
|
|
||||||
|
### Scenario A: "The Rocket" (15% probability)
|
||||||
|
Everything works. Auto-discovery is accurate. Content goes viral. Backstage refugees flock to dd0c.
|
||||||
|
|
||||||
|
| Month | Customers | Avg Engineers | MRR | ARR |
|
||||||
|
|-------|-----------|--------------|-----|-----|
|
||||||
|
| 6 | 25 | 50 | $12,500 | $150K |
|
||||||
|
| 9 | 60 | 55 | $33,000 | $396K |
|
||||||
|
| 12 | 120 | 60 | $72,000 | $864K |
|
||||||
|
|
||||||
|
### Scenario B: "The Grind" (50% probability)
|
||||||
|
Product works but growth is slower. Content gets moderate traction. Word of mouth builds gradually.
|
||||||
|
|
||||||
|
| Month | Customers | Avg Engineers | MRR | ARR |
|
||||||
|
|-------|-----------|--------------|-----|-----|
|
||||||
|
| 6 | 10 | 40 | $4,000 | $48K |
|
||||||
|
| 9 | 25 | 45 | $11,250 | $135K |
|
||||||
|
| 12 | 50 | 50 | $25,000 | $300K |
|
||||||
|
|
||||||
|
### Scenario C: "The Stall" (25% probability)
|
||||||
|
Discovery accuracy is a persistent challenge. Market is smaller than expected.
|
||||||
|
|
||||||
|
| Month | Customers | Avg Engineers | MRR | ARR |
|
||||||
|
|-------|-----------|--------------|-----|-----|
|
||||||
|
| 6 | 5 | 35 | $1,750 | $21K |
|
||||||
|
| 9 | 10 | 35 | $3,500 | $42K |
|
||||||
|
| 12 | 15 | 40 | $6,000 | $72K |
|
||||||
|
|
||||||
|
### Scenario D: "The Kill" (10% probability)
|
||||||
|
GitHub ships a service catalog, or discovery never reaches acceptable accuracy.
|
||||||
|
|
||||||
|
| Month | Action |
|
||||||
|
|-------|--------|
|
||||||
|
| 6 | <5 paying customers. Reassess. |
|
||||||
|
| 9 | No improvement. Kill standalone portal. |
|
||||||
|
| 9+ | Pivot: portal becomes free feature within dd0c platform to drive paid module adoption. |
|
||||||
|
|
||||||
|
**Expected Value (Month 12 ARR):**
|
||||||
|
E(ARR) = (0.15 × $864K) + (0.50 × $300K) + (0.25 × $72K) + (0.10 × $0) = **~$298K**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*This product brief synthesizes findings from four prior phases: Brainstorm, Design Thinking, Innovation Strategy, and Party Mode Advisory Board Review. All contradictions between phases have been resolved in favor of the most conservative, execution-focused position. The brief is designed for investor-ready presentation and solo founder execution planning.*
|
||||||
|
|
||||||
|
*Build the portal. Build it third. Build it fast. And build the discovery engine like your business depends on it — because it does.*
|
||||||
|
|
||||||
@@ -0,0 +1,623 @@
|
|||||||
|
# dd0c/portal — Test Architecture & TDD Strategy
|
||||||
|
**Product:** Lightweight Internal Developer Portal
|
||||||
|
**Phase:** 6 — Architecture Design
|
||||||
|
**Date:** 2026-02-28
|
||||||
|
**Status:** Draft
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Testing Philosophy & TDD Workflow
|
||||||
|
|
||||||
|
### Core Principle
|
||||||
|
|
||||||
|
dd0c/portal's most critical logic — ownership inference, discovery reconciliation, and confidence scoring — is pure algorithmic code with well-defined inputs and outputs. This is ideal TDD territory. The test suite is the specification.
|
||||||
|
|
||||||
|
The product's >80% discovery accuracy target is not a QA metric — it's a product promise. Tests enforce it continuously.
|
||||||
|
|
||||||
|
### Red-Green-Refactor Adapted to This Product
|
||||||
|
|
||||||
|
```
|
||||||
|
RED → Write a failing test that encodes a discovery heuristic or ownership rule
|
||||||
|
GREEN → Write the minimum code to pass it (no clever abstractions yet)
|
||||||
|
REFACTOR → Clean up once the rule is proven correct against real-world fixtures
|
||||||
|
```
|
||||||
|
|
||||||
|
**Adapted cycle for discovery heuristics:**
|
||||||
|
|
||||||
|
1. Capture a real-world failure case (e.g., "Lambda functions named `payment-*` were not grouped into a service")
|
||||||
|
2. Write a unit test encoding the expected grouping behavior using a fixture of that Lambda response
|
||||||
|
3. Fix the heuristic
|
||||||
|
4. Add the fixture to the regression suite permanently
|
||||||
|
|
||||||
|
This means every production accuracy bug becomes a permanent test. The test suite grows as a living record of every edge case the discovery engine has encountered.
|
||||||
|
|
||||||
|
### When to Write Tests First vs. Integration Tests Lead
|
||||||
|
|
||||||
|
| Scenario | Approach | Rationale |
|
||||||
|
|----------|----------|-----------|
|
||||||
|
| Ownership scoring algorithm | Unit-first TDD | Pure function, deterministic, no I/O |
|
||||||
|
| Discovery heuristics (CFN → service mapping) | Unit-first TDD | Deterministic logic over fixture data |
|
||||||
|
| GitHub GraphQL query construction | Unit-first TDD | Query builder logic is pure |
|
||||||
|
| AWS API pagination handling | Integration-first | Behavior depends on real API shape |
|
||||||
|
| Meilisearch index sync | Integration-first | Depends on Meilisearch document model |
|
||||||
|
| DynamoDB schema migrations | Integration-first | Requires real DynamoDB Local behavior |
|
||||||
|
| WebSocket progress events | E2E-first | Requires full pipeline to be meaningful |
|
||||||
|
| Stripe webhook handling | Integration-first | Depends on Stripe event payload shape |
|
||||||
|
|
||||||
|
### Test Naming Conventions
|
||||||
|
|
||||||
|
All tests follow the pattern: `[unit under test]_[scenario]_[expected outcome]`
|
||||||
|
|
||||||
|
**TypeScript/Node.js (Jest):**
|
||||||
|
```typescript
|
||||||
|
describe('OwnershipInferenceEngine', () => {
|
||||||
|
describe('scoreOwnership', () => {
|
||||||
|
it('returns_primary_owner_when_codeowners_present_with_high_confidence', () => {})
|
||||||
|
it('marks_service_unowned_when_top_score_below_threshold', () => {})
|
||||||
|
it('marks_service_ambiguous_when_top_two_scores_within_tolerance', () => {})
|
||||||
|
})
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
**Python (pytest):**
|
||||||
|
```python
|
||||||
|
class TestOwnershipScorer:
|
||||||
|
def test_codeowners_signal_weighted_highest_among_all_signals(self): ...
|
||||||
|
def test_git_blame_frequency_used_when_codeowners_absent(self): ...
|
||||||
|
def test_confidence_below_threshold_flags_service_as_unowned(self): ...
|
||||||
|
```
|
||||||
|
|
||||||
|
**File naming:**
|
||||||
|
- Unit tests: `*.test.ts` / `test_*.py` co-located with source
|
||||||
|
- Integration tests: `*.integration.test.ts` / `test_*_integration.py` in `tests/integration/`
|
||||||
|
- E2E tests: `tests/e2e/*.spec.ts` (Playwright)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Test Pyramid
|
||||||
|
|
||||||
|
### Recommended Ratio: 70 / 20 / 10
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────┐
|
||||||
|
│ E2E / Smoke│ 10% (~30 tests)
|
||||||
|
│ (Playwright)│ Critical user journeys only
|
||||||
|
├─────────────┤
|
||||||
|
│ Integration │ 20% (~80 tests)
|
||||||
|
│ (real deps) │ Service boundaries, API contracts
|
||||||
|
├─────────────┤
|
||||||
|
│ Unit │ 70% (~280 tests)
|
||||||
|
│ (pure logic)│ All heuristics, scoring, parsing
|
||||||
|
└─────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Unit Test Targets (per component)
|
||||||
|
|
||||||
|
| Component | Language | Test Framework | Target Coverage |
|
||||||
|
|-----------|----------|---------------|----------------|
|
||||||
|
| AWS Scanner (heuristics) | Python | pytest | 90% |
|
||||||
|
| GitHub Scanner (parsers) | Node.js | Jest | 90% |
|
||||||
|
| Reconciliation Engine | Node.js | Jest | 85% |
|
||||||
|
| Ownership Inference | Python | pytest | 95% |
|
||||||
|
| Portal API (route handlers) | Node.js | Jest + Supertest | 80% |
|
||||||
|
| Search proxy + cache logic | Node.js | Jest | 85% |
|
||||||
|
| Slack Bot command handlers | Node.js | Jest | 80% |
|
||||||
|
| Feature flag evaluation | Node.js/Python | Jest/pytest | 95% |
|
||||||
|
| Governance policy engine | Node.js | Jest | 95% |
|
||||||
|
| Schema migration validators | Node.js | Jest | 100% |
|
||||||
|
|
||||||
|
### Integration Test Boundaries
|
||||||
|
|
||||||
|
| Boundary | What to Test | Tool |
|
||||||
|
|----------|-------------|------|
|
||||||
|
| Discovery → GitHub API | GraphQL query shape, pagination, rate limit handling | MSW (mock service worker) or nock |
|
||||||
|
| Discovery → AWS APIs | boto3 call sequences, pagination, error handling | moto (AWS mock library) |
|
||||||
|
| Reconciler → PostgreSQL | Upsert logic, conflict resolution, RLS enforcement | Testcontainers (PostgreSQL) |
|
||||||
|
| Inference → PostgreSQL | Ownership write, confidence update, correction propagation | Testcontainers (PostgreSQL) |
|
||||||
|
| API → Meilisearch | Index sync, search query construction, tenant filter injection | Meilisearch test instance (Docker) |
|
||||||
|
| API → Redis | Cache set/get/invalidation, TTL behavior | ioredis-mock or Testcontainers (Redis) |
|
||||||
|
| Slack Bot → Portal API | Command → search → format response | Supertest against local API |
|
||||||
|
| Stripe webhook → API | Subscription activation, plan change, cancellation | Stripe CLI webhook forwarding |
|
||||||
|
|
||||||
|
### E2E / Smoke Test Scenarios
|
||||||
|
|
||||||
|
1. Full onboarding: GitHub OAuth → AWS connection → discovery trigger → catalog populated
|
||||||
|
2. Cmd+K search returns results in <200ms after discovery
|
||||||
|
3. Ownership correction propagates to similar services
|
||||||
|
4. Slack `/dd0c who owns` returns correct owner
|
||||||
|
5. Discovery accuracy: synthetic org with known ground truth scores >80%
|
||||||
|
6. Governance strict mode: discovery populates pending queue, not catalog directly
|
||||||
|
7. Panic mode: all catalog writes return 503
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Unit Test Strategy (Per Component)
|
||||||
|
|
||||||
|
### 3.1 AWS Scanner (Python / pytest)
|
||||||
|
|
||||||
|
**What to test:**
|
||||||
|
- Resource-to-service grouping heuristics (the core logic)
|
||||||
|
- Confidence score assignment per signal type
|
||||||
|
- Pagination handling for each AWS API
|
||||||
|
- Cross-region scan aggregation
|
||||||
|
- Error handling for throttling, missing permissions, empty accounts
|
||||||
|
|
||||||
|
**Key test cases:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# tests/unit/test_cfn_scanner.py
|
||||||
|
|
||||||
|
class TestCloudFormationScanner:
|
||||||
|
def test_stack_name_becomes_service_name_with_high_confidence(self):
|
||||||
|
# Given a CFN stack named "payment-api"
|
||||||
|
# Expect service entity with name="payment-api", confidence=0.95
|
||||||
|
|
||||||
|
def test_stack_tags_extracted_as_service_metadata(self):
|
||||||
|
# Given stack with tags {"service": "payment", "team": "payments"}
|
||||||
|
# Expect service.metadata includes both tags
|
||||||
|
|
||||||
|
def test_stacks_in_multiple_regions_deduplicated_by_name(self):
|
||||||
|
# Given same stack name in us-east-1 and us-west-2
|
||||||
|
# Expect single service entity with both regions in infrastructure
|
||||||
|
|
||||||
|
def test_deleted_stacks_excluded_from_results(self):
|
||||||
|
# Given stack with status DELETE_COMPLETE
|
||||||
|
# Expect it is not included in discovered services
|
||||||
|
|
||||||
|
def test_pagination_fetches_all_stacks_beyond_first_page(self):
|
||||||
|
# Given mock returning 2 pages of stacks
|
||||||
|
# Expect all stacks from both pages are processed
|
||||||
|
|
||||||
|
class TestLambdaScanner:
|
||||||
|
def test_lambdas_with_shared_prefix_grouped_into_single_service(self):
|
||||||
|
# Given ["payment-webhook", "payment-processor", "payment-refund"]
|
||||||
|
# Expect single service "payment" with confidence=0.60
|
||||||
|
|
||||||
|
def test_lambda_with_apigw_trigger_gets_higher_confidence(self):
|
||||||
|
# Given Lambda with API Gateway event source mapping
|
||||||
|
# Expect confidence=0.85 (not 0.60)
|
||||||
|
|
||||||
|
def test_standalone_lambda_without_prefix_pattern_kept_as_individual(self):
|
||||||
|
# Given Lambda named "data-export-job" with no siblings
|
||||||
|
# Expect individual service entity, not grouped
|
||||||
|
|
||||||
|
class TestServiceGroupingHeuristics:
|
||||||
|
def test_cfn_stack_takes_priority_over_ecs_service_for_same_name(self):
|
||||||
|
# Given CFN stack "payment-api" AND ECS service "payment-api"
|
||||||
|
# Expect single service entity (not duplicate), source=cloudformation
|
||||||
|
|
||||||
|
def test_explicit_github_repo_tag_overrides_name_matching(self):
|
||||||
|
# Given AWS resource with tag github_repo="acme/payments-v2"
|
||||||
|
# Expect repo_link="acme/payments-v2" with confidence=0.95
|
||||||
|
# (not fuzzy name match result)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Mocking strategy:**
|
||||||
|
- Use `moto` to mock all boto3 calls — no real AWS calls in unit tests
|
||||||
|
- Fixture files in `tests/fixtures/aws/` contain realistic API response payloads
|
||||||
|
- Each fixture named after the scenario: `cfn_stacks_multi_region.json`, `lambda_functions_with_apigw.json`
|
||||||
|
|
||||||
|
```python
|
||||||
|
@pytest.fixture
|
||||||
|
def mock_aws(aws_credentials):
|
||||||
|
with mock_cloudformation(), mock_ecs(), mock_lambda_():
|
||||||
|
yield
|
||||||
|
|
||||||
|
def test_full_scan_produces_expected_service_count(mock_aws, cfn_fixture):
|
||||||
|
setup_mock_cfn_stacks(cfn_fixture)
|
||||||
|
result = AWSScanner(tenant_id="test", role_arn="arn:aws:iam::123:role/test").scan()
|
||||||
|
assert len(result.services) == cfn_fixture["expected_service_count"]
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.2 GitHub Scanner (Node.js / Jest)
|
||||||
|
|
||||||
|
**What to test:**
|
||||||
|
- GraphQL query construction and batching
|
||||||
|
- CODEOWNERS file parsing (all valid formats)
|
||||||
|
- README first-paragraph extraction
|
||||||
|
- Deploy workflow target extraction
|
||||||
|
- Rate limit detection and backoff
|
||||||
|
|
||||||
|
**Key test cases:**
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// tests/unit/github-scanner/codeowners-parser.test.ts
|
||||||
|
|
||||||
|
describe('CODEOWNERSParser', () => {
|
||||||
|
it('parses_simple_wildcard_ownership_to_team', () => {
|
||||||
|
const input = '* @acme/platform-team'
|
||||||
|
expect(parse(input)).toEqual([{ pattern: '*', owners: ['@acme/platform-team'] }])
|
||||||
|
})
|
||||||
|
|
||||||
|
it('parses_path_specific_ownership', () => {
|
||||||
|
const input = '/src/payments/ @acme/payments-team'
|
||||||
|
expect(parse(input)).toEqual([{ pattern: '/src/payments/', owners: ['@acme/payments-team'] }])
|
||||||
|
})
|
||||||
|
|
||||||
|
it('handles_multiple_owners_per_pattern', () => {
|
||||||
|
const input = '*.ts @acme/frontend @acme/platform'
|
||||||
|
expect(parse(input).owners).toHaveLength(2)
|
||||||
|
})
|
||||||
|
|
||||||
|
it('ignores_comment_lines', () => {
|
||||||
|
const input = '# This is a comment\n* @acme/team'
|
||||||
|
expect(parse(input)).toHaveLength(1)
|
||||||
|
})
|
||||||
|
|
||||||
|
it('returns_empty_array_for_missing_codeowners_file', () => {
|
||||||
|
expect(parse(null)).toEqual([])
|
||||||
|
})
|
||||||
|
|
||||||
|
it('handles_individual_user_ownership_not_just_teams', () => {
|
||||||
|
const input = '* @sarah-chen'
|
||||||
|
expect(parse(input)[0].owners[0]).toBe('@sarah-chen')
|
||||||
|
})
|
||||||
|
})
|
||||||
|
|
||||||
|
describe('READMEExtractor', () => {
|
||||||
|
it('extracts_first_non_heading_non_badge_paragraph', () => {
|
||||||
|
const readme = `# Payment Gateway\n\n\n\nHandles Stripe checkout flows.`
|
||||||
|
expect(extractDescription(readme)).toBe('Handles Stripe checkout flows.')
|
||||||
|
})
|
||||||
|
|
||||||
|
it('returns_null_when_readme_has_only_headings_and_badges', () => {
|
||||||
|
const readme = `# Title\n\n`
|
||||||
|
expect(extractDescription(readme)).toBeNull()
|
||||||
|
})
|
||||||
|
})
|
||||||
|
|
||||||
|
describe('WorkflowTargetExtractor', () => {
|
||||||
|
it('extracts_ecs_service_name_from_deploy_workflow', () => {
|
||||||
|
const yaml = loadFixture('deploy-workflow-ecs.yml')
|
||||||
|
expect(extractDeployTarget(yaml)).toEqual({
|
||||||
|
type: 'ecs_service',
|
||||||
|
name: 'payment-api',
|
||||||
|
cluster: 'production'
|
||||||
|
})
|
||||||
|
})
|
||||||
|
|
||||||
|
it('extracts_lambda_function_name_from_serverless_deploy', () => {
|
||||||
|
const yaml = loadFixture('deploy-workflow-lambda.yml')
|
||||||
|
expect(extractDeployTarget(yaml)).toEqual({
|
||||||
|
type: 'lambda_function',
|
||||||
|
name: 'payment-webhook-handler'
|
||||||
|
})
|
||||||
|
})
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
**Mocking strategy:**
|
||||||
|
- Use `nock` or `msw` to intercept GitHub GraphQL API calls
|
||||||
|
- Fixture files in `tests/fixtures/github/` for realistic API responses
|
||||||
|
- Test the GraphQL query builder separately from the HTTP client
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.3 Reconciliation Engine (Node.js / Jest)
|
||||||
|
|
||||||
|
**What to test:**
|
||||||
|
- Cross-referencing AWS resources with GitHub repos (all 5 matching rules)
|
||||||
|
- Deduplication when multiple signals point to the same service
|
||||||
|
- Conflict resolution when signals disagree
|
||||||
|
- Batch processing of SQS messages
|
||||||
|
|
||||||
|
**Key test cases:**
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
describe('ReconciliationEngine', () => {
|
||||||
|
describe('matchAWSToGitHub', () => {
|
||||||
|
it('explicit_tag_match_takes_highest_priority', () => {
|
||||||
|
const awsService = buildAWSService({ tags: { github_repo: 'acme/payment-gateway' } })
|
||||||
|
const ghRepo = buildGHRepo({ name: 'payment-gateway', org: 'acme' })
|
||||||
|
const result = reconcile([awsService], [ghRepo])
|
||||||
|
expect(result[0].repoLinkSource).toBe('explicit_tag')
|
||||||
|
expect(result[0].repoLinkConfidence).toBe(0.95)
|
||||||
|
})
|
||||||
|
|
||||||
|
it('deploy_workflow_match_used_when_no_explicit_tag', () => {
|
||||||
|
const awsService = buildAWSService({ name: 'payment-api' })
|
||||||
|
const ghRepo = buildGHRepo({ deployTarget: 'payment-api' })
|
||||||
|
const result = reconcile([awsService], [ghRepo])
|
||||||
|
expect(result[0].repoLinkSource).toBe('deploy_workflow')
|
||||||
|
})
|
||||||
|
|
||||||
|
it('fuzzy_name_match_used_as_fallback', () => {
|
||||||
|
const awsService = buildAWSService({ name: 'payment-service' })
|
||||||
|
const ghRepo = buildGHRepo({ name: 'payment-svc' })
|
||||||
|
const result = reconcile([awsService], [ghRepo])
|
||||||
|
expect(result[0].repoLinkSource).toBe('name_match')
|
||||||
|
expect(result[0].repoLinkConfidence).toBe(0.75)
|
||||||
|
})
|
||||||
|
|
||||||
|
it('no_match_produces_aws_only_service_entity', () => {
|
||||||
|
const awsService = buildAWSService({ name: 'legacy-monolith' })
|
||||||
|
const result = reconcile([awsService], [])
|
||||||
|
expect(result[0].repoUrl).toBeNull()
|
||||||
|
expect(result[0].discoverySources).toContain('cloudformation')
|
||||||
|
expect(result[0].discoverySources).not.toContain('github_repo')
|
||||||
|
})
|
||||||
|
|
||||||
|
it('deduplicates_cfn_stack_and_ecs_service_with_same_name', () => {
|
||||||
|
const cfnService = buildAWSService({ source: 'cloudformation', name: 'payment-api' })
|
||||||
|
const ecsService = buildAWSService({ source: 'ecs_service', name: 'payment-api' })
|
||||||
|
const result = reconcile([cfnService, ecsService], [])
|
||||||
|
expect(result).toHaveLength(1)
|
||||||
|
expect(result[0].discoverySources).toContain('cloudformation')
|
||||||
|
expect(result[0].discoverySources).toContain('ecs_service')
|
||||||
|
})
|
||||||
|
})
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.4 Ownership Inference Engine (Python / pytest)
|
||||||
|
|
||||||
|
This is the highest-value unit test target. Ownership inference is the most complex logic and the most likely source of accuracy failures.
|
||||||
|
|
||||||
|
**Key test cases:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
class TestOwnershipScorer:
|
||||||
|
def test_codeowners_weighted_highest_at_0_40(self):
|
||||||
|
signals = [Signal(type='codeowners', team='payments', raw_score=1.0)]
|
||||||
|
result = score_ownership(signals)
|
||||||
|
assert result['payments'].weighted_score == pytest.approx(0.40)
|
||||||
|
|
||||||
|
def test_multiple_signals_summed_correctly(self):
|
||||||
|
signals = [
|
||||||
|
Signal(type='codeowners', team='payments', raw_score=1.0), # 0.40
|
||||||
|
Signal(type='cfn_tag', team='payments', raw_score=1.0), # 0.20
|
||||||
|
Signal(type='git_blame_frequency', team='payments', raw_score=1.0), # 0.25
|
||||||
|
]
|
||||||
|
result = score_ownership(signals)
|
||||||
|
assert result['payments'].total_score == pytest.approx(0.85)
|
||||||
|
|
||||||
|
def test_primary_owner_is_highest_scoring_team(self):
|
||||||
|
signals = [
|
||||||
|
Signal(type='codeowners', team='payments', raw_score=1.0),
|
||||||
|
Signal(type='git_blame_frequency', team='platform', raw_score=1.0),
|
||||||
|
]
|
||||||
|
result = score_ownership(signals)
|
||||||
|
assert result.primary_owner == 'payments'
|
||||||
|
|
||||||
|
def test_service_marked_unowned_when_top_score_below_0_50(self):
|
||||||
|
signals = [Signal(type='git_blame_frequency', team='unknown', raw_score=0.3)]
|
||||||
|
result = score_ownership(signals)
|
||||||
|
assert result.status == 'unowned'
|
||||||
|
|
||||||
|
def test_service_marked_ambiguous_when_top_two_within_0_10(self):
|
||||||
|
signals = [
|
||||||
|
Signal(type='codeowners', team='payments', raw_score=0.8),
|
||||||
|
Signal(type='codeowners', team='platform', raw_score=0.75),
|
||||||
|
]
|
||||||
|
result = score_ownership(signals)
|
||||||
|
assert result.status == 'ambiguous'
|
||||||
|
|
||||||
|
def test_user_correction_overrides_all_inference_with_score_1_00(self):
|
||||||
|
signals = [
|
||||||
|
Signal(type='codeowners', team='payments', raw_score=1.0),
|
||||||
|
Signal(type='user_correction', team='platform', raw_score=1.0),
|
||||||
|
]
|
||||||
|
result = score_ownership(signals)
|
||||||
|
assert result.primary_owner == 'platform'
|
||||||
|
assert result.primary_confidence == 1.00
|
||||||
|
assert result.primary_source == 'user_correction'
|
||||||
|
|
||||||
|
def test_correction_propagation_applies_to_matching_repo_prefix(self):
|
||||||
|
correction = Correction(repo='payment-gateway', team='payments')
|
||||||
|
candidates = ['payment-processor', 'payment-webhook', 'auth-service']
|
||||||
|
propagated = propagate_correction(correction, candidates)
|
||||||
|
assert 'payment-processor' in propagated
|
||||||
|
assert 'payment-webhook' in propagated
|
||||||
|
assert 'auth-service' not in propagated
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.5 Portal API — Route Handlers (Node.js / Jest + Supertest)
|
||||||
|
|
||||||
|
**What to test:**
|
||||||
|
- Tenant isolation enforcement (tenant_id injected into every query)
|
||||||
|
- Search endpoint proxies to Meilisearch with mandatory tenant filter
|
||||||
|
- PATCH /services enforces correction logging
|
||||||
|
- Auth middleware rejects unauthenticated requests
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
describe('GET /api/v1/services/search', () => {
|
||||||
|
it('injects_tenant_id_filter_into_meilisearch_query', async () => {
|
||||||
|
const spy = jest.spyOn(meilisearchClient, 'search')
|
||||||
|
await request(app).get('/api/v1/services/search?q=payment').set('Authorization', `Bearer ${tenantAToken}`)
|
||||||
|
expect(spy).toHaveBeenCalledWith(expect.objectContaining({
|
||||||
|
filter: expect.stringContaining(`tenant_id = '${TENANT_A_ID}'`)
|
||||||
|
}))
|
||||||
|
})
|
||||||
|
|
||||||
|
it('returns_401_when_no_auth_token_provided', async () => {
|
||||||
|
const res = await request(app).get('/api/v1/services/search?q=payment')
|
||||||
|
expect(res.status).toBe(401)
|
||||||
|
})
|
||||||
|
|
||||||
|
it('tenant_a_cannot_see_tenant_b_services', async () => {
|
||||||
|
// Seed Meilisearch with services for both tenants
|
||||||
|
// Query as tenant A, assert no tenant B results
|
||||||
|
})
|
||||||
|
})
|
||||||
|
|
||||||
|
describe('PATCH /api/v1/services/:id', () => {
|
||||||
|
it('stores_correction_in_corrections_table', async () => {
|
||||||
|
await request(app)
|
||||||
|
.patch(`/api/v1/services/${SERVICE_ID}`)
|
||||||
|
.send({ team_id: NEW_TEAM_ID })
|
||||||
|
.set('Authorization', `Bearer ${adminToken}`)
|
||||||
|
const correction = await db.corrections.findFirst({ where: { service_id: SERVICE_ID } })
|
||||||
|
expect(correction).toBeDefined()
|
||||||
|
expect(correction.new_value).toMatchObject({ team_id: NEW_TEAM_ID })
|
||||||
|
})
|
||||||
|
|
||||||
|
it('sets_confidence_to_1_00_on_user_correction', async () => {
|
||||||
|
await request(app).patch(`/api/v1/services/${SERVICE_ID}`).send({ team_id: NEW_TEAM_ID })
|
||||||
|
const ownership = await db.service_ownership.findFirst({ where: { service_id: SERVICE_ID } })
|
||||||
|
expect(ownership.confidence).toBe(1.00)
|
||||||
|
expect(ownership.source).toBe('user_correction')
|
||||||
|
})
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.6 Slack Bot Command Handlers (Node.js / Jest)
|
||||||
|
|
||||||
|
**What to test:**
|
||||||
|
- Command parsing (`/dd0c who owns <service>`)
|
||||||
|
- Typo tolerance matching logic (delegated to search, but bot needs to handle 0 results)
|
||||||
|
- Block kit message formatting
|
||||||
|
- Error handling (unauthorized workspace, missing service)
|
||||||
|
|
||||||
|
### 3.7 Feature Flags & Governance Policy (Node.js / Jest)
|
||||||
|
|
||||||
|
**What to test:**
|
||||||
|
- Flag evaluation (`openfeature` provider)
|
||||||
|
- Governance strict vs. audit mode
|
||||||
|
- Panic mode blocking writes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Integration Test Strategy
|
||||||
|
|
||||||
|
Integration tests verify that our code correctly interacts with external boundaries: databases, caches, search indices, and third-party APIs.
|
||||||
|
|
||||||
|
### 4.1 Service Boundary Tests
|
||||||
|
- **Discovery ↔ GitHub/GitLab:** Use `nock` or `MSW` to mock the GitHub GraphQL endpoint. Assert that the Node.js scanner constructs the correct query and handles rate limits (HTTP 403/429) via retries.
|
||||||
|
- **Catalog ↔ PostgreSQL:** Use Testcontainers for PostgreSQL to verify complex `upsert` queries, foreign key constraints, and RLS (Row-Level Security) tenant isolation.
|
||||||
|
- **API ↔ Meilisearch:** Use a Meilisearch Docker container. Assert that document syncing (PostgreSQL -> SQS -> Meilisearch) completes and search queries with `tenant_id` filters return the expected subset of data.
|
||||||
|
|
||||||
|
### 4.2 Git Provider API Contract Tests
|
||||||
|
- Write scheduled "contract tests" that run against the *live* GitHub API daily using a dedicated test org.
|
||||||
|
- These detect if GitHub changes their GraphQL schema or rate limit behavior.
|
||||||
|
- Assert that `HEAD:CODEOWNERS` blob extraction still works.
|
||||||
|
|
||||||
|
### 4.3 Testcontainers for Local Infrastructure
|
||||||
|
- **Database:** `testcontainers-node` spinning up `postgres:15-alpine`.
|
||||||
|
- **Search:** `getmeili/meilisearch:latest`.
|
||||||
|
- **Cache:** `redis:7-alpine`.
|
||||||
|
- Run these in GitHub Actions via Docker-in-Docker.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. E2E & Smoke Tests
|
||||||
|
|
||||||
|
E2E tests treat the system as a black box, interacting only through the API and the React UI. We keep these fast and focused on the "5-Minute Miracle" critical path.
|
||||||
|
|
||||||
|
### 5.1 Critical User Journeys (Playwright)
|
||||||
|
1. **The Onboarding Flow:** Mock GitHub OAuth login -> Connect AWS (mock CFN role ARN validation) -> Trigger Discovery -> Wait for WebSocket completion -> Verify 147 services appear in catalog.
|
||||||
|
2. **Cmd+K Search:** Open modal (`Cmd+K`) -> type "pay" -> assert "payment-gateway" is highlighted in < 200ms -> press Enter -> assert service detail card opens.
|
||||||
|
3. **Correcting Ownership:** Open service detail -> Click "Correct Owner" -> select new team -> assert badge changes to 100% confidence -> assert Meilisearch is updated.
|
||||||
|
|
||||||
|
### 5.2 The >80% Auto-Discovery Accuracy Validation
|
||||||
|
- **The "Party Mode" Org:** Maintain a real GitHub org and a mock AWS environment with exactly 100 known services, 10 known teams, and specific chaotic naming conventions.
|
||||||
|
- **The Assertion:** Run discovery. Assert that > 80 of the services are correctly inferred with the right primary owner and repo link.
|
||||||
|
- *This is the most important test in the suite. If a PR drops this below 80%, it cannot be merged.*
|
||||||
|
|
||||||
|
### 5.3 Synthetic Topology Generation
|
||||||
|
- Script to generate `N` mock CFN stacks, `M` ECS services, and `K` GitHub repos to feed the E2E environment without hitting AWS/GitHub limits.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Performance & Load Testing
|
||||||
|
|
||||||
|
Load tests ensure the serverless architecture scales correctly and the Cmd+K search remains instantaneous.
|
||||||
|
|
||||||
|
### 6.1 Discovery Scan Benchmarks
|
||||||
|
- **Target:** 500 AWS resources + 500 GitHub repos scanned and reconciled in < 120 seconds.
|
||||||
|
- **Tooling:** K6 or Artillery. Push 5,000 synthetic SQS messages into the Reconciler queue and measure Lambda batch processing throughput.
|
||||||
|
|
||||||
|
### 6.2 Catalog Query Latency
|
||||||
|
- **Target:** API search endpoint returns in < 100ms at the 99th percentile.
|
||||||
|
- **Test:** Load Meilisearch with 10,000 service documents. Fire 50 concurrent Cmd+K search requests per second. Assert p99 latency.
|
||||||
|
|
||||||
|
### 6.3 Concurrent Scorecard Evaluation
|
||||||
|
- Ensure the Python inference Lambda can evaluate 1,000 services concurrently without database connection exhaustion (using Aurora Serverless v2 connection pooling).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. CI/CD Pipeline Integration
|
||||||
|
|
||||||
|
The test pyramid is enforced through GitHub Actions.
|
||||||
|
|
||||||
|
### 7.1 Test Stages
|
||||||
|
- **Pre-commit:** Husky runs ESLint, Prettier, and fast unit tests (Jest/pytest) for changed files only.
|
||||||
|
- **PR Gate:** Runs the full Unit and Integration test suites. Blocks merge if coverage drops or tests fail.
|
||||||
|
- **Merge (Main):** Deploys to Staging. Runs E2E Critical User Journeys and the 80% Accuracy Validation suite against the Party Mode org.
|
||||||
|
- **Post-Deploy:** Smoke tests verify health endpoints and ALB routing in production.
|
||||||
|
|
||||||
|
### 7.2 Coverage Thresholds
|
||||||
|
- Global Unit Test Coverage: 80%
|
||||||
|
- Ownership Inference & Reconciliation Logic: 95%
|
||||||
|
- Feature Flag & Governance Evaluators: 100%
|
||||||
|
|
||||||
|
### 7.3 Test Parallelization
|
||||||
|
- Jest tests run with `--maxWorkers=50%` locally, `100%` in CI.
|
||||||
|
- Integration tests using Testcontainers run serially per file to avoid database port conflicts, or use dynamic port binding and separate schemas for parallel execution.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Transparent Factory Tenet Testing
|
||||||
|
|
||||||
|
Testing the governance and compliance features of the IDP itself.
|
||||||
|
|
||||||
|
### 8.1 Feature Flag Circuit Breakers
|
||||||
|
- **Test:** Enable a flagged discovery heuristic that generates 10 phantom services.
|
||||||
|
- **Assert:** The system detects the threshold (>5 unconfirmed), auto-disables the flag, and marks the 10 services as `status: quarantined`.
|
||||||
|
|
||||||
|
### 8.2 Schema Migration Validation
|
||||||
|
- **Test:** Attempt to apply a PR that drops a column from the `services` table.
|
||||||
|
- **Assert:** CI migration validator script fails the build (additive-only rule).
|
||||||
|
|
||||||
|
### 8.3 Decision Log Enforcement
|
||||||
|
- **Test:** Run a discovery scan where service ownership is inferred from `git blame`.
|
||||||
|
- **Assert:** A `decision_log` entry is written to PostgreSQL with the prompt/reasoning, alternatives, and confidence.
|
||||||
|
|
||||||
|
### 8.4 OTEL Span Assertions
|
||||||
|
- **Test:** Run the Reconciler Lambda.
|
||||||
|
- **Assert:** The `catalog_scan` parent span contains child spans for `ownership_inference` with attributes for `catalog.service_id`, `catalog.ownership_signals`, and `catalog.confidence_score`. Use an in-memory OTEL exporter for testing.
|
||||||
|
|
||||||
|
### 8.5 Governance Policy Enforcement
|
||||||
|
- **Test:** Set tenant policy to `strict` mode. Simulate auto-discovery finding a new service.
|
||||||
|
- **Assert:** Service is placed in the "pending review" queue and NOT visible in the main catalog.
|
||||||
|
- **Test:** Set `panic_mode: true`. Attempt a `PATCH /api/v1/services/123`.
|
||||||
|
- **Assert:** HTTP 503 Service Unavailable.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Test Data & Fixtures
|
||||||
|
|
||||||
|
High-quality fixtures are the lifeblood of this TDD strategy.
|
||||||
|
|
||||||
|
### 9.1 GitHub/GitLab API Response Factories
|
||||||
|
- JSON files containing real obfuscated GraphQL responses for Repositories, `CODEOWNERS` blobs, and Team memberships.
|
||||||
|
- Use factories (e.g., `fishery` or custom functions) to easily override fields: `buildGHRepo({ name: 'auth-service', languages: ['Go'] })`.
|
||||||
|
|
||||||
|
### 9.2 Synthetic Topology Generators
|
||||||
|
- Scripts that generate interconnected AWS resources (e.g., a CFN stack containing an API Gateway routing to 3 Lambdas interacting with 1 RDS instance).
|
||||||
|
|
||||||
|
### 9.3 `CODEOWNERS` and Git Blame Mocks
|
||||||
|
- Diverse `CODEOWNERS` files covering edge cases: wildcard matching, deep path matching, invalid syntax, user-vs-team owners.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. TDD Implementation Order
|
||||||
|
|
||||||
|
To bootstrap the platform efficiently, testing and development should follow this sequence based on Epic dependencies:
|
||||||
|
|
||||||
|
1. **Epic 2 (GitHub Parsers):** Write pure unit tests for `CODEOWNERS` parser and `README` extractor. *Value: High ROI, zero dependencies.*
|
||||||
|
2. **Epic 1 (AWS Heuristics):** Write unit tests for mapping CFN stacks and Tags to Service entities. *Value: Core product logic.*
|
||||||
|
3. **Epic 2 (Ownership Inference):** TDD the scoring algorithm. Build the weighting math. *Value: The brain of the platform.*
|
||||||
|
4. **Epic 3 (Service Catalog Schema):** Integration tests for PostgreSQL RLS and upserting services. *Value: Data durability.*
|
||||||
|
5. **Epic 2 (Reconciliation):** Unit tests merging AWS and GitHub mock entities. *Value: Pipeline glue.*
|
||||||
|
6. **Epic 4 (Search Sync):** Integration tests for pushing DB updates to Meilisearch.
|
||||||
|
7. **Epic 5 (API & UI):** E2E test for the Cmd+K search flow.
|
||||||
|
8. **Epic 10 (Governance & Flags):** Unit tests for feature flag circuit breakers and strict mode.
|
||||||
|
9. **Epic 9 (Onboarding):** Playwright E2E for the 5-Minute Miracle flow.
|
||||||
|
|
||||||
|
This sequence ensures the most complex algorithmic logic is proven before it is wired to databases and APIs.
|
||||||
2421
products/05-aws-cost-anomaly/architecture/architecture.md
Normal file
2421
products/05-aws-cost-anomaly/architecture/architecture.md
Normal file
File diff suppressed because it is too large
Load Diff
340
products/05-aws-cost-anomaly/brainstorm/session.md
Normal file
340
products/05-aws-cost-anomaly/brainstorm/session.md
Normal file
@@ -0,0 +1,340 @@
|
|||||||
|
# dd0c/cost — AWS Cost Anomaly Detective
|
||||||
|
## Brainstorming Session
|
||||||
|
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Facilitator:** Carson, Elite Brainstorming Specialist
|
||||||
|
**Product:** dd0c/cost (Product #5 in the dd0c platform)
|
||||||
|
**Target:** Teams spending $10K–$500K/mo on AWS who want instant alerts when something spikes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: Problem Space (25 ideas)
|
||||||
|
|
||||||
|
### The Emotional Gut-Punch
|
||||||
|
|
||||||
|
1. **The 3am Stomach Drop** — You open your phone, see a Slack message from finance: "Why is our AWS bill $18K over budget?" Your heart rate spikes. You don't even know where to start looking. AWS Cost Explorer loads in 8 seconds and shows you a bar chart that means nothing.
|
||||||
|
|
||||||
|
2. **The Blame Game** — Someone left a GPU instance running over the weekend. Nobody knows who. The CTO asks in the all-hands. Three teams point fingers. The intern who actually did it is too scared to speak up. The political fallout lasts weeks.
|
||||||
|
|
||||||
|
3. **The "It Was Just a Test" Excuse** — A developer spun up a 5-node EMR cluster "just to test something real quick." That was 11 days ago. It's been burning $47/hour. Nobody noticed because nobody looks at billing until month-end.
|
||||||
|
|
||||||
|
4. **The NAT Gateway Surprise** — The single most rage-inducing line item on any AWS bill. Teams discover they're paying $3K/month for NAT Gateway data processing and have zero idea which service is generating the traffic. AWS gives you no breakdown.
|
||||||
|
|
||||||
|
5. **The Data Transfer Black Hole** — Cross-region, cross-AZ, internet egress, VPC endpoints, PrivateLink — data transfer costs are a labyrinth. Even experienced architects can't predict them. They just show up as a lump sum.
|
||||||
|
|
||||||
|
6. **The Autoscaling Runaway** — A traffic spike triggers autoscaling. The spike ends. Autoscaling doesn't scale back down because the cooldown period is misconfigured. You're now running 40 instances instead of 4. For three days.
|
||||||
|
|
||||||
|
7. **The Reserved Instance Waste** — You bought $50K in reserved instances for m5.xlarge. Six months later, the team migrated to Graviton (m7g). The reservations are burning money on instances nobody uses.
|
||||||
|
|
||||||
|
8. **The S3 Lifecycle Policy That Never Was** — "We'll add lifecycle policies later." Later never comes. You're storing 80TB of debug logs from 2023 in S3 Standard at $0.023/GB. That's $1,840/month for data nobody will ever read.
|
||||||
|
|
||||||
|
9. **The EBS Snapshot Graveyard** — Hundreds of orphaned EBS snapshots from deleted instances. Each one costs pennies, but collectively they're $400/month. Nobody even knows they exist.
|
||||||
|
|
||||||
|
10. **The CloudWatch Log Explosion** — A misconfigured Lambda starts logging every request payload at DEBUG level. CloudWatch ingestion costs go from $50/month to $2,000/month in 48 hours. The default CloudWatch dashboard doesn't show cost impact.
|
||||||
|
|
||||||
|
### Why AWS Cost Explorer Sucks
|
||||||
|
|
||||||
|
11. **24-48 Hour Delay** — Cost Explorer data is delayed by up to 48 hours. By the time you see the spike, you've already burned thousands. Real-time? AWS doesn't know the meaning.
|
||||||
|
|
||||||
|
12. **Terrible Filtering UX** — Want to see costs for a specific team's resources? Hope you tagged everything perfectly. Spoiler: you didn't. Cost Explorer's filter UI is a nightmare of dropdowns and "apply" buttons.
|
||||||
|
|
||||||
|
13. **No Actionable Context** — Cost Explorer tells you "EC2 costs went up 300%." It does NOT tell you which specific instances, who launched them, when, or why. You have to cross-reference with CloudTrail manually.
|
||||||
|
|
||||||
|
14. **Anomaly Detection is a Joke** — AWS Cost Anomaly Detection exists but: alerts are delayed (same 24-48hr lag), the ML model is a black box you can't tune, false positive rate is absurd, and the notification options are limited to SNS/email (no Slack native).
|
||||||
|
|
||||||
|
15. **No Remediation Path** — Even when you find the problem in Cost Explorer, there's no "fix it" button. You have to context-switch to the EC2 console, find the resource, and manually terminate/resize. That's 15 clicks minimum.
|
||||||
|
|
||||||
|
16. **Forecasting is Useless** — AWS's cost forecast is a straight-line projection that ignores seasonality, deployment patterns, and common sense. "Based on current trends, your bill will be $∞."
|
||||||
|
|
||||||
|
### What Causes Cost Spikes (The Usual Suspects)
|
||||||
|
|
||||||
|
17. **Zombie Resources** — EC2 instances, RDS databases, Elastic IPs, Load Balancers, Redshift clusters that are running but serving no traffic. The #1 source of waste. Every AWS account has them.
|
||||||
|
|
||||||
|
18. **Right-Sizing Neglect** — Running m5.4xlarge instances that average 8% CPU utilization. Nobody downsizes because "what if we need the headroom?" (They never do.)
|
||||||
|
|
||||||
|
19. **Dev/Staging Environments Running 24/7** — Production needs to be always-on. Dev and staging do not. But they run 24/7 because nobody set up a schedule. That's 75% waste on non-prod.
|
||||||
|
|
||||||
|
20. **Marketplace AMI Licensing** — Someone launched an instance with a marketplace AMI that costs $2/hour in licensing fees on top of the EC2 cost. The license cost doesn't show up where you'd expect.
|
||||||
|
|
||||||
|
21. **Elastic IP Charges** — Allocated but unattached Elastic IPs cost $0.005/hour each. Sounds tiny. 50 orphaned EIPs = $180/month. Death by a thousand cuts.
|
||||||
|
|
||||||
|
22. **Lambda Concurrency Explosions** — A recursive Lambda invocation bug or a sudden traffic spike causes thousands of concurrent executions. The per-invocation cost is low but at 10K concurrent, it adds up fast.
|
||||||
|
|
||||||
|
23. **DynamoDB On-Demand Pricing Surprises** — Teams choose on-demand for convenience, then discover their read/write patterns would be 80% cheaper with provisioned capacity + auto-scaling.
|
||||||
|
|
||||||
|
24. **Multi-Account Sprawl** — Organizations with 20+ AWS accounts lose track of which accounts are active, who owns them, and what's running in them. Consolidated billing hides the details.
|
||||||
|
|
||||||
|
25. **Savings Plans Mismatch** — Bought Compute Savings Plans based on last quarter's usage. This quarter's usage shifted to a different instance family/region. Savings Plans don't cover it. You're paying on-demand AND wasting the commitment.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: Solution Space (42 ideas)
|
||||||
|
|
||||||
|
### Detection Approaches
|
||||||
|
|
||||||
|
26. **CloudWatch Billing Metrics (Near Real-Time)** — Poll `EstimatedCharges` metric every 5 minutes. It's the fastest signal AWS provides. Not perfect, but way better than Cost Explorer's 48-hour lag.
|
||||||
|
|
||||||
|
27. **CloudTrail Event Stream** — Monitor `RunInstances`, `CreateDBInstance`, `CreateFunction`, `CreateNatGateway` etc. in real-time via EventBridge. Detect expensive resource creation the MOMENT it happens, before any cost accrues.
|
||||||
|
|
||||||
|
28. **Cost and Usage Report (CUR) Hourly Parsing** — AWS CUR can be delivered hourly to S3. Parse it with a lightweight Lambda/Fargate job. Gives line-item granularity that Cost Explorer API doesn't.
|
||||||
|
|
||||||
|
29. **Hybrid Detection: Events + Billing** — Use CloudTrail for instant "something was created" alerts, then correlate with CUR data for actual cost impact. Best of both worlds.
|
||||||
|
|
||||||
|
30. **Tag-Based Cost Boundaries** — Let users define expected cost ranges per tag (e.g., `team:payments` should be $2K-$4K/month). Alert when any tag group exceeds its boundary.
|
||||||
|
|
||||||
|
31. **Service-Level Baselines** — Automatically learn the "normal" cost pattern for each AWS service in the account. Alert on deviations. No manual threshold setting required.
|
||||||
|
|
||||||
|
32. **Account-Level Anomaly Scoring** — Assign each AWS account a daily "anomaly score" (0-100) based on deviation from historical patterns. Dashboard shows accounts ranked by anomaly severity.
|
||||||
|
|
||||||
|
### Anomaly Algorithms
|
||||||
|
|
||||||
|
33. **Statistical Z-Score Detection** — Simple, explainable. Calculate rolling mean and standard deviation for each service/tag. Alert when current spend exceeds 2σ or 3σ. Users understand "this is 3 standard deviations above normal."
|
||||||
|
|
||||||
|
34. **Seasonal Decomposition (STL)** — Decompose cost time series into trend + seasonal + residual. Alert on residual spikes. Handles weekly patterns (lower on weekends) and monthly patterns (batch jobs on the 1st).
|
||||||
|
|
||||||
|
35. **Prophet-Style Forecasting** — Use Facebook Prophet or similar for time-series forecasting. Compare actual vs. predicted. Alert on significant positive deviations. Good for accounts with complex seasonality.
|
||||||
|
|
||||||
|
36. **Rule-Based Guardrails** — Simple rules that catch 80% of problems: "Alert if any single resource costs >$X/day", "Alert if a new service appears that wasn't used last month", "Alert if daily spend exceeds 150% of 30-day average."
|
||||||
|
|
||||||
|
37. **Peer Comparison** — "Your EC2 spend per engineer is 3x the median for companies your size." Anonymized benchmarking across dd0c customers. Powerful social proof for optimization.
|
||||||
|
|
||||||
|
38. **Rate-of-Change Detection** — Don't just look at absolute cost. Look at the derivative. A service going from $10/day to $50/day is a 5x spike even though the absolute number is small. Catch problems early when they're cheap to fix.
|
||||||
|
|
||||||
|
39. **Composite Anomaly Detection** — Combine multiple signals: cost spike + new resource creation + unusual API calls = high-confidence anomaly. Single signals = low-confidence (reduce false positives).
|
||||||
|
|
||||||
|
### Remediation
|
||||||
|
|
||||||
|
40. **One-Click Stop Instance** — See a runaway EC2 instance? Click "Stop" right from the dd0c alert. We execute `StopInstances` via the customer's IAM role. No console context-switching.
|
||||||
|
|
||||||
|
41. **One-Click Terminate with Snapshot** — For instances that should be killed: terminate but automatically create an EBS snapshot first, so nothing is lost. Safety net built in.
|
||||||
|
|
||||||
|
42. **Schedule Non-Prod Shutdown** — "This dev environment runs 24/7 but only gets traffic 9am-6pm ET." One click to create a start/stop schedule. Instant 62% savings.
|
||||||
|
|
||||||
|
43. **Right-Size Recommendation with Apply** — "This m5.4xlarge averages 8% CPU. Recommended: m5.large. Estimated savings: $312/month." Click "Apply" to resize. We handle the stop/modify/start.
|
||||||
|
|
||||||
|
44. **Auto-Kill Zombie Resources** — Define a policy: "Any EC2 instance with <1% CPU for 7 days gets auto-terminated." dd0c enforces it. Opt-in, with a 24-hour warning notification before termination.
|
||||||
|
|
||||||
|
45. **Budget Circuit Breaker** — Set a hard daily/weekly budget. When spend approaches the limit, dd0c automatically stops non-essential resources (tagged as `priority:low`). Like a financial circuit breaker.
|
||||||
|
|
||||||
|
46. **Savings Plan Optimizer** — Analyze usage patterns and recommend optimal Savings Plan purchases. Show the exact commitment amount and projected savings. One-click purchase through AWS.
|
||||||
|
|
||||||
|
47. **Reserved Instance Exchange Assistant** — Got unused RIs? dd0c finds the optimal exchange path to convert them to instance types you actually use. Handles the RI Marketplace listing if exchange isn't possible.
|
||||||
|
|
||||||
|
48. **S3 Lifecycle Policy Generator** — Scan S3 buckets, analyze access patterns, generate optimal lifecycle policies (Standard → IA → Glacier → Delete). One-click apply.
|
||||||
|
|
||||||
|
49. **EBS Snapshot Cleanup** — Identify orphaned snapshots, show total cost, one-click bulk delete with a confirmation list.
|
||||||
|
|
||||||
|
50. **Approval Workflow for Expensive Actions** — For remediation actions above a cost threshold, require manager/lead approval via Slack. "Max wants to terminate 5 instances saving $2,100/month. Approve?"
|
||||||
|
|
||||||
|
### Attribution
|
||||||
|
|
||||||
|
51. **Team-Level Cost Dashboard** — Break down costs by team using tags, account mapping, or resource ownership. Each team sees ONLY their costs. Accountability without blame.
|
||||||
|
|
||||||
|
52. **PR-Level Cost Attribution** — Track which pull request / deployment caused a cost change. "Costs increased $340/day after PR #1847 was merged (added new ECS service)." Integration with GitHub/GitLab.
|
||||||
|
|
||||||
|
53. **Environment-Level Breakdown** — Production vs. Staging vs. Dev vs. QA. Instantly see that staging is costing 60% of production (it shouldn't be).
|
||||||
|
|
||||||
|
54. **Service-Level Cost per Request** — Combine cost data with traffic data. "Your payment service costs $0.003 per request. Your search service costs $0.047 per request." Unit economics for infrastructure.
|
||||||
|
|
||||||
|
55. **Slack Cost Bot** — `/cost my-team` in Slack returns your team's current month spend, trend, and anomalies. No dashboard needed for quick checks.
|
||||||
|
|
||||||
|
### Forecasting
|
||||||
|
|
||||||
|
56. **End-of-Month Projection** — "Based on current trajectory, your February bill will be $47,200 (budget: $40,000). You'll exceed budget by $7,200 unless action is taken." Updated daily.
|
||||||
|
|
||||||
|
57. **What-If Scenarios** — "What if we right-size all oversized instances? Projected savings: $4,200/month." "What if we schedule dev environments? Savings: $2,800/month." Quantify the impact before acting.
|
||||||
|
|
||||||
|
58. **Deployment Cost Preview** — Before deploying a new service, estimate its monthly cost based on the Terraform/CloudFormation template. "This deployment will add approximately $1,200/month to your bill." Pre-deploy, not post-mortem.
|
||||||
|
|
||||||
|
59. **Trend Analysis with Narrative** — Not just charts. "Your EC2 costs have increased 23% month-over-month for 3 consecutive months, driven primarily by the data-pipeline team's EMR usage. At this rate, EC2 alone will exceed $30K by April."
|
||||||
|
|
||||||
|
### Notification
|
||||||
|
|
||||||
|
60. **Slack-Native Alerts with Action Buttons** — Alert lands in Slack with context AND action buttons: [Stop Instance] [Snooze 24h] [Assign to Team] [View Details]. No context-switching.
|
||||||
|
|
||||||
|
61. **PagerDuty Integration for Critical Spikes** — Cost spike >$X/hour? That's an incident. Page the on-call FinOps person (or the team lead if no FinOps role exists).
|
||||||
|
|
||||||
|
62. **Daily Digest Email** — Morning email: "Yesterday's spend: $1,423. Trend: ↑12% vs. 7-day average. Top anomaly: NAT Gateway in us-east-1 (+$89). Action needed: 3 zombie instances detected."
|
||||||
|
|
||||||
|
63. **SMS for Emergency Spikes** — Configurable threshold. "Your AWS spend exceeded $500 in the last hour. This is 10x your normal hourly rate." For the truly catastrophic events.
|
||||||
|
|
||||||
|
64. **Weekly Cost Report for Leadership** — Auto-generated PDF/Slack message for non-technical stakeholders. Plain English. "We spent $38K on AWS this week. That's 5% under budget. Three optimization opportunities worth $2,100/month were identified."
|
||||||
|
|
||||||
|
### Visualization
|
||||||
|
|
||||||
|
65. **Cost Heatmap** — Calendar heatmap showing daily spend intensity. Instantly spot the expensive days. Click any day to drill down.
|
||||||
|
|
||||||
|
66. **Service Treemap** — Treemap visualization where rectangle size = cost. Instantly see which services dominate your bill. Click to drill into sub-categories.
|
||||||
|
|
||||||
|
67. **Real-Time Cost Ticker** — A live-updating ticker showing current burn rate: "$1.87/hour | $44.88/day | $1,346/month (projected)". Like a stock ticker for your AWS bill.
|
||||||
|
|
||||||
|
68. **Anomaly Timeline** — Horizontal timeline showing detected anomalies as colored dots. Red = unresolved, green = remediated, yellow = acknowledged. Visual history of your cost health.
|
||||||
|
|
||||||
|
69. **Cost Diff View** — Side-by-side comparison of any two time periods. "This week vs. last week: +$2,100 total. EC2: +$800, RDS: +$1,100, S3: +$200." Like a git diff for your bill.
|
||||||
|
|
||||||
|
70. **Infrastructure Cost Map** — Visual representation of your AWS architecture with cost annotations. See your VPC, subnets, instances, databases — each labeled with their daily cost. Like an AWS architecture diagram that shows you where the money goes.
|
||||||
|
|
||||||
|
### Wild Ideas 🔥
|
||||||
|
|
||||||
|
71. **"Cost Replay"** — Rewind your AWS bill to any point in time and replay cost changes like a video. See exactly when costs started climbing and correlate with CloudTrail events. A DVR for your cloud spend.
|
||||||
|
|
||||||
|
72. **Auto-Negotiate Reserved Instances** — dd0c monitors your usage patterns, identifies RI opportunities, and automatically purchases optimal reservations (with configurable approval thresholds). Fully autonomous FinOps.
|
||||||
|
|
||||||
|
73. **Zombie Resource Hunter (Autonomous Agent)** — An AI agent that continuously scans your account for unused resources, calculates waste, and either auto-terminates (if policy allows) or creates a cleanup ticket. It never sleeps.
|
||||||
|
|
||||||
|
74. **"Cost Blast Radius" for PRs** — GitHub Action that comments on every PR: "If merged, this change will increase monthly AWS costs by approximately $340 (new ECS task definition with 4 vCPU)." Shift cost awareness left.
|
||||||
|
|
||||||
|
75. **Competitive Benchmarking** — "Companies similar to yours (50 engineers, SaaS, Series B) spend a median of $28K/month on AWS. You spend $45K. Here's where you're overspending." Anonymous, aggregated data from dd0c's customer base.
|
||||||
|
|
||||||
|
76. **"AWS Bill Explained Like I'm 5"** — AI-generated plain-English explanation of your bill. "You spent $4,200 on EC2 this month. That's like renting 12 computers 24/7. But 4 of them did almost nothing. If you turn those off, you save $1,400."
|
||||||
|
|
||||||
|
77. **Cost Gamification** — Leaderboard: "Team Payments reduced their AWS spend by 18% this month! 🏆" Badges for optimization milestones. Make cost optimization fun and competitive.
|
||||||
|
|
||||||
|
78. **Automatic Spot Instance Migration** — Identify workloads that are spot-compatible (stateless, fault-tolerant) and automatically migrate them from on-demand to spot. 60-90% savings with zero manual effort.
|
||||||
|
|
||||||
|
79. **"What's This Costing Me?" Chrome Extension** — Hover over any resource in the AWS Console and see its monthly cost. Like a price tag on every resource. Because AWS deliberately makes this hard to see.
|
||||||
|
|
||||||
|
80. **Multi-Cloud Cost Normalization** — Normalize costs across AWS, GCP, and Azure into a single dashboard. "Your compute costs $X on AWS. The equivalent on GCP would cost $Y." Help teams make informed multi-cloud decisions.
|
||||||
|
|
||||||
|
81. **Cost-Aware Autoscaling** — Replace AWS's native autoscaling with a cost-aware version. Instead of just scaling on CPU/memory, factor in cost. "We could scale to 20 instances, but 12 instances + a queue would handle the load at 40% less cost."
|
||||||
|
|
||||||
|
82. **Invoice Dispute Assistant** — AI that reviews your AWS bill for billing errors, credits you're owed, and generates dispute emails. AWS makes billing mistakes more often than people think.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3: Differentiation & Moat (18 ideas)
|
||||||
|
|
||||||
|
### Beating AWS Native Cost Anomaly Detection
|
||||||
|
|
||||||
|
83. **Speed** — AWS Cost Anomaly Detection has a 24-48 hour delay. dd0c/cost detects anomalies in minutes via CloudTrail events + CloudWatch billing metrics. This alone is a 100x improvement.
|
||||||
|
|
||||||
|
84. **Actionability** — AWS tells you "anomaly detected." dd0c tells you "anomaly detected → here's the specific resource → here's who created it → here's the one-click fix." Context + action, not just a notification.
|
||||||
|
|
||||||
|
85. **UX That Doesn't Make You Want to Cry** — AWS Cost Anomaly Detection is buried in the Billing console behind 4 clicks. The UI is a table with tiny text. dd0c is a beautiful, purpose-built dashboard with Slack-native alerts.
|
||||||
|
|
||||||
|
86. **Tunable Sensitivity** — AWS's ML model is a black box. dd0c lets you tune sensitivity per service, per team, per account. "I expect RDS to fluctuate ±20%, but EC2 should be stable within ±5%."
|
||||||
|
|
||||||
|
87. **Remediation Built In** — AWS detects. dd0c detects AND fixes. The gap between "knowing" and "doing" is where all the value is.
|
||||||
|
|
||||||
|
### Beating Vantage / CloudHealth
|
||||||
|
|
||||||
|
88. **Time-to-Value** — Vantage requires connecting your CUR, waiting for data ingestion, configuring dashboards. dd0c: connect your AWS account, get your first anomaly alert in under 10 minutes. Vercel-speed onboarding.
|
||||||
|
|
||||||
|
89. **Pricing Transparency** — CloudHealth/Apptio: "Contact Sales." Vantage: reasonable but still opaque at scale. dd0c: pricing on the website, self-serve signup, no sales calls ever.
|
||||||
|
|
||||||
|
90. **Focus** — Vantage is becoming a broad FinOps platform (Kubernetes costs, unit economics, budgets, reports). dd0c/cost does ONE thing: detect anomalies and fix them. Focused tools beat Swiss Army knives.
|
||||||
|
|
||||||
|
91. **Developer-First, Not Finance-First** — CloudHealth was built for FinOps teams and CFOs. dd0c is built for the engineer who gets paged when something breaks. Different user, different UX, different value prop.
|
||||||
|
|
||||||
|
92. **Real-Time, Not Daily** — Vantage updates costs daily. dd0c provides near-real-time monitoring. For a team burning $100/hour on a runaway resource, daily updates mean $2,400 wasted before you even know.
|
||||||
|
|
||||||
|
### Building the Moat
|
||||||
|
|
||||||
|
93. **Cross-Module Data Flywheel** — dd0c/cost knows your spend. dd0c/portal knows who owns what. dd0c/alert knows your incident patterns. Together, they create an intelligence layer no single-purpose tool can match. "The payment service owned by Team Alpha had a cost spike correlated with the deployment that triggered 3 alerts."
|
||||||
|
|
||||||
|
94. **Anonymized Benchmarking Network** — The more customers dd0c has, the better the benchmarking data. "Your RDS spend per GB is 2x the median." This data is exclusive to dd0c and improves with scale. Classic network effect.
|
||||||
|
|
||||||
|
95. **Optimization Intelligence Accumulation** — Every remediation action taken through dd0c trains the system. "When customers see this pattern, they usually do X." Over time, dd0c's recommendations become eerily accurate. Data moat.
|
||||||
|
|
||||||
|
96. **Open-Source Agent, Paid Dashboard** — The in-VPC agent is open source. This builds trust (customers can audit the code), creates community contributions, and makes dd0c the default choice. The dashboard/alerting/remediation is the paid layer.
|
||||||
|
|
||||||
|
97. **Terraform/Pulumi Provider** — `dd0c_cost_monitor` as a Terraform resource. Define your cost policies as code. This embeds dd0c into the infrastructure-as-code workflow, making it sticky.
|
||||||
|
|
||||||
|
98. **Slack-First Architecture** — Most FinOps tools are dashboard-first. dd0c is Slack-first. Engineers live in Slack. Alerts, actions, reports — all in Slack. The dashboard exists for deep dives, but daily interaction is in Slack. This is a UX moat.
|
||||||
|
|
||||||
|
99. **Multi-Cloud (Strategic Expansion)** — Start AWS-only (Brian's expertise). Add GCP and Azure in Year 2. Become the cross-cloud cost anomaly layer. No single cloud vendor will build this because it's against their interest.
|
||||||
|
|
||||||
|
100. **API-First for Automation** — Full API for everything. Let customers build custom workflows: "When dd0c detects a spike > $500, automatically create a Jira ticket and page the team lead." Programmable FinOps.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4: Anti-Ideas & Red Team (12 ideas)
|
||||||
|
|
||||||
|
### Why This Could Fail
|
||||||
|
|
||||||
|
101. **AWS Improves Cost Explorer** — AWS could ship real-time billing, better anomaly detection, and native Slack integration. They have the data advantage (it's their platform). Counter: AWS has had 15 years to make billing UX good and hasn't. Their incentive is for you to SPEND more, not less. They'll never build a great cost reduction tool.
|
||||||
|
|
||||||
|
102. **Vantage Eats Our Lunch** — Vantage is well-funded, developer-friendly, and already has momentum. They could add real-time anomaly detection tomorrow. Counter: Vantage is going broad (FinOps platform). We're going deep (anomaly detection + remediation). Different strategies.
|
||||||
|
|
||||||
|
103. **IAM Permission Anxiety** — Customers won't give dd0c the IAM permissions needed for remediation (terminate instances, modify resources). Counter: Tiered permissions. Read-only for detection (low trust barrier). Write permissions only for remediation (opt-in). Open-source agent for auditability.
|
||||||
|
|
||||||
|
104. **Race to the Bottom on Pricing** — Cost optimization tools compete on price because their value prop is "we save you money." If you charge too much relative to savings, customers leave. Counter: Price as % of savings identified, not flat fee. Align incentives.
|
||||||
|
|
||||||
|
105. **False Positive Fatigue** — If dd0c alerts too often on non-issues, users will ignore it (same problem as AWS native). Counter: Composite anomaly scoring, tunable sensitivity, and a "snooze" mechanism. Learn from user feedback to reduce false positives over time.
|
||||||
|
|
||||||
|
106. **Small Market Size** — Teams spending $10K-$500K/month is a specific segment. Below $10K, savings aren't worth the tool cost. Above $500K, they have dedicated FinOps teams using enterprise tools. Counter: This segment is actually massive — hundreds of thousands of AWS accounts. And the $500K ceiling can rise as dd0c matures.
|
||||||
|
|
||||||
|
107. **Security Breach Risk** — dd0c has read (and optionally write) access to customer AWS accounts. A breach would be catastrophic for trust. Counter: Minimal permissions, open-source agent, SOC 2 compliance from day 1, no storage of sensitive data (only cost metrics).
|
||||||
|
|
||||||
|
108. **"We'll Build It Internally"** — Platform teams at mid-size companies might build their own cost monitoring. Counter: They always underestimate the effort. Internal tools get abandoned. dd0c is cheaper than one engineer's time for a month.
|
||||||
|
|
||||||
|
109. **AWS Organizations Consolidated Billing Complexity** — Large orgs with complex account structures, SCPs, and consolidated billing make cost attribution incredibly hard. Counter: This is actually a FEATURE opportunity. If dd0c handles multi-account complexity well, it becomes indispensable.
|
||||||
|
|
||||||
|
110. **Terraform Cost Estimation Tools (Infracost) Expand** — Infracost could add post-deploy monitoring to complement their pre-deploy estimation. Counter: Different core competency. Infracost is CI/CD-focused. dd0c is runtime-focused. They're complementary, not competitive. Could even integrate.
|
||||||
|
|
||||||
|
111. **Economic Downturn Kills Cloud Spend** — If companies cut cloud budgets aggressively, there's less to optimize. Counter: Downturns INCREASE demand for cost optimization tools. When budgets tighten, every dollar matters more.
|
||||||
|
|
||||||
|
112. **Customer Churn After Optimization** — Customers use dd0c, optimize their spend, then cancel because there's nothing left to optimize. Counter: Cost drift is continuous. New resources, new team members, new services — waste regenerates. dd0c is a continuous need, not a one-time fix. Also, the monitoring/alerting value persists even after optimization.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 5: Synthesis
|
||||||
|
|
||||||
|
### Top 10 Ideas (Ranked by Impact × Feasibility)
|
||||||
|
|
||||||
|
| Rank | Idea | Why |
|
||||||
|
|------|------|-----|
|
||||||
|
| 1 | **CloudTrail Real-Time Event Detection** (#27, #29) | The single biggest differentiator vs. every competitor. Detect expensive resource creation in seconds, not days. This is the core innovation. |
|
||||||
|
| 2 | **Slack-Native Alerts with Action Buttons** (#60) | Where engineers live. Alert + context + one-click action in Slack = the entire value prop in one message. This IS the product for most users. |
|
||||||
|
| 3 | **One-Click Remediation Suite** (#40-44) | Stop, terminate, resize, schedule — all from the alert. Closing the gap between detection and action is the moat. |
|
||||||
|
| 4 | **Zombie Resource Hunter** (#73, #44) | Autonomous agent that continuously finds and flags waste. Set-and-forget value. This is the "it pays for itself" feature. |
|
||||||
|
| 5 | **End-of-Month Projection** (#56) | "You'll exceed budget by $7,200 unless you act." Simple, powerful, and something AWS does terribly. |
|
||||||
|
| 6 | **Team-Level Cost Attribution** (#51) | Accountability without blame. Each team sees their costs. Essential for organizations with 3+ engineering teams. |
|
||||||
|
| 7 | **Schedule Non-Prod Shutdown** (#42) | The single easiest win for any customer. "Turn off dev at night" = instant 62% savings on non-prod. Proves ROI in week 1. |
|
||||||
|
| 8 | **Cost Blast Radius for PRs** (#74) | Shift-left cost awareness. GitHub Action that comments estimated cost impact on PRs. Viral distribution mechanism (developers share cool GitHub Actions). |
|
||||||
|
| 9 | **Real-Time Cost Ticker** (#67) | Emotional hook. A live burn rate counter creates urgency and awareness. Makes cost visceral, not abstract. |
|
||||||
|
| 10 | **Rule-Based Guardrails** (#36) | Simple rules catch 80% of problems. "Alert if daily spend > 150% of average." Easy to implement, easy to understand, high value. |
|
||||||
|
|
||||||
|
### 3 Wild Cards 🃏
|
||||||
|
|
||||||
|
| Wild Card | Idea | Why It's Wild |
|
||||||
|
|-----------|------|---------------|
|
||||||
|
| 🃏 1 | **"Cost Replay" DVR** (#71) | Rewind your bill like a video. Correlate cost changes with CloudTrail events in a timeline. Nobody has this. It would be a demo-killer at conferences. |
|
||||||
|
| 🃏 2 | **Competitive Benchmarking Network** (#75, #94) | "Companies like yours spend 30% less on RDS." Anonymized cross-customer data creates a network effect moat that grows with every customer. Requires scale but is defensible. |
|
||||||
|
| 🃏 3 | **Invoice Dispute Assistant** (#82) | AI that finds AWS billing errors and generates dispute emails. AWS overcharges more than people realize. This would generate incredible word-of-mouth: "dd0c found $2,400 in billing errors on my account." |
|
||||||
|
|
||||||
|
### Recommended V1 Scope
|
||||||
|
|
||||||
|
**V1 Goal:** Get a customer from "connected AWS account" to "first anomaly detected and remediated" in under 10 minutes.
|
||||||
|
|
||||||
|
**V1 Features (4-6 week build):**
|
||||||
|
|
||||||
|
1. **AWS Account Connection** — IAM role with read-only billing + CloudTrail access. One CloudFormation template click.
|
||||||
|
2. **CloudTrail Event Monitoring** — Real-time detection of expensive resource creation (EC2, RDS, EMR, NAT Gateway, EBS volumes).
|
||||||
|
3. **CloudWatch Billing Polling** — 5-minute polling of EstimatedCharges for account-level anomaly detection.
|
||||||
|
4. **Statistical Anomaly Detection** — Z-score based, per-service, with configurable sensitivity (low/medium/high).
|
||||||
|
5. **Slack Integration** — Alerts with context (what, who, when, how much) and action buttons (Stop, Terminate, Snooze, Assign).
|
||||||
|
6. **Zombie Resource Scanner** — Daily scan for idle EC2 (CPU <5% for 7 days), unattached EBS volumes, orphaned EIPs, unused ELBs.
|
||||||
|
7. **One-Click Stop/Terminate** — Optional write permissions for direct remediation from Slack.
|
||||||
|
8. **End-of-Month Forecast** — Simple projection based on current burn rate with budget comparison.
|
||||||
|
9. **Daily Digest** — Morning Slack message with yesterday's spend, trend, and top anomalies.
|
||||||
|
|
||||||
|
**V1 Does NOT Include:**
|
||||||
|
- Multi-cloud (AWS only)
|
||||||
|
- CUR parsing (too complex for V1; use CloudWatch + CloudTrail)
|
||||||
|
- Savings Plan/RI optimization (Phase 2)
|
||||||
|
- Team attribution (requires tagging strategy; Phase 2)
|
||||||
|
- PR cost estimation (Phase 2; integrate with Infracost instead)
|
||||||
|
- Dashboard UI (Slack-first for V1; web dashboard in Phase 2)
|
||||||
|
|
||||||
|
**V1 Pricing:**
|
||||||
|
- Free: 1 AWS account, daily anomaly checks only
|
||||||
|
- Pro ($49/mo): 3 accounts, real-time detection, Slack alerts, remediation
|
||||||
|
- Business ($149/mo): Unlimited accounts, zombie hunter, forecasting, team features
|
||||||
|
|
||||||
|
**V1 Success Metric:** First 10 paying customers within 60 days of launch. Average customer saves >$500/month (10x the Pro price).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Total ideas generated: 112*
|
||||||
|
*Session complete. Let's build this thing.* 🔥
|
||||||
350
products/05-aws-cost-anomaly/design-thinking/session.md
Normal file
350
products/05-aws-cost-anomaly/design-thinking/session.md
Normal file
@@ -0,0 +1,350 @@
|
|||||||
|
# dd0c/cost — Design Thinking Session
|
||||||
|
## AWS Cost Anomaly Detective
|
||||||
|
|
||||||
|
**Facilitator:** Maya, Design Thinking Maestro
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Product:** dd0c/cost (Product #5 — "The Gateway Drug")
|
||||||
|
**Philosophy:** Design is about THEM, not us.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
> *"The best products don't solve problems. They dissolve the anxiety that surrounds them. When a startup CTO sees a bill that's 4x what they expected, the problem isn't the bill — it's the 47 seconds of pure existential dread before they can even begin to understand WHY."*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Phase 1: EMPATHIZE
|
||||||
|
|
||||||
|
We're not building a cost tool. We're building an anxiety medication for cloud infrastructure. Let's meet the humans who need it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Persona 1: The Startup CTO — "Alex"
|
||||||
|
|
||||||
|
**Demographics:** 32 years old. Series A startup, 12 engineers. Wears the CTO/VP Eng/DevOps hat simultaneously. Personally signed the AWS Enterprise Agreement. The board sees every line item.
|
||||||
|
|
||||||
|
**The Moment That Defines Alex:**
|
||||||
|
It's Tuesday morning, 7:14 AM. Alex is brushing their teeth when a Slack notification buzzes. The CFO has forwarded the AWS billing alert email: "Your estimated charges for this billing period have exceeded $8,000." Last month was $2,100. Alex's stomach drops. Toothbrush still in mouth. They open the AWS Console on their phone. Cost Explorer takes 11 seconds to load on mobile. The bar chart shows a spike but doesn't say WHERE or WHY. Alex is now going to be late for the 8 AM standup, and they'll spend the entire meeting distracted, mentally running through every possible cause. Was it the new feature deploy? Did someone spin up a big instance? Is it a data transfer thing? They don't know. They won't know for hours.
|
||||||
|
|
||||||
|
### Empathy Map
|
||||||
|
|
||||||
|
**SAYS:**
|
||||||
|
- "Can someone check why our AWS bill spiked?"
|
||||||
|
- "We need to be more careful about resource management"
|
||||||
|
- "I'll look into it after standup"
|
||||||
|
- "We can't afford surprises like this"
|
||||||
|
- "Who launched that instance?"
|
||||||
|
- "Do we even need that RDS cluster in staging?"
|
||||||
|
|
||||||
|
**THINKS:**
|
||||||
|
- "This is going to come up at the board meeting"
|
||||||
|
- "I should have set up billing alerts months ago"
|
||||||
|
- "Is this my fault for not having better guardrails?"
|
||||||
|
- "What if this keeps happening and we burn through runway?"
|
||||||
|
- "I don't have time to become a FinOps expert on top of everything else"
|
||||||
|
- "The investors are going to ask about our burn rate"
|
||||||
|
|
||||||
|
**DOES:**
|
||||||
|
- Opens AWS Cost Explorer (waits for it to load, gets frustrated)
|
||||||
|
- Manually checks EC2 console, RDS console, Lambda console — one by one
|
||||||
|
- Searches CloudTrail logs trying to correlate events with cost spikes
|
||||||
|
- Asks in Slack: "Did anyone spin up anything big recently?"
|
||||||
|
- Creates a spreadsheet to track monthly costs (abandons it by month 3)
|
||||||
|
- Sets a billing alarm at 80% of budget (but the alarm fires 48 hours late)
|
||||||
|
|
||||||
|
**FEELS:**
|
||||||
|
- **Panic** — the visceral gut-punch of an unexpected bill
|
||||||
|
- **Helpless** — AWS gives data but not answers
|
||||||
|
- **Guilty** — "I should have caught this sooner"
|
||||||
|
- **Overwhelmed** — too many consoles, too many services, not enough time
|
||||||
|
- **Exposed** — the board/investors will see this number
|
||||||
|
- **Alone** — nobody else on the team understands AWS billing
|
||||||
|
|
||||||
|
### Pain Points
|
||||||
|
1. **The 48-hour blindspot** — By the time Cost Explorer shows the spike, thousands are already burned
|
||||||
|
2. **No attribution** — "EC2 costs went up" tells you nothing about WHICH instance or WHO launched it
|
||||||
|
3. **Context-switching hell** — Diagnosing a cost issue requires jumping between 5+ AWS consoles
|
||||||
|
4. **Personal liability** — At a startup, the CTO's name is on the account. The bill feels personal.
|
||||||
|
5. **Time poverty** — Alex has 47 other priorities. Cost management is important but never urgent — until it's an emergency
|
||||||
|
6. **Knowledge gap** — Alex is a great engineer but not a FinOps specialist. AWS billing is deliberately opaque.
|
||||||
|
|
||||||
|
### Current Workarounds
|
||||||
|
- AWS Billing Alerts (delayed, no context, email-only)
|
||||||
|
- Monthly manual review of Cost Explorer (reactive, not proactive)
|
||||||
|
- Asking in Slack "who did this?" (blame-oriented, unreliable)
|
||||||
|
- Spreadsheet tracking (abandoned within weeks)
|
||||||
|
- Hoping for the best (the most common strategy)
|
||||||
|
|
||||||
|
### Jobs To Be Done (JTBD)
|
||||||
|
- **When** I see an unexpected AWS charge, **I want to** instantly understand what caused it and who's responsible, **so I can** fix it before it gets worse and explain it to stakeholders.
|
||||||
|
- **When** I'm planning our monthly budget, **I want to** confidently predict our AWS spend, **so I can** give the board accurate numbers and not look incompetent.
|
||||||
|
- **When** a new service or resource is created, **I want to** know immediately if it's going to be expensive, **so I can** intervene before costs accumulate.
|
||||||
|
|
||||||
|
### Day-in-the-Life Scenario
|
||||||
|
|
||||||
|
**6:45 AM** — Wake up, check phone. No alerts. Good.
|
||||||
|
**7:14 AM** — CFO Slack: "Why is AWS $8K?" Stomach drops.
|
||||||
|
**7:15-7:55 AM** — Frantically clicking through AWS Console on laptop. Cost Explorer shows EC2 spike but no details. Check CloudTrail — hundreds of events, no obvious culprit.
|
||||||
|
**8:00 AM** — Standup. Distracted. Mentions "looking into a billing issue."
|
||||||
|
**8:30-10:00 AM** — Deep dive. Finally discovers: a developer launched 4x p3.2xlarge GPU instances for an ML experiment on Thursday. They're still running. That's $12.24/hour × 96 hours = $1,175 burned. The developer forgot.
|
||||||
|
**10:05 AM** — Terminates the instances. Sends a Slack message to the team about resource management. Feels like a hall monitor.
|
||||||
|
**10:30 AM** — Writes a "cloud cost policy" doc. Nobody will read it.
|
||||||
|
**11:00 AM** — Back to actual work, 3 hours behind schedule.
|
||||||
|
**Next month** — It happens again. Different resource. Same panic.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Persona 2: The FinOps Analyst — "Jordan"
|
||||||
|
|
||||||
|
**Demographics:** 28 years old. Mid-size SaaS company, 150 engineers, 23 AWS accounts. Jordan's title is "Cloud Financial Analyst" but everyone calls them "the cost person." Reports to VP of Engineering and dotted-line to Finance. The only person in the company who understands AWS billing at a granular level.
|
||||||
|
|
||||||
|
**The Moment That Defines Jordan:**
|
||||||
|
It's the last Thursday of the month. Jordan has spent the past 3 days building the monthly cloud cost report. They have 14 browser tabs open: Cost Explorer for 6 different accounts, 3 spreadsheets, a Confluence page, and the AWS CUR data in Athena. The VP of Engineering wants the report by Friday EOD. The CFO wants it "in a format Finance can understand." Jordan is translating between two worlds — engineering resource names and financial line items — and neither side appreciates how hard that translation is. They just found a $4,200 discrepancy between Cost Explorer and the CUR data and have no idea which one is right.
|
||||||
|
|
||||||
|
### Empathy Map
|
||||||
|
|
||||||
|
**SAYS:**
|
||||||
|
- "I need the teams to tag their resources properly"
|
||||||
|
- "The CUR data doesn't match Cost Explorer — again"
|
||||||
|
- "Can we get a meeting to discuss the tagging strategy?"
|
||||||
|
- "This account's spend is 40% over forecast"
|
||||||
|
- "I've been asking for this data for two weeks"
|
||||||
|
- "No, I can't tell you the cost per request. We don't have that granularity."
|
||||||
|
|
||||||
|
**THINKS:**
|
||||||
|
- "Nobody takes tagging seriously until the bill is a disaster"
|
||||||
|
- "I'm a single point of failure for cost visibility in this entire company"
|
||||||
|
- "If I got hit by a bus, nobody could produce this report"
|
||||||
|
- "I wish I could automate 80% of what I do"
|
||||||
|
- "The engineering teams think I'm the cost police. I'm trying to help them."
|
||||||
|
- "There has to be a better way than 14 spreadsheets"
|
||||||
|
|
||||||
|
**DOES:**
|
||||||
|
- Downloads CUR data daily, loads into Athena, runs custom queries
|
||||||
|
- Maintains a master spreadsheet mapping AWS accounts → teams → budgets
|
||||||
|
- Sends weekly cost summaries to team leads (most don't read them)
|
||||||
|
- Manually investigates anomalies by cross-referencing CUR, CloudTrail, and Cost Explorer
|
||||||
|
- Attends FinOps Foundation meetups to learn best practices
|
||||||
|
- Builds custom dashboards in QuickSight (they break every time AWS changes the CUR schema)
|
||||||
|
|
||||||
|
**FEELS:**
|
||||||
|
- **Frustrated** — the tools are inadequate and nobody understands the complexity
|
||||||
|
- **Undervalued** — cost optimization saves hundreds of thousands but gets no glory
|
||||||
|
- **Anxious** — one missed anomaly and it's Jordan's fault
|
||||||
|
- **Isolated** — the only person who speaks both "engineering" and "finance"
|
||||||
|
- **Exhausted** — the work is repetitive, manual, and never-ending
|
||||||
|
- **Determined** — genuinely believes cost optimization matters and wants to prove it
|
||||||
|
|
||||||
|
### Pain Points
|
||||||
|
1. **Manual data wrangling** — 60% of Jordan's time is spent collecting, cleaning, and reconciling data, not analyzing it
|
||||||
|
2. **Tagging chaos** — Teams don't tag consistently. Untagged resources are a black hole of unattributable cost.
|
||||||
|
3. **Multi-account complexity** — 23 accounts with different owners, different conventions, different levels of maturity
|
||||||
|
4. **No real-time visibility** — CUR is hourly at best, Cost Explorer is 24-48 hours delayed. Jordan is always looking backward.
|
||||||
|
5. **Stakeholder translation** — Engineering wants resource-level detail. Finance wants department-level summaries. Jordan manually bridges the gap.
|
||||||
|
6. **Tool fragmentation** — Uses Cost Explorer + CUR + Athena + QuickSight + spreadsheets + Slack. No single source of truth.
|
||||||
|
|
||||||
|
### Current Workarounds
|
||||||
|
- Custom Athena queries on CUR data (brittle, requires SQL expertise)
|
||||||
|
- Master spreadsheet updated manually every week (error-prone)
|
||||||
|
- QuickSight dashboards (break when CUR schema changes)
|
||||||
|
- Slack reminders to team leads about their budgets (ignored)
|
||||||
|
- Monthly "cost review" meetings (dreaded by everyone)
|
||||||
|
- AWS Cost Anomaly Detection (too many false positives, no actionable context)
|
||||||
|
|
||||||
|
### Jobs To Be Done (JTBD)
|
||||||
|
- **When** I'm preparing the monthly cost report, **I want to** automatically aggregate costs by team, environment, and service with accurate attribution, **so I can** deliver the report in hours instead of days.
|
||||||
|
- **When** an anomaly is detected, **I want to** immediately see the root cause with full context (who, what, when, why), **so I can** resolve it without a 3-hour investigation.
|
||||||
|
- **When** a team exceeds their budget, **I want to** automatically notify the team lead with specific recommendations, **so I can** scale cost governance without being the bottleneck.
|
||||||
|
|
||||||
|
### Day-in-the-Life Scenario
|
||||||
|
|
||||||
|
**8:00 AM** — Open laptop. 47 unread emails. 12 are AWS billing notifications from various accounts. Triage: most are noise.
|
||||||
|
**8:30 AM** — Check yesterday's CUR data in Athena. Run the anomaly detection query Jordan wrote. 3 flagged items. One is a real issue (new RDS instance in account #17), two are false positives (monthly batch job, expected).
|
||||||
|
**9:00 AM** — Slack the owner of account #17: "Hey, there's a new db.r5.4xlarge in us-west-2. Is this expected?" No response for 2 hours.
|
||||||
|
**9:15 AM** — Start building the weekly cost summary. Pull data from 6 accounts. Two accounts have untagged resources totaling $3,400. Jordan can't attribute them. Adds them to "Unallocated" with a note.
|
||||||
|
**10:00 AM** — Meeting with VP Eng about Q1 cloud budget. VP wants to cut 15%. Jordan explains which optimizations are realistic and which are fantasy. VP doesn't fully understand the constraints.
|
||||||
|
**11:00 AM** — Account #17 owner responds: "Oh yeah, that's for the new analytics pipeline. It's permanent." Jordan updates the forecast spreadsheet. The annual impact is $28,000. Nobody approved this.
|
||||||
|
**12:00 PM** — Lunch at desk. Reading a FinOps Foundation article about showback vs. chargeback models.
|
||||||
|
**1:00-4:00 PM** — Deep in spreadsheets. Reconciling CUR data with the finance team's GL codes. Find a $4,200 discrepancy. Spend 90 minutes discovering it's because of a refund that appeared in CUR but not in Cost Explorer.
|
||||||
|
**4:30 PM** — Team lead asks: "Can you tell me how much our staging environment costs?" Jordan: "Give me 30 minutes." It takes 90 because staging resources aren't consistently tagged.
|
||||||
|
**6:00 PM** — Leave. Tomorrow: same thing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Persona 3: The DevOps Engineer — "Sam"
|
||||||
|
|
||||||
|
**Demographics:** 26 years old. Backend/infrastructure engineer at a 40-person startup. Manages Terraform, CI/CD, and "whatever AWS thing is broken today." Doesn't think about costs — until they cause a problem. Sam's primary metric is uptime, not spend.
|
||||||
|
|
||||||
|
**The Moment That Defines Sam:**
|
||||||
|
It's Friday at 4:47 PM. Sam is about to close the laptop for the weekend when a Slack message from the CTO lands: "Sam, did you launch those GPU instances? Finance says we burned $1,200 on something called p3.2xlarge." Sam's blood runs cold. Last Tuesday, Sam spun up 4 GPU instances to benchmark a new ML model for the data team. The benchmark took 20 minutes. Sam meant to terminate them immediately after. But then there was a production incident, and Sam got pulled away, and the instances... are still running. It's been 4 days. Sam checks: $12.24/hour × 4 instances × 96 hours = $4,700. Sam wants to disappear.
|
||||||
|
|
||||||
|
### Empathy Map
|
||||||
|
|
||||||
|
**SAYS:**
|
||||||
|
- "I'll terminate it right after the test"
|
||||||
|
- "I thought I set it to auto-terminate"
|
||||||
|
- "Can we get a policy to auto-kill dev resources?"
|
||||||
|
- "I didn't know NAT Gateways were that expensive"
|
||||||
|
- "The staging environment? Yeah, it's always running. Should it not be?"
|
||||||
|
- "I don't have time to learn AWS billing — I have deploys to ship"
|
||||||
|
|
||||||
|
**THINKS:**
|
||||||
|
- "Cost management isn't my job... but it keeps becoming my problem"
|
||||||
|
- "I should have set a reminder to terminate those instances"
|
||||||
|
- "AWS makes it way too easy to create expensive things and way too hard to know what they cost"
|
||||||
|
- "I'm going to get blamed for this even though there's no guardrail to prevent it"
|
||||||
|
- "Why doesn't AWS just TELL you when something is burning money?"
|
||||||
|
- "I bet there are other zombie resources I don't even know about"
|
||||||
|
|
||||||
|
**DOES:**
|
||||||
|
- Launches resources via Terraform and CLI, sometimes via console for quick tests
|
||||||
|
- Forgets to clean up temporary resources (not malicious — just busy)
|
||||||
|
- Checks costs only when asked by management
|
||||||
|
- Uses `aws ce get-cost-and-usage` CLI occasionally but finds the output confusing
|
||||||
|
- Tags resources inconsistently ("I'll add tags later" → never)
|
||||||
|
- Responds to cost inquiries defensively ("It was just a test!")
|
||||||
|
|
||||||
|
**FEELS:**
|
||||||
|
- **Embarrassed** — when caught leaving expensive resources running
|
||||||
|
- **Defensive** — "There should be a system to catch this, not just blame me"
|
||||||
|
- **Indifferent** — cost isn't Sam's KPI; uptime and velocity are
|
||||||
|
- **Overwhelmed** — too many responsibilities, cost management is one more thing
|
||||||
|
- **Anxious** — fear of making an expensive mistake and getting called out
|
||||||
|
- **Resentful** — "Why is this my problem? Where are the guardrails?"
|
||||||
|
|
||||||
|
### Pain Points
|
||||||
|
1. **No feedback loop** — Sam creates a resource and gets zero signal about its cost until someone complains weeks later
|
||||||
|
2. **Easy to create, hard to track** — AWS makes it trivial to launch resources and nearly impossible to understand their cost implications in real-time
|
||||||
|
3. **No safety net** — There's no automated system to catch forgotten resources. It's all human memory.
|
||||||
|
4. **Blame culture** — When costs spike, the question is "who did this?" not "how do we prevent this?"
|
||||||
|
5. **Cost literacy gap** — Sam is an excellent engineer but has no mental model for AWS pricing. NAT Gateway data processing? EBS IOPS charges? It's a foreign language.
|
||||||
|
6. **Context-switching tax** — Investigating a cost issue means leaving the terminal/IDE and navigating the AWS billing console, which is a completely different mental model.
|
||||||
|
|
||||||
|
### Current Workarounds
|
||||||
|
- Setting personal calendar reminders to terminate test resources (unreliable)
|
||||||
|
- Using spot instances when remembering to (inconsistent)
|
||||||
|
- Terraform `destroy` for test stacks (when they remember)
|
||||||
|
- Asking in Slack before launching anything expensive (social pressure, not a system)
|
||||||
|
- Nothing. Most of the time, there's just nothing. Hope and prayer.
|
||||||
|
|
||||||
|
### Jobs To Be Done (JTBD)
|
||||||
|
- **When** I spin up a temporary resource for testing, **I want to** be automatically reminded (or have it auto-terminated) after a set period, **so I can** focus on my actual work without worrying about zombie resources.
|
||||||
|
- **When** I'm about to create something expensive, **I want to** see the estimated cost impact immediately, **so I can** make an informed decision or choose a cheaper alternative.
|
||||||
|
- **When** a cost anomaly is traced back to my actions, **I want to** fix it with one click from wherever I already am (Slack/terminal), **so I can** resolve it in 30 seconds instead of 15 minutes of console-clicking.
|
||||||
|
|
||||||
|
### Day-in-the-Life Scenario
|
||||||
|
|
||||||
|
**9:00 AM** — Start day. Check CI/CD pipelines. One failed overnight — flaky test. Re-run it.
|
||||||
|
**9:30 AM** — Sprint planning. Pick up a ticket to set up a new ECS service for the payments team.
|
||||||
|
**10:00 AM** — Writing Terraform for the new ECS service. Chooses instance type based on the last service they set up (m5.xlarge). Doesn't check if it's the right size. Doesn't estimate cost.
|
||||||
|
**11:00 AM** — Data team asks Sam to spin up GPU instances for ML benchmarking. Sam launches 4x p3.2xlarge via CLI. Plans to terminate after lunch.
|
||||||
|
**11:30 AM** — Production alert: database connection pool exhausted. All hands on deck.
|
||||||
|
**11:30 AM - 2:00 PM** — Incident response. The GPU instances are completely forgotten.
|
||||||
|
**2:00 PM** — Incident resolved. Sam is mentally drained. Grabs lunch.
|
||||||
|
**2:30 PM** — Back to the ECS Terraform. Deploys to staging. Doesn't think about the GPU instances.
|
||||||
|
**3:00 PM** — Code review for a teammate's Lambda function. Doesn't notice it logs full request payloads at DEBUG level (future CloudWatch cost bomb).
|
||||||
|
**4:00 PM** — Pushes the ECS service to production. Monitors for 30 minutes. Looks good.
|
||||||
|
**4:47 PM** — CTO Slack: "Did you launch those GPU instances?" The cold sweat begins.
|
||||||
|
**4:50 PM** — Terminates the instances. $4,700 burned. Apologizes. Feels terrible.
|
||||||
|
**5:00 PM** — Closes laptop. Spends the weekend low-key anxious about it.
|
||||||
|
**Monday** — CTO announces a new "cloud cost policy." Sam knows it's because of them. Nobody will follow it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Phase 2: DEFINE
|
||||||
|
|
||||||
|
> *"A well-defined problem is a problem half-solved. But here's the jazz riff — the problem isn't 'costs are too high.' The problem is 'I'm flying blind in a machine that charges by the millisecond.' That's a fundamentally different design challenge."*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Point-of-View (POV) Statements
|
||||||
|
|
||||||
|
### POV 1: The Startup CTO (Alex)
|
||||||
|
**Alex**, a time-starved startup CTO who is personally accountable for AWS spend, **needs a way to** instantly understand and resolve unexpected cost spikes the moment they happen, **because** the 48-hour delay in current tools means thousands of dollars burn before they even know there's a problem, and every unexplained spike erodes investor confidence and their own credibility.
|
||||||
|
|
||||||
|
### POV 2: The FinOps Analyst (Jordan)
|
||||||
|
**Jordan**, a solo FinOps analyst responsible for cost governance across 23 AWS accounts, **needs a way to** automatically detect, attribute, and communicate cost anomalies without manual data wrangling, **because** they spend 60% of their time collecting and reconciling data instead of analyzing it, and they are a single point of failure for cost visibility in a 150-person engineering org.
|
||||||
|
|
||||||
|
### POV 3: The DevOps Engineer (Sam)
|
||||||
|
**Sam**, a DevOps engineer who accidentally creates expensive zombie resources, **needs a way to** get immediate cost feedback and automatic safety nets when creating or forgetting cloud resources, **because** there is currently zero feedback loop between "I launched a thing" and "that thing is costing $12/hour," and the resulting blame culture makes cost management feel punitive rather than preventive.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Insights
|
||||||
|
|
||||||
|
1. **The Anxiety Gap** — The real pain isn't the dollar amount. It's the TIME between "something went wrong" and "I understand what happened." AWS's 48-hour delay turns a $200 problem into a $2,000 problem AND a week of anxiety. Speed of detection is speed of relief.
|
||||||
|
|
||||||
|
2. **Attribution Is Emotional, Not Just Financial** — "Who did this?" is the first question asked in every cost spike. Current tools can't answer it. This creates blame culture. If dd0c can instantly say "Sam's p3.2xlarge instances from Tuesday," it transforms the conversation from blame to resolution.
|
||||||
|
|
||||||
|
3. **Nobody Wakes Up Wanting to Do FinOps** — Alex doesn't want a cost dashboard. Sam doesn't want billing alerts. Jordan doesn't want more spreadsheets. They all want the ABSENCE of cost problems. The best cost tool is one you barely notice — until it saves you.
|
||||||
|
|
||||||
|
4. **The Guardrail Deficit** — AWS makes it trivially easy to create expensive resources and provides zero real-time feedback about cost. It's like a highway with no speed limit signs and no guardrails — and the speeding ticket arrives 2 days later. dd0c is the guardrail, not the ticket.
|
||||||
|
|
||||||
|
5. **Slack Is the Operating System** — All three personas live in Slack. Alex gets the CFO's panic message there. Jordan sends cost summaries there. Sam gets called out there. The product that wins is the one that meets them where they already are — not behind another login.
|
||||||
|
|
||||||
|
6. **The Trust Ladder** — Read-only detection (low trust) → Recommendations (medium trust) → One-click remediation (high trust) → Autonomous action (maximum trust). Users climb this ladder over time. V1 must support the full ladder but default to the bottom rung.
|
||||||
|
|
||||||
|
7. **Cost Literacy Is Near Zero** — Even experienced engineers don't understand AWS pricing. NAT Gateway data processing, cross-AZ transfer, EBS IOPS — it's deliberately opaque. dd0c must EXPLAIN costs in human language, not just report numbers.
|
||||||
|
|
||||||
|
8. **Waste Regenerates** — Optimization isn't a one-time event. New engineers join, new services launch, configurations drift. The zombie resource problem is perpetual. dd0c's value is continuous, not episodic.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How Might We (HMW) Questions
|
||||||
|
|
||||||
|
### Detection & Speed
|
||||||
|
1. **HMW** detect expensive resource creation in seconds instead of days, so users can intervene before costs accumulate?
|
||||||
|
2. **HMW** distinguish between expected cost changes (planned deployments) and genuine anomalies, to minimize false positive fatigue?
|
||||||
|
3. **HMW** make the "cost signal" as immediate and visceral as a production alert, so it gets the same urgency?
|
||||||
|
|
||||||
|
### Attribution & Understanding
|
||||||
|
4. **HMW** automatically attribute every cost spike to a specific person, team, and action — without requiring perfect tagging?
|
||||||
|
5. **HMW** explain cost anomalies in plain English so that a CTO, a FinOps analyst, AND a junior engineer all understand what happened?
|
||||||
|
6. **HMW** show the "cost blast radius" of a single action (e.g., "this one CLI command is costing $12.24/hour") at the moment it happens?
|
||||||
|
|
||||||
|
### Remediation & Action
|
||||||
|
7. **HMW** reduce the time from "anomaly detected" to "problem fixed" from hours to seconds?
|
||||||
|
8. **HMW** make remediation feel safe (not scary) so users actually click the "Stop Instance" button instead of hesitating?
|
||||||
|
9. **HMW** build automatic safety nets (auto-terminate, auto-schedule) that prevent problems without requiring human vigilance?
|
||||||
|
|
||||||
|
### Culture & Behavior
|
||||||
|
10. **HMW** transform cost management from a blame game into a team sport?
|
||||||
|
11. **HMW** make cost awareness a natural byproduct of daily engineering work, not a separate chore?
|
||||||
|
12. **HMW** reward cost-conscious behavior instead of only punishing waste?
|
||||||
|
|
||||||
|
### Scale & Governance
|
||||||
|
13. **HMW** give Jordan (the FinOps analyst) their time back by automating the 60% of work that's just data wrangling?
|
||||||
|
14. **HMW** provide cost governance across 20+ AWS accounts without creating a bottleneck at one person?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Core Tension: Real-Time Detection vs. Accuracy
|
||||||
|
|
||||||
|
Here's the design tension that will define dd0c/cost's soul:
|
||||||
|
|
||||||
|
```
|
||||||
|
FAST ←————————————————————→ ACCURATE
|
||||||
|
CloudTrail events CUR line-item data
|
||||||
|
(instant, but estimated) (hourly+, but precise)
|
||||||
|
```
|
||||||
|
|
||||||
|
**The Tradeoff:**
|
||||||
|
- **CloudTrail event detection** tells you "someone just launched a p3.2xlarge" within seconds. You can estimate the cost ($3.06/hour). But you don't have the ACTUAL billed amount yet — there could be reserved instance coverage, savings plans, spot pricing, or marketplace fees that change the real number.
|
||||||
|
- **CUR/Cost Explorer data** gives you the exact billed amount, with all discounts and credits applied. But it's delayed by hours (CUR) or days (Cost Explorer).
|
||||||
|
|
||||||
|
**The Resolution — A Two-Layer Architecture:**
|
||||||
|
|
||||||
|
| Layer | Source | Speed | Accuracy | Use Case |
|
||||||
|
|-------|--------|-------|----------|----------|
|
||||||
|
| **Layer 1: Event Stream** | CloudTrail + EventBridge | Seconds | Estimated (~85% accurate) | "ALERT: New expensive resource detected" |
|
||||||
|
| **Layer 2: Billing Reconciliation** | CloudWatch EstimatedCharges + CUR | Minutes to hours | Precise (99%+) | "UPDATE: Confirmed cost impact is $X" |
|
||||||
|
|
||||||
|
**The Design Principle:** Alert on Layer 1 (fast, estimated). Reconcile with Layer 2 (slow, precise). Always show the user which layer they're looking at. Never pretend an estimate is exact. Never wait for precision when speed saves money.
|
||||||
|
|
||||||
|
This is like a smoke detector vs. a fire investigation. The smoke detector goes off immediately — it might be burnt toast, it might be a real fire. You don't wait for the fire investigator's report before evacuating. You act on the fast signal, then refine your understanding.
|
||||||
|
|
||||||
|
**For each persona, this plays differently:**
|
||||||
|
- **Alex (CTO):** Wants Layer 1 immediately. "I don't care if it's $1,175 or $1,230 — I need to know NOW that something is burning money." Precision can come later.
|
||||||
|
- **Jordan (FinOps):** Needs both layers. Layer 1 for real-time awareness, Layer 2 for accurate reporting and forecasting. Jordan will be frustrated if estimates are wildly off.
|
||||||
|
- **Sam (DevOps):** Wants Layer 1 as a safety net. "Tell me the second I forget to terminate something." Doesn't care about the exact dollar amount — cares about the pattern.
|
||||||
|
|
||||||
|
---
|
||||||
481
products/05-aws-cost-anomaly/epics/epics.md
Normal file
481
products/05-aws-cost-anomaly/epics/epics.md
Normal file
@@ -0,0 +1,481 @@
|
|||||||
|
# dd0c/cost — V1 MVP Epics
|
||||||
|
|
||||||
|
This document breaks down the dd0c/cost MVP into implementable Epics and Stories. Stories are sized for a solo founder to complete in 1-3 days (1-5 points typically).
|
||||||
|
|
||||||
|
## Epic 1: CloudTrail Ingestion
|
||||||
|
**Description:** Build the real-time event pipeline that receives CloudTrail events from customer accounts, filters for cost-relevant actions (EC2, RDS, Lambda), normalizes them into `CostEvents`, and estimates their on-demand cost. This is the foundational data ingestion layer.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 1.1: Cross-Account EventBridge Bus**
|
||||||
|
- **As a** dd0c system, **I want** to receive CloudTrail events from external customer AWS accounts via EventBridge, **so that** I can process them centrally without running agents in customer accounts.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `dd0c-cost-bus` created in dd0c's AWS account.
|
||||||
|
- Resource policy allows `events:PutEvents` from any AWS account (scoped by external ID/trust later, but fundamentally open to receive).
|
||||||
|
- Test events sent from a separate AWS account successfully arrive on the bus.
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** None
|
||||||
|
- **Technical Notes:** Use AWS CDK. Ensure the bus is configured in `us-east-1`.
|
||||||
|
|
||||||
|
**Story 1.2: SQS Ingestion Queue & Dead Letter Queue**
|
||||||
|
- **As a** data pipeline, **I want** events routed from EventBridge to an SQS FIFO queue, **so that** I can process them in order, deduplicate them, and handle bursts without dropping data.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- EventBridge rule routes matching events to `event-ingestion.fifo` queue.
|
||||||
|
- SQS FIFO configured with `MessageGroupId` = accountId and deduplication enabled.
|
||||||
|
- DLQ configured after 3 retries.
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** Story 1.1
|
||||||
|
- **Technical Notes:** CloudTrail can emit duplicates; use `eventID` for SQS deduplication ID.
|
||||||
|
|
||||||
|
**Story 1.3: Static Pricing Tables**
|
||||||
|
- **As an** event processor, **I want** local static lookup tables for EC2, RDS, and Lambda on-demand pricing, **so that** I can estimate hourly costs in milliseconds without calling the slow AWS Pricing API.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- JSON/TypeScript dicts created for top 20 instance types for EC2 and RDS, plus Lambda per-GB-second rates.
|
||||||
|
- Pricing covers `us-east-1` (and placeholder for others if needed).
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** None
|
||||||
|
- **Technical Notes:** Keep it simple for V1. Hardcode the most common instance types. We don't need the entire AWS price list yet.
|
||||||
|
|
||||||
|
**Story 1.4: Event Processor Lambda**
|
||||||
|
- **As an** event pipeline, **I want** a Lambda function to poll the SQS queue, normalize raw CloudTrail events into `CostEvent` schemas, and write them to DynamoDB, **so that** downstream systems have clean, standardized data.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Lambda polls SQS (batch size 10).
|
||||||
|
- Parses `RunInstances`, `CreateDBInstance`, `CreateFunction20150331`, etc.
|
||||||
|
- Extracts actor (IAM User/Role ARN), resource ID, region.
|
||||||
|
- Looks up pricing and appends `estimatedHourlyCost`.
|
||||||
|
- Writes `CostEvent` to DynamoDB `dd0c-cost-main` table.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 1.2, Story 1.3
|
||||||
|
- **Technical Notes:** Implement idempotency. Use DynamoDB Single-Table Design. Partition key: `ACCOUNT#<id>`, Sort key: `EVENT#<timestamp>#<eventId>`.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 2: Anomaly Detection Engine
|
||||||
|
**Description:** Implement the baseline learning and anomaly scoring algorithms. The engine evaluates incoming `CostEvent` records against account-specific, service-specific historical spending baselines to flag unusual spikes, new instance types, or unusual actors.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 2.1: Baseline Storage & Retrieval**
|
||||||
|
- **As an** anomaly scorer, **I want** to read and write spending baselines per account/service/resource from DynamoDB, **so that** I have a statistical foundation to evaluate new events against.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `Baseline` schema created in DynamoDB (`BASELINE#<account_id>`).
|
||||||
|
- Read/Write logic implemented for running means, standard deviations, max observed, and expected actors/instance types.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 1.4
|
||||||
|
- **Technical Notes:** Update baseline with `ADD` expressions in DynamoDB to avoid race conditions.
|
||||||
|
|
||||||
|
**Story 2.2: Cold-Start Absolute Thresholds**
|
||||||
|
- **As a** new customer, **I want** my account to immediately flag highly expensive resources (>$5/hr) even if I have no baseline, **so that** I don't wait 14 days for the system to "learn" a $3,000 mistake.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Implement absolute threshold heuristics: >$0.50/hr = INFO, >$5/hr = WARNING, >$25/hr = CRITICAL.
|
||||||
|
- Apply logic when account maturity is `cold-start` (<14 days or <20 events).
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** Story 2.1
|
||||||
|
- **Technical Notes:** Implement a `scoreAnomaly` function that checks the maturity state of the baseline.
|
||||||
|
|
||||||
|
**Story 2.3: Statistical Anomaly Scoring**
|
||||||
|
- **As an** anomaly scorer, **I want** to calculate composite anomaly scores using Z-scores, instance novelty, and actor novelty, **so that** I reduce false positives and only flag truly unusual behavior.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Implement Z-score calculation (event cost vs baseline mean).
|
||||||
|
- Implement novelty checks (is this instance type or actor new?).
|
||||||
|
- Composite score logic computes severity (`info`, `warning`, `critical`).
|
||||||
|
- Creates an `AnomalyRecord` in DynamoDB if threshold crossed.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 2.1
|
||||||
|
- **Technical Notes:** Add unit tests covering various edge cases (new actor + cheap instance vs. familiar actor + expensive instance).
|
||||||
|
|
||||||
|
**Story 2.4: Feedback Loop ("Mark as Expected")**
|
||||||
|
- **As an** anomaly engine, **I want** to update baselines when a user marks an anomaly as expected, **so that** I learn from feedback and stop alerting on normal workflows.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Provide a function to append a resource type and actor to `expectedInstanceTypes` and `expectedActors`.
|
||||||
|
- Future events matching this suppressed pattern get a reduced anomaly score.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 2.3
|
||||||
|
- **Technical Notes:** This API will be called by the Slack action handler.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 3: Notification Service
|
||||||
|
**Description:** Build the Slack-first notification engine. Deliver rich Block Kit alerts containing anomaly context, estimated costs, and manual remediation suggestions. This is the product's primary user interface for V1.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 3.1: SQS Alert Queue & Notifier Lambda**
|
||||||
|
- **As a** notification engine, **I want** to poll an alert queue and trigger a Lambda function for every new anomaly, **so that** I can format and send alerts asynchronously without blocking the ingestion path.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Create standard SQS `alert-queue` for anomalies.
|
||||||
|
- Create `notifier` Lambda that polls the queue.
|
||||||
|
- SQS retries via visibility timeout on Slack API rate limits (429).
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** Story 2.3
|
||||||
|
- **Technical Notes:** The scorer Lambda pushes the anomaly ID to this queue.
|
||||||
|
|
||||||
|
**Story 3.2: Slack Block Kit Formatting**
|
||||||
|
- **As a** user, **I want** anomaly alerts formatted nicely in Slack, **so that** I can instantly understand what resource launched, who launched it, the estimated cost, and why it was flagged.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Use Slack Block Kit to design a highly readable card.
|
||||||
|
- Include: Resource Type, Region, Cost/hr, Actor, Timestamp, and the reason (e.g., "New instance type never seen").
|
||||||
|
- Test rendering for EC2, RDS, and Lambda anomalies.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 3.1
|
||||||
|
- **Technical Notes:** Include a "Why this alert" section detailing the anomaly signals.
|
||||||
|
|
||||||
|
**Story 3.3: Manual Remediation Suggestions**
|
||||||
|
- **As a** user, **I want** the Slack alert to include CLI commands to stop or terminate the anomalous resource, **so that** I can fix the issue immediately even before one-click buttons are available.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Block Kit template appends a `Suggested actions` section.
|
||||||
|
- Generate a valid `aws ec2 stop-instances` or `aws rds stop-db-instance` command based on the resource type and region.
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** Story 3.2
|
||||||
|
- **Technical Notes:** For V1, no actual remediation API calls are made by dd0c. This prevents accidental deletions and builds trust first.
|
||||||
|
|
||||||
|
**Story 3.4: Daily Digest Generator**
|
||||||
|
- **As a** user, **I want** a daily summary of my spending and any minor anomalies, **so that** I don't get paged for every $0.50 resource but still have visibility.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Create an EventBridge Scheduler rule (e.g., cron at 09:00 UTC).
|
||||||
|
- Lambda queries the last 24h of anomalies and baseline metrics.
|
||||||
|
- Sends a digest message (Spend Estimate, Anomalies Resolved vs. Open, Zombie Watch summary).
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 3.2
|
||||||
|
- **Technical Notes:** Query DynamoDB GSI for recent anomalies (`ANOMALY#<id>#STATUS#*`).
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 4: Customer Onboarding
|
||||||
|
**Description:** Automate the 5-minute setup experience. Create the CloudFormation templates and cross-account IAM roles required for dd0c to securely read CloudTrail events and resource metadata without touching customer data or secrets.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 4.1: IAM Read-Only CloudFormation Template**
|
||||||
|
- **As a** customer, **I want** to deploy a simple, open-source CloudFormation template, **so that** I can grant dd0c secure, read-only access to my AWS account without worrying about compromised credentials.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Create `dd0c-cost-readonly.yaml` template.
|
||||||
|
- Role `dd0c-cost-readonly` with `sts:AssumeRole` policy.
|
||||||
|
- Requires `ExternalId` parameter.
|
||||||
|
- Allows `ec2:Describe*`, `rds:Describe*`, `lambda:List*`, `cloudwatch:*`, `ce:GetCostAndUsage`, `tag:GetResources`.
|
||||||
|
- Hosted on a public S3 bucket (`dd0c-cf-templates`).
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** None
|
||||||
|
- **Technical Notes:** Include an EventBridge rule that forwards `cost-relevant` CloudTrail events to dd0c's EventBridge bus (`arn:aws:events:...:dd0c-cost-bus`).
|
||||||
|
|
||||||
|
**Story 4.2: Cognito User Pool Authentication**
|
||||||
|
- **As a** platform, **I want** a secure identity provider, **so that** users can sign up quickly using GitHub or Google SSO.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Configure Amazon Cognito User Pool.
|
||||||
|
- Enable GitHub and Google OIDC providers.
|
||||||
|
- Provide a login URL and redirect to the dd0c app.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** None
|
||||||
|
- **Technical Notes:** AWS Cognito is free for the first 50k MAU, keeping V1 costs at zero.
|
||||||
|
|
||||||
|
**Story 4.3: Account Setup API Endpoint**
|
||||||
|
- **As a** new user, **I want** an API that initializes my tenant and generates a secure CloudFormation "quick-create" link, **so that** I can click one button to install the required AWS permissions.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `POST /v1/accounts/setup` created in API Gateway.
|
||||||
|
- Validates Cognito JWT.
|
||||||
|
- Generates a unique UUIDv4 `externalId` per tenant/account.
|
||||||
|
- Returns a URL pointing to the AWS Console CloudFormation quick-create page with pre-filled parameters.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 4.1, Story 4.2
|
||||||
|
- **Technical Notes:** The API Lambda should store the generated `externalId` in DynamoDB under the tenant record.
|
||||||
|
|
||||||
|
**Story 4.4: Role Validation & Activation**
|
||||||
|
- **As a** dd0c system, **I want** to validate a user's AWS account connection by assuming their newly created role, **so that** I know I can receive events and start anomaly detection.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `POST /v1/accounts` API created (receives `awsAccountId`, `roleArn`).
|
||||||
|
- Calls `sts:AssumeRole` using the `roleArn` and `externalId`.
|
||||||
|
- On success, updates the account status to `active` in DynamoDB.
|
||||||
|
- Automatically triggers a "Zombie Resource Scan" on connection.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 4.3
|
||||||
|
- **Technical Notes:** This is the critical moment. If the `AssumeRole` fails, return an error explaining the `ExternalId` mismatch or missing permissions.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 5: Dashboard API
|
||||||
|
**Description:** Build the REST API for anomaly querying, account management, and basic metrics. V1 relies entirely on Slack for interaction, but a minimal API is needed for account settings and the upcoming V2 dashboard.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 5.1: Account Retrieval API**
|
||||||
|
- **As a** user, **I want** to see my connected AWS accounts, **so that** I can view their health status and disconnect them if needed.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `GET /v1/accounts` API created (returns `accountId`, status, `baselineMaturity`).
|
||||||
|
- `DELETE /v1/accounts/{id}` API created.
|
||||||
|
- Returns `401 Unauthorized` without a valid Cognito JWT.
|
||||||
|
- Scopes database query to `tenantId`.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 4.4
|
||||||
|
- **Technical Notes:** The disconnect endpoint should mark the account as `disconnecting` and trigger a background Lambda to delete the data within 72 hours.
|
||||||
|
|
||||||
|
**Story 5.2: Anomaly Listing API**
|
||||||
|
- **As a** user, **I want** to view a list of recent anomalies, **so that** I can review past alerts or check if anything was missed.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `GET /v1/anomalies` API created.
|
||||||
|
- Queries DynamoDB GSI3 (`ANOMALY#<id>#STATUS#*`) for the authenticated account.
|
||||||
|
- Supports `since`, `status`, and `severity` filters.
|
||||||
|
- Implements basic pagination.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 2.3
|
||||||
|
- **Technical Notes:** Include `slackMessageUrl` if the anomaly triggered a Slack alert.
|
||||||
|
|
||||||
|
**Story 5.3: Baseline Overrides**
|
||||||
|
- **As a** user, **I want** to adjust anomaly sensitivity for specific services or resource types, **so that** I don't get paged for expected batch processing spikes.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `PATCH /v1/accounts/{id}/baselines/{service}/{type}` API created.
|
||||||
|
- Modifies the DynamoDB baseline record to update `sensitivityOverride` (`low`, `medium`, `high`).
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** Story 2.1
|
||||||
|
- **Technical Notes:** Valid values must be enforced by the API schema.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 6: Dashboard UI
|
||||||
|
**Description:** Build the initial Next.js/React frontend. While V1 focuses on Slack, the web dashboard handles onboarding, account connection, Slack OAuth, and basic anomaly viewing for users who prefer the web.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 6.1: Next.js Boilerplate & Auth**
|
||||||
|
- **As a** user, **I want** to sign in to the dd0c/cost portal, **so that** I can configure my account and view my AWS connections.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Initialize Next.js app with Tailwind CSS.
|
||||||
|
- Implement AWS Amplify or `next-auth` for Cognito integration.
|
||||||
|
- Landing page with `Start Free` button.
|
||||||
|
- Protect `/dashboard` routes.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 4.2
|
||||||
|
- **Technical Notes:** Keep the design clean and Vercel-like. The goal is to get the user authenticated in <10 seconds.
|
||||||
|
|
||||||
|
**Story 6.2: Onboarding Flow**
|
||||||
|
- **As a** new user, **I want** a simple 3-step wizard to connect AWS and Slack, **so that** I don't get lost in documentation.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- "Connect AWS Account" screen.
|
||||||
|
- Generates CloudFormation quick-create URL.
|
||||||
|
- Polls `/v1/accounts/{id}/health` for successful connection.
|
||||||
|
- "Connect Slack" screen initiates OAuth flow.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 4.3, Story 4.4, Story 6.1
|
||||||
|
- **Technical Notes:** Provide a fallback manual input field if the auto-polling fails or the user closes the AWS Console window early.
|
||||||
|
|
||||||
|
**Story 6.3: Basic Dashboard View**
|
||||||
|
- **As a** user, **I want** a simple dashboard showing my connected accounts, recent anomalies, and estimated monthly cost, **so that** I have a high-level view outside of Slack.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Render an `Account Overview` table.
|
||||||
|
- Fetch anomalies via `/v1/anomalies` and display in a simple list or timeline.
|
||||||
|
- Indicate the account's baseline learning phase (e.g., "14 days left in learning phase").
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 5.1, Story 5.2, Story 6.1
|
||||||
|
- **Technical Notes:** V1 UI shouldn't be complex. Avoid graphs or heavy chart libraries for MVP.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 7: Slack Bot
|
||||||
|
**Description:** Build the Slack bot interaction model. This includes the OAuth installation flow, parsing incoming slash commands (`/dd0c status`, `/dd0c anomalies`, `/dd0c digest`), and handling interactive message payloads for actions like snoozing or marking alerts as expected.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 7.1: Slack OAuth Installation Flow**
|
||||||
|
- **As a** user, **I want** to securely install the dd0c app to my Slack workspace, **so that** the bot can send alerts to my designated channels.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `GET /v1/slack/install` initiates the Slack OAuth v2 flow.
|
||||||
|
- `GET /v1/slack/oauth_redirect` handles the callback, exchanging the code for a bot token.
|
||||||
|
- Bot token and workspace details are securely stored in DynamoDB under the tenant's record.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 4.2
|
||||||
|
- **Technical Notes:** Request minimum scopes: `chat:write`, `commands`, `incoming-webhook`. Encrypt the Slack bot token at rest.
|
||||||
|
|
||||||
|
**Story 7.2: Slash Command Parser & Router**
|
||||||
|
- **As a** Slack user, **I want** to use commands like `/dd0c status`, **so that** I can interact with the system without leaving my chat window.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `POST /v1/slack/commands` API endpoint created to receive Slack command webhooks.
|
||||||
|
- Validates Slack request signatures (HMAC-SHA256).
|
||||||
|
- Routes `/dd0c status`, `/dd0c anomalies`, and `/dd0c digest` to respective handler functions.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 7.1
|
||||||
|
- **Technical Notes:** Respond within 3 seconds or defer with a 200 OK and use the `response_url` for delayed execution.
|
||||||
|
|
||||||
|
**Story 7.3: Interactive Action Handler**
|
||||||
|
- **As a** user, **I want** to click buttons on anomaly alerts to snooze them or mark them as expected, **so that** I can tune the system's noise level instantly.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `POST /v1/slack/actions` API endpoint created to receive interactive payloads.
|
||||||
|
- Validates Slack request signatures.
|
||||||
|
- Handles `mark_expected` action by updating the anomaly record and retraining the baseline.
|
||||||
|
- Handles `snooze_Xh` actions by updating the `snoozeUntil` attribute.
|
||||||
|
- Updates the original Slack message using the Slack API to reflect the action taken.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 3.2, Story 7.2
|
||||||
|
- **Technical Notes:** V1 only implements non-destructive actions (snooze, mark expected). No actual AWS remediation API calls yet.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 8: Infrastructure & DevOps
|
||||||
|
**Description:** Define the serverless infrastructure using AWS CDK. This epic covers the deployment of the EventBridge buses, SQS queues, Lambda functions, DynamoDB tables, and setting up the CI/CD pipeline for automated testing and deployment.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 8.1: Core Serverless Stack (CDK)**
|
||||||
|
- **As a** developer, **I want** the core ingestion and data storage infrastructure defined as code, **so that** I can deploy the dd0c platform reliably and repeatedly.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- AWS CDK (TypeScript) project initialized.
|
||||||
|
- `dd0c-cost-main` DynamoDB table defined with GSIs and TTL.
|
||||||
|
- `dd0c-cost-bus` EventBridge bus configured with resource policies allowing external puts.
|
||||||
|
- `event-ingestion.fifo` and `alert-queue` SQS queues created.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** None
|
||||||
|
- **Technical Notes:** Ensure DynamoDB is set to PAY_PER_REQUEST (on-demand) to minimize baseline costs.
|
||||||
|
|
||||||
|
**Story 8.2: Lambda Deployments & Triggers**
|
||||||
|
- **As a** developer, **I want** to deploy the Lambda functions and connect them to their respective triggers, **so that** the event-driven architecture functions end-to-end.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- CDK definitions for `event-processor`, `anomaly-scorer`, `notifier`, and API handlers.
|
||||||
|
- SQS event source mappings configured for processor and notifier Lambdas.
|
||||||
|
- API Gateway REST API configured with routes pointing to the API handler Lambda.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 8.1
|
||||||
|
- **Technical Notes:** Bundle Lambdas using `NodejsFunction` construct (esbuild) to minimize cold starts. Set explicit memory and timeout values.
|
||||||
|
|
||||||
|
**Story 8.3: Observability & Alarms**
|
||||||
|
- **As an** operator, **I want** automated monitoring of the infrastructure, **so that** I am alerted if ingestion fails or components throttle.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- CloudWatch Alarms created for Lambda error rates (>5% in 5 mins).
|
||||||
|
- Alarms created for SQS DLQ depth (`ApproximateNumberOfMessagesVisible` > 0).
|
||||||
|
- Alarms send notifications to an SNS `ops-alerts` topic.
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** Story 8.2
|
||||||
|
- **Technical Notes:** Keep V1 alarms simple to avoid alert fatigue.
|
||||||
|
|
||||||
|
**Story 8.4: CI/CD Pipeline Setup**
|
||||||
|
- **As a** solo founder, **I want** GitHub Actions to automatically test and deploy my code, **so that** I can push to `main` and have it live in minutes without manual deployment steps.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- GitHub Actions workflow created for PRs (lint, test).
|
||||||
|
- Workflow created for `main` branch (lint, test, `cdk deploy --require-approval broadening`).
|
||||||
|
- OIDC provider configured in AWS for passwordless GitHub Actions authentication.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 8.1
|
||||||
|
- **Technical Notes:** Use AWS `configure-aws-credentials` action with Role to assume.
|
||||||
|
|
||||||
|
|
||||||
|
## Epic 9: PLG & Free Tier
|
||||||
|
**Description:** Implement the product-led growth (PLG) foundations. This involves building a seamless self-serve signup flow, enforcing free tier limits (1 AWS account), and providing the mechanism to upgrade to a paid tier via Stripe.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 9.1: Free Tier Enforcement**
|
||||||
|
- **As a** platform, **I want** to limit free users to 1 connected AWS account, **so that** I can control infrastructure costs while letting users experience the product's value.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `POST /v1/accounts/setup` checks the tenant's current account count.
|
||||||
|
- Rejects the request with `403 Forbidden` and an upgrade prompt if the limit (1) is reached on the free tier.
|
||||||
|
- **Estimate:** 2
|
||||||
|
- **Dependencies:** Story 4.3
|
||||||
|
- **Technical Notes:** Check the `TENANT#<id>` metadata record to determine the subscription tier.
|
||||||
|
|
||||||
|
**Story 9.2: Stripe Integration & Upgrade Flow**
|
||||||
|
- **As a** user, **I want** to easily upgrade to a paid subscription, **so that** I can connect multiple AWS accounts and access premium features.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Create a Stripe Checkout session endpoint (`POST /v1/billing/checkout`).
|
||||||
|
- Configure a Stripe webhook handler to listen for `checkout.session.completed` and `customer.subscription.deleted`.
|
||||||
|
- Update the tenant's tier to `pro` in DynamoDB upon successful payment.
|
||||||
|
- **Estimate:** 5
|
||||||
|
- **Dependencies:** Story 4.2
|
||||||
|
- **Technical Notes:** The Pro tier is $19/account/month. Use Stripe Billing's per-unit pricing model tied to the number of active AWS accounts.
|
||||||
|
|
||||||
|
**Story 9.3: API Key Management (V1 Foundation)**
|
||||||
|
- **As a** power user, **I want** to generate an API key, **so that** I can programmatically interact with my dd0c account in the future.
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- `POST /v1/api-keys` endpoint to generate a secure, scoped API key.
|
||||||
|
- Hash the API key before storing it in DynamoDB (`TENANT#<id>#APIKEY#<hash>`).
|
||||||
|
- Display the plain-text key only once during creation.
|
||||||
|
- **Estimate:** 3
|
||||||
|
- **Dependencies:** Story 5.1
|
||||||
|
- **Technical Notes:** This lays the groundwork for the V2 Business tier API access.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 10: Transparent Factory Compliance
|
||||||
|
**Description:** Cross-cutting epic ensuring dd0c/cost adheres to the 5 Transparent Factory tenets. A cost anomaly detector that auto-alerts on spending must itself be governed — false positives erode trust, false negatives cost money.
|
||||||
|
|
||||||
|
### Story 10.1: Atomic Flagging — Feature Flags for Anomaly Detection Rules
|
||||||
|
**As a** solo founder, **I want** every new anomaly scoring algorithm, baseline model, and alert threshold behind a feature flag (default: off), **so that** a bad scoring change doesn't flood customers with false-positive cost alerts.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- OpenFeature SDK integrated into the anomaly scoring engine. V1: env-var or JSON file provider.
|
||||||
|
- All flags evaluate locally — no network calls during cost event processing.
|
||||||
|
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
|
||||||
|
- Automated circuit breaker: if a flagged scoring rule generates >3x baseline alert volume over 1 hour, the flag auto-disables. Suppressed alerts buffered in DLQ for review.
|
||||||
|
- Flags required for: new baseline algorithms, Z-score thresholds, instance novelty scoring, actor novelty detection, new AWS service parsers.
|
||||||
|
|
||||||
|
**Estimate:** 5 points
|
||||||
|
**Dependencies:** Epic 2 (Anomaly Engine)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Circuit breaker tracks alert-per-account rate in Redis with 1-hour sliding window.
|
||||||
|
- DLQ: SQS queue. On circuit break, alerts are replayed once the flag is fixed or removed.
|
||||||
|
- For the "no baseline" fast-path (>$5/hr resources), this is NOT behind a flag — it's a safety net that's always on.
|
||||||
|
|
||||||
|
### Story 10.2: Elastic Schema — Additive-Only for Cost Event Tables
|
||||||
|
**As a** solo founder, **I want** all DynamoDB cost event and TimescaleDB baseline schema changes to be strictly additive, **so that** rollbacks never corrupt historical spending data or break baseline calculations.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- CI rejects migrations containing `DROP`, `ALTER ... TYPE`, or `RENAME` on existing columns/attributes.
|
||||||
|
- New fields use `_v2` suffix for breaking changes.
|
||||||
|
- All event parsers ignore unknown fields (Pydantic `extra="ignore"` or Go equivalent).
|
||||||
|
- Dual-write during migration windows within the same transaction.
|
||||||
|
- Every migration includes `sunset_date` comment (max 30 days).
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 3 (Data Pipeline)
|
||||||
|
**Technical Notes:**
|
||||||
|
- `CostEvent` records in DynamoDB are append-only — never mutate historical events.
|
||||||
|
- Baseline models in TimescaleDB: new algorithm versions write to new continuous aggregate, old aggregate remains queryable during transition.
|
||||||
|
- GSI changes: add new GSIs, never remove old ones until sunset.
|
||||||
|
|
||||||
|
### Story 10.3: Cognitive Durability — Decision Logs for Scoring Algorithms
|
||||||
|
**As a** future maintainer, **I want** every change to anomaly scoring weights, Z-score thresholds, or baseline learning rates accompanied by a `decision_log.json`, **so that** I understand why the system flagged (or missed) a $3,000 EC2 instance.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
|
||||||
|
- CI requires a decision log for PRs touching `src/scoring/`, `src/baseline/`, or `src/detection/`.
|
||||||
|
- Cyclomatic complexity cap of 10 enforced in CI.
|
||||||
|
- Decision logs in `docs/decisions/`.
|
||||||
|
|
||||||
|
**Estimate:** 2 points
|
||||||
|
**Dependencies:** None
|
||||||
|
**Technical Notes:**
|
||||||
|
- Threshold changes are the highest-risk decisions — document: "Why Z-score > 2.5 and not 2.0? What's the false-positive rate at each threshold?"
|
||||||
|
- Include sample cost events showing before/after scoring behavior in decision logs.
|
||||||
|
|
||||||
|
### Story 10.4: Semantic Observability — AI Reasoning Spans on Anomaly Scoring
|
||||||
|
**As an** engineer investigating a missed cost anomaly, **I want** every anomaly scoring decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I can trace exactly why a $500/hr GPU instance wasn't flagged.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- Every `CostEvent` evaluation creates an `anomaly_scoring` span.
|
||||||
|
- Span attributes: `cost.account_id_hash`, `cost.service`, `cost.anomaly_score`, `cost.z_score`, `cost.instance_novelty`, `cost.actor_novelty`, `cost.alert_triggered` (bool), `cost.baseline_days` (how many days of baseline data existed).
|
||||||
|
- If no baseline exists: `cost.fast_path_triggered` (bool) and `cost.hourly_rate`.
|
||||||
|
- Spans export via OTLP. No PII — account IDs hashed, actor ARNs hashed.
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 2 (Anomaly Engine)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Use OpenTelemetry Python SDK with OTLP exporter. Batch export — cost events can be high volume.
|
||||||
|
- The "no baseline fast-path" span is especially important — it's the safety net for new accounts.
|
||||||
|
- Include `cost.baseline_days` so you can correlate alert accuracy with baseline maturity.
|
||||||
|
|
||||||
|
### Story 10.5: Configurable Autonomy — Governance for Cost Alerting
|
||||||
|
**As a** solo founder, **I want** a `policy.json` that controls whether dd0c/cost can auto-alert customers or only log anomalies internally, **so that** I can validate scoring accuracy before enabling customer-facing notifications.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `policy.json` defines `governance_mode`: `strict` (log-only, no customer alerts) or `audit` (auto-alert with logging).
|
||||||
|
- Default for new accounts: `strict` for first 14 days (baseline learning period), then auto-promote to `audit`.
|
||||||
|
- `panic_mode`: when true, all alerting stops. Anomalies are still scored and logged but no notifications sent. Dashboard shows "alerting paused" banner.
|
||||||
|
- Per-account governance override: customers can set their own mode. Can only be MORE restrictive.
|
||||||
|
- All policy decisions logged: "Alert for account X suppressed by strict mode", "Auto-promoted account Y to audit mode after 14-day baseline".
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 4 (Notification Engine)
|
||||||
|
**Technical Notes:**
|
||||||
|
- The 14-day auto-promotion is key — it prevents alert spam during baseline learning while ensuring customers eventually get value.
|
||||||
|
- Auto-promotion check: daily cron. If account has ≥14 days of baseline data AND false-positive rate <10%, promote to `audit`.
|
||||||
|
- Panic mode: Redis key `dd0c:panic`. Notification engine short-circuits on this key.
|
||||||
|
|
||||||
|
### Epic 10 Summary
|
||||||
|
| Story | Tenet | Points |
|
||||||
|
|-------|-------|--------|
|
||||||
|
| 10.1 | Atomic Flagging | 5 |
|
||||||
|
| 10.2 | Elastic Schema | 3 |
|
||||||
|
| 10.3 | Cognitive Durability | 2 |
|
||||||
|
| 10.4 | Semantic Observability | 3 |
|
||||||
|
| 10.5 | Configurable Autonomy | 3 |
|
||||||
|
| **Total** | | **16** |
|
||||||
1058
products/05-aws-cost-anomaly/innovation-strategy/session.md
Normal file
1058
products/05-aws-cost-anomaly/innovation-strategy/session.md
Normal file
File diff suppressed because it is too large
Load Diff
119
products/05-aws-cost-anomaly/party-mode/session.md
Normal file
119
products/05-aws-cost-anomaly/party-mode/session.md
Normal file
@@ -0,0 +1,119 @@
|
|||||||
|
# dd0c/cost — "Party Mode" Advisory Board Review
|
||||||
|
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Product:** dd0c/cost (AWS Cost Anomaly Detective)
|
||||||
|
**Moderator:** Max (The Zoomer) — *Alright nerds, let's tear this apart. You've read the briefs. Real-time CloudTrail analysis, Slack-native remediation, $19/mo. Let's see if this is actually a business or just another dashboard nobody opens.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Round 1: INDIVIDUAL REVIEWS
|
||||||
|
|
||||||
|
### 1. The VC (Pattern-Matcher)
|
||||||
|
**Excites me:** The wedge. Leading with "we catch the $5,000 GPU instance mistake in 60 seconds" is a visceral, high-conversion pitch. The PLG motion here is frictionless, and the time-to-value is under 10 minutes. I love the "gateway drug" cross-sell with dd0c/route.
|
||||||
|
**Worries me:** Defensibility. What is the actual moat here? If Datadog decides to build a Slack-first cost alert, they crush you with distribution. And I'm still not convinced AWS won't just wake up and fix their native anomaly detection.
|
||||||
|
**Vote:** CONDITIONAL GO. (Condition: Prove the CloudTrail event data actually creates a compounding data moat over time.)
|
||||||
|
|
||||||
|
### 2. The CTO (Infrastructure Veteran)
|
||||||
|
**Excites me:** Closing the loop between detection and action. Telling my team an instance is burning money is useless if they have to log into the AWS console to kill it. The one-click Slack remediation is exactly how engineers actually want to work.
|
||||||
|
**Worries me:** CloudTrail is noisy as hell, and the latency isn't perfectly zero. Mapping raw `RunInstances` events to accurate pricing (factoring in RIs, Savings Plans, and Spot pricing) in real-time is notoriously difficult. If the Slack bot cries wolf with inaccurate pricing three times, my engineers will mute the channel.
|
||||||
|
**Vote:** CONDITIONAL GO. (Condition: Ship with hyper-conservative alert thresholds to prevent false positive fatigue.)
|
||||||
|
|
||||||
|
### 3. The Bootstrap Founder (Indie Hacker)
|
||||||
|
**Excites me:** The math is beautiful. $19/month per account is a no-brainer impulse buy for any startup CTO. At $19, you only need ~526 connected accounts to hit $10K MRR. That is incredibly achievable with Hacker News and Reddit distribution.
|
||||||
|
**Worries me:** Solo founder burnout. Processing real-time event streams at scale is an operational nightmare. You're building a highly available data pipeline. If your ingestion goes down, you miss the anomaly, and you lose trust forever. Can Brian actually support this while building 5 other products?
|
||||||
|
**Vote:** GO. (Keep the scope violently narrow. No multi-cloud, no dashboards. Just Slack.)
|
||||||
|
|
||||||
|
### 4. The FinOps Practitioner (The Enterprise Buyer)
|
||||||
|
**Excites me:** Nothing, really. I already use Vantage and I have CUR queries for everything else. But I acknowledge I am not the target buyer here. For the 40-person startup without a FinOps team, this is a lifesaver.
|
||||||
|
**Worries me:** The $19/mo pricing leaves money on the table. A forgotten p4d instance costs $32/hour. If you save a company $2,000, charging them $19 feels like you're underselling the value. Also, attribution is going to be a nightmare without strict tagging, which startups never have.
|
||||||
|
**Vote:** NO-GO. (Pivot to % of savings pricing, or at least tier it by total AWS spend. $19 is a toy price.)
|
||||||
|
|
||||||
|
### 5. The Contrarian (The Red Teamer)
|
||||||
|
**Excites me:** The fact that everyone thinks cloud cost tools are a solved problem. They aren't. They're all building for the CFO. Building for the on-call DevOps engineer is a genuinely contrarian bet.
|
||||||
|
**Worries me:** The "Slack-native" premise is a bug, not a feature. Have you seen a startup's `#alerts` channel? It's a graveyard of ignored webhooks. Adding cost alerts to the noise doesn't solve the problem, it just changes the venue of the ignored warning.
|
||||||
|
**Vote:** CONDITIONAL GO. (Condition: The product must include a "Zombie Resource Auto-Kill" feature. Don't ask them to click a button in Slack. Just kill it and tell them you did it.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Round 2: CROSS-EXAMINATION
|
||||||
|
|
||||||
|
**Max (Moderator):** *Spicy start. I'm hearing some doubts about the price point and the noise. Let's get into it. VC, you think the market is crowded. Bootstrap, you think it's wide open. FinOps is scoffing at $19/mo. Go.*
|
||||||
|
|
||||||
|
**1. The VC (to FinOps):** You're voting NO-GO because $19/month is too cheap? Are you crazy? This is a volume play. Startups don't have $500/month for Vantage. $19 is the exact threshold where a CTO whips out the corporate card without asking permission.
|
||||||
|
|
||||||
|
**2. The FinOps Practitioner:** And that's exactly why they'll churn! If you charge $19/mo, they'll treat it like a $19 tool. The moment it flags a false positive on a planned EMR cluster deployment, they'll turn it off. If you charge $200, they'll at least adjust the configuration.
|
||||||
|
|
||||||
|
**3. The VC:** Wrong. They'll churn if the product sucks. At $19/mo, it's a "set and forget" insurance policy. The real risk is AWS Cost Anomaly Detection waking up and building Slack buttons for free. Datadog could build this over a weekend.
|
||||||
|
|
||||||
|
**4. The Bootstrap Founder (to VC):** Datadog has 3,000 engineers and they charge $23 per *host*. They're not going to cannibalize their upsell motion for a $19 product. And AWS hasn't fixed their billing UX in a decade. The market is crowded at the enterprise level, but there's a massive vacuum at the bottom for developers who just want to be left alone.
|
||||||
|
|
||||||
|
**5. The VC:** Okay, but what's the moat? Once you get to 500 customers, someone else clones the CloudTrail ingestion script and launches for $9/mo.
|
||||||
|
|
||||||
|
**6. The Bootstrap Founder:** The moat is the pattern data! Once dd0c learns your account's seasonal spending spikes and your remediation muscle memory is built into Slack, you don't switch to a $9 clone. And as a solo dev, Brian can run this infrastructure on $200/mo. The margins are insane.
|
||||||
|
|
||||||
|
**7. The CTO (to Contrarian):** Speaking of infrastructure, you want to auto-kill zombie resources? Are you out of your mind? If an automated script terminates a production ML training job because it thought it was a "zombie p4d instance," the CTO will literally fire the vendor on the spot.
|
||||||
|
|
||||||
|
**8. The Contrarian:** Oh, please. If a developer leaves a p4d running over the weekend without tagging it as production, they deserve the termination. You're trying to build a cost tool, but you're too scared to actually enforce the cost. Slack buttons are a coward's way out. Force them to opt-in to auto-termination for dev accounts.
|
||||||
|
|
||||||
|
**9. The CTO:** It's not about courage, it's about the reliability of CloudTrail. CloudTrail events are fast, but they don't contain real-time pricing data with Savings Plans and RIs factored in. If you auto-terminate an instance that was already covered by an RI, you just killed a workload for literally zero financial benefit.
|
||||||
|
|
||||||
|
**10. The FinOps Practitioner:** The CTO is 100% right. You cannot act on CloudTrail data alone. The CUR data is the only source of truth. If dd0c tells a CTO "this instance is costing $5/hour" but it's actually covered by an RI and costing $0, the CTO will lose all trust in the tool immediately.
|
||||||
|
|
||||||
|
**11. The Contrarian (to FinOps/CTO):** You two are entirely missing the point. The CTO doesn't care if it's exactly $5.00 or $4.12. They care that an unsanctioned GPU instance just spun up in `us-east-1` when the entire team is supposed to be in `us-west-2`. The speed of the alert is the product. The exact dollar amount is just decoration.
|
||||||
|
|
||||||
|
**12. The Bootstrap Founder:** Exactly. "Estimated cost: $5/hr" is enough to trigger a Slack conversation. If it's covered by an RI, the developer replies "It's fine, we have an RI," clicks the `[Snooze]` button, and goes back to work. That interaction takes 10 seconds. That's worth $19/mo.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Round 3: STRESS TEST
|
||||||
|
|
||||||
|
**Max (Moderator):** *Let's break it down to the absolute worst-case scenarios. We have three fatal flaws we need to survive. Attack.*
|
||||||
|
|
||||||
|
### Attack 1: AWS Ships Real-Time Cost Anomaly Detection (Faster, Less Noisy)
|
||||||
|
**The Scenario:** At re:Invent 2026, AWS announces a complete overhaul of their native tool. It's real-time, it has tunable ML models, and they launch a first-party Slack integration with remediation buttons. Oh, and it's bundled for free.
|
||||||
|
- **Severity (1-10):** 9. This destroys the primary GTM and differentiation.
|
||||||
|
- **Mitigation:** Your only play is the multi-cloud narrative (AWS + GCP + Azure) or the specific "developer-first" UX. AWS native tools are historically built for enterprise compliance, not startup speed.
|
||||||
|
- **Pivot Option:** Pivot dd0c/cost into a feature of the broader `dd0c/portal` offering. If it can't survive as a standalone product against free AWS tools, bundle it into an IDP (Internal Developer Portal) where cost is just one widget next to PagerDuty and GitHub metrics.
|
||||||
|
|
||||||
|
### Attack 2: Market Consolidation (Datadog Acquires Vantage)
|
||||||
|
**The Scenario:** Datadog acquires Vantage for $300M, integrating Vantage's FinOps capabilities directly into Datadog's massive footprint. Suddenly, every Datadog customer gets cost anomaly detection out of the box.
|
||||||
|
- **Severity (1-10):** 7. Datadog's enterprise motion crushes your mid-market aspirations.
|
||||||
|
- **Mitigation:** Datadog will inevitably raise Vantage's prices or bundle it behind an expensive tier. Emphasize the $19/mo price point and the anti-bloatware positioning. Play the "we are the tool for teams that hate Datadog's pricing model" card.
|
||||||
|
- **Pivot Option:** Double down on the indie/bootstrapper market. Pivot strictly to a PLG motion for sub-50 person engineering teams where a Datadog contract is unjustifiable.
|
||||||
|
|
||||||
|
### Attack 3: False Positive Fatigue
|
||||||
|
**The Scenario:** CloudTrail is noisy. You alert a CTO three times in one week about a $5/hour cost spike that turns out to be covered by an RI or a Spot Instance request that instantly terminated. The CTO's team mutes the `#dd0c-alerts` Slack channel. You're now just another ignored webhook.
|
||||||
|
- **Severity (1-10):** 10. If the product loses trust, churn hits 100%. The "boy who cried wolf" is the death of all monitoring tools.
|
||||||
|
- **Mitigation:** Ship with insanely conservative default thresholds. Require users to opt-in to lower sensitivity. Build an immediate feedback loop: every Slack alert needs a `[Mark as Expected]` button that instantly retrains the anomaly baseline for that specific resource tag.
|
||||||
|
- **Pivot Option:** Pivot from "anomaly detection" to "Zombie Hunter." Stop trying to catch real-time spikes and focus purely on finding unused resources (unattached EBS volumes, empty ELBs, idle EC2s). No false positives there, just pure savings.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Round 4: FINAL VERDICT
|
||||||
|
|
||||||
|
**Max (Moderator):** *The board has deliberated. It's time for the bloodbath. Unanimous or split decision on `dd0c/cost`? What's the final call? Let's go.*
|
||||||
|
|
||||||
|
### The Decision: SPLIT VERDICT (4-1 CONDITIONAL GO)
|
||||||
|
|
||||||
|
**The VC:** "If you can make a CTO feel like they have superpowers for 19 bucks a month, I'm in. But you better move fast before re:Invent ruins your life."
|
||||||
|
**The CTO:** "I'll use it if you don't wake up my engineers with fake alarms. Tune the noise down, and I'll buy 5 licenses right now."
|
||||||
|
**The Bootstrap Founder:** "The easiest $10k MRR you'll ever build. Don't overcomplicate it. Stay out of the enterprise."
|
||||||
|
**The Contrarian:** "I hate Slack bots, but auto-killing zombies is a real product. Do it."
|
||||||
|
**The FinOps Practitioner:** "You are leaving enterprise money on the table, and your attribution sucks. I vote NO-GO."
|
||||||
|
|
||||||
|
### Revised Priority in the `dd0c` Lineup
|
||||||
|
`dd0c/cost` is officially the **Gateway Drug #2**.
|
||||||
|
It must be launched immediately following `dd0c/route`. The entire GTM strategy depends on this product proving immediate, undeniable monetary ROI in Week 1 to earn the trust required for the rest of the platform.
|
||||||
|
|
||||||
|
### Top 3 Must-Get-Right Items
|
||||||
|
1. **The '10-Minute Aha' Onboarding Flow.** No forms, no manual tagging requirements. A user must connect an AWS account via CloudFormation and get their first real alert (even if it's just an unattached EBS volume) within 10 minutes.
|
||||||
|
2. **One-Click Remediation UX.** The `[Stop Instance]` Slack button is the entire moat. It has to work flawlessly without forcing a context switch to the AWS console.
|
||||||
|
3. **Hyper-Conservative Default Alerting.** It is infinitely better to miss a $50 anomaly than to trigger 3 false positives in the first week. The baseline must learn before it screams.
|
||||||
|
|
||||||
|
### The One Kill Condition
|
||||||
|
**If AWS announces real-time Cost Anomaly Detection with native Slack remediation at re:Invent 2026, kill the standalone product.** Pivot the CloudTrail ingestion engine immediately into `dd0c/alert` or `dd0c/drift` as a supplementary feature, and stop selling it as a $19/mo FinOps tool.
|
||||||
|
|
||||||
|
### Final Verdict: CONDITIONAL GO
|
||||||
|
Cloud cost management is a crowded, bloody ocean. But everyone is building for the CFO. The wedge is building for the on-call engineer who just wants to stop a runaway GPU cluster from their phone without finding their YubiKey. The $19/month real-time Slack bot is a hyper-specific, defensible wedge. Build the smoke detector, hand them the fire extinguisher, and get out of their way.
|
||||||
|
|
||||||
|
**Max:** *Alright, party's over. Build the damn thing.*
|
||||||
747
products/05-aws-cost-anomaly/product-brief/brief.md
Normal file
747
products/05-aws-cost-anomaly/product-brief/brief.md
Normal file
@@ -0,0 +1,747 @@
|
|||||||
|
# dd0c/cost — Product Brief
|
||||||
|
## AWS Cost Anomaly Detective
|
||||||
|
|
||||||
|
**Version:** 1.0
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Author:** Product Management
|
||||||
|
**Status:** Conditional GO (4-1 Advisory Board Vote)
|
||||||
|
**Classification:** Investor-Ready
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 1. EXECUTIVE SUMMARY
|
||||||
|
|
||||||
|
## Elevator Pitch
|
||||||
|
|
||||||
|
dd0c/cost is a real-time AWS billing anomaly detector that catches cost spikes in seconds — not the 24-48 hours that AWS native tools require — and delivers actionable Slack alerts with one-click remediation. At $19/account/month, it's the smoke detector with a fire extinguisher attached: it tells you what happened, who did it, and lets you fix it without leaving Slack.
|
||||||
|
|
||||||
|
## Problem Statement
|
||||||
|
|
||||||
|
Cloud cost management is broken at the speed layer. AWS customers collectively overspend by an estimated $16B+ annually on idle, forgotten, and misconfigured resources (Flexera State of the Cloud 2025). The average startup discovers cost anomalies 48-72 hours after they begin — by which time a single forgotten GPU instance has burned $1,400+ (p3.2xlarge at $12.24/hr × 4.8 days).
|
||||||
|
|
||||||
|
The root cause is architectural: every existing tool — including AWS's own Cost Anomaly Detection — is built on batch-processed Cost and Usage Report (CUR) data. CUR is designed for accounting, not operations. It's like getting your credit card statement a month late and wondering why you're broke.
|
||||||
|
|
||||||
|
Three compounding failures make this worse:
|
||||||
|
|
||||||
|
1. **No real-time feedback loop.** AWS makes it trivially easy to launch a $98/hour GPU instance and provides zero immediate cost signal. Engineers get no feedback between "I created a thing" and "the bill arrived."
|
||||||
|
2. **No attribution.** When costs spike, the first question is "who did this?" AWS Cost Explorer answers at the service level ("EC2 went up"), not the human level ("Sam launched 4 GPU instances at 11:02 AM"). This creates blame culture instead of resolution.
|
||||||
|
3. **No remediation path.** Even when anomalies are detected, fixing them requires navigating 5+ AWS console screens. The gap between "knowing" and "doing" is where money burns.
|
||||||
|
|
||||||
|
The AI infrastructure boom has made this exponentially worse. Enterprise AI/ML spend on AWS grew 340% from 2023-2025 (Gartner). GPU instances costing $12-$98/hour are now routine. Teams that never worried about AWS costs are suddenly getting $40K bills because someone left a SageMaker endpoint running over a weekend.
|
||||||
|
|
||||||
|
## Solution Overview
|
||||||
|
|
||||||
|
dd0c/cost replaces the industry's batch-processing paradigm with real-time event-stream analysis:
|
||||||
|
|
||||||
|
- **Real-time detection via CloudTrail:** Instead of waiting for CUR data, dd0c processes CloudTrail events through EventBridge as they happen. When someone launches an expensive resource, dd0c knows in seconds — not days.
|
||||||
|
- **Slack-native alerts with full context:** Every alert includes what happened, who did it, when, estimated cost impact, and plain-English explanation. No dashboard required.
|
||||||
|
- **One-click remediation:** Slack action buttons (Stop, Terminate, Schedule Shutdown, Snooze) let engineers fix problems without leaving their workflow. Remediation includes safety nets (automatic EBS snapshots before termination).
|
||||||
|
- **Zombie resource hunting:** Daily automated scans for idle EC2 instances, unattached EBS volumes, orphaned Elastic IPs, and empty load balancers — the perpetual waste that regenerates as teams grow.
|
||||||
|
- **Pattern learning:** Anomaly baselines adapt to each account's unique spending patterns over 30-90 days, reducing false positives and increasing detection accuracy over time.
|
||||||
|
|
||||||
|
## Target Customer
|
||||||
|
|
||||||
|
**Primary:** Series A/B SaaS startups. 10-50 engineers. 1-5 AWS accounts. $5K-$50K/month AWS spend. No dedicated FinOps team. The CTO or a senior DevOps engineer "owns" the bill as a side responsibility.
|
||||||
|
|
||||||
|
**Secondary:** Mid-market engineering teams (50-200 engineers) with a solo FinOps analyst drowning in manual data wrangling across 10-25 AWS accounts.
|
||||||
|
|
||||||
|
**Anti-target:** Enterprise organizations with $500K+/month AWS spend, dedicated FinOps teams, and existing CloudHealth/Vantage contracts. These are not our customers in Year 1.
|
||||||
|
|
||||||
|
## Key Differentiators
|
||||||
|
|
||||||
|
| Dimension | dd0c/cost | Industry Standard |
|
||||||
|
|-----------|-----------|-------------------|
|
||||||
|
| Detection speed | Seconds (CloudTrail events) | 24-48 hours (CUR/Cost Explorer) |
|
||||||
|
| Alert channel | Slack-native with action buttons | Email/SNS, dashboard visits |
|
||||||
|
| Remediation | One-click from Slack | Manual AWS Console navigation |
|
||||||
|
| Attribution | Resource + user + action + timestamp | Service-level aggregates |
|
||||||
|
| Setup time | 5 minutes (one-click CloudFormation) | 15-60 minutes (CUR configuration, dashboard setup) |
|
||||||
|
| Price | $19/account/month | $100-500+/month or enterprise contracts |
|
||||||
|
| Explanation quality | Plain English ("Sam launched 4x p3.2xlarge at 11:02am, burning $12.24/hr") | "Anomaly detected in EC2" |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 2. MARKET OPPORTUNITY
|
||||||
|
|
||||||
|
## Market Sizing
|
||||||
|
|
||||||
|
| Segment | Size | Basis |
|
||||||
|
|---------|------|-------|
|
||||||
|
| **TAM** | $16.5B | Global cloud cost management and optimization market, 2026. All providers, all segments, all tool categories. (Gartner, FinOps Foundation, Flexera State of the Cloud 2025). 22% CAGR. |
|
||||||
|
| **SAM** | $2.1B | AWS-specific cost anomaly detection and optimization for SMB/mid-market. ~340,000 AWS accounts spending $5K-$500K/month. Average willingness-to-pay ~$500/month for cost tooling. |
|
||||||
|
| **SOM** | $1.0-3.6M ARR (Year 1) | 250 paying accounts at blended $29/account/month from dd0c/cost alone = ~$87K MRR / $1.04M ARR. Combined with dd0c/route ("gateway drug" pair), $2-3.6M ARR if execution is sharp. |
|
||||||
|
|
||||||
|
**The honest math:** To hit $50K MRR (the platform target), dd0c/cost alone won't get there. At $19/account/month, you need ~2,600 paying accounts for $50K MRR from cost alone. Realistically, dd0c/cost contributes $15-25K MRR and dd0c/route carries the rest. That's the strategy — the gateway drug pair, not a single product.
|
||||||
|
|
||||||
|
## Competitive Landscape
|
||||||
|
|
||||||
|
### Direct Competitors
|
||||||
|
|
||||||
|
**AWS Cost Anomaly Detection (Native)**
|
||||||
|
- Free. ML-based. 24-48 hour detection delay. Black-box model with legendary false positive rates. No Slack integration. No remediation. UX buried behind 4 clicks in the Billing console. AWS's incentive structure is fundamentally misaligned — they profit when you overspend. They will never build a great cost reduction tool.
|
||||||
|
- **Threat level:** LOW as a product. HIGH as a "good enough" excuse for prospects to do nothing.
|
||||||
|
|
||||||
|
**Vantage**
|
||||||
|
- Modern FinOps platform. Series A ($13M). Cost reporting, K8s allocation, unit economics. Pricing starts ~$100/month, scales aggressively. Architecture is CUR-based (batch, not real-time). Moving upmarket toward FinOps analyst persona, not startup CTOs.
|
||||||
|
- **Threat level:** MEDIUM. Could add real-time detection but would require a data pipeline rebuild (~6 month project). Window exists.
|
||||||
|
|
||||||
|
**nOps**
|
||||||
|
- Automated cloud optimization (RI/SP purchasing, scheduling, spot migration). Enterprise-focused, opaque pricing ("Contact Sales"). Solves "help me save money systematically" — a different JTBD than "tell me the second something goes wrong."
|
||||||
|
- **Threat level:** LOW-MEDIUM. Different positioning. Potential partner.
|
||||||
|
|
||||||
|
**Antimetal**
|
||||||
|
- Group buying for cloud. Aggregates purchasing power for better RI/SP rates. Visibility features are table stakes. VC-backed, burning cash on a model requiring massive scale.
|
||||||
|
- **Threat level:** LOW. Different business model entirely.
|
||||||
|
|
||||||
|
### Adjacent Competitors (Different Buyer, Overlapping Problem)
|
||||||
|
|
||||||
|
**CloudHealth (VMware/Broadcom)** — Enterprise. 6-month implementations. $50K+ annual contracts. Sells to VP of Infrastructure via golf courses. Irrelevant to our beachhead. **NEGLIGIBLE.**
|
||||||
|
|
||||||
|
**Kubecost / OpenCost** — K8s-only cost monitoring. Our beachhead customers are mostly running EC2, Lambda, and RDS. Complementary, not competitive. **NEGLIGIBLE.**
|
||||||
|
|
||||||
|
**Infracost** — Pre-deploy cost estimation (shift-left). We're runtime (shift-right). "Infracost tells you what it WILL cost. dd0c tells you what it IS costing." **Potential PARTNER.**
|
||||||
|
|
||||||
|
**ProsperOps** — Autonomous discount management. Pure savings execution. No anomaly detection. Different JTBD. **NEGLIGIBLE.**
|
||||||
|
|
||||||
|
### The Existential Threat
|
||||||
|
|
||||||
|
**Datadog**
|
||||||
|
- Already has agents in customer infrastructure, CloudTrail ingestion, and Slack integrations. Adding real-time cost anomaly detection is a feature for them, not a product. 3,000 engineers.
|
||||||
|
- **Why we might still win:** Datadog charges $23/host/month for infrastructure monitoring PLUS additional for cost management. A 50-host startup pays $1,150/month before cost features. Our $19/account/month is a rounding error. Their cost management is dashboard-first, not Slack-first. Their incentive is upselling more Datadog, not being the best cost tool.
|
||||||
|
- **Threat level:** HIGH long-term. LOW short-term (enterprise focus, not startups).
|
||||||
|
|
||||||
|
### Blue Ocean Positioning
|
||||||
|
|
||||||
|
The incumbents cluster around reporting, governance, dashboards, and RI optimization — a Red Ocean of commoditized features. dd0c/cost's Blue Ocean is the quadrant nobody serves well:
|
||||||
|
|
||||||
|
```
|
||||||
|
Factor | AWS Native | Vantage | CloudHealth | dd0c/cost
|
||||||
|
--------------------------|-----------|---------|-------------|----------
|
||||||
|
Detection Speed | 2 | 4 | 3 | 9
|
||||||
|
Attribution (Who/What) | 2 | 6 | 7 | 8
|
||||||
|
Remediation (Fix It) | 1 | 2 | 3 | 9
|
||||||
|
Slack-Native Experience | 1 | 3 | 1 | 10
|
||||||
|
Time-to-Value (Setup) | 6 | 4 | 2 | 9
|
||||||
|
Pricing Transparency | 10 | 6 | 1 | 10
|
||||||
|
Multi-Account Governance | 4 | 7 | 9 | 3
|
||||||
|
Reporting/Dashboards | 5 | 8 | 9 | 2
|
||||||
|
RI/SP Optimization | 3 | 6 | 8 | 1
|
||||||
|
```
|
||||||
|
|
||||||
|
We deliberately score LOW on governance, reporting, and RI optimization. We score so high on speed, action, and simplicity that the comparison is absurd. This is textbook Blue Ocean: make the competition irrelevant by competing on different factors.
|
||||||
|
|
||||||
|
## Timing Thesis: Why Now
|
||||||
|
|
||||||
|
Four converging forces create an exceptional window:
|
||||||
|
|
||||||
|
**1. The AI Spend Explosion (2024-2026)**
|
||||||
|
Enterprise AI/ML infrastructure spend on AWS grew 340% from 2023-2025. GPU instances cost $12-$98/hour. A single forgotten ML training job burns $5,000 in a weekend. Teams that never worried about AWS costs are suddenly panicking at $40K bills. This is creating a new generation of buyers who need cost detection urgently.
|
||||||
|
|
||||||
|
**2. FinOps Goes Mainstream**
|
||||||
|
FinOps Foundation membership grew from 5,000 to 31,000+ between 2022-2025. "FinOps" job titles increased 4x on LinkedIn. The market is educated — we don't need to explain WHY cost management matters. We need to explain why our approach is better. Much easier sell.
|
||||||
|
|
||||||
|
**3. AWS Native Tools Are Still Terrible**
|
||||||
|
AWS Cost Anomaly Detection launched in 2020. Six years later: 24-48 hour delays, no Slack, no remediation, black-box ML. AWS's billing team is a cost center, not a profit center. They have no incentive to invest heavily. Every year they don't fix this, the third-party market grows. We have 2-3 years minimum before AWS could ship something competitive.
|
||||||
|
|
||||||
|
**4. Regulatory Tailwinds**
|
||||||
|
EU DORA requires financial institutions to monitor cloud spend. SOC 2/ISO 27001 auditors increasingly ask "how do you monitor cloud costs?" ESG/sustainability reporting links cloud efficiency to carbon footprint. FinOps Foundation certification is creating a professional class of buyers who actively seek tools.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 3. PRODUCT DEFINITION
|
||||||
|
|
||||||
|
## Value Proposition
|
||||||
|
|
||||||
|
**For startup CTOs and DevOps engineers** who are personally accountable for AWS spend but have no time or tools for real-time cost governance, **dd0c/cost is a Slack-native cost anomaly detector** that catches billing spikes in seconds and lets you fix them with one click. **Unlike AWS Cost Anomaly Detection, Vantage, or CloudHealth,** dd0c/cost is built on real-time CloudTrail event streams (not batch CUR data), delivers alerts where engineers already work (Slack, not dashboards), and includes remediation — not just detection — at $19/account/month.
|
||||||
|
|
||||||
|
**The core promise:** The 48-hour blindspot between "something went wrong" and "I understand what happened" is eliminated. dd0c/cost turns a $4,700 weekend disaster into a $12 blip caught in 60 seconds.
|
||||||
|
|
||||||
|
## Personas
|
||||||
|
|
||||||
|
### Persona 1: Alex — The Startup CTO
|
||||||
|
- **Profile:** 32, Series A startup, 12 engineers. Wears CTO/VP Eng/DevOps hat simultaneously. Personally signed the AWS Enterprise Agreement. The board sees every line item.
|
||||||
|
- **Defining moment:** Tuesday 7:14 AM, brushing teeth. CFO forwards AWS billing alert: charges exceeded $8,000 (last month was $2,100). Stomach drops. Cost Explorer takes 11 seconds to load on mobile. Bar chart shows a spike but not WHERE or WHY. Alex spends 3 hours diagnosing what dd0c would have caught in 60 seconds.
|
||||||
|
- **JTBD:** "When I see an unexpected AWS charge, I want to instantly understand what caused it and who's responsible, so I can fix it before it gets worse and explain it to stakeholders."
|
||||||
|
- **What they hire dd0c for:** Speed of detection, attribution, credibility with investors.
|
||||||
|
|
||||||
|
### Persona 2: Sam — The DevOps Engineer
|
||||||
|
- **Profile:** 26, backend/infrastructure engineer at a 40-person startup. Manages Terraform, CI/CD, and "whatever AWS thing is broken today." Doesn't think about costs until they cause a problem.
|
||||||
|
- **Defining moment:** Friday 4:47 PM. CTO Slack: "Did you launch those GPU instances?" Sam spun up 4x p3.2xlarge on Tuesday for a 20-minute ML benchmark. Production incident pulled them away. Instances still running. 4 days × $12.24/hr × 4 instances = $4,700. Sam wants to disappear.
|
||||||
|
- **JTBD:** "When I spin up a temporary resource, I want automatic safety nets so I can focus on my actual work without worrying about zombie resources."
|
||||||
|
- **What they hire dd0c for:** The safety net they never had. No more blame. No more forgotten instances.
|
||||||
|
|
||||||
|
### Persona 3: Jordan — The Solo FinOps Analyst
|
||||||
|
- **Profile:** 28, mid-size SaaS (150 engineers, 23 AWS accounts). Title is "Cloud Financial Analyst." The only person who understands AWS billing. Reports to VP Eng and dotted-line to Finance.
|
||||||
|
- **Defining moment:** Last Thursday of the month. 14 browser tabs open. 3 days building the monthly cost report. $4,200 discrepancy between Cost Explorer and CUR data. 60% of time spent collecting and reconciling data, not analyzing it.
|
||||||
|
- **JTBD:** "When an anomaly is detected, I want to immediately see the root cause with full context, so I can resolve it without a 3-hour investigation."
|
||||||
|
- **What they hire dd0c for:** Getting their time back. Automated detection replaces manual data wrangling.
|
||||||
|
|
||||||
|
## Feature Roadmap
|
||||||
|
|
||||||
|
### MVP (V1) — Launch at Day 90
|
||||||
|
|
||||||
|
The V1 is ruthlessly scoped to three capabilities: detect, alert, fix.
|
||||||
|
|
||||||
|
**Real-Time Anomaly Detection**
|
||||||
|
- CloudTrail → EventBridge → Lambda pipeline for real-time event ingestion
|
||||||
|
- Z-score anomaly scoring with configurable sensitivity (default: conservative/high threshold)
|
||||||
|
- Cost estimation for top 20 AWS services mapped from CloudTrail events (~85% accuracy)
|
||||||
|
- Two-layer architecture: Layer 1 (CloudTrail, seconds, estimated) + Layer 2 (CloudWatch EstimatedCharges + CUR, hours, precise)
|
||||||
|
- Pattern baseline learning over 30-90 days per account
|
||||||
|
|
||||||
|
**Slack-Native Alerts**
|
||||||
|
- Block Kit messages: resource type, estimated cost/hour, who created it (IAM user/role), when, plain-English explanation
|
||||||
|
- Action buttons: Stop Instance, Terminate Instance (with automatic EBS snapshot), Snooze (1hr/4hr/24hr/permanent), Mark as Expected (retrains baseline)
|
||||||
|
- Daily digest: yesterday's spend summary, top anomalies, zombie resources found
|
||||||
|
- End-of-month spend forecast
|
||||||
|
|
||||||
|
**Zombie Resource Hunter**
|
||||||
|
- Daily automated scan: idle EC2 instances (CPU <5% for 72+ hours), unattached EBS volumes, orphaned Elastic IPs, empty load balancers, stopped instances with attached EBS
|
||||||
|
- Slack report with one-click cleanup actions
|
||||||
|
|
||||||
|
**Onboarding**
|
||||||
|
- One-click CloudFormation template (IAM read-only role, ~90 seconds)
|
||||||
|
- Slack OAuth integration (~30 seconds)
|
||||||
|
- Immediate zombie scan on connection (first value in <10 minutes)
|
||||||
|
- Zero configuration required — opinionated defaults for everything
|
||||||
|
|
||||||
|
**What V1 explicitly does NOT include:** No web dashboard. No multi-account governance. No RI/SP optimization. No team attribution. No multi-cloud. No reporting. No forecasting beyond end-of-month estimate. These are deliberate omissions, not gaps.
|
||||||
|
|
||||||
|
### V2 — Months 4-6
|
||||||
|
|
||||||
|
- **Web dashboard:** Lightweight cost overview, anomaly history, trend visualization
|
||||||
|
- **Multi-account support:** Connect multiple AWS accounts, unified alerting
|
||||||
|
- **Team attribution:** Tag-based cost allocation to teams without requiring perfect tagging (heuristic matching via IAM roles and resource naming patterns)
|
||||||
|
- **Budget circuit breakers:** Automatic alerts and optional enforcement when spend exceeds configurable thresholds
|
||||||
|
- **Approval workflows:** Remediation actions on sensitive resources require manager approval via Slack thread
|
||||||
|
- **Business tier pricing** ($49/account/month) with team features and API access
|
||||||
|
|
||||||
|
### V3 — Months 7-12
|
||||||
|
|
||||||
|
- **RI/SP optimization recommendations:** Identify savings plan and reserved instance opportunities
|
||||||
|
- **Spend forecasting:** ML-based monthly and quarterly projections with confidence intervals
|
||||||
|
- **Benchmarking:** "Companies similar to yours spend X on EC2" — powered by anonymized aggregate data across dd0c customers (requires 500+ customer scale)
|
||||||
|
- **Custom anomaly rules:** User-defined detection logic beyond statistical baselines
|
||||||
|
- **Autonomous remediation (opt-in):** Auto-terminate dev/staging zombies after configurable idle period, with notification
|
||||||
|
|
||||||
|
### V4 — Year 2
|
||||||
|
|
||||||
|
- **Multi-cloud:** GCP and Azure support (the play if AWS improves native tools)
|
||||||
|
- **API platform:** Programmatic access for custom integrations and internal tooling
|
||||||
|
- **dd0c platform integration:** Deep cross-sell with dd0c/route, dd0c/alert, dd0c/run
|
||||||
|
|
||||||
|
## User Journey
|
||||||
|
|
||||||
|
```
|
||||||
|
AWARENESS ACTIVATION RETENTION EXPANSION
|
||||||
|
───────────────────────── ───────────────────────── ───────────────────────── ─────────────────────────
|
||||||
|
"Your AWS bill is lying "Start Free" → GitHub/ First zombie scan alert Connect 2nd AWS account
|
||||||
|
to you" blog post / Google SSO (no credit card) within 10 minutes of setup ($19/mo each)
|
||||||
|
Show HN / Reddit /
|
||||||
|
aws-cost-cli OSS tool One-click CloudFormation First real-time anomaly Upgrade to Business tier
|
||||||
|
(90 sec) → Slack OAuth alert → one-click fix → for team attribution
|
||||||
|
"What's That Spike?" (30 sec) → Choose channel "dd0c just saved us $X"
|
||||||
|
blog series (10 sec) Cross-sell dd0c/route
|
||||||
|
Pattern learning kicks in (LLM cost routing)
|
||||||
|
Bill Shock Calculator DONE. Total: 3-5 minutes. (30-90 days) → fewer
|
||||||
|
(free ungated web tool) Zero configuration. false positives → trust dd0c/alert, dd0c/portal
|
||||||
|
deepens → switching cost (platform expansion)
|
||||||
|
increases
|
||||||
|
```
|
||||||
|
|
||||||
|
**Critical conversion points:**
|
||||||
|
1. **Signup → Connected account:** Must happen in same session. If they leave, 70% never return.
|
||||||
|
2. **Connected → First alert:** Must happen within 24 hours (zombie scan provides this). If no alert in 48 hours, they forget dd0c exists.
|
||||||
|
3. **First alert → First action:** The moment they click "Stop Instance" in Slack and it works, they're hooked. This is the product's magic moment.
|
||||||
|
|
||||||
|
## The Core Technical Tension: Speed vs. Accuracy
|
||||||
|
|
||||||
|
dd0c/cost's architecture resolves the fundamental tradeoff that defines the market:
|
||||||
|
|
||||||
|
| Layer | Source | Speed | Accuracy | Purpose |
|
||||||
|
|-------|--------|-------|----------|---------|
|
||||||
|
| Layer 1: Event Stream | CloudTrail + EventBridge | Seconds | ~85% (estimated, on-demand pricing) | "ALERT: New expensive resource detected" |
|
||||||
|
| Layer 2: Billing Reconciliation | CloudWatch EstimatedCharges + CUR | Minutes to hours | 99%+ (includes RIs, SPs, Spot) | "UPDATE: Confirmed cost impact is $X" |
|
||||||
|
|
||||||
|
**Design principle:** Alert on Layer 1 (fast, estimated). Reconcile with Layer 2 (slow, precise). Always show the user which layer they're seeing. Never pretend an estimate is exact. Never wait for precision when speed saves money.
|
||||||
|
|
||||||
|
This is a smoke detector vs. a fire investigation. The smoke detector goes off immediately — it might be burnt toast, it might be a real fire. You don't wait for the fire investigator's report before evacuating.
|
||||||
|
|
||||||
|
## Pricing
|
||||||
|
|
||||||
|
### Tier Structure
|
||||||
|
|
||||||
|
| Tier | Price | Includes | Purpose |
|
||||||
|
|------|-------|----------|---------|
|
||||||
|
| **Free** | $0/month | 1 AWS account, daily anomaly checks (not real-time), Slack alerts without action buttons, weekly zombie report | Top of funnel. Deliver value. Create upgrade motivation via visible delay. |
|
||||||
|
| **Pro** | $19/account/month | Real-time CloudTrail detection, Slack alerts WITH action buttons, daily zombie hunter, end-of-month forecast, daily digest, configurable sensitivity | Core product. 80% of revenue. |
|
||||||
|
| **Business** | $49/account/month (or $399/month flat for ≤20 accounts) | Everything in Pro + team attribution, approval workflows, custom anomaly rules, API access, priority support | Expansion revenue. Launches with V2. |
|
||||||
|
|
||||||
|
### Why $19/month
|
||||||
|
|
||||||
|
1. **Impulse purchase threshold.** $19 doesn't require approval from anyone. $49 might. Conversion rate difference is typically 2-3x for developer tools.
|
||||||
|
2. **Multi-account expansion.** 3 accounts = $57/month. 10 accounts = $190/month. Revenue scales naturally with customer growth.
|
||||||
|
3. **Trivial ROI.** One forgotten GPU instance ($12.24/hr × 48hr = $587) pays for 2.5 years of dd0c. The ROI story doesn't need a spreadsheet.
|
||||||
|
4. **Category positioning.** At $19, we're 5-25x cheaper than Vantage. That's not a price difference — it's a category difference. We're not "cheaper Vantage." We're a different thing.
|
||||||
|
|
||||||
|
### Free-to-Paid Conversion Mechanics
|
||||||
|
|
||||||
|
The free tier is deliberately designed to create upgrade pressure:
|
||||||
|
- Free gets daily checks. Pro gets real-time. Every free alert includes: "We detected this anomaly 18 hours ago. On Pro, you'd have known in 60 seconds. Estimated cost of the delay: $220."
|
||||||
|
- Free alerts have NO action buttons. You see the problem but must switch to AWS Console to fix it. The friction is the upgrade motivation.
|
||||||
|
- Target conversion rate: 2.5-3.5% (consistent with developer tool benchmarks: Vercel 2.5%, Supabase 3.1%, Railway 2.8%).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 4. GO-TO-MARKET PLAN
|
||||||
|
|
||||||
|
## Launch Strategy: Product-Led Growth (PLG)
|
||||||
|
|
||||||
|
No sales team. No demos. No "Contact Sales" button. The product sells itself or it doesn't sell at all.
|
||||||
|
|
||||||
|
The GTM motion is built on one principle: **time-to-value under 10 minutes.** A startup CTO should go from "I've never heard of dd0c" to "I just got my first anomaly alert in Slack" in a single sitting. If onboarding takes more than 5 minutes, we lose 60% of signups.
|
||||||
|
|
||||||
|
### The Onboarding Flow (Critical Path)
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Landing page → "Start Free" (no credit card)
|
||||||
|
2. Sign up with GitHub or Google (no email/password forms)
|
||||||
|
3. "Connect Your AWS Account" → One-click CloudFormation template
|
||||||
|
→ Opens AWS Console with pre-filled CF stack
|
||||||
|
→ Creates IAM role with read-only permissions
|
||||||
|
→ Outputs role ARN back to dd0c
|
||||||
|
→ Total: 90 seconds (including AWS Console login)
|
||||||
|
4. "Connect Slack" → Standard Slack OAuth flow (30 seconds)
|
||||||
|
5. "Choose a channel for alerts" → Dropdown (10 seconds)
|
||||||
|
6. DONE. "We're monitoring your account. First alert incoming."
|
||||||
|
```
|
||||||
|
|
||||||
|
**Immediate value delivery:** The moment the account connects, dd0c runs a zombie resource scan. Most accounts have at least one idle resource. First Slack alert within 5-10 minutes: "We found 3 potentially unused resources costing $127/month." This is the aha moment.
|
||||||
|
|
||||||
|
## Beachhead: Startups Burning AWS Credits
|
||||||
|
|
||||||
|
### The Ideal First Customer
|
||||||
|
|
||||||
|
Series A or B SaaS startup. 10-40 engineers. 1-3 AWS accounts. $5K-$50K/month AWS spend. No FinOps person. The CTO owns the bill as a side responsibility.
|
||||||
|
|
||||||
|
**Why this profile works:**
|
||||||
|
- **Pain is acute and personal.** The CTO's name is on the account. The board sees every line item.
|
||||||
|
- **Decision cycle is fast.** One person decides. No procurement. No security review committee. Sign up and be live in 10 minutes.
|
||||||
|
- **$19/month is a non-decision.** Less than one engineer's daily coffee. If dd0c catches ONE forgotten GPU instance, it pays for itself for 5 years.
|
||||||
|
- **They talk to each other.** Startup CTOs are in Slack communities (Rands Leadership, CTO Craft, YC groups), Twitter/X, and FinOps meetups. One happy customer generates 3 referrals.
|
||||||
|
- **AWS credits make it free.** YC gives $100K in AWS credits. Via AWS Marketplace listing, dd0c becomes "free" — paid from credits they'd spend anyway.
|
||||||
|
|
||||||
|
### The First 10 Customers Playbook
|
||||||
|
|
||||||
|
1. **Customers 1-3: Network.** Brian is a senior AWS architect. Call people running startups on AWS. "I built something. Try it, give me honest feedback." Design partners — free for 6 months in exchange for weekly 15-minute feedback calls.
|
||||||
|
2. **Customers 4-7: Hacker News + Reddit launch.** "Show HN: I built a real-time AWS cost anomaly detector." Tuesday or Wednesday morning US time. Product polished, landing page sharp, onboarding bulletproof. One shot at first impression.
|
||||||
|
3. **Customers 8-10: Referrals from 1-7.** If the first 7 don't refer anyone, the product isn't good enough. Go back to step 1.
|
||||||
|
|
||||||
|
## Growth Loops
|
||||||
|
|
||||||
|
### Loop 1: Savings-Driven Virality
|
||||||
|
```
|
||||||
|
Customer saves $X → Shares "dd0c saved us $4,700" on Twitter/Slack community
|
||||||
|
→ Peers sign up → They save $X → They share → Repeat
|
||||||
|
```
|
||||||
|
Amplifier: Monthly "savings report" email with shareable stats. Make it easy to brag about being smart with money.
|
||||||
|
|
||||||
|
### Loop 2: Engineering-as-Marketing (Open Source Tools)
|
||||||
|
```
|
||||||
|
Free CLI tool (aws-cost-cli, zombie-hunter) → GitHub stars → Developer awareness
|
||||||
|
→ "Like this? dd0c does this automatically, in real-time" → Signups → Repeat
|
||||||
|
```
|
||||||
|
Each tool solves a small problem and funnels to dd0c for the full solution.
|
||||||
|
|
||||||
|
### Loop 3: Content SEO Flywheel
|
||||||
|
```
|
||||||
|
"What's That Spike?" blog post → Ranks for "AWS NAT Gateway cost spike"
|
||||||
|
→ CTO Googles exact problem → Finds post → "dd0c would have caught this in 60 seconds"
|
||||||
|
→ Signup → Repeat
|
||||||
|
```
|
||||||
|
Each post targets a long-tail keyword that the ICP searches when they have the exact problem dd0c solves.
|
||||||
|
|
||||||
|
### Loop 4: Cross-Sell from dd0c/route
|
||||||
|
```
|
||||||
|
Customer uses dd0c/route (LLM cost routing) → Saves $400/month on OpenAI
|
||||||
|
→ Sees dd0c/cost in same workspace → "Oh, this monitors AWS too?"
|
||||||
|
→ Connects AWS account → Finds $800/month in zombies → Platform lock-in deepens
|
||||||
|
```
|
||||||
|
This is the "gateway drug" strategy. Money saved on LLM costs earns the right to sell AWS cost monitoring.
|
||||||
|
|
||||||
|
## Content Strategy
|
||||||
|
|
||||||
|
### Pillar 1: "AWS Bill Shock Calculator" (Lead Generation)
|
||||||
|
Free, ungated web tool. Input your monthly AWS bill → Output: "Companies your size waste 25-35%. That's $X-$Y/month. Here are the top 5 sources." CTA: "Want to find YOUR specific waste? Connect your AWS account (free)." Shareable, generates organic backlinks.
|
||||||
|
|
||||||
|
### Pillar 2: "What's That Spike?" Blog Series (SEO + Authority)
|
||||||
|
Recurring series dissecting real AWS cost anomalies (anonymized):
|
||||||
|
- "The NAT Gateway That Ate $3,000"
|
||||||
|
- "When Autoscaling Doesn't Scale Back"
|
||||||
|
- "The $5,000 GPU Instance Nobody Remembered"
|
||||||
|
- "CloudWatch Logs Gone Wild"
|
||||||
|
|
||||||
|
Each post targets a specific long-tail SEO keyword that the ICP searches during an active cost crisis.
|
||||||
|
|
||||||
|
### Pillar 3: "The Real-Time FinOps Manifesto" (Category Creation)
|
||||||
|
A single definitive piece establishing "real-time FinOps" as a recognized subcategory. If we define the category, dd0c is the default leader. Target: FinOps Foundation blog, The New Stack, InfoQ.
|
||||||
|
|
||||||
|
### Pillar 4: Open-Source Tools (Engineering-as-Marketing)
|
||||||
|
- **`aws-cost-cli`**: CLI showing current AWS burn rate. `npx aws-cost-cli` → "Current burn rate: $1.87/hour | $44.88/day | $1,346/month."
|
||||||
|
- **`zombie-hunter`**: CLI scanning for unused AWS resources. `npx zombie-hunter` → "Found 7 zombie resources costing $312/month."
|
||||||
|
- **CloudFormation billing alerts template**: One-click CF template for proper billing alerts (better than AWS default). Free, dd0c branded.
|
||||||
|
|
||||||
|
## Channel Strategy
|
||||||
|
|
||||||
|
| Channel | Tactic | Expected Yield |
|
||||||
|
|---------|--------|---------------|
|
||||||
|
| Hacker News | "Show HN" launch post | 500-2,000 signups if front page. 2-5% convert. |
|
||||||
|
| r/aws, r/devops | Genuine participation + "I built this" | 100-500 signups. Higher conversion (self-selected). |
|
||||||
|
| Twitter/X | "Your AWS bill is lying to you" thread | Brand awareness. 50-200 signups per viral thread. |
|
||||||
|
| FinOps Foundation Slack | Community participation, answer questions | 10-30 high-quality leads. Most educated buyers. |
|
||||||
|
| Dev.to / Hashnode | Technical blog posts | SEO long-tail. 10-30 signups/month ongoing. |
|
||||||
|
| AWS Marketplace | Listed within 90 days of launch | Pay-with-credits angle. AWS takes 3-5% cut. Worth it. |
|
||||||
|
| Product Hunt | Same launch week as HN, different day | 200-500 signups. Lower conversion but brand awareness. |
|
||||||
|
|
||||||
|
## Partnerships
|
||||||
|
|
||||||
|
**AWS Marketplace (Priority: HIGH)** — List within 90 days. Customers pay using existing AWS committed spend/credits. YC startups with $100K in AWS credits can use dd0c for "free." Revenue impact: AWS takes 3-5%, worth it for distribution.
|
||||||
|
|
||||||
|
**FinOps Foundation (Priority: HIGH)** — Vendor membership. Contribute to framework documentation (specifically "Real-Time Cost Management" capability). Speak at FinOps X conference. Table stakes for credibility.
|
||||||
|
|
||||||
|
**Infracost (Priority: MEDIUM)** — Integration: Infracost for pre-deploy estimation + dd0c for post-deploy detection. Complementary products, same buyer. Cross-promotion opportunity.
|
||||||
|
|
||||||
|
## 90-Day Launch Timeline
|
||||||
|
|
||||||
|
### Days 1-30: Build the Core
|
||||||
|
- CloudTrail → EventBridge → Lambda pipeline for real-time event ingestion
|
||||||
|
- Anomaly scoring engine (Z-score, configurable sensitivity)
|
||||||
|
- Cost estimation library (CloudTrail events → estimated hourly costs, top 20 AWS services)
|
||||||
|
- Slack app: OAuth, Block Kit alert templates, action handlers (Stop, Terminate, Snooze, Mark Expected)
|
||||||
|
- Daily digest message
|
||||||
|
- **Deliverable:** Working product on own AWS accounts. Ugly but functional.
|
||||||
|
|
||||||
|
### Days 31-60: Polish + Design Partners
|
||||||
|
- Landing page (one-page, Vercel-style)
|
||||||
|
- GitHub/Google SSO signup
|
||||||
|
- One-click CloudFormation onboarding template
|
||||||
|
- Slack OAuth integration flow
|
||||||
|
- Immediate zombie scan on account connection
|
||||||
|
- Recruit 3-5 design partners from network. Free for 6 months, weekly feedback calls.
|
||||||
|
- Instrument: time-to-first-alert, alert-to-action ratio, false positive rate
|
||||||
|
- **Deliverable:** 5 real humans using it daily. Onboarding <5 minutes. False positive rate <30%.
|
||||||
|
|
||||||
|
### Days 61-90: Public Launch
|
||||||
|
- Stripe billing integration ($19/account/month, free tier for 1 account)
|
||||||
|
- First "What's That Spike?" blog post
|
||||||
|
- `aws-cost-cli` open-source tool released
|
||||||
|
- AWS Marketplace listing application submitted
|
||||||
|
- FinOps Foundation vendor membership application
|
||||||
|
- Show HN + Reddit + Product Hunt + Twitter launch
|
||||||
|
- Personal outreach to 50 startup CTOs via LinkedIn/Twitter DMs
|
||||||
|
- **Deliverable:** Product live, publicly available, with paying customers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 5. BUSINESS MODEL
|
||||||
|
|
||||||
|
## Revenue Model
|
||||||
|
|
||||||
|
**Primary revenue:** Per-account SaaS subscription. $19/account/month (Pro) and $49/account/month (Business, launching V2).
|
||||||
|
|
||||||
|
**Secondary revenue (future):** dd0c platform bundle pricing. dd0c/route + dd0c/cost bundle at $39/month flat for small teams (discount vs. buying separately). Creates pricing anchor that makes each individual product feel cheap.
|
||||||
|
|
||||||
|
**Revenue characteristics:**
|
||||||
|
- Recurring (monthly subscription)
|
||||||
|
- Usage-correlated (revenue scales with customer's AWS footprint — more accounts = more revenue)
|
||||||
|
- Low churn by design (pattern learning + remediation workflows create switching costs over time)
|
||||||
|
- Expansion-native (customers add accounts as they grow)
|
||||||
|
|
||||||
|
## Unit Economics
|
||||||
|
|
||||||
|
### Per-Customer Economics (Pro Tier, Single Account)
|
||||||
|
|
||||||
|
| Metric | Value | Notes |
|
||||||
|
|--------|-------|-------|
|
||||||
|
| Monthly revenue | $19 | Per connected AWS account |
|
||||||
|
| Infrastructure cost | ~$0.80/month | CloudTrail processing (Lambda), anomaly storage (DynamoDB/Postgres), Slack API calls. Estimated at scale. |
|
||||||
|
| Gross margin | ~96% | SaaS infrastructure costs are minimal at this price point |
|
||||||
|
| CAC (PLG) | ~$15-25 | Blended across organic (HN, Reddit, SEO = ~$0) and paid content promotion (~$50-80 per paid signup). PLG means no sales team. |
|
||||||
|
| Payback period | 1-2 months | At $19/month revenue and $15-25 CAC |
|
||||||
|
| Target LTV | $190 | 10-month average lifetime at <10% monthly churn |
|
||||||
|
| LTV:CAC ratio | 7.6-12.7x | Healthy. >3x is the benchmark for sustainable SaaS. |
|
||||||
|
|
||||||
|
### Multi-Account Expansion Economics
|
||||||
|
|
||||||
|
The real unit economics story is expansion revenue. A customer starts with 1 account ($19/month), then connects their staging account ($38/month), then their data account ($57/month). No additional CAC for expansion revenue.
|
||||||
|
|
||||||
|
| Accounts | Monthly Revenue | Annual Revenue | Notes |
|
||||||
|
|----------|----------------|----------------|-------|
|
||||||
|
| 1 | $19 | $228 | Entry point |
|
||||||
|
| 3 | $57 | $684 | Typical startup (prod + staging + data) |
|
||||||
|
| 5 | $95 | $1,140 | Growing startup |
|
||||||
|
| 10 | $190 | $2,280 | Mid-market entry |
|
||||||
|
| 20 | $399 (Business flat) | $4,788 | Business tier cap |
|
||||||
|
|
||||||
|
## Path to Revenue Milestones
|
||||||
|
|
||||||
|
### $10K MRR (~526 paying accounts)
|
||||||
|
|
||||||
|
**Timeline:** Month 6-9 (Scenario B "The Grind")
|
||||||
|
|
||||||
|
**How we get there:**
|
||||||
|
- dd0c/cost: ~300 accounts × $19 = $5,700 MRR
|
||||||
|
- dd0c/route: contributing remaining ~$4,300 MRR
|
||||||
|
- Total: ~$10K MRR from the gateway drug pair
|
||||||
|
|
||||||
|
**Requirements:** 2,000+ free signups, 2.5% conversion, steady content marketing cadence, 2-3 "dd0c saved us $X" case studies published.
|
||||||
|
|
||||||
|
### $50K MRR (~2,600 paying accounts from cost alone, or blended across platform)
|
||||||
|
|
||||||
|
**Timeline:** Month 12-18
|
||||||
|
|
||||||
|
**How we get there (blended):**
|
||||||
|
- dd0c/cost: ~1,000 accounts × $22 avg (mix of Pro + Business) = $22,000 MRR
|
||||||
|
- dd0c/route: ~$18,000 MRR
|
||||||
|
- dd0c/alert (launched Month 6): ~$10,000 MRR
|
||||||
|
- Total: ~$50K MRR across 3 modules
|
||||||
|
|
||||||
|
**Requirements:** Strong PLG flywheel, AWS Marketplace traction, at least one viral content moment, cross-sell motion working between route and cost.
|
||||||
|
|
||||||
|
### $100K MRR
|
||||||
|
|
||||||
|
**Timeline:** Month 18-24
|
||||||
|
|
||||||
|
**How we get there:**
|
||||||
|
- 4+ dd0c modules live
|
||||||
|
- Business tier adoption driving higher ARPA
|
||||||
|
- Platform bundle pricing
|
||||||
|
- Early mid-market customers (10-25 accounts each)
|
||||||
|
- Potential: first contractor hire for customer support
|
||||||
|
|
||||||
|
**Requirements:** Product-market fit validated across at least 3 modules. Churn <8%. NPS >40. The platform flywheel (modules more valuable together than apart) must be demonstrably working.
|
||||||
|
|
||||||
|
## Solo Founder Constraints & Mitigations
|
||||||
|
|
||||||
|
| Constraint | Impact | Mitigation |
|
||||||
|
|-----------|--------|------------|
|
||||||
|
| No sales team | Can't do enterprise outreach | PLG motion. Product sells itself or doesn't sell. |
|
||||||
|
| No support team | Support burden scales with customers | Automate everything. Self-service docs. Community Slack. Hire part-time contractor at ~200 customers. |
|
||||||
|
| No marketing team | Limited content output | Batch content creation. 1 blog post/week. Leverage open-source tools for organic reach. |
|
||||||
|
| Single point of failure | Bus factor = 1 | Infrastructure as code. CI/CD. Automated testing. Documented runbooks. No manual processes per customer. |
|
||||||
|
| Cognitive load of 6 products | Risk of building 6 mediocre products | Hard rule: no more than 2 products in active development at any time. dd0c/route + dd0c/cost first. Everything else waits. |
|
||||||
|
| No fundraising | Limited runway for experimentation | Bootstrap-friendly unit economics. $19/month × 96% gross margin = profitable from customer #1. No burn rate to manage. |
|
||||||
|
|
||||||
|
## The "Gateway Drug" Cross-Sell Economics
|
||||||
|
|
||||||
|
The dd0c platform strategy depends on the gateway drug pair (route + cost) earning the right to sell everything else:
|
||||||
|
|
||||||
|
```
|
||||||
|
Month 1-2: dd0c/route launches → Customer saves $400/month on LLM costs
|
||||||
|
Month 2-3: dd0c/cost launches → Same customer saves $800/month on AWS waste
|
||||||
|
Month 3: Customer is saving $1,200/month across two dd0c products for ~$60/month total
|
||||||
|
Month 4-6: dd0c/alert launches → "Save your money AND your sleep"
|
||||||
|
Month 6+: dd0c/portal → dd0c owns the developer experience. Switching cost is massive.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Data synergy:** dd0c/route knows which services make LLM API calls and their cost. dd0c/cost knows which AWS resources are running and their cost. Combined: "Your recommendation service is making $3,200/month in GPT-4o calls AND running on a $1,800/month p3.2xlarge. Here's how to cut both by 60%." Single-product competitors can't replicate this.
|
||||||
|
|
||||||
|
**Technical synergy:** Both products need AWS account integration, Slack integration, auth, and billing. Building dd0c/cost after dd0c/route means 50% of infrastructure already exists. Marginal engineering cost of the second product is much lower than the first.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 6. RISKS & MITIGATIONS
|
||||||
|
|
||||||
|
## Top 5 Risks
|
||||||
|
|
||||||
|
### Risk 1: AWS Ships Real-Time Cost Anomaly Detection with Slack Remediation
|
||||||
|
|
||||||
|
- **Likelihood:** MEDIUM (40% within 2 years)
|
||||||
|
- **Impact:** CRITICAL — Primary differentiator evaporates overnight
|
||||||
|
- **Analysis:** AWS's billing team is a cost center, not a profit center. Real-time cost detection that helps customers spend LESS is antithetical to AWS's revenue model. They've had 15 years to build this and haven't. Their organizational incentives are structurally misaligned. Even if they improve, it'll be enterprise-focused, console-bound, and half-hearted.
|
||||||
|
- **Mitigation:** Move fast. Establish brand and switching costs (pattern data, remediation workflows) before AWS can respond. If AWS ships something competitive, pivot to multi-cloud (AWS + GCP + Azure) — something AWS will NEVER build.
|
||||||
|
- **Kill trigger:** If AWS announces real-time Cost Anomaly Detection with native Slack remediation at re:Invent 2026, kill the standalone product. Pivot the CloudTrail ingestion engine into dd0c/alert or dd0c/drift as a supplementary feature.
|
||||||
|
|
||||||
|
### Risk 2: Market Consolidation (Datadog Acquires Vantage or Builds Equivalent)
|
||||||
|
|
||||||
|
- **Likelihood:** HIGH (60% within 18 months for Datadog entering the space)
|
||||||
|
- **Impact:** HIGH — Datadog has 3,000 engineers, $2B+ revenue, and existing customer infrastructure agents
|
||||||
|
- **Analysis:** Datadog charges $23/host/month. Their cost management is an upsell, not a standalone product. A startup with 50 hosts pays $1,150/month for Datadog before cost features. Our $19/account/month is a completely different price point. Datadog optimizes for enterprise, not startups.
|
||||||
|
- **Mitigation:** Don't compete on features. Compete on price and simplicity. Position as "the cost tool for teams that can't afford Datadog" or "teams that use Datadog for monitoring but don't want Datadog prices for cost management." If Datadog acquires Vantage, they'll inevitably raise prices or bundle behind expensive tiers. Double down on the anti-bloatware positioning.
|
||||||
|
- **Pivot option:** Strictly PLG for sub-50 person engineering teams where a Datadog contract is unjustifiable.
|
||||||
|
|
||||||
|
### Risk 3: False Positive Fatigue Kills Retention
|
||||||
|
|
||||||
|
- **Likelihood:** HIGH (70% if not actively managed)
|
||||||
|
- **Impact:** HIGH — If the product loses trust, churn hits 100%. The "boy who cried wolf" is the death of all monitoring tools.
|
||||||
|
- **Analysis:** CloudTrail is noisy. Mapping raw `RunInstances` events to accurate pricing (factoring RIs, Savings Plans, Spot) in real-time is notoriously difficult. If the Slack bot cries wolf with inaccurate pricing three times, engineers mute the channel. Game over.
|
||||||
|
- **Mitigation:**
|
||||||
|
1. Ship with hyper-conservative default thresholds (miss $50 anomalies rather than trigger 3 false positives)
|
||||||
|
2. Every alert includes `[Mark as Expected]` button that instantly retrains the baseline
|
||||||
|
3. Composite anomaly scoring (multiple signals = high confidence, single signal = low confidence)
|
||||||
|
4. User-tunable sensitivity per service
|
||||||
|
5. Track alert-to-action ratio as core product metric. If <20% of alerts result in action, sensitivity is too high.
|
||||||
|
6. Be transparent about estimates: "Estimated cost: $X/hour (on-demand pricing. Actual may differ with RIs/SPs)."
|
||||||
|
- **Kill trigger:** Alert-to-action ratio <10% at Month 4.
|
||||||
|
|
||||||
|
### Risk 4: Solo Founder Burnout (Bus Factor = 1)
|
||||||
|
|
||||||
|
- **Likelihood:** MEDIUM-HIGH (50% within 18 months)
|
||||||
|
- **Impact:** CRITICAL — Processing real-time event streams at scale is an operational nightmare. If ingestion goes down, you miss the anomaly, you lose trust forever.
|
||||||
|
- **Analysis:** Brian is building 6 products simultaneously. The cognitive load, support burden, and operational complexity of running a multi-product SaaS as a solo founder is extreme. Burnout is the most common startup killer.
|
||||||
|
- **Mitigation:**
|
||||||
|
1. Hard rule: no more than 2 products in active development at any time
|
||||||
|
2. Automate everything (IaC, CI/CD, automated testing, automated onboarding)
|
||||||
|
3. Hire part-time support contractor within 6 months of launch
|
||||||
|
4. dd0c/cost's Slack-first architecture eliminates 60% of the frontend engineering burden (no dashboard in V1)
|
||||||
|
- **Kill trigger:** Spending >60% of time on dd0c/cost support instead of building.
|
||||||
|
|
||||||
|
### Risk 5: The "Good Enough" Trap — Free Tier Cannibalization
|
||||||
|
|
||||||
|
- **Likelihood:** HIGH (60-70% of signups stay free)
|
||||||
|
- **Impact:** MEDIUM — Revenue growth stalls despite strong signup numbers
|
||||||
|
- **Analysis:** The free tier (daily anomaly checks, 1 account) may be sufficient for many small startups. Daily checks catch most problems, just 24 hours late.
|
||||||
|
- **Mitigation:**
|
||||||
|
1. Make the free-to-paid gap visceral. Every free alert: "We detected this 18 hours ago. On Pro, you'd have known in 60 seconds. Estimated cost of the delay: $220."
|
||||||
|
2. Free alerts have NO action buttons. See the problem, can't fix it from Slack. Friction = upgrade motivation.
|
||||||
|
3. Accept 60-70% free as normal for PLG. Focus on the 30-40% who convert. At $19/month, volume matters more than conversion rate.
|
||||||
|
|
||||||
|
### Additional Risks (Monitored)
|
||||||
|
|
||||||
|
| Risk | Likelihood | Impact | Mitigation |
|
||||||
|
|------|-----------|--------|------------|
|
||||||
|
| IAM permission anxiety blocks adoption | MEDIUM (30%) | MEDIUM | Minimal permissions (read-only), open-source agent, SOC 2 within 12 months |
|
||||||
|
| AI spend bubble pops | LOW-MEDIUM (20%) | MEDIUM | AI is the hook, not the product. dd0c detects ALL cost anomalies. Core problem persists regardless. |
|
||||||
|
| Security breach / data incident | LOW (10%) | CATASTROPHIC | Minimize data collection, encrypt everything, no stored credentials (IAM cross-account roles), bug bounty from day 1 |
|
||||||
|
| "We'll build it internally" | MEDIUM (25%) | LOW | Self-solving. Internal tools get abandoned. Content strategy demonstrates problem depth. $19/month < one engineer's afternoon. |
|
||||||
|
|
||||||
|
## Kill Criteria
|
||||||
|
|
||||||
|
Non-negotiable triggers to kill dd0c/cost and redirect effort:
|
||||||
|
|
||||||
|
1. **< 50 free signups within 30 days of Show HN launch.** Developer community doesn't care. Problem isn't painful enough or positioning is wrong.
|
||||||
|
2. **< 5 paying customers within 90 days of launch.** Product-market fit isn't there at any price.
|
||||||
|
3. **> 50% of paying customers churn within 60 days.** Product isn't delivering enough value to justify even $19/month.
|
||||||
|
4. **AWS ships real-time anomaly detection with Slack integration.** Primary differentiator evaporates. Pivot or kill.
|
||||||
|
5. **> 60% of time spent on support instead of building.** Product complexity is wrong for solo founder operating model.
|
||||||
|
|
||||||
|
If any trigger fires, don't rationalize. Don't "give it one more month." Kill it, learn from it, move on. dd0c has 5 other products.
|
||||||
|
|
||||||
|
## Pivot Options
|
||||||
|
|
||||||
|
| Trigger | Pivot |
|
||||||
|
|---------|-------|
|
||||||
|
| AWS closes the speed gap | Pivot to multi-cloud (AWS + GCP + Azure) — something AWS will never build |
|
||||||
|
| Standalone product fails | Absorb CloudTrail engine into dd0c/portal as a cost widget, not a standalone product |
|
||||||
|
| False positive crisis | Pivot from "anomaly detection" to "Zombie Hunter" — pure unused resource detection. Zero false positives, pure savings. |
|
||||||
|
| Market too noisy | Rebrand as "dd0c/guard" — cost governance and guardrails, not detection. Prevention > detection. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 7. SUCCESS METRICS
|
||||||
|
|
||||||
|
## North Star Metric
|
||||||
|
|
||||||
|
**Anomalies Resolved** — the number of cost anomalies dd0c detected AND the customer took action on (Stop, Terminate, Schedule, or acknowledged via Mark as Expected).
|
||||||
|
|
||||||
|
Not signups. Not MRR. Not DAU. Anomalies resolved is the atomic unit of value. Every anomaly resolved is money saved, trust earned, and retention deepened. Everything else is a proxy.
|
||||||
|
|
||||||
|
## Leading Indicators (Predictive)
|
||||||
|
|
||||||
|
| Metric | Target | Why It Matters |
|
||||||
|
|--------|--------|---------------|
|
||||||
|
| Time-to-first-alert | <10 minutes | If users don't get value fast, they churn before they start |
|
||||||
|
| Signup → Connected account rate | >60% | Measures onboarding friction. Below 60% = onboarding is broken |
|
||||||
|
| Alert-to-action ratio | >25% | Product quality signal. Below 20% = false positive crisis |
|
||||||
|
| Weekly active accounts (WAA) | Growing 10%+ week-over-week | Engagement health. Flat = product isn't sticky |
|
||||||
|
| Free-to-paid conversion rate | 2.5-3.5% | Revenue efficiency. Below 2% = free tier is too generous or paid value unclear |
|
||||||
|
|
||||||
|
## Lagging Indicators (Confirmatory)
|
||||||
|
|
||||||
|
| Metric | Target | Why It Matters |
|
||||||
|
|--------|--------|---------------|
|
||||||
|
| MRR | Per milestone targets below | Revenue health |
|
||||||
|
| Monthly churn rate | <8% | Retention. Above 15% = product isn't delivering sustained value |
|
||||||
|
| NPS | >40 | Customer satisfaction. Below 20 = product problems |
|
||||||
|
| Organic referral rate | >15% of new signups | Word-of-mouth health. Below 5% = product isn't remarkable enough to share |
|
||||||
|
| Estimated customer savings | >10x subscription cost | ROI validation. If customers aren't saving 10x what they pay, pricing or detection is wrong |
|
||||||
|
|
||||||
|
## 30/60/90 Day Milestones
|
||||||
|
|
||||||
|
### Day 30: Core Product Complete
|
||||||
|
- [ ] CloudTrail → EventBridge → Lambda pipeline operational
|
||||||
|
- [ ] Anomaly scoring engine functional (Z-score, configurable sensitivity)
|
||||||
|
- [ ] Slack app: alerts with action buttons (Stop, Terminate, Snooze, Mark Expected)
|
||||||
|
- [ ] Daily digest message working
|
||||||
|
- [ ] Tested on 2+ own AWS accounts
|
||||||
|
- **Gate:** Can detect a manually-created expensive resource and alert in Slack within 120 seconds
|
||||||
|
|
||||||
|
### Day 60: Design Partners Active
|
||||||
|
- [ ] 3-5 design partners using the product daily
|
||||||
|
- [ ] Onboarding flow complete (CloudFormation + Slack OAuth, <5 minutes)
|
||||||
|
- [ ] Immediate zombie scan on account connection
|
||||||
|
- [ ] False positive rate <30%
|
||||||
|
- [ ] At least 1 design partner has a "dd0c saved us $X" story
|
||||||
|
- **Gate:** Time-to-first-alert <10 minutes for all design partners
|
||||||
|
|
||||||
|
### Day 90: Public Launch
|
||||||
|
- [ ] Stripe billing live ($19/account/month, free tier)
|
||||||
|
- [ ] Show HN + Reddit + Product Hunt launched
|
||||||
|
- [ ] First "What's That Spike?" blog post published
|
||||||
|
- [ ] `aws-cost-cli` open-source tool released
|
||||||
|
- [ ] AWS Marketplace listing application submitted
|
||||||
|
- [ ] 200+ free signups in launch week
|
||||||
|
- **Gate:** At least 1 paying customer within 2 weeks of launch
|
||||||
|
|
||||||
|
### Month 4 Checkpoint
|
||||||
|
- [ ] 25+ paying accounts
|
||||||
|
- [ ] $475+ MRR
|
||||||
|
- [ ] Alert-to-action ratio >25%
|
||||||
|
- [ ] Monthly churn <10%
|
||||||
|
- [ ] At least 2 organic referrals
|
||||||
|
- **Kill trigger review:** If <5 paying accounts, initiate kill criteria evaluation
|
||||||
|
|
||||||
|
### Month 6 Checkpoint
|
||||||
|
- [ ] 100+ paying accounts
|
||||||
|
- [ ] $1,900+ MRR
|
||||||
|
- [ ] NPS >40
|
||||||
|
- [ ] Monthly churn <8%
|
||||||
|
- [ ] V2 development underway (dashboard, multi-account)
|
||||||
|
- [ ] Cross-sell motion with dd0c/route initiated
|
||||||
|
- **Kill trigger review:** If <25 paying accounts or >15% churn, initiate kill criteria evaluation
|
||||||
|
|
||||||
|
## Metrics to Track Daily
|
||||||
|
1. New signups (free + paid)
|
||||||
|
2. Accounts connected (signup → connected conversion)
|
||||||
|
3. Anomalies detected (total, by type, by severity)
|
||||||
|
4. Anomalies acted on (stop, terminate, snooze, mark expected)
|
||||||
|
5. Alert-to-action ratio
|
||||||
|
6. Time-to-first-alert
|
||||||
|
7. False positive reports (Mark as Expected / total alerts)
|
||||||
|
|
||||||
|
## Metrics to Track Weekly
|
||||||
|
1. MRR and MRR growth rate
|
||||||
|
2. Free-to-paid conversion rate
|
||||||
|
3. Churn rate (accounts disconnected or downgraded)
|
||||||
|
4. Estimated customer savings (sum of costs avoided via remediation)
|
||||||
|
5. Support ticket volume (early warning for complexity issues)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# APPENDIX: SCENARIO PROJECTIONS
|
||||||
|
|
||||||
|
| Scenario | Probability | Month 3 MRR | Month 6 MRR | Month 12 MRR | Description |
|
||||||
|
|----------|------------|-------------|-------------|--------------|-------------|
|
||||||
|
| **A: The Rocket** | 20% | $2,850 | $9,500 | $19,000 | HN front page, 2K signups week 1, 3% conversion, strong word-of-mouth |
|
||||||
|
| **B: The Grind** | 50% | $475 | $950 | $3,800 | Moderate HN traction, 500 signups week 1, slow steady growth via content |
|
||||||
|
| **C: The Pivot** | 25% | $95 | $285 | — | Lukewarm response, 200 signups, 1.5% conversion. Rebrand as portal feature or kill. |
|
||||||
|
| **D: The Extinction** | 5% | — | — | — | AWS ships competitive native tool. Kill immediately. Salvage CloudTrail engine for dd0c/alert. |
|
||||||
|
|
||||||
|
**Expected value (probability-weighted Month 12 MRR):** ~$5,700 from dd0c/cost alone. Combined with dd0c/route, the gateway drug pair targets $10-15K MRR at Month 12 under the most likely scenario.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*This brief synthesizes findings from four prior development phases: Brainstorm, Design Thinking, Innovation Strategy, and Party Mode Advisory Board Review. All contradictions between phases have been resolved in favor of the most conservative, execution-focused position. The advisory board voted 4-1 Conditional GO.*
|
||||||
|
|
||||||
|
*The bet: real-time CloudTrail analysis is an architectural wedge that incumbents can't easily follow. The condition: ship in 90 days, honor kill criteria, and stay ruthlessly focused on three things — detect fast, alert clearly, fix with one click.*
|
||||||
|
|
||||||
@@ -0,0 +1,103 @@
|
|||||||
|
# dd0c/cost — Test Architecture & TDD Strategy
|
||||||
|
|
||||||
|
**Version:** 2.0
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Status:** Authoritative
|
||||||
|
**Audience:** Founding engineer, future contributors
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
> **Guiding principle:** A cost anomaly detector that misses a $3,000 GPU instance is worse than useless — it's a liability. A cost anomaly detector that cries wolf 40% of the time gets disabled. Tests are the only way to ship with confidence at solo-founder velocity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
1. [Testing Philosophy & TDD Workflow](#1-testing-philosophy--tdd-workflow)
|
||||||
|
2. [Test Pyramid](#2-test-pyramid)
|
||||||
|
3. [Unit Test Strategy](#3-unit-test-strategy)
|
||||||
|
4. [Integration Test Strategy](#4-integration-test-strategy)
|
||||||
|
5. [E2E & Smoke Tests](#5-e2e--smoke-tests)
|
||||||
|
6. [Performance & Load Testing](#6-performance--load-testing)
|
||||||
|
7. [CI/CD Pipeline Integration](#7-cicd-pipeline-integration)
|
||||||
|
8. [Transparent Factory Tenet Testing](#8-transparent-factory-tenet-testing)
|
||||||
|
9. [Test Data & Fixtures](#9-test-data--fixtures)
|
||||||
|
10. [TDD Implementation Order](#10-tdd-implementation-order)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Testing Philosophy & TDD Workflow
|
||||||
|
|
||||||
|
### Red-Green-Refactor for dd0c/cost
|
||||||
|
|
||||||
|
TDD is non-negotiable for the anomaly scoring engine and baseline learning components. A scoring bug that ships to production means either missed anomalies (customers lose money) or false positives (customers disable the product). The cost of a test is minutes. The cost of a scoring bug is churn.
|
||||||
|
|
||||||
|
**Where TDD is mandatory:**
|
||||||
|
- `src/scoring/` — every scoring signal, composite calculation, and severity classification
|
||||||
|
- `src/baseline/` — all statistical operations (mean, stddev, rolling window, cold-start transitions)
|
||||||
|
- `src/parsers/` — every CloudTrail event parser (RunInstances, CreateDBInstance, etc.)
|
||||||
|
- `src/pricing/` — pricing lookup logic and cost estimation
|
||||||
|
- `src/governance/` — policy.json evaluation, auto-promotion logic, panic mode
|
||||||
|
|
||||||
|
**Where TDD is recommended but not mandatory:**
|
||||||
|
- `src/notifier/` — Slack Block Kit formatting (snapshot tests are sufficient)
|
||||||
|
- `src/api/` — REST handlers (contract tests cover these)
|
||||||
|
- `src/infra/` — CDK stacks (CDK assertions cover these)
|
||||||
|
|
||||||
|
**Where tests follow implementation:**
|
||||||
|
- `src/onboarding/` — CloudFormation URL generation, Cognito flows (integration tests only)
|
||||||
|
- `src/slack/` — OAuth flows, signature verification (integration tests)
|
||||||
|
|
||||||
|
### The Red-Green-Refactor Cycle
|
||||||
|
|
||||||
|
```
|
||||||
|
RED: Write a failing test that describes the desired behavior.
|
||||||
|
Name it precisely: what component, what input, what expected output.
|
||||||
|
Run it. Watch it fail. Confirm it fails for the right reason.
|
||||||
|
|
||||||
|
GREEN: Write the minimum code to make the test pass.
|
||||||
|
No gold-plating. No "while I'm here" refactors.
|
||||||
|
Run the test. Watch it pass.
|
||||||
|
|
||||||
|
REFACTOR: Clean up the implementation without changing behavior.
|
||||||
|
Extract constants. Rename variables. Simplify logic.
|
||||||
|
Tests must still pass after every refactor step.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test Naming Convention
|
||||||
|
|
||||||
|
All tests follow the pattern: `[unit under test] [scenario] [expected outcome]`
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// ✅ Good — precise, readable, searchable
|
||||||
|
describe('scoreAnomaly', () => {
|
||||||
|
it('returns critical severity when z-score exceeds 5.0 and instance type is novel', () => {});
|
||||||
|
it('returns none severity when account is in cold-start and cost is below $0.50/hr', () => {});
|
||||||
|
it('returns warning severity when actor is novel but cost is within 2 standard deviations', () => {});
|
||||||
|
it('compounds severity when multiple signals fire simultaneously', () => {});
|
||||||
|
});
|
||||||
|
|
||||||
|
// ❌ Bad — vague, not searchable
|
||||||
|
describe('scoring', () => {
|
||||||
|
it('works correctly', () => {});
|
||||||
|
it('handles edge cases', () => {});
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
### Decision Log Requirement
|
||||||
|
|
||||||
|
Per Transparent Factory tenet (Story 10.3), any PR touching `src/scoring/`, `src/baseline/`, or `src/detection/` must include a `docs/decisions/<YYYY-MM-DD>-<slug>.json` file. The test suite validates this in CI.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"prompt": "Should Z-score threshold be 2.5 or 3.0?",
|
||||||
|
"reasoning": "At 2.5, false positive rate in design partner data was 28%. At 3.0, it dropped to 18% with only 2 additional missed true positives over 30 days.",
|
||||||
|
"alternatives_considered": ["2.0 (too noisy)", "3.5 (misses too many real anomalies)"],
|
||||||
|
"confidence": "medium",
|
||||||
|
"timestamp": "2026-02-28T10:00:00Z",
|
||||||
|
"author": "brian"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
2144
products/06-runbook-automation/architecture/architecture.md
Normal file
2144
products/06-runbook-automation/architecture/architecture.md
Normal file
File diff suppressed because it is too large
Load Diff
270
products/06-runbook-automation/brainstorm/session.md
Normal file
270
products/06-runbook-automation/brainstorm/session.md
Normal file
@@ -0,0 +1,270 @@
|
|||||||
|
# dd0c/run — Brainstorm Session: AI-Powered Runbook Automation
|
||||||
|
**Facilitator:** Carson, Elite Brainstorming Specialist
|
||||||
|
**Date:** 2026-02-28
|
||||||
|
**Product:** dd0c/run (Product #6 in the dd0c platform)
|
||||||
|
**Phase:** "On-Call Savior" (Months 4-6 per brand strategy)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: Problem Space (25 ideas)
|
||||||
|
|
||||||
|
The graveyard of runbooks is REAL. Let's map every angle of the pain.
|
||||||
|
|
||||||
|
### Discovery & Awareness
|
||||||
|
1. **The Invisible Runbook** — On-call engineer gets paged, doesn't know a runbook exists for this exact alert. It's buried in page 47 of a Confluence space nobody bookmarks.
|
||||||
|
2. **The "Ask Steve" Problem** — The runbook is Steve's brain. Steve is on vacation in Bali. Steve didn't write it down. Steve never will.
|
||||||
|
3. **The Wrong Runbook** — Engineer finds a runbook but it's for a different version of the service, or a similar-but-different failure mode. They follow it anyway. Things get worse.
|
||||||
|
4. **The Search Tax** — At 3am, panicking, the engineer spends 12 minutes searching Confluence, Notion, Slack, and Google Docs for the right runbook. MTTR just doubled.
|
||||||
|
5. **The Tribal Knowledge Silo** — Senior engineers have mental runbooks for every failure mode. They never write them down because "it's faster to just fix it myself."
|
||||||
|
|
||||||
|
### Runbook Rot & Maintenance
|
||||||
|
6. **The Day-One Decay** — A runbook is accurate the day it's written. By day 30, the infrastructure has changed and 3 of the 8 steps are wrong.
|
||||||
|
7. **The Nobody-Owns-It Problem** — Who maintains the runbook? The person who wrote it left 6 months ago. The team that owns the service doesn't know the runbook exists.
|
||||||
|
8. **The Copy-Paste Drift** — Runbooks get forked, copied, slightly modified. Now there are 4 versions and none are canonical.
|
||||||
|
9. **The Screenshot Graveyard** — Runbooks full of screenshots of UIs that have been redesigned twice since.
|
||||||
|
10. **The "Works on My Machine" Runbook** — Steps that assume specific IAM permissions, VPN configs, or CLI versions that the on-call engineer doesn't have.
|
||||||
|
|
||||||
|
### Cognitive Load & Human Factors
|
||||||
|
11. **3am Brain** — Cognitive function drops 30-40% during night pages. Complex multi-step runbooks become impossible to follow accurately.
|
||||||
|
12. **The Panic Spiral** — Alert fires → engineer panics → skips steps → makes it worse → more alerts fire → more panic. The runbook can't help if the human can't process it.
|
||||||
|
13. **Context Switching Hell** — Following a runbook means jumping between the doc, the terminal, the AWS console, Datadog, Slack, and PagerDuty. Each switch costs 30 seconds of re-orientation.
|
||||||
|
14. **The "Which Step Am I On?" Problem** — Engineer gets interrupted by Slack message, loses their place in a 20-step runbook, re-executes step 7 which was supposed to be idempotent but isn't.
|
||||||
|
15. **Decision Fatigue at the Fork** — Runbook says "If X, do A. If Y, do B. If neither, escalate." Engineer can't tell if it's X or Y. Freezes.
|
||||||
|
|
||||||
|
### Organizational & Cultural
|
||||||
|
16. **The Postmortem Lie** — Every postmortem says "Action item: update runbook." It never happens. The Jira ticket sits in the backlog for eternity.
|
||||||
|
17. **The Hero Culture** — Organizations reward the engineer who heroically fixes the incident, not the one who writes the boring runbook. Incentives are backwards.
|
||||||
|
18. **The New Hire Cliff** — New on-call engineer's first page. No context, no muscle memory, runbooks assume 2 years of institutional knowledge.
|
||||||
|
19. **The Handoff Gap** — Shift change during an active incident. The outgoing engineer's context is lost. The incoming engineer starts from scratch.
|
||||||
|
20. **The "We Don't Have Runbooks" Admission** — Many teams simply don't have runbooks at all. They rely entirely on tribal knowledge and hope.
|
||||||
|
|
||||||
|
### Economic & Business Impact
|
||||||
|
21. **The MTTR Multiplier** — Every minute of downtime costs money. A 30-minute MTTR vs 5-minute MTTR on a revenue-critical service can mean $50K+ difference per incident for mid-market companies.
|
||||||
|
22. **The Attrition Cost** — On-call burnout is the #1 reason SREs quit. Bad runbooks = more stress = more turnover = $150K+ per lost engineer.
|
||||||
|
23. **The Compliance Gap** — SOC 2 and ISO 27001 require documented incident response procedures. Most teams' "documentation" is a stale wiki page that wouldn't survive an audit.
|
||||||
|
24. **The Repeated Incident Tax** — Same incident happens monthly. Same engineer fixes it manually each time. Nobody automates it because "it only takes 20 minutes." That's 4 hours/year of senior engineer time per recurring incident.
|
||||||
|
25. **The Escalation Cascade** — Junior engineer can't follow the runbook → escalates to senior → senior is also paged for 3 other things → everyone's awake, nobody's effective.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: Solution Space (42 ideas)
|
||||||
|
|
||||||
|
LET'S GO. Every idea is valid. We're building the future of incident response.
|
||||||
|
|
||||||
|
### Ingestion & Import (Ideas 26-36)
|
||||||
|
26. **Confluence Crawler** — API integration that discovers and imports all runbook-tagged pages from Confluence spaces. Parses prose into structured steps.
|
||||||
|
27. **Notion Sync** — Bidirectional sync with Notion databases. Import existing runbooks, push updates back.
|
||||||
|
28. **GitHub/GitLab Markdown Ingest** — Point at a repo directory of `.md` runbooks. Auto-import on merge to main.
|
||||||
|
29. **Slack Thread Scraper** — "That time we fixed the database" lives in a Slack thread. AI extracts the resolution steps from the conversation noise.
|
||||||
|
30. **Google Docs Connector** — Many teams keep runbooks in shared Google Docs. Import and keep synced.
|
||||||
|
31. **Video Transcription Import** — Senior engineer recorded a Loom/screen recording of fixing an issue. AI transcribes it, extracts the steps, generates a runbook.
|
||||||
|
32. **Terminal Session Replay Import** — Import asciinema recordings or shell history from past incidents. AI identifies the commands that actually fixed the issue vs. the diagnostic noise.
|
||||||
|
33. **Postmortem-to-Runbook Pipeline** — Feed in your postmortem doc. AI extracts the resolution steps and generates a draft runbook automatically.
|
||||||
|
34. **PagerDuty/OpsGenie Notes Scraper** — Engineers often leave resolution notes in the incident timeline. Scrape those into runbook drafts.
|
||||||
|
35. **Jira/Linear Ticket Mining** — Incident tickets often contain resolution steps in comments. Mine them.
|
||||||
|
36. **Clipboard/Paste Import** — Zero-friction: just paste the text of your runbook into dd0c/run. AI structures it instantly.
|
||||||
|
|
||||||
|
### AI Parsing & Understanding (Ideas 37-45)
|
||||||
|
37. **Prose-to-Steps Converter** — AI takes a wall of text ("First you need to SSH into the bastion, then check the logs for...") and converts it into numbered, executable steps.
|
||||||
|
38. **Command Extraction** — AI identifies shell commands, API calls, and SQL queries embedded in prose. Tags them as executable.
|
||||||
|
39. **Prerequisite Detection** — AI identifies implicit prerequisites ("you need kubectl access", "make sure you're on the VPN") and surfaces them as a checklist before execution.
|
||||||
|
40. **Conditional Logic Mapping** — AI identifies decision points ("if the error is X, do Y; otherwise do Z") and creates branching workflow trees.
|
||||||
|
41. **Risk Classification per Step** — AI labels each step: 🟢 Safe (read-only), 🟡 Caution (state change, reversible), 🔴 Dangerous (destructive, irreversible). Auto-execute green, prompt for yellow, require explicit approval for red.
|
||||||
|
42. **Staleness Detection** — AI cross-references runbook commands against current infrastructure (Terraform state, K8s manifests, AWS resource tags). Flags steps that reference resources that no longer exist.
|
||||||
|
43. **Ambiguity Highlighter** — AI flags vague steps ("check the logs" — which logs? where?) and prompts the author to clarify.
|
||||||
|
44. **Multi-Language Support** — Parse runbooks written in English, Spanish, Japanese, etc. Incident response is global.
|
||||||
|
45. **Diagram/Flowchart Generation** — AI generates a visual flowchart from the runbook steps. Engineers can see the whole decision tree at a glance.
|
||||||
|
|
||||||
|
### Execution Modes (Ideas 46-54)
|
||||||
|
46. **Full Autopilot** — For well-tested, low-risk runbooks: AI executes every step automatically, reports results.
|
||||||
|
47. **Copilot Mode (Human-in-the-Loop)** — AI suggests each step, pre-fills the command, engineer clicks "Execute" or modifies it first. The default mode.
|
||||||
|
48. **Suggestion-Only / Read-Along** — AI walks the engineer through the runbook step by step, highlighting the current step, but doesn't execute anything. Training wheels.
|
||||||
|
49. **Dry-Run Mode** — AI simulates execution of each step, shows what WOULD happen without actually doing it. Perfect for testing runbooks.
|
||||||
|
50. **Progressive Trust** — Starts in suggestion-only mode. As the team builds confidence, they can promote individual runbooks to copilot or autopilot mode.
|
||||||
|
51. **Approval Chains** — Dangerous steps require approval from a second engineer or a manager. Integrated with Slack/Teams for quick approvals.
|
||||||
|
52. **Rollback-Aware Execution** — Every step that changes state also records the rollback command. If things go wrong, one-click undo.
|
||||||
|
53. **Parallel Step Execution** — Some runbook steps are independent. AI identifies parallelizable steps and executes them simultaneously to reduce MTTR.
|
||||||
|
54. **Breakpoint Mode** — Engineer sets breakpoints in the runbook like a debugger. Execution pauses at those points for manual inspection.
|
||||||
|
|
||||||
|
### Alert Integration (Ideas 55-62)
|
||||||
|
55. **Auto-Attach Runbook to Incident** — When PagerDuty/OpsGenie fires an alert, dd0c/run automatically identifies the most relevant runbook and attaches it to the incident.
|
||||||
|
56. **Alert-to-Runbook Matching Engine** — ML model that learns which alerts map to which runbooks based on historical resolution patterns.
|
||||||
|
57. **Slack Bot Integration** — `/ddoc run database-failover` in Slack. The bot walks you through the runbook right in the channel.
|
||||||
|
58. **PagerDuty Custom Action** — One-click "Run Runbook" button directly in the PagerDuty incident page.
|
||||||
|
59. **Pre-Incident Warm-Up** — When anomaly detection suggests an incident is LIKELY (but hasn't fired yet), dd0c/run pre-loads the relevant runbook and notifies the on-call.
|
||||||
|
60. **Multi-Alert Correlation** — When 5 alerts fire simultaneously, AI determines they're all symptoms of one root cause and suggests the single runbook that addresses it.
|
||||||
|
61. **Escalation-Aware Routing** — If the L1 runbook doesn't resolve the issue within N minutes, automatically escalate to L2 runbook and page the senior engineer with full context.
|
||||||
|
62. **Alert Context Injection** — When the runbook starts, AI pre-populates variables (affected service, region, customer impact) from the alert payload. No manual lookup needed.
|
||||||
|
|
||||||
|
### Learning Loop & Continuous Improvement (Ideas 63-72)
|
||||||
|
63. **Resolution Tracking** — Track which runbook steps actually resolved the incident vs. which were skipped or failed. Use this data to improve runbooks.
|
||||||
|
64. **Auto-Update Suggestions** — After an incident, AI compares what the engineer actually did vs. what the runbook said. Suggests updates for divergences.
|
||||||
|
65. **Runbook Effectiveness Score** — Each runbook gets a score: success rate, average MTTR when used, skip rate per step. Surface the worst-performing runbooks for review.
|
||||||
|
66. **Dead Step Detection** — If step 4 is skipped by every engineer every time, it's probably unnecessary. Flag it for removal.
|
||||||
|
67. **New Failure Mode Detection** — AI notices an incident that doesn't match any existing runbook. Prompts the resolving engineer to create one from their actions.
|
||||||
|
68. **A/B Testing Runbooks** — Two approaches to fixing the same issue? Run both, track which has better MTTR. Data-driven runbook optimization.
|
||||||
|
69. **Seasonal Pattern Learning** — "This database issue happens every month-end during batch processing." AI learns temporal patterns and pre-stages runbooks.
|
||||||
|
70. **Cross-Team Learning** — Anonymized patterns: "Teams using this AWS architecture commonly need this type of runbook." Suggest runbook templates based on infrastructure fingerprint.
|
||||||
|
71. **Confidence Decay Model** — Runbook confidence score decreases over time since last successful use or last infrastructure change. Triggers review when confidence drops below threshold.
|
||||||
|
72. **Incident Replay for Training** — Record the full incident timeline (alerts, runbook execution, engineer actions). Replay it for training new on-call engineers.
|
||||||
|
|
||||||
|
### Collaboration & Handoff (Ideas 73-80)
|
||||||
|
73. **Multi-Engineer Incident View** — Multiple engineers working the same incident can see each other's progress through the runbook in real-time.
|
||||||
|
74. **Shift Handoff Package** — When shift changes during an incident, dd0c/run generates a context package: what's been tried, what's left, current state.
|
||||||
|
75. **War Room Mode** — Dedicated incident channel with the runbook pinned, step progress visible, and AI providing real-time suggestions.
|
||||||
|
76. **Expert Paging with Context** — When escalating, the paged expert receives not just "help needed" but the full runbook execution history, what's been tried, and where it's stuck.
|
||||||
|
77. **Async Runbook Contributions** — After an incident, any engineer can suggest edits to the runbook. Changes go through a review process like a PR.
|
||||||
|
78. **Runbook Comments & Annotations** — Engineers can leave inline comments on runbook steps ("This step takes 5 minutes, don't panic if it seems stuck").
|
||||||
|
79. **Incident Narration** — AI generates a real-time narrative of the incident for stakeholders: "The team is on step 5 of 8. Database failover is in progress. ETA: 10 minutes."
|
||||||
|
80. **Cross-Timezone Handoff Intelligence** — AI knows which engineers are in which timezone and suggests optimal handoff points.
|
||||||
|
|
||||||
|
### Runbook Creation & Generation (Ideas 81-90)
|
||||||
|
81. **Terminal Watcher** — Opt-in agent that watches your terminal during an incident. After resolution, AI generates a runbook from the commands you ran.
|
||||||
|
82. **Incident Postmortem → Runbook** — Feed in the postmortem. AI generates the runbook. Close the loop that every team promises but never delivers.
|
||||||
|
83. **Screen Recording → Runbook** — Record your screen while fixing an issue. AI watches, transcribes, and generates a step-by-step runbook.
|
||||||
|
84. **Slack Thread → Runbook** — Point at a Slack thread where an incident was resolved. AI extracts the signal from the noise and generates a runbook.
|
||||||
|
85. **Template Library** — Pre-built runbook templates for common scenarios: "AWS RDS failover", "Kubernetes pod crash loop", "Redis memory pressure", "Certificate expiry".
|
||||||
|
86. **Infrastructure-Aware Generation** — dd0c/run knows your infrastructure (via dd0c/portal integration). When you deploy a new service, it auto-suggests runbook templates based on the tech stack.
|
||||||
|
87. **Chaos Engineering Integration** — Run a chaos experiment (Gremlin, LitmusChaos). dd0c/run observes the resolution and generates a runbook from it.
|
||||||
|
88. **Pair Programming Runbooks** — AI and engineer co-author a runbook interactively. AI asks questions ("What do you check first?"), engineer answers, AI structures it.
|
||||||
|
89. **Runbook from Architecture Diagram** — Feed in your architecture diagram. AI identifies potential failure points and generates skeleton runbooks for each.
|
||||||
|
90. **Git-Backed Runbooks** — All runbooks stored as code in a Git repo. Version history, PRs for changes, CI/CD for validation. The runbook-as-code movement.
|
||||||
|
|
||||||
|
### Wild & Visionary Ideas (Ideas 91-100)
|
||||||
|
91. **Incident Simulator / Fire Drills** — Simulate incidents in a sandbox environment. On-call engineers practice runbooks without real consequences. Gamified with scores and leaderboards.
|
||||||
|
92. **Voice-Guided Runbooks** — At 3am, reading is hard. AI reads the runbook steps aloud through your headphones while you type commands. Hands-free incident response.
|
||||||
|
93. **Runbook Marketplace** — Community-contributed runbook templates. "Here's how Stripe handles Redis failover." Anonymized, vetted, rated.
|
||||||
|
94. **Predictive Runbook Staging** — AI predicts incidents before they happen (based on metrics trends) and pre-stages the relevant runbook, pre-approves safe steps, and alerts the on-call: "Heads up, you might need this in 30 minutes."
|
||||||
|
95. **Natural Language Incident Response** — Engineer types "the database is slow" in Slack. AI figures out which database, runs diagnostics, identifies the issue, and suggests the right runbook. No alert needed.
|
||||||
|
96. **Runbook Dependency Graph** — Visualize how runbooks relate to each other. "If Runbook A fails, try Runbook B." "Runbook C is a prerequisite for Runbook D."
|
||||||
|
97. **Self-Healing Runbooks** — Runbooks that detect their own staleness by periodically dry-running against the current infrastructure and flagging broken steps.
|
||||||
|
98. **Customer-Impact Aware Execution** — AI knows which customers are affected (via dd0c/portal service catalog) and prioritizes runbook execution based on customer tier/revenue impact.
|
||||||
|
99. **Regulatory Compliance Mode** — Every runbook execution is logged with full audit trail. Who ran what, when, what changed. Auto-generates compliance evidence for SOC 2/ISO 27001.
|
||||||
|
100. **Multi-Cloud Runbook Abstraction** — Write one runbook that works across AWS, GCP, and Azure. AI translates cloud-specific commands based on the target environment.
|
||||||
|
101. **Runbook Health Dashboard** — Single pane of glass: total runbooks, coverage gaps (services without runbooks), staleness scores, usage frequency, effectiveness ratings.
|
||||||
|
102. **"What Would Steve Do?" Mode** — AI learns from how senior engineers resolve incidents (via terminal watcher + historical data) and can suggest their approach even when they're not available.
|
||||||
|
103. **Incident Cost Tracker** — Real-time cost counter during incident: "This outage has cost $12,400 so far. Estimated savings if resolved in next 5 minutes: $8,200."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3: Differentiation & Moat (18 ideas)
|
||||||
|
|
||||||
|
### Beating Rundeck (Free/OSS)
|
||||||
|
104. **UX Superiority** — Rundeck's UI is from 2015. dd0c/run is Linear-quality UX. Engineers will pay for beautiful, fast tools.
|
||||||
|
105. **Zero Config** — Rundeck requires Java, a database, YAML job definitions. dd0c/run: paste your runbook, it works. Time-to-value < 5 minutes.
|
||||||
|
106. **AI-Native vs. Bolt-On** — Rundeck is a job scheduler with runbook features bolted on. dd0c/run is AI-first. The AI IS the product, not a feature.
|
||||||
|
107. **SaaS vs. Self-Hosted Burden** — Rundeck requires hosting, patching, upgrading. dd0c/run is managed SaaS. One less thing to maintain.
|
||||||
|
|
||||||
|
### Beating PagerDuty Automation Actions
|
||||||
|
108. **Not Locked to PagerDuty** — PagerDuty Automation Actions only works within PagerDuty. dd0c/run works with PagerDuty, OpsGenie, Grafana OnCall, or any webhook-based alerting.
|
||||||
|
109. **Runbook Intelligence vs. Dumb Automation** — PagerDuty runs pre-defined scripts. dd0c/run understands the runbook, adapts to context, handles branching logic, and learns.
|
||||||
|
110. **Ingestion from Anywhere** — PagerDuty can't import your existing Confluence runbooks. dd0c/run can.
|
||||||
|
111. **Mid-Market Pricing** — PagerDuty's automation is an expensive add-on to an already expensive product. dd0c/run is $15-30/seat standalone.
|
||||||
|
|
||||||
|
### The Data Moat
|
||||||
|
112. **Runbook Corpus** — Every runbook ingested makes the AI smarter at parsing and structuring new runbooks. Network effect.
|
||||||
|
113. **Resolution Pattern Database** — "When alert X fires for service type Y, step Z resolves it 87% of the time." This data is incredibly valuable and compounds over time.
|
||||||
|
114. **Infrastructure Fingerprinting** — dd0c/run learns common failure patterns for specific tech stacks (EKS + RDS + Redis = these 5 failure modes). New customers with similar stacks get instant runbook suggestions.
|
||||||
|
115. **MTTR Benchmarking** — "Your MTTR for database incidents is 23 minutes. Similar teams average 8 minutes. Here's what they do differently." Anonymized cross-customer intelligence.
|
||||||
|
|
||||||
|
### Platform Integration Moat (dd0c Ecosystem)
|
||||||
|
116. **dd0c/alert → dd0c/run Pipeline** — Alert intelligence identifies the incident, runbook automation resolves it. Together they're 10x more valuable than apart.
|
||||||
|
117. **dd0c/portal Service Catalog** — dd0c/run knows who owns the service, what it depends on, and who to page. No configuration needed if you're already using portal.
|
||||||
|
118. **dd0c/cost Integration** — Runbook execution can factor in cost: "This remediation will spin up 3 extra instances costing $X/hour. Approve?"
|
||||||
|
119. **dd0c/drift Integration** — "This incident was caused by infrastructure drift detected by dd0c/drift. Here's the runbook to remediate AND the drift to fix."
|
||||||
|
120. **Unified Audit Trail** — All dd0c modules share one audit log. Compliance teams get a single source of truth for incident response, cost decisions, and infrastructure changes.
|
||||||
|
121. **The "Last Mile" Advantage** — Competitors solve one piece. dd0c solves the whole chain: detect anomaly → correlate alerts → identify runbook → execute resolution → update documentation → generate postmortem.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4: Anti-Ideas & Red Team (15 ideas)
|
||||||
|
|
||||||
|
Time to be brutally honest. Let's stress-test this thing.
|
||||||
|
|
||||||
|
### Why This Could Fail
|
||||||
|
122. **AI Agents Make Runbooks Obsolete** — If autonomous AI agents (Pulumi Neo, GitHub Agentic Workflows) can detect and fix infrastructure issues without human intervention, who needs runbooks? *Counter:* We're 3-5 years from trusting AI to autonomously fix production. Runbooks are the bridge. And even with AI agents, you need runbooks as the "policy" that defines what the agent should do.
|
||||||
|
123. **Trust Barrier** — Will engineers let an AI run commands in their production environment? The first time dd0c/run makes an incident worse, trust is destroyed forever. *Counter:* Progressive trust model. Start with suggestion-only. Graduate to copilot. Autopilot only for proven runbooks. Never force it.
|
||||||
|
124. **The AI Makes It Worse** — AI misinterprets a runbook step, executes the wrong command, cascading failure. *Counter:* Risk classification per step. Dangerous steps always require human approval. Dry-run mode. Rollback-aware execution.
|
||||||
|
125. **Runbook Quality Garbage-In** — If the existing runbooks are terrible (and they usually are), AI can't magically make them good. It'll just execute bad steps faster. *Counter:* AI quality scoring on import. Flag ambiguous, incomplete, or risky runbooks. Suggest improvements before enabling execution.
|
||||||
|
126. **Security & Compliance Nightmare** — dd0c/run needs access to production systems to execute commands. That's a massive attack surface and compliance concern. *Counter:* Agent-based architecture (like the dd0c brand strategy specifies). Agent runs in their VPC. SaaS never sees credentials. SOC 2 compliance from day one.
|
||||||
|
127. **Small Market?** — How many teams actually have runbooks worth automating? Most teams don't have runbooks at all. *Counter:* That's the opportunity. dd0c/run doesn't just automate existing runbooks — it helps CREATE them. The terminal watcher and postmortem-to-runbook features address teams with zero runbooks.
|
||||||
|
128. **Rundeck is Free** — Why pay for dd0c/run when Rundeck is open source? *Counter:* Rundeck is a job scheduler, not an AI runbook engine. It's like comparing Notepad to VS Code. Different products for different eras.
|
||||||
|
129. **PagerDuty/Rootly Acqui-Hire the Space** — Big players could build or acquire this capability. *Counter:* PagerDuty is slow-moving and enterprise-focused. Rootly is incident management, not runbook execution. By the time they build it, dd0c/run has the data moat.
|
||||||
|
130. **Engineer Resistance** — "I don't need an AI to tell me how to do my job." Cultural resistance from senior engineers. *Counter:* Position it as a tool for the 3am junior engineer, not the senior. Seniors benefit because they stop getting paged for things juniors can now handle.
|
||||||
|
131. **Integration Fatigue** — Yet another tool to integrate with PagerDuty, Slack, AWS, etc. *Counter:* dd0c platform handles integrations once. dd0c/run inherits them.
|
||||||
|
132. **Latency During Incidents** — If dd0c/run adds latency to incident response (loading, parsing, waiting for AI), engineers will bypass it. *Counter:* Pre-stage runbooks. Cache everything. AI inference must be < 2 seconds. If it's slower than reading the doc, it's useless.
|
||||||
|
133. **Liability** — If dd0c/run's AI suggestion causes data loss, who's liable? *Counter:* Clear ToS. AI suggests, human approves (in copilot mode). Audit trail proves the human clicked "Execute."
|
||||||
|
134. **Hallucination Risk** — AI "invents" a runbook step that doesn't exist in the source material. *Counter:* Strict grounding. Every suggested step must trace back to the source runbook. Hallucination detection layer. Never generate steps that aren't in the original document unless explicitly in "creative" mode.
|
||||||
|
135. **Chicken-and-Egg: No Runbooks = No Product** — Teams without runbooks can't use dd0c/run. *Counter:* Terminal watcher, postmortem mining, Slack thread scraping, and template library all solve cold-start. dd0c/run creates runbooks, not just executes them.
|
||||||
|
136. **Pricing Pressure** — If the market commoditizes AI runbook execution, margins collapse. *Counter:* The moat isn't the execution engine — it's the resolution pattern database and the dd0c platform integration. Those compound over time.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 5: Synthesis
|
||||||
|
|
||||||
|
### Top 10 Ideas (Ranked by Impact × Feasibility)
|
||||||
|
|
||||||
|
| Rank | Idea | # | Why |
|
||||||
|
|------|------|---|-----|
|
||||||
|
| 1 | **Copilot Mode (Human-in-the-Loop Execution)** | 47 | The core product. AI suggests, human approves. Safe, trustworthy, immediately valuable. This IS dd0c/run. |
|
||||||
|
| 2 | **Auto-Attach Runbook to Incident** | 55 | The killer integration. Alert fires → runbook appears. Solves the #1 problem (engineers don't know the runbook exists). |
|
||||||
|
| 3 | **Risk Classification per Step** | 41 | The trust enabler. Green/yellow/red labeling makes engineers comfortable letting AI execute safe steps while maintaining control over dangerous ones. |
|
||||||
|
| 4 | **Confluence/Notion/Markdown Ingestion** | 26-28 | Meet teams where they are. Zero migration friction. Import existing runbooks in minutes. |
|
||||||
|
| 5 | **Prose-to-Steps AI Converter** | 37 | The magic moment. Paste a wall of text, get a structured executable runbook. This is the demo that sells the product. |
|
||||||
|
| 6 | **Terminal Watcher → Auto-Generate Runbook** | 81 | Solves the cold-start problem AND the "seniors won't write runbooks" problem. The runbook writes itself. |
|
||||||
|
| 7 | **Resolution Tracking & Auto-Update Suggestions** | 63-64 | The learning loop that kills runbook rot. Runbooks get better with every incident instead of decaying. |
|
||||||
|
| 8 | **Slack Bot Integration** | 57 | Meet engineers where they already are during incidents. No context switching. `/ddoc run` in the incident channel. |
|
||||||
|
| 9 | **Runbook Effectiveness Score** | 65 | Data-driven runbook management. Surface the worst runbooks, celebrate the best. Gamification of operational excellence. |
|
||||||
|
| 10 | **dd0c/alert → dd0c/run Pipeline** | 116 | The platform play. Alert intelligence + runbook automation = the complete incident response stack. This is how you beat point solutions. |
|
||||||
|
|
||||||
|
### 3 Wild Cards 🃏
|
||||||
|
|
||||||
|
1. **🃏 Incident Simulator / Fire Drills (#91)** — Gamified incident response training. Engineers practice runbooks in a sandbox. Leaderboards, scores, team competitions. This could be the viral growth mechanism — "My team's incident response score is 94. What's yours?" Could become a standalone product.
|
||||||
|
|
||||||
|
2. **🃏 Voice-Guided Runbooks (#92)** — At 3am, your eyes are barely open. What if dd0c/run talked you through the incident like a calm co-pilot? "Step 3: SSH into the bastion. The command is ready in your clipboard. Press enter when ready." This is genuinely differentiated — nobody else is doing audio-guided incident response.
|
||||||
|
|
||||||
|
3. **🃏 "What Would Steve Do?" Mode (#102)** — AI learns senior engineers' incident response patterns and can replicate their decision-making. This is the ultimate knowledge capture tool. When Steve leaves the company, his expertise stays. Emotionally compelling pitch for engineering managers worried about bus factor.
|
||||||
|
|
||||||
|
### Recommended V1 Scope
|
||||||
|
|
||||||
|
**V1 = "Paste → Parse → Page → Pilot"**
|
||||||
|
|
||||||
|
The minimum viable product that delivers immediate value:
|
||||||
|
|
||||||
|
1. **Ingest** — Paste a runbook (plain text, markdown) or connect Confluence/Notion. AI parses it into structured, executable steps with risk classification (green/yellow/red).
|
||||||
|
|
||||||
|
2. **Match** — PagerDuty/OpsGenie webhook integration. When an alert fires, dd0c/run matches it to the most relevant runbook using semantic similarity + alert metadata.
|
||||||
|
|
||||||
|
3. **Execute (Copilot Mode)** — Slack bot or web UI walks the on-call engineer through the runbook step-by-step. Auto-executes green (safe) steps. Prompts for approval on yellow/red steps. Pre-fills commands with context from the alert.
|
||||||
|
|
||||||
|
4. **Learn** — Track which steps were executed, skipped, or modified. After incident resolution, suggest runbook updates based on what actually happened vs. what the runbook said.
|
||||||
|
|
||||||
|
**What V1 does NOT include:**
|
||||||
|
- Terminal watcher (V2)
|
||||||
|
- Full autopilot mode (V2 — need trust first)
|
||||||
|
- Incident simulator (V3)
|
||||||
|
- Multi-cloud abstraction (V3)
|
||||||
|
- Runbook marketplace (V4)
|
||||||
|
|
||||||
|
**V1 Success Metrics:**
|
||||||
|
- Time-to-first-runbook: < 5 minutes (paste and go)
|
||||||
|
- MTTR reduction: 40%+ for teams using dd0c/run vs. manual runbook following
|
||||||
|
- Runbook coverage: surface services with zero runbooks, track coverage growth
|
||||||
|
- NPS from on-call engineers: > 50 (they actually LIKE being on-call now)
|
||||||
|
|
||||||
|
**V1 Tech Stack:**
|
||||||
|
- Lightweight agent (Rust/Go) runs in customer VPC for command execution
|
||||||
|
- SaaS dashboard + Slack bot for the UI
|
||||||
|
- OpenAI/Anthropic for runbook parsing and step generation (use dd0c/route for cost optimization — eat your own dog food)
|
||||||
|
- PagerDuty + OpsGenie webhooks for alert integration
|
||||||
|
- PostgreSQL + vector DB for runbook storage and semantic matching
|
||||||
|
|
||||||
|
**V1 Pricing:**
|
||||||
|
- Free: 3 runbooks, suggestion-only mode
|
||||||
|
- Pro ($25/seat/month): Unlimited runbooks, copilot execution, Slack bot
|
||||||
|
- Business ($49/seat/month): Autopilot mode, API access, SSO, audit trail
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Total ideas generated: 136*
|
||||||
|
*Session complete. Let's build this thing.* 🔥
|
||||||
1097
products/06-runbook-automation/design-thinking/session.md
Normal file
1097
products/06-runbook-automation/design-thinking/session.md
Normal file
File diff suppressed because it is too large
Load Diff
552
products/06-runbook-automation/epics/epics.md
Normal file
552
products/06-runbook-automation/epics/epics.md
Normal file
@@ -0,0 +1,552 @@
|
|||||||
|
# dd0c/run — V1 MVP Epics
|
||||||
|
|
||||||
|
This document breaks down the V1 MVP of dd0c/run into implementation Epics and User Stories. Scope is strictly limited to V1 (read-only execution 🟢, copilot approval for 🟡/🔴, no autopilot, no terminal watcher).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 1: Runbook Parser
|
||||||
|
**Description:** The "5-second wow moment." Ingest raw unstructured text from a paste event, normalize it, and use a fast LLM (e.g., Claude Haiku via dd0c/route) to extract a structured JSON representation of executable steps, variables, conditional branches, and prerequisites in under 5 seconds.
|
||||||
|
|
||||||
|
**Dependencies:** None. This is the foundational data structure.
|
||||||
|
|
||||||
|
**Technical Notes:**
|
||||||
|
- Pure Rust normalizer to strip HTML/Markdown before hitting the LLM to save tokens and latency.
|
||||||
|
- LLM prompt must enforce a strict JSON schema.
|
||||||
|
- Idempotent parsing: same text + temperature 0 = same output.
|
||||||
|
- Must run in < 3.5s to leave room for the Action Classifier within the 5s SLA.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 1.1: Raw Text Normalization & Ingestion**
|
||||||
|
*As a runbook author (Morgan), I want to paste raw text from Confluence or Notion into the system, so that I don't have to learn a proprietary YAML DSL.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- System accepts raw text payload via API.
|
||||||
|
- Rust-based normalizer strips HTML tags, markdown formatting, and normalizes whitespace/bullet points.
|
||||||
|
- Original raw text and hash are preserved in the DB for audit/re-parsing.
|
||||||
|
- **Story Points:** 2
|
||||||
|
- **Dependencies:** None
|
||||||
|
|
||||||
|
**Story 1.2: LLM Structured Step Extraction**
|
||||||
|
*As a system, I want to pass normalized text to a fast LLM to extract an ordered JSON array of steps and commands, so that the runbook becomes machine-readable.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Sends normalized text to `dd0c/route` with a strict JSON schema prompt.
|
||||||
|
- Correctly extracts step order, natural language description, and embedded CLI/shell commands.
|
||||||
|
- P95 latency for extraction is < 3.5 seconds.
|
||||||
|
- Rejects/errors gracefully if the text contains no actionable steps.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 1.1
|
||||||
|
|
||||||
|
**Story 1.3: Variable & Prerequisite Detection**
|
||||||
|
*As an on-call engineer (Riley), I want the parser to identify implicit prerequisites (like VPN) and variables (like `<instance-id>`), so that I'm fully prepared before the runbook starts.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Regex/heuristic scanner identifies common placeholders (`$VAR`, `<var>`, `{var}`).
|
||||||
|
- LLM identifies implicit requirements ("ensure you are on VPN", "requires prod AWS profile").
|
||||||
|
- Outputs a structured array of `variables` and `prerequisites` in the JSON payload.
|
||||||
|
- **Story Points:** 2
|
||||||
|
- **Dependencies:** 1.2
|
||||||
|
|
||||||
|
**Story 1.4: Branching & Ambiguity Highlighting**
|
||||||
|
*As a runbook author (Morgan), I want the system to map conditional logic (if/else) and flag vague instructions, so that my runbooks are deterministic and clear.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Parser detects simple "if X then Y" statements and maps them to a step execution DAG (Directed Acyclic Graph).
|
||||||
|
- Identifies vague steps (e.g., "check the logs" without specifying which logs) and adds them to an `ambiguities` array for author review.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 1.2
|
||||||
|
|
||||||
|
## Epic 2: Action Classifier
|
||||||
|
**Description:** The safety-critical core of the system. Classifies every extracted command as 🟢 Safe, 🟡 Caution, 🔴 Dangerous, or ⬜ Unknown. Uses a defense-in-depth "dual-key" architecture: a deterministic compiled Rust scanner overrides an advisory LLM classifier. A misclassification here is an existential risk.
|
||||||
|
|
||||||
|
**Dependencies:** Epic 1 (needs parsed steps to classify)
|
||||||
|
|
||||||
|
**Technical Notes:**
|
||||||
|
- The deterministic scanner must use tree-sitter or compiled regex sets.
|
||||||
|
- Merge rules must be hardcoded in Rust (not configurable).
|
||||||
|
- LLM classification must run in parallel across steps to meet the 5-second total parse+classify SLA.
|
||||||
|
- If the LLM confidence is low or unknown, risk escalates. Scanner wins disagreements.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 2.1: Deterministic Safety Scanner**
|
||||||
|
*As a system, I want to pattern-match commands against a compiled database of known signatures, so that destructive commands are definitively blocked without relying on an LLM.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Rust library with compiled regex sets (allowlist 🟢, caution 🟡, blocklist 🔴).
|
||||||
|
- Uses tree-sitter to parse SQL/shell ASTs to detect destructive patterns (e.g., `DELETE` without `WHERE`, piped `rm -rf`).
|
||||||
|
- Executes in < 1ms per command.
|
||||||
|
- Defaults to `Unknown` (🟡 minimum) if no patterns match.
|
||||||
|
- **Story Points:** 5
|
||||||
|
- **Dependencies:** None (standalone library)
|
||||||
|
|
||||||
|
**Story 2.2: LLM Contextual Classifier**
|
||||||
|
*As a system, I want an LLM to assess the command in the context of surrounding steps, so that implicit state changes or custom scripts are caught.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Sends the command, surrounding steps, and infrastructure context to an LLM via a strict JSON prompt.
|
||||||
|
- Returns a classification (🟢/🟡/🔴), a confidence score, and suggested rollbacks.
|
||||||
|
- Escalate classification to the next risk tier if confidence < 0.9.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** Epic 1 (requires structured context)
|
||||||
|
|
||||||
|
**Story 2.3: Classification Merge Engine**
|
||||||
|
*As an on-call engineer (Riley), I want the system to safely merge the LLM and deterministic scanner results, so that I can trust the final 🟢/🟡/🔴 label.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Implements the 5 hardcoded merge rules exactly as architected (e.g., if Scanner says 🔴, final is 🔴; if Scanner says 🟡 and LLM says 🟢, final is 🟡).
|
||||||
|
- The final risk level is `Safe` ONLY if both the Scanner and the LLM agree it is safe.
|
||||||
|
- **Story Points:** 2
|
||||||
|
- **Dependencies:** 2.1, 2.2
|
||||||
|
|
||||||
|
**Story 2.4: Immutable Classification Audit**
|
||||||
|
*As an SRE manager (Jordan), I want every classification decision logged with full context, so that I have a forensic record of why the system made its choice.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Emits a `runbook.classified` event to the PostgreSQL database for every step.
|
||||||
|
- Log includes: scanner result (matched patterns), LLM reasoning, confidence scores, final classification, and merge rule applied.
|
||||||
|
- **Story Points:** 1
|
||||||
|
- **Dependencies:** 2.3
|
||||||
|
|
||||||
|
## Epic 3: Execution Engine
|
||||||
|
**Description:** A state machine orchestrating the step-by-step runbook execution, enforcing the Trust Gradient (Level 0, 1, 2) at every transition. It never auto-executes 🟡 or 🔴 commands in V1. Handles timeouts, rollbacks, and tracking skipped steps.
|
||||||
|
|
||||||
|
**Dependencies:** Epics 1 & 2 (needs classified runbooks to execute)
|
||||||
|
|
||||||
|
**Technical Notes:**
|
||||||
|
- V1 is Copilot-only. State transitions must block 🔴 and prompt for 🟡.
|
||||||
|
- Engine must communicate with the Agent via gRPC over mTLS.
|
||||||
|
- Each step execution receives a unique ID to prevent duplicate deliveries.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 3.1: Execution State Machine**
|
||||||
|
*As a system, I want a state machine to orchestrate step-by-step runbook execution, so that no two commands execute simultaneously and the Trust Gradient is enforced.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Starts in `Pending` state upon alert match.
|
||||||
|
- Progresses to `AutoExecute` ONLY if the step is 🟢 and trust level allows.
|
||||||
|
- Transitions to `AwaitApproval` for 🟡/🔴 steps, blocking execution until approved.
|
||||||
|
- Aborts or transitions to `ManualIntervention` on timeout (e.g., 60s for 🟢, 300s for 🔴).
|
||||||
|
- **Story Points:** 5
|
||||||
|
- **Dependencies:** 2.3
|
||||||
|
|
||||||
|
**Story 3.2: gRPC Agent Communication Protocol**
|
||||||
|
*As a system, I want to communicate securely with the dd0c Agent in the customer VPC, so that commands are executed safely without inbound firewall rules.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Outbound-only gRPC streaming connection initiated by the Agent.
|
||||||
|
- Engine sends `ExecuteStep` payload (command, timeout, risk level, environment variables).
|
||||||
|
- Agent streams `StepOutput` (stdout/stderr) back to the Engine.
|
||||||
|
- Agent returns `StepResult` (exit code, duration, stdout/stderr hashes) on completion.
|
||||||
|
- **Story Points:** 5
|
||||||
|
- **Dependencies:** 3.1
|
||||||
|
|
||||||
|
**Story 3.3: Rollback Integration**
|
||||||
|
*As an on-call engineer (Riley), I want every state-changing step to record its inverse command, so that I can undo it with one click if it fails.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- If a step fails, the Engine transitions to `RollbackAvailable` and awaits human approval.
|
||||||
|
- Engine stores the rollback command before executing the forward command.
|
||||||
|
- Executing rollback triggers an independent execution step and returns the state machine to `StepReady` or `ManualIntervention`.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 3.1
|
||||||
|
|
||||||
|
**Story 3.4: Divergence Analysis**
|
||||||
|
*As an SRE manager (Jordan), I want the system to track what the engineer actually did vs. what the runbook prescribed, so that runbooks get better over time.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Post-execution analyzer compares prescribed steps with executed/skipped steps and any unlisted commands detected by the agent.
|
||||||
|
- Flags skipped steps, modified commands, and unlisted actions.
|
||||||
|
- Emits a `divergence.detected` event with suggested runbook updates.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 3.1
|
||||||
|
|
||||||
|
## Epic 4: Slack Bot Copilot
|
||||||
|
**Description:** The primary 3am interface for on-call engineers. Guides the engineer step-by-step through an active incident, enforcing the Trust Gradient via interactive buttons for 🟡 actions and explicit typed confirmations for 🔴 actions.
|
||||||
|
|
||||||
|
**Dependencies:** Epic 3 (needs execution state to drive the UI)
|
||||||
|
|
||||||
|
**Technical Notes:**
|
||||||
|
- Uses Rust + Slack Bolt SDK in Socket Mode (no inbound webhooks).
|
||||||
|
- Must use Slack Block Kit for formatting.
|
||||||
|
- Respects Slack's 1 message/sec/channel rate limit by batching rapid updates.
|
||||||
|
- Uses dedicated threads to keep main channels clean.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 4.1: Alert Matching & Notification**
|
||||||
|
*As an on-call engineer (Riley), I want a Slack message with the relevant runbook when an alert fires, so that I don't have to search for it.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Integrates with PagerDuty/OpsGenie webhook payloads.
|
||||||
|
- Matches alert context (service, region, keywords) to the most relevant runbook.
|
||||||
|
- Posts a `🔔 Runbook matched` message with [▶ Start Copilot] button in the incident channel.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** None
|
||||||
|
|
||||||
|
**Story 4.2: Step-by-Step Interactive UI**
|
||||||
|
*As an on-call engineer (Riley), I want the bot to guide me through the runbook step-by-step in a thread, so that I can focus on one command at a time.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Tapping [▶ Start Copilot] opens a thread and triggers the Execution Engine.
|
||||||
|
- Shows 🟢 steps auto-executing with live stdout snippets.
|
||||||
|
- Shows 🟡/🔴 steps awaiting approval with command details and rollback.
|
||||||
|
- Includes [⏭ Skip] and [🛑 Abort] buttons for every step.
|
||||||
|
- **Story Points:** 5
|
||||||
|
- **Dependencies:** 3.1, 4.1
|
||||||
|
|
||||||
|
**Story 4.3: Risk-Aware Approval Gates**
|
||||||
|
*As a system, I want to block dangerous commands until explicit confirmation is provided, so that accidental misclicks don't cause outages.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- 🟡 steps require clicking [✅ Approve] or [✏️ Edit].
|
||||||
|
- 🔴 steps require opening a text input modal and typing the resource name exactly.
|
||||||
|
- Cannot bulk-approve steps. Each step must be individually gated.
|
||||||
|
- Approver's Slack identity is captured and passed to the Execution Engine.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 4.2
|
||||||
|
|
||||||
|
**Story 4.4: Real-Time Output Streaming**
|
||||||
|
*As an on-call engineer (Riley), I want to see the execution output of commands in real-time, so that I know what's happening.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Slack messages update dynamically with command stdout/stderr.
|
||||||
|
- Batches rapid output updates to respect Slack rate limits.
|
||||||
|
- Truncates long outputs and provides a link to the full log in the dashboard.
|
||||||
|
- **Story Points:** 2
|
||||||
|
- **Dependencies:** 4.2
|
||||||
|
|
||||||
|
## Epic 5: Audit Trail
|
||||||
|
**Description:** The compliance backbone and forensic record. An append-only, immutable, partitioned PostgreSQL log that tracks every runbook parse, classification, approval, execution, and rollback.
|
||||||
|
|
||||||
|
**Dependencies:** Epics 2, 3, 4 (needs events to log)
|
||||||
|
|
||||||
|
**Technical Notes:**
|
||||||
|
- PostgreSQL partitioned table (by month) for query performance.
|
||||||
|
- Application role must have `INSERT` and `SELECT` only. No `UPDATE` or `DELETE` grants.
|
||||||
|
- Row-Level Security (RLS) enforces tenant isolation at the database level.
|
||||||
|
- V1 includes a basic PDF/CSV export for SOC 2 readiness.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 5.1: Append-Only Audit Schema**
|
||||||
|
*As a system, I want a partitioned database table to store all execution events immutably, so that no one can tamper with the forensic record.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Creates a PostgreSQL table partitioned by month.
|
||||||
|
- Revokes `UPDATE` and `DELETE` grants from the application role.
|
||||||
|
- Schema captures `tenant_id`, `event_type`, `execution_id`, `actor_id`, `event_data` (JSONB), and `created_at`.
|
||||||
|
- **Story Points:** 2
|
||||||
|
- **Dependencies:** None
|
||||||
|
|
||||||
|
**Story 5.2: Event Ingestion Pipeline**
|
||||||
|
*As an SRE manager (Jordan), I want every action recorded in real-time, so that I know exactly who did what during an incident.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Ingests events from Parser, Classifier, Execution Engine, and Slack Bot.
|
||||||
|
- Logs `runbook.parsed`, `runbook.classified`, `step.auto_executed`, `step.approved`, `step.executed`, etc., exactly as architected.
|
||||||
|
- Captures the exact command executed, exit code, and stdout/stderr hashes.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 5.1
|
||||||
|
|
||||||
|
**Story 5.3: Compliance Export**
|
||||||
|
*As an SRE manager (Jordan), I want to export timestamped execution logs, so that I can provide evidence to SOC 2 auditors.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Generates a PDF or CSV of an execution run, including the approval chain and audit trail.
|
||||||
|
- Includes risk classifications and any divergence/modifications made by the engineer.
|
||||||
|
- Links to S3 for full stdout/stderr logs if needed.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 5.2
|
||||||
|
|
||||||
|
**Story 5.4: Multi-Tenant Data Isolation (RLS)**
|
||||||
|
*As a system, I want to ensure no tenant can see another tenant's data, so that security boundaries are guaranteed at the database level.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Enables Row-Level Security (RLS) on all tenant-scoped tables (`runbooks`, `executions`, `audit_events`).
|
||||||
|
- API middleware sets `app.current_tenant_id` session variable on every database connection.
|
||||||
|
- Cross-tenant queries return zero rows, not an error.
|
||||||
|
- **Story Points:** 2
|
||||||
|
- **Dependencies:** 5.1
|
||||||
|
|
||||||
|
## Epic 6: Dashboard API
|
||||||
|
**Description:** The control plane REST API served by the dd0c API Gateway. Handles runbook CRUD, parsing endpoints, execution history, and integration with the Alert-Runbook Matcher. Provides the backend for the Dashboard UI and external integrations.
|
||||||
|
|
||||||
|
**Dependencies:** Epics 1, 3, 5
|
||||||
|
|
||||||
|
**Technical Notes:**
|
||||||
|
- Served via Axum (Rust) and secured with JWT.
|
||||||
|
- Integrates with the shared dd0c API Gateway.
|
||||||
|
- Enforces tenant isolation (RLS) on every request.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 6.1: Runbook CRUD & Parsing Endpoints**
|
||||||
|
*As a runbook author (Morgan), I want to create, read, update, and soft-delete runbooks via a REST API, so that I can manage my team's active runbooks.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Implements `POST /runbooks` (paste raw text → auto-parse), `GET /runbooks`, `GET /runbooks/:id/versions`, `PUT /runbooks/:id`, `DELETE /runbooks/:id`.
|
||||||
|
- Implements `POST /runbooks/parse-preview` for the 5-second wow moment (parses without saving).
|
||||||
|
- Returns 422 if parsing fails.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 1.2
|
||||||
|
|
||||||
|
**Story 6.2: Execution History & Status Queries**
|
||||||
|
*As an SRE manager (Jordan), I want to view active and completed execution runs, so that I can monitor incident response times and skipped steps.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Implements `POST /executions` to start copilot manually.
|
||||||
|
- Implements `GET /executions` and `GET /executions/:id` to retrieve status, steps, exit codes, and durations.
|
||||||
|
- Implements `GET /executions/:id/divergence` to get the post-execution analysis (skipped steps, modified commands).
|
||||||
|
- **Story Points:** 2
|
||||||
|
- **Dependencies:** 3.1, 5.2
|
||||||
|
|
||||||
|
**Story 6.3: Alert-Runbook Matcher Integration**
|
||||||
|
*As a system, I want to match incoming webhooks (e.g., PagerDuty) to the correct runbook, so that the Slack bot can post the runbook immediately.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Implements `POST /webhooks/pagerduty`, `POST /webhooks/opsgenie`, and `POST /webhooks/dd0c-alert`.
|
||||||
|
- Uses keyword + metadata matching (and optionally pgvector similarity) to find the best runbook for the alert payload.
|
||||||
|
- Generates the `alert_context` and triggers the Slack bot notification (Epic 4).
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 6.1
|
||||||
|
|
||||||
|
**Story 6.4: Classification Query Endpoints**
|
||||||
|
*As a runbook author (Morgan), I want to query the classification engine directly, so that I can test commands before publishing a runbook.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Implements `POST /classify` for testing/debugging.
|
||||||
|
- Implements `GET /classifications/:step_id` to retrieve full classification details.
|
||||||
|
- Rate-limited to 30 req/min per tenant.
|
||||||
|
- **Story Points:** 1
|
||||||
|
- **Dependencies:** 2.3
|
||||||
|
|
||||||
|
## Epic 7: Dashboard UI
|
||||||
|
**Description:** A React Single-Page Application (SPA) for runbook authors and managers to view runbooks, past executions, and risk classifications. The primary interface for onboarding, reviewing runbooks, and analyzing post-incident data.
|
||||||
|
|
||||||
|
**Dependencies:** Epic 6 (needs APIs to consume)
|
||||||
|
|
||||||
|
**Technical Notes:**
|
||||||
|
- React SPA, integrates with the shared dd0c portal.
|
||||||
|
- Displays the 5-second "wow moment" parse preview.
|
||||||
|
- Must visually distinguish 🟢 Safe, 🟡 Caution, and 🔴 Dangerous commands.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 7.1: Runbook Paste & Preview UI**
|
||||||
|
*As a runbook author (Morgan), I want to paste raw text and instantly see the parsed steps, so that I can verify the structure before saving.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Large text area for pasting raw runbook text.
|
||||||
|
- Calls `POST /runbooks/parse-preview` and displays the structured steps.
|
||||||
|
- Highlights variables, prerequisites, and ambiguities.
|
||||||
|
- Allows editing the raw text to trigger a re-parse.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 6.1
|
||||||
|
|
||||||
|
**Story 7.2: Execution Timeline & Divergence View**
|
||||||
|
*As an SRE manager (Jordan), I want to view a timeline of an incident's execution, so that I can see what was skipped or modified.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Displays the execution run with a visual timeline of steps.
|
||||||
|
- Shows who approved each step, the exact command executed, and the exit code.
|
||||||
|
- Highlights divergence (skipped steps, modified commands, unlisted actions).
|
||||||
|
- Provides a "Apply Updates" button to update the runbook based on divergence.
|
||||||
|
- **Story Points:** 5
|
||||||
|
- **Dependencies:** 6.2
|
||||||
|
|
||||||
|
**Story 7.3: Trust Level & Risk Visualization**
|
||||||
|
*As a runbook author (Morgan), I want to clearly see the risk level of each step in my runbook, so that I know what requires approval.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Color-codes steps: 🟢 Green (Safe), 🟡 Yellow (Caution), 🔴 Red (Dangerous).
|
||||||
|
- Clicking a risk badge shows the classification reasoning (scanner rules matched, LLM explanation).
|
||||||
|
- Displays the runbook's overall Trust Level (0, 1, 2) and allows authorized users to change it (V1 max = 2).
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 7.1
|
||||||
|
|
||||||
|
**Story 7.4: Basic Health & MTTR Dashboard**
|
||||||
|
*As an SRE manager (Jordan), I want a high-level view of my team's runbooks and execution stats, so that I can measure the value of dd0c/run.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Displays a list of runbooks, their coverage, and average staleness.
|
||||||
|
- Shows MTTR (Mean Time To Resolution) for incidents handled via Copilot.
|
||||||
|
- Displays the total number of Copilot runs and skipped steps per month.
|
||||||
|
- **Story Points:** 2
|
||||||
|
- **Dependencies:** 6.2
|
||||||
|
|
||||||
|
## Epic 8: Infrastructure & DevOps
|
||||||
|
**Description:** The foundational cloud infrastructure and CI/CD pipelines required to run the dd0c/run SaaS platform and build the customer-facing Agent. Ensures strict security isolation (mTLS), observability, and zero-downtime ECS Fargate deployments.
|
||||||
|
|
||||||
|
**Dependencies:** None (can be built in parallel with Epic 1).
|
||||||
|
|
||||||
|
**Technical Notes:**
|
||||||
|
- All infrastructure defined as code (Terraform) and deployed to AWS.
|
||||||
|
- ECS Fargate for all core services (Parser, Classifier, Engine, APIs).
|
||||||
|
- Agent binary built for Linux x86_64 and ARM64 via GitHub Actions.
|
||||||
|
- mTLS certificate generation and rotation pipeline for Agent authentication.
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 8.1: Core AWS Infrastructure Provisioning**
|
||||||
|
*As a system administrator (Brian), I want the base AWS infrastructure provisioned via Terraform, so that the services have a secure, scalable environment to run in.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Terraform configures VPC, public/private subnets, NAT Gateway, and ALB.
|
||||||
|
- Provisions RDS PostgreSQL 16 (Multi-AZ) with pgvector and S3 buckets for audit/logs.
|
||||||
|
- Provisions ECS Fargate cluster and SQS queues.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** None
|
||||||
|
|
||||||
|
**Story 8.2: CI/CD Pipeline & Agent Build**
|
||||||
|
*As a developer (Brian), I want automated build and deployment pipelines, so that I can ship code to production safely and distribute the Agent binary.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- GitHub Actions pipeline runs `cargo clippy`, `cargo test`, and the "canary test suite" on every PR.
|
||||||
|
- Merges to `main` auto-deploy ECS Fargate services with zero downtime.
|
||||||
|
- Release tags compile the Agent binary (x86_64, aarch64), sign it, and publish to GitHub Releases / S3.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** None
|
||||||
|
|
||||||
|
**Story 8.3: Agent mTLS & gRPC Setup**
|
||||||
|
*As a system, I want secure outbound-only gRPC communication between the Agent and the Execution Engine, so that commands are transmitted safely without VPNs or inbound firewall rules.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Internal CA issues tenant-scoped mTLS certificates.
|
||||||
|
- API Gateway terminates mTLS and validates the tenant ID in the certificate against the request.
|
||||||
|
- Agent successfully establishes a persistent, outbound-only gRPC stream to the Engine.
|
||||||
|
- **Story Points:** 5
|
||||||
|
- **Dependencies:** 8.1
|
||||||
|
|
||||||
|
**Story 8.4: Observability & Party Mode Alerting**
|
||||||
|
*As the solo on-call engineer (Brian), I want comprehensive monitoring and alerting, so that I am woken up if a safety invariant is violated (Party Mode).*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- OpenTelemetry (OTEL) collector deployed and routing metrics/traces to Grafana Cloud.
|
||||||
|
- PagerDuty integration configured for P1 alerts.
|
||||||
|
- Alert fires immediately if the "Party Mode" database flag is set or if Agent heartbeats drop unexpectedly.
|
||||||
|
- **Story Points:** 2
|
||||||
|
- **Dependencies:** 8.1
|
||||||
|
|
||||||
|
## Epic 9: Onboarding & PLG
|
||||||
|
**Description:** Product-Led Growth (PLG) and frictionless onboarding. Focuses on the "5-second wow moment" where a user can paste a runbook and instantly see the parsed, classified steps without installing an agent, plus a self-serve tier to capture leads.
|
||||||
|
|
||||||
|
**Dependencies:** Epics 1, 6, 7.
|
||||||
|
|
||||||
|
**Technical Notes:**
|
||||||
|
- The demo uses the `POST /runbooks/parse-preview` endpoint (does not persist data).
|
||||||
|
- Tenant provisioning must be fully automated upon signup.
|
||||||
|
- Free tier enforced via database limits (no Stripe required initially for free tier).
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
**Story 9.1: The 5-Second Interactive Demo**
|
||||||
|
*As a prospective customer, I want to paste my messy runbook into the marketing website and see it parsed instantly, so that I immediately understand the product's value without signing up.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Landing page features a prominent text area for pasting a runbook.
|
||||||
|
- Submitting calls the unauthenticated parse-preview endpoint (rate-limited by IP).
|
||||||
|
- Renders the structured steps with 🟢/🟡/🔴 risk badges dynamically.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 6.1, 7.1
|
||||||
|
|
||||||
|
**Story 9.2: Self-Serve Signup & Tenant Provisioning**
|
||||||
|
*As a new user, I want to create an account and get my tenant provisioned instantly, so that I can start using the product without talking to sales.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- OAuth/Email signup flow creates a User and a Tenant record in PostgreSQL.
|
||||||
|
- Automated worker initializes the tenant workspace, generates the initial mTLS certs, and provides the Agent installation command.
|
||||||
|
- Limits the free tier to 5 active runbooks and 50 executions/month.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 8.1
|
||||||
|
|
||||||
|
**Story 9.3: Agent Installation Wizard**
|
||||||
|
*As a new user, I want a simple copy-paste command to install the read-only Agent in my infrastructure, so that I can execute my first runbook within 10 minutes of signup.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- Dashboard provides a dynamically generated `curl | bash` or `kubectl apply` snippet.
|
||||||
|
- Snippet includes the tenant-specific mTLS certificate and Agent binary download.
|
||||||
|
- Dashboard shows a live "Waiting for Agent heartbeat..." state that turns green when the Agent connects.
|
||||||
|
- **Story Points:** 3
|
||||||
|
- **Dependencies:** 8.3, 9.2
|
||||||
|
|
||||||
|
**Story 9.4: First Runbook Copilot Walkthrough**
|
||||||
|
*As a new user, I want an interactive guide for my first runbook execution, so that I learn how the Copilot approval flow and Trust Gradient work safely.*
|
||||||
|
- **Acceptance Criteria:**
|
||||||
|
- System automatically provisions a "Hello World" runbook containing safe read-only commands (e.g., `kubectl get nodes`).
|
||||||
|
- In-app tooltip guides the user to trigger the runbook via the Slack integration.
|
||||||
|
- Completing this first execution unlocks the "Runbook Author" badge/status.
|
||||||
|
- **Story Points:** 2
|
||||||
|
- **Dependencies:** 4.2, 9.3
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 10: Transparent Factory Compliance
|
||||||
|
**Description:** Cross-cutting epic ensuring dd0c/run adheres to the 5 Transparent Factory tenets. Of all 6 dd0c products, runbook automation carries the highest governance burden — it executes commands in production infrastructure. Every tenet here is load-bearing.
|
||||||
|
|
||||||
|
### Story 10.1: Atomic Flagging — Feature Flags for Execution Behaviors
|
||||||
|
**As a** solo founder, **I want** every new runbook execution capability, command parser, and auto-remediation behavior behind a feature flag (default: off), **so that** a bad parser change never executes the wrong command in a customer's production environment.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- OpenFeature SDK integrated into the execution engine. V1: JSON file provider.
|
||||||
|
- All flags evaluate locally — no network calls during runbook execution.
|
||||||
|
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
|
||||||
|
- Automated circuit breaker: if a flagged execution path fails >2 times in any 10-minute window, the flag auto-disables and all in-flight executions using that path are paused (not killed).
|
||||||
|
- Flags required for: new command parsers, conditional logic handlers, new infrastructure targets (K8s, AWS, bare metal), auto-retry behaviors, approval bypass rules.
|
||||||
|
- **Hard rule:** any flag that controls execution of destructive commands (`rm`, `delete`, `terminate`, `drop`) requires a 48-hour bake time at 10% rollout before full enablement.
|
||||||
|
|
||||||
|
**Estimate:** 5 points
|
||||||
|
**Dependencies:** Epic 3 (Execution Engine)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Circuit breaker threshold is intentionally low (2 failures) because production execution failures are high-severity.
|
||||||
|
- Pause vs. kill: paused executions hold state and can be resumed after review. Killing mid-execution risks partial state.
|
||||||
|
- 48-hour bake for destructive flags: enforced via flag metadata `destructive: true` + CI check.
|
||||||
|
|
||||||
|
### Story 10.2: Elastic Schema — Additive-Only for Execution Audit Trail
|
||||||
|
**As a** solo founder, **I want** all execution log and runbook state schema changes to be strictly additive, **so that** the forensic audit trail of every production command ever executed is never corrupted or lost.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- CI rejects any migration that removes, renames, or changes type of existing columns/attributes.
|
||||||
|
- New fields use `_v2` suffix for breaking changes.
|
||||||
|
- All execution log parsers ignore unknown fields.
|
||||||
|
- Dual-write during migration windows within the same transaction.
|
||||||
|
- Every migration includes `sunset_date` comment (max 30 days).
|
||||||
|
- **Hard rule:** execution audit records are immutable. No `UPDATE` or `DELETE` statements are permitted on the `execution_log` table/collection. Ever.
|
||||||
|
|
||||||
|
**Estimate:** 3 points
|
||||||
|
**Dependencies:** Epic 4 (Audit & Forensics)
|
||||||
|
**Technical Notes:**
|
||||||
|
- Execution logs are append-only by design and by policy. Use DynamoDB with no `UpdateItem` IAM permission on the audit table.
|
||||||
|
- For runbook definition versioning: every edit creates a new version record. Old versions are never mutated.
|
||||||
|
- Schema for parsed runbook steps: version the parser output format. V1 steps and V2 steps coexist.
|
||||||
|
|
||||||
|
### Story 10.3: Cognitive Durability — Decision Logs for Execution Logic
|
||||||
|
**As a** future maintainer, **I want** every change to runbook parsing, step classification, or execution logic accompanied by a `decision_log.json`, **so that** I understand why the system interpreted step 3 as "destructive" and required approval.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
|
||||||
|
- CI requires a decision log for PRs touching `pkg/parser/`, `pkg/classifier/`, `pkg/executor/`, or `pkg/approval/`.
|
||||||
|
- Cyclomatic complexity cap of 10 enforced via linter. PRs exceeding this are blocked.
|
||||||
|
- Decision logs in `docs/decisions/`.
|
||||||
|
- **Additional requirement:** every change to the "destructive command" classification list requires a decision log entry explaining why the command was added/removed.
|
||||||
|
|
||||||
|
**Estimate:** 2 points
|
||||||
|
**Dependencies:** None
|
||||||
|
**Technical Notes:**
|
||||||
|
- The destructive command list is the most sensitive config in the entire product. Changes must be documented, reviewed, and logged.
|
||||||
|
- Parser logic decision logs should include sample runbook snippets showing before/after interpretation.
|
||||||
|
|
||||||
|
### Story 10.4: Semantic Observability — AI Reasoning Spans on Classification & Execution
|
||||||
|
**As an** SRE manager investigating an incident caused by automated execution, **I want** every runbook classification and execution decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I have a complete forensic record of what the system decided, why, and what it executed.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- Every runbook execution creates a parent `runbook_execution` span. Child spans for each step: `step_classification`, `step_approval_check`, `step_execution`.
|
||||||
|
- `step_classification` attributes: `step.text_hash`, `step.classified_as` (safe/destructive/ambiguous), `step.confidence_score`, `step.alternatives_considered`.
|
||||||
|
- `step_execution` attributes: `step.command_hash`, `step.target_host_hash`, `step.exit_code`, `step.duration_ms`, `step.output_truncated` (first 500 chars, hashed).
|
||||||
|
- If AI-assisted classification: `ai.prompt_hash`, `ai.model_version`, `ai.reasoning_chain`.
|
||||||
|
- `step_approval_check` attributes: `step.approval_required` (bool), `step.approval_source` (human/policy/auto), `step.approval_latency_ms`.
|
||||||
|
- No PII. No raw commands in spans — everything hashed. Full commands only in the encrypted audit log.
|
||||||
|
|
||||||
|
**Estimate:** 5 points
|
||||||
|
**Dependencies:** Epic 3 (Execution Engine), Epic 4 (Audit & Forensics)
|
||||||
|
**Technical Notes:**
|
||||||
|
- This is the highest-effort observability story across all 6 products because execution tracing is forensic-grade.
|
||||||
|
- Span hierarchy: `runbook_execution` → `step_N` → `[classification, approval, execution]`. Three levels deep.
|
||||||
|
- Output hashing: SHA-256 of command output for correlation. Raw output only in encrypted audit store.
|
||||||
|
|
||||||
|
### Story 10.5: Configurable Autonomy — Governance for Production Execution
|
||||||
|
**As a** solo founder, **I want** a `policy.json` that strictly controls what dd0c/run is allowed to execute autonomously vs. what requires human approval, **so that** no automated system ever runs a destructive command without explicit authorization.
|
||||||
|
|
||||||
|
**Acceptance Criteria:**
|
||||||
|
- `policy.json` defines `governance_mode`: `strict` (all steps require human approval) or `audit` (safe steps auto-execute, destructive steps require approval).
|
||||||
|
- Default for ALL customers: `strict`. There is no "fully autonomous" mode for dd0c/run. Even in `audit` mode, destructive commands always require approval.
|
||||||
|
- `panic_mode`: when true, ALL execution halts immediately. In-flight steps are paused (not killed). All pending approvals are revoked. System enters read-only forensic mode.
|
||||||
|
- Governance drift monitoring: weekly report of auto-executed vs. human-approved steps. If auto-execution ratio exceeds the configured threshold, system auto-downgrades to `strict`.
|
||||||
|
- Per-customer, per-runbook governance overrides. Customers can lock specific runbooks to `strict` regardless of system mode.
|
||||||
|
- **Hard rule:** `panic_mode` must be triggerable in <1 second via API call, CLI command, Slack command, OR a physical hardware button (webhook endpoint).
|
||||||
|
|
||||||
|
**Estimate:** 5 points
|
||||||
|
**Dependencies:** Epic 3 (Execution Engine), Epic 5 (Approval Workflow)
|
||||||
|
**Technical Notes:**
|
||||||
|
- No "full auto" mode is a deliberate product decision. dd0c/run assists humans, it doesn't replace them for destructive actions.
|
||||||
|
- Panic mode implementation: Redis key `dd0c:panic` checked at the top of every execution loop iteration. Webhook endpoint `POST /admin/panic` sets the key and broadcasts a halt signal via pub/sub.
|
||||||
|
- The <1 second requirement means panic cannot depend on a DB write — Redis pub/sub only.
|
||||||
|
- Governance drift: cron job queries execution logs weekly. Threshold configurable per-org (default: 70% auto-execution triggers downgrade).
|
||||||
|
|
||||||
|
### Epic 10 Summary
|
||||||
|
| Story | Tenet | Points |
|
||||||
|
|-------|-------|--------|
|
||||||
|
| 10.1 | Atomic Flagging | 5 |
|
||||||
|
| 10.2 | Elastic Schema | 3 |
|
||||||
|
| 10.3 | Cognitive Durability | 2 |
|
||||||
|
| 10.4 | Semantic Observability | 5 |
|
||||||
|
| 10.5 | Configurable Autonomy | 5 |
|
||||||
|
| **Total** | | **20** |
|
||||||
128
products/06-runbook-automation/innovation-strategy/session.md
Normal file
128
products/06-runbook-automation/innovation-strategy/session.md
Normal file
@@ -0,0 +1,128 @@
|
|||||||
|
# dd0c/run — Innovation Strategy & Disruption Verdict
|
||||||
|
**Strategist:** Victor, Disruptive Innovation Oracle
|
||||||
|
**Date:** 2026-02-28
|
||||||
|
|
||||||
|
## Section 1: MARKET LANDSCAPE
|
||||||
|
|
||||||
|
Let us dispense with the industry hallucinations. The current runbook automation market is a museum of failed promises. We are selling to teams whose "documentation" is a stale Confluence page that actively sabotages their incident response.
|
||||||
|
|
||||||
|
**The Incumbent Graveyard (Current State):**
|
||||||
|
* **Rundeck:** A 2015-era job scheduler masquerading as a modern runbook engine. It requires Java, a database, and YAML definitions. It is the definition of toil.
|
||||||
|
* **Transposit & Shoreline:** Over-engineered orchestration platforms built for the 1% of engineering orgs that have the bandwidth to learn yet another proprietary DSL. They built jetpacks for people who are currently drowning.
|
||||||
|
* **Rootly:** Excellent at incident management (the bureaucracy of the outage), but they stop at the boundary of execution. They document the fire; they don't hold the hose.
|
||||||
|
|
||||||
|
**Adjacent Markets (The Collision Course):**
|
||||||
|
* **Incident Management (PagerDuty, OpsGenie):** They own the alert routing but treat the actual resolution as an "exercise left to the reader." Their native automation add-ons are extortionately priced bolt-ons.
|
||||||
|
* **AIOps:** A buzzword that has historically meant "we will group your 5,000 meaningless alerts into 50 slightly-less-meaningless clusters."
|
||||||
|
* **Workflow Automation (Zapier/Make for DevOps):** Too generic. They don't understand infrastructure state, blast radius, or the concept of a 3 AM rollback.
|
||||||
|
|
||||||
|
**Key Macro Trends (2026):**
|
||||||
|
1. **Shift from Documentation to Execution:** The era of static text is dead. If a runbook cannot execute its own read-only steps, it is legacy technical debt.
|
||||||
|
2. **LLM-Powered Ops (The Parsing Revolution):** We finally have models capable of translating ambiguous human intent ("bounce the connection pool") into deterministic shell commands with high reliability.
|
||||||
|
3. **Agentic Automation:** We are transitioning from "human-in-the-loop" to "human-on-the-loop." The trust gradient is shifting.
|
||||||
|
|
||||||
|
## Section 2: DISRUPTION ANALYSIS
|
||||||
|
|
||||||
|
The incumbents have built moats out of complexity. They mistake the density of their UI for enterprise value.
|
||||||
|
|
||||||
|
**Vulnerabilities of the Old Guard:**
|
||||||
|
* **The Complexity Tax:** Rundeck and Shoreline require upfront investment. You do not buy them; you marry them. This violates the 5-minute time-to-value constraint.
|
||||||
|
* **The PagerDuty Extortion:** PagerDuty's native automation is a cynical upsell. It demands thousands of dollars simply to automate the resolution of the alerts it already charges you to receive. They are taxing their own utility.
|
||||||
|
|
||||||
|
**The Unowned Gap:**
|
||||||
|
Nobody owns the bridge between *tribal knowledge* and *automated execution*. The "documentation-to-execution" gap is vast. Teams currently have to write documentation, then write automation code. We eliminate the intermediate step. The documentation *is* the code.
|
||||||
|
|
||||||
|
**Why 2026? The Paradigm Shift:**
|
||||||
|
Two years ago, this was impossible. A 2024 LLM would hallucinate a destructive command or fail to parse implicit prerequisites. Now, we have models capable of rigorous structural extraction and risk classification (🟢🟡🔴). The context windows are large enough to digest a 50-page postmortem and distill the exact terminal commands that fixed it. The inference latency is under 2 seconds. The underlying intelligence has commoditized to the point where we can offer it for $29/runbook/month, destroying the enterprise pricing models of the incumbents.
|
||||||
|
|
||||||
|
## Section 3: COMPETITIVE MOAT STRATEGY
|
||||||
|
|
||||||
|
You cannot rely on LLM capabilities as a moat. OpenAI or Anthropic will drop prices or release better reasoning models every six months. The moat is your data and your ecosystem. If you do not lock this down, you are just a generic wrapper waiting to be replaced by a Pulumi or GitHub Agentic Workflow.
|
||||||
|
|
||||||
|
**The Flywheel of the dd0c Ecosystem:**
|
||||||
|
* **The Alert/Run Coupling:** `dd0c/alert` identifies the incident pattern; `dd0c/run` provides the resolution. The execution telemetry from `dd0c/run` feeds back into `dd0c/alert`, training the matching engine. It is a closed-loop system of continuous improvement. The data moat compounds daily.
|
||||||
|
|
||||||
|
**The Network Effect:**
|
||||||
|
* **The Template Marketplace:** A company signs up and immediately inherits the collective wisdom of thousands of other engineering teams. A shared template for "AWS RDS Failover" that has been battle-tested and refined across 500 organizations is infinitely more valuable than a blank slate. The value of the platform scales non-linearly with every new user.
|
||||||
|
|
||||||
|
**The Data Moat (Execution Telemetry):**
|
||||||
|
* We log every skipped step, every manual override, every successful rollback. We are building the industry's first database of *what actually works in an incident*. This "Resolution Pattern Database" is an asset no incumbent possesses. They only know what the runbook says; we know what the human actually typed at 3:14 AM.
|
||||||
|
|
||||||
|
**Why the Incumbents Cannot Replicate This:**
|
||||||
|
* PagerDuty and Incident.io cannot simply add a "generate runbook" button. To replicate `dd0c/run`, they would need the deep infrastructure context, the FinOps integration (`dd0c/cost`), and the alert intelligence pipeline (`dd0c/alert`). We have the context. They just have the routing rules.
|
||||||
|
|
||||||
|
## Section 4: MARKET SIZING
|
||||||
|
|
||||||
|
The numbers must be defensible. Stop inflating them to please imaginary VCs. We are a bootstrapped operation.
|
||||||
|
|
||||||
|
**Methodology & Market Sizing:**
|
||||||
|
* **TAM (Total Addressable Market): $12B+**
|
||||||
|
* *Calculation:* There are roughly 26 million software developers globally. Assume 20% are involved in ops/on-call rotations (5.2 million). Assume an average enterprise tooling spend of $200/month per engineer for incident management, AIOps, and automation.
|
||||||
|
* **SAM (Serviceable Available Market): $1.5B**
|
||||||
|
* *Calculation:* Focus on the mid-market (startups to mid-size tech companies, Series A to Series D). These teams have the highest pain-to-budget ratio. They have 10-100 engineers and cannot afford to hire a dedicated SRE team or buy Shoreline. Let's estimate 50,000 such companies, averaging 30 on-call engineers each (1.5 million target seats). At $1,000/year per seat, that's $1.5B.
|
||||||
|
* **SOM (Serviceable Obtainable Market): $15M (Year 3 Target)**
|
||||||
|
* *Calculation:* 1% penetration of the SAM. 500 companies with an average ARR of $30,000.
|
||||||
|
|
||||||
|
**Beachhead Segment Identification:**
|
||||||
|
* **The Drowning SRE Team (Series B/C Startups):** Teams of 5-15 SREs supporting 50-200 developers. They have high incident volume and their existing Confluence runbooks are a known liability. They are desperate for a solution that does not require a six-month migration.
|
||||||
|
* **The Compliance Chasers:** Startups preparing for their first or second SOC 2 audit. They need documented, auditable incident response procedures immediately. We sell them the audit trail masquerading as an automation tool.
|
||||||
|
|
||||||
|
**Revenue Projections (Based on $29/runbook/month or equivalent per-seat pricing):**
|
||||||
|
* **Conservative (Year 1):** $250K ARR. We capture the early adopters from our FinOps wedge (`dd0c/cost`) and convert them to the platform.
|
||||||
|
* **Moderate (Year 2):** $1.2M ARR. The flywheel engages. The template marketplace drives organic acquisition. `dd0c/alert` and `dd0c/run` are sold as a bundled pair.
|
||||||
|
* **Aggressive (Year 3):** $5M+ ARR. The platform takes over the incident management budget of 150-200 mid-market companies.
|
||||||
|
|
||||||
|
## Section 5: STRATEGIC RISKS
|
||||||
|
|
||||||
|
This is where the hallucination stops. This is where you look at the barrel of the gun. The market is not kind to solo founders.
|
||||||
|
|
||||||
|
**Top 5 Existential Risks:**
|
||||||
|
|
||||||
|
1. **PagerDuty/Incident.io Building Native AI Automation**
|
||||||
|
* **Severity:** Critical
|
||||||
|
* **Probability:** High
|
||||||
|
* **Mitigation:** They *will* build this, but they will build it as a closed, proprietary upsell for Enterprise tiers. They will not integrate deeply with your AWS cost anomalies (`dd0c/cost`) or your infrastructure drift (`dd0c/drift`). We win on the open ecosystem, the cross-platform nature of the agent, and mid-market pricing. They will sell to the CIO; we sell to the on-call engineer at 3 AM.
|
||||||
|
|
||||||
|
2. **LLM Hallucination in Production Runbooks (Safety Critical)**
|
||||||
|
* **Severity:** Catastrophic
|
||||||
|
* **Probability:** Medium
|
||||||
|
* **Mitigation:** The Trust Gradient is non-negotiable. 🟢 (Safe), 🟡 (Caution), 🔴 (Dangerous). We never execute state-changing commands without explicit human approval (Copilot Mode) or a proven track record (Autopilot Mode). We must implement strict grounding techniques; the LLM cannot invent steps not found in the source material unless explicitly requested. Every action must have a recorded rollback command. The first time `dd0c/run` breaks production autonomously, the company is dead.
|
||||||
|
|
||||||
|
3. **The "Agentic AI" Obsolescence Event**
|
||||||
|
* **Severity:** High
|
||||||
|
* **Probability:** Low (in the next 3 years)
|
||||||
|
* **Mitigation:** If autonomous AI agents (Devin, GitHub Copilot Workspace, Pulumi Neo) can detect and fix infrastructure issues without human intervention, who needs runbooks? The answer: They need runbooks as the "policy" that defines what the agent *should* do. Runbooks become the bridge between human intent and agent execution. We pivot from "human automation" to "agent policy management."
|
||||||
|
|
||||||
|
4. **Solo Founder Scaling Constraints (The Bus Factor)**
|
||||||
|
* **Severity:** High
|
||||||
|
* **Probability:** High
|
||||||
|
* **Mitigation:** Brian, you are building six products. You must rigorously enforce the "Anti-Bloatware Platform Strategy." Share the API gateway, the auth layer, the OpenTelemetry ingest, and the Rust agent architecture across all `dd0c` modules. If you build six separate data models, you will burn out in 14 months. Do not build custom integrations where webhooks will suffice. Do not build crawlers; force the user to paste.
|
||||||
|
|
||||||
|
5. **The "No Runbooks at All" Cold Start Problem**
|
||||||
|
* **Severity:** Medium
|
||||||
|
* **Probability:** High
|
||||||
|
* **Mitigation:** Many teams have zero runbooks. They cannot use `dd0c/run` if they have nothing to paste. V1 must include the "Slack Thread Scraper" and "Postmortem-to-Runbook Pipeline." V2 must include the "Terminal Watcher." The product must *create* runbooks, not just execute them.
|
||||||
|
|
||||||
|
## Section 6: INNOVATION VERDICT
|
||||||
|
|
||||||
|
This is the final word. The market is saturated with "Ops" products, but it is entirely devoid of execution velocity.
|
||||||
|
|
||||||
|
**Verdict: CONDITIONAL GO.**
|
||||||
|
|
||||||
|
**Timing Recommendation:**
|
||||||
|
Do not build this first. Do not build this second.
|
||||||
|
1. **Month 1-3:** Build `dd0c/route` and `dd0c/cost`. These are the FinOps wedges. They prove immediate, hard-dollar ROI. If you cannot save a company money, you have no right to ask them to trust you with their production environment.
|
||||||
|
2. **Month 4-6:** Build `dd0c/alert` and `dd0c/run` as a bundled pair. The "On-Call Savior" phase. You have saved their budget; now save their sleep.
|
||||||
|
|
||||||
|
**Key Conditions & Kill Criteria:**
|
||||||
|
* **Condition 1:** The "Paste & Parse" AI must take < 5 seconds and perfectly classify destructive vs. read-only commands. If it requires 10 minutes of manual YAML adjustment after pasting, the product is dead. Kill it.
|
||||||
|
* **Condition 2:** You must secure the `dd0c/alert` integration pipeline. `dd0c/run` is not a standalone product; it is the execution arm of your alert intelligence. If it cannot auto-attach the correct runbook to a PagerDuty webhook with >90% confidence, kill it.
|
||||||
|
* **Condition 3:** Zero-configuration local execution. The Rust agent must run in their VPC and pull commands from the SaaS. If you require inbound firewall rules or AWS root credentials, the security review will block every sale. Kill it.
|
||||||
|
|
||||||
|
**The One Thing That Must Be True:**
|
||||||
|
Engineers must hate their current 3 AM reality more than they fear giving an LLM the ability to suggest a production terminal command. The dread of the pager must outweigh the skepticism of the AI.
|
||||||
|
|
||||||
|
If that is true—and my telemetry suggests it is—then `dd0c/run` is not just a feature. It is the beginning of the end for static documentation.
|
||||||
|
|
||||||
|
Build the weapon.
|
||||||
|
— Victor
|
||||||
101
products/06-runbook-automation/party-mode/session.md
Normal file
101
products/06-runbook-automation/party-mode/session.md
Normal file
@@ -0,0 +1,101 @@
|
|||||||
|
# dd0c/run — Party Mode Advisory Board Review
|
||||||
|
**Date:** 2026-02-28
|
||||||
|
**Product:** dd0c/run (AI-Powered Runbook Automation)
|
||||||
|
**Phase:** 3
|
||||||
|
|
||||||
|
> *The boardroom doors lock. The coffee is black. The whiteboard is empty. Five experts sit around the table to tear dd0c/run apart. If it survives this, it survives production.*
|
||||||
|
|
||||||
|
## Round 1: INDIVIDUAL REVIEWS
|
||||||
|
|
||||||
|
**The VC (Pattern Matcher):**
|
||||||
|
"I'll be honest, the initial pitch sounds like a feature PagerDuty will announce at their next user conference. But the *Resolution Pattern Database*—that's a data moat. If you actually capture what engineers type at 3am across 500 companies, you're not building an ops tool, you're building the industry's first operational intelligence graph. What worries me is the go-to-market. $29/runbook/month sounds weird. Just do price per seat. **CONDITIONAL GO**—only if you change the pricing model and prove PagerDuty can't just copy the UI."
|
||||||
|
|
||||||
|
**The CTO (Infrastructure Veteran):**
|
||||||
|
"I have 20 years of scars from production outages. The idea of an LLM piping commands directly into a root shell makes my blood run cold. 'Hallucinate a failover' is a resume-generating event. However, the strict Risk Classification (🟢🟡🔴) and the local Rust agent in the customer's VPC give me a sliver of hope. What worries me is the parsing engine misclassifying a destructive command as 'safe' because the LLM got confused by context. **NO-GO** until I see the deterministic AST-level validation that overrides the LLM's risk guesses."
|
||||||
|
|
||||||
|
**The Bootstrap Founder (Solo Shipper):**
|
||||||
|
"A secure, remote-execution VPC agent plus an LLM parsing layer plus a Slack bot? For a solo founder? Brian, you're going to burn out before month 6. I love the 'Paste & Parse' wedge—that's a viral 5-second time-to-value that doesn't require a sales call. But what worries me is the support burden when a customer's weird custom Kubernetes setup doesn't play nice with your agent. **GO**, but you have to ruthlessly cut scope. V1 is Copilot-only, no Autopilot."
|
||||||
|
|
||||||
|
**The On-Call Engineer (3am Survivor):**
|
||||||
|
"You had me at 'Slack bot that pre-fills the variables.' When I'm paged at 3am, my IQ drops by 40 points. I don't want to read Confluence. I want a button that says 'Check Database Locks' and just does it. What worries me is trust. If this thing lies to me once—if it runs the wrong script or links the wrong runbook—I will uninstall it and tell every SRE I know that it's garbage. The rollback feature is the only reason I'm at the table. **CONDITIONAL GO**—if it actually works in Slack."
|
||||||
|
|
||||||
|
**The Contrarian (The Skeptic):**
|
||||||
|
"You're all missing the point. Automating a runbook is a fundamental anti-pattern. If a process can be documented as a step-by-step runbook, it should be codified as a self-healing script in the infrastructure, not babysat by an AI chatbot! Runbooks are a symptom of engineering failure. You're building a highly profitable band-aid for a bullet wound. Furthermore, this is a feature of `dd0c/portal`, not a standalone product. **NO-GO**."
|
||||||
|
|
||||||
|
## Round 2: CROSS-EXAMINATION
|
||||||
|
|
||||||
|
**The CTO:** "Let's talk blast radius. What happens when the LLM reads a stale runbook that says `kubectl delete namespace payments-legacy`, but that namespace was repurposed last week for the new billing engine?"
|
||||||
|
|
||||||
|
**The On-Call Engineer:** "Wait, if the AI suggests that at 3am, I might just click approve because I'm half asleep. If the UI makes it look 'green' or safe, I'm blindly trusting it."
|
||||||
|
|
||||||
|
**The CTO:** "Exactly. An LLM cannot know the current state of the cluster unless it's constantly diffing against live infrastructure. It's just reading a dead document."
|
||||||
|
|
||||||
|
**The Contrarian:** "Which is why relying on text documents to manage infrastructure in 2026 is archaic. The product is enforcing bad habits. We should be writing declarative state, not chat scripts."
|
||||||
|
|
||||||
|
**The Bootstrap Founder:** "Guys, focus. We're not solving the platonic ideal of engineering, we're solving the reality that 50,000 startups have garbage Confluence pages. CTO, the design doc says V2 includes 'Infrastructure-Aware Staleness' via `dd0c/portal` to catch dead resources."
|
||||||
|
|
||||||
|
**The CTO:** "V2 doesn't pay the bills if V1 drops a production database. The risk classifier cannot be pure LLM. It has to be a deterministic regex/AST scanner matching against known destructive patterns."
|
||||||
|
|
||||||
|
**The VC:** "Let's pivot to market dynamics. Bootstrap Founder, you think one person can't build this. But if he uses the shared `dd0c` platform architecture—same auth, same billing, same gateway—isn't this just a Slack bot piping to an LLM router?"
|
||||||
|
|
||||||
|
**The Bootstrap Founder:** "The SaaS side is easy. The Rust execution agent that runs in the customer's VPC is hard. That has to be bulletproof. If that agent has a CVE, the whole dd0c brand is dead."
|
||||||
|
|
||||||
|
**The VC:** "But without the execution, we're just a markdown parser. Incident.io can build a markdown parser in a weekend. They're probably building it right now."
|
||||||
|
|
||||||
|
**The On-Call Engineer:** "I don't care about Incident.io. I care that when I get a PagerDuty alert, the Slack thread already has the logs pulled. If `dd0c/run` can do that without even executing a write command, I'd pay for it."
|
||||||
|
|
||||||
|
**The Contrarian:** "So you're paying $29 a month for a glorified `grep`?"
|
||||||
|
|
||||||
|
**The On-Call Engineer:** "At 3am, a glorified `grep` that knows exactly which pod is failing is worth its weight in gold. You don't get it because you sleep through the night."
|
||||||
|
|
||||||
|
**The Bootstrap Founder:** "That's the wedge! The read-only Copilot. V1 doesn't need to execute state changes. V1 just runs the 🟢 safe diagnostic steps automatically and presents the context."
|
||||||
|
|
||||||
|
**The CTO:** "I can stomach that. Read-only diagnostics limit the blast radius to zero, while proving the agent works."
|
||||||
|
|
||||||
|
## Round 3: STRESS TEST
|
||||||
|
|
||||||
|
**1. The Catastrophic Scenario: LLM Hallucination Causes a Production Outage**
|
||||||
|
* **Severity:** 10/10. An extinction-level event for the company.
|
||||||
|
* **The Attack:** The runbook says "Restart the pod." The LLM hallucinates and outputs `kubectl delete deployment` instead, classifying it as 🟢 Safe. The tired engineer clicks approve. Production goes down, the customer cancels, and dd0c goes bankrupt from lawsuits.
|
||||||
|
* **Mitigation:**
|
||||||
|
* *Systemic:* Strict UI friction. 🔴 actions require typing the resource name to confirm, not just clicking a button.
|
||||||
|
* *Architectural:* The agent must have a deterministic whitelist of allowed commands. Anything outside the whitelist is blocked at the agent level, regardless of the SaaS payload.
|
||||||
|
* **Pivot Option:** Limit V1 to Read-Only Diagnostic execution. The AI only fetches logs and metrics. State changes must be copy-pasted by the human.
|
||||||
|
|
||||||
|
**2. The Market Threat: PagerDuty / Incident.io Ships Native Runbook Automation**
|
||||||
|
* **Severity:** 8/10.
|
||||||
|
* **The Attack:** PagerDuty acquires a small AI runbook startup and bundles it into their Enterprise tier for "free" (subsidized by their massive license cost), choking off dd0c/run's distribution.
|
||||||
|
* **Mitigation:**
|
||||||
|
* *Positioning:* PagerDuty's automation is locked to PagerDuty. `dd0c/run` must be agnostic (works with OpsGenie, Grafana OnCall, direct webhooks).
|
||||||
|
* *Data Moat:* The cross-customer resolution template library. PagerDuty keeps data siloed per enterprise; we anonymize and share the "ideal" runbooks across the mid-market.
|
||||||
|
* **Pivot Option:** Double down on the `dd0c` ecosystem. `dd0c/run` becomes the execution arm of `dd0c/alert` and `dd0c/drift`, creating a tightly coupled platform that PagerDuty can't touch.
|
||||||
|
|
||||||
|
**3. The Cold Start: Teams Don't Have Documented Runbooks**
|
||||||
|
* **Severity:** 7/10.
|
||||||
|
* **The Attack:** A prospect signs up, goes to the "Paste Runbook" screen, and realizes they have nothing to paste. Churn happens in 60 seconds.
|
||||||
|
* **Mitigation:**
|
||||||
|
* *Product:* The "Slack Thread Scraper" and "Postmortem-to-Runbook Pipeline" must be in V1. If they have incidents, they have Slack threads. Extract the runbooks from history.
|
||||||
|
* *Community:* Pre-seed the platform with 50 high-quality templates for standard infra (AWS RDS failover, K8s CrashLoopBackOff, Redis memory pressure).
|
||||||
|
* **Pivot Option:** Shift marketing from "Automate your runbooks" to "Generate your missing runbooks."
|
||||||
|
|
||||||
|
## Round 4: FINAL VERDICT
|
||||||
|
|
||||||
|
**The Board Vote:** Split Decision (3 GO, 1 CONDITIONAL GO, 1 NO-GO).
|
||||||
|
|
||||||
|
**The Verdict:** **PIVOT TO CONDITIONAL GO.**
|
||||||
|
|
||||||
|
**Revised Strategy & Timing:**
|
||||||
|
Confirmed as Phase 3 product (Months 4-6), but ONLY if `dd0c/alert` is functional first. `dd0c/run` cannot stand alone in the market against incumbents; it must be the execution arm of the alert intelligence.
|
||||||
|
|
||||||
|
**Top 3 Must-Get-Right Items:**
|
||||||
|
1. **The Trust Gradient (Read-Only First):** V1 must aggressively limit itself to 🟢 Safe (diagnostic/read-only) commands for auto-execution. 🟡 and 🔴 commands must have extreme UI friction.
|
||||||
|
2. **The "Paste to Pilot" Vercel Moment:** The onboarding must be under 5 minutes. If parsing fails or requires heavy YAML tweaking, the solo founder loses the GTM battle.
|
||||||
|
3. **The Agent Security Model:** The Rust VPC agent must be open-source, audited, and strictly outbound-only. If InfoSec teams balk, the sales cycle extends beyond a solo founder's runway.
|
||||||
|
|
||||||
|
**The Kill Condition:**
|
||||||
|
If beta testing shows that the LLM misclassifies a destructive command as "Safe" (🟢) even once, or if the false-positive rate for safe commands exceeds 0.1%, the product must be killed or fundamentally re-architected to remove LLMs from the execution path.
|
||||||
|
|
||||||
|
**Closing Remarks from The Board:**
|
||||||
|
*The board recognizes the immense leverage of this product. If Brian can pull this off, he effectively creates the bridge between static documentation and autonomous AI agents. But the execution risk is astronomical. Build it paranoid. Assume the AI wants to delete production. Constrain it accordingly.*
|
||||||
|
|
||||||
|
*Meeting adjourned.*
|
||||||
617
products/06-runbook-automation/product-brief/brief.md
Normal file
617
products/06-runbook-automation/product-brief/brief.md
Normal file
@@ -0,0 +1,617 @@
|
|||||||
|
# dd0c/run — Product Brief
|
||||||
|
## AI-Powered Runbook Automation
|
||||||
|
**Version:** 1.0 | **Date:** 2026-02-28 | **Author:** Product Management | **Status:** Investor-Ready Draft
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. EXECUTIVE SUMMARY
|
||||||
|
|
||||||
|
### Elevator Pitch
|
||||||
|
|
||||||
|
dd0c/run converts your team's existing runbooks — the stale Confluence pages, the Slack threads, the knowledge trapped in your senior engineer's head — into structured, executable workflows that guide on-call engineers through incidents step by step. Paste a runbook, get an intelligent copilot in under 5 seconds. No YAML. No configuration. No new DSL to learn.
|
||||||
|
|
||||||
|
This is the most safety-critical module in the dd0c platform. It touches production. We built it that way on purpose.
|
||||||
|
|
||||||
|
### The Problem
|
||||||
|
|
||||||
|
**The documentation-to-execution gap is killing engineering teams.**
|
||||||
|
|
||||||
|
- The average on-call engineer spends 12+ minutes *finding and interpreting* the right runbook during a 3am incident. Cognitive function drops 30-40% during nighttime pages. Every minute of that search is a minute of downtime, a minute of cortisol, and a minute closer to burnout.
|
||||||
|
- 60% of runbooks are stale within 30 days of creation. Infrastructure changes, UIs get redesigned, scripts move repos. The runbook becomes a historical artifact that actively sabotages incident response.
|
||||||
|
- On-call burnout is the #1 reason SREs quit. Replacing a single engineer costs $150K+. The tooling that's supposed to help — Rundeck, PagerDuty Automation Actions, Shoreline — either requires weeks of setup, costs thousands per month, or demands a proprietary DSL that nobody has time to learn.
|
||||||
|
- SOC 2 and ISO 27001 require documented, auditable incident response procedures. Most teams' "documentation" is a stale wiki page that wouldn't survive a serious audit.
|
||||||
|
|
||||||
|
The industry has tools that *route* alerts (PagerDuty), tools that *document* incidents (Rootly, Incident.io), and tools that *schedule* jobs (Rundeck). Nobody owns the bridge between tribal knowledge and automated execution. The runbook sits in Confluence. The terminal sits on the engineer's laptop. The gap between them is where MTTR lives.
|
||||||
|
|
||||||
|
### The Solution
|
||||||
|
|
||||||
|
dd0c/run is an AI-powered runbook engine that:
|
||||||
|
|
||||||
|
1. **Ingests** runbooks from anywhere — paste raw text, connect Confluence/Notion, or point at a Git repo of markdown files.
|
||||||
|
2. **Parses** prose into structured, executable steps with automatic risk classification (🟢 Safe / 🟡 Caution / 🔴 Dangerous) in under 5 seconds.
|
||||||
|
3. **Matches** runbooks to incoming alerts via PagerDuty/OpsGenie webhooks, so the right procedure appears in the incident Slack channel before the engineer finishes rubbing their eyes.
|
||||||
|
4. **Guides** execution in Copilot mode — auto-executing safe diagnostic steps, prompting for approval on state changes, blocking destructive actions without explicit confirmation.
|
||||||
|
5. **Learns** from every execution — tracking which steps were skipped, modified, or added — and suggests runbook updates automatically. Runbooks get better with every incident instead of decaying.
|
||||||
|
|
||||||
|
### Target Customer
|
||||||
|
|
||||||
|
**Primary:** Mid-market engineering teams (Series A through Series D startups, 10-100 engineers) with 5-15 SREs supporting high incident volume. They have existing runbooks in Confluence/Notion that they know are stale, they can't afford a dedicated SRE tooling team, and they're drowning in on-call.
|
||||||
|
|
||||||
|
**Secondary:** Startups approaching their first SOC 2 audit who need documented, auditable incident response procedures immediately.
|
||||||
|
|
||||||
|
**Beachhead:** Teams already using dd0c/cost or dd0c/alert. We've saved their budget and their sleep. Now we save their production environment.
|
||||||
|
|
||||||
|
### Key Differentiators
|
||||||
|
|
||||||
|
1. **Zero-Configuration Intelligence.** Paste a runbook. Get structured, risk-classified, executable steps in under 5 seconds. Rundeck requires Java, a database, and YAML definitions. Shoreline requires a proprietary DSL. We require a clipboard.
|
||||||
|
|
||||||
|
2. **The Trust Gradient.** We don't ask teams to hand production to an AI on day one. dd0c/run starts in read-only suggestion mode. Trust is earned through data — 10 successful copilot runs with zero modifications before the system even *suggests* promotion to autopilot. Trust is earned in millimeters and lost in miles. We designed for that.
|
||||||
|
|
||||||
|
3. **The dd0c Ecosystem Flywheel.** dd0c/alert identifies the incident pattern. dd0c/run provides the resolution. Execution telemetry feeds back into alert intelligence, training the matching engine. dd0c/portal provides service ownership context. dd0c/cost tracks the financial impact. The modules are 10x more valuable together than apart. No point solution can replicate this.
|
||||||
|
|
||||||
|
4. **The Resolution Pattern Database.** Every skipped step, every manual override, every successful rollback is logged. We're building the industry's first database of *what actually works in an incident* — not what the runbook says, but what the engineer actually typed at 3:14 AM. This data moat compounds daily.
|
||||||
|
|
||||||
|
5. **Agent-Based Security Model.** A lightweight, open-source Rust agent runs inside the customer's VPC. The SaaS never sees credentials. The agent is strictly outbound-only. No inbound firewall rules. No root AWS credentials. InfoSec teams can audit the binary. The execution boundary is the customer's perimeter, not ours.
|
||||||
|
|
||||||
|
### The Safety Narrative
|
||||||
|
|
||||||
|
Let's be direct: this product executes actions in production environments. An LLM suggesting the wrong command at 3am to a sleep-deprived engineer is not a theoretical risk — it's the scenario that kills the company.
|
||||||
|
|
||||||
|
Our entire architecture is built around one principle: **assume the AI wants to delete production. Constrain it accordingly.**
|
||||||
|
|
||||||
|
- **V1 is Copilot-only.** No autonomous execution of state-changing commands. Period. The AI suggests. The human approves. The audit trail proves it.
|
||||||
|
- **Risk classification is deterministic + AI.** The LLM classifies steps, but a deterministic regex/AST scanner validates against known destructive patterns. The scanner overrides the LLM. Always.
|
||||||
|
- **The agent has a command whitelist.** Anything outside the whitelist is blocked at the agent level, regardless of what the SaaS sends. The agent is the last line of defense, and it doesn't trust the cloud.
|
||||||
|
- **🔴 Dangerous actions require typing the resource name to confirm.** Not clicking a button. Typing. UI friction is a feature, not a bug.
|
||||||
|
- **Every state-changing step records its rollback command.** One-click undo at any point. The safety net that makes engineers brave enough to click "Approve."
|
||||||
|
- **Kill condition:** If beta testing shows the LLM misclassifies a destructive command as Safe (🟢) even once, or if the false-positive rate exceeds 0.1%, the product is killed or fundamentally re-architected to remove LLMs from the classification path.
|
||||||
|
|
||||||
|
Trust is built incrementally: read-only diagnostics first, then copilot with approval gates, then — only after proven track records — selective autopilot for safe steps. The engineer always has the kill switch. The AI never insists.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. MARKET OPPORTUNITY
|
||||||
|
|
||||||
|
### The Documentation-to-Execution Gap
|
||||||
|
|
||||||
|
Every engineering team has some form of incident documentation. Confluence pages, Notion databases, Google Docs, Slack threads, senior engineers' brains. And every team has a terminal where incidents get resolved. The gap between those two things — the document and the execution — is where MTTR lives, where burnout festers, and where $12B+ in operational tooling spend fails to deliver.
|
||||||
|
|
||||||
|
The current market is segmented into tools that solve *pieces* of the incident lifecycle but leave the critical bridge unbuilt:
|
||||||
|
|
||||||
|
| Category | Players | What They Do | What They Don't Do |
|
||||||
|
|----------|---------|-------------|-------------------|
|
||||||
|
| Alert Routing | PagerDuty, OpsGenie, Grafana OnCall | Route alerts to the right human | Help that human actually resolve the incident |
|
||||||
|
| Incident Management | Rootly, Incident.io, FireHydrant | Document the bureaucracy of the outage | Execute the resolution |
|
||||||
|
| Job Scheduling | Rundeck (OSS) | Run pre-defined jobs via YAML | Parse natural language, classify risk, learn from execution |
|
||||||
|
| Orchestration Platforms | Shoreline, Transposit | Execute complex remediation workflows | Onboard in under 5 minutes, work without a proprietary DSL |
|
||||||
|
| AIOps | BigPanda, Moogsoft | Cluster and correlate alerts | Bridge the gap from "we know what's wrong" to "here's how to fix it" |
|
||||||
|
|
||||||
|
Nobody owns the bridge from documentation to execution. That's the gap. That's the market.
|
||||||
|
|
||||||
|
### Market Sizing
|
||||||
|
|
||||||
|
**TAM (Total Addressable Market): $12B+**
|
||||||
|
|
||||||
|
26 million software developers globally. ~20% involved in ops/on-call rotations (5.2 million). Average enterprise tooling spend of $200/month per engineer across incident management, AIOps, and automation tooling. The TAM encompasses the full operational tooling budget that dd0c/run competes for or displaces.
|
||||||
|
|
||||||
|
**SAM (Serviceable Available Market): $1.5B**
|
||||||
|
|
||||||
|
Focus on mid-market: startups to mid-size tech companies (Series A through Series D). These teams have the highest pain-to-budget ratio — 10-100 engineers, can't afford dedicated SRE tooling teams, can't justify Shoreline's enterprise pricing. Estimated 50,000 such companies globally, averaging 30 on-call engineers each (1.5 million target seats). At $1,000/year per seat equivalent, that's $1.5B.
|
||||||
|
|
||||||
|
**SOM (Serviceable Obtainable Market): $15M (Year 3 Target)**
|
||||||
|
|
||||||
|
1% penetration of SAM. 500 companies with average ARR of $30,000. This is a bootstrapped operation — the numbers must be defensible, not inflated for imaginary VCs.
|
||||||
|
|
||||||
|
### Competitive Landscape
|
||||||
|
|
||||||
|
**Rundeck (Open Source / PagerDuty-owned)**
|
||||||
|
- Strengths: Free, established, large community.
|
||||||
|
- Weaknesses: 2015-era UX. Requires Java, a database, YAML job definitions. It's a job scheduler masquerading as a runbook engine. Time-to-value is measured in days, not seconds.
|
||||||
|
- Our advantage: Zero-config AI parsing vs. manual YAML authoring. It's Notepad vs. VS Code — different products for different eras.
|
||||||
|
|
||||||
|
**Transposit / Shoreline**
|
||||||
|
- Strengths: Deep orchestration capabilities, enterprise customers.
|
||||||
|
- Weaknesses: Over-engineered for the 1% of orgs that have bandwidth to learn a proprietary DSL. They built jetpacks for people who are currently drowning. Pricing is enterprise-only.
|
||||||
|
- Our advantage: Paste-to-parse in 5 seconds. No DSL. Mid-market pricing. We meet teams where they are, not where Shoreline wishes they were.
|
||||||
|
|
||||||
|
**Rootly / Incident.io / FireHydrant**
|
||||||
|
- Strengths: Excellent incident management workflows, growing fast.
|
||||||
|
- Weaknesses: They document the fire; they don't hold the hose. They stop at the boundary of execution.
|
||||||
|
- Our advantage: We start where they stop. And with dd0c/alert integration, we own the full chain from detection to resolution.
|
||||||
|
|
||||||
|
**PagerDuty Automation Actions**
|
||||||
|
- Strengths: Distribution. Every on-call team already has PagerDuty.
|
||||||
|
- Weaknesses: Cynical upsell — thousands of dollars to automate the resolution of alerts they already charge you to receive. Locked to PagerDuty ecosystem. No runbook intelligence, just pre-defined script execution.
|
||||||
|
- Our advantage: Platform-agnostic (works with PagerDuty, OpsGenie, Grafana OnCall, any webhook). AI-native intelligence vs. dumb script execution. 10x cheaper.
|
||||||
|
|
||||||
|
**The Real Threat: PagerDuty or Incident.io acquiring an AI runbook startup and bundling it into Enterprise tier.** Mitigation: They will build it as a closed, proprietary upsell. They cannot integrate with dd0c/cost, dd0c/drift, or dd0c/alert. They will sell to the CIO; we sell to the on-call engineer at 3 AM. We win on the open ecosystem, cross-platform nature, and mid-market pricing.
|
||||||
|
|
||||||
|
### Timing Thesis: Why 2026
|
||||||
|
|
||||||
|
Two years ago, this product was impossible. Three things changed:
|
||||||
|
|
||||||
|
1. **LLM Parsing Reliability.** A 2024 model would hallucinate destructive commands or fail to parse implicit prerequisites. 2026 models can perform rigorous structural extraction and risk classification with the accuracy required for production-adjacent tooling. Context windows are large enough to digest a 50-page postmortem and distill the exact terminal commands that fixed it.
|
||||||
|
|
||||||
|
2. **Inference Economics.** Inference latency is under 2 seconds. Costs have commoditized to the point where we can offer AI-powered parsing for $29/runbook/month, destroying the enterprise pricing models of incumbents who charge $50-100/seat/month for dumb automation.
|
||||||
|
|
||||||
|
3. **The Agentic Shift.** The industry is transitioning from "human-in-the-loop" to "human-on-the-loop." Engineering teams are psychologically ready for AI-assisted operations in a way they weren't 18 months ago. The dread of the 3am pager now outweighs the skepticism of the AI — and that's the inflection point.
|
||||||
|
|
||||||
|
**The window:** If we don't build this in the next 12 months, PagerDuty, Incident.io, or a well-funded startup will. The documentation-to-execution gap is too obvious and too painful to remain unowned. First-mover advantage accrues to whoever builds the Resolution Pattern Database first — that data moat compounds daily and is nearly impossible to replicate.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. PRODUCT DEFINITION
|
||||||
|
|
||||||
|
### Value Proposition
|
||||||
|
|
||||||
|
**For on-call engineers:** Replace the 3am panic spiral — searching Confluence, interpreting stale docs, copy-pasting commands you don't understand — with a calm copilot that already knows what's wrong, already has the runbook, and walks you through it step by step.
|
||||||
|
|
||||||
|
**For SRE managers:** Replace vibes-based operational health with data. Know which services lack runbooks, which runbooks are stale, which steps get skipped, and what your actual MTTR is — with audit-ready compliance evidence generated automatically.
|
||||||
|
|
||||||
|
**For senior engineers (the bus factor):** Stop being the human runbook. Your expertise gets captured from your natural workflow — terminal sessions, Slack threads, incident resolutions — and lives on in the system even when you're on vacation, asleep, or gone.
|
||||||
|
|
||||||
|
**One sentence:** dd0c/run turns your team's scattered incident knowledge into a living, learning, executable system that makes every on-call engineer as effective as your best one.
|
||||||
|
|
||||||
|
### Personas
|
||||||
|
|
||||||
|
| Persona | Name | Role | Primary Need | Key Metric |
|
||||||
|
|---------|------|------|-------------|------------|
|
||||||
|
| The On-Call Engineer | Riley | Mid-level SRE, 2 years exp, paged at 3am | Instantly know what to do without searching or guessing | Time-to-resolution, confidence during incidents |
|
||||||
|
| The SRE Manager | Jordan | Manages 8 SREs, owns incident response quality | Consistent, auditable, measurable incident response | MTTR trends, runbook coverage, compliance readiness |
|
||||||
|
| The Runbook Author | Morgan | Staff engineer, 6 years, carries institutional knowledge | Transfer expertise without the overhead of writing docs | Knowledge capture rate, runbook usage by others |
|
||||||
|
|
||||||
|
### The Trust Gradient — The Core Design Principle
|
||||||
|
|
||||||
|
This is the architectural decision that defines dd0c/run. It is non-negotiable.
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ THE TRUST GRADIENT │
|
||||||
|
│ │
|
||||||
|
│ READ-ONLY ──→ SUGGEST ──→ COPILOT ──→ AUTOPILOT │
|
||||||
|
│ │
|
||||||
|
│ "Show me "Here's what "I'll do it, "Handled. │
|
||||||
|
│ the steps" I'd do" you approve" Here's │
|
||||||
|
│ the log" │
|
||||||
|
│ │
|
||||||
|
│ V1 ◄──────────────────► │
|
||||||
|
│ (V1 scope: Read-Only + Suggest + Copilot for 🟢 only) │
|
||||||
|
│ │
|
||||||
|
│ ● Per-runbook setting (not global) │
|
||||||
|
│ ● Per-step override (🟢 auto, 🟡 prompt, 🔴 block) │
|
||||||
|
│ ● Earned through data (10 successful runs → suggest upgrade│
|
||||||
|
│ ● Instantly revocable (one bad run → auto-downgrade) │
|
||||||
|
│ ● The human always has the kill switch │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**V1 is Read-Only + Suggest + Copilot for 🟢 Safe steps ONLY.** The AI auto-executes read-only diagnostic commands (check logs, query metrics, list pods). All state-changing commands (🟡 Caution, 🔴 Dangerous) require explicit human approval. Full Autopilot mode is V2 — and only for runbooks with a proven track record.
|
||||||
|
|
||||||
|
This is not a limitation. This is the product. Trust earned through data is the moat that no competitor can shortcut.
|
||||||
|
|
||||||
|
### Features by Release
|
||||||
|
|
||||||
|
#### V1: "Paste → Parse → Page → Pilot" (Months 4-6)
|
||||||
|
|
||||||
|
The minimum viable product. Four verbs. If you can't explain it in those four words, it's out of scope.
|
||||||
|
|
||||||
|
| Feature | Description | Priority |
|
||||||
|
|---------|-------------|----------|
|
||||||
|
| **Paste & Parse** | Copy-paste raw text from anywhere. AI structures it into numbered, executable steps with risk classification (🟢🟡🔴) in < 5 seconds. Zero configuration. | P0 — This IS the product |
|
||||||
|
| **Risk Classification Engine** | AI + deterministic scanner labels every step. LLM classifies intent; regex/AST scanner validates against known destructive patterns. Scanner overrides LLM. Always. | P0 — Trust foundation |
|
||||||
|
| **Copilot Execution** | Slack bot + web UI walks engineer through runbook step-by-step. Auto-executes 🟢 steps. Prompts for 🟡. Blocks 🔴 without explicit typed confirmation. | P0 — Core value prop |
|
||||||
|
| **Alert-to-Runbook Matching** | PagerDuty/OpsGenie webhook integration. Alert fires → dd0c/run matches to most relevant runbook via keyword + metadata + basic semantic similarity. Posts in incident Slack channel. | P0 — "The runbook finds you" |
|
||||||
|
| **Alert Context Injection** | Matched runbook arrives pre-populated: affected service, region, recent deploys, related metrics. No manual lookup. | P0 — 3am brain can't look things up |
|
||||||
|
| **Rollback-Aware Execution** | Every state-changing step records its inverse command. One-click undo at any point. | P0 — Safety net |
|
||||||
|
| **Divergence Detection** | Post-incident: AI compares what engineer did vs. what runbook prescribed. Flags skipped steps, modified commands, unlisted actions. | P1 — Learning loop |
|
||||||
|
| **Auto-Update Suggestions** | Generates runbook diffs from divergence data. "You skipped steps 6-8 in 4/4 runs. Remove them?" | P1 — Self-improving runbooks |
|
||||||
|
| **Runbook Health Dashboard** | Coverage %, average staleness, MTTR with/without runbook, step skip rates. Jordan's command center. | P1 — Management visibility |
|
||||||
|
| **Compliance Export** | PDF/CSV of timestamped execution logs, approval chains, audit trail. Not pretty, but functional. | P1 — SOC 2 readiness |
|
||||||
|
| **Prerequisite Detection** | AI identifies implicit requirements ("you need kubectl access", "make sure you're on the VPN") and surfaces pre-flight checklist. | P2 |
|
||||||
|
| **Ambiguity Highlighter** | AI flags vague steps ("check the logs" — which logs?) and prompts author to clarify before the runbook goes live. | P2 |
|
||||||
|
|
||||||
|
**What V1 does NOT include:** Terminal Watcher (V2), Full Autopilot (V2), Confluence/Notion crawlers (V2 — V1 is paste), Incident Simulator (V3), Runbook Marketplace (V3), Multi-cloud abstraction (V3), Voice-guided runbooks (V3).
|
||||||
|
|
||||||
|
#### V2: "Watch → Learn → Predict → Protect" (Months 7-9)
|
||||||
|
|
||||||
|
| Feature | Description | Unlocks |
|
||||||
|
|---------|-------------|---------|
|
||||||
|
| Terminal Watcher | Opt-in agent captures commands during incidents. AI generates runbooks from real actions. | Solves cold-start. Captures Morgan's expertise passively. |
|
||||||
|
| Confluence/Notion Crawlers | Automated discovery and sync of runbook-tagged pages. | Bulk import for large teams with 100+ runbooks. |
|
||||||
|
| Full Autopilot Mode | Runbooks with 10+ successful copilot runs and zero modifications can promote 🟢 steps to autonomous execution. | "Sleep through the night" promise. |
|
||||||
|
| dd0c/alert Deep Integration | Multi-alert correlation, enriched context passing, bidirectional feedback loop. | The platform flywheel engages. |
|
||||||
|
| Infrastructure-Aware Staleness | Cross-reference steps against live Terraform state, K8s manifests, AWS resources via dd0c/portal. | Runbooks that know when they're lying. |
|
||||||
|
| Runbook Effectiveness ML Model | Predict runbook success probability based on alert context, time of day, engineer experience. | Data-driven trust promotion. |
|
||||||
|
|
||||||
|
#### V3: "Simulate → Train → Marketplace → Scale" (Months 10-12)
|
||||||
|
|
||||||
|
| Feature | Description | Unlocks |
|
||||||
|
|---------|-------------|---------|
|
||||||
|
| Incident Simulator / Fire Drills | Sandbox environment for practicing runbooks. Gamified with scores. | Viral growth. "My team's score is 94." |
|
||||||
|
| Voice-Guided Runbooks | AI reads steps aloud at 3am. Hands-free incident response. | Genuine differentiation nobody else has. |
|
||||||
|
| Runbook Marketplace | Community-contributed, anonymized templates. "How teams running EKS + RDS handle connection storms." | Network effect. Templates improve with every customer. |
|
||||||
|
| Predictive Runbook Staging | dd0c/alert detects anomaly trending toward incident → dd0c/run pre-stages runbook → on-call gets heads-up 30 min early. | The incident that never happens. |
|
||||||
|
|
||||||
|
### User Journey: Riley's 3am Page
|
||||||
|
|
||||||
|
```
|
||||||
|
3:17 AM — Phone buzzes. PagerDuty: CRITICAL — payment-service latency > 5000ms
|
||||||
|
|
||||||
|
3:17 AM — dd0c/run webhook fires. Matches alert to "Payment Service Latency Runbook" (92% confidence).
|
||||||
|
|
||||||
|
3:17 AM — Slack bot posts in #incident-2847:
|
||||||
|
🔔 Runbook matched: Payment Service Latency
|
||||||
|
📊 Pre-filled: region=us-east-1, service=payment-svc, deploy=v2.4.1 (2h ago)
|
||||||
|
🟢🟡🔴 8 steps (4 safe, 3 caution, 1 danger)
|
||||||
|
[▶ Start Copilot]
|
||||||
|
|
||||||
|
3:18 AM — Riley taps Start Copilot. Steps 1-3 (🟢 Safe) auto-execute:
|
||||||
|
✅ Checked pod status — 2/5 pods in CrashLoopBackOff
|
||||||
|
✅ Pulled logs — 847 connection timeout errors in last 5 min
|
||||||
|
✅ Queried pg_stat_activity — 312 idle-in-transaction connections
|
||||||
|
|
||||||
|
3:19 AM — Step 4 (🟡 Caution): "Bounce connection pool — kubectl rollout restart"
|
||||||
|
⚠️ This will restart all pods. ~30s downtime.
|
||||||
|
↩️ Rollback: kubectl rollout undo ...
|
||||||
|
Riley taps [✅ Approve & Execute]
|
||||||
|
|
||||||
|
3:20 AM — Step 5 (🟢 Safe) auto-executes: Verify latency recovery.
|
||||||
|
✅ Latency recovered to baseline. All pods Running.
|
||||||
|
|
||||||
|
3:21 AM — ✅ Incident resolved. MTTR: 3m 47s.
|
||||||
|
📝 "You skipped steps 6-8. Also ran a command not in the runbook:
|
||||||
|
SELECT count(*) FROM pg_stat_activity"
|
||||||
|
Suggested updates: Remove steps 6-8, add DB connection check before step 4.
|
||||||
|
[✅ Apply Updates]
|
||||||
|
|
||||||
|
3:21 AM — Riley applies updates. Goes back to sleep. The cat didn't even wake up.
|
||||||
|
```
|
||||||
|
|
||||||
|
Previous MTTR for this incident type without dd0c/run: 38-45 minutes. With dd0c/run: under 4 minutes. That's the product.
|
||||||
|
|
||||||
|
### Pricing
|
||||||
|
|
||||||
|
| Tier | Price | Includes |
|
||||||
|
|------|-------|----------|
|
||||||
|
| **Free** | $0 | 3 runbooks, read-along mode only (no execution), basic parsing |
|
||||||
|
| **Pro** | $29/runbook/month | Unlimited runbooks, copilot execution, Slack bot, PagerDuty/OpsGenie integration, basic dashboard, divergence detection |
|
||||||
|
| **Business** | $49/seat/month | Everything in Pro + autopilot mode (V2), API access, SSO, compliance export, audit trail, priority support |
|
||||||
|
|
||||||
|
**Pricing rationale:** The per-runbook model ($29/runbook/month) aligns cost with value — teams pay for the runbooks they actually use, not empty seats. A team with 10 active runbooks pays $290/month. As they add more runbooks and see MTTR drop, revenue grows with demonstrated value. The per-seat Business tier captures larger teams that want platform features (SSO, compliance, API).
|
||||||
|
|
||||||
|
**Note from Party Mode:** The VC advisor recommended switching to pure per-seat pricing for simplicity. This is a valid concern. We will A/B test per-runbook vs. per-seat during beta to determine which model drives faster adoption and lower churn. The per-runbook model has the advantage of a lower entry point and direct value alignment; the per-seat model has the advantage of predictability and simpler billing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. GO-TO-MARKET PLAN
|
||||||
|
|
||||||
|
### Launch Strategy
|
||||||
|
|
||||||
|
dd0c/run is Phase 3 in the dd0c platform rollout (Months 4-6). It does not launch alone. It launches alongside dd0c/alert as the "On-Call Savior" bundle — because a runbook engine without alert intelligence is a document viewer, and alert intelligence without execution is a notification system. Together, they close the loop from detection to resolution.
|
||||||
|
|
||||||
|
**Prerequisite:** dd0c/cost and dd0c/route must be live and generating revenue (Phase 1, Months 1-3). These FinOps modules prove immediate, hard-dollar ROI. If we can't save a company money, we have no right to ask them to trust us with their production environment. The FinOps wedge buys the political capital for the operational wedge.
|
||||||
|
|
||||||
|
### Beachhead: Teams Drowning in On-Call
|
||||||
|
|
||||||
|
The ideal early customer has three characteristics:
|
||||||
|
|
||||||
|
1. **High incident volume.** 10+ pages per week across the team. They feel the pain daily.
|
||||||
|
2. **Existing runbooks that they know are stale.** They've tried to document. They know it's broken. They're looking for a better way.
|
||||||
|
3. **No dedicated SRE tooling team.** They can't afford to spend 3 months configuring Rundeck or learning Shoreline's DSL. They need something that works in 5 minutes.
|
||||||
|
|
||||||
|
This is the Series B/C startup with 5-15 SREs supporting 50-200 developers. They're big enough to have real infrastructure problems, small enough that every engineer feels the on-call burden personally.
|
||||||
|
|
||||||
|
**Secondary beachhead:** Compliance chasers — startups preparing for SOC 2 who need documented, auditable incident response procedures yesterday. We sell them the audit trail masquerading as an automation tool.
|
||||||
|
|
||||||
|
### The dd0c/alert → dd0c/run Upsell Path
|
||||||
|
|
||||||
|
This is the primary growth engine for dd0c/run. The conversion funnel:
|
||||||
|
|
||||||
|
```
|
||||||
|
dd0c/cost user (saves money) → trusts the platform
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
dd0c/alert user (reduces noise, sleeps better) → trusts the intelligence
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
dd0c/alert fires an alert → Slack message includes:
|
||||||
|
"📋 A runbook exists for this alert pattern. Want dd0c/run to guide you?"
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Engineer clicks through → lands on Paste & Parse → 5-second wow moment
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
dd0c/run user (resolves incidents faster) → trusts the execution
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
dd0c/portal user (owns the full developer experience) → locked in
|
||||||
|
```
|
||||||
|
|
||||||
|
Every dd0c/alert notification becomes a dd0c/run acquisition channel. The upsell is embedded in the product, not in a sales email.
|
||||||
|
|
||||||
|
### Growth Loops
|
||||||
|
|
||||||
|
**Loop 1: The Parsing Flywheel (Product-Led)**
|
||||||
|
Engineer pastes runbook → AI parses in 5 seconds → "Wow" → pastes 5 more → invites teammate → teammate pastes theirs → team has 20 runbooks in a week → first incident uses copilot → MTTR drops → team is hooked.
|
||||||
|
|
||||||
|
*Fuel:* The 5-second parse moment must be so good that engineers paste runbooks for fun.
|
||||||
|
|
||||||
|
**Loop 2: The Incident Evidence Loop (Manager-Led)**
|
||||||
|
Jordan sees MTTR data → shows leadership → "With dd0c/run: 6 minutes. Without: 38 minutes." → leadership asks "Why don't all teams use this?" → org-wide rollout → more teams = more runbooks = better matching = better MTTR.
|
||||||
|
|
||||||
|
*Fuel:* The MTTR comparison chart. One number that justifies the budget.
|
||||||
|
|
||||||
|
**Loop 3: The Open-Source Wedge (Community-Led)**
|
||||||
|
Release `ddoc-parse` — a free, open-source CLI that parses runbooks locally. No account needed. No SaaS. Engineers who love it self-select into the beta. Their runbooks (anonymized) improve the parsing model. The CLI gets better. More users. More conversions.
|
||||||
|
|
||||||
|
*Fuel:* A genuinely useful free tool, not a crippled demo.
|
||||||
|
|
||||||
|
**Loop 4: The Knowledge Capture Loop (Retention)**
|
||||||
|
Morgan's expertise captured in dd0c/run → Morgan leaves → Riley handles incident using Morgan's captured knowledge → team realizes dd0c/run IS their institutional memory → switching cost becomes infinite → renewal is automatic.
|
||||||
|
|
||||||
|
*Fuel:* The "Ghost of Morgan" moment — the first time a junior resolves an incident using a runbook generated from a senior's session.
|
||||||
|
|
||||||
|
### Content Strategy
|
||||||
|
|
||||||
|
**Engineering-as-marketing.** Developers use adblockers and hate salespeople. We don't sell to them. We teach them.
|
||||||
|
|
||||||
|
| Content | Channel | Purpose |
|
||||||
|
|---------|---------|---------|
|
||||||
|
| "The Anatomy of a 3am Page" — blog post with real data on cognitive impairment during nighttime incidents | Blog, Hacker News, r/sre | Thought leadership. Establishes the problem before pitching the solution. |
|
||||||
|
| `ddoc-parse` open-source CLI | GitHub, Product Hunt | Free tool that demonstrates AI parsing quality. Acquisition funnel. |
|
||||||
|
| "Your Runbooks Are Lying to You" — analysis of runbook staleness rates across 100 teams | Blog, SRE Weekly newsletter | Data-driven content that managers share internally. |
|
||||||
|
| Conference lightning talks (SREcon, KubeCon, DevOpsDays) | In-person | 5-minute talk ending with beta signup QR code. |
|
||||||
|
| Incident postmortem outreach | Direct DM | Companies publishing postmortems are self-selecting. "I read your Redis incident writeup. We're building something that would have cut your MTTR in half." |
|
||||||
|
| Pre-seeded runbook templates (AWS RDS failover, K8s CrashLoopBackOff, Redis memory pressure) | In-product, GitHub | Solve the cold-start problem. Demonstrate value before the user pastes anything. |
|
||||||
|
|
||||||
|
### 90-Day Launch Timeline
|
||||||
|
|
||||||
|
| Day | Milestone |
|
||||||
|
|-----|-----------|
|
||||||
|
| **1-14** | Private alpha with 5 hand-picked teams from dd0c/cost user base. Paste & Parse + basic Copilot in Slack. Gather parsing quality feedback. |
|
||||||
|
| **15-30** | Iterate on parsing accuracy based on alpha feedback. Add PagerDuty webhook integration. Add risk classification validation (deterministic scanner). Ship divergence detection. |
|
||||||
|
| **31-45** | Expand to 15-20 beta teams. Launch `ddoc-parse` open-source CLI. Begin collecting MTTR data. Add health dashboard for Jordan persona. |
|
||||||
|
| **46-60** | Beta teams running in production. First MTTR comparison data available. Begin compliance export feature. Publish "Anatomy of a 3am Page" blog post. |
|
||||||
|
| **61-75** | Refine based on beta feedback. A/B test pricing models (per-runbook vs. per-seat). Secure 3+ case study commitments. Ship dd0c/alert integration (webhook-based). |
|
||||||
|
| **76-90** | Public launch. Product Hunt launch. Hacker News "Show HN" post. Conference talk submissions. Convert beta teams to paid. Target: 50 teams with ≥ 5 active runbooks. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. BUSINESS MODEL
|
||||||
|
|
||||||
|
### Revenue Model
|
||||||
|
|
||||||
|
**Primary:** $29/runbook/month (Pro tier) or $49/seat/month (Business tier).
|
||||||
|
|
||||||
|
A team with 10 active runbooks on Pro pays $290/month. A team of 8 SREs on Business pays $392/month. Revenue scales with demonstrated value — more runbooks means more incidents resolved faster, which means higher willingness to pay.
|
||||||
|
|
||||||
|
**Secondary:** The dd0c platform bundle. Teams using dd0c/cost + dd0c/alert + dd0c/run together represent an average deal size of $500-800/month. The platform is stickier than any individual module.
|
||||||
|
|
||||||
|
### Unit Economics
|
||||||
|
|
||||||
|
| Metric | Value | Notes |
|
||||||
|
|--------|-------|-------|
|
||||||
|
| **COGS per runbook/month** | ~$3-5 | LLM inference (via dd0c/route, optimized model selection), compute for Rust API + agent coordination, PostgreSQL storage. Parsing is a one-time cost per runbook; execution inference is per-incident. |
|
||||||
|
| **Gross Margin** | ~85% | SaaS-standard. The Rust stack keeps infrastructure costs low. LLM costs are the primary variable, managed by dd0c/route. |
|
||||||
|
| **CAC (Target)** | < $500 | Product-led growth via `ddoc-parse` CLI + dd0c/alert upsell. No outbound sales team. Content marketing + community seeding. |
|
||||||
|
| **LTV (Target)** | > $5,000 | 18+ month retention (institutional memory lock-in). Average $290/month × 18 months = $5,220. |
|
||||||
|
| **LTV:CAC Ratio** | > 10:1 | Healthy for bootstrapped SaaS. The dd0c/alert upsell path has near-zero incremental CAC. |
|
||||||
|
| **Payback Period** | < 2 months | At $290/month with $500 CAC, payback in ~1.7 months. |
|
||||||
|
|
||||||
|
### Path to Revenue Milestones
|
||||||
|
|
||||||
|
**$10K MRR (Month 8 — 4 months post-launch)**
|
||||||
|
- 35 Pro teams × 10 runbooks × $29 = $10,150/month
|
||||||
|
- Source: Beta conversions (15-20 teams) + dd0c/alert upsell (10-15 teams) + organic from `ddoc-parse` CLI (5-10 teams)
|
||||||
|
- Key assumption: 70% beta-to-paid conversion rate
|
||||||
|
|
||||||
|
**$50K MRR (Month 14 — 10 months post-launch)**
|
||||||
|
- 120 Pro teams ($34,800) + 30 Business teams ($14,700) = $49,500/month
|
||||||
|
- Source: Platform flywheel engaged. dd0c/alert → dd0c/run conversion running at 25%. Community templates driving organic acquisition. First conference talks generating inbound.
|
||||||
|
- Key assumption: < 5% monthly churn (institutional memory lock-in)
|
||||||
|
|
||||||
|
**$100K MRR (Month 20 — 16 months post-launch)**
|
||||||
|
- 200 Pro teams ($58,000) + 80 Business teams ($39,200) + 5 custom enterprise ($10,000) = $107,200/month
|
||||||
|
- Source: Runbook Marketplace (V3) creating network effects. Multi-team deployments within companies. SOC 2 compliance driving Business tier upgrades.
|
||||||
|
- Key assumption: Average expansion revenue of 30% (teams add runbooks and seats over time)
|
||||||
|
|
||||||
|
### Solo Founder Constraints
|
||||||
|
|
||||||
|
This is the hardest product in the dd0c lineup to support as a solo founder. The reasons are structural:
|
||||||
|
|
||||||
|
1. **Production safety liability.** If dd0c/run contributes to a production outage, the reputational damage extends to the entire dd0c brand. There is no "move fast and break things" with a product that touches production. Every release must be paranoid.
|
||||||
|
|
||||||
|
2. **Support burden.** When a customer's weird custom Kubernetes setup doesn't play nice with the Rust agent, that's a high-urgency, high-complexity support ticket at 3am. Unlike dd0c/cost (where a bug means a wrong number on a dashboard), a dd0c/run bug means a failed incident response.
|
||||||
|
|
||||||
|
3. **Security surface area.** The VPC agent is open-source and auditable, but it's still a binary running inside customer infrastructure. A CVE in the agent is an existential event. Security reviews from enterprise customers will be thorough and time-consuming.
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
|
||||||
|
- **Shared platform architecture.** One API gateway, one auth layer, one billing system, one OpenTelemetry ingest pipeline across all dd0c modules. If you build six separate data models, you burn out in 14 months.
|
||||||
|
- **V1 scope discipline.** Copilot-only. No Autopilot. No Terminal Watcher. No crawlers. The smaller the surface area, the smaller the support burden.
|
||||||
|
- **Community-driven templates.** Pre-seed 50 high-quality templates for standard infrastructure. Let the community maintain and improve them. Reduce the "my setup is unique" support tickets.
|
||||||
|
- **Aggressive kill criteria.** If the support burden exceeds 10 hours/week within the first 3 months, re-evaluate the agent architecture. Consider a managed-execution model where the SaaS handles execution via customer-provided cloud credentials (higher trust barrier, lower support burden).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. RISKS & MITIGATIONS
|
||||||
|
|
||||||
|
### Risk 1: LLM Hallucination Causes a Production Outage
|
||||||
|
|
||||||
|
**Severity:** 10/10 — Extinction-level event for the company.
|
||||||
|
**Probability:** Medium (with mitigations), High (without).
|
||||||
|
|
||||||
|
**The scenario:** The runbook says "Restart the pod." The LLM hallucinates and outputs `kubectl delete deployment` instead, classifying it as 🟢 Safe. The tired engineer clicks approve. Production goes down. The customer cancels. dd0c goes bankrupt from reputational damage.
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- Deterministic regex/AST scanner validates every command against known destructive patterns. The scanner overrides the LLM classification. Always. The LLM is advisory; the scanner is authoritative.
|
||||||
|
- Agent-level command whitelist. Anything outside the whitelist is blocked at the agent, regardless of what the SaaS sends. Defense in depth.
|
||||||
|
- 🔴 Dangerous actions require typing the resource name to confirm. Not clicking a button. UI friction is a feature.
|
||||||
|
- Every state-changing step records its rollback command. One-click undo.
|
||||||
|
- V1 limits auto-execution to 🟢 Safe (read-only diagnostic) commands only. State changes always require human approval.
|
||||||
|
- Comprehensive logging of every command suggested, approved, executed, and rolled back. Full audit trail.
|
||||||
|
|
||||||
|
**Kill condition:** If the LLM misclassifies a destructive command as Safe (🟢) even once during beta, or if the false-positive rate for safe commands exceeds 0.1%, the product is killed or fundamentally re-architected to remove LLMs from the classification path.
|
||||||
|
|
||||||
|
### Risk 2: PagerDuty / Incident.io Ships Native AI Runbook Automation
|
||||||
|
|
||||||
|
**Severity:** 8/10.
|
||||||
|
**Probability:** High — they will build something. The question is when and how good.
|
||||||
|
|
||||||
|
**The scenario:** PagerDuty acquires a small AI runbook startup and bundles it into their Enterprise tier for "free" (subsidized by their massive license cost), choking off dd0c/run's distribution.
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- Platform agnosticism. dd0c/run works with PagerDuty, OpsGenie, Grafana OnCall, and any webhook source. PagerDuty's automation is locked to PagerDuty.
|
||||||
|
- Cross-module data advantage. PagerDuty can't integrate with dd0c/cost anomalies, dd0c/drift detection, or dd0c/portal service ownership. We have the context. They have the routing rules.
|
||||||
|
- Mid-market pricing. PagerDuty's automation is an enterprise upsell ($$$). We sell to the on-call engineer at 3am for $29/runbook.
|
||||||
|
- The Resolution Pattern Database. PagerDuty keeps data siloed per enterprise. We anonymize and share the "ideal" runbooks across the mid-market. Network effect they can't replicate without cannibalizing their enterprise model.
|
||||||
|
|
||||||
|
**Pivot option:** If PagerDuty ships a compelling native solution, double down on the dd0c ecosystem play. dd0c/run becomes the execution arm of dd0c/alert + dd0c/drift + dd0c/portal — a tightly coupled platform that no single-feature bolt-on can touch.
|
||||||
|
|
||||||
|
### Risk 3: Teams Don't Have Documented Runbooks (Cold Start Problem)
|
||||||
|
|
||||||
|
**Severity:** 7/10.
|
||||||
|
**Probability:** High — many teams have zero runbooks.
|
||||||
|
|
||||||
|
**The scenario:** A prospect signs up, goes to the "Paste Runbook" screen, and realizes they have nothing to paste. Churn happens in 60 seconds.
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- Pre-seed the platform with 50 high-quality templates for standard infrastructure (AWS RDS failover, K8s CrashLoopBackOff, Redis memory pressure, cert expiry, etc.). New users see value before they paste anything.
|
||||||
|
- Slack Thread Distiller (V1): Paste a Slack thread URL from a past incident. AI extracts the resolution commands and generates a draft runbook. If they have incidents, they have Slack threads.
|
||||||
|
- Postmortem-to-Runbook Pipeline (V1): Feed in a postmortem doc. AI extracts "what we did to fix it" and generates a structured runbook.
|
||||||
|
- Terminal Watcher (V2): Captures commands during live incidents and generates runbooks automatically.
|
||||||
|
- Shift marketing from "Automate your runbooks" to "Generate your missing runbooks." The product creates runbooks, not just executes them.
|
||||||
|
|
||||||
|
### Risk 4: The "Agentic AI" Obsolescence Event
|
||||||
|
|
||||||
|
**Severity:** High.
|
||||||
|
**Probability:** Low (in the next 3 years).
|
||||||
|
|
||||||
|
**The scenario:** Autonomous AI agents (Devin, GitHub Copilot Workspace, Pulumi Neo) can detect and fix infrastructure issues without human intervention. Who needs runbooks?
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- Runbooks become the "policy" that defines what the agent *should* do. They're the bridge between human intent and agent execution. We pivot from "human automation" to "agent policy management."
|
||||||
|
- Position dd0c/run as the control plane for agentic operations — the system that defines, constrains, and audits what AI agents are allowed to do in production.
|
||||||
|
- The Trust Gradient already models this transition: Read-Only → Copilot → Autopilot is the same spectrum as Human-Driven → Human-on-the-Loop → Agent-Driven.
|
||||||
|
|
||||||
|
### Risk 5: Solo Founder Scaling / The Bus Factor
|
||||||
|
|
||||||
|
**Severity:** High.
|
||||||
|
**Probability:** High — Brian is building six products.
|
||||||
|
|
||||||
|
**The scenario:** The support burden of a production-safety-critical product overwhelms a solo founder. A critical bug in the VPC agent requires immediate response at 3am. Brian burns out.
|
||||||
|
|
||||||
|
**Mitigations:**
|
||||||
|
- Shared platform architecture reduces per-module engineering overhead by 60%+.
|
||||||
|
- V1 scope discipline: Copilot-only, no Autopilot, no crawlers. Smallest possible surface area.
|
||||||
|
- Open-source the Rust agent. Community contributions for edge-case Kubernetes configurations. Community security audits.
|
||||||
|
- Aggressive automation of support: self-healing agent updates, comprehensive error messages, in-product diagnostics.
|
||||||
|
- If dd0c/run reaches $50K MRR, hire a dedicated SRE for agent support. This is the first hire, non-negotiable.
|
||||||
|
|
||||||
|
### The Catastrophic Scenario and How to Prevent It
|
||||||
|
|
||||||
|
**The nightmare:** dd0c/run's AI suggests a destructive command. A sleep-deprived engineer approves it. Production goes down for a major customer. The incident gets posted on Hacker News. The dd0c brand — across all six modules — is destroyed. Not just dd0c/run. Everything.
|
||||||
|
|
||||||
|
**Prevention (defense in depth):**
|
||||||
|
|
||||||
|
1. **Layer 1 — LLM Classification:** AI labels every step with risk level. This is the first pass, and it's the least trusted.
|
||||||
|
2. **Layer 2 — Deterministic Scanner:** Regex/AST pattern matching against known destructive commands (`delete`, `drop`, `rm -rf`, `kubectl delete namespace`, etc.). Overrides LLM. Catches hallucinations.
|
||||||
|
3. **Layer 3 — Agent Whitelist:** The Rust agent maintains a local whitelist of allowed command patterns. Anything not on the whitelist is rejected at the agent level, regardless of what the SaaS sends. The agent doesn't trust the cloud.
|
||||||
|
4. **Layer 4 — UI Friction:** 🟡 commands require click-to-approve. 🔴 commands require typing the resource name. No "approve all" button. Ever.
|
||||||
|
5. **Layer 5 — Rollback Recording:** Every state-changing command has a recorded inverse. One-click undo. The safety net.
|
||||||
|
6. **Layer 6 — Audit Trail:** Every command suggested, approved, modified, executed, and rolled back is logged with timestamps, user identity, and alert context. Full forensic capability.
|
||||||
|
|
||||||
|
If all six layers fail simultaneously, the product deserves to die. But they won't fail simultaneously — that's the point of defense in depth.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. SUCCESS METRICS
|
||||||
|
|
||||||
|
### North Star Metric
|
||||||
|
|
||||||
|
**Incidents resolved via dd0c/run copilot per month.**
|
||||||
|
|
||||||
|
This single metric captures adoption (teams are using it), trust (engineers choose copilot over skipping), and value (incidents are actually getting resolved). If this number grows, everything else follows.
|
||||||
|
|
||||||
|
### Leading Indicators
|
||||||
|
|
||||||
|
| Metric | Target (Month 6) | Why It Matters |
|
||||||
|
|--------|-------------------|----------------|
|
||||||
|
| Time-to-First-Runbook | < 5 minutes | If onboarding is slow, nobody reaches the value. The Vercel test. |
|
||||||
|
| Paste & Parse success rate | > 90% | If parsing fails or requires heavy manual editing, the magic moment is broken. |
|
||||||
|
| Copilot adoption rate | ≥ 60% of matched incidents | If engineers bypass copilot, the product isn't trusted or isn't useful. |
|
||||||
|
| Risk classification accuracy | > 99.9% (zero false-safe on destructive commands) | The safety foundation. One misclassification and we're done. |
|
||||||
|
| Weekly active runbooks per team | ≥ 5 | The product is alive, not shelfware. |
|
||||||
|
| Runbook update acceptance rate | ≥ 30% of suggestions | The learning loop is working. Runbooks are improving. |
|
||||||
|
|
||||||
|
### Lagging Indicators
|
||||||
|
|
||||||
|
| Metric | Target (Month 6) | Why It Matters |
|
||||||
|
|--------|-------------------|----------------|
|
||||||
|
| MTTR reduction | ≥ 40% vs. baseline | The headline number. "Teams using dd0c/run resolve incidents 40% faster." |
|
||||||
|
| NPS from on-call engineers | > 50 | Riley actually likes this. Not just tolerates it. |
|
||||||
|
| Monthly churn | < 5% | Institutional memory lock-in is working. |
|
||||||
|
| Expansion revenue | > 20% | Teams adding runbooks and seats over time. |
|
||||||
|
| Zero safety incidents | 0 | dd0c/run never made an incident worse. Non-negotiable. |
|
||||||
|
|
||||||
|
### 30/60/90 Day Milestones
|
||||||
|
|
||||||
|
**Day 30: Prove the Parse**
|
||||||
|
- 15-20 beta teams onboarded
|
||||||
|
- Paste & Parse working with > 90% accuracy across diverse runbook formats
|
||||||
|
- PagerDuty webhook integration live
|
||||||
|
- Risk classification validated: zero false-safe misclassifications on destructive commands
|
||||||
|
- First MTTR data points collected
|
||||||
|
- Success criteria: Engineers say "wow" when they paste their first runbook
|
||||||
|
|
||||||
|
**Day 60: Prove the Pilot**
|
||||||
|
- Beta teams running copilot in production incidents
|
||||||
|
- MTTR reduction ≥ 30% for at least 8 teams
|
||||||
|
- Divergence detection generating useful runbook update suggestions
|
||||||
|
- Health dashboard live for Jordan persona
|
||||||
|
- dd0c/alert webhook integration functional
|
||||||
|
- `ddoc-parse` open-source CLI launched
|
||||||
|
- Success criteria: At least one engineer says "I actually slept through the night because dd0c/run handled the diagnostics"
|
||||||
|
|
||||||
|
**Day 90: Prove the Business**
|
||||||
|
- 50 teams with ≥ 5 active runbooks
|
||||||
|
- MTTR reduction ≥ 40% for at least 12 teams
|
||||||
|
- 3+ teams committed as named case studies
|
||||||
|
- Pricing model validated (per-runbook vs. per-seat A/B test complete)
|
||||||
|
- Zero safety incidents across all beta teams
|
||||||
|
- Public launch executed (Product Hunt, Hacker News, conference submissions)
|
||||||
|
- $10K MRR trajectory confirmed
|
||||||
|
- Success criteria: Beta-to-paid conversion rate ≥ 70%
|
||||||
|
|
||||||
|
### Kill Criteria
|
||||||
|
|
||||||
|
The product is killed or fundamentally re-architected if any of the following occur:
|
||||||
|
|
||||||
|
1. **Safety failure.** The LLM misclassifies a destructive command as Safe (🟢) during beta. Even once.
|
||||||
|
2. **Trust failure.** Engineers skip copilot mode > 50% of the time after 30 days. The product isn't trusted.
|
||||||
|
3. **Parse failure.** Paste & Parse accuracy stays below 80% after 60 days of iteration. The core AI capability doesn't work.
|
||||||
|
4. **Adoption failure.** Fewer than 8 beta teams active after 4 weeks. The problem isn't painful enough or the solution isn't compelling enough.
|
||||||
|
5. **MTTR failure.** MTTR reduction < 20% or inconsistent across teams after 60 days. The product doesn't deliver measurable value.
|
||||||
|
|
||||||
|
If we hit a kill criterion, the pivot options are:
|
||||||
|
- **Pivot to read-only intelligence:** Strip execution entirely. Become the "runbook quality platform" — parsing, staleness detection, coverage dashboards, compliance evidence. Lower risk, lower value, but viable.
|
||||||
|
- **Pivot to agent policy management:** If agentic AI arrives faster than expected, position dd0c/run as the policy layer that defines what AI agents are allowed to do in production.
|
||||||
|
- **Absorb into dd0c/portal:** The Contrarian from Party Mode was right about one thing — if dd0c/run can't stand alone, it becomes a feature of the IDP, not a product.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## APPENDIX: RESOLVED CONTRADICTIONS ACROSS PHASES
|
||||||
|
|
||||||
|
| Contradiction | Brainstorm Position | Party Mode Position | Resolution |
|
||||||
|
|--------------|---------------------|---------------------|------------|
|
||||||
|
| Standalone product vs. portal feature | Standalone with ecosystem integration | Contrarian argued it's a portal feature, not a product | **Standalone with kill-criteria pivot to portal.** Launch as standalone to test market demand. If adoption fails, absorb into dd0c/portal as a feature. |
|
||||||
|
| Per-runbook vs. per-seat pricing | $29/runbook/month | VC advisor recommended per-seat for simplicity | **A/B test during beta.** Per-runbook aligns cost with value; per-seat is simpler. Let data decide. |
|
||||||
|
| V1 execution scope | Full copilot with 🟢🟡🔴 approval gates | CTO demanded no execution until deterministic validation exists; Bootstrap Founder said copilot-only | **V1 auto-executes 🟢 only. 🟡🔴 require human approval. Deterministic scanner overrides LLM.** Synthesizes both positions. |
|
||||||
|
| Confluence/Notion crawlers in V1 | Design Thinking included crawlers as V1 | Innovation Strategy said "do not build crawlers; force the user to paste" | **Paste-only in V1. Crawlers are V2.** Solo founder can't maintain integration APIs for V1. Paste is the 5-second wow moment anyway. |
|
||||||
|
| Cold start solution | Slack Thread Scraper in V1 | Terminal Watcher in V1 | **Slack Thread Distiller in V1. Terminal Watcher deferred to V2.** Slack threads require no agent installation (lower trust barrier). Terminal Watcher requires an agent on the engineer's machine — too much friction for V1. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*This brief synthesizes insights from four prior development phases: Brainstorm (Carson, Venture Architect), Design Thinking (Maya, Design Maestro), Innovation Strategy (Victor, Disruption Oracle), and Party Mode Advisory Board (5-person expert panel). All contradictions have been identified and resolved with rationale.*
|
||||||
|
|
||||||
|
*dd0c/run is the most safety-critical module in the dd0c platform. This brief reflects that gravity. Build it paranoid. Assume the AI wants to delete production. Constrain it accordingly. Then ship it — because the 3am pager isn't going to fix itself.*
|
||||||
File diff suppressed because it is too large
Load Diff
191
projectlocker-market-analysis.md
Normal file
191
projectlocker-market-analysis.md
Normal file
@@ -0,0 +1,191 @@
|
|||||||
|
# ProjectLocker.com — SVN Hosting Market Analysis
|
||||||
|
|
||||||
|
**Date:** February 28, 2026
|
||||||
|
**Prepared for:** Brian
|
||||||
|
**Classification:** Brutally honest
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. SVN Market Status (2025–2026)
|
||||||
|
|
||||||
|
### Is SVN Still Used?
|
||||||
|
|
||||||
|
Yes, but it's a shrinking niche. SVN was the second most popular VCS in the 2022 Stack Overflow Developer Survey, but that's a distant second — Git dominates with 90%+ adoption. The gap widens every year.
|
||||||
|
|
||||||
|
**Key inflection point:** GitHub officially removed SVN protocol support on January 8, 2024. This was a loud signal to the industry that SVN is legacy technology. SourceForge publicly pledged to "never end Subversion support" — which tells you everything about who's left in this market.
|
||||||
|
|
||||||
|
### Who Still Uses SVN?
|
||||||
|
|
||||||
|
Based on Assembla's market research and RhodeCode's enterprise analysis, SVN persists in:
|
||||||
|
|
||||||
|
| Industry | Why SVN Sticks |
|
||||||
|
|---|---|
|
||||||
|
| **Semiconductors / Chip Design** | Massive binary files (GDSII layouts), centralized locking is essential |
|
||||||
|
| **Government / Defense** | Compliance mandates, change-averse procurement, classified networks |
|
||||||
|
| **Aerospace / Automotive** | DO-178C, ISO 26262 compliance — auditors know SVN, not Git |
|
||||||
|
| **Film / Animation / VFX** | Large binary assets, centralized workflow fits artist pipelines |
|
||||||
|
| **Video Games** | Large asset repos (art, audio), though Perforce dominates here |
|
||||||
|
| **Manufacturing** | CAD files, hardware design, legacy toolchains |
|
||||||
|
| **Embedded Systems** | Legacy codebases, conservative engineering culture |
|
||||||
|
| **Legacy Enterprise** | "It works, don't touch it" — the most honest reason |
|
||||||
|
|
||||||
|
### Market Size & Trajectory
|
||||||
|
|
||||||
|
There are no reliable public numbers for "SVN hosting market size." It's too small for analyst firms to track separately. What we know:
|
||||||
|
|
||||||
|
- 6sense tracks ~3,800 companies using SVN (as of their latest data), but this includes self-hosted
|
||||||
|
- The hosted SVN market is a tiny fraction of that
|
||||||
|
- The market is **definitively shrinking**, not stable — every year, more teams migrate to Git
|
||||||
|
- The decline is slow, not catastrophic — regulated industries move slowly by design
|
||||||
|
|
||||||
|
**Verdict:** SVN is in managed decline. It's not dead, but it's on life support. The remaining users are sticky (compliance, legacy), but they're not growing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Competitor Landscape
|
||||||
|
|
||||||
|
### Active Competitors
|
||||||
|
|
||||||
|
| Provider | Pricing | SVN Support | Status |
|
||||||
|
|---|---|---|---|
|
||||||
|
| **Assembla** | $16/user/month (Team plan, packs of 5) | SVN + Git + Perforce | **Thriving** — pivoted to Perforce hosting, enterprise focus. Actively marketing SVN-to-Perforce migration. The 800-lb gorilla in this space. |
|
||||||
|
| **Beanstalk** (Wildbit) | $15–$200/month (flat, not per-user) | SVN + Git | **Stable/coasting** — profitable, debt-free, privately owned. No visible growth push. Classic lifestyle business. |
|
||||||
|
| **Perforce TeamHub** | Enterprise pricing (free tier killed July 2025) | SVN + Git + Mercurial | **Alive but deprioritized** — Perforce cares about Helix Core, not SVN hosting. TeamHub is a side product. |
|
||||||
|
| **SourceForge** | Free (open-source projects) | SVN + Git + Mercurial | **Alive** — publicly committed to SVN forever. But only for open-source. Not a competitor for private repos. |
|
||||||
|
| **ProjectLocker** | $19–$99/month (flat) | SVN + Git | **You're here** |
|
||||||
|
|
||||||
|
### Dead / Dying Competitors
|
||||||
|
|
||||||
|
| Provider | Status |
|
||||||
|
|---|---|
|
||||||
|
| **CloudForge** | **Dead.** Shut down by Perforce. Redirected users to Helix TeamHub. |
|
||||||
|
| **RiouxSVN** | **Dead.** Domain expired October 2024. SSL cert expired September 2024. Gone. |
|
||||||
|
| **Unfuddle** | **Effectively dead.** No meaningful updates in years. |
|
||||||
|
| **GitHub (SVN bridge)** | **Dead.** Removed January 8, 2024. |
|
||||||
|
|
||||||
|
### What This Means
|
||||||
|
|
||||||
|
The competitive field is consolidating. Smaller players are dying off. Assembla is the clear market leader for hosted SVN and is actively investing in the space (they're smart — they're using SVN as a gateway drug to sell Perforce hosting at $39/user/month). Beanstalk is the other survivor, running a quiet lifestyle business.
|
||||||
|
|
||||||
|
**ProjectLocker's position:** You're in a shrinking field where the top competitor (Assembla) has more features, more marketing, and a clear upsell path. But you're also cheaper for small teams.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. ProjectLocker Specifically
|
||||||
|
|
||||||
|
### Current Offering
|
||||||
|
|
||||||
|
From projectlocker.com (fetched Feb 2026):
|
||||||
|
|
||||||
|
| Plan | Price | Users | Storage | Projects |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| Venture | $19/mo | 5 | 5GB | 5 |
|
||||||
|
| Equity | $49/mo | 20 | 10GB | Unlimited |
|
||||||
|
| IPO | $99/mo | 50 | 25GB | Unlimited |
|
||||||
|
| Enterprise | Contact | 100+ | 25-100+GB | Unlimited |
|
||||||
|
|
||||||
|
**Features:** SVN + Git hosting, Trac (bug tracking, wiki, milestones), automatic deployment, IP-based access restrictions (SVN), fine-grained directory permissions, BuildLocker CI (Equity+), integrations (Basecamp, FogBugz, HipChat — note: some of these are themselves dead products).
|
||||||
|
|
||||||
|
### Strengths
|
||||||
|
- Simple, flat pricing (not per-user like Assembla)
|
||||||
|
- Trac integration is a differentiator for teams that use it
|
||||||
|
- Cheaper than Assembla for small teams (5 users = $19/mo vs Assembla's $80/mo minimum)
|
||||||
|
- Has been around for 15+ years — longevity signals reliability
|
||||||
|
|
||||||
|
### Weaknesses
|
||||||
|
- Website feels dated (mentions HipChat, Basecamp, FogBugz — products from another era)
|
||||||
|
- No visible marketing or content strategy
|
||||||
|
- No Perforce integration (Assembla's big upsell)
|
||||||
|
- Trac is itself aging technology
|
||||||
|
- No public API documentation visible
|
||||||
|
- Integration list is stale
|
||||||
|
- No SOC 2 or compliance certifications mentioned (critical for the regulated industries that still use SVN)
|
||||||
|
|
||||||
|
### Online Reputation
|
||||||
|
|
||||||
|
- **Reddit mentions:** Sparse and old. Most references are from 2010–2018. Users mentioned ProjectLocker as a budget option for private SVN/Git repos. No recent (2023+) mentions found.
|
||||||
|
- **G2/review sites:** Minimal review presence.
|
||||||
|
- **Hacker News:** No significant mentions found.
|
||||||
|
- **General sentiment:** Not negative — just invisible. Nobody's complaining, but nobody's recommending it either.
|
||||||
|
|
||||||
|
**The silence is the story.** ProjectLocker isn't generating word-of-mouth, positive or negative. It's a ghost in the market.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Growth Paths — Is There ANY Way to Grow?
|
||||||
|
|
||||||
|
### Option A: SVN-to-Git Migration Services
|
||||||
|
- **Opportunity:** Real but limited. GitHub's SVN sunset (Jan 2024) created a one-time migration wave, but that's largely passed.
|
||||||
|
- **Problem:** Migration is a one-time revenue event, not recurring. And there are plenty of free tools (git-svn, svn2git) and consultancies that already do this.
|
||||||
|
- **Verdict:** Small opportunity. Not a business builder.
|
||||||
|
|
||||||
|
### Option B: Target Regulated Industries (Government, Aerospace, Defense)
|
||||||
|
- **Opportunity:** This is where the real SVN money is. These industries need:
|
||||||
|
- SOC 2 Type II compliance
|
||||||
|
- FedRAMP authorization (for US government)
|
||||||
|
- ITAR compliance (defense)
|
||||||
|
- Detailed audit trails
|
||||||
|
- On-prem or GovCloud hosting options
|
||||||
|
- **Problem:** Getting these certifications costs $50K–$200K+ and takes 6–12 months. The sales cycle for government contracts is brutal. Assembla is already here.
|
||||||
|
- **Verdict:** The only real growth path, but requires significant investment that may not make sense for a small operation.
|
||||||
|
|
||||||
|
### Option C: Compliance-Focused SVN Hosting
|
||||||
|
- **Opportunity:** Position as "the SVN host for teams that need audit trails and compliance" without going full FedRAMP.
|
||||||
|
- SOC 2 certification
|
||||||
|
- Detailed commit audit logs
|
||||||
|
- Data residency options (EU, US)
|
||||||
|
- SSO/SAML integration
|
||||||
|
- **Problem:** Still requires investment. Assembla already does this.
|
||||||
|
- **Verdict:** More realistic than Option B, but still requires capital and effort.
|
||||||
|
|
||||||
|
### Option D: Niche Down — Game Dev / VFX / Design
|
||||||
|
- **Opportunity:** These industries deal with large binary files where SVN's centralized model and file locking actually make sense.
|
||||||
|
- **Problem:** Perforce owns this market. Game studios that care about version control use Perforce, not SVN.
|
||||||
|
- **Verdict:** Uphill battle against a dominant incumbent.
|
||||||
|
|
||||||
|
### Option E: Do Nothing, Ride the Long Tail
|
||||||
|
- **Opportunity:** SVN isn't dying tomorrow. Existing customers are sticky. Competitors are dying (CloudForge, RiouxSVN gone). As others exit, some of their customers may land on ProjectLocker.
|
||||||
|
- **Problem:** Revenue will slowly decline as customers eventually migrate to Git.
|
||||||
|
- **Verdict:** The most realistic option. See below.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Honest Verdict
|
||||||
|
|
||||||
|
### The Math
|
||||||
|
|
||||||
|
ProjectLocker is a small SVN hosting business in a shrinking market. Let's be real about what that means:
|
||||||
|
|
||||||
|
**The good:**
|
||||||
|
- Competitors are dying off (CloudForge dead, RiouxSVN dead, Perforce TeamHub free tier killed). Every exit potentially sends a few customers your way.
|
||||||
|
- Existing SVN users in regulated industries are extremely sticky. They won't migrate to Git unless forced.
|
||||||
|
- Flat pricing is competitive for small teams vs. Assembla's per-user model.
|
||||||
|
- Infrastructure costs for SVN hosting are low and well-understood.
|
||||||
|
|
||||||
|
**The bad:**
|
||||||
|
- The market is shrinking. Every year, fewer new SVN projects start.
|
||||||
|
- Assembla is the clear leader and is actively investing in the space.
|
||||||
|
- ProjectLocker's web presence is dated and invisible.
|
||||||
|
- No compliance certifications = locked out of the most valuable remaining customers.
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
**Let it ride as passive income.** Here's why:
|
||||||
|
|
||||||
|
1. **The investment required to grow doesn't justify the returns.** Getting SOC 2 certified, modernizing the platform, and building a sales pipeline for regulated industries would cost $100K+ and 12+ months of effort — for a market that's shrinking 5-10% per year.
|
||||||
|
|
||||||
|
2. **The passive income play is actually decent.** As competitors die, you may pick up stragglers. Keep the lights on, keep the service reliable, and let the long tail play out. SVN won't hit zero for another decade.
|
||||||
|
|
||||||
|
3. **If you want to invest anything, invest in SEO and a website refresh.** The cheapest growth lever is making sure that when someone Googles "SVN hosting" (and they still do), ProjectLocker shows up and looks credible. Update the integrations list (remove HipChat, add Slack/Teams). Add a "migrate from CloudForge" landing page. This is a weekend project, not a capital investment.
|
||||||
|
|
||||||
|
4. **Consider a "migrate to Git" offering as a graceful exit ramp.** Help your existing customers migrate when they're ready, charge a one-time fee, and earn goodwill. This is customer service, not a growth strategy.
|
||||||
|
|
||||||
|
5. **Don't pour money into a declining market.** Brian's time and capital are better spent on something with a growth trajectory. ProjectLocker can keep generating passive income for years with minimal maintenance.
|
||||||
|
|
||||||
|
### TL;DR
|
||||||
|
|
||||||
|
SVN hosting is a slowly melting ice cube. ProjectLocker is a solid ice cube in a warming room. Don't invest in a bigger freezer — just collect the water while it lasts.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Research conducted February 2026. Sources: projectlocker.com, assembla.com, beanstalkapp.com, GitHub Blog, SourceForge, 6sense, RhodeCode, G2, Reddit, Stack Overflow surveys, various industry analyses.*
|
||||||
Reference in New Issue
Block a user