dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
This commit is contained in:
2026-02-28 17:35:02 +00:00
commit 5ee95d8b13
51 changed files with 36935 additions and 0 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,343 @@
# 🔥 IaC Drift Detection & Auto-Remediation — BRAINSTORM SESSION 🔥
**Facilitator:** Carson, Elite Brainstorming Specialist
**Date:** February 28, 2026
**Product:** dd0c Product #2 — IaC Drift Detection SaaS
**Energy Level:** ☢️ MAXIMUM ☢️
---
> *"Every piece of infrastructure that drifts from its declared state is a lie your system is telling you. Let's build the lie detector."* — Carson
---
## Phase 1: Problem Space 🎯 (25 Ideas)
### Drift Scenarios That Cause the Most Pain
1. **The "Helpful" Engineer** — Someone SSH'd into prod and tweaked an nginx config "just for now." That was 8 months ago. The Terraform state thinks it's vanilla. It's not. It never was again.
2. **Security Group Roulette** — A developer opens port 22 to 0.0.0.0/0 via the AWS console "for 5 minutes" to debug. Forgets. Drift undetected. You're now on Shodan. Congrats.
3. **The Auto-Scaling Ghost** — ASG scales up, someone manually terminates instances, ASG state and Terraform state disagree. `terraform apply` now wants to destroy your running workload.
4. **IAM Policy Creep** — Someone adds an inline policy via console. Terraform doesn't know. That policy grants `s3:*` to a role that should only read. Compliance audit finds it 6 months later.
5. **The RDS Parameter Drift** — Database parameters changed via console for "performance tuning." Next `terraform apply` reverts them. Production database restarts. At 2pm on a Tuesday. During a demo.
6. **Tag Drift Avalanche** — Cost allocation tags removed or changed manually. FinOps team can't attribute $40K/month in spend. CFO is asking questions. Nobody knows which team owns what.
7. **DNS Record Drift** — Route53 records edited manually during an incident. Never reverted. Terraform state is wrong. Next apply overwrites the fix. Outage #2.
8. **The Terraform Import That Never Happened** — Resources created via console during an emergency. "We'll import them later." Later never comes. They exist outside state. They cost money. Nobody knows they're there.
9. **Cross-Account Drift** — Shared resources (VPC peering, Transit Gateway attachments) modified in one account. The other account's Terraform doesn't know. Networking breaks silently.
10. **The Module Version Mismatch** — Team A upgrades a shared module. Team B doesn't. Their states diverge. Applying either one now has unpredictable blast radius.
### What Happens When Drift Goes Undetected — Horror Stories
11. **The $200K Surprise** — Drifted auto-scaling policies kept spinning up GPU instances nobody asked for. Undetected for 3 weeks. The AWS bill was... educational.
12. **The Compliance Audit Failure** — SOC 2 auditor asks "show me your infrastructure matches your declared state." It doesn't. Audit failed. Customer contract at risk. 6-figure deal on the line.
13. **The Cascading Terraform Destroy** — Engineer runs `terraform apply` on a state that's 4 months stale. Terraform sees 47 resources it doesn't recognize. Proposes destroying them. Engineer hits yes. Half of staging is gone.
14. **The Security Breach Nobody Noticed** — Drifted security group + drifted IAM role = open door. Attacker got in through the gap between declared and actual state. The IaC said it was secure. The cloud said otherwise.
15. **The "It Works On My Machine" of Infrastructure** — Dev environment Terraform matches state. Prod doesn't. "But it works in dev!" Yes, because dev hasn't drifted. Prod has been manually patched 30 times.
### Why Existing Tools Fail
16. **`terraform plan` Is Not Monitoring** — It's a point-in-time check that requires someone to run it. Nobody runs it at 3am when the drift happens. It's a flashlight, not a security camera.
17. **Spacelift/env0 Are Platforms, Not Tools** — You don't want to migrate your entire IaC workflow to detect drift. That's like buying a car to use the cup holder. $500/mo minimum for what should be a focused utility.
18. **driftctl Is Dead** — Snyk acquired it, then abandoned it. The OSS community is orphaned. README still says "beta." Last meaningful commit: ancient history.
19. **Terraform Cloud's Drift Detection Is an Afterthought** — Buried in the UI. Limited to Terraform (no OpenTofu, no Pulumi). Requires full TFC adoption. HashiCorp pricing is... HashiCorp pricing.
20. **ControlMonkey Is Enterprise-Only** — Great product, but they want $50K+ contracts and 6-month sales cycles. A 10-person startup can't even get a demo.
### The Emotional Experience of Drift
21. **2am PagerDuty + Drift = Existential Dread** — You're debugging a production issue. Nothing matches what the code says. You can't trust your own infrastructure definitions. You're flying blind in the dark.
22. **The Trust Erosion** — Every time drift is discovered, the team trusts IaC less. "Why bother with Terraform if the console changes override it anyway?" IaC adoption dies from a thousand drifts.
23. **The Blame Game** — "Who changed this?" Nobody knows. No audit trail. The console doesn't log who clicked what (unless CloudTrail is perfectly configured, which... it's not).
### Hidden Costs of Drift
24. **Debugging Time Multiplier** — Engineers spend 2-5x longer debugging issues when the actual state doesn't match the declared state. You're debugging a phantom. The code says X, reality is Y, and you don't know that.
25. **Compliance Theater** — Teams spend weeks before audits manually reconciling state. Running `terraform plan` across 50 stacks, fixing drift, re-running. This is a full-time job that shouldn't exist.
---
## Phase 2: Solution Space 🚀 (42 Ideas)
### Detection Approaches
26. **Continuous Polling Engine** — Run `terraform plan` (or equivalent) on a schedule. Every 15 min, every hour, every day. Configurable per-stack. The "security camera" approach.
27. **Event-Driven Detection via CloudTrail** — Watch AWS CloudTrail (and Azure Activity Log, GCP Audit Log) for API calls that modify resources tracked in state. Instant drift detection — no polling needed.
28. **State File Diffing** — Compare current state file against last known-good state. Detect additions, removals, and modifications without running a full plan. Faster, cheaper, less permissions needed.
29. **Git-State Reconciliation** — Compare what's in the git repo (the desired state) against what's in the cloud (actual state). The "source of truth" approach. Works across any IaC tool.
30. **Hybrid Detection** — CloudTrail for real-time alerts on high-risk resources (security groups, IAM), scheduled polling for everything else. Best of both worlds. Cost-efficient.
31. **Resource Fingerprinting** — Hash the configuration of each resource. Compare hashes over time. If the hash changes and there's no corresponding git commit, that's drift. Lightweight and fast.
32. **Provider API Direct Query** — Skip Terraform entirely. Query AWS/Azure/GCP APIs directly and compare against declared state. Eliminates Terraform plan overhead. Works even if Terraform is broken.
33. **Multi-State Correlation** — Detect drift across multiple state files that reference shared resources. If VPC in state A drifts, alert teams using states B, C, D that reference that VPC.
### Remediation Strategies
34. **One-Click Revert** — "This security group drifted. Click here to revert to declared state." Generates and applies the minimal Terraform change. No full plan needed.
35. **Auto-Generated Fix PR** — Drift detected → automatically generate a PR that updates the Terraform code to match the new reality (when the drift is intentional). "Accept the drift" workflow.
36. **Approval Workflow** — Drift detected → Slack notification → team lead approves remediation → auto-applied. For teams that want human-in-the-loop but don't want to context-switch to a terminal.
37. **Scheduled Remediation Windows** — "Fix all non-critical drift every Sunday at 2am." Batch remediation with automatic rollback if health checks fail.
38. **Selective Auto-Remediation** — Define policies: "Always auto-revert security group changes. Never auto-revert RDS parameter changes. Ask for approval on IAM changes." Risk-based automation.
39. **Drift Quarantine** — When drift is detected on a critical resource, automatically lock it (prevent further manual changes) until the drift is resolved through IaC. Enforced guardrails.
40. **Rollback Snapshots** — Before any remediation, snapshot the current state. If remediation breaks something, one-click rollback to the drifted-but-working state. Safety net.
41. **Import Wizard** — For drift that should be accepted: auto-generate the `terraform import` commands and HCL code to bring the drifted resources into state properly. The "make it official" button.
### Notification & Alerting
42. **Slack-First Alerts** — Rich Slack messages with drift details, blast radius, and action buttons (Revert / Accept / Snooze / Assign). Where engineers already live.
43. **PagerDuty Integration for Critical Drift** — Security group opened to the internet? That's not a Slack message. That's a page. Severity-based routing.
44. **PR Comments** — When a PR is opened that would conflict with existing drift, comment on the PR: "⚠️ Warning: these resources have drifted since your branch was created."
45. **Daily Drift Digest** — Morning email/Slack summary: "You have 3 new drifts, 7 unresolved, 2 auto-remediated overnight. Here's your drift score: 94/100."
46. **Drift Score Dashboard** — Real-time "infrastructure health score" based on % of resources in declared state. Gamify it. Teams compete for 100% drift-free status.
47. **Compliance Alert Channel** — Separate notification stream for compliance-relevant drift (IAM, encryption, logging). Auto-CC the security team. Generate audit evidence.
48. **ChatOps Remediation**`/drift fix sg-12345` in Slack. Bot runs the remediation. No need to open a terminal or dashboard.
### Multi-Tool Support
49. **Terraform + OpenTofu Day 1** — These are 95% compatible. Support both from launch. Capture the OpenTofu migration wave.
50. **Pulumi Support** — Pulumi's state format is different but the concept is identical. Second priority. Captures the "modern IaC" crowd.
51. **CloudFormation Read-Only** — Many teams have legacy CFN stacks they can't migrate. At minimum, detect drift on them (CFN has a drift detection API). Don't need to remediate — just alert.
52. **CDK Awareness** — CDK compiles to CloudFormation. Understand the CDK→CFN mapping so drift alerts reference the CDK construct, not the raw CFN resource. Developer-friendly.
53. **Crossplane/Kubernetes** — For teams using Kubernetes-native IaC. Detect drift between desired state (CRDs) and actual cloud state. Niche but growing fast.
### Visualization
54. **Drift Heat Map** — Visual map of your infrastructure colored by drift status. Green = clean, yellow = minor drift, red = critical drift. Instant situational awareness.
55. **Dependency Graph with Drift Overlay** — Show resource dependencies. Highlight drifted resources AND everything that depends on them. "Blast radius" visualization.
56. **Timeline View** — When did each drift occur? Correlate with CloudTrail events. "This security group drifted at 3:47pm when user jsmith made a console change."
57. **Drift Trends Over Time** — Is drift getting better or worse? Weekly/monthly trends. "Your team's drift rate decreased 40% this month." Metrics for engineering leadership.
58. **Stack Health Dashboard** — Per-stack view: resources managed, resources drifted, last check time, remediation history. The "single pane of glass" for IaC health.
### Compliance Angle
59. **SOC 2 Evidence Auto-Generation** — Automatically generate compliance evidence: "100% of infrastructure changes were made through IaC. Here are the 3 exceptions, all remediated within SLA."
60. **Audit Trail Export** — Every drift event, every remediation, every approval — logged and exportable as CSV/PDF for auditors. One-click audit package.
61. **Policy-as-Code Integration** — Integrate with OPA/Rego or Sentinel. "Alert on drift that violates policy X." Not just "something changed" but "something changed AND it's now non-compliant."
62. **Change Window Enforcement** — Detect drift that occurs outside approved change windows. "Someone modified production at 2am on Saturday. That's outside the change freeze."
### Developer Experience
63. **CLI Tool (`drift check`)** — Run locally before pushing. "Your stack has 2 drifts. Fix them before applying." Shift-left drift detection.
64. **GitHub Action**`uses: dd0c/drift-check@v1` — Run drift detection in CI. Block merges if drift exists. Free tier for public repos.
65. **VS Code Extension** — Inline drift indicators in your .tf files. "⚠️ This resource has drifted" right in the editor. Click to see details.
66. **Terraform Provider** — A Terraform provider that outputs drift status as data sources. `data.driftcheck_status.my_stack.drifted_resources`. Use drift status in your IaC logic.
67. **`drift init`** — One command to connect your stack. Auto-discovers state backend, cloud provider, resources. 60-second setup. No YAML config files.
### 🌶️ Wild Ideas
68. **Predictive Drift Detection** — ML model trained on CloudTrail patterns. "Based on historical patterns, this resource is likely to drift in the next 48 hours." Predict before it happens.
69. **Auto-Generated Fix PRs with AI Explanation** — Not just the code fix, but a natural language explanation: "This security group was opened to 0.0.0.0/0 by jsmith at 3pm. Here's a PR that reverts it and adds a comment explaining why it should stay restricted."
70. **Drift Insurance** — "We guarantee your infrastructure matches your IaC. If drift causes an incident and we didn't catch it, we pay." SLA-backed drift detection. Bold positioning.
71. **Infrastructure Replay** — Record all drift events. Replay them to understand how your infrastructure evolved outside of IaC. "Here's a movie of everything that changed in prod this month that wasn't in git."
72. **Drift-Aware Terraform Plan** — Wrap `terraform plan` to show not just what will change, but what has ALREADY changed (drift) vs what you're ABOUT to change. Split the plan output into "drift remediation" and "new changes."
73. **Cross-Org Drift Benchmarking** — Anonymous, aggregated drift statistics. "Your organization has 12% drift rate. The industry average is 23%. You're in the top quartile." Competitive benchmarking.
74. **Natural Language Drift Queries** — "Show me all security-related drift in production from the last 7 days" → instant filtered view. ChatGPT for your infrastructure state.
75. **Drift Bounties** — Gamification: assign points for fixing drift. Leaderboard. "Sarah fixed 47 drifts this month. She's the Drift Hunter champion." Make compliance fun.
76. **"Chaos Drift" Testing** — Intentionally introduce drift in staging to test your team's detection and response capabilities. Like chaos engineering but for IaC discipline.
77. **Bi-Directional Sync** — Instead of just detecting drift, offer the option to sync in EITHER direction: revert cloud to match code, OR update code to match cloud. The user chooses which is the source of truth per-resource.
---
## Phase 3: Differentiation & Moat 🏰 (18 Ideas)
78. **Focused Tool, Not a Platform** — Spacelift, env0, and TFC are platforms. We're a tool. We do ONE thing — drift detection — and we do it better than anyone. This is our positioning moat. "We're the Stripe of drift detection. Focused. Developer-friendly. Just works."
79. **Price Disruption** — $29/mo for 10 stacks vs $500/mo for Spacelift. 17x cheaper. Price is the moat for SMBs. Spacelift can't drop to $29 without cannibalizing their enterprise business.
80. **Open-Source Core** — Open-source the detection engine. Paid SaaS for dashboard, alerting, remediation, and team features. Builds community, trust, and adoption. Hard for competitors to FUD against OSS.
81. **Multi-Tool from Day 1** — Spacelift is Terraform-first. env0 is Terraform-first. We support Terraform + OpenTofu + Pulumi from launch. The "Switzerland" of drift detection. No vendor lock-in.
82. **CloudTrail Data Advantage** — The more CloudTrail data we process, the better our drift attribution and prediction models get. Network effect: more customers → better detection → more customers.
83. **Integration Ecosystem** — Deep integrations with Slack, PagerDuty, GitHub, GitLab, Jira, Linear. Become the "drift hub" that connects to everything. Switching cost = reconfiguring all integrations.
84. **Community Drift Patterns Library** — Open-source library of common drift patterns and remediation playbooks. "AWS security group drift → here's the standard remediation." Community-contributed. We host it.
85. **Self-Serve Onboarding** — No sales calls. No demos. Sign up, connect your state backend, get drift alerts in 5 minutes. Spacelift requires a sales conversation. We require a credit card.
86. **Free Tier That's Actually Useful** — 3 stacks free forever. Not a trial. Not limited to 14 days. Actually useful for small teams and side projects. Builds habit and word-of-mouth.
87. **Terraform State as a Service (Adjacent Product)** — Once we're reading state files, we can offer state management (locking, versioning, encryption) as an adjacent product. Expand the surface area.
88. **Compliance Certification Partnerships** — Partner with SOC 2 auditors. "Use dd0c drift detection and your audit evidence is pre-generated." Become the recommended tool in compliance playbooks.
89. **Education Content Moat** — Become THE authority on IaC drift. Blog posts, case studies, "State of Drift" annual report, conference talks. Own the narrative. When people think "drift," they think dd0c.
90. **API-First Architecture** — Everything we do is available via API. Customers build custom workflows on top. Creates switching costs — their automation depends on our API.
91. **Drift SLA Guarantees** — "We detect drift within 15 minutes or your month is free." Nobody else offers this. Bold, measurable, differentiated.
92. **Agent-Ready Architecture** — Build the API so AI agents (Pulumi Neo, GitHub Copilot, custom agents) can query drift status and trigger remediation programmatically. Be the drift detection layer for the agentic DevOps era.
93. **Embeddable Widget** — Let teams embed a drift status badge in their README, Backstage catalog, or internal wiki. Viral distribution through visibility.
94. **Multi-Cloud Correlation** — Detect drift across AWS + Azure + GCP simultaneously. Correlate cross-cloud dependencies. Nobody does this well.
95. **Acquisition Target Positioning** — Build something so good at drift detection that Spacelift, env0, or HashiCorp wants to acquire it rather than build it. The exit strategy IS the moat — be the best at one thing.
---
## Phase 4: Anti-Ideas & Red Team 💀 (14 Ideas)
96. **HashiCorp Builds It Natively** — Terraform 2.0 (or whatever) ships with built-in continuous drift detection. Risk: MEDIUM. HashiCorp moves slowly and their pricing alienates SMBs. OpenTofu fork means the community is fragmented. Even if they build it, it'll be Terraform-only and expensive.
97. **OpenTofu Builds It Natively** — OpenTofu adds drift detection as a core feature. Risk: LOW-MEDIUM. OpenTofu is community-driven and focused on core IaC, not SaaS features. They'd build the CLI piece, not the dashboard/alerting/remediation layer.
98. **Spacelift Launches a Free Tier** — Risk: MEDIUM-HIGH. Spacelift could offer basic drift detection for free to capture the market. Counter: their platform complexity is a liability. Free tier of a complex platform ≠ simple focused tool.
99. **"Drift Doesn't Matter" Argument** — Some teams argue that if you have good CI/CD and always apply from code, drift is impossible. Risk: LOW. This is theoretically true and practically false. Console access exists. Emergencies happen. Humans are humans.
100. **Cloud Providers Build It In** — AWS Config already does drift detection for CloudFormation. What if they extend it to Terraform? Risk: LOW. Cloud providers want you on THEIR IaC (CloudFormation, Bicep, Deployment Manager). They won't optimize for third-party tools.
101. **Security Scanners Expand Into Drift** — Prisma Cloud, Wiz, or Orca add drift detection as a feature. Risk: MEDIUM. They have the cloud access and customer base. Counter: they're security tools, not IaC tools. Drift detection would be a checkbox feature, not their core competency.
102. **The "Just Use CI/CD" Objection** — "Just run `terraform plan` in a cron job and parse the output." Risk: This is what most teams do today. It's fragile, doesn't scale, has no UI, no remediation, no audit trail. We're the productized version of this hack.
103. **State File Access Is a Blocker** — Reading Terraform state requires access to the backend (S3, Terraform Cloud, etc.). Some security teams won't grant this. Risk: MEDIUM. Counter: offer a "pull" model where the customer's CI runs our agent and pushes results. No state file access needed.
104. **Permissions Anxiety** — "I'm not giving a SaaS tool IAM access to my AWS account." Risk: HIGH. This is the #1 adoption blocker for any cloud security/management tool. Counter: read-only IAM role with minimal permissions. SOC 2 certification. Option to run agent in customer's VPC.
105. **The Market Is Too Small** — Maybe only 10,000 teams worldwide actually need dedicated drift detection. At $99/mo average, that's $12M TAM. Is that enough? Counter: drift detection is the wedge. Expand into state management, policy enforcement, IaC analytics.
106. **Terraform Is Dying** — What if the industry moves to Pulumi, CDK, or AI-generated infrastructure? Risk: LOW-MEDIUM in 3-year horizon. Terraform/OpenTofu has massive inertia. But we should be multi-tool from day 1 to hedge.
107. **AI Makes IaC Obsolete** — What if Pulumi Neo-style agents manage infrastructure directly and IaC files become unnecessary? Risk: LOW in 3 years, MEDIUM in 5 years. Even with AI agents, you need to detect when actual state diverges from intended state. The concept of drift survives even if the tooling changes.
108. **Enterprise Sales Required** — What if SMBs don't pay for drift detection but enterprises do? Then we need a sales team, which kills the bootstrap model. Counter: validate with self-serve SMB customers first. Add enterprise features (SSO, audit logs, SLAs) later.
109. **Open Source Competitor Emerges** — Someone builds an excellent OSS drift detection tool. Risk: MEDIUM. Counter: our moat is the SaaS layer (dashboard, alerting, remediation, team features), not the detection engine. If we open-source our own engine, we control the narrative.
---
## Phase 5: Synthesis 🏆
### Top 10 Ideas — Ranked
| Rank | Idea # | Name | Why It's Top 10 |
|------|--------|------|-----------------|
| 🥇 1 | 30 | **Hybrid Detection (CloudTrail + Polling)** | Best-in-class detection that's both real-time AND comprehensive. This is the technical differentiator. |
| 🥈 2 | 79 | **Price Disruption ($29/mo)** | 17x cheaper than Spacelift. The single most powerful go-to-market weapon. |
| 🥉 3 | 42 | **Slack-First Alerts with Action Buttons** | Meet engineers where they are. Revert/Accept/Snooze without leaving Slack. This IS the product for most users. |
| 4 | 34 | **One-Click Revert** | The killer feature. Detect drift AND fix it in one click. Nobody else does this as a focused tool. |
| 5 | 67 | **`drift init` — 60-Second Setup** | Self-serve onboarding is the growth engine. If setup takes more than 60 seconds, you lose. |
| 6 | 80 | **Open-Source Core** | Builds trust, community, and adoption. Paid SaaS for the good stuff. Proven model (GitLab, Sentry, PostHog). |
| 7 | 86 | **Free Tier (3 Stacks Forever)** | Habit-forming. Word-of-mouth. The developer who uses it on a side project brings it to work. |
| 8 | 38 | **Selective Auto-Remediation Policies** | "Always revert security group drift. Ask for approval on IAM." Risk-based automation is the enterprise unlock. |
| 9 | 49 | **Terraform + OpenTofu + Pulumi from Day 1** | Multi-tool support = "Switzerland" positioning. Captures migration waves in all directions. |
| 10 | 59 | **SOC 2 Evidence Auto-Generation** | Compliance is the budget unlocker. "This tool pays for itself in audit prep time saved." CFO-friendly. |
### 3 Wild Cards 🃏
| Wild Card | Idea # | Name | Why It's Wild |
|-----------|--------|------|---------------|
| 🃏 1 | 68 | **Predictive Drift Detection** | ML that predicts drift before it happens. "This resource will drift in 48 hours based on historical patterns." Nobody has this. It's the future. |
| 🃏 2 | 71 | **Infrastructure Replay** | A DVR for your infrastructure. Replay every change that happened outside IaC. Forensics meets compliance meets "holy crap that's cool." |
| 🃏 3 | 70 | **Drift Insurance / SLA Guarantee** | "We guarantee detection within 15 minutes or your month is free." Turns a SaaS tool into a trust contract. Unprecedented in the space. |
### Key Themes
1. **Simplicity Is the Moat** — Every competitor is a platform. We're a tool. The market is screaming for focused, affordable, easy-to-adopt solutions. Don't build a platform. Build a scalpel.
2. **Slack Is the UI** — For 80% of users, the Slack notification with action buttons IS the product. The dashboard is secondary. Design Slack-first, dashboard-second.
3. **Price Is a Feature** — At $29/mo, drift detection becomes a no-brainer expense. No procurement process. No budget approval. Credit card and go. This is how you win SMB.
4. **Compliance Sells to Leadership** — Engineers want drift detection for operational sanity. Leadership wants it for compliance evidence. Sell both stories. The engineer adopts it bottom-up; the compliance angle gets it approved top-down.
5. **Open Source Builds Trust** — Cloud security tools face massive trust barriers ("you want access to my AWS account?!"). Open-source core + SOC 2 certification + minimal permissions = trust equation solved.
6. **Multi-Tool Is Non-Negotiable** — The IaC landscape is fragmented (Terraform, OpenTofu, Pulumi, CDK, CloudFormation). A drift detection tool that only works with one is leaving money on the table.
### Recommended V1 Focus 🎯
**V1 = "Drift Detection That Just Works"**
Ship with:
- ✅ Terraform + OpenTofu support (Pulumi in V1.1)
- ✅ Hybrid detection: CloudTrail real-time + scheduled polling
- ✅ Slack alerts with Revert / Accept / Snooze buttons
- ✅ One-click remediation (revert to declared state)
-`drift init` CLI for 60-second onboarding
- ✅ Basic web dashboard (drift list, stack health, timeline)
- ✅ Free tier: 3 stacks, daily polling, Slack alerts
- ✅ Paid tier: $29/mo for 10 stacks, 15-min polling, remediation
Do NOT ship with (save for V2+):
- ❌ Pulumi support (V1.1)
- ❌ Predictive drift detection (V2 — needs data)
- ❌ SOC 2 evidence generation (V1.5)
- ❌ VS Code extension (V2)
- ❌ Auto-generated fix PRs (V1.5)
- ❌ Policy-as-code integration (V2)
**The V1 pitch:** *"Connect your Terraform state. Get Slack alerts when something drifts. Fix it in one click. $29/month. Set up in 60 seconds."*
That's it. That's the product. Ship it. 🚀
---
**Total Ideas Generated: 109** 🔥🔥🔥
*Session complete. Carson out. Go build something that makes infrastructure engineers sleep better at night.* ✌️

View File

@@ -0,0 +1,353 @@
# 🎷 dd0c/drift — Design Thinking Session
**Facilitator:** Maya, Design Thinking Maestro
**Date:** February 28, 2026
**Product:** dd0c/drift — IaC Drift Detection & Auto-Remediation
**Method:** Full Design Thinking (Empathize → Define → Ideate → Prototype → Test → Iterate)
---
> *"Drift is the jazz of infrastructure — unplanned improvisation that sounds beautiful to nobody. Our job isn't to kill the music. It's to give the musicians a score they actually want to follow."* — Maya
---
## Phase 1: EMPATHIZE 🎭
*Before we design a single pixel, we sit in their chairs. We wear their headphones. We feel the 2am vibration of a PagerDuty alert in our bones. Design is about THEM, not us.*
---
### Persona 1: The Infrastructure Engineer — "Ravi"
**Name:** Ravi Krishnamurthy
**Title:** Senior Infrastructure Engineer
**Company:** Mid-stage SaaS startup, 120 employees, Series B
**Experience:** 6 years in infra, 4 years writing Terraform daily
**Tools:** Terraform, AWS, GitHub Actions, Slack, Datadog, PagerDuty
**Stacks managed:** 23 (across dev, staging, prod, and 3 microservice clusters)
#### Empathy Map
**SAYS:**
- "I'll import that resource into state tomorrow." *(Tomorrow never comes.)*
- "Who changed the security group? Was it you? Was it me? I honestly don't remember."
- "Don't touch prod through the console. I'm serious. PLEASE."
- "I need to run `terraform plan` across all stacks before the release, give me... four hours."
- "The state file is the source of truth." *(Said with the conviction of someone who knows it's a lie.)*
**THINKS:**
- "If I run `terraform apply` right now, will it destroy something that someone manually created last month?"
- "I'm mass-producing YAML and HCL for a living. Is this what I went to school for?"
- "There are at least 15 resources in prod that aren't in any state file. I know it. I can feel it."
- "If this next plan shows 40+ changes I didn't expect, I'm calling in sick."
- "I should automate drift checks. I've been saying that for 8 months."
**DOES:**
- Runs `terraform plan` manually before every apply, scanning the output like a bomb technician reading a wire diagram
- Maintains a mental map of "things that have drifted but I haven't fixed yet" — a cognitive debt ledger that never gets paid down
- Writes Slack messages like "PSA: DO NOT modify anything in the `prod-networking` stack via console" every few weeks
- Spends 30% of debugging time figuring out whether the issue is in the code, the state, or reality
- Keeps a personal spreadsheet of "resources created via console during incidents that need to be imported"
**FEELS:**
- **Anxiety** before every `terraform apply` — the gap between what the code says and what exists is a minefield
- **Guilt** about the drift they know exists but haven't fixed — it's technical debt they can see but can't prioritize
- **Frustration** when colleagues make console changes — it feels like someone scribbling in the margins of a book you're trying to keep clean
- **Loneliness** at 2am when the pager goes off and nothing matches the code — you're debugging a ghost
- **Imposter syndrome** when drift causes an incident — "I should have caught this. Why didn't I catch this?"
#### Pain Points
1. **No continuous visibility**`terraform plan` is a flashlight, not a security camera. Drift happens between plans.
2. **State file trust erosion** — Every discovered drift chips away at confidence in IaC as a practice.
3. **Manual reconciliation is soul-crushing** — Running plan across 23 stacks, reading output, triaging changes, fixing one by one. This is a full workday that produces zero new value.
4. **Blast radius fear** — Applying to a drifted stack is Russian roulette. Will it fix the drift or destroy the workaround someone built at 3am last Tuesday?
5. **No attribution** — WHO drifted this? WHEN? WHY? CloudTrail exists but correlating it to Terraform resources requires a PhD in AWS forensics.
6. **Context switching tax** — Drift discovery interrupts real work. You're building a new module and suddenly you're spelunking through state files.
#### Current Workarounds
- **Cron job running `terraform plan`** — Fragile bash script that emails output. Nobody reads the emails. The cron job broke 3 weeks ago. Nobody noticed.
- **"Drift Fridays"** — Dedicated time to reconcile state. Gets cancelled when there's a release. Which is every Friday.
- **Console access restrictions** — Tried to remove console access. Got overruled because "we need it for emergencies." The emergency is now permanent.
- **Mental model** — Ravi literally remembers which stacks are "clean" and which are "dirty." This knowledge lives in his head. When he's on vacation, nobody knows.
#### Jobs To Be Done (JTBD)
1. **When** I'm about to run `terraform apply`, **I want to** know exactly what has drifted since my last apply, **so I can** apply with confidence instead of fear.
2. **When** someone makes a console change to a resource I manage, **I want to** be notified immediately with context (who, what, when), **so I can** decide whether to revert or codify it before it becomes invisible debt.
3. **When** I'm debugging a production issue at 2am, **I want to** instantly see whether the resource in question matches its declared state, **so I can** eliminate "drift" as a variable in 30 seconds instead of 30 minutes.
4. **When** it's audit season, **I want to** generate a report showing all drift events and their resolutions, **so I can** prove our infrastructure matches our code without spending a week on manual reconciliation.
#### Day-in-the-Life Scenario: "The 2am Discovery"
*It's 2:17am. Ravi's phone buzzes. PagerDuty. `CRITICAL: API latency > 5s on prod-api-cluster`.*
*He opens his laptop, eyes half-closed. Checks Datadog. The API pods are healthy. Load balancer looks fine. But wait — the target group health checks are failing for two instances. He checks the security group. It should allow traffic on port 8080 from the ALB.*
*It doesn't.*
*The security group has been modified. Port 8080 is restricted to a specific CIDR that... isn't the ALB's subnet. Someone changed this. When? He doesn't know. Who? He doesn't know. Why? He doesn't know.*
*He opens the Terraform code. The code says port 8080 should be open to the ALB security group. The code is right. Reality is wrong. But he can't just `terraform apply` — what if there are OTHER changes in this stack he doesn't know about? What if applying reverts something else that's keeping prod alive?*
*He runs `terraform plan`. It shows 12 changes. TWELVE. He expected one. He doesn't recognize 8 of them. His stomach drops.*
*He spends the next 47 minutes reading plan output, cross-referencing with CloudTrail (which has 4,000 events in the last 24 hours for this account), trying to figure out which of these 12 changes are safe to apply and which will make things worse.*
*At 3:04am, he manually fixes the security group via the console. Adding more drift to fix drift. The irony isn't lost on him. He makes a mental note to "clean this up tomorrow."*
*Tomorrow, he won't. He'll be too tired. And the drift will compound.*
*He goes back to sleep at 3:22am. The alarm is set for 6:30am. He lies awake until 4.*
---
### Persona 2: The Security/Compliance Lead — "Diana"
**Name:** Diana Okafor
**Title:** Head of Security & Compliance
**Company:** B2B SaaS, 200 employees, SOC 2 Type II certified, pursuing HIPAA
**Experience:** 10 years in security, 3 years in cloud compliance
**Tools:** AWS Config, Prisma Cloud, Jira, Confluence, spreadsheets (so many spreadsheets)
**Responsibility:** Ensuring infrastructure matches approved configurations across 4 AWS accounts
#### Empathy Map
**SAYS:**
- "Can you prove that production matches the approved Terraform configuration? I need evidence for the auditor."
- "When was the last time someone verified there's no drift in the PCI-scoped environment?"
- "I need a change log. Not CloudTrail — I need something a human can read."
- "If we fail this audit, we lose the Acme contract. That's $2.3 million ARR."
- "I don't care HOW you fix the drift. I care that it's fixed, documented, and won't happen again."
**THINKS:**
- "I'm building compliance evidence on a foundation of sand. The Terraform code says one thing. I have no idea if the cloud matches."
- "The engineering team says 'everything is in code.' I've been in this industry long enough to know that's never 100% true."
- "If an auditor asks me to demonstrate real-time drift detection and I show them a cron job... we're done."
- "I'm spending 60% of my time on evidence collection that should be automated."
- "One undetected IAM policy change could be the difference between 'compliant' and 'breach notification.'"
**DOES:**
- Maintains a 200-row spreadsheet mapping compliance controls to infrastructure resources — updated manually, always slightly out of date
- Requests quarterly "drift audits" from the infra team — which take 2 weeks and produce a PDF that's outdated by the time it's delivered
- Reviews CloudTrail logs for unauthorized changes — drowning in noise, looking for signal
- Writes compliance narratives that say "all infrastructure changes are made through version-controlled IaC" while knowing there are exceptions
- Schedules monthly meetings with engineering to review "infrastructure hygiene" — attendance drops every month
**FEELS:**
- **Exposed** — she knows there are gaps between declared and actual state, but can't quantify them
- **Dependent** — she relies entirely on the infra team to tell her whether things have drifted, and they're too busy to check regularly
- **Anxious before audits** — the two weeks before an audit are a scramble to reconcile state, fix drift, and generate evidence
- **Frustrated by tooling** — AWS Config gives her compliance rules but not IaC drift. Prisma Cloud gives her security posture but not Terraform state comparison. Nothing connects the dots.
- **Professionally vulnerable** — if a breach happens because of undetected drift, it's her name on the incident report
#### Pain Points
1. **No continuous compliance evidence** — compliance is proven in snapshots, not streams. Between audits, she's flying blind.
2. **IaC drift is invisible to security tools** — Prisma Cloud sees the current state. It doesn't know what the INTENDED state is. Drift is the gap between intent and reality, and no security tool measures it.
3. **Manual evidence collection** — generating audit evidence requires coordinating with 3 engineering teams, running plans, collecting outputs, formatting reports. It's a part-time job.
4. **Change attribution is archaeological** — figuring out who changed what, when, and whether it was approved requires cross-referencing CloudTrail, git history, Jira tickets, and Slack messages.
5. **Compliance theater** — she suspects some "evidence" is aspirational rather than factual. The narrative says "no manual changes" but she can't verify it.
#### Current Workarounds
- **Quarterly manual audits** — engineering runs `terraform plan` across all stacks, documents drift, fixes it, re-runs. Takes 2 weeks. Results are stale immediately.
- **AWS Config rules** — catches some configuration drift but doesn't compare against Terraform intent. Generates hundreds of findings, most irrelevant.
- **Honor system** — relies on engineers to report console changes. They don't. Not maliciously — they just forget, or they think "I'll fix it in code later."
- **Pre-audit fire drill** — 2 weeks before every audit, the entire infra team drops everything to reconcile state. Productivity crater.
#### Jobs To Be Done (JTBD)
1. **When** an auditor asks for evidence that infrastructure matches declared state, **I want to** generate a real-time compliance report in one click, **so I can** demonstrate continuous compliance instead of point-in-time snapshots.
2. **When** a change is made outside of IaC, **I want to** be alerted immediately with full attribution (who, what, when, from where), **so I can** assess the compliance impact and initiate remediation before it becomes a finding.
3. **When** preparing for SOC 2 / HIPAA audits, **I want to** have an automatically maintained audit trail of all drift events and their resolutions, **so I can** eliminate the 2-week pre-audit scramble.
4. **When** evaluating our security posture, **I want to** see a real-time "drift score" across all environments, **so I can** quantify infrastructure hygiene and track improvement over time.
#### Day-in-the-Life Scenario: "The Audit Scramble"
*It's Monday morning, 14 days before the SOC 2 Type II audit. Diana opens her laptop to 3 Slack messages from the auditor: "Please provide evidence for Control CC6.1 — logical access controls match approved configurations."*
*She messages the infra team lead: "I need a full drift report across all production stacks by Wednesday." The response: "We're in the middle of a release. Can it wait until next week?" It cannot wait until next week.*
*She compromises: "Can you at least run plans on the PCI-scoped stacks?" Two days later, she gets a Confluence page with terraform plan output pasted in. It shows 7 drifted resources. Three are tagged "known — will fix later." Two are "not sure what this is." Two are "probably fine."*
*"Probably fine" is not a compliance posture.*
*She spends Thursday manually cross-referencing the drifted resources with CloudTrail to build an attribution timeline. The CloudTrail console search is painfully slow. She exports to CSV. Opens it in Excel. 47,000 rows for the last 30 days. She filters by resource ID. Finds the change. It was made by a service role, not a human. Which service? She doesn't know. More digging.*
*By Friday, she has a draft evidence document that says "7 drift events detected, 5 remediated, 2 accepted with justification." The justifications are thin. She knows the auditor will push back. She rewrites them three times.*
*The audit passes. Barely. The auditor notes "opportunity for improvement in continuous configuration monitoring." Diana knows that means "fix this before next year or you'll get a finding."*
*She adds "evaluate drift detection tooling" to her Q2 OKRs. It's February. Q2 starts in April. The drift continues.*
---
### Persona 3: The DevOps Team Lead — "Marcus"
**Name:** Marcus Chen
**Title:** DevOps Team Lead
**Company:** E-commerce platform, 400 employees, multi-region AWS deployment
**Experience:** 12 years in ops/DevOps, managing a team of 4 infra engineers
**Tools:** Terraform, AWS (3 accounts), GitHub, Slack, Linear, Datadog, PagerDuty
**Stacks managed (team total):** 67 across dev, staging, prod, and disaster recovery
#### Empathy Map
**SAYS:**
- "How much time did we spend on drift remediation this sprint? I need to report that to leadership."
- "I need visibility across all 67 stacks. Not 'run plan on each one.' A dashboard. One screen."
- "We're spending 30% of our time firefighting things that shouldn't have changed. That's not engineering, that's janitorial work."
- "I can't hire another engineer just to babysit state files. The budget isn't there."
- "If I could show leadership a metric — 'drift incidents per week' trending down — I could justify the tooling investment."
**THINKS:**
- "My team is burning out. Ravi hasn't taken a real vacation in 8 months because he's the only one who knows which stacks are 'safe' to apply."
- "I have 67 stacks and zero visibility into which ones have drifted. I'm managing by hope."
- "If we had drift detection, I could turn reactive firefighting into proactive maintenance. That's the difference between a team that ships and a team that survives."
- "Spacelift would solve this but it's $6,000/year and requires migrating our entire workflow. I can't justify that for drift detection alone."
- "One of my engineers is going to quit if I don't reduce the on-call burden. The 2am pages for drift-related issues are killing morale."
**DOES:**
- Runs weekly "stack health" meetings where each engineer reports on their stacks — this is verbal, tribal knowledge transfer disguised as a meeting
- Maintains a Linear board of "known drift" issues that never gets prioritized over feature work
- Advocates to leadership for drift detection tooling — gets told "can't you just write a script?"
- Manually assigns drift remediation during "quiet sprints" — which don't exist
- Reviews every `terraform apply` in prod personally because he doesn't trust the state
#### Pain Points
1. **No aggregate visibility** — he can't see drift across 67 stacks without asking each engineer to run plans individually. There's no "single pane of glass."
2. **Team capacity drain** — drift remediation is unplanned work that displaces planned work. He can't forecast sprint velocity because drift is unpredictable.
3. **Knowledge silos** — each engineer knows their stacks. When someone is sick or on vacation, their stacks are black boxes. Drift knowledge is tribal.
4. **Can't quantify the problem** — leadership asks "how bad is drift?" and he can't answer with data. He has anecdotes and gut feelings. That doesn't unlock budget.
5. **Tool sprawl fatigue** — his team already uses 8+ tools. Adding another platform (Spacelift, env0) means migration, training, and ongoing maintenance. He wants a tool, not a platform.
6. **On-call burnout** — drift-related incidents inflate on-call burden. His team is on a 4-person rotation. One more quit and it's unsustainable.
#### Current Workarounds
- **Weekly manual checks** — each engineer runs `terraform plan` on their stacks Monday morning. Results shared in Slack. Nobody reads each other's results.
- **"Drift budget"** — allocates 20% of each sprint to "infrastructure hygiene." In practice, it's 5% because feature work always wins.
- **Tribal knowledge** — Marcus keeps a mental model of which stacks are "high risk" for drift. He assigns on-call accordingly. This doesn't scale.
- **Post-incident drift audits** — after every drift-related incident, the team does a full audit of related stacks. Reactive, not proactive.
#### Jobs To Be Done (JTBD)
1. **When** planning sprint capacity, **I want to** see a real-time count of drift events across all stacks, **so I can** allocate remediation time accurately instead of guessing.
2. **When** reporting to engineering leadership, **I want to** show drift metrics trending over time (drift rate, mean time to remediation, stacks affected), **so I can** justify tooling investment with data instead of anecdotes.
3. **When** an engineer is on vacation, **I want to** have automated drift monitoring on their stacks, **so I can** eliminate the "bus factor" of tribal knowledge.
4. **When** evaluating new tools, **I want to** adopt something that integrates with our existing workflow (Terraform + GitHub + Slack) without requiring a platform migration, **so I can** get value in days, not months.
5. **When** a drift event occurs, **I want to** route it to the right engineer automatically based on stack ownership, **so I can** reduce my role as a human router of drift information.
#### Day-in-the-Life Scenario: "The Visibility Gap"
*It's Wednesday standup. Marcus asks the team: "Any drift issues this week?"*
*Ravi: "I found 3 drifted resources in prod-api yesterday. Fixed two, the third is complicated — someone changed the RDS parameter group and I'm not sure if reverting will restart the database."*
*Priya: "I haven't checked my stacks yet this week. I've been heads-down on the Kubernetes migration."*
*Jordan: "My stacks are clean... I think. I ran plan on Monday. But that was before the incident on Tuesday where someone opened a port via console."*
*Sam: "I'm still fixing drift from last month's audit prep. I have 4 stacks with known drift that I haven't gotten to."*
*Marcus writes in his notebook: "3 confirmed drifts (Ravi), unknown (Priya), possibly new drift (Jordan), 4 known unresolved (Sam)." He has 67 stacks. He has visibility into maybe 15 of them. The other 52 are Schrödinger's infrastructure — simultaneously drifted and not drifted until someone opens the box.*
*After standup, his manager pings him: "The VP of Engineering wants to know our 'infrastructure reliability score' for the board deck. Can you get me a number by Friday?"*
*Marcus stares at the message. He has no number. He has a notebook with question marks. He opens a spreadsheet and starts making one up — educated guesses based on the last time each stack was checked. He knows it's fiction. The VP will present it as fact.*
*He adds "evaluate drift detection tools" to his personal to-do list. It's been there for 5 months. It keeps getting bumped by the next fire.*
---
## Phase 2: DEFINE 🔍
*Now we take all that raw empathy — the 2am dread, the audit scramble, the standup question marks — and we distill it into sharp, actionable problem statements. This is where the jazz improvisation becomes a composed melody. We're not solving everything. We're finding the ONE note that resonates across all three personas.*
---
### Point-of-View (POV) Statements
**POV 1 — Ravi (Infrastructure Engineer):**
Ravi, a senior infrastructure engineer managing 23 Terraform stacks, needs a way to continuously know when his infrastructure has diverged from its declared state because the gap between code and reality is an invisible minefield that turns every `terraform apply` into an act of faith and every 2am incident into a forensic investigation against phantom configurations.
**POV 2 — Diana (Security/Compliance Lead):**
Diana, a head of security responsible for SOC 2 compliance across 4 AWS accounts, needs a way to continuously prove that infrastructure matches approved configurations because her current evidence is built on quarterly snapshots and engineer self-reporting — a house of cards that one undetected IAM policy change could collapse, taking a $2.3M customer contract with it.
**POV 3 — Marcus (DevOps Team Lead):**
Marcus, a DevOps lead managing 67 stacks through a team of 4 engineers, needs a way to see aggregate drift status across all stacks in real time because without it, he's managing infrastructure health through tribal knowledge and standup anecdotes — a system that breaks the moment someone takes a vacation, and that produces fiction when leadership asks for a number.
**POV 4 — The Composite (The Organization):**
Engineering organizations that practice Infrastructure as Code need a way to close the loop between declared state and actual state continuously because the current model — periodic manual checks, tribal knowledge, and reactive firefighting — erodes trust in IaC itself, burns out the engineers who maintain it, and creates compliance gaps that compound silently until they become audit failures or security incidents.
---
### Key Insights
**Insight 1: Drift is a trust problem, not a technical problem.**
Every undetected drift event erodes trust — trust in the state file, trust in IaC as a practice, trust in teammates not to make console changes. When trust erodes far enough, teams abandon IaC discipline entirely. "Why bother with Terraform if reality doesn't match anyway?" dd0c/drift doesn't just detect configuration changes. It restores faith in the system.
**Insight 2: The absence of data is the biggest pain point.**
Ravi can't quantify his anxiety. Diana can't prove her compliance. Marcus can't report his team's infrastructure health. All three personas suffer from the same root cause: there is no continuous, automated measurement of the gap between intent and reality. The first product that provides this number — a simple drift score — wins all three personas simultaneously.
**Insight 3: Remediation without context is dangerous.**
"Just revert it" sounds simple until you realize the drift might be a 3am hotfix that's keeping production alive. The product must present drift with CONTEXT — who changed it, when, what else depends on it, and what happens if you revert. One-click remediation is the killer feature, but one-click destruction is the killer bug.
**Insight 4: The tool/platform divide is real and exploitable.**
Every competitor (Spacelift, env0, Terraform Cloud) bundles drift detection inside a platform that requires workflow migration. Our personas don't want to change how they work. They want to ADD visibility to how they already work. The market gap isn't "better drift detection." It's "drift detection that doesn't require you to change anything else."
**Insight 5: Three buyers, one product, three stories.**
- Ravi buys with his credit card because it eliminates his 2am dread. (Bottom-up, individual pain.)
- Diana approves the budget because it generates audit evidence. (Middle-out, compliance justification.)
- Marcus champions it to leadership because it produces metrics. (Top-down, organizational visibility.)
The product is the same. The value proposition changes per persona. This is the dd0c/drift GTM unlock.
**Insight 6: Slack is the control plane, not the dashboard.**
For Ravi, the Slack alert with action buttons IS the product 80% of the time. He doesn't want to open another dashboard. He wants to see "Security group sg-abc123 drifted. [Revert] [Accept] [Snooze]" in the channel he's already in. The web dashboard exists for Diana and Marcus. Ravi lives in Slack.
---
### Core Tension: Automation vs. Control 🎸
*Here's the jazz tension — the dissonance that makes the music interesting:*
**The Automation Pull:**
- Engineers want drift to be fixed automatically. "If someone opens a port to 0.0.0.0/0, just close it. Don't ask me. I'm sleeping."
- Compliance wants continuous enforcement. "Infrastructure should ALWAYS match declared state. Period."
- Leadership wants zero drift. "Why do we have drift at all? Automate it away."
**The Control Pull:**
- Engineers fear auto-remediation will revert a hotfix that's keeping prod alive. "What if the drift is INTENTIONAL?"
- Security wants approval workflows. "We can't auto-revert IAM changes without understanding the blast radius."
- Operations wants change windows. "Don't auto-remediate during peak traffic. That's a recipe for a different kind of outage."
**The Resolution:**
The product must offer a SPECTRUM of automation, not a binary switch. Per-resource-type policies:
- **Auto-revert:** Security groups opened to 0.0.0.0/0. Always. No questions.
- **Alert + one-click:** IAM policy changes. Show me, let me decide, make it easy.
- **Digest only:** Tag drift. Tell me in the morning summary. I'll get to it.
- **Ignore:** Auto-scaling instance count changes. That's not drift, that's the system working.
This spectrum IS the product's sophistication. It's what separates dd0c/drift from a cron job running `terraform plan`.
---
### How Might We (HMW) Questions
**Detection:**
1. **HMW** detect drift in real-time without requiring engineers to manually run `terraform plan`?
2. **HMW** distinguish between intentional changes (hotfixes, emergency responses) and unintentional drift (mistakes, forgotten console changes)?
3. **HMW** detect drift across 67+ stacks without requiring individual stack-by-stack checks?
4. **HMW** attribute drift to a specific person, time, and action without requiring engineers to become CloudTrail forensics experts?
**Remediation:**
5. **HMW** make drift remediation a 10-second action instead of a 30-minute investigation?
6. **HMW** give engineers confidence that reverting drift won't break something else?
7. **HMW** allow teams to define per-resource automation policies (auto-revert vs. alert vs. ignore) without complex configuration?
8. **HMW** offer "accept the drift" as a first-class workflow — updating code to match reality when the change was intentional?
**Visibility & Reporting:**
9. **HMW** give team leads a single number ("drift score") that represents infrastructure health across all stacks?
10. **HMW** generate SOC 2 / HIPAA compliance evidence automatically from drift detection data?
11. **HMW** show drift trends over time so teams can measure whether their IaC discipline is improving or degrading?
12. **HMW** route drift alerts to the right engineer automatically based on stack ownership?
**Adoption & Trust:**
13. **HMW** get an engineer from zero to first drift alert in under 60 seconds?
14. **HMW** build trust with security-conscious teams who won't give a SaaS tool IAM access to their AWS accounts?
15. **HMW** make drift detection feel like a natural extension of existing workflows (Terraform + GitHub + Slack) rather than a new tool to learn?
16. **HMW** make the free tier valuable enough to create habit and word-of-mouth without giving away the business?
---

View File

@@ -0,0 +1,505 @@
# dd0c/drift - V1 MVP Epics
## Epic 1: Drift Agent (Go CLI)
**Description:** The core open-source Go binary that runs in the customer's environment. It parses Terraform state, polls AWS APIs for actual resource configurations, calculates the diff, and scrubs sensitive data before transmission.
### User Stories
#### Story 1.1: Terraform State Parser
**As an** Infrastructure Engineer, **I want** the agent to parse my Terraform state file locally, **so that** it can identify declared resources without uploading my raw state to a third party.
* **Acceptance Criteria:**
* Successfully parses Terraform state v4 JSON format.
* Extracts a list of `managed` resources with their declared attributes.
* Handles both local `.tfstate` files and AWS S3 remote backend configurations.
* **Story Points:** 5
* **Dependencies:** None
* **Technical Notes:** Use standard Go JSON unmarshaling. Create an internal graph representation. Focus exclusively on AWS provider resources.
#### Story 1.2: AWS Resource Polling
**As an** Infrastructure Engineer, **I want** the agent to query AWS for the current state of my resources, **so that** it can compare reality against my declared Terraform state.
* **Acceptance Criteria:**
* Agent uses customer's local AWS credentials/IAM role to authenticate.
* Queries AWS APIs for the top 20 MVP resource types (e.g., `ec2:DescribeSecurityGroups`, `iam:GetRole`).
* Maps Terraform resource IDs to AWS identifiers.
* **Story Points:** 8
* **Dependencies:** Story 1.1
* **Technical Notes:** Use the official AWS Go SDK v2. Map API responses to a standardized internal schema that matches the state parser output. Add simple retry logic for rate limits.
#### Story 1.3: Drift Diff Calculation
**As an** Infrastructure Engineer, **I want** the agent to calculate attribute-level differences between my state file and AWS reality, **so that** I know exactly what changed.
* **Acceptance Criteria:**
* Compares parsed state attributes with polled AWS attributes.
* Outputs a structured diff showing `old` (state) and `new` (reality) values.
* Ignores AWS-generated default attributes that aren't declared in state.
* **Story Points:** 5
* **Dependencies:** Story 1.1, Story 1.2
* **Technical Notes:** Implement a deep compare function. Requires hardcoded ignore lists for known noisy attributes (e.g., AWS-assigned IDs or timestamps).
#### Story 1.4: Secret Scrubbing Engine
**As a** Security Lead, **I want** the agent to scrub all sensitive data from the drift diffs, **so that** my database passwords and API keys are never transmitted to the dd0c SaaS.
* **Acceptance Criteria:**
* Strips any attribute marked `sensitive` in the state file.
* Redacts values matching known secret patterns (e.g., `password`, `secret`, `token`).
* Replaces redacted values with `[REDACTED]`.
* Completely strips the `Private` field from state instances.
* **Story Points:** 3
* **Dependencies:** Story 1.3
* **Technical Notes:** Use regex for pattern matching. Validate scrubber with rigorous unit tests before shipping. Ensure the diff structure remains intact even when values are redacted.
## Epic 2: Agent Communication
**Description:** Secure data transmission between the local Go agent and the dd0c SaaS. Handles authentication, mTLS registration, heartbeat signals, and encrypted drift report uploads.
### User Stories
#### Story 2.1: Agent Registration & Authentication
**As a** DevOps Lead, **I want** the agent to securely register itself with the dd0c SaaS using an API key, **so that** my drift data is securely associated with my organization.
* **Acceptance Criteria:**
* Agent registers via `POST /v1/agents/register` using a static API key.
* Generates and exchanges mTLS certificates for subsequent requests.
* Receives configuration details (e.g., poll interval) from the SaaS.
* **Story Points:** 5
* **Dependencies:** None
* **Technical Notes:** The agent needs to store the mTLS cert locally or in memory. Implement robust error handling for unauthorized/revoked API keys.
#### Story 2.2: Encrypted Payload Transmission
**As a** Security Lead, **I want** the agent to transmit drift reports over a secure, encrypted channel, **so that** our infrastructure data cannot be intercepted in transit.
* **Acceptance Criteria:**
* Agent POSTs scrubbed drift reports to `/v1/drift-reports`.
* Communication enforces TLS 1.3 and uses the established mTLS client certificate.
* Payload is compressed (gzip) if over a certain threshold.
* **Story Points:** 3
* **Dependencies:** Story 1.4, Story 2.1
* **Technical Notes:** Ensure HTTP client in Go enforces TLS 1.3. Define the strict JSON schema for the `DriftReport` payload.
#### Story 2.3: Agent Heartbeat
**As a** DevOps Lead, **I want** the agent to send regular heartbeats to the SaaS, **so that** I know if the agent crashes or loses connectivity.
* **Acceptance Criteria:**
* Agent sends a lightweight heartbeat payload every N minutes.
* Payload includes uptime, memory usage, and events processed.
* SaaS API logs the heartbeat to track agent health.
* **Story Points:** 2
* **Dependencies:** Story 2.1
* **Technical Notes:** Run heartbeat in a separate Go goroutine with a ticker. Handle transient network errors silently.
## Epic 3: Drift Analysis Engine
**Description:** The SaaS-side Node.js/TypeScript processor that ingests drift reports, classifies severity, calculates stack drift scores, and persists events to the database and event store.
### User Stories
#### Story 3.1: Ingestion & Validation Pipeline
**As a** System Operator, **I want** the SaaS to receive and validate drift reports from agents via SQS, **so that** high volumes of reports are processed reliably without dropping data.
* **Acceptance Criteria:**
* API Gateway routes valid `POST /v1/drift-reports` to an SQS FIFO queue.
* Event Processor ECS task consumes from the queue.
* Validates the report payload against a strict JSON schema.
* **Story Points:** 5
* **Dependencies:** Story 2.2
* **Technical Notes:** Use `zod` for payload validation. Ensure message group IDs use `stack_id` to maintain ordering per stack.
#### Story 3.2: Drift Classification
**As an** Infrastructure Engineer, **I want** the SaaS to classify detected drift by severity and category, **so that** I can prioritize critical security issues over cosmetic tag changes.
* **Acceptance Criteria:**
* Applies YAML-defined classification rules to incoming drift diffs.
* Tags events as Critical, High, Medium, or Low severity.
* Tags events with categories (Security, Configuration, Tags, etc.).
* **Story Points:** 3
* **Dependencies:** Story 3.1
* **Technical Notes:** Implement a fast rule evaluation engine. Default unmatched drift to "Medium/Configuration".
#### Story 3.3: Persistence & Event Sourcing
**As a** Compliance Lead, **I want** every drift detection event to be stored in an immutable log, **so that** I have a reliable audit trail for SOC 2 compliance.
* **Acceptance Criteria:**
* Appends the raw drift event to DynamoDB (immutable event store).
* Upserts the current state of the resource in the PostgreSQL `resources` table.
* Inserts a new record in the PostgreSQL `drift_events` table for open drift.
* **Story Points:** 8
* **Dependencies:** Story 3.2
* **Technical Notes:** Handle database transactions carefully to keep PostgreSQL and DynamoDB in sync. Ensure Row-Level Security (RLS) is applied on all PostgreSQL inserts.
#### Story 3.4: Drift Score Calculation
**As a** DevOps Lead, **I want** the engine to calculate a drift score for each stack, **so that** I have a high-level metric of infrastructure health.
* **Acceptance Criteria:**
* Updates the `drift_score` field on the `stacks` table after processing a report.
* Score is out of 100 (e.g., 100 = completely clean).
* Weighted penalization based on severity (Critical heavily impacts score, Low barely impacts).
* **Story Points:** 3
* **Dependencies:** Story 3.3
* **Technical Notes:** Define a simple but logical weighting algorithm. Run calculation synchronously during event processing for V1.
## Epic 4: Notification Service
**Description:** A Lambda-based service that formats drift events into actionable Slack messages (Block Kit) and handles delivery routing per stack configuration.
### User Stories
#### Story 4.1: Slack Block Kit Formatting
**As an** Infrastructure Engineer, **I want** drift alerts to arrive as rich Slack messages, **so that** I can easily read the diff and context without leaving Slack.
* **Acceptance Criteria:**
* Lambda function maps drift events to Slack Block Kit JSON.
* Message includes Stack Name, Resource Address, Timestamp, Severity, and CloudTrail Attribution.
* Displays a code block showing the `old` vs `new` attribute diff.
* **Story Points:** 5
* **Dependencies:** Story 3.3
* **Technical Notes:** Build a flexible Block Kit template builder. Ensure diffs that are too long are gracefully truncated to fit Slack's block length limits.
#### Story 4.2: Slack Routing & Fanout
**As a** DevOps Lead, **I want** drift alerts to be routed to specific Slack channels based on the stack, **so that** the right team sees the alert without noise.
* **Acceptance Criteria:**
* Checks the `stacks` table for custom Slack channel overrides.
* Falls back to the organization's default Slack channel.
* Sends the formatted message via the Slack API.
* **Story Points:** 3
* **Dependencies:** Story 4.1
* **Technical Notes:** The Event Processor triggers this Lambda asynchronously via SQS (`notification-fanout` queue).
#### Story 4.3: Action Buttons (Revert/Accept)
**As an** Infrastructure Engineer, **I want** action buttons directly on the Slack alert, **so that** I can quickly trigger remediation workflows.
* **Acceptance Criteria:**
* Slack message includes interactive buttons: `[Revert]`, `[Accept]`, `[Snooze]`, `[Assign]`.
* Buttons contain the `drift_event_id` in their payload value.
* **Story Points:** 2
* **Dependencies:** Story 4.1
* **Technical Notes:** Just rendering the buttons. Handling the interactive callbacks will be covered under the Slack Bot epic or Remediation Engine (V1 MVP focus).
#### Story 4.4: Notification Batching (Low Severity)
**As an** Infrastructure Engineer, **I want** low and medium severity drift alerts to be batched into a digest, **so that** my Slack channel isn't spammed with noisy tag changes.
* **Acceptance Criteria:**
* Critical/High alerts are sent immediately.
* Medium/Low alerts are held in a DynamoDB table or SQS delay queue and dispatched as a daily/hourly digest.
* **Story Points:** 8
* **Dependencies:** Story 4.2
* **Technical Notes:** Use EventBridge Scheduler or a cron Lambda to process and flush the digest queue periodically.
## Epic 5: Dashboard API
**Description:** The REST API powering the React SPA web dashboard. Handles user authentication (Cognito), organization/stack management, and querying the PostgreSQL data for drift history.
### User Stories
#### Story 5.1: API Authentication & RLS Setup
**As a** System Operator, **I want** the API to enforce authentication and isolate tenant data, **so that** a user from one organization cannot see another organization's data.
* **Acceptance Criteria:**
* Integrates AWS Cognito JWT validation middleware.
* API sets `app.current_org_id` on the PostgreSQL connection session for Row-Level Security (RLS).
* Returns `401/403` for unauthorized requests.
* **Story Points:** 5
* **Dependencies:** Database Schema (RLS)
* **Technical Notes:** Build an Express/Node.js middleware layer. Ensure strict parameterization for all SQL queries beyond RLS to avoid injection.
#### Story 5.2: Stack Management Endpoints
**As a** DevOps Lead, **I want** a set of REST endpoints to view and manage my stacks, **so that** I can configure ownership, check connection health, and monitor the drift score.
* **Acceptance Criteria:**
* Implements `GET /v1/stacks` (list all stacks with their scores and resource counts).
* Implements `GET /v1/stacks/:id` (stack details).
* Implements `PATCH /v1/stacks/:id` (update name, owner, Slack channel).
* **Story Points:** 3
* **Dependencies:** Story 5.1
* **Technical Notes:** Support basic pagination (`limit`, `offset`) on list endpoints. Include agent heartbeat status in stack details if applicable.
#### Story 5.3: Drift History & Event Queries
**As a** Security Lead, **I want** endpoints to search and filter drift events, **so that** I can review historical drift, find specific changes, and generate audit reports.
* **Acceptance Criteria:**
* Implements `GET /v1/drift-events` with filters for `stack_id`, `status` (open/resolved), `severity`, and timestamp ranges.
* Joins the `drift_events` table with `resources` to return full address paths and diff payloads.
* **Story Points:** 5
* **Dependencies:** Story 5.1
* **Technical Notes:** Expose the JSONB `diff` field cleanly in the response payload. Use index-backed PostgreSQL queries for fast filtering.
#### Story 5.4: Policy Configuration Endpoints
**As a** DevOps Lead, **I want** an API to manage remediation policies, **so that** I can customize how the agent reacts to specific resource drift.
* **Acceptance Criteria:**
* Implements CRUD operations for stack-level and org-level policies (`/v1/policies`).
* Validates policy configuration payloads (e.g., action type, valid resource expressions).
* **Story Points:** 3
* **Dependencies:** Story 5.1
* **Technical Notes:** For V1, store policies as JSON fields in a simple `policies` table or direct mapping to stacks. Keep validation simple (e.g., regex checks on resource types).
## Epic 6: Dashboard UI
**Description:** The React Single Page Application (SPA) providing a web dashboard for stack overview, drift timeline, and resource-level diff viewer.
### User Stories
#### Story 6.1: Stack Overview Dashboard
**As a** DevOps Lead, **I want** a main dashboard showing all my stacks and their drift scores, **so that** I can assess infrastructure health at a glance.
* **Acceptance Criteria:**
* Displays a list/table of all monitored stacks.
* Shows a visual "Drift Score" indicator (0-100) per stack.
* Sortable by score, name, and last checked timestamp.
* Provides visual indicators for agent connection status.
* **Story Points:** 5
* **Dependencies:** Story 5.2
* **Technical Notes:** Build with React and Vite. Use a standard UI library (e.g., Tailwind UI or MUI). Implement efficient data fetching (e.g., React Query).
#### Story 6.2: Stack Detail & Drift Timeline
**As an** Infrastructure Engineer, **I want** to click into a stack and see a timeline of drift events, **so that** I can track when things changed and who changed them.
* **Acceptance Criteria:**
* Shows a chronological list of drift events for the selected stack.
* Displays open vs. resolved status.
* Filters for severity and category.
* Includes CloudTrail attribution data (who, IP, action).
* **Story Points:** 5
* **Dependencies:** Story 5.3, Story 6.1
* **Technical Notes:** Support pagination/infinite scrolling for the timeline. Use clear icons for event types (Security vs. Tags).
#### Story 6.3: Resource-Level Diff Viewer
**As an** Infrastructure Engineer, **I want** to see the exact attribute changes for a drifted resource, **so that** I know exactly how reality differs from my state file.
* **Acceptance Criteria:**
* Clicking an event opens a detailed view/modal.
* Renders a code-diff view (red for old state, green for new reality).
* Clearly marks redacted sensitive values.
* **Story Points:** 5
* **Dependencies:** Story 6.2
* **Technical Notes:** Use a specialized diff viewing component (e.g., `react-diff-viewer`). Ensure it handles large JSON blocks gracefully.
#### Story 6.4: Auth & User Settings
**As a** User, **I want** to manage my account and view my API keys, **so that** I can deploy the agent and access my organization's dashboard.
* **Acceptance Criteria:**
* Implements login/signup via Cognito (Email/Password & GitHub OAuth).
* Provides a settings page displaying the organization's static API key.
* Displays current subscription plan (Free tier limits for MVP).
* **Story Points:** 3
* **Dependencies:** Story 5.1
* **Technical Notes:** Securely manage JWT storage (HttpOnly cookies or secure local storage). Include a clear "copy to clipboard" for the API key.
## Epic 7: Slack Bot
**Description:** The interactive Slack application that handles user commands (`/drift score`) and processes the interactive action buttons (`[Revert]`, `[Accept]`) from drift alerts.
### User Stories
#### Story 7.1: Interactive Remediation Callbacks (Revert)
**As an** Infrastructure Engineer, **I want** clicking `[Revert]` on a Slack alert to trigger a targeted `terraform apply`, **so that** I can fix drift instantly without leaving Slack.
* **Acceptance Criteria:**
* SaaS API Gateway (`/v1/slack/interactions`) receives the button click payload.
* Validates the Slack request signature.
* Generates a scoped `terraform plan -target` command and queues it for the agent.
* Updates the Slack message to "Reverting...".
* **Story Points:** 8
* **Dependencies:** Story 4.3
* **Technical Notes:** The actual execution happens via the Remediation Engine (ECS Fargate) dispatching commands to the agent. Requires careful state tracking (Pending -> Executing -> Completed).
#### Story 7.2: Interactive Acceptance Callbacks (Accept)
**As an** Infrastructure Engineer, **I want** clicking `[Accept]` to auto-generate a PR that updates my Terraform code to match reality, **so that** the drift becomes the new source of truth.
* **Acceptance Criteria:**
* SaaS generates a code patch representing the new state.
* Uses the GitHub API to create a branch and open a PR against the target repo.
* Updates the Slack message with a link to the PR.
* **Story Points:** 8
* **Dependencies:** Story 7.1
* **Technical Notes:** Will require GitHub App integration (or OAuth token) to create branches/PRs. The patch generation logic needs robust testing.
#### Story 7.3: Slack Slash Commands
**As a** DevOps Lead, **I want** to use `/drift score` and `/drift status <stack>` in Slack, **so that** I can check my infrastructure health on demand.
* **Acceptance Criteria:**
* `/drift score` returns the aggregate score for the organization.
* `/drift status prod-networking` returns the score, open events, and agent health for a specific stack.
* Formats output as a clean Slack Block Kit message visible only to the user.
* **Story Points:** 5
* **Dependencies:** Story 4.1, Story 5.2
* **Technical Notes:** Deploy an API Gateway endpoint specifically for Slack slash commands. Validate the token and use the internal Dashboard API logic to fetch scores.
#### Story 7.4: Snooze & Assign Callbacks
**As an** Infrastructure Engineer, **I want** to click `[Snooze 24h]` or `[Assign]`, **so that** I can manage alert noise or delegate investigation to a teammate.
* **Acceptance Criteria:**
* `[Snooze]` updates the event status to `snoozed` and schedules a wake-up time.
* `[Assign]` opens a Slack modal to select a team member, updating the event owner.
* The original Slack message updates to reflect the new state/owner.
* **Story Points:** 5
* **Dependencies:** Story 7.1
* **Technical Notes:** Snooze requires a scheduled EventBridge or cron job to un-snooze. Assign requires interacting with Slack's user selection menus.
## Epic 8: Infrastructure & DevOps
**Description:** The underlying cloud resources for the dd0c SaaS and the CI/CD pipelines to build, test, and release the agent and services.
### User Stories
#### Story 8.1: SaaS Infrastructure (Terraform)
**As a** System Operator, **I want** the SaaS infrastructure to be defined as code, **so that** deployments are repeatable and I can dogfood my own drift detection tool.
* **Acceptance Criteria:**
* Defines VPC, Subnets, ECS Fargate Clusters, RDS PostgreSQL (Multi-AZ), API Gateway, and SQS FIFO queues.
* Sets up CloudWatch log groups and IAM roles.
* Uses Terraform for all configuration.
* **Story Points:** 8
* **Dependencies:** Architecture Design Document
* **Technical Notes:** Build a modular Terraform setup. Use the official AWS provider. Include variables for environment separation (staging vs. prod).
#### Story 8.2: CI/CD Pipeline (GitHub Actions)
**As a** Developer, **I want** a fully automated CI/CD pipeline, **so that** code pushed to `main` is linted, tested, built, and deployed to ECS.
* **Acceptance Criteria:**
* Runs `golangci-lint`, `go test`, ESLint, and Vitest on PRs.
* Builds multi-stage Docker images for the Event Processor, Dashboard API, and Remediation Engine.
* Pushes images to ECR and triggers an ECS rolling deploy.
* **Story Points:** 5
* **Dependencies:** Story 8.1
* **Technical Notes:** Use standard GitHub Actions (e.g., `aws-actions/configure-aws-credentials`). Add Trivy for basic container scanning.
#### Story 8.3: Agent Distribution (Releases & Homebrew)
**As an** Open Source User, **I want** to easily download and install the CLI agent, **so that** I can test drift detection locally without building from source.
* **Acceptance Criteria:**
* Configures GoReleaser to cross-compile binaries for Linux/macOS/Windows (amd64/arm64).
* Auto-publishes GitHub Releases when a new tag is pushed.
* Creates a custom Homebrew tap (`brew install dd0c/tap/drift-cli`).
* **Story Points:** 5
* **Dependencies:** Story 1.1
* **Technical Notes:** Create a dedicated `.github/workflows/release.yml` for GoReleaser.
#### Story 8.4: Agent Terraform Module Publication
**As a** DevOps Lead, **I want** a pre-built Terraform module to deploy the agent in my AWS account, **so that** I don't have to manually configure ECS tasks and EventBridge rules.
* **Acceptance Criteria:**
* Creates the `dd0c/drift-agent/aws` Terraform module.
* Provisions an ECS Task, EventBridge rules, SQS, and IAM roles for the customer.
* Publishes the module to the public Terraform Registry.
* **Story Points:** 8
* **Dependencies:** Story 8.3
* **Technical Notes:** Adhere to Terraform Registry best practices. Ensure the `README.md` clearly explains the required `dd0c_api_key` and state bucket variables.
## Epic 9: Onboarding & PLG (Product-Led Growth)
**Description:** The self-serve funnel that guides users from CLI installation to their first drift alert in under 5 minutes, plus billing and tier management.
### User Stories
#### Story 9.1: Self-Serve Signup & CLI Login
**As an** Engineer, **I want** to easily sign up for the free tier via the CLI, **so that** I don't have to fill out sales forms to test the product.
* **Acceptance Criteria:**
* Running `drift auth login` opens a browser to an OAuth flow (GitHub/Email).
* The CLI spins up a local web server to catch the callback token.
* Successfully provisions an organization and user account in the SaaS.
* **Story Points:** 5
* **Dependencies:** Story 5.4, Story 8.3
* **Technical Notes:** The callback server should listen on `localhost` with a random or standard port (e.g., `8080`).
#### Story 9.2: Auto-Discovery (`drift init`)
**As an** Infrastructure Engineer, **I want** the CLI to auto-discover my Terraform state files, **so that** I can configure my first stack without typing S3 ARNs.
* **Acceptance Criteria:**
* `drift init` scans the current directory for `*.tf` files.
* Uses default AWS credentials to query S3 buckets matching common state file patterns.
* Prompts the user to register discovered stacks to their organization.
* **Story Points:** 8
* **Dependencies:** Story 9.1
* **Technical Notes:** Implement robust fallback to manual input if discovery fails.
#### Story 9.3: Free Tier Enforcement (1 Stack)
**As a** Product Manager, **I want** to enforce a free tier limit of 1 stack, **so that** users get value but are incentivized to upgrade for larger infrastructure needs.
* **Acceptance Criteria:**
* The API rejects attempts to register more than 1 stack on the Free plan.
* The Dashboard clearly shows "1/1 Stacks Used".
* The CLI prompts "Upgrade to Starter ($49/mo)" when trying to add a second stack.
* **Story Points:** 3
* **Dependencies:** Story 9.1
* **Technical Notes:** Enforce limits securely at the API level (e.g., `POST /v1/stacks` should return a `403 Stack Limit` error).
#### Story 9.4: Stripe Billing Integration
**As a** Solo Founder, **I want** customers to upgrade to paid tiers with a credit card, **so that** I can capture revenue without a sales process.
* **Acceptance Criteria:**
* Integrates Stripe Checkout for the Starter ($49/mo) and Pro ($149/mo) tiers.
* Dashboard provides a billing management portal (Stripe Customer Portal).
* Webhooks listen for successful payments and update the organization's `plan` field in PostgreSQL.
* **Story Points:** 8
* **Dependencies:** Story 5.2
* **Technical Notes:** Must include Stripe webhook signature verification to prevent spoofed upgrades. Store `stripe_customer_id` on the `organizations` table.
---
## Epic 10: Transparent Factory Compliance
**Description:** Cross-cutting epic ensuring dd0c/drift adheres to the 5 Transparent Factory architectural tenets. For a drift detection product, these tenets are especially critical — the tool that detects infrastructure drift must itself be immune to uncontrolled drift.
### Story 10.1: Atomic Flagging — Feature Flags for Detection Behaviors
**As a** solo founder, **I want** every new drift detection rule, remediation action, and notification behavior wrapped in a feature flag (default: off), **so that** I can ship new detection capabilities without accidentally triggering false-positive alerts for customers.
**Acceptance Criteria:**
- OpenFeature SDK integrated into the Go agent. V1 provider: env-var or JSON file-based (no external service).
- All flags evaluate locally — no network calls during drift scan execution.
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if any flag at 100% rollout exceeds TTL.
- Automated circuit breaker: if a flagged detection rule generates >3x the baseline false-positive rate over 1 hour, the flag auto-disables.
- Flags required for: new IaC provider support (Terraform/Pulumi/CDK), remediation suggestions, Slack notification formats, scan scheduling changes.
**Estimate:** 5 points
**Dependencies:** Epic 1 (Agent Core)
**Technical Notes:**
- Use Go OpenFeature SDK (`go.openfeature.dev/sdk`). JSON file provider for V1.
- Circuit breaker: track false-positive dismissals per rule in Redis. If dismissal rate spikes, disable the flag.
- Flag audit: `make flag-audit` lists all flags with TTL status.
### Story 10.2: Elastic Schema — Additive-Only for Drift State Storage
**As a** solo founder, **I want** all DynamoDB and state file schema changes to be strictly additive, **so that** agent rollbacks never corrupt drift history or lose customer scan results.
**Acceptance Criteria:**
- CI lint rejects any migration or schema change that removes, renames, or changes the type of existing DynamoDB attributes.
- New attributes use `_v2` suffix when breaking changes are needed. Old attributes remain readable.
- Go structs use `json:",omitempty"` and ignore unknown fields so V1 agents can read V2 state files without crashing.
- Dual-write enforced during migration windows: agent writes to both old and new attribute paths in the same DynamoDB `TransactWriteItems` call.
- Every schema change includes a `sunset_date` comment (max 30 days). CI warns on overdue cleanups.
**Estimate:** 3 points
**Dependencies:** Epic 2 (State Management)
**Technical Notes:**
- DynamoDB Single Table Design: version items with `_v` attribute. Agent code uses a factory to select the correct model.
- For Terraform state parsing, use `encoding/json` with `DisallowUnknownFields: false` to tolerate upstream state format changes.
- S3 state snapshots: never overwrite — always write new versioned keys.
### Story 10.3: Cognitive Durability — Decision Logs for Detection Logic
**As a** future maintainer, **I want** every change to drift detection algorithms, severity scoring, or remediation logic accompanied by a `decision_log.json`, **so that** I understand why a particular drift pattern is flagged as critical vs. informational.
**Acceptance Criteria:**
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
- CI requires a decision log entry for PRs touching `pkg/detection/`, `pkg/scoring/`, or `pkg/remediation/`.
- Cyclomatic complexity cap of 10 enforced via `golangci-lint` with `gocyclo` linter. PRs exceeding this are blocked.
- Decision logs committed in `docs/decisions/`, one per significant logic change.
**Estimate:** 2 points
**Dependencies:** None
**Technical Notes:**
- PR template includes decision log fields as a checklist.
- For drift scoring changes: document why specific thresholds were chosen (e.g., "security group changes scored critical because X% of breaches start with SG drift").
- `golangci-lint` config: `.golangci.yml` with `gocyclo: max-complexity: 10`.
### Story 10.4: Semantic Observability — AI Reasoning Spans on Drift Classification
**As an** SRE debugging a missed drift alert, **I want** every drift classification decision to emit an OpenTelemetry span with structured reasoning metadata, **so that** I can trace why a specific resource change was scored as low-severity when it should have been critical.
**Acceptance Criteria:**
- Every drift scan emits a parent `drift_scan` span. Each resource comparison emits a child `drift_classification` span.
- Span attributes: `drift.resource_type`, `drift.severity_score`, `drift.classification_reason`, `drift.alternatives_considered` (e.g., "considered critical but downgraded because tag-only change").
- If AI-assisted classification is used (future): `ai.prompt_hash`, `ai.model_version`, `ai.confidence_score` included.
- Spans export via OTLP to any compatible backend.
- No PII or customer infrastructure details in spans — resource ARNs are hashed.
**Estimate:** 3 points
**Dependencies:** Epic 1 (Agent Core)
**Technical Notes:**
- Use `go.opentelemetry.io/otel` with OTLP exporter.
- For V1 without AI classification, the `drift.classification_reason` is the rule name + threshold that triggered.
- ARN hashing: SHA-256 truncated to 12 chars for correlation without exposure.
### Story 10.5: Configurable Autonomy — Governance for Auto-Remediation
**As a** solo founder, **I want** a `policy.json` that controls whether the agent can auto-remediate drift or only report it, **so that** customers maintain full control over what the tool is allowed to change in their infrastructure.
**Acceptance Criteria:**
- `policy.json` defines `governance_mode`: `strict` (report-only, no remediation) or `audit` (auto-remediate with logging).
- Agent checks policy before every remediation action. In `strict` mode, remediation suggestions are logged but never executed.
- `panic_mode`: when true, agent stops all scans immediately, preserves last-known-good state, and sends a single "paused" notification.
- Per-customer policy override: customers can set their own governance mode via config, which is always more restrictive than the system default (never less).
- All policy decisions logged: "Remediation blocked by strict mode for resource X", "Auto-remediation applied in audit mode".
**Estimate:** 3 points
**Dependencies:** Epic 3 (Remediation Engine)
**Technical Notes:**
- `policy.json` in repo root, loaded at startup, watched via `fsnotify`.
- Customer-level overrides stored in DynamoDB `org_settings` item. Merge logic: `min(system_policy, customer_policy)` — customer can only be MORE restrictive.
- Panic mode trigger: `POST /admin/panic` or env var `DD0C_PANIC=true`. Agent drains current scan and halts.
### Epic 10 Summary
| Story | Tenet | Points |
|-------|-------|--------|
| 10.1 | Atomic Flagging | 5 |
| 10.2 | Elastic Schema | 3 |
| 10.3 | Cognitive Durability | 2 |
| 10.4 | Semantic Observability | 3 |
| 10.5 | Configurable Autonomy | 3 |
| **Total** | | **16** |

View File

@@ -0,0 +1,858 @@
# dd0c/drift — Disruptive Innovation Strategy
**Strategist:** Victor, Former McKinsey Partner & Startup Advisor
**Date:** February 28, 2026
**Product:** dd0c/drift — IaC Drift Detection & Auto-Remediation SaaS
**Verdict:** Conditional GO — with caveats that will make you uncomfortable.
---
> *"The graveyard of DevOps startups is filled with companies that built better mousetraps for mice that had already moved to a different house. Let's make sure the mice are still here."* — Victor
---
## Section 1: MARKET LANDSCAPE
### 1.1 Competitive Analysis — The Battlefield
Let me be precise about who you're fighting, because half of these "competitors" are actually your best friends.
#### Tier 1: The Platforms (Your Real Competition)
**Spacelift** — $40M+ raised. Series B. ~200 employees.
- Pricing: Starts ~$40/month for Cloud tier (limited). Business tier is custom/enterprise — realistically $500-$2,000+/mo for meaningful usage. Drift detection requires at minimum Starter+ tier with private workers.
- Drift detection is a *feature*, not the product. It's buried inside their IaC management platform.
- Strength: Deep Terraform integration, policy-as-code, mature RBAC.
- Weakness: Requires full workflow migration. You don't just "add" Spacelift — you rebuild your CI/CD around it. That's a 2-month project minimum.
- Vulnerability: They CANNOT price down to $29/stack without destroying their enterprise ACV. Their sales motion requires $30K+ deals to justify the CAC.
**env0** — $28M+ raised. Series A (large). ~100 employees.
- Pricing: Free tier exists but is crippled. Paid starts ~$35/user/month. For a team of 10 managing 50 stacks, you're looking at $350-$500+/mo before enterprise features.
- Drift detection: Available but secondary to their "Environment as a Service" positioning.
- Strength: Good self-service onboarding, cost estimation features, OpenTofu support.
- Weakness: Trying to be everything — cost management, drift, policy, self-service. Jack of all trades.
- Vulnerability: Same platform migration problem as Spacelift. And their "per-user" pricing punishes growing teams.
**Terraform Cloud / HCP Terraform** — HashiCorp (IBM). Infinite resources.
- Pricing: Free for up to 500 managed resources. Plus tier at $0.00014/hr per resource (~$1.23/resource/year). Gets expensive fast at scale. Enterprise is custom.
- Drift detection: Exists since 2023. Runs health assessments on a schedule. Terraform-only (obviously). No OpenTofu. No Pulumi.
- Strength: It's HashiCorp. Native integration. Brand trust.
- Weakness: IBM acquisition created mass exodus to OpenTofu. Pricing changes alienated the community. BSL license killed goodwill. Drift detection is basic — no remediation workflows, no Slack-native experience.
- Vulnerability: The HashiCorp-to-OpenTofu migration is YOUR recruiting ground. Every team that leaves TFC needs drift detection and won't go back.
**Pulumi Cloud** — $97M+ raised.
- Pricing: Free for individuals. Team at $50/month for 3 users. Enterprise custom.
- Drift detection: `pulumi refresh` exists but it's manual. No continuous monitoring. No alerting.
- Strength: Modern language support (TypeScript, Python, Go). Developer-loved.
- Weakness: Pulumi-only. Small market share vs Terraform/OpenTofu. Drift is an afterthought.
- Vulnerability: Not a competitor — they're a potential integration partner. Support Pulumi stacks and you capture their underserved users.
#### Tier 2: The Dead and Dying
**driftctl (Snyk)** — DEAD.
- Acquired by Snyk in 2022. Effectively abandoned. Last meaningful development: ancient history. README still says "beta."
- This is your single greatest market gift. driftctl proved demand exists. Snyk proved that big companies don't care about drift enough to maintain an OSS tool. The community is orphaned and actively looking for a replacement.
- Every GitHub issue on driftctl asking "is this project dead?" is a lead for dd0c/drift.
#### Tier 3: Adjacent Players
**Firefly.ai** — $23M+ raised. Israeli startup.
- Positioning: "Cloud Asset Management" — broader than drift. They do inventory, codification (turning unmanaged resources into IaC), drift detection, and policy.
- Pricing: Enterprise-only. "Contact Sales." Minimum $1,000+/mo based on G2 reviews and AWS Marketplace listings.
- Strength: Comprehensive cloud visibility. Good at the "unmanaged resources" problem.
- Weakness: Enterprise sales motion. No self-serve. No PLG. A 5-person startup can't even get a demo without pretending to be bigger than they are.
- Vulnerability: They're selling to CISOs, not engineers. You're selling to the engineer with a credit card. Different buyer, different motion, no conflict.
**Digger** — Open-source Terraform CI/CD.
- Positioning: "Terraform CI/CD that runs in your CI." Open-source core.
- Drift detection: Basic. Not their focus.
- Strength: Runs in your existing CI (GitHub Actions, GitLab CI). No separate platform.
- Weakness: Small team, limited features, drift is a checkbox not a product.
- Vulnerability: Potential partner, not competitor. Digger users need drift detection. You provide it.
**ControlMonkey** — Enterprise IaC management.
- Pricing: Enterprise-only. $50K+ annual contracts.
- Irrelevant to your beachhead. They're playing a different game at a different price point.
#### Competitive Summary
| Player | Drift Detection | Pricing | Self-Serve | Multi-IaC | Your Advantage |
|--------|----------------|---------|------------|-----------|----------------|
| Spacelift | Feature (good) | $500+/mo | No (sales) | Terraform, OpenTofu, Pulumi, CFN | 17x cheaper, no migration |
| env0 | Feature (basic) | $350+/mo | Partial | Terraform, OpenTofu | Focused tool, not platform |
| HCP Terraform | Feature (basic) | Variable | Yes | Terraform only | Multi-IaC, no vendor lock-in |
| Pulumi Cloud | Manual only | $50+/mo | Yes | Pulumi only | Continuous, multi-IaC |
| driftctl | Dead | Free (OSS) | N/A | Terraform | You exist. They don't. |
| Firefly | Feature (good) | $1,000+/mo | No (sales) | Multi-IaC | 34x cheaper, PLG |
| Digger | Basic | Free (OSS) | N/A | Terraform | Dedicated product |
| **dd0c/drift** | **Core product** | **$29/stack** | **Yes** | **TF + OTF + Pulumi** | **—** |
### 1.2 Market Sizing
Let me be honest about the numbers, because most market sizing is fiction dressed in a suit.
**TAM (Total Addressable Market) — IaC Management & Governance**
- The global IaC market is projected at $2.5-$3.5B by 2027 (various analyst reports, growing 25-30% CAGR).
- But that includes everything: provisioning, CI/CD, policy, cost management. Drift detection is a slice.
- Realistic TAM for "IaC drift detection and remediation" specifically: **$800M-$1.2B** by 2027. This includes the drift features embedded in platforms like Spacelift/env0 plus standalone tools.
**SAM (Serviceable Addressable Market) — Teams Using Terraform/OpenTofu/Pulumi Who Need Drift Detection**
- HashiCorp reported 3,500+ enterprise customers and millions of Terraform users before the IBM acquisition.
- Conservatively, 150,000-200,000 organizations actively use Terraform/OpenTofu in production.
- Of those, ~60% (90,000-120,000) have 10+ stacks and experience meaningful drift.
- At $29/stack/month average, with average 20 stacks per org: ~$29 × 20 × 100,000 = **$696M/year SAM**.
- More conservatively (only teams with 10-100 stacks, excluding enterprises that will buy Spacelift anyway): **$200-$400M SAM**.
**SOM (Serviceable Obtainable Market) — What You Can Realistically Capture in 24 Months**
- As a solo founder with PLG motion, targeting SMB/mid-market teams (5-50 engineers, 10-100 stacks).
- Realistic first-year target: 200-500 paying customers.
- At average $145/mo (5 stacks × $29): **$350K-$870K ARR in Year 1**.
- Year 2 with expansion and word-of-mouth: **$1.5M-$3M ARR**.
- SOM: **$3-$5M** in the 24-month horizon.
**Brian, here's the uncomfortable truth:** The SOM is real but modest. This isn't a venture-scale market as a standalone product. It's a wedge. The dd0c platform strategy (route + cost + alert + drift + portal) is what makes this a $50M+ opportunity. Drift alone is a $3-5M ARR business. That's a great lifestyle business or a great wedge into a bigger platform play. It is NOT a standalone unicorn.
### 1.3 Timing — Why NOW
Four forces are converging that make February 2026 the optimal entry window:
**1. The HashiCorp Exodus (2024-2026)**
IBM's acquisition of HashiCorp and the BSL license change triggered the largest migration event in IaC history. OpenTofu adoption is accelerating. Teams migrating from Terraform Cloud to OpenTofu + GitHub Actions lose their (mediocre) drift detection. They need a replacement. They're actively searching. RIGHT NOW.
**2. driftctl's Death Created a Vacuum**
driftctl was the only focused, open-source drift detection tool. Snyk killed it. The community is orphaned. GitHub issues, Reddit threads, and HN comments are filled with "what do I use instead of driftctl?" There is no answer. You ARE the answer.
**3. IaC Adoption Hit Mainstream (2024-2025)**
IaC is no longer a practice of elite DevOps teams. It's standard. Mid-market companies with 20-50 engineers now have 30+ Terraform stacks. They've graduated from "learning IaC" to "suffering from IaC at scale." Drift is the #1 pain point of IaC at scale. The market of sufferers just 10x'd.
**4. Multi-Tool Reality**
Teams no longer use just Terraform. They use Terraform AND OpenTofu AND Pulumi AND CloudFormation (legacy). No existing tool handles drift across all of them. The first tool that does owns the "Switzerland" position.
### 1.4 Regulatory & Trend Tailwinds
**SOC 2 Type II** — Now table stakes for any B2B SaaS. Auditors are increasingly asking: "How do you ensure your infrastructure matches your declared configuration?" The answer "we run terraform plan sometimes" is no longer acceptable. Continuous drift detection is becoming a compliance requirement, not a nice-to-have.
**HIPAA / HITRUST** — Healthcare SaaS companies managing PHI need to prove infrastructure configurations haven't been tampered with. Drift detection = continuous compliance evidence.
**PCI DSS 4.0** — Effective March 2025. Requirement 1.2.5 requires documentation and review of all allowed services, protocols, and ports. Drift in security groups is now a PCI finding, not just an operational annoyance.
**FedRAMP / StateRAMP** — Government cloud compliance frameworks increasingly require continuous monitoring of configuration state. Drift detection maps directly to NIST 800-53 CM-3 (Configuration Change Control) and CM-6 (Configuration Settings).
**Cyber Insurance** — Insurers are asking more detailed questions about infrastructure configuration management. Companies with continuous drift detection get better rates. This is an emerging but real purchasing driver.
**The Net Effect:** Compliance is transforming drift detection from "engineering nice-to-have" to "business requirement." Diana (the compliance persona from your design thinking session) isn't just a user — she's the budget unlocker. When the auditor says "you need this," the CFO writes the check.
---
## Section 2: COMPETITIVE POSITIONING
### 2.1 Blue Ocean Strategy Canvas
The Blue Ocean framework asks: where are all competitors investing heavily, and where is nobody investing at all? The blue ocean is the uncontested space.
**Factors of Competition in IaC Management:**
```
Factor | Spacelift | env0 | TFC | Firefly | dd0c/drift
--------------------------|-----------|------|------|---------|----------
Platform Breadth | 9 | 8 | 8 | 8 | 2
Enterprise Features | 9 | 7 | 9 | 9 | 2
Drift Detection Depth | 6 | 4 | 3 | 7 | 10
Remediation Workflows | 5 | 3 | 2 | 6 | 9
Self-Serve Onboarding | 3 | 5 | 6 | 2 | 10
Time-to-Value | 3 | 4 | 5 | 2 | 10
Price Accessibility | 2 | 3 | 4 | 1 | 10
Multi-IaC Support | 7 | 6 | 2 | 7 | 8
Slack-Native Experience | 4 | 3 | 2 | 3 | 10
Compliance Reporting | 5 | 4 | 5 | 7 | 8
CI/CD Orchestration | 9 | 8 | 9 | 4 | 0
Policy-as-Code | 8 | 6 | 7 | 8 | 0
Cost Management | 3 | 7 | 3 | 5 | 0
```
**The Blue Ocean:** dd0c/drift deliberately scores ZERO on CI/CD orchestration, policy-as-code, and cost management. This is not weakness — it's strategy. Every competitor is fighting over the same red ocean of "IaC platform" features. dd0c/drift creates blue ocean by:
1. **Eliminating** platform features entirely (no CI/CD, no policy engine, no cost tools)
2. **Raising** drift detection and remediation to 10/10 — making it the core product, not a feature
3. **Creating** Slack-native remediation (nobody does this well) and 60-second onboarding (nobody does this at all)
4. **Reducing** price by 17x, making procurement irrelevant (credit card purchase, not enterprise sales)
The strategic canvas shows a clear "crossing pattern" — dd0c/drift's value curve is the INVERSE of every competitor. Where they're high, you're low. Where they're low, you're high. This is textbook Blue Ocean. You're not competing. You're redefining.
### 2.2 Porter's Five Forces
**1. Threat of New Entrants: MEDIUM-HIGH**
- Low technical barriers. Any competent engineer can build a cron job that runs `terraform plan`. The detection engine is not rocket science.
- BUT: the product layer (UX, Slack integration, remediation workflows, compliance reporting, multi-IaC support) is 10x harder than the detection engine. The moat is product, not technology.
- Cloud providers could build native drift detection (AWS Config already does it for CloudFormation). But they won't optimize for third-party IaC tools — it's against their strategic interest.
- Verdict: Easy to enter, hard to win. Your moat is speed-to-market + product quality + community.
**2. Bargaining Power of Buyers: HIGH**
- Buyers have alternatives: do nothing (run `terraform plan` manually), use platform features (Spacelift/env0), or build internal tooling.
- Switching costs are low for a drift detection tool — it's read-only access to state files and cloud APIs.
- Price sensitivity is high for SMB/mid-market (your target). They'll compare $29/stack to "free" (manual process) and need to see clear ROI.
- Verdict: You must demonstrate ROI in the first 5 minutes. The "Drift Cost Calculator" content play is essential — show them what drift is costing them BEFORE they sign up.
**3. Bargaining Power of Suppliers: LOW**
- Your "suppliers" are cloud provider APIs (AWS, Azure, GCP) and IaC tool formats (Terraform state, Pulumi state).
- These are open/standardized. No supplier can cut you off.
- The one risk: HashiCorp could make Terraform state format proprietary or restrict access. But OpenTofu fork means this is a self-defeating move. And the BSL license already pushed the community away.
- Verdict: No supplier risk. You're reading open formats and public APIs.
**4. Threat of Substitutes: HIGH**
- The primary substitute is "do nothing" — manual `terraform plan` runs, tribal knowledge, and hope.
- Secondary substitute: build internal tooling. Many teams have a bash script or GitHub Action that approximates drift detection.
- Tertiary substitute: platform features in Spacelift/env0/TFC that "good enough" drift detection as part of a broader purchase.
- Verdict: Your biggest competitor is inertia. The product must be SO easy to adopt and SO immediately valuable that "do nothing" feels irresponsible. The 60-second onboarding is not a nice-to-have — it's existential.
**5. Competitive Rivalry: MEDIUM**
- No one is competing on drift detection as a primary product. Everyone treats it as a feature.
- The "focused drift detection tool" category has exactly one dead player (driftctl) and zero live ones.
- Rivalry will increase if dd0c/drift proves the market — Spacelift/env0 will invest more in their drift features, and new entrants will appear.
- Verdict: You have a 12-18 month window of low rivalry. Use it to build market position and community before the platforms respond.
**Porter's Overall Assessment:** The industry structure is favorable for a focused entrant. High buyer power and high substitute threat mean you must nail the value proposition and onboarding. But low supplier power and medium rivalry give you room to establish position. The key strategic imperative: **move fast, build community, create switching costs through integrations and data before the platforms wake up.**
### 2.3 Value Curve vs. Competitors
The value proposition differs by competitor:
**vs. Spacelift:** "You don't need to migrate your entire CI/CD pipeline to detect drift. dd0c/drift plugs into your existing workflow in 60 seconds. It costs $29/stack instead of $500+/month. And it does drift detection better because that's ALL it does."
**vs. env0:** "env0 is a platform that happens to detect drift. dd0c/drift is a drift detection tool that does nothing else — and does it 10x better. No per-user pricing that punishes growing teams. No platform migration. Just drift detection that works."
**vs. Terraform Cloud:** "You left TFC because of the BSL license and IBM pricing. Why would you go back for mediocre drift detection? dd0c/drift works with Terraform AND OpenTofu AND Pulumi. It's the drift detection TFC should have built but never will."
**vs. Firefly:** "Firefly is enterprise cloud asset management for companies with $50K+ budgets and 6-month procurement cycles. dd0c/drift is for the team that needs drift detection TODAY, for $29/stack, with a credit card. No sales calls. No demos. No SOWs."
**vs. "Do Nothing" (Manual terraform plan):** "You're already running terraform plan manually. You know it doesn't scale. You know things drift between checks. You know the cron job broke 3 weeks ago and nobody noticed. dd0c/drift is the productized version of what you're already trying to do — except it actually works, runs continuously, alerts you in Slack, and lets you fix drift in one click."
### 2.4 Solo Founder Advantages
Brian, being a solo founder against $40M-funded competitors sounds suicidal. It's actually your greatest strategic advantage. Here's why:
**1. The 17x Price Advantage Is Structural, Not Temporary**
Spacelift has ~200 employees. At $150K average fully-loaded cost, that's $30M/year in payroll. They NEED $30K+ enterprise deals to survive. They literally cannot sell a $29/stack product — the unit economics don't support their cost structure. Your cost structure is you, a laptop, and AWS credits. You can profitably serve customers they can't afford to talk to.
**2. Focus Is a Weapon**
Spacelift's product team is split across CI/CD, policy, drift, blueprints, contexts, worker pools, and 50 other features. Their drift detection gets maybe 5% of engineering attention. Your drift detection gets 100%. You will always be 6-12 months ahead on drift-specific features because it's your ONLY product.
**3. Speed of Decision**
You can ship a feature in a day. Spacelift needs a product review, design review, engineering sprint, QA cycle, and release train. When a customer asks for OpenTofu support, you ship it Thursday. They ship it next quarter. In a market where the IaC landscape is shifting rapidly (OpenTofu, Pulumi growth, AI-generated IaC), speed of adaptation is survival.
**4. Authenticity**
You're a senior AWS architect who has LIVED the drift problem. You're not a product manager who read about it in a Gartner report. Your blog posts, HN comments, and conference talks will carry the credibility of someone who's been paged at 2am because of drift. Developers buy from practitioners, not from marketing teams.
**5. No Investor Pressure to "Go Enterprise"**
Bootstrapped means you don't have a board pushing you to hire a sales team and chase $100K deals. You can stay PLG, stay developer-focused, and stay cheap. This is a strategic moat that funded competitors literally cannot replicate — their investors won't let them.
---
## Section 3: DISRUPTION ANALYSIS
### 3.1 Christensen Framework — Classic Low-End Disruption
Clayton Christensen would look at dd0c/drift and smile. This is textbook disruption theory. Let me walk through it precisely, because understanding WHY this works matters more than believing it will.
**The Incumbent's Dilemma:**
Spacelift and env0 are classic sustaining innovators. They started with a focused product, raised venture capital, and are now marching upmarket to justify their valuations. Every quarter, they add more features (policy-as-code, cost estimation, blueprints, self-service portals) to win larger enterprise deals. Their product roadmap is driven by what their biggest customers ask for — which is always "more platform."
This creates the classic Christensen gap at the bottom of the market:
```
Performance
Spacelift/env0 trajectory
(more platform features)
← Enterprise needs
│ ╱─────────────────── SMB needs (drift detection + remediation)
← dd0c/drift enters HERE
│╱
└──────────────────────► Time
```
**The Gap:** SMB and mid-market teams (5-50 engineers, 10-100 stacks) need drift detection. They do NOT need CI/CD orchestration, policy-as-code engines, or self-service infrastructure portals. Spacelift/env0 are overshooting their needs and overcharging for the privilege.
**dd0c/drift enters at the low end** — offering "just" drift detection at 17x lower price. Incumbents look at this and think: "That's a feature, not a product. They'll never win enterprise deals." They're right. And that's exactly why they won't respond.
**The Disruption Sequence:**
1. **Year 1:** dd0c/drift captures SMB teams that can't afford or don't need Spacelift. Spacelift ignores this — these aren't their customers anyway.
2. **Year 2:** dd0c/drift adds compliance reporting, team features, and deeper remediation. Mid-market teams start choosing dd0c/drift over Spacelift for drift specifically, while using GitHub Actions for CI/CD.
3. **Year 3:** dd0c/drift's drift detection is so superior (because it's the ONLY thing they build) that even Spacelift's enterprise customers start asking "why is dd0c/drift better at drift than you?" Spacelift can't catch up because drift gets 5% of their engineering attention.
4. **Year 4:** dd0c/drift expands into adjacent features (state management, policy for drift, IaC analytics) — moving upmarket from a position of strength.
**The Key Insight:** Disruption doesn't require you to be better at everything. It requires you to be better at ONE thing that matters, at a price point incumbents can't match. $29/stack for world-class drift detection is that thing.
### 3.2 Jobs-To-Be-Done (JTBD) Competitive Analysis
JTBD theory says customers don't buy products — they "hire" them to do a job. Let's map the jobs and who currently gets hired:
**Job 1: "Help me know when my infrastructure has drifted from code"**
- Current hire: Manual `terraform plan` (free, unreliable, doesn't scale)
- Alternative hire: Spacelift drift detection (expensive, requires platform adoption)
- Alternative hire: Cron job + bash script (free, fragile, no UI, breaks silently)
- dd0c/drift fit: **PERFECT.** This is the core job. 10/10 fit.
**Job 2: "Help me fix drift quickly without breaking things"**
- Current hire: Manual investigation + careful `terraform apply` (slow, risky, requires expertise)
- Alternative hire: Spacelift auto-reconciliation (good but requires full platform)
- Alternative hire: Nothing. Most teams just live with drift.
- dd0c/drift fit: **EXCELLENT.** One-click revert with context and blast radius analysis. 9/10 fit.
**Job 3: "Help me prove to auditors that infrastructure matches code"**
- Current hire: Manual quarterly reconciliation + spreadsheets (expensive in engineer-hours, always stale)
- Alternative hire: AWS Config (partial — doesn't compare against IaC intent)
- Alternative hire: Firefly (good but enterprise-only, $1,000+/mo)
- dd0c/drift fit: **STRONG.** Continuous compliance evidence generation. 8/10 fit.
**Job 4: "Help me see infrastructure health across all stacks"**
- Current hire: Standup meetings + tribal knowledge (doesn't scale, single point of failure)
- Alternative hire: Spacelift dashboard (good but requires full platform)
- Alternative hire: Custom Datadog dashboards (partial, doesn't understand IaC state)
- dd0c/drift fit: **STRONG.** Drift score dashboard, stack health view. 8/10 fit.
**Job 5: "Help me manage my entire IaC workflow (plan, apply, approve, deploy)"**
- Current hire: GitHub Actions + manual process
- Alternative hire: Spacelift, env0, TFC (this is their core job)
- dd0c/drift fit: **ZERO.** Not your job. Don't even think about it. This is where competitors live. Stay away.
**JTBD Strategic Implication:** dd0c/drift is the best hire for Jobs 1-4 and deliberately refuses Job 5. This is the "anti-platform" positioning. You win by doing LESS, not more. Every feature request that smells like Job 5 ("can you also run our terraform apply?") gets a polite "no — use GitHub Actions for that, we integrate with it."
### 3.3 Switching Costs Analysis
Switching costs determine whether customers stay. For dd0c/drift, the analysis is nuanced:
**Switching costs FROM competitors TO dd0c/drift: LOW (your advantage)**
- From Spacelift: Teams frustrated with platform complexity can add dd0c/drift alongside Spacelift (or instead of it for drift). No migration required.
- From env0: Same story. dd0c/drift doesn't replace their CI/CD — it replaces their drift feature.
- From TFC: Teams leaving TFC for OpenTofu need drift detection. dd0c/drift is a natural addition to their new GitHub Actions workflow.
- From "do nothing": The 60-second `drift init` setup means the switching cost from manual process to dd0c/drift is essentially zero.
**Switching costs FROM dd0c/drift TO competitors: BUILD THESE DELIBERATELY**
This is where you need to be strategic. A drift detection tool with low switching costs is a commodity. You must create stickiness:
1. **Integration Depth** — Deep Slack integration (custom channels per stack, approval workflows, remediation history in threads). Reconfiguring all of this in a competitor is painful.
2. **Historical Data** — 12 months of drift history, trend data, and compliance audit trails. This data doesn't export to Spacelift. Leaving means losing your audit history.
3. **Policy Configuration** — Per-resource-type remediation policies (auto-revert security groups, alert on IAM, ignore tags). Rebuilding these policies in another tool takes weeks.
4. **Team Workflows** — Stack ownership assignments, on-call routing, approval chains. These are organizational knowledge encoded in the tool.
5. **Compliance Dependency** — Once your SOC 2 audit evidence references dd0c/drift reports, switching tools means re-establishing evidence chains with your auditor. Nobody wants to do that mid-audit-cycle.
**Net Assessment:** Easy to adopt (low switching costs in), increasingly hard to leave (growing switching costs out). This is the ideal dynamic for a PLG product.
### 3.4 Data Moats — Drift Pattern Intelligence
Here's where it gets interesting. And where the long-term defensibility lives.
**The Data Flywheel:**
Every dd0c/drift customer generates drift data: what resources drift, how often, who causes it, what time of day, what the blast radius is, how long until remediation. Individually, this data is useful for that customer. Aggregated and anonymized across thousands of customers, it becomes an intelligence asset that no competitor can replicate.
**What You Can Build With Aggregate Drift Data:**
1. **Drift Probability Scores** — "Security groups drift 3.2x more often than VPCs. RDS parameter groups drift most frequently on Fridays after deployments." Per-resource-type drift probability, trained on real-world data across thousands of organizations.
2. **Predictive Drift Alerts** — "Based on patterns from 10,000 similar organizations, this resource has a 78% chance of drifting in the next 48 hours." This is the "Wild Card #1" from the brainstorm session — and it's achievable with sufficient data.
3. **Remediation Recommendations** — "When this type of security group drift occurs, 89% of teams revert it. 11% accept it. Here's the most common reason for acceptance." AI-powered remediation suggestions based on what similar teams do.
4. **Industry Benchmarking** — "Your organization has a 12% drift rate. The median for Series B SaaS companies with 20-50 stacks is 18%. You're in the top quartile." Competitive benchmarking that creates FOMO and drives adoption.
5. **Compliance Risk Scoring** — "Organizations with your drift profile have a 34% higher rate of SOC 2 findings related to configuration management." Risk quantification that sells to the Diana persona.
**The Moat Mechanics:**
- More customers → more drift data → better predictions → better product → more customers.
- This flywheel takes 12-18 months to become meaningful (you need ~500+ customers generating data).
- Once spinning, it's nearly impossible for a new entrant to replicate — they'd need the same customer base to generate the same data.
- Spacelift/env0 COULD build this, but drift is a feature for them, not the product. They won't invest in drift-specific ML when they're building CI/CD features for enterprise deals.
**The Honest Caveat:** Data moats are real but slow to build. For the first 12 months, your moat is speed, focus, and price. The data moat kicks in around Year 2. Don't over-index on it in your pitch — it's a long-term strategic asset, not a launch differentiator.
---
## Section 4: GO-TO-MARKET STRATEGY
### 4.1 Beachhead — The First 10 Customers
Brian, the first 10 customers define your company. Get them wrong and you'll spend 18 months building features for the wrong audience. Get them right and they become your case studies, your referral engine, and your product advisory board.
**Ideal First Customer Profile:**
- Team size: 5-20 engineers
- Stacks: 10-50 Terraform/OpenTofu stacks
- Cloud: AWS (start single-cloud, expand later)
- Current drift solution: Manual `terraform plan` or nothing
- Budget for drift tooling: $0-$500/mo (can't afford Spacelift, won't get Firefly to return their call)
- Pain trigger: Recent drift-related incident, upcoming SOC 2 audit, or engineer burnout from manual reconciliation
- Decision maker: The infra engineer or DevOps lead (not a VP — no procurement process)
**Where These People Live:**
1. **r/terraform** — 80K+ members. Search for "drift" and you'll find weekly posts asking for solutions. These are your people. They're already in pain.
2. **r/devops** — 300K+ members. Broader audience but drift discussions surface regularly.
3. **Hacker News** — "Show HN" launches for developer tools consistently hit front page. A well-crafted launch post ("I built a $29/mo alternative to Spacelift's drift detection") will generate 200+ comments and 5,000+ site visits.
4. **driftctl GitHub Issues** — The abandoned driftctl repo has open issues from people asking "what do I use instead?" These are pre-qualified leads. Literally people who searched for your product and found a dead project.
5. **HashiCorp Community Forum** — Teams migrating from TFC to OpenTofu are actively discussing tooling gaps. Drift detection is consistently mentioned.
6. **DevOps Slack Communities** — Rands Leadership Slack, DevOps Chat, Kubernetes Slack (#terraform channel). Organic mentions and helpful answers build credibility.
7. **Twitter/X DevOps Community** — DevOps influencers (Kelsey Hightower, Charity Majors, etc.) regularly discuss IaC pain points. A well-timed thread about drift costs gets amplified.
**The First 10 Acquisition Playbook:**
1. **Customers 1-3: Personal network.** Brian, you're a senior AWS architect. You know people who manage Terraform stacks. Call them. "I'm building a drift detection tool. Can I give you free access for 3 months in exchange for feedback?" These are your design partners.
2. **Customers 4-6: Community engagement.** Spend 2 weeks answering drift-related questions on r/terraform and r/devops. Don't pitch. Just help. Then post "Show HN: I built dd0c/drift — $29/mo drift detection for Terraform." The community engagement creates credibility for the launch.
3. **Customers 7-10: Content-driven inbound.** Publish "The True Cost of Infrastructure Drift" blog post with the Drift Cost Calculator. Promote on HN, Reddit, Twitter. Convert readers to free tier users, convert free tier to paid.
**Timeline to First 10 Paying Customers:** 60-90 days from public launch. This is aggressive but achievable with the right community engagement.
### 4.2 Pricing — Is $29/Stack the Right Anchor?
Let me stress-test the $29/stack/month pricing from multiple angles.
**The Bull Case for $29/Stack:**
1. **Credit Card Threshold** — $29 is below the "ask my manager" threshold at most companies. An engineer can expense it. No procurement. No legal review. No 3-month sales cycle. This is the PLG unlock.
2. **Anchoring Against Competitors** — When someone Googles "drift detection pricing" and sees Spacelift at $500+/mo and dd0c/drift at $29/stack, the contrast is visceral. "17x cheaper" is a headline that writes itself.
3. **Expansion Revenue** — A team starts with 5 stacks ($145/mo). As they grow to 20 stacks ($580/mo), then 50 stacks ($1,450/mo), revenue expands naturally without upselling. The pricing model has built-in NDR (Net Dollar Retention) >120%.
4. **Market Positioning** — $29/stack says "this is a utility, not a platform." It positions dd0c/drift as infrastructure (like Datadog per-host pricing) rather than a software platform (like Spacelift per-seat pricing).
**The Bear Case Against $29/Stack:**
1. **Revenue Per Customer Is Low** — Average customer with 15 stacks = $435/mo. You need 115 customers to hit $50K MRR. That's achievable but requires real marketing effort for a solo founder.
2. **Stack Count Is Ambiguous** — What's a "stack"? A Terraform workspace? A state file? A root module? Customers will argue about this. You need a crystal-clear definition.
3. **Penalizes Good Architecture** — Teams that split infrastructure into many small stacks (best practice) pay more than teams with monolithic stacks. This creates a perverse incentive.
4. **Enterprise Sticker Shock** — A team with 200 stacks would pay $5,800/mo. At that point, Spacelift's platform (which includes drift detection PLUS CI/CD) starts looking reasonable. You lose the price advantage at scale.
**Victor's Pricing Recommendation:**
Don't do pure per-stack pricing. Use a **tiered model with stack bundles:**
| Tier | Price | Stacks | Polling | Features |
|------|-------|--------|---------|----------|
| Free | $0 | 3 stacks | Daily | Slack alerts, basic dashboard |
| Starter | $49/mo | 10 stacks | 15-min | + One-click remediation, stack ownership |
| Pro | $149/mo | 30 stacks | 5-min | + Compliance reports, auto-remediation policies, API |
| Business | $399/mo | 100 stacks | 1-min | + SSO, RBAC, audit trail export, priority support |
| Enterprise | Custom | Unlimited | Real-time | + SLA, dedicated support, custom integrations |
**Why This Is Better Than Pure Per-Stack:**
1. **Predictable pricing** — Customers know exactly what they'll pay. No surprise bills when they add stacks.
2. **Encourages adoption** — "I'm on the 30-stack plan but only using 15. Let me add more stacks since I'm already paying for them." Drives usage.
3. **Natural upsell** — When a team hits 30 stacks, they upgrade to Business. Clear upgrade trigger.
4. **Enterprise ceiling** — Business at $399/mo is still 10x cheaper than Spacelift. You maintain the price advantage even at scale.
5. **Free tier is generous** — 3 stacks with daily polling is genuinely useful for small teams and side projects. This is your viral loop.
**The $29/stack messaging still works for marketing** — "Starting at $29/stack" is the headline. The tiered pricing is what they see on the pricing page. The anchor is set.
### 4.3 Channel Strategy — PLG via CLI Onboarding
**The Funnel:**
```
Awareness (Content + Community)
Interest (Drift Cost Calculator / Blog)
Activation (`drift init` — 60 seconds to first alert)
Engagement (Daily Slack alerts, drift score)
Revenue (Free → Starter when they hit 4+ stacks or need remediation)
Expansion (Starter → Pro → Business as stacks grow)
Referral ("This tool saved my team 10 hours/week" → word of mouth)
```
**Channel Breakdown:**
**1. Hacker News (Primary Launch Channel)**
- "Show HN" post: "dd0c/drift — $29/mo drift detection for Terraform/OpenTofu. Set up in 60 seconds."
- HN loves: open-source components, solo founder stories, tools that replace expensive platforms, clear pricing.
- Expected outcome: 200-500 comments, 5,000-15,000 site visits, 100-300 signups, 10-30 paying customers.
- Timing: Launch on a Tuesday or Wednesday morning (US Eastern). Avoid Mondays (crowded) and Fridays (low traffic).
**2. Reddit (Sustained Community Engagement)**
- r/terraform: Weekly engagement answering drift questions. Monthly "how we detect drift" technical posts.
- r/devops: Broader DevOps audience. Focus on the operational pain, not the tool.
- r/aws: AWS-specific drift scenarios (security groups, IAM policies).
- Rule: 10:1 ratio of helpful comments to self-promotion. Build credibility first.
**3. Technical Blog (SEO + Thought Leadership)**
- "The True Cost of Infrastructure Drift" (with calculator)
- "driftctl Is Dead. Here's What to Use Instead."
- "How to Detect Terraform Drift Without Spacelift"
- "SOC 2 and Infrastructure Drift: A Compliance Guide"
- "Terraform vs OpenTofu: Drift Detection Compared"
- Each post targets a specific long-tail keyword. SEO compounds over 6-12 months.
**4. GitHub (Open-Source Lead Gen)**
- Open-source the CLI detection engine (`dd0c/drift-cli`).
- Free, local drift detection. No account needed.
- The CLI outputs: "Found 7 drifted resources. View details and remediate at app.dd0c.dev" — the upsell to SaaS.
- GitHub stars = social proof. Target 1,000 stars in first 3 months.
**5. Conference Talks (Credibility + Reach)**
- HashiConf (if it still exists post-IBM), KubeCon, DevOpsDays, local meetups.
- Talk title: "The Hidden Cost of Infrastructure Drift: Data from 1,000 Terraform Stacks" (once you have the data).
- Conference talks convert at low volume but high quality — the people who approach you after the talk are pre-sold.
### 4.4 Content Strategy — The Drift Cost Calculator
**The "Drift Cost Calculator" is your single most important marketing asset.** Here's why:
Most developer tools market with features: "We detect drift! We have Slack alerts! We do remediation!" Features don't sell. Pain quantification sells.
**The Calculator:**
A simple web tool where an engineer inputs:
- Number of Terraform stacks
- Average team size
- Average engineer salary
- Frequency of manual drift checks
- Number of drift-related incidents per quarter
**Output:**
- "Your team spends approximately **$47,000/year** on manual drift management."
- "At $149/mo for dd0c/drift Pro, your ROI is **26x** in the first year."
- "You're losing **312 engineering hours/year** to drift-related work."
**Why This Works:**
1. It makes the invisible visible. Drift costs are hidden in engineer time, not in a line item.
2. It creates urgency. "$47K/year" is a number a manager can act on. "Drift is annoying" is not.
3. It's shareable. Engineers send the calculator to their managers. "Look what drift is costing us."
4. It captures leads. "Enter your email to get the full report" after showing the headline number.
5. It's content marketing gold. "We analyzed drift costs across 500 teams. The average is $52K/year." Blog post writes itself.
### 4.5 Open-Source as Lead Gen
**The Strategy:** Open-source the drift detection CLI. Charge for the SaaS layer.
**What's Open-Source:**
- `drift-cli` — Local drift detection for Terraform/OpenTofu. Runs `drift check` and outputs drifted resources to stdout.
- Works offline. No account needed. No telemetry.
- Supports single-stack scanning. Multi-stack requires SaaS.
**What's Paid SaaS:**
- Continuous monitoring (scheduled + event-driven)
- Slack/PagerDuty alerts
- One-click remediation
- Dashboard and drift score
- Compliance reports
- Team features (ownership, routing, RBAC)
- Historical data and trends
- Multi-stack aggregate view
**The Conversion Funnel:**
1. Engineer discovers `drift-cli` on GitHub or HN.
2. Runs `drift check` on one stack. Finds 5 drifted resources. "Oh crap."
3. Wants to run it on all 30 stacks continuously. Can't do that locally.
4. Signs up for free tier (3 stacks, daily polling).
5. Gets hooked on Slack alerts. Wants remediation and more stacks.
6. Upgrades to Starter ($49/mo) or Pro ($149/mo).
This is the Sentry/PostHog/GitLab playbook. Open-source core builds trust and adoption. Paid SaaS captures value from teams that need more.
### 4.6 Partnership Strategy
**HashiCorp/Terraform Ecosystem:**
- List on Terraform Registry as a complementary tool.
- Write a Terraform provider (`terraform-provider-driftcheck`) that exposes drift status as data sources.
- Publish in HashiCorp's partner ecosystem (if they still maintain one post-IBM).
- Caveat: Don't depend on HashiCorp goodwill. They may view you as competitive to TFC. Maintain independence.
**OpenTofu Foundation:**
- Become a visible OpenTofu ecosystem partner. Sponsor the project. Contribute to discussions.
- Position dd0c/drift as "the drift detection tool for the OpenTofu community."
- OpenTofu teams are actively building their toolchain. Be part of it from day one.
**Slack Marketplace:**
- List dd0c/drift as a Slack app. Slack Marketplace is an underrated distribution channel for DevOps tools.
- "Install dd0c/drift from Slack" → OAuth → connect state backend → first alert in 5 minutes.
**AWS Marketplace:**
- List on AWS Marketplace for teams that want to pay through their AWS bill (consolidated billing, committed spend credits).
- AWS Marketplace listing also provides credibility and discoverability.
---
## Section 5: RISK MATRIX
### 5.1 Top 10 Risks — Ranked by Severity × Probability
I'm going to be brutal here. If you can't stomach these risks, don't build this product.
| # | Risk | Severity (1-5) | Probability (1-5) | Score | Timeframe |
|---|------|----------------|-------------------|-------|-----------|
| 1 | HashiCorp/IBM builds native drift detection into TFC that's "good enough" | 5 | 3 | 15 | 12-24 months |
| 2 | Solo founder burnout — you're building 6 products, not 1 | 5 | 4 | 20 | 6-12 months |
| 3 | Spacelift drops drift detection into their free tier to kill you | 4 | 3 | 12 | 12-18 months |
| 4 | OpenTofu fragments the market, slowing IaC adoption overall | 3 | 3 | 9 | 12-24 months |
| 5 | AWS/Azure/GCP build native IaC drift detection into their consoles | 5 | 2 | 10 | 24-36 months |
| 6 | Security concerns prevent teams from granting read access to state/cloud | 4 | 3 | 12 | Immediate |
| 7 | "Good enough" internal tooling (bash scripts, GitHub Actions) prevents adoption | 3 | 4 | 12 | Ongoing |
| 8 | AI-generated IaC reduces drift by making reconciliation trivial | 3 | 2 | 6 | 18-36 months |
| 9 | Pricing pressure from open-source alternatives (someone forks driftctl, builds a better one) | 3 | 3 | 9 | 6-18 months |
| 10 | Customer concentration risk — first 10 customers represent 80%+ of revenue | 3 | 4 | 12 | 0-12 months |
### 5.2 Mitigation Strategies
**Risk 1: HashiCorp/IBM Builds Native Drift Detection**
This is the existential risk. Let's be precise about it.
*Why it might happen:* IBM paid $4.6B for HashiCorp. They need to justify the acquisition by expanding TFC's feature set and increasing enterprise revenue. Drift detection is an obvious feature to improve.
*Why it might NOT happen:* IBM is IBM. They move slowly. They'll focus on enterprise features (governance, compliance frameworks, SSO) that justify $70K+ annual contracts. Improving drift detection for the free/starter tier doesn't move the revenue needle. Also, post-BSL, the community is migrating to OpenTofu. IBM may double down on enterprise lock-in rather than community features.
*Mitigation:*
1. **Multi-IaC is your insurance policy.** TFC will only ever support Terraform. dd0c/drift supports Terraform + OpenTofu + Pulumi. Every team using multiple IaC tools is immune to TFC's drift features.
2. **Speed.** You need to be 18 months ahead on drift-specific features by the time IBM responds. That means shipping weekly, not quarterly.
3. **Community lock-in.** If dd0c/drift is the community standard for drift detection (the "driftctl successor"), IBM improving TFC drift won't matter — the community has already chosen you.
4. **Worst case:** TFC drift detection becomes "good enough" for Terraform-only teams on TFC. You lose that segment. But teams on OpenTofu, Pulumi, or multi-IaC are still yours. That's the growing segment.
**Risk 2: Solo Founder Burnout**
This is the risk I'm most worried about. Not because of the market — because of you.
*The math:* dd0c is 6 products. Even if drift is Phase 3, you're building route, cost, and alert first. By the time you get to drift, you'll have 3 products to maintain, support, and market. Adding a 4th is not "building a new product" — it's "adding 25% more work to an already unsustainable workload."
*Mitigation:*
1. **Ruthless prioritization.** If drift is the product with the clearest market gap and the strongest disruption thesis (it is), consider moving it to Phase 1 or Phase 2. Don't wait until you're already exhausted.
2. **Shared infrastructure.** The dd0c platform architecture (shared auth, billing, OTel pipeline) must be built ONCE and reused. If each product has its own backend, you're dead.
3. **AI-assisted development.** You're already using AI tools. Lean harder. Use Cursor/Copilot for 80% of the boilerplate. Reserve your cognitive energy for architecture decisions and customer conversations.
4. **Hire at $30K MRR.** The moment you hit $30K MRR across all dd0c products, hire a part-time contractor for support and bug fixes. Don't try to be a solo founder past $30K MRR — the support burden alone will consume you.
**Risk 3: Spacelift Drops Drift Into Free Tier**
*Why it might happen:* If dd0c/drift gains traction and starts appearing in "Spacelift alternatives" searches, Spacelift's marketing team will notice. The easiest response is to make their basic drift detection free.
*Why it might NOT happen:* Spacelift's drift detection requires private workers, which have infrastructure costs. Making it free erodes their upgrade path. Their investors won't love giving away features that drive enterprise upgrades.
*Mitigation:*
1. **Be better, not just cheaper.** If your drift detection is 10x better (Slack-native, one-click remediation, compliance reports, multi-IaC), "free but mediocre" from Spacelift won't matter. Nobody switched from Figma to free Adobe XD.
2. **Different buyer.** Spacelift's free tier targets teams evaluating their platform. dd0c/drift targets teams who DON'T WANT a platform. Different buyer, different motion.
3. **Speed of innovation.** By the time Spacelift responds, you should be 2 versions ahead with features they haven't thought of (drift prediction, cost-of-drift analytics, compliance automation).
**Risk 4: OpenTofu Fragments the Market**
*Mitigation:* This is actually an opportunity disguised as a risk. Fragmentation means teams use BOTH Terraform and OpenTofu (migration is gradual). They need a tool that works with both. That's you. Fragmentation increases your value proposition.
**Risk 5: Cloud Providers Build Native Drift Detection**
*Mitigation:* AWS Config already does basic configuration drift detection for CloudFormation. It's been around for years. It's terrible. Cloud providers optimize for their own IaC tools, not third-party ones. AWS will never build great Terraform drift detection — it's against their strategic interest (they want you on CloudFormation). You're safe for 3+ years.
**Risk 6: Security Concerns Block Adoption**
*Mitigation:*
1. **Read-only access only.** dd0c/drift never needs write access to cloud resources (except for remediation, which is opt-in).
2. **State file access architecture.** Offer multiple modes: (a) SaaS reads state from S3/GCS directly (requires IAM role), (b) Agent runs in customer's VPC and pushes drift data out (no inbound access), (c) CLI mode for air-gapped environments.
3. **SOC 2 certification for dd0c itself.** Eat your own dog food. Get SOC 2 certified. It's expensive ($20-50K) but it eliminates the "can we trust a solo founder's SaaS?" objection.
4. **Open-source the agent.** If the detection agent is open-source, security teams can audit the code. Trust through transparency.
**Risk 7: "Good Enough" Internal Tooling**
*Mitigation:* This is your biggest competitor — inertia. The Drift Cost Calculator directly attacks this by quantifying the cost of "good enough." When an engineer sees "$47K/year in drift management costs" vs. "$149/mo for dd0c/drift," the bash script suddenly looks expensive.
**Risk 8: AI-Generated IaC Reduces Drift**
*Mitigation:* AI might make it easier to WRITE IaC, but drift isn't caused by bad code — it's caused by humans making console changes, emergency hotfixes, and auto-scaling events. AI doesn't prevent a panicked engineer from opening a security group at 2am. If anything, AI-generated IaC increases the volume of managed resources, which increases the surface area for drift. This risk is overblown.
**Risk 9: Open-Source Competitor Emerges**
*Mitigation:* Beat them to it. YOUR CLI is open-source. If someone wants to build a free drift detection tool, they'll fork yours rather than building from scratch. You control the ecosystem. The SaaS layer (continuous monitoring, Slack integration, compliance reports, team features) is where you capture value — and that's hard to replicate in OSS.
**Risk 10: Customer Concentration**
*Mitigation:* Standard startup risk. Mitigate by aggressively pursuing PLG (many small customers) rather than a few large ones. Target: no single customer >10% of revenue by month 6.
### 5.3 Kill Criteria — When to Walk Away
Brian, every good strategist defines the conditions under which they retreat. Here are yours:
**Kill the product if ANY of these are true at the 6-month mark:**
1. **< 50 free tier signups** after HN launch + Reddit engagement + blog content. This means the market doesn't care, regardless of what the personas say.
2. **< 5 paying customers** after 90 days of the paid tier being available. Free users who won't pay are a vanity metric.
3. **Free-to-paid conversion < 3%.** Industry benchmark for PLG developer tools is 3-7%. Below 3% means the free tier is too generous or the paid tier isn't compelling.
4. **NPS < 30** from first 20 customers. If early adopters (who are the most forgiving) aren't enthusiastic, the product isn't solving a real problem.
5. **HashiCorp announces "Terraform Cloud Drift Detection Pro"** with continuous monitoring, Slack alerts, and remediation — included in the Plus tier. If this happens before you have 100+ customers, pivot to a different dd0c module.
**Kill the product if ANY of these are true at the 12-month mark:**
1. **< $10K MRR.** At $10K MRR after 12 months, the growth trajectory doesn't support a standalone product. Fold drift detection into dd0c/portal as a feature instead.
2. **Churn > 8% monthly.** Developer tools should have <5% monthly churn. Above 8% means customers try it, find it insufficient, and leave. The product isn't sticky.
3. **CAC payback > 12 months.** If it takes more than 12 months of revenue to recoup the cost of acquiring a customer, the unit economics don't work for a bootstrapped founder.
### 5.4 Scenario Planning — Revenue Projections
**Scenario A: "The Rocket" (20% probability)**
- HN launch goes viral. 500+ signups in week 1.
- driftctl community adopts dd0c/drift as the successor.
- Word-of-mouth drives organic growth.
- Month 6: 100 paying customers, $15K MRR
- Month 12: 350 paying customers, $52K MRR
- Month 24: 1,200 paying customers, $180K MRR
- Outcome: Standalone product. Consider raising seed round to accelerate.
**Scenario B: "The Grind" (50% probability)**
- Steady but unspectacular growth. Community engagement works but slowly.
- Each blog post and Reddit thread brings 5-10 signups.
- Month 6: 40 paying customers, $6K MRR
- Month 12: 150 paying customers, $22K MRR
- Month 24: 500 paying customers, $75K MRR
- Outcome: Solid product within the dd0c platform. Not a standalone business but a strong wedge.
**Scenario C: "The Slog" (25% probability)**
- Market is interested but conversion is low. Free tier gets usage, paid tier struggles.
- Competitors respond faster than expected.
- Month 6: 15 paying customers, $2.2K MRR
- Month 12: 60 paying customers, $9K MRR
- Month 24: 150 paying customers, $22K MRR
- Outcome: Fold into dd0c/portal as a feature. Not viable as standalone.
**Scenario D: "The Flop" (5% probability)**
- Market doesn't materialize. Teams prefer internal tooling or platform features.
- HN launch gets 30 comments and 200 visits.
- Month 6: 5 paying customers, $750 MRR
- Month 12: < $5K MRR
- Outcome: Kill it. Redirect effort to dd0c/route or dd0c/cost.
**Expected Value Calculation:**
Weighted average Month 12 MRR: (0.20 × $52K) + (0.50 × $22K) + (0.25 × $9K) + (0.05 × $5K) = $10.4K + $11K + $2.25K + $0.25K = **$23.9K MRR**
That's ~$287K ARR at the 12-month mark. For a bootstrapped solo founder, that's a real business. Not a unicorn. A real business.
---
## Section 6: STRATEGIC RECOMMENDATIONS
### 6.1 The 90-Day Launch Plan
**Days 1-30: Build the Foundation**
- Week 1-2: Build `drift-cli` (open-source). Terraform + OpenTofu support. Single-stack scanning. Output to stdout.
- Week 2-3: Build the SaaS detection engine. Multi-stack continuous monitoring. S3/GCS state backend integration.
- Week 3-4: Build Slack integration. Drift alerts with [Revert] [Accept] [Snooze] buttons. This is the MVP killer feature.
- Week 4: Build the dashboard. Drift score, stack list, drift history. Minimal but functional.
- Deliverable: Working product that can detect drift across multiple Terraform/OpenTofu stacks and alert via Slack.
**Days 31-60: Seed the Community**
- Week 5: Publish `drift-cli` on GitHub. Write a clear README with GIF demos. Target: 100 stars in week 1.
- Week 5-6: Begin daily engagement on r/terraform, r/devops. Answer drift questions. Don't pitch. Build credibility.
- Week 6: Publish "The True Cost of Infrastructure Drift" blog post with Drift Cost Calculator.
- Week 7: Publish "driftctl Is Dead. Here's What to Use Instead." (This will rank on Google for "driftctl alternative.")
- Week 7-8: Recruit 3-5 design partners from personal network. Free access for 3 months. Weekly feedback calls.
- Deliverable: 200+ GitHub stars, 50+ email list signups, 3-5 design partners actively using the product.
**Days 61-90: Launch and Convert**
- Week 9: "Show HN" launch. Prepare the post carefully. Have the landing page, pricing page, and docs ready.
- Week 9-10: Respond to every HN comment. Fix bugs reported by early users within 24 hours. Ship daily.
- Week 10: Launch on Product Hunt (secondary channel, lower priority than HN).
- Week 11: Publish case study from design partner: "How [Company] Reduced Drift by 90% in 2 Weeks."
- Week 12: Enable paid tiers. Convert free users to Starter/Pro.
- Deliverable: 200+ free tier users, 10+ paying customers, $1.5K+ MRR.
### 6.2 Key Metrics and Milestones
**North Star Metric:** Stacks monitored (total across all customers). This measures adoption depth, not just customer count.
**Leading Indicators:**
| Metric | Month 3 Target | Month 6 Target | Month 12 Target |
|--------|---------------|----------------|-----------------|
| GitHub stars (drift-cli) | 500 | 1,500 | 3,000 |
| Free tier users | 200 | 600 | 1,500 |
| Paying customers | 10 | 50 | 150 |
| MRR | $1.5K | $7.5K | $22K |
| Stacks monitored | 300 | 1,500 | 5,000 |
| Free-to-paid conversion | 5% | 5% | 7% |
| Monthly churn | <5% | <5% | <4% |
| NPS | 40+ | 45+ | 50+ |
**Lagging Indicators:**
- Net Dollar Retention (NDR): Target >120% (customers expand as they add stacks)
- CAC Payback: Target <6 months
- LTV:CAC ratio: Target >3:1
### 6.3 Open-Source Core Strategy — YES, With Boundaries
**Verdict: YES. Open-source the detection engine. Charge for the operational layer.**
**The Logic:**
1. **Trust.** Security-conscious teams won't run a closed-source agent in their VPC. Open-source eliminates this objection.
2. **Distribution.** GitHub is a distribution channel. Stars, forks, and contributors are free marketing.
3. **Community.** An open-source project attracts contributors who build integrations you don't have time to build (Azure support, GCP support, Pulumi support).
4. **Defensibility.** Counterintuitively, open-source is MORE defensible than closed-source. If someone forks your CLI, they still need to build the SaaS layer (Slack integration, dashboard, compliance reports, team features, continuous monitoring). That's 80% of the value. The detection engine is 20%.
**The Boundary:**
- Open-source: `drift-cli` (detection engine, single-stack scanning, stdout output)
- Proprietary SaaS: Everything else (continuous monitoring, Slack/PagerDuty integration, remediation workflows, dashboard, compliance reports, team features, API, historical data)
**License:** Apache 2.0 for the CLI. Not AGPL (too restrictive, scares enterprises). Not MIT (too permissive, allows competitors to embed without attribution). Apache 2.0 is the sweet spot — permissive enough for adoption, with patent protection.
### 6.4 The "Unfair Bet" — What Makes This Work
Brian, here's the honest assessment. You asked me to make the case or tell you to skip it. I'm going to do both.
**The Case FOR dd0c/drift:**
1. **The driftctl vacuum is real and time-limited.** There is a window — maybe 12-18 months — where the community is actively searching for a driftctl replacement and nobody has filled the gap. If you ship a credible product in that window, you become the default. If you wait, someone else will.
2. **The disruption math works.** $29/stack (or $49-$399/mo tiered) vs. $500+/mo platforms is a 10-17x price advantage. This isn't a marginal improvement — it's a category shift. You're not competing with Spacelift. You're making Spacelift's drift feature irrelevant for 80% of the market.
3. **Compliance is a forcing function.** SOC 2, PCI DSS 4.0, HIPAA — these aren't optional. Auditors are asking for continuous drift detection. This transforms your product from "nice-to-have" to "the auditor said we need this." Compliance-driven purchases have shorter sales cycles and lower churn.
4. **The platform strategy amplifies the bet.** dd0c/drift isn't just a standalone product — it's a wedge into the dd0c platform. A customer who uses drift is a customer who sees dd0c/cost, dd0c/alert, and dd0c/portal in the sidebar. Cross-sell potential is enormous.
5. **Your background is the unfair advantage.** You're a senior AWS architect. You've lived this problem. You can write the blog posts, give the conference talks, and answer the Reddit questions with authentic credibility that no marketing team can manufacture.
**The Case AGAINST dd0c/drift (the uncomfortable part):**
1. **It's Product #4 in a 6-product platform, built by one person.** The brand strategy puts drift in Phase 3 (months 7-12). By then, you'll be maintaining route, cost, and alert. Adding drift means you're running 4 products simultaneously. The burnout risk is not theoretical — it's mathematical.
2. **The standalone TAM is modest.** $3-5M SOM in 24 months is a great lifestyle business but won't attract investors if you ever want to raise. As a platform wedge, it's valuable. As a standalone bet, it's limited.
3. **The "do nothing" competitor is strong.** Most teams tolerate drift. They've been tolerating it for years. Converting "tolerators" to "payers" requires more marketing effort than converting "seekers" to "payers." Your beachhead is seekers (driftctl refugees, post-incident teams, pre-audit teams), but that's a smaller pool than the total market suggests.
4. **State file access is a trust barrier.** Reading Terraform state files means seeing resource configurations, sometimes including sensitive data. Even with read-only access, security teams will scrutinize this. The agent-in-VPC architecture mitigates it but adds deployment complexity.
**Victor's Final Verdict:**
**BUILD IT. But change the sequencing.**
Move dd0c/drift to Phase 2 (months 4-6), not Phase 3. Here's why:
- dd0c/route (LLM cost router) is a good Phase 1 product — immediate ROI, easy to build.
- dd0c/drift has a TIME-SENSITIVE market window (driftctl vacuum). Every month you wait, the window shrinks.
- dd0c/cost (AWS cost anomaly) can wait — the cost management market is crowded and not time-sensitive.
- dd0c/alert can be Phase 3 — alert fatigue is a chronic problem, not an acute one.
**Revised Launch Sequence:**
1. Phase 1 (Months 1-3): `dd0c/route` — The FinOps wedge. Immediate ROI. Easy build.
2. Phase 2 (Months 4-6): `dd0c/drift` — The driftctl vacuum. Time-sensitive. Compliance tailwind.
3. Phase 3 (Months 7-9): `dd0c/alert` — The on-call savior. Builds on route + drift data.
4. Phase 4 (Months 10-12): `dd0c/portal` + `dd0c/cost` + `dd0c/run` — The platform play.
**The Unfair Bet in One Sentence:**
dd0c/drift wins because it's the only product in the market that treats drift detection as the ENTIRE product rather than a feature checkbox — at a price point that makes the decision trivial and an onboarding experience that makes the switch instant — launched into a market vacuum left by driftctl's death, at the exact moment compliance requirements are making drift detection mandatory.
That's not a bet. That's a calculated position with favorable odds.
Ship it, Brian. But ship it in Phase 2, not Phase 3. The window won't wait.
---
*"The best time to plant a tree was 20 years ago. The second best time is now. The worst time is 'after I finish three other products first.'"*
— Victor
---
**Document Status:** COMPLETE
**Confidence Level:** HIGH (conditional on sequencing change)
**Next Step:** Technical architecture session — define the detection engine, state backend integrations, and Slack workflow architecture.
**Recommended Follow-Up:** Competitive intelligence deep-dive on Firefly.ai (they're the closest to your positioning and the least understood).

View File

@@ -0,0 +1,130 @@
# 🪩 dd0c/drift — Advisory Board Party Mode 🪩
**Product:** dd0c/drift — IaC Drift Detection & Auto-Remediation SaaS
**Format:** BMad Creative Intelligence Suite "Party Mode"
**Date:** February 28, 2026
---
## 🎙️ The Panel
1. **The VC** — Pattern-matching machine. Seen 1000 AI/DevOps pitches. Asks "what's the moat?"
2. **The CTO** — 20 years building infrastructure. Skeptical of AI hype. Asks "does drift detection actually work reliably at scale?"
3. **The Bootstrap Founder** — Built 3 profitable SaaS products solo. Asks "can one person ship this?"
4. **The DevOps Practitioner** — Platform engineer managing 200+ Terraform stacks. Asks "would I actually pay for this?"
5. **The Contrarian** — Deliberately argues the opposite. Finds the blind spots.
---
## 🥊 ROUND 1: INDIVIDUAL REVIEWS
**The VC**
"Look, I'm looking at the TAM and it's… cute. $3-5M SOM for a standalone tool? My associates wouldn't even take the meeting. BUT. The wedge strategy is brilliant. Capitalizing on the driftctl vacuum and HashiCorp's BSL fallout is classic opportunistic timing. What excites me is the compliance angle—SOC 2 turns this from a 'vitamin' into a 'painkiller.' What worries me is the ceiling. If you stay at $29/stack, you need massive volume. If Spacelift or env0 decide to commoditize drift detection to protect their platform play, your CAC payback goes negative.
**Vote: CONDITIONAL GO (Only if it's a wedge for the broader dd0c platform).**"
**The CTO**
"I've been paged at 2am because someone manually tweaked an ASG and Terraform wiped it. Drift is a real operational nightmare. But let's talk reality: doing continuous detection across AWS, GCP, Azure, Terraform, Tofu, AND Pulumi without causing rate limit chaos or requiring god-mode IAM roles is a hard engineering problem. I like the 'read-only' approach and hybrid CloudTrail + polling idea. What worries me is remediation. Auto-reverting infrastructure state via a Slack button sounds cool until it causes a cascading failure because you didn't understand the blast radius of that emergency hotfix.
**Vote: CONDITIONAL GO (Nail the blast-radius context before enabling write access).**"
**The Bootstrap Founder**
"I love this. It's a scalpel, not a Swiss Army knife. You don't need a sales team to sell a $29/stack tool that engineers can expense. The PLG motion of open-sourcing the CLI to capture the orphaned driftctl community is a proven playbook. Can one person ship this? Yes, if you leverage AI for the boilerplate and keep the UI dead simple. The risk? You're building a multi-tool SaaS. If you try to support TF, Tofu, Pulumi, and CloudFormation on day one, you'll drown in edge cases.
**Vote: GO.**"
**The DevOps Practitioner**
"Finally, someone who understands that I don't want to migrate my entire CI/CD pipeline just to see what drifted. I'm already running GitHub Actions. I just want a Slack alert when a security group changes. The 60-second setup is exactly what I need. Would I pay $29/stack? Honestly, my boss would approve a $149/mo 'Pro' tier instantly if it generates our SOC 2 evidence. My biggest worry is noise. If this thing alerts me every time an ASG scales or a dynamic tag updates, I will mute the channel in 48 hours and churn in a month.
**Vote: GO.**"
**The Contrarian**
"Everyone's talking about how great it is that you're an independent tool. That's your biggest weakness. The market is moving towards consolidated platforms, not fragmented point solutions. Engineers have tool fatigue. Also, your entire premise assumes IaC is the permanent future. What if the future is just AI agents directly managing cloud APIs, making 'state files' obsolete? You're building a better saddle for a horse while Ford is building the Model T. Furthermore, selling to engineers is a nightmare. They'd rather spend 40 hours building a fragile bash script than pay you $29 a month.
**Vote: NO-GO.**"
---
## 🥊 ROUND 2: CROSS-EXAMINATION
**1. The VC:** "Alright, Founder. You love this because it's a 'scalpel.' But a $29/stack utility caps out at what, $3-5M ARR? That's a rounding error. Without the platform play, this is a lifestyle business. Where's the enterprise scale?"
**2. The Bootstrap Founder:** "VC, you're blinded by billion-dollar valuations. A $3M ARR solo business with 90% margins is phenomenal. The goal isn't to IPO; the goal is to capture the 100,000 mid-market teams HashiCorp priced out. You don't need to be a unicorn to be highly profitable."
**3. The CTO:** "Let's talk tech, Practitioner. 'One-click revert' from a Slack message? Have you ever tried to un-drift a complex RDS parameter group via a naive Terraform apply? You'll cause a reboot during peak hours. You can't just slap a 'revert' button on infrastructure without understanding dependencies."
**4. The DevOps Practitioner:** "CTO, you're assuming we're going to auto-revert databases. The MVP isn't fixing RDS parameters. It's reverting the security group someone opened to 0.0.0.0/0 at 3am. For critical state, the button should just generate a PR so I can review the plan output."
**5. The Contrarian:** "You're both missing the point. Practitioner, you're not going to pay $29/stack for PR generation. You have a GitHub Action for that. If this tool doesn't automate the hard remediation, it's just an expensive notification service."
**6. The DevOps Practitioner:** "Wrong, Contrarian. My GitHub Action only runs when I push code. It doesn't run when a developer clicks buttons in the AWS console. The value is the continuous polling and CloudTrail integration, not just the code diff."
**7. The VC:** "Contrarian has a point, though. If Spacelift sees you taking market share, they drop a free drift detection tier. Then what? Your entire '17x cheaper' moat evaporates."
**8. The Bootstrap Founder:** "Spacelift can't afford to give away private workers for continuous drift polling. Their unit economics don't support it. And even if they do, migrating to Spacelift takes two months. Installing dd0c/drift takes 60 seconds. PLG beats sales cycles."
**9. The Contrarian:** "Setup takes 60 seconds? Only if your security team rubber-stamps giving a third-party SaaS tool IAM access to your production AWS account and Terraform state files containing secrets. Spoiler: they won't."
**10. The CTO:** "Contrarian is right about the IAM permissions. If the SaaS requires a cross-account role with `s3:GetObject` on the state bucket, every SOC 2 auditor will flag it. It needs to be an agent running inside their VPC that pushes data out."
**11. The DevOps Practitioner:** "Exactly. An open-source CLI running in our CI/CD cron or as an ECS task, pushing the diffs to the dd0c SaaS. Our security team would approve that in a heartbeat because no inbound access is required."
**12. The Contrarian:** "So now your 60-second setup just became 'deploy an ECS task, configure IAM roles, and manage a new internal agent.' Congratulations, you've just reinvented enterprise software deployment."
---
## 🥊 ROUND 3: STRESS TEST
### Point 1: What if HashiCorp ships native drift detection in Terraform Cloud?
**The VC:** "Severity: **8/10**. IBM didn't buy HashiCorp to sit on their hands. If they bake real-time drift detection into the TFC Plus tier, they kill your 'HashiCorp exodus' narrative before you scale."
**Mitigations:**
- **The CTO:** "Multi-IaC support is your shield. TFC will *never* support OpenTofu or Pulumi. By the time they ship it, a massive chunk of the market has migrated to Tofu."
- **The DevOps Practitioner:** "And TFC's drift detection historically sucks. It's scheduled, not continuous via CloudTrail. And it's buried in their UI. Your Slack-native integration beats them on UX."
**Pivot Option:** Lean entirely into OpenTofu, Pulumi, and CloudFormation. Become the multi-cloud, multi-tool standard, explicitly targeting teams that are anti-vendor-lock-in.
### Point 2: What if the driftctl community vacuum gets filled by another OSS project?
**The Contrarian:** "Severity: **6/10**. Snyk abandoned driftctl, sure. But all it takes is one bored platform engineer at a series B startup to fork it, slap a new UI on it, and gain 5k GitHub stars. Free always beats $29/mo."
**Mitigations:**
- **The Bootstrap Founder:** "That's why *you* open-source the core CLI from day one! You build the successor. The OSS project detects drift locally; the SaaS layer does the continuous polling, Slack alerts, RBAC, and SOC 2 reporting. Let them fork the CLI. They still need the operational dashboard."
- **The CTO:** "Running an OSS project is a lot of work. The mitigation here is speed. Ship the CLI, get the stars, establish yourself as the de-facto standard before anyone else realizes the vacuum exists."
**Pivot Option:** Turn the CLI into a generic standard that integrates with existing observability tools (Datadog, New Relic) and pivot the SaaS into an enterprise compliance reporting engine.
### Point 3: What if enterprises won't trust a third-party tool with their cloud credentials?
**The CTO:** "Severity: **9/10**. Reading state files means reading secrets. Giving a bootstrapped SaaS tool cross-account access to production AWS and state buckets is a massive red flag for any CISO."
**Mitigations:**
- **The DevOps Practitioner:** "This is the 'push vs. pull' architecture debate. The SaaS *never* pulls from their cloud. The customer runs the open-source CLI/agent in their own secure environment (e.g., a GitHub Actions cron or ECS task) and pushes an encrypted payload of the drift diffs to the dd0c SaaS."
- **The VC:** "You also need to get SOC 2 certified immediately. It's expensive, but you can't sell a compliance tool without passing compliance yourself."
**Pivot Option:** Offer an on-premise/VPC-deployed enterprise version of the dashboard. Instead of SaaS, you sell an appliance they deploy entirely inside their perimeter.
---
## 🥊 ROUND 4: FINAL VERDICT
**The Facilitator:** "The tension is thick. The market is waiting. We've stress-tested the tech, the timing, and the TAM. We need a verdict. Go, No-Go, or Pivot?"
**The Panel Deliberates:**
- **The VC:** "If you launch it as a standalone unicorn pitch, I'm out. If you launch it as a wedge for the broader dd0c suite, I'm in."
- **The CTO:** "Push-architecture for the agent, or you're dead on arrival with enterprise security teams. If you build the push agent, I'm in."
- **The Bootstrap Founder:** "Ship the open-source CLI yesterday. The vacuum is real. I'm a hard GO."
- **The DevOps Practitioner:** "I've already asked my boss for the corporate card. Give me Slack alerts and I'm a GO."
- **The Contrarian:** "I still hate it. You're building an alarm system for a burning house instead of putting out the fire. But fine, the burning house owners have money. PIVOT to a compliance engine, and I guess it's a GO."
### Decision: SPLIT DECISION → CONDITIONAL GO
**Revised Priority in dd0c Lineup:**
Move dd0c/drift from Phase 3 to **Phase 2**. The driftctl vacuum is a time-sensitive, closing window. You must launch the open-source CLI immediately, capitalize on the HashiCorp BSL fallout, and use it as a powerful wedge into mid-market engineering teams.
**Top 3 Must-Get-Right Items:**
1. **Push-Based Architecture:** The SaaS must NEVER ask for inbound read access to a customer's AWS account or Terraform state buckets. The open-source agent runs in the customer's CI/CD or VPC and pushes drift events to the dd0c API.
2. **Slack-Native UX:** The alert must be the UI. Rich messages, diff context, and action buttons (`Revert`, `Snooze`, `Acknowledge`) entirely within Slack.
3. **Multi-IaC Day One:** You must launch with Terraform AND OpenTofu support. Supporting Pulumi shortly after is a massive differentiator against HashiCorp.
**The One Kill Condition:**
If you launch the open-source CLI on Hacker News/Reddit and cannot achieve **500 GitHub stars and 50 free-tier SaaS signups within 30 days**, kill the SaaS product. It means the market has either solved the problem internally or the pain isn't acute enough to adopt new tooling.
**Final Verdict: GO.**
*The panel adjourns, grabbing slices of cold pizza while the CTO mutters about IAM policies.*

View File

@@ -0,0 +1,695 @@
# dd0c/drift — Product Brief
**Product:** IaC Drift Detection & Auto-Remediation SaaS
**Author:** Max Mayfield (Product Manager, Phase 5)
**Date:** February 28, 2026
**Status:** Investor-Ready Draft
**Pipeline Phase:** BMad Phase 5 — Product Brief
---
## 1. EXECUTIVE SUMMARY
### Elevator Pitch
dd0c/drift is a focused, developer-first SaaS tool that continuously monitors Terraform, OpenTofu, and Pulumi infrastructure for drift from declared state — and lets engineers fix it in one click from Slack. It replaces the fragile cron jobs, manual `terraform plan` runs, and tribal knowledge that teams currently rely on, at 10-17x less than platform competitors like Spacelift or env0.
**The one-liner:** Connect your IaC state. Get Slack alerts when something drifts. Fix it in one click. Set up in 60 seconds.
### Problem Statement
Infrastructure as Code promised a single source of truth. In practice, it's a polite fiction.
Engineers make console changes during 2am incidents. Auto-scaling events mutate state. Emergency hotfixes bypass the IaC pipeline. The result: a growing, invisible gap between what the code declares and what actually exists in the cloud. This gap — drift — is the #1 operational pain point of IaC at scale.
**The data:**
- Engineers spend 2-5x longer debugging issues when actual state doesn't match declared state (design thinking persona research).
- Teams with 20+ stacks report spending 30% of sprint capacity on unplanned drift-related firefighting.
- Pre-audit drift reconciliation consumes 2+ weeks of engineering time per audit cycle — time that produces zero new value.
- A single undetected security group drift (port opened to 0.0.0.0/0) has led to breaches, compliance failures, and six-figure customer contract losses.
- The average mid-market team (20 stacks, 10 engineers) spends an estimated $47,000/year on manual drift management — a cost that's invisible because it's buried in engineer time, not a line item.
There is no focused, affordable, self-serve tool that solves this. The market's only dedicated open-source option — driftctl — was acquired by Snyk and abandoned. Platform vendors (Spacelift, env0, Terraform Cloud) bundle drift detection as a feature inside $500+/mo platforms that require full workflow migration. The result: most teams "solve" drift with bash scripts, tribal knowledge, and hope.
### Solution Overview
dd0c/drift is a standalone drift detection and remediation tool — not a platform. It does one thing and does it better than anyone:
1. **Hybrid Detection Engine** — Combines CloudTrail event-driven detection (real-time for high-risk resources like security groups and IAM) with scheduled polling (comprehensive coverage for everything else). This is the "security camera" approach vs. the industry-standard "flashlight" (`terraform plan`).
2. **Slack-First Remediation** — Rich Slack messages with drift context (who changed it, when, blast radius) and action buttons: `[Revert]` `[Accept]` `[Snooze]` `[Assign]`. For 80% of users, the Slack alert IS the product. No dashboard required.
3. **One-Click Fix** — Revert drift to declared state, or accept it by auto-generating a PR that updates code to match reality. Both directions. The engineer chooses which is the source of truth, per resource.
4. **60-Second Onboarding**`drift init` auto-discovers state backend, cloud provider, and resources. No YAML config. No platform migration. Plugs into existing Terraform + GitHub + Slack workflows.
5. **Push-Based Architecture** — An open-source agent runs inside the customer's CI/CD or VPC and pushes encrypted drift data to the dd0c SaaS. The SaaS never requires inbound access to customer cloud accounts or state files. This resolves the #1 enterprise adoption blocker (IAM trust).
### Target Customer
**Primary:** Mid-market engineering teams (5-50 engineers, 10-100 Terraform/OpenTofu stacks, AWS-first) who experience meaningful drift but can't afford or don't need a full IaC platform. They use GitHub Actions for CI/CD, Slack for communication, and a credit card for tooling purchases under $500/mo.
**Three buyer personas, one product:**
- **The Infrastructure Engineer (Ravi):** Buys with a credit card because it eliminates 2am dread. Bottom-up adoption driven by individual pain.
- **The Security/Compliance Lead (Diana):** Approves the budget because it generates SOC 2 audit evidence automatically. Middle-out adoption driven by compliance requirements.
- **The DevOps Team Lead (Marcus):** Champions it to leadership because it produces drift metrics and eliminates tribal knowledge. Top-down adoption driven by organizational visibility.
### Key Differentiators
| Differentiator | dd0c/drift | Competitors |
|---|---|---|
| **Product focus** | Drift detection IS the product (100% of engineering effort) | Drift is a feature (5% of engineering effort) |
| **Price** | $49-$399/mo (tiered bundles) | $500-$2,000+/mo (platforms) |
| **Onboarding** | 60 seconds, self-serve, credit card | Weeks-to-months, sales calls, platform migration |
| **Multi-IaC** | Terraform + OpenTofu + Pulumi from Day 1 | Terraform-only or limited multi-tool |
| **Architecture** | Push-based agent (no inbound cloud access) | Pull-based (requires IAM cross-account roles) |
| **UX paradigm** | Slack-native with action buttons | Dashboard-first, Slack as afterthought |
| **Open-source** | CLI detection engine is OSS (Apache 2.0) | Proprietary |
---
## 2. MARKET OPPORTUNITY
### Market Sizing
**TAM (Total Addressable Market) — IaC Management & Governance:**
The global IaC market is projected at $2.5-$3.5B by 2027 (25-30% CAGR). The drift detection and remediation slice — including drift features embedded in platforms — represents an estimated **$800M-$1.2B** by 2027.
**SAM (Serviceable Addressable Market) — Teams Using Terraform/OpenTofu/Pulumi Who Need Drift Detection:**
- 150,000-200,000 organizations actively use Terraform/OpenTofu in production.
- ~60% (90,000-120,000) have 10+ stacks and experience meaningful drift.
- Conservative estimate targeting teams with 10-100 stacks (excluding enterprises that will buy Spacelift regardless): **$200-$400M SAM**.
**SOM (Serviceable Obtainable Market) — 24-Month Capture:**
- Solo founder with PLG motion, targeting SMB/mid-market (5-50 engineers, 10-100 stacks).
- Year 1 realistic target: 200-500 paying customers at ~$145/mo average = **$350K-$870K ARR**.
- Year 2 with expansion and word-of-mouth: **$1.5M-$3M ARR**.
- 24-month SOM: **$3-$5M**.
**The honest framing:** $3-5M SOM as a standalone product is a strong bootstrapped business, not a venture-scale outcome. The strategic value is as a wedge into the broader dd0c platform (route + cost + alert + drift + portal), which targets a $50M+ opportunity. Drift alone funds the founder; the platform funds the company.
### Competitive Landscape (Top 5)
| Competitor | What They Are | Drift Capability | Pricing | Vulnerability |
|---|---|---|---|---|
| **Spacelift** | IaC management platform ($40M+ raised) | Good — but a feature, not the product. Requires private workers. | $500-$2,000+/mo | Can't price down to $49 without cannibalizing enterprise ACV. Requires full workflow migration. |
| **env0** | "Environment as a Service" platform ($28M+ raised) | Basic — secondary to their core positioning | $350-$500+/mo (per-user) | Jack of all trades. Per-user pricing punishes growing teams. Same migration problem. |
| **HCP Terraform (HashiCorp/IBM)** | Native Terraform Cloud | Basic — scheduled health assessments, no remediation workflows | Variable; gets expensive at scale | IBM acquisition triggered OpenTofu exodus. Terraform-only. BSL license killed community goodwill. |
| **Firefly.ai** | Cloud Asset Management ($23M+ raised) | Good — but bundled in enterprise package | $1,000+/mo, enterprise-only, "Contact Sales" | Sells to CISOs, not engineers. No self-serve. A 5-person startup can't get a demo. |
| **driftctl (Snyk)** | Open-source drift detection CLI | Was good — now dead | Free (abandoned OSS) | Acquired and abandoned. Community orphaned. README still says "beta." **This vacuum is our market entry.** |
**The competitive insight:** Every live competitor treats drift detection as a feature inside a platform. Nobody treats it as the entire product. dd0c/drift's value curve is the inverse of every competitor — zero on CI/CD orchestration and policy engines, 10/10 on drift detection depth, remediation workflows, Slack-native UX, and self-serve onboarding. This is textbook Blue Ocean positioning.
### Timing Thesis — Why February 2026
Four forces are converging that create a 12-18 month window of opportunity:
**1. The HashiCorp Exodus (2024-2026)**
IBM's acquisition of HashiCorp and the BSL license change triggered the largest migration event in IaC history. Teams migrating from Terraform Cloud to OpenTofu + GitHub Actions lose their (mediocre) drift detection. They need a replacement and are actively searching right now.
**2. The driftctl Vacuum**
driftctl was the only focused, open-source drift detection tool. Snyk killed it. GitHub issues, Reddit threads, and HN comments are filled with "what do I use instead of driftctl?" There is no answer. dd0c/drift IS the answer. This vacuum is time-limited — someone will fill it within 12-18 months.
**3. IaC Adoption Hit Mainstream**
IaC is no longer a practice of elite DevOps teams. Mid-market companies with 20-50 engineers now have 30+ Terraform stacks. They've graduated from "learning IaC" to "suffering from IaC at scale." The market of sufferers just 10x'd.
**4. Compliance Is Becoming a Forcing Function**
- **SOC 2 Type II:** Auditors increasingly ask "How do you ensure infrastructure matches declared configuration?" — "we run terraform plan sometimes" is no longer acceptable.
- **PCI DSS 4.0** (effective March 2025): Requirement 1.2.5 requires documentation and review of all allowed services, protocols, and ports. Security group drift is now a PCI finding.
- **HIPAA/HITRUST:** Healthcare SaaS companies need to prove infrastructure configurations haven't been tampered with.
- **FedRAMP/StateRAMP:** Continuous monitoring of configuration state maps directly to NIST 800-53 CM-3 and CM-6.
- **Cyber Insurance:** Insurers are asking detailed questions about infrastructure configuration management. Continuous drift detection improves rates.
Compliance transforms drift detection from "engineering nice-to-have" to "business requirement." When the auditor says "you need this," the CFO writes the check.
### Market Trends
- **Multi-IaC reality:** Teams no longer use just Terraform. They use Terraform AND OpenTofu AND Pulumi AND CloudFormation (legacy). The first tool that handles drift across all of them owns the "Switzerland" position.
- **Platform fatigue:** Engineering teams are experiencing tool sprawl fatigue. They want focused tools that integrate with existing workflows, not new platforms that require migration.
- **AI-assisted infrastructure:** AI agents (Pulumi Neo, GitHub Copilot) are generating more IaC, increasing the volume of managed resources and the surface area for drift. AI doesn't prevent a panicked engineer from opening a security group at 2am.
- **Shift from periodic to continuous:** The industry is moving from point-in-time compliance checks to continuous monitoring. Drift detection is the infrastructure equivalent of this shift.
---
## 3. PRODUCT DEFINITION
### Value Proposition
**For infrastructure engineers:** "Stop dreading `terraform apply`. Know exactly what drifted, who changed it, and fix it in one click — without leaving Slack."
**For compliance leads:** "Generate continuous SOC 2 / HIPAA compliance evidence automatically. Eliminate the 2-week pre-audit scramble."
**For DevOps leads:** "See drift across all stacks in one dashboard. Replace tribal knowledge with data. Show leadership a number, not an anecdote."
**The composite:** dd0c/drift closes the loop between declared state and actual state continuously — restoring trust in IaC as a practice, eliminating reactive firefighting, and turning compliance from a quarterly scramble into an always-on posture.
### Personas
**Persona 1: Ravi — The Infrastructure Engineer**
- Senior infra engineer, 6 years experience, manages 23 Terraform stacks
- Runs `terraform plan` manually before every apply, scanning output like a bomb technician
- Maintains a mental map of "things that have drifted but I haven't fixed yet"
- Feels anxiety before every apply, guilt about known drift, loneliness at 2am when nothing matches the code
- **JTBD:** "When I'm about to run `terraform apply`, I want to know exactly what has drifted so I can apply with confidence instead of fear."
- **Buys because:** Eliminates 2am dread. Credit card purchase. Bottom-up.
**Persona 2: Diana — The Security/Compliance Lead**
- Head of Security, 10 years experience, responsible for SOC 2 Type II across 4 AWS accounts
- Maintains a 200-row spreadsheet mapping compliance controls to infrastructure resources — always slightly out of date
- Spends 60% of her time on evidence collection that should be automated
- **JTBD:** "When an auditor asks for evidence that infrastructure matches declared state, I want to generate a real-time compliance report in one click."
- **Buys because:** Generates audit evidence. Budget approval. Middle-out.
**Persona 3: Marcus — The DevOps Team Lead**
- DevOps lead, 12 years experience, manages 67 stacks through a team of 4 engineers
- Has zero aggregate visibility — manages infrastructure health through standup anecdotes and tribal knowledge
- Team is burning out from on-call burden inflated by drift-related incidents
- **JTBD:** "When reporting to leadership, I want to show drift metrics trending over time so I can justify tooling investment with data."
- **Buys because:** Produces metrics, eliminates bus factor. Champions to leadership. Top-down.
### Feature Roadmap
#### MVP (Month 1 — Launch)
| Feature | Description |
|---|---|
| **Hybrid detection engine** | CloudTrail event-driven (real-time for security groups, IAM) + scheduled polling (comprehensive). The "security camera" vs. "flashlight" approach. |
| **Terraform + OpenTofu support** | Full support for both from Day 1. Multi-IaC is a launch differentiator, not a roadmap item. |
| **Slack-native alerts** | Rich messages with drift context: what changed, who changed it (CloudTrail attribution), when, and blast radius preview. Action buttons: `[Revert]` `[Accept]` `[Snooze]` `[Assign]`. |
| **One-click revert** | Revert drift to declared state via Terraform apply scoped to the drifted resource. Includes blast radius check before execution. |
| **One-click accept** | Accept drift by auto-generating a PR that updates IaC code to match current reality. Both directions — engineer chooses which is the source of truth. |
| **Drift score dashboard** | Single number per stack and aggregate across all stacks. "Your infrastructure is 94% aligned with declared state." Minimal but functional web UI. |
| **Push-based agent** | Open-source CLI/agent runs in customer's CI/CD (GitHub Actions cron) or VPC (ECS task). Pushes encrypted drift data to dd0c SaaS. No inbound access required. |
| **60-second onboarding** | `drift init` auto-discovers state backend, cloud provider, and resources. No YAML config files. |
| **Stack ownership** | Assign stacks to engineers. Route drift alerts to the right person automatically. |
#### V2 (Month 3-4)
| Feature | Description |
|---|---|
| **Per-resource automation policies** | Spectrum of automation per resource type: Auto-revert (security groups opened to 0.0.0.0/0), Alert + one-click (IAM changes), Digest only (tag drift), Ignore (ASG instance counts). This spectrum IS the product's sophistication. |
| **Compliance report generation** | One-click SOC 2 / HIPAA evidence reports. Continuous audit trail of all drift events and resolutions. Exportable PDF/CSV. |
| **Pulumi support** | Extend detection engine to Pulumi state. Capture the underserved Pulumi community. |
| **Drift trends & analytics** | Drift rate over time, mean time to remediation, most-drifted resource types, drift by team member. The metrics Marcus needs for leadership. |
| **PagerDuty / OpsGenie integration** | Route critical drift (security groups, IAM) through existing on-call rotation. |
| **Teams & RBAC** | Multi-team support with role-based access. Stack-level permissions. |
#### V3 (Month 6-9)
| Feature | Description |
|---|---|
| **Drift prediction** | "Based on patterns from N similar organizations, this resource has a 78% chance of drifting in the next 48 hours." Requires aggregate data from 500+ customers. |
| **Industry benchmarking** | "Your drift rate is 12%. The median for Series B SaaS companies is 18%. You're in the top quartile." Competitive FOMO that drives adoption. |
| **Multi-cloud support** | Azure and GCP detection alongside AWS. |
| **CloudFormation support** | Capture legacy stacks that haven't migrated to Terraform/OpenTofu. |
| **SSO / SAML** | Enterprise authentication. Unlocks larger team adoption. |
| **API & webhooks** | Programmatic access to drift data for custom integrations and internal dashboards. |
| **dd0c platform integration** | Drift data feeds into dd0c/alert (intelligent routing), dd0c/portal (service catalog enrichment), and dd0c/run (automated runbooks for drift remediation). Cross-module flywheel. |
### User Journey
```
1. DISCOVER
Engineer sees "driftctl alternative" blog post, HN launch, or Reddit recommendation.
Downloads open-source drift-cli. Runs `drift check` on one stack.
Finds 7 drifted resources. "Oh crap."
2. ACTIVATE (60 seconds)
Signs up for free tier. Runs `drift init`.
CLI auto-discovers S3 state backend, AWS account, 3 stacks.
First Slack alert arrives within 5 minutes.
3. ENGAGE (Week 1)
Daily Slack alerts become part of the workflow.
Reverts a security group drift in one click. Accepts a tag drift.
Checks drift score dashboard — "We're at 87% alignment."
4. CONVERT (Week 2-4)
Hits 4-stack limit on free tier. Wants to add remaining 12 stacks.
Upgrades to Starter ($49/mo, 10 stacks) with a credit card.
No manager approval needed. No procurement.
5. EXPAND (Month 2-6)
Adds more stacks. Hits 10-stack limit. Upgrades to Pro ($149/mo, 30 stacks).
Diana (compliance) discovers the compliance report feature.
Generates SOC 2 evidence in one click. Becomes internal champion.
Marcus (team lead) sees the drift trends dashboard. Uses it in leadership reports.
6. ADVOCATE (Month 6+)
Team presents "How we reduced drift by 90%" at internal engineering all-hands.
Engineer mentions dd0c/drift on r/terraform. Word-of-mouth loop begins.
Team evaluates dd0c/cost and dd0c/alert — platform expansion.
```
### Pricing — Resolution
**The pricing question:** The brainstorm session proposed $29/stack/month flat pricing. The innovation strategy recommended tiered bundles ($49-$399/mo) over flat per-stack. The party mode panel's DevOps Practitioner said "my boss would approve a $149/mo Pro tier instantly if it generates SOC 2 evidence." The Contrarian argued $29/stack is too low for meaningful revenue.
**Resolution: Tiered bundles win.** Here's why:
Pure per-stack pricing has three fatal flaws:
1. It penalizes good architecture — teams that split into many small stacks (best practice) pay more.
2. It creates enterprise sticker shock — 200 stacks × $29 = $5,800/mo, at which point Spacelift's platform looks reasonable.
3. It's unpredictable — customers can't forecast costs as they add stacks.
Tiered bundles solve all three while preserving the "$29/stack" marketing anchor (Starter tier = $49/mo for 10 stacks ≈ $4.90/stack effective).
**Final Pricing:**
| Tier | Price | Stacks | Polling Frequency | Key Features |
|---|---|---|---|---|
| **Free** | $0/mo | 3 stacks | Daily | Slack alerts, basic dashboard, drift score |
| **Starter** | $49/mo | 10 stacks | 15-minute | + One-click remediation, stack ownership, CloudTrail attribution |
| **Pro** | $149/mo | 30 stacks | 5-minute | + Compliance reports, auto-remediation policies, drift trends, API, PagerDuty |
| **Business** | $399/mo | 100 stacks | 1-minute | + SSO, RBAC, audit trail export, priority support, custom integrations |
| **Enterprise** | Custom | Unlimited | Real-time (CloudTrail) | + SLA, dedicated support, on-prem agent option, custom compliance frameworks |
**Pricing justification:**
- **Free tier is genuinely useful** — 3 stacks with daily polling creates habit and word-of-mouth. This is the viral loop.
- **Starter at $49** — Below the "ask my manager" threshold. An engineer can expense this. No procurement. No legal review.
- **Pro at $149** — The sweet spot. Compliance reports unlock Diana's budget. 30 stacks covers most mid-market teams. This is the volume tier.
- **Business at $399** — Still 10x cheaper than Spacelift. Covers large teams (100 stacks) with enterprise features. Natural upsell trigger when teams hit 30 stacks.
- **Enterprise at custom** — Exists for the 1% who need unlimited stacks, SLAs, and on-prem. Not the focus. Don't build a sales team for this.
**The $29/stack anchor still works for marketing:** "Starting at less than $5/stack" or "17x cheaper than Spacelift" are the headlines. The tiered pricing is what they see on the pricing page.
---
## 4. GO-TO-MARKET PLAN
### Launch Strategy
dd0c/drift launches as a Phase 2 product in the dd0c suite (months 4-6), following dd0c/route (LLM cost router). Victor's innovation strategy recommended moving drift up from Phase 3 due to the time-sensitive driftctl vacuum. The party mode panel unanimously agreed. This brief confirms: **drift launches in Phase 2.**
The GTM motion is pure PLG (Product-Led Growth). No sales team. No enterprise outbound. No "Contact Sales" buttons. The product sells itself through:
1. An open-source CLI that proves value locally before asking for a signup.
2. A 60-second onboarding flow that converts interest into activation instantly.
3. Slack alerts that deliver value daily, creating habit and dependency.
4. Word-of-mouth from engineers who share their drift score improvements.
### Beachhead: driftctl Refugees + r/terraform
**Primary beachhead:** Engineers who used driftctl and are actively searching for a replacement. These are pre-qualified leads — they already understand the problem, have budget intent, and are searching for a solution that doesn't exist yet.
**Where they live:**
- **driftctl GitHub Issues** — Open issues from people asking "is this project dead?" and "what do I use instead?" These are literal inbound leads.
- **r/terraform** (80K+ members) — Weekly posts asking for drift solutions. Search "drift" and find your first 50 prospects.
- **r/devops** (300K+ members) — Broader audience, drift discussions surface regularly.
- **Hacker News** — "Show HN" launches for developer tools consistently hit front page. Solo founder + open-source + clear pricing = HN catnip.
- **HashiCorp Community Forum** — Teams migrating from TFC to OpenTofu discussing tooling gaps. Drift detection is consistently mentioned.
- **DevOps Slack communities** — Rands Leadership Slack, DevOps Chat, Kubernetes Slack (#terraform channel).
- **Twitter/X DevOps community** — DevOps influencers regularly discuss IaC pain points.
**First 10 customer acquisition playbook:**
- **Customers 1-3:** Personal network. Brian is a senior AWS architect — he knows people managing Terraform stacks. Free access for 3 months in exchange for weekly feedback. These are design partners.
- **Customers 4-6:** Community engagement. 2 weeks of answering drift questions on r/terraform and r/devops. Don't pitch. Just help. Build credibility, then launch.
- **Customers 7-10:** Content-driven inbound. "The True Cost of Infrastructure Drift" blog post + Drift Cost Calculator. Convert readers to free tier, free tier to paid.
### Growth Loops
**Loop 1: Open-Source → Free Tier → Paid (Primary)**
```
Engineer discovers drift-cli on GitHub/HN
→ Runs `drift check` locally, finds drift
→ Signs up for free tier (3 stacks)
→ Gets hooked on Slack alerts
→ Hits stack limit, upgrades to Starter/Pro
→ Tells teammate → teammate discovers drift-cli
```
**Loop 2: Compliance → Budget → Expansion**
```
Diana (compliance) discovers drift reports during audit prep
→ Generates SOC 2 evidence in one click (vs. 2-week manual scramble)
→ Becomes internal champion, approves budget increase
→ Team expands to Pro/Business tier
→ Diana mentions dd0c/drift to compliance peers at industry events
```
**Loop 3: Content → SEO → Inbound**
```
Blog post ranks for "terraform drift detection" / "driftctl alternative"
→ Engineer reads post, tries Drift Cost Calculator
→ Sees "$47K/year in drift costs" → downloads CLI
→ Enters Loop 1
```
**Loop 4: Incident → Adoption (Event-Driven)**
```
Team has a drift-related incident (security group change causes outage)
→ Post-mortem action item: "evaluate drift detection tooling"
→ Engineer Googles "terraform drift detection tool"
→ Finds dd0c/drift blog post or GitHub repo
→ Enters Loop 1
```
### Content Strategy
**Pillar content (SEO + thought leadership):**
1. "The True Cost of Infrastructure Drift" — with interactive Drift Cost Calculator. The single most important marketing asset. Quantifies invisible pain.
2. "driftctl Is Dead. Here's What to Use Instead." — Will rank for "driftctl alternative" on Google. Direct capture of orphaned community.
3. "How to Detect Terraform Drift Without Spacelift" — Targets teams evaluating platforms who don't want platform migration.
4. "SOC 2 and Infrastructure Drift: A Compliance Guide" — Targets Diana persona. Compliance-driven purchase justification.
5. "Terraform vs OpenTofu: Drift Detection Compared" — Captures migration-related search traffic.
**The Drift Cost Calculator:**
A web tool where an engineer inputs: number of stacks, team size, average salary, frequency of manual checks, drift incidents per quarter. Output: "Your team spends approximately $47,000/year on manual drift management. At $149/mo for dd0c/drift Pro, your ROI is 26x in the first year." This is shareable — engineers send it to managers. It captures leads. It's content marketing gold.
### Open-Source CLI as Lead Gen
**What's open-source (Apache 2.0):**
- `drift-cli` — Local drift detection for Terraform/OpenTofu. Runs `drift check` and outputs drifted resources to stdout. Works offline. No account needed. No telemetry. Single-stack scanning.
**What's paid SaaS:**
- Continuous monitoring (scheduled + event-driven)
- Slack/PagerDuty alerts with action buttons
- One-click remediation (revert or accept)
- Dashboard, drift score, trends
- Compliance reports
- Team features (ownership, routing, RBAC)
- Historical data
- Multi-stack aggregate view
**The conversion funnel:**
`drift-cli` outputs: "Found 7 drifted resources. View details and remediate at app.dd0c.dev" — the natural upsell. This is the Sentry/PostHog/GitLab playbook. Open-source core builds trust and adoption. Paid SaaS captures value from teams that need operational features.
**Target:** 1,000 GitHub stars in first 3 months. Stars = social proof = distribution.
### Partnerships
- **OpenTofu Foundation:** Become a visible ecosystem partner. Sponsor the project. Position dd0c/drift as "the drift detection tool for the OpenTofu community." OpenTofu teams are actively building their toolchain — be part of it from Day 1.
- **Slack Marketplace:** List dd0c/drift as a Slack app. "Install from Slack → OAuth → connect state backend → first alert in 5 minutes." Underrated distribution channel.
- **AWS Marketplace:** List for teams that want to pay through their AWS bill (consolidated billing, committed spend credits). Also provides credibility and discoverability.
- **Digger (OSS Terraform CI/CD):** Digger users need drift detection. Integration partnership, not competition.
- **Terraform Registry:** List as a complementary tool. Publish a `terraform-provider-driftcheck` data source.
### 90-Day Launch Timeline
**Days 1-30: Build the Foundation**
- Week 1-2: Build `drift-cli` (open-source). Terraform + OpenTofu support. Single-stack scanning. Output to stdout.
- Week 2-3: Build SaaS detection engine. Multi-stack continuous monitoring. S3/GCS state backend integration.
- Week 3-4: Build Slack integration. Drift alerts with action buttons. This is the MVP killer feature.
- Week 4: Build dashboard. Drift score, stack list, drift history. Minimal but functional.
- **Deliverable:** Working product that detects drift across multiple Terraform/OpenTofu stacks and alerts via Slack.
**Days 31-60: Seed the Community**
- Week 5: Publish `drift-cli` on GitHub. Clear README with GIF demos. Target: 100 stars in week 1.
- Week 5-6: Begin daily engagement on r/terraform, r/devops. Answer drift questions. Don't pitch.
- Week 6: Publish "The True Cost of Infrastructure Drift" blog post with Drift Cost Calculator.
- Week 7: Publish "driftctl Is Dead. Here's What to Use Instead."
- Week 7-8: Recruit 3-5 design partners from personal network. Free access, weekly feedback calls.
- **Deliverable:** 200+ GitHub stars, 50+ email list signups, 3-5 design partners actively using the product.
**Days 61-90: Launch and Convert**
- Week 9: "Show HN" launch. Tuesday or Wednesday morning (US Eastern). Landing page, pricing page, and docs ready.
- Week 9-10: Respond to every HN comment. Fix bugs within 24 hours. Ship daily.
- Week 10: Launch on Product Hunt (secondary channel).
- Week 11: Publish design partner case study: "How [Company] Reduced Drift by 90% in 2 Weeks."
- Week 12: Enable paid tiers. Convert free users to Starter/Pro.
- **Deliverable:** 200+ free tier users, 10+ paying customers, $1.5K+ MRR.
---
## 5. BUSINESS MODEL
### Revenue Model
**Primary revenue:** Tiered SaaS subscriptions (Free / $49 / $149 / $399 / Custom).
**Revenue characteristics:**
- **Recurring:** Monthly subscriptions with annual discount option (2 months free on annual).
- **Expansion-native:** Revenue grows as customers add stacks and upgrade tiers. Built-in NDR (Net Dollar Retention) >120%.
- **Low-touch:** Self-serve signup, credit card billing, no sales team required for Free through Business tiers.
- **Compliance-sticky:** Once SOC 2 audit evidence references dd0c/drift reports, switching tools means re-establishing evidence chains with auditors. Nobody does that mid-audit-cycle.
**Secondary revenue (future):**
- AWS Marketplace transactions (consolidated billing).
- dd0c platform cross-sell (drift customers adopt dd0c/cost, dd0c/alert, dd0c/portal).
- Enterprise on-prem/VPC-deployed dashboard (license fee, not SaaS).
### Unit Economics
**Assumptions:**
- Average customer: Pro tier ($149/mo) — this is the volume tier based on persona analysis.
- Infrastructure cost per customer: ~$8-12/mo (compute for polling, storage for drift history, Slack API calls).
- Gross margin: ~92-95%.
- CAC (blended): ~$150-$300 (PLG motion — content + community + open-source, no paid ads initially).
- CAC payback: 1-2 months at Pro tier.
- LTV (assuming 5% monthly churn, 24-month average lifetime): $149 × 24 = $3,576.
- LTV:CAC ratio: 12-24x (healthy; target >3x).
**Revenue mix projection (Month 12):**
| Tier | Customers | MRR | % of MRR |
|---|---|---|---|
| Free | 1,200 | $0 | 0% |
| Starter ($49) | 50 | $2,450 | 11% |
| Pro ($149) | 80 | $11,920 | 54% |
| Business ($399) | 18 | $7,182 | 32% |
| Enterprise | 2 | $600 | 3% |
| **Total** | **1,350 (150 paid)** | **$22,152** | **100%** |
### Path to $10K / $50K / $100K MRR
**$10K MRR — "Ramen Profitable" (Month 6-9)**
- ~67 paying customers at blended $149/mo average.
- Achieved through: HN launch momentum + community engagement + 2-3 blog posts ranking on Google + design partner referrals.
- Solo founder is sustainable at this level. Infrastructure costs ~$1K/mo. Net income ~$9K/mo.
- **Milestone significance:** Validates product-market fit. Proves the market will pay.
**$50K MRR — "Real Business" (Month 15-20)**
- ~335 paying customers at blended $149/mo average.
- Achieved through: SEO compounding + word-of-mouth + Slack Marketplace distribution + first conference talks + compliance-driven purchases accelerating.
- Hire first part-time contractor for support and bug fixes at ~$30K MRR.
- **Milestone significance:** Sustainable solo business. Funds development of dd0c platform expansion.
**$100K MRR — "Platform Inflection" (Month 24-30)**
- ~500 paying customers at blended $200/mo average (mix shifts toward Pro/Business as larger teams adopt).
- Achieved through: dd0c platform cross-sell (drift customers adopt other modules) + enterprise tier traction + AWS Marketplace + potential seed round to accelerate.
- Hire 1-2 full-time engineers. Transition from solo founder to small team.
- **Milestone significance:** dd0c becomes a platform company, not a single-product company.
### Solo Founder Constraints
**What one person can realistically do:**
- Build and maintain the core product (detection engine, Slack integration, dashboard).
- Write 2-4 blog posts per month.
- Engage on Reddit/HN daily (30 min/day).
- Handle support for up to ~100 customers (Slack-based, async).
- Ship weekly releases.
**What one person cannot do:**
- Build enterprise features (SSO, SAML, advanced RBAC) while also shipping core features and doing marketing.
- Handle support for 200+ customers without it consuming all productive time.
- Attend conferences while also shipping code.
- Build multi-cloud support (Azure, GCP) while maintaining AWS quality.
**The constraint strategy:**
- Ruthlessly prioritize AWS + Terraform + OpenTofu. Don't touch Azure/GCP/Pulumi until $30K MRR.
- Use AI-assisted development (Cursor/Copilot) for 80% of boilerplate. Reserve cognitive energy for architecture and customer conversations.
- Hire first contractor at $30K MRR. First full-time hire at $75K MRR.
- Shared dd0c platform infrastructure (auth, billing, OTel pipeline) is built once and reused across all modules. This is the moat against burnout.
### Key Assumptions
1. **The driftctl vacuum persists for 12+ months.** If someone fills it before dd0c/drift launches, the beachhead shrinks significantly.
2. **Engineers will adopt a new tool for drift detection specifically.** The "do nothing" competitor (manual `terraform plan`) is strong. The product must demonstrate ROI in the first 5 minutes.
3. **Compliance requirements continue tightening.** SOC 2, PCI DSS 4.0, and HIPAA are driving drift detection from "nice-to-have" to "required." If compliance pressure plateaus, the Diana persona weakens.
4. **Push-based architecture is acceptable to security teams.** The open-source agent running in customer VPC must satisfy CISO review. If it doesn't, adoption stalls at security-conscious organizations.
5. **PLG motion works for infrastructure tooling.** Bottom-up adoption by individual engineers, expanding to team purchases. If procurement processes block credit card purchases, the self-serve model breaks.
6. **Brian can sustain development velocity across multiple dd0c modules.** Drift is Product #2 in a 6-product suite. If dd0c/route (Phase 1) consumes more time than expected, drift launch delays and the window may close.
---
## 6. RISKS & MITIGATIONS
### Top 5 Risks (from Party Mode Stress Tests)
**Risk 1: HashiCorp/IBM Ships Native Drift Detection in TFC (Severity: 8/10)**
IBM paid $4.6B for HashiCorp. They have infinite resources and strategic motivation to improve TFC's drift features. If they ship continuous monitoring + Slack alerts + remediation in the TFC Plus tier, the "HashiCorp exodus" narrative dies.
*Why it might not happen:* IBM moves slowly. They'll focus on enterprise governance features that justify $70K+ contracts, not improving drift for the free/starter tier. Post-BSL, the community is migrating to OpenTofu — IBM may double down on enterprise lock-in rather than community features.
*Mitigation:*
- Multi-IaC support is the insurance policy. TFC will never support OpenTofu or Pulumi. Every team using multiple IaC tools is immune to TFC's drift features.
- Speed. Be 18 months ahead on drift-specific features by the time IBM responds. Ship weekly, not quarterly.
- Community lock-in. If dd0c/drift is the community standard (the "driftctl successor"), IBM improving TFC drift won't matter — the community has already chosen.
**Risk 2: Solo Founder Burnout (Severity: 9/10, Probability: High)**
This is the risk the party mode panel was most worried about — and so am I. dd0c is 6 products. Even with drift in Phase 2, Brian will be maintaining dd0c/route while building drift. Adding a 4th, 5th, 6th product is not "building new products" — it's adding 25% more work each time to an already unsustainable workload.
*Mitigation:*
- Shared platform infrastructure (auth, billing, OTel pipeline) built once and reused. If each product has its own backend, this fails.
- AI-assisted development for 80% of boilerplate.
- Hire at $30K MRR. Don't try to be solo past that threshold.
- Ruthless scope control. MVP means MVP. No feature creep. No Azure/GCP until $30K MRR.
**Risk 3: Spacelift/env0 Commoditize Drift Detection (Severity: 7/10)**
If dd0c/drift gains traction and appears in "Spacelift alternatives" searches, Spacelift's marketing team will notice. The easiest response: drop basic drift detection into their free tier.
*Why it might not happen:* Spacelift's drift detection requires private workers with infrastructure costs. Making it free erodes their upgrade path. Their investors won't love giving away features that drive enterprise upgrades.
*Mitigation:*
- Be better, not just cheaper. If drift detection is 10x better (Slack-native, one-click remediation, compliance reports, multi-IaC), "free but mediocre" from Spacelift won't matter. Nobody switched from Figma to free Adobe XD.
- Different buyer. Spacelift's free tier targets teams evaluating their platform. dd0c/drift targets teams who don't want a platform. Different buyer, different motion.
**Risk 4: Enterprise Security Teams Block Adoption (Severity: 8/10)**
Reading state files means reading resource configurations, sometimes including sensitive data. Giving a bootstrapped SaaS tool access to production AWS and state buckets is a red flag for any CISO. The party mode CTO called this severity 9/10.
*Mitigation:*
- Push-based architecture is non-negotiable. The SaaS never pulls from customer cloud. The open-source agent runs in their VPC and pushes encrypted drift diffs out.
- Open-source the agent so security teams can audit the code. Trust through transparency.
- Get dd0c SOC 2 certified. Expensive ($20-50K) but eliminates the "can we trust a solo founder's SaaS?" objection. You can't sell a compliance tool without passing compliance yourself.
**Risk 5: "Do Nothing" Inertia (Severity: 6/10, Probability: High)**
Most teams tolerate drift. They've been tolerating it for years. The primary substitute is "do nothing" — manual `terraform plan` runs, tribal knowledge, and hope. Converting tolerators to payers requires more effort than converting seekers to payers.
*Mitigation:*
- The Drift Cost Calculator directly attacks this by quantifying the cost of "good enough." When an engineer sees "$47K/year in drift management costs" vs. "$149/mo for dd0c/drift," the bash script suddenly looks expensive.
- Target seekers first (driftctl refugees, post-incident teams, pre-audit teams), not tolerators. The beachhead is people already in pain.
- Compliance as forcing function. When the auditor says "you need continuous drift detection," inertia loses.
### Kill Criteria
**Kill at 6 months if ANY of these are true:**
1. < 50 free tier signups after HN launch + Reddit engagement + blog content. Market doesn't care.
2. < 5 paying customers after 90 days of paid tier availability. Free users who won't pay are vanity.
3. Free-to-paid conversion < 3%. Industry benchmark for PLG dev tools is 3-7%.
4. NPS < 30 from first 20 customers. If early adopters aren't enthusiastic, the product isn't solving a real problem.
5. HashiCorp announces "TFC Drift Detection Pro" with continuous monitoring, Slack alerts, and remediation included in Plus tier — before dd0c/drift has 100+ customers.
**Kill at 12 months if ANY of these are true:**
1. < $10K MRR. Growth trajectory doesn't support standalone product. Fold drift into dd0c/portal as a feature.
2. Monthly churn > 8%. Dev tools should have <5%. Above 8% means the product isn't sticky.
3. CAC payback > 12 months. Unit economics don't work for a bootstrapped founder.
### Pivot Options
- **Pivot A: Compliance Engine.** If drift detection alone doesn't convert but compliance reports do, pivot to a broader "IaC Compliance Platform" — drift detection becomes a feature feeding compliance evidence generation, audit trail management, and regulatory reporting. Diana becomes the primary buyer, not Ravi.
- **Pivot B: dd0c/portal Feature.** If drift doesn't sustain as a standalone product, fold it into dd0c/portal as the "infrastructure health" module. Drift detection becomes a feature of the IDP, not a product. Reduces standalone revenue pressure.
- **Pivot C: Multi-Tool Standard.** If the multi-IaC angle resonates more than drift specifically, pivot to a generic "IaC state comparison engine" that integrates with existing observability tools (Datadog, New Relic). Become the standard for state comparison, let others build the UX.
---
## 7. SUCCESS METRICS
### North Star Metric
**Stacks monitored** (total across all customers).
This measures adoption depth, not just customer count. A customer monitoring 50 stacks is 10x more engaged (and 10x more likely to retain) than a customer monitoring 5. It also directly correlates with revenue (more stacks = higher tier) and with the data flywheel (more stacks = better drift intelligence).
### Leading Indicators
| Metric | Description | Why It Matters |
|---|---|---|
| **GitHub stars (drift-cli)** | Social proof and top-of-funnel awareness | Stars → downloads → free signups → paid conversions |
| **Free tier signups** | Activation rate of interested engineers | Measures whether the value proposition resonates |
| **Free-to-paid conversion rate** | % of free users who upgrade | Measures whether the product delivers enough value to pay for |
| **Time-to-first-alert** | Minutes from signup to first Slack drift alert | Measures onboarding friction. Target: <5 minutes. |
| **Weekly active stacks** | Stacks with at least one drift check in the past 7 days | Measures engagement depth, not just signup vanity |
| **Slack action rate** | % of drift alerts that receive a Revert/Accept/Snooze action | Measures whether alerts are actionable vs. noise |
### Lagging Indicators
| Metric | Description | Target |
|---|---|---|
| **MRR** | Monthly Recurring Revenue | See milestones below |
| **Net Dollar Retention (NDR)** | Revenue expansion from existing customers | >120% (customers upgrade as they add stacks) |
| **Monthly churn** | % of paying customers lost per month | <5% |
| **CAC payback** | Months to recoup customer acquisition cost | <6 months |
| **LTV:CAC ratio** | Lifetime value vs. acquisition cost | >3:1 (target 10:1+) |
| **NPS** | Net Promoter Score from paying customers | >40 |
### Milestones
**30 Days Post-Launch:**
- 200+ GitHub stars on drift-cli
- 50+ free tier signups
- 3-5 design partners actively using the product
- First Slack alert delivered to a non-Brian user
- Zero critical bugs in production
**60 Days Post-Launch:**
- 500+ GitHub stars
- 150+ free tier signups
- 10+ paying customers
- $1.5K+ MRR
- "driftctl Is Dead" blog post ranking on page 1 for "driftctl alternative"
- First unsolicited mention on r/terraform or r/devops
**90 Days Post-Launch:**
- 1,000+ GitHub stars
- 300+ free tier signups
- 25+ paying customers
- $3.5K+ MRR
- Free-to-paid conversion rate >5%
- First design partner case study published
- NPS >40 from first 20 customers
### Month 6 Targets
| Metric | Target |
|---|---|
| GitHub stars | 1,500 |
| Free tier users | 600 |
| Paying customers | 50 |
| MRR | $7,500 |
| Stacks monitored | 1,500 |
| Monthly churn | <5% |
| NDR | >110% |
### Month 12 Targets
| Metric | Target |
|---|---|
| GitHub stars | 3,000 |
| Free tier users | 1,500 |
| Paying customers | 150 |
| MRR | $22,000 |
| Stacks monitored | 5,000 |
| Monthly churn | <4% |
| NDR | >120% |
| Free-to-paid conversion | 7% |
| NPS | >50 |
| CAC payback | <6 months |
| LTV:CAC | >10:1 |
### Scenario-Weighted Revenue Projection
| Scenario | Probability | Month 6 MRR | Month 12 MRR | Month 24 MRR |
|---|---|---|---|---|
| **Rocket** (viral HN launch, community adopts as driftctl successor) | 20% | $15K | $52K | $180K |
| **Grind** (steady growth, community works but slowly) | 50% | $6K | $22K | $75K |
| **Slog** (interest but low conversion, competitors respond) | 25% | $2.2K | $9K | $22K |
| **Flop** (market doesn't materialize) | 5% | $750 | $5K | $5K |
| **Weighted Expected Value** | — | **$6.7K** | **$23.9K** | **$78.8K** |
Weighted Month 12 MRR of ~$24K = ~$287K ARR. For a bootstrapped solo founder, that's a real business. Not a unicorn. A real business that funds the dd0c platform expansion.
---
## APPENDIX: CROSS-PHASE CONTRADICTION RESOLUTION
This brief synthesized four prior phase documents. Key contradictions and their resolutions:
| Contradiction | Resolution |
|---|---|
| **Pricing: $29/stack flat vs. tiered bundles** — Brainstorm proposed $29/stack. Innovation strategy recommended tiers ($49-$399). Party mode practitioner wanted $149 Pro tier. | **Tiered bundles win.** Flat per-stack penalizes good architecture, creates enterprise sticker shock, and is unpredictable. Tiers solve all three while preserving the "$29/stack" marketing anchor. See Section 3 pricing table. |
| **Launch sequencing: Phase 3 (months 7-12) vs. Phase 2 (months 4-6)** — Brand strategy placed drift in Phase 3. Innovation strategy and party mode both recommended Phase 2. | **Phase 2 wins.** The driftctl vacuum is time-sensitive. Every month of delay shrinks the window. dd0c/route (Phase 1) is a faster build; drift follows immediately. |
| **Standalone product vs. platform wedge** — VC panelist said $3-5M SOM isn't venture-scale. Bootstrap founder said $3M ARR solo is phenomenal. | **Both are right.** Drift is a strong standalone bootstrapped business AND a wedge into the dd0c platform. The brief treats it as both: standalone metrics for the first 12 months, platform expansion metrics for months 12-24. No need to choose yet. |
| **Auto-remediation scope** — CTO warned about blast radius of one-click revert. Practitioner said MVP should focus on safe reverts (security groups), not complex state (RDS parameters). | **Spectrum of automation.** Per-resource-type policies: auto-revert for security groups opened to 0.0.0.0/0, alert + one-click for IAM, digest for tags, ignore for ASG scaling. The spectrum IS the product's sophistication. Complex state remediation generates a PR for human review, not a direct apply. |
| **Architecture: SaaS pull vs. push-based agent** — Contrarian and CTO both flagged IAM trust as a blocker. Practitioner proposed push-based agent. | **Push-based is non-negotiable.** The SaaS never pulls from customer cloud. Open-source agent runs in customer VPC, pushes encrypted diffs out. This was unanimous across all phases. |
---
*"The window won't wait. Ship it."* — Victor
**Document Status:** COMPLETE
**Confidence Level:** HIGH
**Next Step:** Technical architecture session — define the detection engine, state backend integrations, and Slack workflow architecture.

File diff suppressed because it is too large Load Diff