dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run Phases: brainstorm, design-thinking, innovation-strategy, party-mode, product-brief, architecture, epics (incl. Epic 10 TF compliance), test-architecture (TDD strategy) Brand strategy and market research included.
2026-02-28 17:35:02 +00:00
commit 5ee95d8b13
51 changed files with 36935 additions and 0 deletions
--- a/products/02-iac-drift-detection/epics/epics.md
+++ b/products/02-iac-drift-detection/epics/epics.md
@@ -0,0 +1,505 @@
+# dd0c/drift - V1 MVP Epics
+
+## Epic 1: Drift Agent (Go CLI)
+**Description:** The core open-source Go binary that runs in the customer's environment. It parses Terraform state, polls AWS APIs for actual resource configurations, calculates the diff, and scrubs sensitive data before transmission.
+
+### User Stories
+
+#### Story 1.1: Terraform State Parser
+**As an** Infrastructure Engineer, **I want** the agent to parse my Terraform state file locally, **so that** it can identify declared resources without uploading my raw state to a third party.
+* **Acceptance Criteria:**
+  * Successfully parses Terraform state v4 JSON format.
+  * Extracts a list of `managed` resources with their declared attributes.
+  * Handles both local `.tfstate` files and AWS S3 remote backend configurations.
+* **Story Points:** 5
+* **Dependencies:** None
+* **Technical Notes:** Use standard Go JSON unmarshaling. Create an internal graph representation. Focus exclusively on AWS provider resources.
+
+#### Story 1.2: AWS Resource Polling
+**As an** Infrastructure Engineer, **I want** the agent to query AWS for the current state of my resources, **so that** it can compare reality against my declared Terraform state.
+* **Acceptance Criteria:**
+  * Agent uses customer's local AWS credentials/IAM role to authenticate.
+  * Queries AWS APIs for the top 20 MVP resource types (e.g., `ec2:DescribeSecurityGroups`, `iam:GetRole`).
+  * Maps Terraform resource IDs to AWS identifiers.
+* **Story Points:** 8
+* **Dependencies:** Story 1.1
+* **Technical Notes:** Use the official AWS Go SDK v2. Map API responses to a standardized internal schema that matches the state parser output. Add simple retry logic for rate limits.
+
+#### Story 1.3: Drift Diff Calculation
+**As an** Infrastructure Engineer, **I want** the agent to calculate attribute-level differences between my state file and AWS reality, **so that** I know exactly what changed.
+* **Acceptance Criteria:**
+  * Compares parsed state attributes with polled AWS attributes.
+  * Outputs a structured diff showing `old` (state) and `new` (reality) values.
+  * Ignores AWS-generated default attributes that aren't declared in state.
+* **Story Points:** 5
+* **Dependencies:** Story 1.1, Story 1.2
+* **Technical Notes:** Implement a deep compare function. Requires hardcoded ignore lists for known noisy attributes (e.g., AWS-assigned IDs or timestamps).
+
+#### Story 1.4: Secret Scrubbing Engine
+**As a** Security Lead, **I want** the agent to scrub all sensitive data from the drift diffs, **so that** my database passwords and API keys are never transmitted to the dd0c SaaS.
+* **Acceptance Criteria:**
+  * Strips any attribute marked `sensitive` in the state file.
+  * Redacts values matching known secret patterns (e.g., `password`, `secret`, `token`).
+  * Replaces redacted values with `[REDACTED]`.
+  * Completely strips the `Private` field from state instances.
+* **Story Points:** 3
+* **Dependencies:** Story 1.3
+* **Technical Notes:** Use regex for pattern matching. Validate scrubber with rigorous unit tests before shipping. Ensure the diff structure remains intact even when values are redacted.
+
+
+## Epic 2: Agent Communication
+**Description:** Secure data transmission between the local Go agent and the dd0c SaaS. Handles authentication, mTLS registration, heartbeat signals, and encrypted drift report uploads.
+
+### User Stories
+
+#### Story 2.1: Agent Registration & Authentication
+**As a** DevOps Lead, **I want** the agent to securely register itself with the dd0c SaaS using an API key, **so that** my drift data is securely associated with my organization.
+* **Acceptance Criteria:**
+  * Agent registers via `POST /v1/agents/register` using a static API key.
+  * Generates and exchanges mTLS certificates for subsequent requests.
+  * Receives configuration details (e.g., poll interval) from the SaaS.
+* **Story Points:** 5
+* **Dependencies:** None
+* **Technical Notes:** The agent needs to store the mTLS cert locally or in memory. Implement robust error handling for unauthorized/revoked API keys.
+
+#### Story 2.2: Encrypted Payload Transmission
+**As a** Security Lead, **I want** the agent to transmit drift reports over a secure, encrypted channel, **so that** our infrastructure data cannot be intercepted in transit.
+* **Acceptance Criteria:**
+  * Agent POSTs scrubbed drift reports to `/v1/drift-reports`.
+  * Communication enforces TLS 1.3 and uses the established mTLS client certificate.
+  * Payload is compressed (gzip) if over a certain threshold.
+* **Story Points:** 3
+* **Dependencies:** Story 1.4, Story 2.1
+* **Technical Notes:** Ensure HTTP client in Go enforces TLS 1.3. Define the strict JSON schema for the `DriftReport` payload.
+
+#### Story 2.3: Agent Heartbeat
+**As a** DevOps Lead, **I want** the agent to send regular heartbeats to the SaaS, **so that** I know if the agent crashes or loses connectivity.
+* **Acceptance Criteria:**
+  * Agent sends a lightweight heartbeat payload every N minutes.
+  * Payload includes uptime, memory usage, and events processed.
+  * SaaS API logs the heartbeat to track agent health.
+* **Story Points:** 2
+* **Dependencies:** Story 2.1
+* **Technical Notes:** Run heartbeat in a separate Go goroutine with a ticker. Handle transient network errors silently.
+
+
+## Epic 3: Drift Analysis Engine
+**Description:** The SaaS-side Node.js/TypeScript processor that ingests drift reports, classifies severity, calculates stack drift scores, and persists events to the database and event store.
+
+### User Stories
+
+#### Story 3.1: Ingestion & Validation Pipeline
+**As a** System Operator, **I want** the SaaS to receive and validate drift reports from agents via SQS, **so that** high volumes of reports are processed reliably without dropping data.
+* **Acceptance Criteria:**
+  * API Gateway routes valid `POST /v1/drift-reports` to an SQS FIFO queue.
+  * Event Processor ECS task consumes from the queue.
+  * Validates the report payload against a strict JSON schema.
+* **Story Points:** 5
+* **Dependencies:** Story 2.2
+* **Technical Notes:** Use `zod` for payload validation. Ensure message group IDs use `stack_id` to maintain ordering per stack.
+
+#### Story 3.2: Drift Classification
+**As an** Infrastructure Engineer, **I want** the SaaS to classify detected drift by severity and category, **so that** I can prioritize critical security issues over cosmetic tag changes.
+* **Acceptance Criteria:**
+  * Applies YAML-defined classification rules to incoming drift diffs.
+  * Tags events as Critical, High, Medium, or Low severity.
+  * Tags events with categories (Security, Configuration, Tags, etc.).
+* **Story Points:** 3
+* **Dependencies:** Story 3.1
+* **Technical Notes:** Implement a fast rule evaluation engine. Default unmatched drift to "Medium/Configuration".
+
+#### Story 3.3: Persistence & Event Sourcing
+**As a** Compliance Lead, **I want** every drift detection event to be stored in an immutable log, **so that** I have a reliable audit trail for SOC 2 compliance.
+* **Acceptance Criteria:**
+  * Appends the raw drift event to DynamoDB (immutable event store).
+  * Upserts the current state of the resource in the PostgreSQL `resources` table.
+  * Inserts a new record in the PostgreSQL `drift_events` table for open drift.
+* **Story Points:** 8
+* **Dependencies:** Story 3.2
+* **Technical Notes:** Handle database transactions carefully to keep PostgreSQL and DynamoDB in sync. Ensure Row-Level Security (RLS) is applied on all PostgreSQL inserts.
+
+#### Story 3.4: Drift Score Calculation
+**As a** DevOps Lead, **I want** the engine to calculate a drift score for each stack, **so that** I have a high-level metric of infrastructure health.
+* **Acceptance Criteria:**
+  * Updates the `drift_score` field on the `stacks` table after processing a report.
+  * Score is out of 100 (e.g., 100 = completely clean).
+  * Weighted penalization based on severity (Critical heavily impacts score, Low barely impacts).
+* **Story Points:** 3
+* **Dependencies:** Story 3.3
+* **Technical Notes:** Define a simple but logical weighting algorithm. Run calculation synchronously during event processing for V1.
+
+
+## Epic 4: Notification Service
+**Description:** A Lambda-based service that formats drift events into actionable Slack messages (Block Kit) and handles delivery routing per stack configuration.
+
+### User Stories
+
+#### Story 4.1: Slack Block Kit Formatting
+**As an** Infrastructure Engineer, **I want** drift alerts to arrive as rich Slack messages, **so that** I can easily read the diff and context without leaving Slack.
+* **Acceptance Criteria:**
+  * Lambda function maps drift events to Slack Block Kit JSON.
+  * Message includes Stack Name, Resource Address, Timestamp, Severity, and CloudTrail Attribution.
+  * Displays a code block showing the `old` vs `new` attribute diff.
+* **Story Points:** 5
+* **Dependencies:** Story 3.3
+* **Technical Notes:** Build a flexible Block Kit template builder. Ensure diffs that are too long are gracefully truncated to fit Slack's block length limits.
+
+#### Story 4.2: Slack Routing & Fanout
+**As a** DevOps Lead, **I want** drift alerts to be routed to specific Slack channels based on the stack, **so that** the right team sees the alert without noise.
+* **Acceptance Criteria:**
+  * Checks the `stacks` table for custom Slack channel overrides.
+  * Falls back to the organization's default Slack channel.
+  * Sends the formatted message via the Slack API.
+* **Story Points:** 3
+* **Dependencies:** Story 4.1
+* **Technical Notes:** The Event Processor triggers this Lambda asynchronously via SQS (`notification-fanout` queue).
+
+#### Story 4.3: Action Buttons (Revert/Accept)
+**As an** Infrastructure Engineer, **I want** action buttons directly on the Slack alert, **so that** I can quickly trigger remediation workflows.
+* **Acceptance Criteria:**
+  * Slack message includes interactive buttons: `[Revert]`, `[Accept]`, `[Snooze]`, `[Assign]`.
+  * Buttons contain the `drift_event_id` in their payload value.
+* **Story Points:** 2
+* **Dependencies:** Story 4.1
+* **Technical Notes:** Just rendering the buttons. Handling the interactive callbacks will be covered under the Slack Bot epic or Remediation Engine (V1 MVP focus).
+
+#### Story 4.4: Notification Batching (Low Severity)
+**As an** Infrastructure Engineer, **I want** low and medium severity drift alerts to be batched into a digest, **so that** my Slack channel isn't spammed with noisy tag changes.
+* **Acceptance Criteria:**
+  * Critical/High alerts are sent immediately.
+  * Medium/Low alerts are held in a DynamoDB table or SQS delay queue and dispatched as a daily/hourly digest.
+* **Story Points:** 8
+* **Dependencies:** Story 4.2
+* **Technical Notes:** Use EventBridge Scheduler or a cron Lambda to process and flush the digest queue periodically.
+
+
+## Epic 5: Dashboard API
+**Description:** The REST API powering the React SPA web dashboard. Handles user authentication (Cognito), organization/stack management, and querying the PostgreSQL data for drift history.
+
+### User Stories
+
+#### Story 5.1: API Authentication & RLS Setup
+**As a** System Operator, **I want** the API to enforce authentication and isolate tenant data, **so that** a user from one organization cannot see another organization's data.
+* **Acceptance Criteria:**
+  * Integrates AWS Cognito JWT validation middleware.
+  * API sets `app.current_org_id` on the PostgreSQL connection session for Row-Level Security (RLS).
+  * Returns `401/403` for unauthorized requests.
+* **Story Points:** 5
+* **Dependencies:** Database Schema (RLS)
+* **Technical Notes:** Build an Express/Node.js middleware layer. Ensure strict parameterization for all SQL queries beyond RLS to avoid injection.
+
+#### Story 5.2: Stack Management Endpoints
+**As a** DevOps Lead, **I want** a set of REST endpoints to view and manage my stacks, **so that** I can configure ownership, check connection health, and monitor the drift score.
+* **Acceptance Criteria:**
+  * Implements `GET /v1/stacks` (list all stacks with their scores and resource counts).
+  * Implements `GET /v1/stacks/:id` (stack details).
+  * Implements `PATCH /v1/stacks/:id` (update name, owner, Slack channel).
+* **Story Points:** 3
+* **Dependencies:** Story 5.1
+* **Technical Notes:** Support basic pagination (`limit`, `offset`) on list endpoints. Include agent heartbeat status in stack details if applicable.
+
+#### Story 5.3: Drift History & Event Queries
+**As a** Security Lead, **I want** endpoints to search and filter drift events, **so that** I can review historical drift, find specific changes, and generate audit reports.
+* **Acceptance Criteria:**
+  * Implements `GET /v1/drift-events` with filters for `stack_id`, `status` (open/resolved), `severity`, and timestamp ranges.
+  * Joins the `drift_events` table with `resources` to return full address paths and diff payloads.
+* **Story Points:** 5
+* **Dependencies:** Story 5.1
+* **Technical Notes:** Expose the JSONB `diff` field cleanly in the response payload. Use index-backed PostgreSQL queries for fast filtering.
+
+#### Story 5.4: Policy Configuration Endpoints
+**As a** DevOps Lead, **I want** an API to manage remediation policies, **so that** I can customize how the agent reacts to specific resource drift.
+* **Acceptance Criteria:**
+  * Implements CRUD operations for stack-level and org-level policies (`/v1/policies`).
+  * Validates policy configuration payloads (e.g., action type, valid resource expressions).
+* **Story Points:** 3
+* **Dependencies:** Story 5.1
+* **Technical Notes:** For V1, store policies as JSON fields in a simple `policies` table or direct mapping to stacks. Keep validation simple (e.g., regex checks on resource types).
+
+
+## Epic 6: Dashboard UI
+**Description:** The React Single Page Application (SPA) providing a web dashboard for stack overview, drift timeline, and resource-level diff viewer.
+
+### User Stories
+
+#### Story 6.1: Stack Overview Dashboard
+**As a** DevOps Lead, **I want** a main dashboard showing all my stacks and their drift scores, **so that** I can assess infrastructure health at a glance.
+* **Acceptance Criteria:**
+  * Displays a list/table of all monitored stacks.
+  * Shows a visual "Drift Score" indicator (0-100) per stack.
+  * Sortable by score, name, and last checked timestamp.
+  * Provides visual indicators for agent connection status.
+* **Story Points:** 5
+* **Dependencies:** Story 5.2
+* **Technical Notes:** Build with React and Vite. Use a standard UI library (e.g., Tailwind UI or MUI). Implement efficient data fetching (e.g., React Query).
+
+#### Story 6.2: Stack Detail & Drift Timeline
+**As an** Infrastructure Engineer, **I want** to click into a stack and see a timeline of drift events, **so that** I can track when things changed and who changed them.
+* **Acceptance Criteria:**
+  * Shows a chronological list of drift events for the selected stack.
+  * Displays open vs. resolved status.
+  * Filters for severity and category.
+  * Includes CloudTrail attribution data (who, IP, action).
+* **Story Points:** 5
+* **Dependencies:** Story 5.3, Story 6.1
+* **Technical Notes:** Support pagination/infinite scrolling for the timeline. Use clear icons for event types (Security vs. Tags).
+
+#### Story 6.3: Resource-Level Diff Viewer
+**As an** Infrastructure Engineer, **I want** to see the exact attribute changes for a drifted resource, **so that** I know exactly how reality differs from my state file.
+* **Acceptance Criteria:**
+  * Clicking an event opens a detailed view/modal.
+  * Renders a code-diff view (red for old state, green for new reality).
+  * Clearly marks redacted sensitive values.
+* **Story Points:** 5
+* **Dependencies:** Story 6.2
+* **Technical Notes:** Use a specialized diff viewing component (e.g., `react-diff-viewer`). Ensure it handles large JSON blocks gracefully.
+
+#### Story 6.4: Auth & User Settings
+**As a** User, **I want** to manage my account and view my API keys, **so that** I can deploy the agent and access my organization's dashboard.
+* **Acceptance Criteria:**
+  * Implements login/signup via Cognito (Email/Password & GitHub OAuth).
+  * Provides a settings page displaying the organization's static API key.
+  * Displays current subscription plan (Free tier limits for MVP).
+* **Story Points:** 3
+* **Dependencies:** Story 5.1
+* **Technical Notes:** Securely manage JWT storage (HttpOnly cookies or secure local storage). Include a clear "copy to clipboard" for the API key.
+
+
+## Epic 7: Slack Bot
+**Description:** The interactive Slack application that handles user commands (`/drift score`) and processes the interactive action buttons (`[Revert]`, `[Accept]`) from drift alerts.
+
+### User Stories
+
+#### Story 7.1: Interactive Remediation Callbacks (Revert)
+**As an** Infrastructure Engineer, **I want** clicking `[Revert]` on a Slack alert to trigger a targeted `terraform apply`, **so that** I can fix drift instantly without leaving Slack.
+* **Acceptance Criteria:**
+  * SaaS API Gateway (`/v1/slack/interactions`) receives the button click payload.
+  * Validates the Slack request signature.
+  * Generates a scoped `terraform plan -target` command and queues it for the agent.
+  * Updates the Slack message to "Reverting...".
+* **Story Points:** 8
+* **Dependencies:** Story 4.3
+* **Technical Notes:** The actual execution happens via the Remediation Engine (ECS Fargate) dispatching commands to the agent. Requires careful state tracking (Pending -> Executing -> Completed).
+
+#### Story 7.2: Interactive Acceptance Callbacks (Accept)
+**As an** Infrastructure Engineer, **I want** clicking `[Accept]` to auto-generate a PR that updates my Terraform code to match reality, **so that** the drift becomes the new source of truth.
+* **Acceptance Criteria:**
+  * SaaS generates a code patch representing the new state.
+  * Uses the GitHub API to create a branch and open a PR against the target repo.
+  * Updates the Slack message with a link to the PR.
+* **Story Points:** 8
+* **Dependencies:** Story 7.1
+* **Technical Notes:** Will require GitHub App integration (or OAuth token) to create branches/PRs. The patch generation logic needs robust testing.
+
+#### Story 7.3: Slack Slash Commands
+**As a** DevOps Lead, **I want** to use `/drift score` and `/drift status <stack>` in Slack, **so that** I can check my infrastructure health on demand.
+* **Acceptance Criteria:**
+  * `/drift score` returns the aggregate score for the organization.
+  * `/drift status prod-networking` returns the score, open events, and agent health for a specific stack.
+  * Formats output as a clean Slack Block Kit message visible only to the user.
+* **Story Points:** 5
+* **Dependencies:** Story 4.1, Story 5.2
+* **Technical Notes:** Deploy an API Gateway endpoint specifically for Slack slash commands. Validate the token and use the internal Dashboard API logic to fetch scores.
+
+#### Story 7.4: Snooze & Assign Callbacks
+**As an** Infrastructure Engineer, **I want** to click `[Snooze 24h]` or `[Assign]`, **so that** I can manage alert noise or delegate investigation to a teammate.
+* **Acceptance Criteria:**
+  * `[Snooze]` updates the event status to `snoozed` and schedules a wake-up time.
+  * `[Assign]` opens a Slack modal to select a team member, updating the event owner.
+  * The original Slack message updates to reflect the new state/owner.
+* **Story Points:** 5
+* **Dependencies:** Story 7.1
+* **Technical Notes:** Snooze requires a scheduled EventBridge or cron job to un-snooze. Assign requires interacting with Slack's user selection menus.
+
+
+## Epic 8: Infrastructure & DevOps
+**Description:** The underlying cloud resources for the dd0c SaaS and the CI/CD pipelines to build, test, and release the agent and services.
+
+### User Stories
+
+#### Story 8.1: SaaS Infrastructure (Terraform)
+**As a** System Operator, **I want** the SaaS infrastructure to be defined as code, **so that** deployments are repeatable and I can dogfood my own drift detection tool.
+* **Acceptance Criteria:**
+  * Defines VPC, Subnets, ECS Fargate Clusters, RDS PostgreSQL (Multi-AZ), API Gateway, and SQS FIFO queues.
+  * Sets up CloudWatch log groups and IAM roles.
+  * Uses Terraform for all configuration.
+* **Story Points:** 8
+* **Dependencies:** Architecture Design Document
+* **Technical Notes:** Build a modular Terraform setup. Use the official AWS provider. Include variables for environment separation (staging vs. prod).
+
+#### Story 8.2: CI/CD Pipeline (GitHub Actions)
+**As a** Developer, **I want** a fully automated CI/CD pipeline, **so that** code pushed to `main` is linted, tested, built, and deployed to ECS.
+* **Acceptance Criteria:**
+  * Runs `golangci-lint`, `go test`, ESLint, and Vitest on PRs.
+  * Builds multi-stage Docker images for the Event Processor, Dashboard API, and Remediation Engine.
+  * Pushes images to ECR and triggers an ECS rolling deploy.
+* **Story Points:** 5
+* **Dependencies:** Story 8.1
+* **Technical Notes:** Use standard GitHub Actions (e.g., `aws-actions/configure-aws-credentials`). Add Trivy for basic container scanning.
+
+#### Story 8.3: Agent Distribution (Releases & Homebrew)
+**As an** Open Source User, **I want** to easily download and install the CLI agent, **so that** I can test drift detection locally without building from source.
+* **Acceptance Criteria:**
+  * Configures GoReleaser to cross-compile binaries for Linux/macOS/Windows (amd64/arm64).
+  * Auto-publishes GitHub Releases when a new tag is pushed.
+  * Creates a custom Homebrew tap (`brew install dd0c/tap/drift-cli`).
+* **Story Points:** 5
+* **Dependencies:** Story 1.1
+* **Technical Notes:** Create a dedicated `.github/workflows/release.yml` for GoReleaser.
+
+#### Story 8.4: Agent Terraform Module Publication
+**As a** DevOps Lead, **I want** a pre-built Terraform module to deploy the agent in my AWS account, **so that** I don't have to manually configure ECS tasks and EventBridge rules.
+* **Acceptance Criteria:**
+  * Creates the `dd0c/drift-agent/aws` Terraform module.
+  * Provisions an ECS Task, EventBridge rules, SQS, and IAM roles for the customer.
+  * Publishes the module to the public Terraform Registry.
+* **Story Points:** 8
+* **Dependencies:** Story 8.3
+* **Technical Notes:** Adhere to Terraform Registry best practices. Ensure the `README.md` clearly explains the required `dd0c_api_key` and state bucket variables.
+
+
+## Epic 9: Onboarding & PLG (Product-Led Growth)
+**Description:** The self-serve funnel that guides users from CLI installation to their first drift alert in under 5 minutes, plus billing and tier management.
+
+### User Stories
+
+#### Story 9.1: Self-Serve Signup & CLI Login
+**As an** Engineer, **I want** to easily sign up for the free tier via the CLI, **so that** I don't have to fill out sales forms to test the product.
+* **Acceptance Criteria:**
+  * Running `drift auth login` opens a browser to an OAuth flow (GitHub/Email).
+  * The CLI spins up a local web server to catch the callback token.
+  * Successfully provisions an organization and user account in the SaaS.
+* **Story Points:** 5
+* **Dependencies:** Story 5.4, Story 8.3
+* **Technical Notes:** The callback server should listen on `localhost` with a random or standard port (e.g., `8080`).
+
+#### Story 9.2: Auto-Discovery (`drift init`)
+**As an** Infrastructure Engineer, **I want** the CLI to auto-discover my Terraform state files, **so that** I can configure my first stack without typing S3 ARNs.
+* **Acceptance Criteria:**
+  * `drift init` scans the current directory for `*.tf` files.
+  * Uses default AWS credentials to query S3 buckets matching common state file patterns.
+  * Prompts the user to register discovered stacks to their organization.
+* **Story Points:** 8
+* **Dependencies:** Story 9.1
+* **Technical Notes:** Implement robust fallback to manual input if discovery fails.
+
+#### Story 9.3: Free Tier Enforcement (1 Stack)
+**As a** Product Manager, **I want** to enforce a free tier limit of 1 stack, **so that** users get value but are incentivized to upgrade for larger infrastructure needs.
+* **Acceptance Criteria:**
+  * The API rejects attempts to register more than 1 stack on the Free plan.
+  * The Dashboard clearly shows "1/1 Stacks Used".
+  * The CLI prompts "Upgrade to Starter ($49/mo)" when trying to add a second stack.
+* **Story Points:** 3
+* **Dependencies:** Story 9.1
+* **Technical Notes:** Enforce limits securely at the API level (e.g., `POST /v1/stacks` should return a `403 Stack Limit` error).
+
+#### Story 9.4: Stripe Billing Integration
+**As a** Solo Founder, **I want** customers to upgrade to paid tiers with a credit card, **so that** I can capture revenue without a sales process.
+* **Acceptance Criteria:**
+  * Integrates Stripe Checkout for the Starter ($49/mo) and Pro ($149/mo) tiers.
+  * Dashboard provides a billing management portal (Stripe Customer Portal).
+  * Webhooks listen for successful payments and update the organization's `plan` field in PostgreSQL.
+* **Story Points:** 8
+* **Dependencies:** Story 5.2
+* **Technical Notes:** Must include Stripe webhook signature verification to prevent spoofed upgrades. Store `stripe_customer_id` on the `organizations` table.
+
+
+
+---
+
+## Epic 10: Transparent Factory Compliance
+**Description:** Cross-cutting epic ensuring dd0c/drift adheres to the 5 Transparent Factory architectural tenets. For a drift detection product, these tenets are especially critical — the tool that detects infrastructure drift must itself be immune to uncontrolled drift.
+
+### Story 10.1: Atomic Flagging — Feature Flags for Detection Behaviors
+**As a** solo founder, **I want** every new drift detection rule, remediation action, and notification behavior wrapped in a feature flag (default: off), **so that** I can ship new detection capabilities without accidentally triggering false-positive alerts for customers.
+
+**Acceptance Criteria:**
+- OpenFeature SDK integrated into the Go agent. V1 provider: env-var or JSON file-based (no external service).
+- All flags evaluate locally — no network calls during drift scan execution.
+- Every flag has `owner` and `ttl` (max 14 days). CI blocks if any flag at 100% rollout exceeds TTL.
+- Automated circuit breaker: if a flagged detection rule generates >3x the baseline false-positive rate over 1 hour, the flag auto-disables.
+- Flags required for: new IaC provider support (Terraform/Pulumi/CDK), remediation suggestions, Slack notification formats, scan scheduling changes.
+
+**Estimate:** 5 points
+**Dependencies:** Epic 1 (Agent Core)
+**Technical Notes:**
+- Use Go OpenFeature SDK (`go.openfeature.dev/sdk`). JSON file provider for V1.
+- Circuit breaker: track false-positive dismissals per rule in Redis. If dismissal rate spikes, disable the flag.
+- Flag audit: `make flag-audit` lists all flags with TTL status.
+
+### Story 10.2: Elastic Schema — Additive-Only for Drift State Storage
+**As a** solo founder, **I want** all DynamoDB and state file schema changes to be strictly additive, **so that** agent rollbacks never corrupt drift history or lose customer scan results.
+
+**Acceptance Criteria:**
+- CI lint rejects any migration or schema change that removes, renames, or changes the type of existing DynamoDB attributes.
+- New attributes use `_v2` suffix when breaking changes are needed. Old attributes remain readable.
+- Go structs use `json:",omitempty"` and ignore unknown fields so V1 agents can read V2 state files without crashing.
+- Dual-write enforced during migration windows: agent writes to both old and new attribute paths in the same DynamoDB `TransactWriteItems` call.
+- Every schema change includes a `sunset_date` comment (max 30 days). CI warns on overdue cleanups.
+
+**Estimate:** 3 points
+**Dependencies:** Epic 2 (State Management)
+**Technical Notes:**
+- DynamoDB Single Table Design: version items with `_v` attribute. Agent code uses a factory to select the correct model.
+- For Terraform state parsing, use `encoding/json` with `DisallowUnknownFields: false` to tolerate upstream state format changes.
+- S3 state snapshots: never overwrite — always write new versioned keys.
+
+### Story 10.3: Cognitive Durability — Decision Logs for Detection Logic
+**As a** future maintainer, **I want** every change to drift detection algorithms, severity scoring, or remediation logic accompanied by a `decision_log.json`, **so that** I understand why a particular drift pattern is flagged as critical vs. informational.
+
+**Acceptance Criteria:**
+- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
+- CI requires a decision log entry for PRs touching `pkg/detection/`, `pkg/scoring/`, or `pkg/remediation/`.
+- Cyclomatic complexity cap of 10 enforced via `golangci-lint` with `gocyclo` linter. PRs exceeding this are blocked.
+- Decision logs committed in `docs/decisions/`, one per significant logic change.
+
+**Estimate:** 2 points
+**Dependencies:** None
+**Technical Notes:**
+- PR template includes decision log fields as a checklist.
+- For drift scoring changes: document why specific thresholds were chosen (e.g., "security group changes scored critical because X% of breaches start with SG drift").
+- `golangci-lint` config: `.golangci.yml` with `gocyclo: max-complexity: 10`.
+
+### Story 10.4: Semantic Observability — AI Reasoning Spans on Drift Classification
+**As an** SRE debugging a missed drift alert, **I want** every drift classification decision to emit an OpenTelemetry span with structured reasoning metadata, **so that** I can trace why a specific resource change was scored as low-severity when it should have been critical.
+
+**Acceptance Criteria:**
+- Every drift scan emits a parent `drift_scan` span. Each resource comparison emits a child `drift_classification` span.
+- Span attributes: `drift.resource_type`, `drift.severity_score`, `drift.classification_reason`, `drift.alternatives_considered` (e.g., "considered critical but downgraded because tag-only change").
+- If AI-assisted classification is used (future): `ai.prompt_hash`, `ai.model_version`, `ai.confidence_score` included.
+- Spans export via OTLP to any compatible backend.
+- No PII or customer infrastructure details in spans — resource ARNs are hashed.
+
+**Estimate:** 3 points
+**Dependencies:** Epic 1 (Agent Core)
+**Technical Notes:**
+- Use `go.opentelemetry.io/otel` with OTLP exporter.
+- For V1 without AI classification, the `drift.classification_reason` is the rule name + threshold that triggered.
+- ARN hashing: SHA-256 truncated to 12 chars for correlation without exposure.
+
+### Story 10.5: Configurable Autonomy — Governance for Auto-Remediation
+**As a** solo founder, **I want** a `policy.json` that controls whether the agent can auto-remediate drift or only report it, **so that** customers maintain full control over what the tool is allowed to change in their infrastructure.
+
+**Acceptance Criteria:**
+- `policy.json` defines `governance_mode`: `strict` (report-only, no remediation) or `audit` (auto-remediate with logging).
+- Agent checks policy before every remediation action. In `strict` mode, remediation suggestions are logged but never executed.
+- `panic_mode`: when true, agent stops all scans immediately, preserves last-known-good state, and sends a single "paused" notification.
+- Per-customer policy override: customers can set their own governance mode via config, which is always more restrictive than the system default (never less).
+- All policy decisions logged: "Remediation blocked by strict mode for resource X", "Auto-remediation applied in audit mode".
+
+**Estimate:** 3 points
+**Dependencies:** Epic 3 (Remediation Engine)
+**Technical Notes:**
+- `policy.json` in repo root, loaded at startup, watched via `fsnotify`.
+- Customer-level overrides stored in DynamoDB `org_settings` item. Merge logic: `min(system_policy, customer_policy)` — customer can only be MORE restrictive.
+- Panic mode trigger: `POST /admin/panic` or env var `DD0C_PANIC=true`. Agent drains current scan and halts.
+
+### Epic 10 Summary
+| Story | Tenet | Points |
+|-------|-------|--------|
+| 10.1 | Atomic Flagging | 5 |
+| 10.2 | Elastic Schema | 3 |
+| 10.3 | Cognitive Durability | 2 |
+| 10.4 | Semantic Observability | 3 |
+| 10.5 | Configurable Autonomy | 3 |
+| **Total** | | **16** |