# dd0c/drift - V1 MVP Epics ## Epic 1: Drift Agent (Go CLI) **Description:** The core open-source Go binary that runs in the customer's environment. It parses Terraform state, polls AWS APIs for actual resource configurations, calculates the diff, and scrubs sensitive data before transmission. ### User Stories #### Story 1.1: Terraform State Parser **As an** Infrastructure Engineer, **I want** the agent to parse my Terraform state file locally, **so that** it can identify declared resources without uploading my raw state to a third party. * **Acceptance Criteria:** * Successfully parses Terraform state v4 JSON format. * Extracts a list of `managed` resources with their declared attributes. * Handles both local `.tfstate` files and AWS S3 remote backend configurations. * **Story Points:** 5 * **Dependencies:** None * **Technical Notes:** Use standard Go JSON unmarshaling. Create an internal graph representation. Focus exclusively on AWS provider resources. #### Story 1.2: AWS Resource Polling **As an** Infrastructure Engineer, **I want** the agent to query AWS for the current state of my resources, **so that** it can compare reality against my declared Terraform state. * **Acceptance Criteria:** * Agent uses customer's local AWS credentials/IAM role to authenticate. * Queries AWS APIs for the top 20 MVP resource types (e.g., `ec2:DescribeSecurityGroups`, `iam:GetRole`). * Maps Terraform resource IDs to AWS identifiers. * **Story Points:** 8 * **Dependencies:** Story 1.1 * **Technical Notes:** Use the official AWS Go SDK v2. Map API responses to a standardized internal schema that matches the state parser output. Add simple retry logic for rate limits. #### Story 1.3: Drift Diff Calculation **As an** Infrastructure Engineer, **I want** the agent to calculate attribute-level differences between my state file and AWS reality, **so that** I know exactly what changed. * **Acceptance Criteria:** * Compares parsed state attributes with polled AWS attributes. * Outputs a structured diff showing `old` (state) and `new` (reality) values. * Ignores AWS-generated default attributes that aren't declared in state. * **Story Points:** 5 * **Dependencies:** Story 1.1, Story 1.2 * **Technical Notes:** Implement a deep compare function. Requires hardcoded ignore lists for known noisy attributes (e.g., AWS-assigned IDs or timestamps). #### Story 1.4: Secret Scrubbing Engine **As a** Security Lead, **I want** the agent to scrub all sensitive data from the drift diffs, **so that** my database passwords and API keys are never transmitted to the dd0c SaaS. * **Acceptance Criteria:** * Strips any attribute marked `sensitive` in the state file. * Redacts values matching known secret patterns (e.g., `password`, `secret`, `token`). * Replaces redacted values with `[REDACTED]`. * Completely strips the `Private` field from state instances. * **Story Points:** 3 * **Dependencies:** Story 1.3 * **Technical Notes:** Use regex for pattern matching. Validate scrubber with rigorous unit tests before shipping. Ensure the diff structure remains intact even when values are redacted. ## Epic 2: Agent Communication **Description:** Secure data transmission between the local Go agent and the dd0c SaaS. Handles authentication, mTLS registration, heartbeat signals, and encrypted drift report uploads. ### User Stories #### Story 2.1: Agent Registration & Authentication **As a** DevOps Lead, **I want** the agent to securely register itself with the dd0c SaaS using an API key, **so that** my drift data is securely associated with my organization. * **Acceptance Criteria:** * Agent registers via `POST /v1/agents/register` using a static API key. * Generates and exchanges mTLS certificates for subsequent requests. * Receives configuration details (e.g., poll interval) from the SaaS. * **Story Points:** 5 * **Dependencies:** None * **Technical Notes:** The agent needs to store the mTLS cert locally or in memory. Implement robust error handling for unauthorized/revoked API keys. #### Story 2.2: Encrypted Payload Transmission **As a** Security Lead, **I want** the agent to transmit drift reports over a secure, encrypted channel, **so that** our infrastructure data cannot be intercepted in transit. * **Acceptance Criteria:** * Agent POSTs scrubbed drift reports to `/v1/drift-reports`. * Communication enforces TLS 1.3 and uses the established mTLS client certificate. * Payload is compressed (gzip) if over a certain threshold. * **Story Points:** 3 * **Dependencies:** Story 1.4, Story 2.1 * **Technical Notes:** Ensure HTTP client in Go enforces TLS 1.3. Define the strict JSON schema for the `DriftReport` payload. #### Story 2.3: Agent Heartbeat **As a** DevOps Lead, **I want** the agent to send regular heartbeats to the SaaS, **so that** I know if the agent crashes or loses connectivity. * **Acceptance Criteria:** * Agent sends a lightweight heartbeat payload every N minutes. * Payload includes uptime, memory usage, and events processed. * SaaS API logs the heartbeat to track agent health. * **Story Points:** 2 * **Dependencies:** Story 2.1 * **Technical Notes:** Run heartbeat in a separate Go goroutine with a ticker. Handle transient network errors silently. ## Epic 3: Drift Analysis Engine **Description:** The SaaS-side Node.js/TypeScript processor that ingests drift reports, classifies severity, calculates stack drift scores, and persists events to the database and event store. ### User Stories #### Story 3.1: Ingestion & Validation Pipeline **As a** System Operator, **I want** the SaaS to receive and validate drift reports from agents via SQS, **so that** high volumes of reports are processed reliably without dropping data. * **Acceptance Criteria:** * API Gateway routes valid `POST /v1/drift-reports` to an SQS FIFO queue. * Event Processor ECS task consumes from the queue. * Validates the report payload against a strict JSON schema. * **Story Points:** 5 * **Dependencies:** Story 2.2 * **Technical Notes:** Use `zod` for payload validation. Ensure message group IDs use `stack_id` to maintain ordering per stack. #### Story 3.2: Drift Classification **As an** Infrastructure Engineer, **I want** the SaaS to classify detected drift by severity and category, **so that** I can prioritize critical security issues over cosmetic tag changes. * **Acceptance Criteria:** * Applies YAML-defined classification rules to incoming drift diffs. * Tags events as Critical, High, Medium, or Low severity. * Tags events with categories (Security, Configuration, Tags, etc.). * **Story Points:** 3 * **Dependencies:** Story 3.1 * **Technical Notes:** Implement a fast rule evaluation engine. Default unmatched drift to "Medium/Configuration". #### Story 3.3: Persistence & Event Sourcing **As a** Compliance Lead, **I want** every drift detection event to be stored in an immutable log, **so that** I have a reliable audit trail for SOC 2 compliance. * **Acceptance Criteria:** * Appends the raw drift event to DynamoDB (immutable event store). * Upserts the current state of the resource in the PostgreSQL `resources` table. * Inserts a new record in the PostgreSQL `drift_events` table for open drift. * **Story Points:** 8 * **Dependencies:** Story 3.2 * **Technical Notes:** Handle database transactions carefully to keep PostgreSQL and DynamoDB in sync. Ensure Row-Level Security (RLS) is applied on all PostgreSQL inserts. #### Story 3.4: Drift Score Calculation **As a** DevOps Lead, **I want** the engine to calculate a drift score for each stack, **so that** I have a high-level metric of infrastructure health. * **Acceptance Criteria:** * Updates the `drift_score` field on the `stacks` table after processing a report. * Score is out of 100 (e.g., 100 = completely clean). * Weighted penalization based on severity (Critical heavily impacts score, Low barely impacts). * **Story Points:** 3 * **Dependencies:** Story 3.3 * **Technical Notes:** Define a simple but logical weighting algorithm. Run calculation synchronously during event processing for V1. ## Epic 4: Notification Service **Description:** A Lambda-based service that formats drift events into actionable Slack messages (Block Kit) and handles delivery routing per stack configuration. ### User Stories #### Story 4.1: Slack Block Kit Formatting **As an** Infrastructure Engineer, **I want** drift alerts to arrive as rich Slack messages, **so that** I can easily read the diff and context without leaving Slack. * **Acceptance Criteria:** * Lambda function maps drift events to Slack Block Kit JSON. * Message includes Stack Name, Resource Address, Timestamp, Severity, and CloudTrail Attribution. * Displays a code block showing the `old` vs `new` attribute diff. * **Story Points:** 5 * **Dependencies:** Story 3.3 * **Technical Notes:** Build a flexible Block Kit template builder. Ensure diffs that are too long are gracefully truncated to fit Slack's block length limits. #### Story 4.2: Slack Routing & Fanout **As a** DevOps Lead, **I want** drift alerts to be routed to specific Slack channels based on the stack, **so that** the right team sees the alert without noise. * **Acceptance Criteria:** * Checks the `stacks` table for custom Slack channel overrides. * Falls back to the organization's default Slack channel. * Sends the formatted message via the Slack API. * **Story Points:** 3 * **Dependencies:** Story 4.1 * **Technical Notes:** The Event Processor triggers this Lambda asynchronously via SQS (`notification-fanout` queue). #### Story 4.3: Action Buttons (Revert/Accept) **As an** Infrastructure Engineer, **I want** action buttons directly on the Slack alert, **so that** I can quickly trigger remediation workflows. * **Acceptance Criteria:** * Slack message includes interactive buttons: `[Revert]`, `[Accept]`, `[Snooze]`, `[Assign]`. * Buttons contain the `drift_event_id` in their payload value. * **Story Points:** 2 * **Dependencies:** Story 4.1 * **Technical Notes:** Just rendering the buttons. Handling the interactive callbacks will be covered under the Slack Bot epic or Remediation Engine (V1 MVP focus). #### Story 4.4: Notification Batching (Low Severity) **As an** Infrastructure Engineer, **I want** low and medium severity drift alerts to be batched into a digest, **so that** my Slack channel isn't spammed with noisy tag changes. * **Acceptance Criteria:** * Critical/High alerts are sent immediately. * Medium/Low alerts are held in a DynamoDB table or SQS delay queue and dispatched as a daily/hourly digest. * **Story Points:** 8 * **Dependencies:** Story 4.2 * **Technical Notes:** Use EventBridge Scheduler or a cron Lambda to process and flush the digest queue periodically. ## Epic 5: Dashboard API **Description:** The REST API powering the React SPA web dashboard. Handles user authentication (Cognito), organization/stack management, and querying the PostgreSQL data for drift history. ### User Stories #### Story 5.1: API Authentication & RLS Setup **As a** System Operator, **I want** the API to enforce authentication and isolate tenant data, **so that** a user from one organization cannot see another organization's data. * **Acceptance Criteria:** * Integrates AWS Cognito JWT validation middleware. * API sets `app.current_org_id` on the PostgreSQL connection session for Row-Level Security (RLS). * Returns `401/403` for unauthorized requests. * **Story Points:** 5 * **Dependencies:** Database Schema (RLS) * **Technical Notes:** Build an Express/Node.js middleware layer. Ensure strict parameterization for all SQL queries beyond RLS to avoid injection. #### Story 5.2: Stack Management Endpoints **As a** DevOps Lead, **I want** a set of REST endpoints to view and manage my stacks, **so that** I can configure ownership, check connection health, and monitor the drift score. * **Acceptance Criteria:** * Implements `GET /v1/stacks` (list all stacks with their scores and resource counts). * Implements `GET /v1/stacks/:id` (stack details). * Implements `PATCH /v1/stacks/:id` (update name, owner, Slack channel). * **Story Points:** 3 * **Dependencies:** Story 5.1 * **Technical Notes:** Support basic pagination (`limit`, `offset`) on list endpoints. Include agent heartbeat status in stack details if applicable. #### Story 5.3: Drift History & Event Queries **As a** Security Lead, **I want** endpoints to search and filter drift events, **so that** I can review historical drift, find specific changes, and generate audit reports. * **Acceptance Criteria:** * Implements `GET /v1/drift-events` with filters for `stack_id`, `status` (open/resolved), `severity`, and timestamp ranges. * Joins the `drift_events` table with `resources` to return full address paths and diff payloads. * **Story Points:** 5 * **Dependencies:** Story 5.1 * **Technical Notes:** Expose the JSONB `diff` field cleanly in the response payload. Use index-backed PostgreSQL queries for fast filtering. #### Story 5.4: Policy Configuration Endpoints **As a** DevOps Lead, **I want** an API to manage remediation policies, **so that** I can customize how the agent reacts to specific resource drift. * **Acceptance Criteria:** * Implements CRUD operations for stack-level and org-level policies (`/v1/policies`). * Validates policy configuration payloads (e.g., action type, valid resource expressions). * **Story Points:** 3 * **Dependencies:** Story 5.1 * **Technical Notes:** For V1, store policies as JSON fields in a simple `policies` table or direct mapping to stacks. Keep validation simple (e.g., regex checks on resource types). ## Epic 6: Dashboard UI **Description:** The React Single Page Application (SPA) providing a web dashboard for stack overview, drift timeline, and resource-level diff viewer. ### User Stories #### Story 6.1: Stack Overview Dashboard **As a** DevOps Lead, **I want** a main dashboard showing all my stacks and their drift scores, **so that** I can assess infrastructure health at a glance. * **Acceptance Criteria:** * Displays a list/table of all monitored stacks. * Shows a visual "Drift Score" indicator (0-100) per stack. * Sortable by score, name, and last checked timestamp. * Provides visual indicators for agent connection status. * **Story Points:** 5 * **Dependencies:** Story 5.2 * **Technical Notes:** Build with React and Vite. Use a standard UI library (e.g., Tailwind UI or MUI). Implement efficient data fetching (e.g., React Query). #### Story 6.2: Stack Detail & Drift Timeline **As an** Infrastructure Engineer, **I want** to click into a stack and see a timeline of drift events, **so that** I can track when things changed and who changed them. * **Acceptance Criteria:** * Shows a chronological list of drift events for the selected stack. * Displays open vs. resolved status. * Filters for severity and category. * Includes CloudTrail attribution data (who, IP, action). * **Story Points:** 5 * **Dependencies:** Story 5.3, Story 6.1 * **Technical Notes:** Support pagination/infinite scrolling for the timeline. Use clear icons for event types (Security vs. Tags). #### Story 6.3: Resource-Level Diff Viewer **As an** Infrastructure Engineer, **I want** to see the exact attribute changes for a drifted resource, **so that** I know exactly how reality differs from my state file. * **Acceptance Criteria:** * Clicking an event opens a detailed view/modal. * Renders a code-diff view (red for old state, green for new reality). * Clearly marks redacted sensitive values. * **Story Points:** 5 * **Dependencies:** Story 6.2 * **Technical Notes:** Use a specialized diff viewing component (e.g., `react-diff-viewer`). Ensure it handles large JSON blocks gracefully. #### Story 6.4: Auth & User Settings **As a** User, **I want** to manage my account and view my API keys, **so that** I can deploy the agent and access my organization's dashboard. * **Acceptance Criteria:** * Implements login/signup via Cognito (Email/Password & GitHub OAuth). * Provides a settings page displaying the organization's static API key. * Displays current subscription plan (Free tier limits for MVP). * **Story Points:** 3 * **Dependencies:** Story 5.1 * **Technical Notes:** Securely manage JWT storage (HttpOnly cookies or secure local storage). Include a clear "copy to clipboard" for the API key. ## Epic 7: Slack Bot **Description:** The interactive Slack application that handles user commands (`/drift score`) and processes the interactive action buttons (`[Revert]`, `[Accept]`) from drift alerts. ### User Stories #### Story 7.1: Interactive Remediation Callbacks (Revert) **As an** Infrastructure Engineer, **I want** clicking `[Revert]` on a Slack alert to trigger a targeted `terraform apply`, **so that** I can fix drift instantly without leaving Slack. * **Acceptance Criteria:** * SaaS API Gateway (`/v1/slack/interactions`) receives the button click payload. * Validates the Slack request signature. * Generates a scoped `terraform plan -target` command and queues it for the agent. * Updates the Slack message to "Reverting...". * **Story Points:** 8 * **Dependencies:** Story 4.3 * **Technical Notes:** The actual execution happens via the Remediation Engine (ECS Fargate) dispatching commands to the agent. Requires careful state tracking (Pending -> Executing -> Completed). #### Story 7.2: Interactive Acceptance Callbacks (Accept) **As an** Infrastructure Engineer, **I want** clicking `[Accept]` to auto-generate a PR that updates my Terraform code to match reality, **so that** the drift becomes the new source of truth. * **Acceptance Criteria:** * SaaS generates a code patch representing the new state. * Uses the GitHub API to create a branch and open a PR against the target repo. * Updates the Slack message with a link to the PR. * **Story Points:** 8 * **Dependencies:** Story 7.1 * **Technical Notes:** Will require GitHub App integration (or OAuth token) to create branches/PRs. The patch generation logic needs robust testing. #### Story 7.3: Slack Slash Commands **As a** DevOps Lead, **I want** to use `/drift score` and `/drift status ` in Slack, **so that** I can check my infrastructure health on demand. * **Acceptance Criteria:** * `/drift score` returns the aggregate score for the organization. * `/drift status prod-networking` returns the score, open events, and agent health for a specific stack. * Formats output as a clean Slack Block Kit message visible only to the user. * **Story Points:** 5 * **Dependencies:** Story 4.1, Story 5.2 * **Technical Notes:** Deploy an API Gateway endpoint specifically for Slack slash commands. Validate the token and use the internal Dashboard API logic to fetch scores. #### Story 7.4: Snooze & Assign Callbacks **As an** Infrastructure Engineer, **I want** to click `[Snooze 24h]` or `[Assign]`, **so that** I can manage alert noise or delegate investigation to a teammate. * **Acceptance Criteria:** * `[Snooze]` updates the event status to `snoozed` and schedules a wake-up time. * `[Assign]` opens a Slack modal to select a team member, updating the event owner. * The original Slack message updates to reflect the new state/owner. * **Story Points:** 5 * **Dependencies:** Story 7.1 * **Technical Notes:** Snooze requires a scheduled EventBridge or cron job to un-snooze. Assign requires interacting with Slack's user selection menus. ## Epic 8: Infrastructure & DevOps **Description:** The underlying cloud resources for the dd0c SaaS and the CI/CD pipelines to build, test, and release the agent and services. ### User Stories #### Story 8.1: SaaS Infrastructure (Terraform) **As a** System Operator, **I want** the SaaS infrastructure to be defined as code, **so that** deployments are repeatable and I can dogfood my own drift detection tool. * **Acceptance Criteria:** * Defines VPC, Subnets, ECS Fargate Clusters, RDS PostgreSQL (Multi-AZ), API Gateway, and SQS FIFO queues. * Sets up CloudWatch log groups and IAM roles. * Uses Terraform for all configuration. * **Story Points:** 8 * **Dependencies:** Architecture Design Document * **Technical Notes:** Build a modular Terraform setup. Use the official AWS provider. Include variables for environment separation (staging vs. prod). #### Story 8.2: CI/CD Pipeline (GitHub Actions) **As a** Developer, **I want** a fully automated CI/CD pipeline, **so that** code pushed to `main` is linted, tested, built, and deployed to ECS. * **Acceptance Criteria:** * Runs `golangci-lint`, `go test`, ESLint, and Vitest on PRs. * Builds multi-stage Docker images for the Event Processor, Dashboard API, and Remediation Engine. * Pushes images to ECR and triggers an ECS rolling deploy. * **Story Points:** 5 * **Dependencies:** Story 8.1 * **Technical Notes:** Use standard GitHub Actions (e.g., `aws-actions/configure-aws-credentials`). Add Trivy for basic container scanning. #### Story 8.3: Agent Distribution (Releases & Homebrew) **As an** Open Source User, **I want** to easily download and install the CLI agent, **so that** I can test drift detection locally without building from source. * **Acceptance Criteria:** * Configures GoReleaser to cross-compile binaries for Linux/macOS/Windows (amd64/arm64). * Auto-publishes GitHub Releases when a new tag is pushed. * Creates a custom Homebrew tap (`brew install dd0c/tap/drift-cli`). * **Story Points:** 5 * **Dependencies:** Story 1.1 * **Technical Notes:** Create a dedicated `.github/workflows/release.yml` for GoReleaser. #### Story 8.4: Agent Terraform Module Publication **As a** DevOps Lead, **I want** a pre-built Terraform module to deploy the agent in my AWS account, **so that** I don't have to manually configure ECS tasks and EventBridge rules. * **Acceptance Criteria:** * Creates the `dd0c/drift-agent/aws` Terraform module. * Provisions an ECS Task, EventBridge rules, SQS, and IAM roles for the customer. * Publishes the module to the public Terraform Registry. * **Story Points:** 8 * **Dependencies:** Story 8.3 * **Technical Notes:** Adhere to Terraform Registry best practices. Ensure the `README.md` clearly explains the required `dd0c_api_key` and state bucket variables. ## Epic 9: Onboarding & PLG (Product-Led Growth) **Description:** The self-serve funnel that guides users from CLI installation to their first drift alert in under 5 minutes, plus billing and tier management. ### User Stories #### Story 9.1: Self-Serve Signup & CLI Login **As an** Engineer, **I want** to easily sign up for the free tier via the CLI, **so that** I don't have to fill out sales forms to test the product. * **Acceptance Criteria:** * Running `drift auth login` opens a browser to an OAuth flow (GitHub/Email). * The CLI spins up a local web server to catch the callback token. * Successfully provisions an organization and user account in the SaaS. * **Story Points:** 5 * **Dependencies:** Story 5.4, Story 8.3 * **Technical Notes:** The callback server should listen on `localhost` with a random or standard port (e.g., `8080`). #### Story 9.2: Auto-Discovery (`drift init`) **As an** Infrastructure Engineer, **I want** the CLI to auto-discover my Terraform state files, **so that** I can configure my first stack without typing S3 ARNs. * **Acceptance Criteria:** * `drift init` scans the current directory for `*.tf` files. * Uses default AWS credentials to query S3 buckets matching common state file patterns. * Prompts the user to register discovered stacks to their organization. * **Story Points:** 8 * **Dependencies:** Story 9.1 * **Technical Notes:** Implement robust fallback to manual input if discovery fails. #### Story 9.3: Free Tier Enforcement (1 Stack) **As a** Product Manager, **I want** to enforce a free tier limit of 1 stack, **so that** users get value but are incentivized to upgrade for larger infrastructure needs. * **Acceptance Criteria:** * The API rejects attempts to register more than 1 stack on the Free plan. * The Dashboard clearly shows "1/1 Stacks Used". * The CLI prompts "Upgrade to Starter ($49/mo)" when trying to add a second stack. * **Story Points:** 3 * **Dependencies:** Story 9.1 * **Technical Notes:** Enforce limits securely at the API level (e.g., `POST /v1/stacks` should return a `403 Stack Limit` error). #### Story 9.4: Stripe Billing Integration **As a** Solo Founder, **I want** customers to upgrade to paid tiers with a credit card, **so that** I can capture revenue without a sales process. * **Acceptance Criteria:** * Integrates Stripe Checkout for the Starter ($49/mo) and Pro ($149/mo) tiers. * Dashboard provides a billing management portal (Stripe Customer Portal). * Webhooks listen for successful payments and update the organization's `plan` field in PostgreSQL. * **Story Points:** 8 * **Dependencies:** Story 5.2 * **Technical Notes:** Must include Stripe webhook signature verification to prevent spoofed upgrades. Store `stripe_customer_id` on the `organizations` table. --- ## Epic 10: Transparent Factory Compliance **Description:** Cross-cutting epic ensuring dd0c/drift adheres to the 5 Transparent Factory architectural tenets. For a drift detection product, these tenets are especially critical — the tool that detects infrastructure drift must itself be immune to uncontrolled drift. ### Story 10.1: Atomic Flagging — Feature Flags for Detection Behaviors **As a** solo founder, **I want** every new drift detection rule, remediation action, and notification behavior wrapped in a feature flag (default: off), **so that** I can ship new detection capabilities without accidentally triggering false-positive alerts for customers. **Acceptance Criteria:** - OpenFeature SDK integrated into the Go agent. V1 provider: env-var or JSON file-based (no external service). - All flags evaluate locally — no network calls during drift scan execution. - Every flag has `owner` and `ttl` (max 14 days). CI blocks if any flag at 100% rollout exceeds TTL. - Automated circuit breaker: if a flagged detection rule generates >3x the baseline false-positive rate over 1 hour, the flag auto-disables. - Flags required for: new IaC provider support (Terraform/Pulumi/CDK), remediation suggestions, Slack notification formats, scan scheduling changes. **Estimate:** 5 points **Dependencies:** Epic 1 (Agent Core) **Technical Notes:** - Use Go OpenFeature SDK (`go.openfeature.dev/sdk`). JSON file provider for V1. - Circuit breaker: track false-positive dismissals per rule in Redis. If dismissal rate spikes, disable the flag. - Flag audit: `make flag-audit` lists all flags with TTL status. ### Story 10.2: Elastic Schema — Additive-Only for Drift State Storage **As a** solo founder, **I want** all DynamoDB and state file schema changes to be strictly additive, **so that** agent rollbacks never corrupt drift history or lose customer scan results. **Acceptance Criteria:** - CI lint rejects any migration or schema change that removes, renames, or changes the type of existing DynamoDB attributes. - New attributes use `_v2` suffix when breaking changes are needed. Old attributes remain readable. - Go structs use `json:",omitempty"` and ignore unknown fields so V1 agents can read V2 state files without crashing. - Dual-write enforced during migration windows: agent writes to both old and new attribute paths in the same DynamoDB `TransactWriteItems` call. - Every schema change includes a `sunset_date` comment (max 30 days). CI warns on overdue cleanups. **Estimate:** 3 points **Dependencies:** Epic 2 (State Management) **Technical Notes:** - DynamoDB Single Table Design: version items with `_v` attribute. Agent code uses a factory to select the correct model. - For Terraform state parsing, use `encoding/json` with `DisallowUnknownFields: false` to tolerate upstream state format changes. - S3 state snapshots: never overwrite — always write new versioned keys. ### Story 10.3: Cognitive Durability — Decision Logs for Detection Logic **As a** future maintainer, **I want** every change to drift detection algorithms, severity scoring, or remediation logic accompanied by a `decision_log.json`, **so that** I understand why a particular drift pattern is flagged as critical vs. informational. **Acceptance Criteria:** - `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`. - CI requires a decision log entry for PRs touching `pkg/detection/`, `pkg/scoring/`, or `pkg/remediation/`. - Cyclomatic complexity cap of 10 enforced via `golangci-lint` with `gocyclo` linter. PRs exceeding this are blocked. - Decision logs committed in `docs/decisions/`, one per significant logic change. **Estimate:** 2 points **Dependencies:** None **Technical Notes:** - PR template includes decision log fields as a checklist. - For drift scoring changes: document why specific thresholds were chosen (e.g., "security group changes scored critical because X% of breaches start with SG drift"). - `golangci-lint` config: `.golangci.yml` with `gocyclo: max-complexity: 10`. ### Story 10.4: Semantic Observability — AI Reasoning Spans on Drift Classification **As an** SRE debugging a missed drift alert, **I want** every drift classification decision to emit an OpenTelemetry span with structured reasoning metadata, **so that** I can trace why a specific resource change was scored as low-severity when it should have been critical. **Acceptance Criteria:** - Every drift scan emits a parent `drift_scan` span. Each resource comparison emits a child `drift_classification` span. - Span attributes: `drift.resource_type`, `drift.severity_score`, `drift.classification_reason`, `drift.alternatives_considered` (e.g., "considered critical but downgraded because tag-only change"). - If AI-assisted classification is used (future): `ai.prompt_hash`, `ai.model_version`, `ai.confidence_score` included. - Spans export via OTLP to any compatible backend. - No PII or customer infrastructure details in spans — resource ARNs are hashed. **Estimate:** 3 points **Dependencies:** Epic 1 (Agent Core) **Technical Notes:** - Use `go.opentelemetry.io/otel` with OTLP exporter. - For V1 without AI classification, the `drift.classification_reason` is the rule name + threshold that triggered. - ARN hashing: SHA-256 truncated to 12 chars for correlation without exposure. ### Story 10.5: Configurable Autonomy — Governance for Auto-Remediation **As a** solo founder, **I want** a `policy.json` that controls whether the agent can auto-remediate drift or only report it, **so that** customers maintain full control over what the tool is allowed to change in their infrastructure. **Acceptance Criteria:** - `policy.json` defines `governance_mode`: `strict` (report-only, no remediation) or `audit` (auto-remediate with logging). - Agent checks policy before every remediation action. In `strict` mode, remediation suggestions are logged but never executed. - `panic_mode`: when true, agent stops all scans immediately, preserves last-known-good state, and sends a single "paused" notification. - Per-customer policy override: customers can set their own governance mode via config, which is always more restrictive than the system default (never less). - All policy decisions logged: "Remediation blocked by strict mode for resource X", "Auto-remediation applied in audit mode". **Estimate:** 3 points **Dependencies:** Epic 3 (Remediation Engine) **Technical Notes:** - `policy.json` in repo root, loaded at startup, watched via `fsnotify`. - Customer-level overrides stored in DynamoDB `org_settings` item. Merge logic: `min(system_policy, customer_policy)` — customer can only be MORE restrictive. - Panic mode trigger: `POST /admin/panic` or env var `DD0C_PANIC=true`. Agent drains current scan and halts. ### Epic 10 Summary | Story | Tenet | Points | |-------|-------|--------| | 10.1 | Atomic Flagging | 5 | | 10.2 | Elastic Schema | 3 | | 10.3 | Cognitive Durability | 2 | | 10.4 | Semantic Observability | 3 | | 10.5 | Configurable Autonomy | 3 | | **Total** | | **16** |