dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
This commit is contained in:
2026-02-28 17:35:02 +00:00
commit 5ee95d8b13
51 changed files with 36935 additions and 0 deletions

View File

@@ -0,0 +1,505 @@
# dd0c/drift - V1 MVP Epics
## Epic 1: Drift Agent (Go CLI)
**Description:** The core open-source Go binary that runs in the customer's environment. It parses Terraform state, polls AWS APIs for actual resource configurations, calculates the diff, and scrubs sensitive data before transmission.
### User Stories
#### Story 1.1: Terraform State Parser
**As an** Infrastructure Engineer, **I want** the agent to parse my Terraform state file locally, **so that** it can identify declared resources without uploading my raw state to a third party.
* **Acceptance Criteria:**
* Successfully parses Terraform state v4 JSON format.
* Extracts a list of `managed` resources with their declared attributes.
* Handles both local `.tfstate` files and AWS S3 remote backend configurations.
* **Story Points:** 5
* **Dependencies:** None
* **Technical Notes:** Use standard Go JSON unmarshaling. Create an internal graph representation. Focus exclusively on AWS provider resources.
#### Story 1.2: AWS Resource Polling
**As an** Infrastructure Engineer, **I want** the agent to query AWS for the current state of my resources, **so that** it can compare reality against my declared Terraform state.
* **Acceptance Criteria:**
* Agent uses customer's local AWS credentials/IAM role to authenticate.
* Queries AWS APIs for the top 20 MVP resource types (e.g., `ec2:DescribeSecurityGroups`, `iam:GetRole`).
* Maps Terraform resource IDs to AWS identifiers.
* **Story Points:** 8
* **Dependencies:** Story 1.1
* **Technical Notes:** Use the official AWS Go SDK v2. Map API responses to a standardized internal schema that matches the state parser output. Add simple retry logic for rate limits.
#### Story 1.3: Drift Diff Calculation
**As an** Infrastructure Engineer, **I want** the agent to calculate attribute-level differences between my state file and AWS reality, **so that** I know exactly what changed.
* **Acceptance Criteria:**
* Compares parsed state attributes with polled AWS attributes.
* Outputs a structured diff showing `old` (state) and `new` (reality) values.
* Ignores AWS-generated default attributes that aren't declared in state.
* **Story Points:** 5
* **Dependencies:** Story 1.1, Story 1.2
* **Technical Notes:** Implement a deep compare function. Requires hardcoded ignore lists for known noisy attributes (e.g., AWS-assigned IDs or timestamps).
#### Story 1.4: Secret Scrubbing Engine
**As a** Security Lead, **I want** the agent to scrub all sensitive data from the drift diffs, **so that** my database passwords and API keys are never transmitted to the dd0c SaaS.
* **Acceptance Criteria:**
* Strips any attribute marked `sensitive` in the state file.
* Redacts values matching known secret patterns (e.g., `password`, `secret`, `token`).
* Replaces redacted values with `[REDACTED]`.
* Completely strips the `Private` field from state instances.
* **Story Points:** 3
* **Dependencies:** Story 1.3
* **Technical Notes:** Use regex for pattern matching. Validate scrubber with rigorous unit tests before shipping. Ensure the diff structure remains intact even when values are redacted.
## Epic 2: Agent Communication
**Description:** Secure data transmission between the local Go agent and the dd0c SaaS. Handles authentication, mTLS registration, heartbeat signals, and encrypted drift report uploads.
### User Stories
#### Story 2.1: Agent Registration & Authentication
**As a** DevOps Lead, **I want** the agent to securely register itself with the dd0c SaaS using an API key, **so that** my drift data is securely associated with my organization.
* **Acceptance Criteria:**
* Agent registers via `POST /v1/agents/register` using a static API key.
* Generates and exchanges mTLS certificates for subsequent requests.
* Receives configuration details (e.g., poll interval) from the SaaS.
* **Story Points:** 5
* **Dependencies:** None
* **Technical Notes:** The agent needs to store the mTLS cert locally or in memory. Implement robust error handling for unauthorized/revoked API keys.
#### Story 2.2: Encrypted Payload Transmission
**As a** Security Lead, **I want** the agent to transmit drift reports over a secure, encrypted channel, **so that** our infrastructure data cannot be intercepted in transit.
* **Acceptance Criteria:**
* Agent POSTs scrubbed drift reports to `/v1/drift-reports`.
* Communication enforces TLS 1.3 and uses the established mTLS client certificate.
* Payload is compressed (gzip) if over a certain threshold.
* **Story Points:** 3
* **Dependencies:** Story 1.4, Story 2.1
* **Technical Notes:** Ensure HTTP client in Go enforces TLS 1.3. Define the strict JSON schema for the `DriftReport` payload.
#### Story 2.3: Agent Heartbeat
**As a** DevOps Lead, **I want** the agent to send regular heartbeats to the SaaS, **so that** I know if the agent crashes or loses connectivity.
* **Acceptance Criteria:**
* Agent sends a lightweight heartbeat payload every N minutes.
* Payload includes uptime, memory usage, and events processed.
* SaaS API logs the heartbeat to track agent health.
* **Story Points:** 2
* **Dependencies:** Story 2.1
* **Technical Notes:** Run heartbeat in a separate Go goroutine with a ticker. Handle transient network errors silently.
## Epic 3: Drift Analysis Engine
**Description:** The SaaS-side Node.js/TypeScript processor that ingests drift reports, classifies severity, calculates stack drift scores, and persists events to the database and event store.
### User Stories
#### Story 3.1: Ingestion & Validation Pipeline
**As a** System Operator, **I want** the SaaS to receive and validate drift reports from agents via SQS, **so that** high volumes of reports are processed reliably without dropping data.
* **Acceptance Criteria:**
* API Gateway routes valid `POST /v1/drift-reports` to an SQS FIFO queue.
* Event Processor ECS task consumes from the queue.
* Validates the report payload against a strict JSON schema.
* **Story Points:** 5
* **Dependencies:** Story 2.2
* **Technical Notes:** Use `zod` for payload validation. Ensure message group IDs use `stack_id` to maintain ordering per stack.
#### Story 3.2: Drift Classification
**As an** Infrastructure Engineer, **I want** the SaaS to classify detected drift by severity and category, **so that** I can prioritize critical security issues over cosmetic tag changes.
* **Acceptance Criteria:**
* Applies YAML-defined classification rules to incoming drift diffs.
* Tags events as Critical, High, Medium, or Low severity.
* Tags events with categories (Security, Configuration, Tags, etc.).
* **Story Points:** 3
* **Dependencies:** Story 3.1
* **Technical Notes:** Implement a fast rule evaluation engine. Default unmatched drift to "Medium/Configuration".
#### Story 3.3: Persistence & Event Sourcing
**As a** Compliance Lead, **I want** every drift detection event to be stored in an immutable log, **so that** I have a reliable audit trail for SOC 2 compliance.
* **Acceptance Criteria:**
* Appends the raw drift event to DynamoDB (immutable event store).
* Upserts the current state of the resource in the PostgreSQL `resources` table.
* Inserts a new record in the PostgreSQL `drift_events` table for open drift.
* **Story Points:** 8
* **Dependencies:** Story 3.2
* **Technical Notes:** Handle database transactions carefully to keep PostgreSQL and DynamoDB in sync. Ensure Row-Level Security (RLS) is applied on all PostgreSQL inserts.
#### Story 3.4: Drift Score Calculation
**As a** DevOps Lead, **I want** the engine to calculate a drift score for each stack, **so that** I have a high-level metric of infrastructure health.
* **Acceptance Criteria:**
* Updates the `drift_score` field on the `stacks` table after processing a report.
* Score is out of 100 (e.g., 100 = completely clean).
* Weighted penalization based on severity (Critical heavily impacts score, Low barely impacts).
* **Story Points:** 3
* **Dependencies:** Story 3.3
* **Technical Notes:** Define a simple but logical weighting algorithm. Run calculation synchronously during event processing for V1.
## Epic 4: Notification Service
**Description:** A Lambda-based service that formats drift events into actionable Slack messages (Block Kit) and handles delivery routing per stack configuration.
### User Stories
#### Story 4.1: Slack Block Kit Formatting
**As an** Infrastructure Engineer, **I want** drift alerts to arrive as rich Slack messages, **so that** I can easily read the diff and context without leaving Slack.
* **Acceptance Criteria:**
* Lambda function maps drift events to Slack Block Kit JSON.
* Message includes Stack Name, Resource Address, Timestamp, Severity, and CloudTrail Attribution.
* Displays a code block showing the `old` vs `new` attribute diff.
* **Story Points:** 5
* **Dependencies:** Story 3.3
* **Technical Notes:** Build a flexible Block Kit template builder. Ensure diffs that are too long are gracefully truncated to fit Slack's block length limits.
#### Story 4.2: Slack Routing & Fanout
**As a** DevOps Lead, **I want** drift alerts to be routed to specific Slack channels based on the stack, **so that** the right team sees the alert without noise.
* **Acceptance Criteria:**
* Checks the `stacks` table for custom Slack channel overrides.
* Falls back to the organization's default Slack channel.
* Sends the formatted message via the Slack API.
* **Story Points:** 3
* **Dependencies:** Story 4.1
* **Technical Notes:** The Event Processor triggers this Lambda asynchronously via SQS (`notification-fanout` queue).
#### Story 4.3: Action Buttons (Revert/Accept)
**As an** Infrastructure Engineer, **I want** action buttons directly on the Slack alert, **so that** I can quickly trigger remediation workflows.
* **Acceptance Criteria:**
* Slack message includes interactive buttons: `[Revert]`, `[Accept]`, `[Snooze]`, `[Assign]`.
* Buttons contain the `drift_event_id` in their payload value.
* **Story Points:** 2
* **Dependencies:** Story 4.1
* **Technical Notes:** Just rendering the buttons. Handling the interactive callbacks will be covered under the Slack Bot epic or Remediation Engine (V1 MVP focus).
#### Story 4.4: Notification Batching (Low Severity)
**As an** Infrastructure Engineer, **I want** low and medium severity drift alerts to be batched into a digest, **so that** my Slack channel isn't spammed with noisy tag changes.
* **Acceptance Criteria:**
* Critical/High alerts are sent immediately.
* Medium/Low alerts are held in a DynamoDB table or SQS delay queue and dispatched as a daily/hourly digest.
* **Story Points:** 8
* **Dependencies:** Story 4.2
* **Technical Notes:** Use EventBridge Scheduler or a cron Lambda to process and flush the digest queue periodically.
## Epic 5: Dashboard API
**Description:** The REST API powering the React SPA web dashboard. Handles user authentication (Cognito), organization/stack management, and querying the PostgreSQL data for drift history.
### User Stories
#### Story 5.1: API Authentication & RLS Setup
**As a** System Operator, **I want** the API to enforce authentication and isolate tenant data, **so that** a user from one organization cannot see another organization's data.
* **Acceptance Criteria:**
* Integrates AWS Cognito JWT validation middleware.
* API sets `app.current_org_id` on the PostgreSQL connection session for Row-Level Security (RLS).
* Returns `401/403` for unauthorized requests.
* **Story Points:** 5
* **Dependencies:** Database Schema (RLS)
* **Technical Notes:** Build an Express/Node.js middleware layer. Ensure strict parameterization for all SQL queries beyond RLS to avoid injection.
#### Story 5.2: Stack Management Endpoints
**As a** DevOps Lead, **I want** a set of REST endpoints to view and manage my stacks, **so that** I can configure ownership, check connection health, and monitor the drift score.
* **Acceptance Criteria:**
* Implements `GET /v1/stacks` (list all stacks with their scores and resource counts).
* Implements `GET /v1/stacks/:id` (stack details).
* Implements `PATCH /v1/stacks/:id` (update name, owner, Slack channel).
* **Story Points:** 3
* **Dependencies:** Story 5.1
* **Technical Notes:** Support basic pagination (`limit`, `offset`) on list endpoints. Include agent heartbeat status in stack details if applicable.
#### Story 5.3: Drift History & Event Queries
**As a** Security Lead, **I want** endpoints to search and filter drift events, **so that** I can review historical drift, find specific changes, and generate audit reports.
* **Acceptance Criteria:**
* Implements `GET /v1/drift-events` with filters for `stack_id`, `status` (open/resolved), `severity`, and timestamp ranges.
* Joins the `drift_events` table with `resources` to return full address paths and diff payloads.
* **Story Points:** 5
* **Dependencies:** Story 5.1
* **Technical Notes:** Expose the JSONB `diff` field cleanly in the response payload. Use index-backed PostgreSQL queries for fast filtering.
#### Story 5.4: Policy Configuration Endpoints
**As a** DevOps Lead, **I want** an API to manage remediation policies, **so that** I can customize how the agent reacts to specific resource drift.
* **Acceptance Criteria:**
* Implements CRUD operations for stack-level and org-level policies (`/v1/policies`).
* Validates policy configuration payloads (e.g., action type, valid resource expressions).
* **Story Points:** 3
* **Dependencies:** Story 5.1
* **Technical Notes:** For V1, store policies as JSON fields in a simple `policies` table or direct mapping to stacks. Keep validation simple (e.g., regex checks on resource types).
## Epic 6: Dashboard UI
**Description:** The React Single Page Application (SPA) providing a web dashboard for stack overview, drift timeline, and resource-level diff viewer.
### User Stories
#### Story 6.1: Stack Overview Dashboard
**As a** DevOps Lead, **I want** a main dashboard showing all my stacks and their drift scores, **so that** I can assess infrastructure health at a glance.
* **Acceptance Criteria:**
* Displays a list/table of all monitored stacks.
* Shows a visual "Drift Score" indicator (0-100) per stack.
* Sortable by score, name, and last checked timestamp.
* Provides visual indicators for agent connection status.
* **Story Points:** 5
* **Dependencies:** Story 5.2
* **Technical Notes:** Build with React and Vite. Use a standard UI library (e.g., Tailwind UI or MUI). Implement efficient data fetching (e.g., React Query).
#### Story 6.2: Stack Detail & Drift Timeline
**As an** Infrastructure Engineer, **I want** to click into a stack and see a timeline of drift events, **so that** I can track when things changed and who changed them.
* **Acceptance Criteria:**
* Shows a chronological list of drift events for the selected stack.
* Displays open vs. resolved status.
* Filters for severity and category.
* Includes CloudTrail attribution data (who, IP, action).
* **Story Points:** 5
* **Dependencies:** Story 5.3, Story 6.1
* **Technical Notes:** Support pagination/infinite scrolling for the timeline. Use clear icons for event types (Security vs. Tags).
#### Story 6.3: Resource-Level Diff Viewer
**As an** Infrastructure Engineer, **I want** to see the exact attribute changes for a drifted resource, **so that** I know exactly how reality differs from my state file.
* **Acceptance Criteria:**
* Clicking an event opens a detailed view/modal.
* Renders a code-diff view (red for old state, green for new reality).
* Clearly marks redacted sensitive values.
* **Story Points:** 5
* **Dependencies:** Story 6.2
* **Technical Notes:** Use a specialized diff viewing component (e.g., `react-diff-viewer`). Ensure it handles large JSON blocks gracefully.
#### Story 6.4: Auth & User Settings
**As a** User, **I want** to manage my account and view my API keys, **so that** I can deploy the agent and access my organization's dashboard.
* **Acceptance Criteria:**
* Implements login/signup via Cognito (Email/Password & GitHub OAuth).
* Provides a settings page displaying the organization's static API key.
* Displays current subscription plan (Free tier limits for MVP).
* **Story Points:** 3
* **Dependencies:** Story 5.1
* **Technical Notes:** Securely manage JWT storage (HttpOnly cookies or secure local storage). Include a clear "copy to clipboard" for the API key.
## Epic 7: Slack Bot
**Description:** The interactive Slack application that handles user commands (`/drift score`) and processes the interactive action buttons (`[Revert]`, `[Accept]`) from drift alerts.
### User Stories
#### Story 7.1: Interactive Remediation Callbacks (Revert)
**As an** Infrastructure Engineer, **I want** clicking `[Revert]` on a Slack alert to trigger a targeted `terraform apply`, **so that** I can fix drift instantly without leaving Slack.
* **Acceptance Criteria:**
* SaaS API Gateway (`/v1/slack/interactions`) receives the button click payload.
* Validates the Slack request signature.
* Generates a scoped `terraform plan -target` command and queues it for the agent.
* Updates the Slack message to "Reverting...".
* **Story Points:** 8
* **Dependencies:** Story 4.3
* **Technical Notes:** The actual execution happens via the Remediation Engine (ECS Fargate) dispatching commands to the agent. Requires careful state tracking (Pending -> Executing -> Completed).
#### Story 7.2: Interactive Acceptance Callbacks (Accept)
**As an** Infrastructure Engineer, **I want** clicking `[Accept]` to auto-generate a PR that updates my Terraform code to match reality, **so that** the drift becomes the new source of truth.
* **Acceptance Criteria:**
* SaaS generates a code patch representing the new state.
* Uses the GitHub API to create a branch and open a PR against the target repo.
* Updates the Slack message with a link to the PR.
* **Story Points:** 8
* **Dependencies:** Story 7.1
* **Technical Notes:** Will require GitHub App integration (or OAuth token) to create branches/PRs. The patch generation logic needs robust testing.
#### Story 7.3: Slack Slash Commands
**As a** DevOps Lead, **I want** to use `/drift score` and `/drift status <stack>` in Slack, **so that** I can check my infrastructure health on demand.
* **Acceptance Criteria:**
* `/drift score` returns the aggregate score for the organization.
* `/drift status prod-networking` returns the score, open events, and agent health for a specific stack.
* Formats output as a clean Slack Block Kit message visible only to the user.
* **Story Points:** 5
* **Dependencies:** Story 4.1, Story 5.2
* **Technical Notes:** Deploy an API Gateway endpoint specifically for Slack slash commands. Validate the token and use the internal Dashboard API logic to fetch scores.
#### Story 7.4: Snooze & Assign Callbacks
**As an** Infrastructure Engineer, **I want** to click `[Snooze 24h]` or `[Assign]`, **so that** I can manage alert noise or delegate investigation to a teammate.
* **Acceptance Criteria:**
* `[Snooze]` updates the event status to `snoozed` and schedules a wake-up time.
* `[Assign]` opens a Slack modal to select a team member, updating the event owner.
* The original Slack message updates to reflect the new state/owner.
* **Story Points:** 5
* **Dependencies:** Story 7.1
* **Technical Notes:** Snooze requires a scheduled EventBridge or cron job to un-snooze. Assign requires interacting with Slack's user selection menus.
## Epic 8: Infrastructure & DevOps
**Description:** The underlying cloud resources for the dd0c SaaS and the CI/CD pipelines to build, test, and release the agent and services.
### User Stories
#### Story 8.1: SaaS Infrastructure (Terraform)
**As a** System Operator, **I want** the SaaS infrastructure to be defined as code, **so that** deployments are repeatable and I can dogfood my own drift detection tool.
* **Acceptance Criteria:**
* Defines VPC, Subnets, ECS Fargate Clusters, RDS PostgreSQL (Multi-AZ), API Gateway, and SQS FIFO queues.
* Sets up CloudWatch log groups and IAM roles.
* Uses Terraform for all configuration.
* **Story Points:** 8
* **Dependencies:** Architecture Design Document
* **Technical Notes:** Build a modular Terraform setup. Use the official AWS provider. Include variables for environment separation (staging vs. prod).
#### Story 8.2: CI/CD Pipeline (GitHub Actions)
**As a** Developer, **I want** a fully automated CI/CD pipeline, **so that** code pushed to `main` is linted, tested, built, and deployed to ECS.
* **Acceptance Criteria:**
* Runs `golangci-lint`, `go test`, ESLint, and Vitest on PRs.
* Builds multi-stage Docker images for the Event Processor, Dashboard API, and Remediation Engine.
* Pushes images to ECR and triggers an ECS rolling deploy.
* **Story Points:** 5
* **Dependencies:** Story 8.1
* **Technical Notes:** Use standard GitHub Actions (e.g., `aws-actions/configure-aws-credentials`). Add Trivy for basic container scanning.
#### Story 8.3: Agent Distribution (Releases & Homebrew)
**As an** Open Source User, **I want** to easily download and install the CLI agent, **so that** I can test drift detection locally without building from source.
* **Acceptance Criteria:**
* Configures GoReleaser to cross-compile binaries for Linux/macOS/Windows (amd64/arm64).
* Auto-publishes GitHub Releases when a new tag is pushed.
* Creates a custom Homebrew tap (`brew install dd0c/tap/drift-cli`).
* **Story Points:** 5
* **Dependencies:** Story 1.1
* **Technical Notes:** Create a dedicated `.github/workflows/release.yml` for GoReleaser.
#### Story 8.4: Agent Terraform Module Publication
**As a** DevOps Lead, **I want** a pre-built Terraform module to deploy the agent in my AWS account, **so that** I don't have to manually configure ECS tasks and EventBridge rules.
* **Acceptance Criteria:**
* Creates the `dd0c/drift-agent/aws` Terraform module.
* Provisions an ECS Task, EventBridge rules, SQS, and IAM roles for the customer.
* Publishes the module to the public Terraform Registry.
* **Story Points:** 8
* **Dependencies:** Story 8.3
* **Technical Notes:** Adhere to Terraform Registry best practices. Ensure the `README.md` clearly explains the required `dd0c_api_key` and state bucket variables.
## Epic 9: Onboarding & PLG (Product-Led Growth)
**Description:** The self-serve funnel that guides users from CLI installation to their first drift alert in under 5 minutes, plus billing and tier management.
### User Stories
#### Story 9.1: Self-Serve Signup & CLI Login
**As an** Engineer, **I want** to easily sign up for the free tier via the CLI, **so that** I don't have to fill out sales forms to test the product.
* **Acceptance Criteria:**
* Running `drift auth login` opens a browser to an OAuth flow (GitHub/Email).
* The CLI spins up a local web server to catch the callback token.
* Successfully provisions an organization and user account in the SaaS.
* **Story Points:** 5
* **Dependencies:** Story 5.4, Story 8.3
* **Technical Notes:** The callback server should listen on `localhost` with a random or standard port (e.g., `8080`).
#### Story 9.2: Auto-Discovery (`drift init`)
**As an** Infrastructure Engineer, **I want** the CLI to auto-discover my Terraform state files, **so that** I can configure my first stack without typing S3 ARNs.
* **Acceptance Criteria:**
* `drift init` scans the current directory for `*.tf` files.
* Uses default AWS credentials to query S3 buckets matching common state file patterns.
* Prompts the user to register discovered stacks to their organization.
* **Story Points:** 8
* **Dependencies:** Story 9.1
* **Technical Notes:** Implement robust fallback to manual input if discovery fails.
#### Story 9.3: Free Tier Enforcement (1 Stack)
**As a** Product Manager, **I want** to enforce a free tier limit of 1 stack, **so that** users get value but are incentivized to upgrade for larger infrastructure needs.
* **Acceptance Criteria:**
* The API rejects attempts to register more than 1 stack on the Free plan.
* The Dashboard clearly shows "1/1 Stacks Used".
* The CLI prompts "Upgrade to Starter ($49/mo)" when trying to add a second stack.
* **Story Points:** 3
* **Dependencies:** Story 9.1
* **Technical Notes:** Enforce limits securely at the API level (e.g., `POST /v1/stacks` should return a `403 Stack Limit` error).
#### Story 9.4: Stripe Billing Integration
**As a** Solo Founder, **I want** customers to upgrade to paid tiers with a credit card, **so that** I can capture revenue without a sales process.
* **Acceptance Criteria:**
* Integrates Stripe Checkout for the Starter ($49/mo) and Pro ($149/mo) tiers.
* Dashboard provides a billing management portal (Stripe Customer Portal).
* Webhooks listen for successful payments and update the organization's `plan` field in PostgreSQL.
* **Story Points:** 8
* **Dependencies:** Story 5.2
* **Technical Notes:** Must include Stripe webhook signature verification to prevent spoofed upgrades. Store `stripe_customer_id` on the `organizations` table.
---
## Epic 10: Transparent Factory Compliance
**Description:** Cross-cutting epic ensuring dd0c/drift adheres to the 5 Transparent Factory architectural tenets. For a drift detection product, these tenets are especially critical — the tool that detects infrastructure drift must itself be immune to uncontrolled drift.
### Story 10.1: Atomic Flagging — Feature Flags for Detection Behaviors
**As a** solo founder, **I want** every new drift detection rule, remediation action, and notification behavior wrapped in a feature flag (default: off), **so that** I can ship new detection capabilities without accidentally triggering false-positive alerts for customers.
**Acceptance Criteria:**
- OpenFeature SDK integrated into the Go agent. V1 provider: env-var or JSON file-based (no external service).
- All flags evaluate locally — no network calls during drift scan execution.
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if any flag at 100% rollout exceeds TTL.
- Automated circuit breaker: if a flagged detection rule generates >3x the baseline false-positive rate over 1 hour, the flag auto-disables.
- Flags required for: new IaC provider support (Terraform/Pulumi/CDK), remediation suggestions, Slack notification formats, scan scheduling changes.
**Estimate:** 5 points
**Dependencies:** Epic 1 (Agent Core)
**Technical Notes:**
- Use Go OpenFeature SDK (`go.openfeature.dev/sdk`). JSON file provider for V1.
- Circuit breaker: track false-positive dismissals per rule in Redis. If dismissal rate spikes, disable the flag.
- Flag audit: `make flag-audit` lists all flags with TTL status.
### Story 10.2: Elastic Schema — Additive-Only for Drift State Storage
**As a** solo founder, **I want** all DynamoDB and state file schema changes to be strictly additive, **so that** agent rollbacks never corrupt drift history or lose customer scan results.
**Acceptance Criteria:**
- CI lint rejects any migration or schema change that removes, renames, or changes the type of existing DynamoDB attributes.
- New attributes use `_v2` suffix when breaking changes are needed. Old attributes remain readable.
- Go structs use `json:",omitempty"` and ignore unknown fields so V1 agents can read V2 state files without crashing.
- Dual-write enforced during migration windows: agent writes to both old and new attribute paths in the same DynamoDB `TransactWriteItems` call.
- Every schema change includes a `sunset_date` comment (max 30 days). CI warns on overdue cleanups.
**Estimate:** 3 points
**Dependencies:** Epic 2 (State Management)
**Technical Notes:**
- DynamoDB Single Table Design: version items with `_v` attribute. Agent code uses a factory to select the correct model.
- For Terraform state parsing, use `encoding/json` with `DisallowUnknownFields: false` to tolerate upstream state format changes.
- S3 state snapshots: never overwrite — always write new versioned keys.
### Story 10.3: Cognitive Durability — Decision Logs for Detection Logic
**As a** future maintainer, **I want** every change to drift detection algorithms, severity scoring, or remediation logic accompanied by a `decision_log.json`, **so that** I understand why a particular drift pattern is flagged as critical vs. informational.
**Acceptance Criteria:**
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
- CI requires a decision log entry for PRs touching `pkg/detection/`, `pkg/scoring/`, or `pkg/remediation/`.
- Cyclomatic complexity cap of 10 enforced via `golangci-lint` with `gocyclo` linter. PRs exceeding this are blocked.
- Decision logs committed in `docs/decisions/`, one per significant logic change.
**Estimate:** 2 points
**Dependencies:** None
**Technical Notes:**
- PR template includes decision log fields as a checklist.
- For drift scoring changes: document why specific thresholds were chosen (e.g., "security group changes scored critical because X% of breaches start with SG drift").
- `golangci-lint` config: `.golangci.yml` with `gocyclo: max-complexity: 10`.
### Story 10.4: Semantic Observability — AI Reasoning Spans on Drift Classification
**As an** SRE debugging a missed drift alert, **I want** every drift classification decision to emit an OpenTelemetry span with structured reasoning metadata, **so that** I can trace why a specific resource change was scored as low-severity when it should have been critical.
**Acceptance Criteria:**
- Every drift scan emits a parent `drift_scan` span. Each resource comparison emits a child `drift_classification` span.
- Span attributes: `drift.resource_type`, `drift.severity_score`, `drift.classification_reason`, `drift.alternatives_considered` (e.g., "considered critical but downgraded because tag-only change").
- If AI-assisted classification is used (future): `ai.prompt_hash`, `ai.model_version`, `ai.confidence_score` included.
- Spans export via OTLP to any compatible backend.
- No PII or customer infrastructure details in spans — resource ARNs are hashed.
**Estimate:** 3 points
**Dependencies:** Epic 1 (Agent Core)
**Technical Notes:**
- Use `go.opentelemetry.io/otel` with OTLP exporter.
- For V1 without AI classification, the `drift.classification_reason` is the rule name + threshold that triggered.
- ARN hashing: SHA-256 truncated to 12 chars for correlation without exposure.
### Story 10.5: Configurable Autonomy — Governance for Auto-Remediation
**As a** solo founder, **I want** a `policy.json` that controls whether the agent can auto-remediate drift or only report it, **so that** customers maintain full control over what the tool is allowed to change in their infrastructure.
**Acceptance Criteria:**
- `policy.json` defines `governance_mode`: `strict` (report-only, no remediation) or `audit` (auto-remediate with logging).
- Agent checks policy before every remediation action. In `strict` mode, remediation suggestions are logged but never executed.
- `panic_mode`: when true, agent stops all scans immediately, preserves last-known-good state, and sends a single "paused" notification.
- Per-customer policy override: customers can set their own governance mode via config, which is always more restrictive than the system default (never less).
- All policy decisions logged: "Remediation blocked by strict mode for resource X", "Auto-remediation applied in audit mode".
**Estimate:** 3 points
**Dependencies:** Epic 3 (Remediation Engine)
**Technical Notes:**
- `policy.json` in repo root, loaded at startup, watched via `fsnotify`.
- Customer-level overrides stored in DynamoDB `org_settings` item. Merge logic: `min(system_policy, customer_policy)` — customer can only be MORE restrictive.
- Panic mode trigger: `POST /admin/panic` or env var `DD0C_PANIC=true`. Agent drains current scan and halts.
### Epic 10 Summary
| Story | Tenet | Points |
|-------|-------|--------|
| 10.1 | Atomic Flagging | 5 |
| 10.2 | Elastic Schema | 3 |
| 10.3 | Cognitive Durability | 2 |
| 10.4 | Semantic Observability | 3 |
| 10.5 | Configurable Autonomy | 3 |
| **Total** | | **16** |