Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
31 KiB
dd0c/drift - V1 MVP Epics
Epic 1: Drift Agent (Go CLI)
Description: The core open-source Go binary that runs in the customer's environment. It parses Terraform state, polls AWS APIs for actual resource configurations, calculates the diff, and scrubs sensitive data before transmission.
User Stories
Story 1.1: Terraform State Parser
As an Infrastructure Engineer, I want the agent to parse my Terraform state file locally, so that it can identify declared resources without uploading my raw state to a third party.
- Acceptance Criteria:
- Successfully parses Terraform state v4 JSON format.
- Extracts a list of
managedresources with their declared attributes. - Handles both local
.tfstatefiles and AWS S3 remote backend configurations.
- Story Points: 5
- Dependencies: None
- Technical Notes: Use standard Go JSON unmarshaling. Create an internal graph representation. Focus exclusively on AWS provider resources.
Story 1.2: AWS Resource Polling
As an Infrastructure Engineer, I want the agent to query AWS for the current state of my resources, so that it can compare reality against my declared Terraform state.
- Acceptance Criteria:
- Agent uses customer's local AWS credentials/IAM role to authenticate.
- Queries AWS APIs for the top 20 MVP resource types (e.g.,
ec2:DescribeSecurityGroups,iam:GetRole). - Maps Terraform resource IDs to AWS identifiers.
- Story Points: 8
- Dependencies: Story 1.1
- Technical Notes: Use the official AWS Go SDK v2. Map API responses to a standardized internal schema that matches the state parser output. Add simple retry logic for rate limits.
Story 1.3: Drift Diff Calculation
As an Infrastructure Engineer, I want the agent to calculate attribute-level differences between my state file and AWS reality, so that I know exactly what changed.
- Acceptance Criteria:
- Compares parsed state attributes with polled AWS attributes.
- Outputs a structured diff showing
old(state) andnew(reality) values. - Ignores AWS-generated default attributes that aren't declared in state.
- Story Points: 5
- Dependencies: Story 1.1, Story 1.2
- Technical Notes: Implement a deep compare function. Requires hardcoded ignore lists for known noisy attributes (e.g., AWS-assigned IDs or timestamps).
Story 1.4: Secret Scrubbing Engine
As a Security Lead, I want the agent to scrub all sensitive data from the drift diffs, so that my database passwords and API keys are never transmitted to the dd0c SaaS.
- Acceptance Criteria:
- Strips any attribute marked
sensitivein the state file. - Redacts values matching known secret patterns (e.g.,
password,secret,token). - Replaces redacted values with
[REDACTED]. - Completely strips the
Privatefield from state instances.
- Strips any attribute marked
- Story Points: 3
- Dependencies: Story 1.3
- Technical Notes: Use regex for pattern matching. Validate scrubber with rigorous unit tests before shipping. Ensure the diff structure remains intact even when values are redacted.
Epic 2: Agent Communication
Description: Secure data transmission between the local Go agent and the dd0c SaaS. Handles authentication, mTLS registration, heartbeat signals, and encrypted drift report uploads.
User Stories
Story 2.1: Agent Registration & Authentication
As a DevOps Lead, I want the agent to securely register itself with the dd0c SaaS using an API key, so that my drift data is securely associated with my organization.
- Acceptance Criteria:
- Agent registers via
POST /v1/agents/registerusing a static API key. - Generates and exchanges mTLS certificates for subsequent requests.
- Receives configuration details (e.g., poll interval) from the SaaS.
- Agent registers via
- Story Points: 5
- Dependencies: None
- Technical Notes: The agent needs to store the mTLS cert locally or in memory. Implement robust error handling for unauthorized/revoked API keys.
Story 2.2: Encrypted Payload Transmission
As a Security Lead, I want the agent to transmit drift reports over a secure, encrypted channel, so that our infrastructure data cannot be intercepted in transit.
- Acceptance Criteria:
- Agent POSTs scrubbed drift reports to
/v1/drift-reports. - Communication enforces TLS 1.3 and uses the established mTLS client certificate.
- Payload is compressed (gzip) if over a certain threshold.
- Agent POSTs scrubbed drift reports to
- Story Points: 3
- Dependencies: Story 1.4, Story 2.1
- Technical Notes: Ensure HTTP client in Go enforces TLS 1.3. Define the strict JSON schema for the
DriftReportpayload.
Story 2.3: Agent Heartbeat
As a DevOps Lead, I want the agent to send regular heartbeats to the SaaS, so that I know if the agent crashes or loses connectivity.
- Acceptance Criteria:
- Agent sends a lightweight heartbeat payload every N minutes.
- Payload includes uptime, memory usage, and events processed.
- SaaS API logs the heartbeat to track agent health.
- Story Points: 2
- Dependencies: Story 2.1
- Technical Notes: Run heartbeat in a separate Go goroutine with a ticker. Handle transient network errors silently.
Epic 3: Drift Analysis Engine
Description: The SaaS-side Node.js/TypeScript processor that ingests drift reports, classifies severity, calculates stack drift scores, and persists events to the database and event store.
User Stories
Story 3.1: Ingestion & Validation Pipeline
As a System Operator, I want the SaaS to receive and validate drift reports from agents via SQS, so that high volumes of reports are processed reliably without dropping data.
- Acceptance Criteria:
- API Gateway routes valid
POST /v1/drift-reportsto an SQS FIFO queue. - Event Processor ECS task consumes from the queue.
- Validates the report payload against a strict JSON schema.
- API Gateway routes valid
- Story Points: 5
- Dependencies: Story 2.2
- Technical Notes: Use
zodfor payload validation. Ensure message group IDs usestack_idto maintain ordering per stack.
Story 3.2: Drift Classification
As an Infrastructure Engineer, I want the SaaS to classify detected drift by severity and category, so that I can prioritize critical security issues over cosmetic tag changes.
- Acceptance Criteria:
- Applies YAML-defined classification rules to incoming drift diffs.
- Tags events as Critical, High, Medium, or Low severity.
- Tags events with categories (Security, Configuration, Tags, etc.).
- Story Points: 3
- Dependencies: Story 3.1
- Technical Notes: Implement a fast rule evaluation engine. Default unmatched drift to "Medium/Configuration".
Story 3.3: Persistence & Event Sourcing
As a Compliance Lead, I want every drift detection event to be stored in an immutable log, so that I have a reliable audit trail for SOC 2 compliance.
- Acceptance Criteria:
- Appends the raw drift event to DynamoDB (immutable event store).
- Upserts the current state of the resource in the PostgreSQL
resourcestable. - Inserts a new record in the PostgreSQL
drift_eventstable for open drift.
- Story Points: 8
- Dependencies: Story 3.2
- Technical Notes: Handle database transactions carefully to keep PostgreSQL and DynamoDB in sync. Ensure Row-Level Security (RLS) is applied on all PostgreSQL inserts.
Story 3.4: Drift Score Calculation
As a DevOps Lead, I want the engine to calculate a drift score for each stack, so that I have a high-level metric of infrastructure health.
- Acceptance Criteria:
- Updates the
drift_scorefield on thestackstable after processing a report. - Score is out of 100 (e.g., 100 = completely clean).
- Weighted penalization based on severity (Critical heavily impacts score, Low barely impacts).
- Updates the
- Story Points: 3
- Dependencies: Story 3.3
- Technical Notes: Define a simple but logical weighting algorithm. Run calculation synchronously during event processing for V1.
Epic 4: Notification Service
Description: A Lambda-based service that formats drift events into actionable Slack messages (Block Kit) and handles delivery routing per stack configuration.
User Stories
Story 4.1: Slack Block Kit Formatting
As an Infrastructure Engineer, I want drift alerts to arrive as rich Slack messages, so that I can easily read the diff and context without leaving Slack.
- Acceptance Criteria:
- Lambda function maps drift events to Slack Block Kit JSON.
- Message includes Stack Name, Resource Address, Timestamp, Severity, and CloudTrail Attribution.
- Displays a code block showing the
oldvsnewattribute diff.
- Story Points: 5
- Dependencies: Story 3.3
- Technical Notes: Build a flexible Block Kit template builder. Ensure diffs that are too long are gracefully truncated to fit Slack's block length limits.
Story 4.2: Slack Routing & Fanout
As a DevOps Lead, I want drift alerts to be routed to specific Slack channels based on the stack, so that the right team sees the alert without noise.
- Acceptance Criteria:
- Checks the
stackstable for custom Slack channel overrides. - Falls back to the organization's default Slack channel.
- Sends the formatted message via the Slack API.
- Checks the
- Story Points: 3
- Dependencies: Story 4.1
- Technical Notes: The Event Processor triggers this Lambda asynchronously via SQS (
notification-fanoutqueue).
Story 4.3: Action Buttons (Revert/Accept)
As an Infrastructure Engineer, I want action buttons directly on the Slack alert, so that I can quickly trigger remediation workflows.
- Acceptance Criteria:
- Slack message includes interactive buttons:
[Revert],[Accept],[Snooze],[Assign]. - Buttons contain the
drift_event_idin their payload value.
- Slack message includes interactive buttons:
- Story Points: 2
- Dependencies: Story 4.1
- Technical Notes: Just rendering the buttons. Handling the interactive callbacks will be covered under the Slack Bot epic or Remediation Engine (V1 MVP focus).
Story 4.4: Notification Batching (Low Severity)
As an Infrastructure Engineer, I want low and medium severity drift alerts to be batched into a digest, so that my Slack channel isn't spammed with noisy tag changes.
- Acceptance Criteria:
- Critical/High alerts are sent immediately.
- Medium/Low alerts are held in a DynamoDB table or SQS delay queue and dispatched as a daily/hourly digest.
- Story Points: 8
- Dependencies: Story 4.2
- Technical Notes: Use EventBridge Scheduler or a cron Lambda to process and flush the digest queue periodically.
Epic 5: Dashboard API
Description: The REST API powering the React SPA web dashboard. Handles user authentication (Cognito), organization/stack management, and querying the PostgreSQL data for drift history.
User Stories
Story 5.1: API Authentication & RLS Setup
As a System Operator, I want the API to enforce authentication and isolate tenant data, so that a user from one organization cannot see another organization's data.
- Acceptance Criteria:
- Integrates AWS Cognito JWT validation middleware.
- API sets
app.current_org_idon the PostgreSQL connection session for Row-Level Security (RLS). - Returns
401/403for unauthorized requests.
- Story Points: 5
- Dependencies: Database Schema (RLS)
- Technical Notes: Build an Express/Node.js middleware layer. Ensure strict parameterization for all SQL queries beyond RLS to avoid injection.
Story 5.2: Stack Management Endpoints
As a DevOps Lead, I want a set of REST endpoints to view and manage my stacks, so that I can configure ownership, check connection health, and monitor the drift score.
- Acceptance Criteria:
- Implements
GET /v1/stacks(list all stacks with their scores and resource counts). - Implements
GET /v1/stacks/:id(stack details). - Implements
PATCH /v1/stacks/:id(update name, owner, Slack channel).
- Implements
- Story Points: 3
- Dependencies: Story 5.1
- Technical Notes: Support basic pagination (
limit,offset) on list endpoints. Include agent heartbeat status in stack details if applicable.
Story 5.3: Drift History & Event Queries
As a Security Lead, I want endpoints to search and filter drift events, so that I can review historical drift, find specific changes, and generate audit reports.
- Acceptance Criteria:
- Implements
GET /v1/drift-eventswith filters forstack_id,status(open/resolved),severity, and timestamp ranges. - Joins the
drift_eventstable withresourcesto return full address paths and diff payloads.
- Implements
- Story Points: 5
- Dependencies: Story 5.1
- Technical Notes: Expose the JSONB
difffield cleanly in the response payload. Use index-backed PostgreSQL queries for fast filtering.
Story 5.4: Policy Configuration Endpoints
As a DevOps Lead, I want an API to manage remediation policies, so that I can customize how the agent reacts to specific resource drift.
- Acceptance Criteria:
- Implements CRUD operations for stack-level and org-level policies (
/v1/policies). - Validates policy configuration payloads (e.g., action type, valid resource expressions).
- Implements CRUD operations for stack-level and org-level policies (
- Story Points: 3
- Dependencies: Story 5.1
- Technical Notes: For V1, store policies as JSON fields in a simple
policiestable or direct mapping to stacks. Keep validation simple (e.g., regex checks on resource types).
Epic 6: Dashboard UI
Description: The React Single Page Application (SPA) providing a web dashboard for stack overview, drift timeline, and resource-level diff viewer.
User Stories
Story 6.1: Stack Overview Dashboard
As a DevOps Lead, I want a main dashboard showing all my stacks and their drift scores, so that I can assess infrastructure health at a glance.
- Acceptance Criteria:
- Displays a list/table of all monitored stacks.
- Shows a visual "Drift Score" indicator (0-100) per stack.
- Sortable by score, name, and last checked timestamp.
- Provides visual indicators for agent connection status.
- Story Points: 5
- Dependencies: Story 5.2
- Technical Notes: Build with React and Vite. Use a standard UI library (e.g., Tailwind UI or MUI). Implement efficient data fetching (e.g., React Query).
Story 6.2: Stack Detail & Drift Timeline
As an Infrastructure Engineer, I want to click into a stack and see a timeline of drift events, so that I can track when things changed and who changed them.
- Acceptance Criteria:
- Shows a chronological list of drift events for the selected stack.
- Displays open vs. resolved status.
- Filters for severity and category.
- Includes CloudTrail attribution data (who, IP, action).
- Story Points: 5
- Dependencies: Story 5.3, Story 6.1
- Technical Notes: Support pagination/infinite scrolling for the timeline. Use clear icons for event types (Security vs. Tags).
Story 6.3: Resource-Level Diff Viewer
As an Infrastructure Engineer, I want to see the exact attribute changes for a drifted resource, so that I know exactly how reality differs from my state file.
- Acceptance Criteria:
- Clicking an event opens a detailed view/modal.
- Renders a code-diff view (red for old state, green for new reality).
- Clearly marks redacted sensitive values.
- Story Points: 5
- Dependencies: Story 6.2
- Technical Notes: Use a specialized diff viewing component (e.g.,
react-diff-viewer). Ensure it handles large JSON blocks gracefully.
Story 6.4: Auth & User Settings
As a User, I want to manage my account and view my API keys, so that I can deploy the agent and access my organization's dashboard.
- Acceptance Criteria:
- Implements login/signup via Cognito (Email/Password & GitHub OAuth).
- Provides a settings page displaying the organization's static API key.
- Displays current subscription plan (Free tier limits for MVP).
- Story Points: 3
- Dependencies: Story 5.1
- Technical Notes: Securely manage JWT storage (HttpOnly cookies or secure local storage). Include a clear "copy to clipboard" for the API key.
Epic 7: Slack Bot
Description: The interactive Slack application that handles user commands (/drift score) and processes the interactive action buttons ([Revert], [Accept]) from drift alerts.
User Stories
Story 7.1: Interactive Remediation Callbacks (Revert)
As an Infrastructure Engineer, I want clicking [Revert] on a Slack alert to trigger a targeted terraform apply, so that I can fix drift instantly without leaving Slack.
- Acceptance Criteria:
- SaaS API Gateway (
/v1/slack/interactions) receives the button click payload. - Validates the Slack request signature.
- Generates a scoped
terraform plan -targetcommand and queues it for the agent. - Updates the Slack message to "Reverting...".
- SaaS API Gateway (
- Story Points: 8
- Dependencies: Story 4.3
- Technical Notes: The actual execution happens via the Remediation Engine (ECS Fargate) dispatching commands to the agent. Requires careful state tracking (Pending -> Executing -> Completed).
Story 7.2: Interactive Acceptance Callbacks (Accept)
As an Infrastructure Engineer, I want clicking [Accept] to auto-generate a PR that updates my Terraform code to match reality, so that the drift becomes the new source of truth.
- Acceptance Criteria:
- SaaS generates a code patch representing the new state.
- Uses the GitHub API to create a branch and open a PR against the target repo.
- Updates the Slack message with a link to the PR.
- Story Points: 8
- Dependencies: Story 7.1
- Technical Notes: Will require GitHub App integration (or OAuth token) to create branches/PRs. The patch generation logic needs robust testing.
Story 7.3: Slack Slash Commands
As a DevOps Lead, I want to use /drift score and /drift status <stack> in Slack, so that I can check my infrastructure health on demand.
- Acceptance Criteria:
/drift scorereturns the aggregate score for the organization./drift status prod-networkingreturns the score, open events, and agent health for a specific stack.- Formats output as a clean Slack Block Kit message visible only to the user.
- Story Points: 5
- Dependencies: Story 4.1, Story 5.2
- Technical Notes: Deploy an API Gateway endpoint specifically for Slack slash commands. Validate the token and use the internal Dashboard API logic to fetch scores.
Story 7.4: Snooze & Assign Callbacks
As an Infrastructure Engineer, I want to click [Snooze 24h] or [Assign], so that I can manage alert noise or delegate investigation to a teammate.
- Acceptance Criteria:
[Snooze]updates the event status tosnoozedand schedules a wake-up time.[Assign]opens a Slack modal to select a team member, updating the event owner.- The original Slack message updates to reflect the new state/owner.
- Story Points: 5
- Dependencies: Story 7.1
- Technical Notes: Snooze requires a scheduled EventBridge or cron job to un-snooze. Assign requires interacting with Slack's user selection menus.
Epic 8: Infrastructure & DevOps
Description: The underlying cloud resources for the dd0c SaaS and the CI/CD pipelines to build, test, and release the agent and services.
User Stories
Story 8.1: SaaS Infrastructure (Terraform)
As a System Operator, I want the SaaS infrastructure to be defined as code, so that deployments are repeatable and I can dogfood my own drift detection tool.
- Acceptance Criteria:
- Defines VPC, Subnets, ECS Fargate Clusters, RDS PostgreSQL (Multi-AZ), API Gateway, and SQS FIFO queues.
- Sets up CloudWatch log groups and IAM roles.
- Uses Terraform for all configuration.
- Story Points: 8
- Dependencies: Architecture Design Document
- Technical Notes: Build a modular Terraform setup. Use the official AWS provider. Include variables for environment separation (staging vs. prod).
Story 8.2: CI/CD Pipeline (GitHub Actions)
As a Developer, I want a fully automated CI/CD pipeline, so that code pushed to main is linted, tested, built, and deployed to ECS.
- Acceptance Criteria:
- Runs
golangci-lint,go test, ESLint, and Vitest on PRs. - Builds multi-stage Docker images for the Event Processor, Dashboard API, and Remediation Engine.
- Pushes images to ECR and triggers an ECS rolling deploy.
- Runs
- Story Points: 5
- Dependencies: Story 8.1
- Technical Notes: Use standard GitHub Actions (e.g.,
aws-actions/configure-aws-credentials). Add Trivy for basic container scanning.
Story 8.3: Agent Distribution (Releases & Homebrew)
As an Open Source User, I want to easily download and install the CLI agent, so that I can test drift detection locally without building from source.
- Acceptance Criteria:
- Configures GoReleaser to cross-compile binaries for Linux/macOS/Windows (amd64/arm64).
- Auto-publishes GitHub Releases when a new tag is pushed.
- Creates a custom Homebrew tap (
brew install dd0c/tap/drift-cli).
- Story Points: 5
- Dependencies: Story 1.1
- Technical Notes: Create a dedicated
.github/workflows/release.ymlfor GoReleaser.
Story 8.4: Agent Terraform Module Publication
As a DevOps Lead, I want a pre-built Terraform module to deploy the agent in my AWS account, so that I don't have to manually configure ECS tasks and EventBridge rules.
- Acceptance Criteria:
- Creates the
dd0c/drift-agent/awsTerraform module. - Provisions an ECS Task, EventBridge rules, SQS, and IAM roles for the customer.
- Publishes the module to the public Terraform Registry.
- Creates the
- Story Points: 8
- Dependencies: Story 8.3
- Technical Notes: Adhere to Terraform Registry best practices. Ensure the
README.mdclearly explains the requireddd0c_api_keyand state bucket variables.
Epic 9: Onboarding & PLG (Product-Led Growth)
Description: The self-serve funnel that guides users from CLI installation to their first drift alert in under 5 minutes, plus billing and tier management.
User Stories
Story 9.1: Self-Serve Signup & CLI Login
As an Engineer, I want to easily sign up for the free tier via the CLI, so that I don't have to fill out sales forms to test the product.
- Acceptance Criteria:
- Running
drift auth loginopens a browser to an OAuth flow (GitHub/Email). - The CLI spins up a local web server to catch the callback token.
- Successfully provisions an organization and user account in the SaaS.
- Running
- Story Points: 5
- Dependencies: Story 5.4, Story 8.3
- Technical Notes: The callback server should listen on
localhostwith a random or standard port (e.g.,8080).
Story 9.2: Auto-Discovery (drift init)
As an Infrastructure Engineer, I want the CLI to auto-discover my Terraform state files, so that I can configure my first stack without typing S3 ARNs.
- Acceptance Criteria:
drift initscans the current directory for*.tffiles.- Uses default AWS credentials to query S3 buckets matching common state file patterns.
- Prompts the user to register discovered stacks to their organization.
- Story Points: 8
- Dependencies: Story 9.1
- Technical Notes: Implement robust fallback to manual input if discovery fails.
Story 9.3: Free Tier Enforcement (1 Stack)
As a Product Manager, I want to enforce a free tier limit of 1 stack, so that users get value but are incentivized to upgrade for larger infrastructure needs.
- Acceptance Criteria:
- The API rejects attempts to register more than 1 stack on the Free plan.
- The Dashboard clearly shows "1/1 Stacks Used".
- The CLI prompts "Upgrade to Starter ($49/mo)" when trying to add a second stack.
- Story Points: 3
- Dependencies: Story 9.1
- Technical Notes: Enforce limits securely at the API level (e.g.,
POST /v1/stacksshould return a403 Stack Limiterror).
Story 9.4: Stripe Billing Integration
As a Solo Founder, I want customers to upgrade to paid tiers with a credit card, so that I can capture revenue without a sales process.
- Acceptance Criteria:
- Integrates Stripe Checkout for the Starter ($49/mo) and Pro ($149/mo) tiers.
- Dashboard provides a billing management portal (Stripe Customer Portal).
- Webhooks listen for successful payments and update the organization's
planfield in PostgreSQL.
- Story Points: 8
- Dependencies: Story 5.2
- Technical Notes: Must include Stripe webhook signature verification to prevent spoofed upgrades. Store
stripe_customer_idon theorganizationstable.
Epic 10: Transparent Factory Compliance
Description: Cross-cutting epic ensuring dd0c/drift adheres to the 5 Transparent Factory architectural tenets. For a drift detection product, these tenets are especially critical — the tool that detects infrastructure drift must itself be immune to uncontrolled drift.
Story 10.1: Atomic Flagging — Feature Flags for Detection Behaviors
As a solo founder, I want every new drift detection rule, remediation action, and notification behavior wrapped in a feature flag (default: off), so that I can ship new detection capabilities without accidentally triggering false-positive alerts for customers.
Acceptance Criteria:
- OpenFeature SDK integrated into the Go agent. V1 provider: env-var or JSON file-based (no external service).
- All flags evaluate locally — no network calls during drift scan execution.
- Every flag has
ownerandttl(max 14 days). CI blocks if any flag at 100% rollout exceeds TTL. - Automated circuit breaker: if a flagged detection rule generates >3x the baseline false-positive rate over 1 hour, the flag auto-disables.
- Flags required for: new IaC provider support (Terraform/Pulumi/CDK), remediation suggestions, Slack notification formats, scan scheduling changes.
Estimate: 5 points Dependencies: Epic 1 (Agent Core) Technical Notes:
- Use Go OpenFeature SDK (
go.openfeature.dev/sdk). JSON file provider for V1. - Circuit breaker: track false-positive dismissals per rule in Redis. If dismissal rate spikes, disable the flag.
- Flag audit:
make flag-auditlists all flags with TTL status.
Story 10.2: Elastic Schema — Additive-Only for Drift State Storage
As a solo founder, I want all DynamoDB and state file schema changes to be strictly additive, so that agent rollbacks never corrupt drift history or lose customer scan results.
Acceptance Criteria:
- CI lint rejects any migration or schema change that removes, renames, or changes the type of existing DynamoDB attributes.
- New attributes use
_v2suffix when breaking changes are needed. Old attributes remain readable. - Go structs use
json:",omitempty"and ignore unknown fields so V1 agents can read V2 state files without crashing. - Dual-write enforced during migration windows: agent writes to both old and new attribute paths in the same DynamoDB
TransactWriteItemscall. - Every schema change includes a
sunset_datecomment (max 30 days). CI warns on overdue cleanups.
Estimate: 3 points Dependencies: Epic 2 (State Management) Technical Notes:
- DynamoDB Single Table Design: version items with
_vattribute. Agent code uses a factory to select the correct model. - For Terraform state parsing, use
encoding/jsonwithDisallowUnknownFields: falseto tolerate upstream state format changes. - S3 state snapshots: never overwrite — always write new versioned keys.
Story 10.3: Cognitive Durability — Decision Logs for Detection Logic
As a future maintainer, I want every change to drift detection algorithms, severity scoring, or remediation logic accompanied by a decision_log.json, so that I understand why a particular drift pattern is flagged as critical vs. informational.
Acceptance Criteria:
decision_log.jsonschema:{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }.- CI requires a decision log entry for PRs touching
pkg/detection/,pkg/scoring/, orpkg/remediation/. - Cyclomatic complexity cap of 10 enforced via
golangci-lintwithgocyclolinter. PRs exceeding this are blocked. - Decision logs committed in
docs/decisions/, one per significant logic change.
Estimate: 2 points Dependencies: None Technical Notes:
- PR template includes decision log fields as a checklist.
- For drift scoring changes: document why specific thresholds were chosen (e.g., "security group changes scored critical because X% of breaches start with SG drift").
golangci-lintconfig:.golangci.ymlwithgocyclo: max-complexity: 10.
Story 10.4: Semantic Observability — AI Reasoning Spans on Drift Classification
As an SRE debugging a missed drift alert, I want every drift classification decision to emit an OpenTelemetry span with structured reasoning metadata, so that I can trace why a specific resource change was scored as low-severity when it should have been critical.
Acceptance Criteria:
- Every drift scan emits a parent
drift_scanspan. Each resource comparison emits a childdrift_classificationspan. - Span attributes:
drift.resource_type,drift.severity_score,drift.classification_reason,drift.alternatives_considered(e.g., "considered critical but downgraded because tag-only change"). - If AI-assisted classification is used (future):
ai.prompt_hash,ai.model_version,ai.confidence_scoreincluded. - Spans export via OTLP to any compatible backend.
- No PII or customer infrastructure details in spans — resource ARNs are hashed.
Estimate: 3 points Dependencies: Epic 1 (Agent Core) Technical Notes:
- Use
go.opentelemetry.io/otelwith OTLP exporter. - For V1 without AI classification, the
drift.classification_reasonis the rule name + threshold that triggered. - ARN hashing: SHA-256 truncated to 12 chars for correlation without exposure.
Story 10.5: Configurable Autonomy — Governance for Auto-Remediation
As a solo founder, I want a policy.json that controls whether the agent can auto-remediate drift or only report it, so that customers maintain full control over what the tool is allowed to change in their infrastructure.
Acceptance Criteria:
policy.jsondefinesgovernance_mode:strict(report-only, no remediation) oraudit(auto-remediate with logging).- Agent checks policy before every remediation action. In
strictmode, remediation suggestions are logged but never executed. panic_mode: when true, agent stops all scans immediately, preserves last-known-good state, and sends a single "paused" notification.- Per-customer policy override: customers can set their own governance mode via config, which is always more restrictive than the system default (never less).
- All policy decisions logged: "Remediation blocked by strict mode for resource X", "Auto-remediation applied in audit mode".
Estimate: 3 points Dependencies: Epic 3 (Remediation Engine) Technical Notes:
policy.jsonin repo root, loaded at startup, watched viafsnotify.- Customer-level overrides stored in DynamoDB
org_settingsitem. Merge logic:min(system_policy, customer_policy)— customer can only be MORE restrictive. - Panic mode trigger:
POST /admin/panicor env varDD0C_PANIC=true. Agent drains current scan and halts.
Epic 10 Summary
| Story | Tenet | Points |
|---|---|---|
| 10.1 | Atomic Flagging | 5 |
| 10.2 | Elastic Schema | 3 |
| 10.3 | Cognitive Durability | 2 |
| 10.4 | Semantic Observability | 3 |
| 10.5 | Configurable Autonomy | 3 |
| Total | 16 |