products/02-iac-drift-detection/architecture/architecture.md

# dd0c/drift — Technical Architecture
**Architect:** Max Mayfield (Phase 6 — Architecture)
**Date:** February 28, 2026
**Product:** dd0c/drift — IaC Drift Detection & Remediation SaaS
**Status:** Architecture Design Document

---

## Section 1: SYSTEM OVERVIEW

### High-Level Architecture

```mermaid
graph TB
    subgraph Customer VPC
        CT[AWS CloudTrail] -->|Events| EB[Amazon EventBridge]
        EB -->|Rule Match| SQS_C[SQS Queue<br/>customer-side]
        SQS_C --> DA[Drift Agent<br/>ECS Task / GitHub Action]
        SF[Terraform State<br/>S3 Backend] -->|Read| DA
        DA -->|Encrypted Drift Diffs| HTTPS[HTTPS Egress Only]
    end

    subgraph dd0c SaaS Platform — AWS
        HTTPS -->|mTLS| APIGW[API Gateway<br/>Agent Ingestion]
        APIGW --> SQS_P[SQS FIFO Queue<br/>Event Ingestion]
        SQS_P --> PROC[Event Processor<br/>ECS Fargate]
        PROC --> DB[(PostgreSQL RDS<br/>Multi-Tenant)]
        PROC --> S3_SNAP[S3<br/>State Snapshots]
        PROC --> ES[Event Store<br/>DynamoDB Streams]

        PROC --> RE[Remediation Engine<br/>ECS Fargate]
        RE -->|Generate Plan| DA

        PROC --> NS[Notification Service<br/>Lambda]
        NS --> SLACK[Slack API]
        NS --> EMAIL[SES Email]
        NS --> WH[Webhook Delivery]

        DASH[Dashboard API<br/>ECS Fargate] --> DB
        DASH --> S3_SNAP
        UI[React SPA<br/>CloudFront] --> DASH

        AUTH[Auth Service<br/>Cognito + Lambda] --> DASH
        AUTH --> APIGW
    end

    subgraph External Integrations
        SLACK --> SLACK_USER[Slack Workspace]
        GH[GitHub API] --> PR[Pull Request<br/>Accept Drift]
        PD[PagerDuty API] --> ONCALL[On-Call Rotation]
    end
```

### Component Inventory

| Component | Responsibility | Runtime | Deployment |
|---|---|---|---|
| **Drift Agent** | Consumes CloudTrail events via EventBridge/SQS, reads Terraform state, computes drift diffs, pushes encrypted results to SaaS | Go binary | Customer VPC: ECS Task, GitHub Action, or standalone binary |
| **API Gateway (Ingestion)** | Authenticates agent connections (mTLS + API key), rate limits, routes to ingestion queue | AWS API Gateway (HTTP API) | SaaS account |
| **Event Processor** | Deserializes drift diffs, classifies severity, persists to DB, triggers notifications and remediation workflows | Node.js / TypeScript | ECS Fargate |
| **State Manager** | Parses Terraform state files (v4 format), builds resource graph, computes resource-level diffs against previous snapshots | Go (shared lib with Agent) | Runs inside Drift Agent + SaaS-side for dashboard queries |
| **Remediation Engine** | Generates scoped `terraform plan` for revert, manages approval workflow, dispatches apply commands back to Agent | Node.js / TypeScript | ECS Fargate |
| **Notification Service** | Formats and delivers Slack Block Kit messages, emails, webhooks; handles Slack interactivity (button callbacks) | Node.js Lambda | Lambda (event-driven, pay-per-invocation) |
| **Dashboard API** | REST API for web dashboard — drift scores, stack list, history, compliance reports, team management | Node.js / TypeScript | ECS Fargate |
| **Web Dashboard** | React SPA — drift score, stack overview, drift timeline, compliance report generator, settings | React + Vite | CloudFront + S3 |
| **Auth Service** | User authentication (email/password, GitHub OAuth, Google OAuth), API key management for agents, RBAC | Cognito + Lambda authorizers | Managed + Lambda |

### Technology Choices

| Decision | Choice | Justification |
|---|---|---|
| **Agent Language** | Go | Single static binary, no runtime dependencies, cross-compiles to Linux/macOS/Windows. Critical for customer-side deployment — zero dependency footprint. |
| **SaaS Backend** | Node.js / TypeScript | Shared language with frontend (React). Fast iteration for a solo founder. Strong AWS SDK support. TypeScript catches bugs at compile time. |
| **Database** | PostgreSQL (RDS) | Relational model fits multi-tenant SaaS (row-level security). JSONB for flexible drift diff storage. Mature, battle-tested. RDS handles backups/failover. |
| **Event Store** | DynamoDB + Streams | Append-only drift event log. DynamoDB Streams enables event sourcing pattern. Cost-effective at low volume, scales linearly. |
| **State Snapshots** | S3 + Glacier lifecycle | State snapshots are large (MB range), write-once-read-rarely. S3 is the obvious choice. Glacier after 90 days for cost. |
| **Queue** | SQS FIFO | Exactly-once processing for drift events. FIFO guarantees ordering per message group (per stack). No operational overhead vs. self-managed Kafka. |
| **Notifications** | Lambda | Event-driven, bursty workload. Pay-per-invocation. A Slack message costs ~$0.0000002 in Lambda compute. |
| **Frontend** | React + Vite + CloudFront | Standard SPA stack. CloudFront for global edge caching. Vite for fast builds. Nothing exotic — solo founder needs boring tech. |
| **Auth** | Cognito | Managed auth with OAuth flows, JWT tokens, user pools. Eliminates building auth from scratch. Cognito is ugly but functional. |
| **IaC for SaaS infra** | Terraform | Dogfooding. The SaaS that detects Terraform drift should be deployed with Terraform. |

### Deployment Model — Push-Based Agent Architecture

This is the non-negotiable architectural decision. The SaaS never pulls from customer infrastructure. The agent pushes out.

```
┌─────────────────────────────────────────────────┐
│                 Customer Account                 │
│                                                  │
│  CloudTrail ──► EventBridge ──► SQS ──► Agent   │
│                                          │       │
│  S3 State Bucket ◄──── (read) ──────────┘       │
│                                          │       │
│                                    Drift Diff    │
│                                    (encrypted)   │
│                                          │       │
│                                    HTTPS OUT ────┼──►  dd0c SaaS
│                                                  │
│  ❌ No inbound access from SaaS                  │
│  ❌ No IAM cross-account role for SaaS           │
│  ✅ Agent runs with customer's IAM role          │
│  ✅ State file never leaves customer account     │
│  ✅ Only drift diffs (no secrets) are transmitted│
└─────────────────────────────────────────────────┘
```

**Agent Deployment Options:**

1. **ECS Fargate Task** (recommended for always-on): Long-running container that subscribes to SQS queue for real-time CloudTrail events and runs scheduled state comparisons. Deployed via Terraform module provided by dd0c.

2. **GitHub Actions Cron** (recommended for getting started): Scheduled workflow that runs `drift check` on a cron (e.g., every 15 minutes). Zero infrastructure to manage. Lowest barrier to entry.

3. **Standalone Binary** (for air-gapped / custom): Download the Go binary, run it anywhere — EC2, Kubernetes pod, on-prem server. Maximum flexibility.

**What the Agent Transmits:**

The agent sends a `DriftReport` payload — NOT the state file. The payload contains:
- Stack identifier (name, backend location hash — not the actual S3 path)
- List of drifted resources: resource type, resource address, attribute-level diff (old value vs. new value)
- CloudTrail attribution: IAM principal, source IP, timestamp, event name
- Drift classification: severity (critical/high/medium/low), category (security/config/tags/scaling)
- Agent metadata: version, heartbeat timestamp, detection method (event-driven vs. scheduled)

**What the Agent Does NOT Transmit:**
- Full state file contents
- Secret values (the agent strips `sensitive` attributes and known secret patterns before transmission)
- Raw CloudTrail events (only correlated attribution data)
- S3 bucket names, account IDs, or other infrastructure identifiers (hashed)

---

## Section 2: CORE COMPONENTS

### 2.1 Drift Agent

The agent is the heart of the push-based architecture. It's a single Go binary that runs inside the customer's environment and does two things: consume CloudTrail events for real-time detection, and periodically compare Terraform state against cloud reality for comprehensive coverage.

**Architecture:**

```mermaid
graph LR
    subgraph Drift Agent Process
        EC[Event Consumer<br/>CloudTrail via SQS] --> DF[Drift Filter<br/>Resource Matcher]
        SC[Scheduled Checker<br/>Cron / Timer] --> SP[State Parser<br/>TF State v4]
        SP --> DC[Drift Comparator<br/>Attribute-Level Diff]
        DF --> DC
        DC --> SS[Secret Scrubber<br/>Strip Sensitive Values]
        SS --> TX[Transmitter<br/>HTTPS + mTLS]
        TX -->|Encrypted DriftReport| SAAS[dd0c SaaS API]
    end
```

**Event Consumer (Real-Time Path):**

CloudTrail delivers events to EventBridge. An EventBridge rule matches IaC-managed resource types and forwards to an SQS queue. The agent polls this queue.

```json
// EventBridge Rule Pattern — matches write API calls on IaC-managed resource types
{
  "source": ["aws.ec2", "aws.iam", "aws.rds", "aws.s3", "aws.lambda", "aws.ecs"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventName": [{
      "anything-but": {
        "prefix": "Describe"
      }
    }],
    "readOnly": [false]
  }
}
```

When the agent receives a CloudTrail event:
1. Extract the resource identifier from the event (e.g., `sg-abc123` from a `ModifySecurityGroupRules` event)
2. Look up which Terraform state file manages this resource (agent maintains an in-memory resource → state index)
3. Read the current resource attributes from the cloud API (e.g., `ec2:DescribeSecurityGroups`)
4. Compare against the declared attributes in the Terraform state file
5. If drift detected → build `DriftReport`, scrub secrets, transmit to SaaS

**Scheduled Checker (Comprehensive Path):**

The event-driven path catches ~80% of drift in real-time. The scheduled path catches everything else — resources modified through means that don't generate standard CloudTrail events, or resources in services not yet covered by the EventBridge rule.

```
Schedule: Configurable per tier
  Free:     Every 24 hours
  Starter:  Every 15 minutes
  Pro:      Every 5 minutes
  Business: Every 1 minute
```

The scheduled checker:
1. Reads the Terraform state file from the configured backend (S3, GCS, Terraform Cloud, local)
2. For each resource in state, calls the corresponding cloud API to get current attributes
3. Computes attribute-level diff between state and reality
4. Batches all drifted resources into a single `DriftReport`
5. Transmits to SaaS

**State File Parsing:**

The agent parses Terraform state v4 format (JSON). Key structures:

```go
type TerraformState struct {
    Version          int              `json:"version"`          // Must be 4
    TerraformVersion string           `json:"terraform_version"`
    Serial           int64            `json:"serial"`
    Lineage          string           `json:"lineage"`
    Resources        []StateResource  `json:"resources"`
}

type StateResource struct {
    Module    string              `json:"module,omitempty"`
    Mode      string              `json:"mode"`      // "managed" or "data"
    Type      string              `json:"type"`       // e.g., "aws_security_group"
    Name      string              `json:"name"`
    Provider  string              `json:"provider"`
    Instances []ResourceInstance   `json:"instances"`
}

type ResourceInstance struct {
    SchemaVersion int                    `json:"schema_version"`
    Attributes    map[string]interface{} `json:"attributes"`
    Private       string                 `json:"private"`  // Base64 encoded, may contain secrets
}
```

The agent maintains a mapping of Terraform resource types to AWS API calls:

| Terraform Resource Type | AWS Describe API | Key Identifier |
|---|---|---|
| `aws_security_group` | `ec2:DescribeSecurityGroups` | `attributes.id` |
| `aws_iam_role` | `iam:GetRole` | `attributes.name` |
| `aws_iam_policy` | `iam:GetPolicyVersion` | `attributes.arn` |
| `aws_db_instance` | `rds:DescribeDBInstances` | `attributes.identifier` |
| `aws_s3_bucket` | `s3:GetBucketConfiguration` (composite) | `attributes.bucket` |
| `aws_lambda_function` | `lambda:GetFunction` | `attributes.function_name` |
| `aws_ecs_service` | `ecs:DescribeServices` | `attributes.name` + `attributes.cluster` |
| `aws_route53_record` | `route53:ListResourceRecordSets` | `attributes.zone_id` + `attributes.name` |

MVP covers the top 20 most-drifted resource types (based on community data from driftctl's historical issues). Remaining types added iteratively based on customer demand.

**Secret Scrubbing:**

Before transmitting any drift diff, the agent runs a scrubbing pass:

1. Remove any attribute marked `sensitive` in the Terraform provider schema
2. Redact values matching known secret patterns: `password`, `secret`, `token`, `key`, `private_key`, `connection_string`
3. Redact any value that looks like a credential (regex patterns for AWS keys, database URIs, JWT tokens)
4. Replace redacted values with `[REDACTED]` — the diff still shows "attribute changed" but not the actual values
5. The `Private` field on resource instances is always stripped entirely

### 2.2 Event Pipeline

The event pipeline is the real-time nervous system. CloudTrail events flow through EventBridge and SQS to the agent — no polling, no cron, no delay.

**Customer-Side Pipeline:**

```
CloudTrail (all regions)
    │
    ▼
EventBridge (default bus)
    │
    ├── Rule: drift-agent-ec2 (ec2 write events)
    ├── Rule: drift-agent-iam (iam write events)
    ├── Rule: drift-agent-rds (rds write events)
    └── Rule: drift-agent-* (per-service rules)
    │
    ▼
SQS Queue: drift-agent-events
    │ (FIFO, dedup by CloudTrail eventID)
    │ (visibility timeout: 300s)
    │ (DLQ after 3 retries)
    ▼
Drift Agent (long-poll, batch size 10)
```

**SaaS-Side Pipeline:**

```
Agent HTTPS POST /v1/drift-reports
    │
    ▼
API Gateway (auth: mTLS + API key header)
    │
    ▼
SQS FIFO Queue: drift-report-ingestion
    │ (message group ID = stack_id → ordering per stack)
    │ (dedup ID = report_id → exactly-once)
    ▼
Event Processor (ECS Fargate, auto-scaling 1-10 tasks)
    │
    ├──► PostgreSQL (drift_events, resources, stacks)
    ├──► DynamoDB (event store, append-only)
    ├──► S3 (state snapshot, if full snapshot report)
    └──► Notification Service (Lambda, async invoke)
```

**Why SQS FIFO, not Kafka/Kinesis:**

At MVP scale (10-1,000 stacks), SQS FIFO is the right choice:
- Zero operational overhead (no brokers, no partitions, no ZooKeeper)
- Exactly-once processing via deduplication ID
- Per-stack ordering via message group ID
- Costs ~$0.40/million messages. At 1,000 stacks checking every 5 minutes, that's ~288K messages/day = ~$3.50/month
- If we hit 10,000+ stacks and need streaming analytics, migrate to Kinesis. That's a V3 problem.

### 2.3 State Manager

The State Manager is a shared library (Go) used by both the Drift Agent and the SaaS-side Event Processor. It handles Terraform state parsing, resource graph construction, and drift classification.

**Resource Graph:**

The State Manager builds a dependency graph from Terraform state. This is critical for blast radius analysis — when a resource drifts, what else might be affected?

```go
type ResourceGraph struct {
    Nodes map[string]*ResourceNode  // key: resource address (e.g., "aws_security_group.api")
    Edges []ResourceEdge            // dependency relationships
}

type ResourceNode struct {
    Address    string                 // e.g., "module.networking.aws_security_group.api"
    Type       string                 // e.g., "aws_security_group"
    Provider   string                 // e.g., "registry.terraform.io/hashicorp/aws"
    Attributes map[string]interface{} // current state attributes
    DriftState DriftState             // clean, drifted, unknown
}

type ResourceEdge struct {
    From string  // resource address
    To   string  // resource address
    Type string  // "depends_on", "reference", "implicit"
}
```

The graph is built by analyzing attribute cross-references in state (e.g., `aws_instance.web.vpc_security_group_ids` references `aws_security_group.web.id`). This isn't perfect — Terraform state doesn't store the full dependency graph — but it catches 80%+ of relationships.

**Drift Classification:**

Every detected drift is classified along two axes:

| Severity | Criteria | Examples |
|---|---|---|
| **Critical** | Security boundary change, IAM escalation, public exposure | Security group opened to 0.0.0.0/0, IAM policy with `*:*`, S3 bucket made public |
| **High** | Configuration change affecting availability or data | RDS parameter change, ECS task definition change, Lambda runtime change |
| **Medium** | Non-critical configuration change | Instance type change, tag modification on critical resources, DNS TTL change |
| **Low** | Cosmetic or expected drift | Tag-only changes, description updates, ASG desired count (auto-scaling) |

Classification rules are defined in a YAML config shipped with the agent:

```yaml
# drift-classification.yaml
rules:
  - resource_type: aws_security_group
    attribute: ingress
    condition: "contains_cidr('0.0.0.0/0')"
    severity: critical
    category: security

  - resource_type: aws_iam_role_policy
    attribute: policy
    severity: high
    category: security

  - resource_type: aws_db_instance
    attribute: parameter_group_name
    severity: high
    category: configuration

  - resource_type: "*"
    attribute: tags
    severity: low
    category: tags

  # Default: anything not matched
  - resource_type: "*"
    attribute: "*"
    severity: medium
    category: configuration
```

Customers can override these rules in their agent config to match their risk tolerance.

### 2.4 Remediation Engine

The Remediation Engine handles two workflows: **Revert** (make cloud match code) and **Accept** (make code match cloud). Both are initiated from Slack action buttons or the dashboard.

**Revert Workflow:**

```mermaid
sequenceDiagram
    participant U as Engineer (Slack)
    participant NS as Notification Service
    participant RE as Remediation Engine
    participant DA as Drift Agent
    participant TF as Terraform (in customer VPC)

    U->>NS: Click [Revert] button
    NS->>RE: Initiate revert for resource X in stack Y
    RE->>RE: Generate scoped plan (target resource only)
    RE->>RE: Compute blast radius (resource graph)
    RE->>NS: Send confirmation with blast radius
    NS->>U: "Reverting aws_security_group.api will affect 0 other resources. Proceed? [Confirm] [Cancel]"
    U->>NS: Click [Confirm]
    NS->>RE: Confirmed
    RE->>DA: Execute: terraform apply -target=aws_security_group.api -auto-approve
    DA->>TF: Run terraform apply (scoped)
    TF-->>DA: Apply complete
    DA->>RE: Result: success
    RE->>NS: Notify success
    NS->>U: "✅ aws_security_group.api reverted to declared state. Drift resolved."
```

**Accept Workflow (Code-to-Cloud):**

When the engineer clicks [Accept], the drift is intentional and the code should be updated to match reality:

1. Remediation Engine generates a Terraform code patch that updates the resource definition to match the current cloud state
2. Creates a branch and PR on the connected GitHub repository
3. PR includes: the code change, a description of the drift, CloudTrail attribution, and a link to the drift event in the dd0c dashboard
4. Engineer reviews and merges the PR through normal code review process
5. On merge, the next `terraform apply` in CI/CD is a no-op for this resource (code now matches cloud)
6. Agent detects the state file update and marks the drift as resolved

**Approval Workflow (V2 — Pro/Business tiers):**

For teams that want approval gates before remediation:

```yaml
# remediation-policy.yaml
policies:
  - resource_type: aws_security_group
    action: auto-revert
    condition: "severity == 'critical'"
    # No approval needed — auto-revert critical security drift

  - resource_type: aws_iam_*
    action: require-approval
    approvers: ["@security-team"]
    timeout: 4h
    # IAM changes need security team sign-off

  - resource_type: aws_db_instance
    action: require-approval
    approvers: ["@dba-team", "@infra-lead"]
    timeout: 24h
    # Database changes need DBA approval

  - resource_type: "*"
    attribute: tags
    action: digest
    # Tag drift goes in the daily digest, no action needed
```

### 2.5 Notification Service

The Notification Service is a Lambda function that formats drift events into rich, actionable messages and delivers them to configured channels.

**Slack Block Kit Message (Primary):**

```json
{
  "blocks": [
    {
      "type": "header",
      "text": {
        "type": "plain_text",
        "text": "🔴 Critical Drift Detected"
      }
    },
    {
      "type": "section",
      "fields": [
        { "type": "mrkdwn", "text": "*Stack:*\nprod-networking" },
        { "type": "mrkdwn", "text": "*Resource:*\naws_security_group.api" },
        { "type": "mrkdwn", "text": "*Changed by:*\narn:aws:iam::123456:user/jsmith" },
        { "type": "mrkdwn", "text": "*When:*\n2 minutes ago" }
      ]
    },
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*What changed:*\n```ingress rule added: 0.0.0.0/0:443 (HTTPS from anywhere)```"
      }
    },
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*Blast radius:* 0 dependent resources\n*Owner:* @ravi"
      }
    },
    {
      "type": "actions",
      "elements": [
        { "type": "button", "text": { "type": "plain_text", "text": "🔄 Revert" }, "style": "danger", "action_id": "drift_revert", "value": "evt_abc123" },
        { "type": "button", "text": { "type": "plain_text", "text": "✅ Accept" }, "action_id": "drift_accept", "value": "evt_abc123" },
        { "type": "button", "text": { "type": "plain_text", "text": "⏰ Snooze 24h" }, "action_id": "drift_snooze", "value": "evt_abc123" },
        { "type": "button", "text": { "type": "plain_text", "text": "👤 Assign" }, "action_id": "drift_assign", "value": "evt_abc123" }
      ]
    }
  ]
}
```

**Notification Routing:**

| Severity | Slack | Email | PagerDuty | Webhook |
|---|---|---|---|---|
| Critical | Immediate (channel + DM to owner) | Immediate | Page on-call (Pro+) | Immediate |
| High | Immediate (channel) | Immediate | Alert (no page) | Immediate |
| Medium | Batched (hourly digest) | Daily digest | — | Batched |
| Low | Daily digest | Weekly digest | — | Batched |

**Slack Interactivity:**

When an engineer clicks a button, Slack sends an interaction payload to our API Gateway endpoint (`POST /v1/slack/interactions`). The Lambda:
1. Verifies the Slack request signature
2. Looks up the drift event by ID
3. Checks the user's permissions (RBAC — can this user remediate this stack?)
4. Initiates the appropriate workflow (revert, accept, snooze, assign)
5. Updates the original Slack message to reflect the action taken

---

## Section 3: DATA ARCHITECTURE

### 3.1 Database Schema (PostgreSQL RDS)

Multi-tenant PostgreSQL with Row-Level Security (RLS). Every table includes `org_id` and all queries are scoped by it. No cross-tenant data leakage by design.

```sql
-- ============================================================
-- ORGANIZATIONS & AUTH
-- ============================================================

CREATE TABLE organizations (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            TEXT NOT NULL,
    slug            TEXT UNIQUE NOT NULL,           -- e.g., "acme-corp"
    plan            TEXT NOT NULL DEFAULT 'free',    -- free, starter, pro, business, enterprise
    stripe_customer_id TEXT,
    max_stacks      INT NOT NULL DEFAULT 3,
    poll_interval_s INT NOT NULL DEFAULT 86400,      -- default: daily (free tier)
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    email           TEXT NOT NULL,
    name            TEXT,
    role            TEXT NOT NULL DEFAULT 'member',  -- owner, admin, member, viewer
    cognito_sub     TEXT UNIQUE,
    slack_user_id   TEXT,                            -- for Slack DM routing
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(org_id, email)
);

-- ============================================================
-- STACKS & RESOURCES
-- ============================================================

CREATE TABLE stacks (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    name            TEXT NOT NULL,                   -- e.g., "prod-networking"
    backend_type    TEXT NOT NULL DEFAULT 's3',       -- s3, gcs, tfc, local
    backend_hash    TEXT NOT NULL,                    -- SHA256 of backend config (no raw paths stored)
    iac_tool        TEXT NOT NULL DEFAULT 'terraform', -- terraform, opentofu, pulumi
    environment     TEXT,                             -- prod, staging, dev
    owner_user_id   UUID REFERENCES users(id),
    slack_channel   TEXT,                             -- override notification channel per stack
    drift_score     REAL NOT NULL DEFAULT 100.0,      -- 0-100, 100 = clean
    last_check_at   TIMESTAMPTZ,
    last_drift_at   TIMESTAMPTZ,
    resource_count  INT NOT NULL DEFAULT 0,
    drifted_count   INT NOT NULL DEFAULT 0,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(org_id, backend_hash)
);

CREATE TABLE resources (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    stack_id        UUID NOT NULL REFERENCES stacks(id) ON DELETE CASCADE,
    address         TEXT NOT NULL,                   -- e.g., "module.vpc.aws_security_group.api"
    resource_type   TEXT NOT NULL,                   -- e.g., "aws_security_group"
    provider        TEXT NOT NULL,                   -- e.g., "registry.terraform.io/hashicorp/aws"
    cloud_id        TEXT,                            -- e.g., "sg-abc123" (for cross-referencing)
    drift_state     TEXT NOT NULL DEFAULT 'clean',   -- clean, drifted, unknown, ignored
    last_drift_at   TIMESTAMPTZ,
    drift_count     INT NOT NULL DEFAULT 0,          -- lifetime drift count for this resource
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(stack_id, address)
);

CREATE INDEX idx_resources_type ON resources(org_id, resource_type);
CREATE INDEX idx_resources_drift ON resources(org_id, drift_state) WHERE drift_state = 'drifted';

-- ============================================================
-- DRIFT EVENTS
-- ============================================================

CREATE TABLE drift_events (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    stack_id        UUID NOT NULL REFERENCES stacks(id),
    resource_id     UUID NOT NULL REFERENCES resources(id),
    report_id       UUID NOT NULL,                   -- groups events from same detection run
    severity        TEXT NOT NULL,                    -- critical, high, medium, low
    category        TEXT NOT NULL,                    -- security, configuration, tags, scaling
    detection_method TEXT NOT NULL,                   -- event_driven, scheduled
    
    -- The drift diff (JSONB for flexible querying)
    diff            JSONB NOT NULL,
    /*  Example diff:
        {
          "attributes": {
            "ingress": {
              "old": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8"]}],
              "new": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8", "0.0.0.0/0"]}]
            }
          }
        }
    */
    
    -- CloudTrail attribution (nullable — scheduled checks don't have this)
    attributed_principal TEXT,                        -- IAM ARN who made the change
    attributed_source_ip TEXT,                        -- source IP
    attributed_event_name TEXT,                       -- e.g., "AuthorizeSecurityGroupIngress"
    attributed_at   TIMESTAMPTZ,                     -- when the change was made
    
    -- Resolution
    status          TEXT NOT NULL DEFAULT 'open',    -- open, resolved, accepted, snoozed, ignored
    resolved_by     UUID REFERENCES users(id),
    resolved_at     TIMESTAMPTZ,
    resolution_type TEXT,                            -- reverted, accepted, snoozed, auto_reverted
    resolution_note TEXT,
    
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_drift_events_stack ON drift_events(org_id, stack_id, created_at DESC);
CREATE INDEX idx_drift_events_status ON drift_events(org_id, status) WHERE status = 'open';
CREATE INDEX idx_drift_events_severity ON drift_events(org_id, severity, created_at DESC);

-- ============================================================
-- REMEDIATION PLANS
-- ============================================================

CREATE TABLE remediation_plans (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    drift_event_id  UUID NOT NULL REFERENCES drift_events(id),
    stack_id        UUID NOT NULL REFERENCES stacks(id),
    plan_type       TEXT NOT NULL,                   -- revert, accept
    status          TEXT NOT NULL DEFAULT 'pending', -- pending, approved, executing, completed, failed, cancelled
    
    -- For revert: terraform plan output (scrubbed)
    plan_output     TEXT,
    target_resources TEXT[],                          -- resource addresses targeted
    blast_radius    INT NOT NULL DEFAULT 0,          -- number of dependent resources affected
    
    -- For accept: generated code patch
    code_patch      TEXT,
    pr_url          TEXT,                             -- GitHub PR URL if created
    
    -- Approval
    requested_by    UUID REFERENCES users(id),
    approved_by     UUID REFERENCES users(id),
    approved_at     TIMESTAMPTZ,
    
    -- Execution
    started_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ,
    error_message   TEXT,
    
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- ============================================================
-- AGENTS
-- ============================================================

CREATE TABLE agents (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    name            TEXT NOT NULL,                    -- e.g., "prod-agent-1"
    api_key_hash    TEXT NOT NULL,                    -- bcrypt hash of the agent API key
    status          TEXT NOT NULL DEFAULT 'active',   -- active, inactive, revoked
    last_heartbeat  TIMESTAMPTZ,
    agent_version   TEXT,
    deployment_type TEXT,                             -- ecs, github_action, binary
    stacks          TEXT[],                           -- stack IDs this agent monitors
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- ============================================================
-- ROW-LEVEL SECURITY
-- ============================================================

ALTER TABLE stacks ENABLE ROW LEVEL SECURITY;
ALTER TABLE resources ENABLE ROW LEVEL SECURITY;
ALTER TABLE drift_events ENABLE ROW LEVEL SECURITY;
ALTER TABLE remediation_plans ENABLE ROW LEVEL SECURITY;
ALTER TABLE agents ENABLE ROW LEVEL SECURITY;
ALTER TABLE users ENABLE ROW LEVEL SECURITY;

-- All policies follow the same pattern: current_setting('app.current_org_id')
CREATE POLICY org_isolation ON stacks
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON resources
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON drift_events
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON remediation_plans
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON agents
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON users
    USING (org_id = current_setting('app.current_org_id')::UUID);
```

### 3.2 Event Sourcing (DynamoDB)

Every drift detection result is appended to an immutable event store in DynamoDB. This provides a complete audit trail that can never be modified — critical for compliance evidence.

```
Table: drift-events-log
Partition Key: org_id#stack_id (String)
Sort Key:      timestamp#event_id (String)
TTL:           expires_at (90 days for free, 1 year for paid, 7 years for enterprise)

Attributes:
  - event_type:     "drift_detected" | "drift_resolved" | "remediation_started" | "remediation_completed" | "agent_heartbeat" | "stack_registered"
  - payload:        Full event payload (JSON)
  - report_id:      Groups events from same detection run
  - checksum:       SHA256 of payload (tamper detection)
```

**Why DynamoDB for the event store (not PostgreSQL):**

1. Append-only workload — DynamoDB excels at high-throughput writes with no read contention
2. TTL-based expiration — automatic cleanup per tier without cron jobs
3. DynamoDB Streams — enables downstream consumers (analytics, compliance report generation) without polling
4. Cost — at 1,000 stacks × 288 checks/day × 365 days = ~105M items/year. DynamoDB on-demand pricing: ~$130/year for writes, ~$25/year for storage. PostgreSQL would need periodic archival to maintain query performance.

**DynamoDB Streams Consumer:**

A Lambda function subscribes to the DynamoDB Stream and:
1. Aggregates drift metrics (drift rate, MTTR, drift-by-resource-type) into a PostgreSQL `drift_metrics` table for dashboard queries
2. Generates daily/weekly compliance digest reports (stored in S3)
3. Feeds the drift score calculation engine

### 3.3 State Snapshot Storage (S3)

When the agent performs a full scheduled check, it can optionally upload a sanitized state snapshot to S3. This enables:
- Historical comparison ("what did the state look like last Tuesday?")
- Compliance evidence ("here's the state at the time of the audit")
- Debugging ("the drift diff looks wrong — let me check the raw state")

```
Bucket: dd0c-state-snapshots-{account_id}
Prefix: {org_id}/{stack_id}/{YYYY}/{MM}/{DD}/{timestamp}-{report_id}.json.gz

Lifecycle:
  - Standard:           0-30 days
  - Infrequent Access:  30-90 days
  - Glacier Instant:    90-365 days
  - Glacier Deep:       365+ days (enterprise only)
  - Expire:             Per tier TTL (90d free, 1yr paid, 7yr enterprise)

Encryption: SSE-S3 (AES-256) + bucket policy enforcing encryption
Versioning: Enabled (tamper protection)
Object Lock: Compliance mode for enterprise tier (WORM — auditors love this)
```

**What's in a state snapshot:**

NOT the raw Terraform state file. The agent sanitizes it:
1. All `sensitive` attributes → `[REDACTED]`
2. All `private` instance data → stripped
3. Backend configuration → hashed
4. Account IDs → hashed (reversible only by the customer's agent)

The snapshot is useful for drift comparison but cannot be used to reconstruct the customer's infrastructure or extract secrets.

### 3.4 Multi-Tenant Data Isolation

Three layers of isolation:

**Layer 1: Application-Level (RLS)**
Every API request sets `app.current_org_id` on the PostgreSQL session before executing queries. Row-Level Security policies ensure queries only return rows matching the org. Even a SQL injection vulnerability cannot access cross-tenant data.

```typescript
// Middleware: set org context on every request
async function setOrgContext(req: Request, res: Response, next: NextFunction) {
  const orgId = req.auth.orgId; // from JWT claims
  await pool.query("SET app.current_org_id = $1", [orgId]);
  next();
}
```

**Layer 2: Infrastructure-Level (S3 Prefixes + IAM)**
State snapshots are stored under `{org_id}/` prefixes. The SaaS application IAM role has a policy condition that restricts S3 access to the prefix matching the authenticated org:

```json
{
  "Effect": "Allow",
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:::dd0c-state-snapshots-*",
  "Condition": {
    "StringLike": {
      "s3:prefix": "${aws:PrincipalTag/org_id}/*"
    }
  }
}
```

**Layer 3: Encryption-Level (Per-Org KMS Keys — Enterprise)**
Enterprise tier customers get a dedicated KMS key for encrypting their data at rest. This enables:
- Customer-controlled key rotation
- Key deletion = cryptographic data destruction (for offboarding)
- CloudTrail logging of all key usage (customer can audit our access)

**Data Residency (V2):**
For EU customers requiring GDPR data residency, deploy a separate RDS instance + S3 bucket in `eu-west-1`. The application routes based on `org.data_region`. This is a V2 feature — MVP runs single-region `us-east-1`.

---

## Section 4: INFRASTRUCTURE

### 4.1 AWS Architecture — SaaS Platform

```mermaid
graph TB
    subgraph Public Edge
        CF[CloudFront<br/>Web Dashboard CDN]
        APIGW[API Gateway<br/>HTTP API]
    end

    subgraph Compute — ECS Fargate Cluster
        EP[Event Processor<br/>1-10 tasks, auto-scale]
        RE[Remediation Engine<br/>1-3 tasks, auto-scale]
        DASH[Dashboard API<br/>2 tasks, target-tracking]
    end

    subgraph Serverless
        NS[Notification Service<br/>Lambda]
        AUTH_L[Auth Authorizer<br/>Lambda]
        STREAM_L[DynamoDB Stream<br/>Consumer Lambda]
        CRON_L[Cron Jobs<br/>Lambda + EventBridge Scheduler]
    end

    subgraph Data
        RDS[(PostgreSQL 16<br/>RDS db.t4g.medium<br/>Multi-AZ)]
        DDB[(DynamoDB<br/>On-Demand<br/>Event Store)]
        S3_SNAP[S3<br/>State Snapshots]
        S3_WEB[S3<br/>Web Dashboard Assets]
    end

    subgraph Messaging
        SQS_IN[SQS FIFO<br/>drift-report-ingestion]
        SQS_REM[SQS Standard<br/>remediation-commands]
        SQS_NOTIFY[SQS Standard<br/>notification-fanout]
    end

    subgraph Auth & Secrets
        COG[Cognito User Pool]
        SM[Secrets Manager<br/>Slack tokens, DB creds]
        KMS[KMS<br/>Encryption keys]
    end

    subgraph Monitoring
        CW[CloudWatch<br/>Logs + Metrics + Alarms]
        XR[X-Ray<br/>Distributed Tracing]
    end

    CF --> S3_WEB
    CF --> APIGW
    APIGW --> AUTH_L --> COG
    APIGW --> SQS_IN
    APIGW --> DASH
    SQS_IN --> EP
    EP --> RDS
    EP --> DDB
    EP --> S3_SNAP
    EP --> SQS_NOTIFY
    EP --> SQS_REM
    SQS_NOTIFY --> NS
    SQS_REM --> RE
    RE --> RDS
    DASH --> RDS
    DASH --> S3_SNAP
    DDB --> STREAM_L
    STREAM_L --> RDS
```

**VPC Layout:**

```
VPC: 10.0.0.0/16 (us-east-1)

  Public Subnets (NAT Gateway, ALB):
    10.0.1.0/24  (us-east-1a)
    10.0.2.0/24  (us-east-1b)

  Private Subnets (ECS Tasks, Lambda):
    10.0.10.0/24 (us-east-1a)
    10.0.11.0/24 (us-east-1b)

  Isolated Subnets (RDS):
    10.0.20.0/24 (us-east-1a)
    10.0.21.0/24 (us-east-1b)

  VPC Endpoints (no NAT for AWS services):
    - com.amazonaws.us-east-1.sqs
    - com.amazonaws.us-east-1.dynamodb
    - com.amazonaws.us-east-1.s3
    - com.amazonaws.us-east-1.secretsmanager
    - com.amazonaws.us-east-1.kms
    - com.amazonaws.us-east-1.ecr.api
    - com.amazonaws.us-east-1.ecr.dkr
    - com.amazonaws.us-east-1.logs
```

### 4.2 Customer-Side Agent Deployment

The agent is deployed into the customer's AWS account via a Terraform module published to the Terraform Registry.

**Terraform Module: `dd0c/drift-agent/aws`**

```hcl
module "drift_agent" {
  source  = "dd0c/drift-agent/aws"
  version = "~> 1.0"

  # Required
  dd0c_api_key       = var.dd0c_api_key          # From dd0c dashboard
  terraform_state_bucket = "my-terraform-state"   # S3 bucket with state files
  terraform_state_keys   = ["prod/*.tfstate"]     # Glob patterns for state files

  # Optional
  deployment_type    = "ecs"                       # "ecs" | "lambda" | "binary"
  vpc_id             = module.vpc.vpc_id
  subnet_ids         = module.vpc.private_subnet_ids
  poll_interval      = 300                         # seconds (overridden by tier)
  
  # EventBridge real-time detection
  enable_eventbridge = true
  cloudtrail_name    = "main-trail"                # Existing CloudTrail trail name
  
  # Resource type filter (optional — default: all supported types)
  resource_types     = ["aws_security_group", "aws_iam_*", "aws_db_instance"]

  tags = {
    Environment = "production"
    ManagedBy   = "dd0c-drift"
  }
}
```

**What the module creates:**

| Resource | Purpose |
|---|---|
| ECS Task Definition + Service | Runs the drift agent container (Fargate, 0.25 vCPU, 512MB) |
| IAM Role: `dd0c-drift-agent` | Agent execution role with read-only permissions |
| IAM Policy: `dd0c-drift-readonly` | Read access to state bucket + describe APIs for monitored resource types |
| EventBridge Rules | Match CloudTrail write events for monitored resource types |
| SQS Queue: `dd0c-drift-events` | Buffer for EventBridge events consumed by agent |
| SQS DLQ: `dd0c-drift-events-dlq` | Dead letter queue for failed event processing |
| CloudWatch Log Group | Agent logs (retained 30 days) |
| Security Group | Egress-only to dd0c SaaS API endpoint + AWS service endpoints |

**Alternative: GitHub Actions (Zero Infrastructure)**

For teams that don't want to run infrastructure, the agent runs as a GitHub Action:

```yaml
# .github/workflows/drift-check.yml
name: Drift Check
on:
  schedule:
    - cron: '*/15 * * * *'  # Every 15 minutes
  workflow_dispatch: {}

jobs:
  check:
    runs-on: ubuntu-latest
    permissions:
      id-token: write  # For OIDC auth to AWS
    steps:
      - uses: dd0c/drift-action@v1
        with:
          dd0c-api-key: ${{ secrets.DD0C_API_KEY }}
          aws-role-arn: arn:aws:iam::123456789:role/dd0c-drift-readonly
          state-bucket: my-terraform-state
          state-keys: "prod/*.tfstate"
```

This approach trades real-time detection (no EventBridge) for zero infrastructure. Good for getting started; upgrade to ECS when they want real-time.

### 4.3 Cost Estimates

**SaaS Platform Costs (Monthly):**

| Component | 10 Stacks | 100 Stacks | 1,000 Stacks |
|---|---|---|---|
| RDS db.t4g.medium (Multi-AZ) | $140 | $140 | $280 (db.t4g.large) |
| ECS Fargate (Event Processor) | $15 | $35 | $150 |
| ECS Fargate (Dashboard API) | $30 | $30 | $60 |
| ECS Fargate (Remediation Engine) | $8 | $15 | $50 |
| Lambda (Notifications + Stream) | $1 | $5 | $30 |
| SQS | $1 | $3 | $15 |
| DynamoDB (On-Demand) | $2 | $15 | $130 |
| S3 (State Snapshots) | $1 | $5 | $40 |
| API Gateway | $4 | $15 | $80 |
| CloudFront | $5 | $5 | $10 |
| NAT Gateway | $35 | $35 | $70 |
| Cognito | $0 | $3 | $25 |
| CloudWatch / X-Ray | $10 | $20 | $50 |
| Secrets Manager | $2 | $2 | $5 |
| **Total SaaS Infra** | **~$254/mo** | **~$328/mo** | **~$995/mo** |

**Customer-Side Agent Costs (Per Customer, Monthly):**

| Component | Cost |
|---|---|
| ECS Fargate (0.25 vCPU, 512MB, always-on) | ~$9/mo |
| SQS Queue (EventBridge events) | ~$0.50/mo |
| CloudWatch Logs | ~$1/mo |
| EventBridge Rules | Free (included in CloudTrail) |
| **Total per customer** | **~$10.50/mo** |

This is important for pricing: the customer pays ~$10.50/mo in their own AWS bill to run the agent. The $49/mo Starter tier needs to deliver enough value to justify $49 + $10.50 = ~$60/mo total cost.

**Unit Economics at Scale:**

| Scale | Revenue (est.) | SaaS Infra Cost | Gross Margin |
|---|---|---|---|
| 10 customers (avg $99/mo) | $990/mo | $254/mo | 74% |
| 50 customers (avg $149/mo) | $7,450/mo | $328/mo | 96% |
| 200 customers (avg $199/mo) | $39,800/mo | $995/mo | 97% |

SaaS margins are excellent once past the fixed-cost floor (~$254/mo). The business breaks even at ~3 paying customers.

### 4.4 Scaling Strategy

**Phase 1: MVP (0-100 stacks)**
- Single RDS instance (db.t4g.medium, Multi-AZ)
- ECS Fargate with auto-scaling (min 1, max 3 per service)
- DynamoDB on-demand (auto-scales)
- Single region (us-east-1)

**Phase 2: Growth (100-1,000 stacks)**
- RDS read replica for dashboard queries (separate read/write paths)
- ECS auto-scaling up to 10 tasks per service
- SQS batch processing (batch size 10 → higher throughput)
- CloudFront caching for dashboard API (drift scores, stack lists — cache 60s)

**Phase 3: Scale (1,000-10,000 stacks)**
- RDS upgrade to db.r6g.large + read replicas
- Consider migrating event ingestion from SQS FIFO to Kinesis Data Streams (higher throughput, fan-out)
- DynamoDB DAX for hot-path reads (drift score lookups)
- Multi-region deployment (us-east-1 + eu-west-1) for data residency
- Connection pooling via RDS Proxy

**Phase 4: Enterprise (10,000+ stacks)**
- Dedicated RDS instances per large enterprise customer
- Kinesis + Lambda fan-out for event processing
- ElastiCache (Redis) for session management and rate limiting
- This is a "good problem to have" phase — re-architect based on actual bottlenecks

### 4.5 CI/CD Pipeline

```mermaid
graph LR
    subgraph Developer
        CODE[Push to main] --> GH[GitHub]
    end

    subgraph CI — GitHub Actions
        GH --> LINT[Lint + Type Check]
        LINT --> TEST[Unit Tests]
        TEST --> INT[Integration Tests<br/>LocalStack]
        INT --> BUILD[Docker Build<br/>+ Go Binary]
        BUILD --> SCAN[Trivy Container Scan]
        SCAN --> PUSH[Push to ECR]
    end

    subgraph CD — Terraform + ECS
        PUSH --> TF_PLAN[Terraform Plan<br/>staging]
        TF_PLAN --> APPROVE[Manual Approval<br/>for prod]
        APPROVE --> TF_APPLY[Terraform Apply<br/>prod]
        TF_APPLY --> ECS_DEPLOY[ECS Rolling Deploy<br/>Blue/Green]
        ECS_DEPLOY --> SMOKE[Smoke Tests]
        SMOKE --> DONE[✅ Deployed]
    end
```

**Pipeline Details:**

| Stage | Tool | Duration |
|---|---|---|
| Lint + Type Check | ESLint + tsc (TypeScript), golangci-lint (Go) | ~30s |
| Unit Tests | Vitest (TypeScript), go test (Go) | ~60s |
| Integration Tests | LocalStack (SQS, DynamoDB, S3 emulation) | ~120s |
| Docker Build | Multi-stage Dockerfile, Go binary cross-compile | ~90s |
| Container Scan | Trivy (CVE scanning) | ~30s |
| ECR Push | Docker push to private ECR | ~20s |
| Terraform Plan | Plan against staging environment | ~30s |
| Manual Approval | GitHub Environment protection rule (prod) | Human |
| Terraform Apply | Apply to prod | ~60s |
| ECS Deploy | Rolling update (min healthy 100%, max 200%) | ~120s |
| Smoke Tests | Hit health endpoints, verify SQS consumption | ~30s |
| **Total (automated)** | | **~9 minutes** |

**Agent Release Pipeline:**

The Go agent binary is released separately:
1. Tag a release on GitHub (`v1.2.3`)
2. GoReleaser builds binaries for linux/amd64, linux/arm64, darwin/amd64, darwin/arm64
3. Docker image pushed to public ECR (for ECS deployment)
4. GitHub Action published to GitHub Marketplace
5. Terraform module version bumped in Terraform Registry
6. Changelog posted to dd0c blog + Slack community

---

## Section 5: SECURITY

### 5.1 IAM Role Design — Customer Accounts

The trust model is the hardest sell: customers are giving dd0c's agent read access to their Terraform state and cloud resource attributes. The architecture must make this as narrow and auditable as possible.

**Agent Execution Role (Customer-Side):**

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadTerraformState",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::${state_bucket}",
        "arn:aws:s3:::${state_bucket}/${state_key_prefix}*"
      ]
    },
    {
      "Sid": "DescribeResources",
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "iam:Get*",
        "iam:List*",
        "rds:Describe*",
        "s3:GetBucket*",
        "s3:GetEncryptionConfiguration",
        "s3:GetLifecycleConfiguration",
        "lambda:GetFunction",
        "lambda:GetFunctionConfiguration",
        "lambda:ListTags",
        "ecs:Describe*",
        "ecs:List*",
        "route53:GetHostedZone",
        "route53:ListResourceRecordSets",
        "elasticloadbalancing:Describe*",
        "cloudfront:GetDistribution",
        "sns:GetTopicAttributes",
        "sqs:GetQueueAttributes",
        "dynamodb:DescribeTable",
        "kms:DescribeKey",
        "kms:GetKeyPolicy"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "${monitored_regions}"
        }
      }
    },
    {
      "Sid": "ConsumeEventBridgeQueue",
      "Effect": "Allow",
      "Action": [
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage",
        "sqs:GetQueueAttributes"
      ],
      "Resource": "arn:aws:sqs:*:${account_id}:dd0c-drift-events"
    },
    {
      "Sid": "WriteAgentLogs",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:${account_id}:log-group:/dd0c/drift-agent:*"
    }
  ]
}
```

**Key design decisions:**

1. **No cross-account role for the SaaS.** The SaaS platform NEVER assumes a role in the customer's account. The agent runs with the customer's own IAM role. The SaaS only receives drift reports over HTTPS. This is the fundamental trust boundary.

2. **Read-only by default.** The agent role has zero write permissions. It can describe resources and read state files. It cannot modify anything.

3. **Region-scoped.** The `aws:RequestedRegion` condition limits describe calls to regions the customer explicitly configures. No global enumeration.

4. **State bucket scoped.** S3 access is limited to the specific state bucket and key prefix. Not `s3:*` on `*`.

### 5.2 Remediation IAM Role (Separate, Opt-In)

Remediation requires write access. This is a SEPARATE IAM role that customers opt into explicitly. It is never created by default.

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TerraformApply",
      "Effect": "Allow",
      "Action": [
        "ec2:*",
        "iam:*",
        "rds:*",
        "s3:*",
        "lambda:*",
        "ecs:*"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "${monitored_regions}"
        },
        "StringLike": {
          "aws:ResourceTag/ManagedBy": "terraform"
        }
      }
    },
    {
      "Sid": "StateLock",
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": "arn:aws:dynamodb:*:${account_id}:table/${state_lock_table}"
    }
  ]
}
```

**Guardrails on remediation:**

1. **Tag-scoped:** The condition `aws:ResourceTag/ManagedBy = terraform` limits write actions to resources that are tagged as Terraform-managed. Resources created outside Terraform can't be modified.
2. **Approval required:** The SaaS never triggers remediation without explicit human approval (button click in Slack or dashboard). Auto-remediation policies are customer-configured and customer-approved.
3. **Scoped apply:** Remediation always uses `terraform apply -target=<resource>`. Never a full `terraform apply`. Blast radius is minimized.
4. **Audit trail:** Every remediation action is logged in the event store with: who approved it, when, what was changed, and the full terraform plan output.
5. **Kill switch:** Customers can revoke the remediation role at any time via IAM. The agent gracefully degrades to detect-only mode.

### 5.3 State File Security

Terraform state files are the crown jewels. They contain resource IDs, configuration, and — critically — secret values (database passwords, API keys, private keys). The architecture must handle this with extreme care.

**Principle: State files never leave the customer's account.**

The agent reads the state file in-memory within the customer's VPC. It extracts resource attributes for drift comparison. Before transmitting anything to the SaaS:

1. **Attribute filtering:** Only attributes relevant to drift detection are included in the report. The agent maintains an allowlist per resource type:

```yaml
# attribute-allowlist.yaml
aws_security_group:
  - ingress
  - egress
  - name
  - description
  - vpc_id
  - tags

aws_iam_role:
  - assume_role_policy
  - max_session_duration
  - path
  - permissions_boundary
  - tags
  # NOT included: inline policies (may contain secrets in conditions)

aws_db_instance:
  - engine
  - engine_version
  - instance_class
  - allocated_storage
  - storage_type
  - multi_az
  - publicly_accessible
  - vpc_security_group_ids
  - db_subnet_group_name
  - parameter_group_name
  - tags
  # NOT included: master_password, endpoint (could be used for targeting)
```

2. **Secret pattern scrubbing:** Even within allowed attributes, values matching secret patterns are redacted:
   - AWS access keys (`AKIA...`)
   - Database connection strings (`postgres://...`, `mysql://...`)
   - Private keys (`-----BEGIN RSA PRIVATE KEY-----`)
   - JWT tokens (`eyJ...`)
   - Generic patterns: any value for keys containing `password`, `secret`, `token`, `key`, `credential`

3. **In-transit encryption:** All agent-to-SaaS communication uses TLS 1.3 with mTLS (mutual TLS). The agent presents a client certificate issued during registration. The SaaS validates it before accepting any data.

4. **At-rest encryption:** Drift diffs stored in PostgreSQL and DynamoDB are encrypted with KMS (AWS-managed key for standard tiers, customer-managed key for enterprise).

5. **No state file caching:** The agent does not write state file contents to disk. State is read from S3 into memory, processed, and discarded. The Go binary uses `mlock` to prevent state data from being swapped to disk.

### 5.4 SOC 2 Considerations

dd0c/drift will pursue SOC 2 Type II certification. The architecture supports the required controls:

| SOC 2 Criteria | How dd0c Addresses It |
|---|---|
| **CC6.1** Logical access controls | Cognito auth, RBAC, API key auth for agents, RLS in PostgreSQL |
| **CC6.2** Access provisioned/deprovisioned | User management via dashboard, API key rotation, agent revocation |
| **CC6.3** Access restricted to authorized | mTLS for agent connections, JWT validation for dashboard, VPC isolation for data tier |
| **CC7.1** Monitoring for anomalies | CloudWatch alarms, X-Ray tracing, agent heartbeat monitoring |
| **CC7.2** Incident response | Runbook in Confluence, PagerDuty integration, automated rollback via ECS |
| **CC8.1** Change management | Terraform IaC for all infrastructure, GitHub PR reviews, CI/CD pipeline |
| **A1.2** Recovery objectives | RDS Multi-AZ (RPO: 0, RTO: <5min), S3 cross-region replication (enterprise), DynamoDB point-in-time recovery |
| **C1.1** Confidentiality | State files never leave customer VPC, secret scrubbing, KMS encryption, TLS 1.3 |

**Compliance automation:**

The irony is not lost on us: a drift detection product must itself be drift-free. dd0c/drift will run dd0c/drift on its own infrastructure. Dogfooding as compliance evidence.

### 5.5 Trust Model

The trust model is the product's biggest adoption barrier. Here's how we address it at each level:

**Level 1: "I don't trust you with any access"**
→ GitHub Actions mode. The agent runs in the customer's GitHub Actions runner. dd0c only receives drift reports (no IAM role in customer account at all). The customer reviews the agent source code (open-source Go binary).

**Level 2: "I'll give you read-only access"**
→ Standard deployment. Agent runs in customer VPC with read-only IAM role. State files never leave the account. Only sanitized drift diffs are transmitted.

**Level 3: "I trust you to remediate"**
→ Remediation role (opt-in). Separate IAM role with write permissions. Scoped to tagged resources. Requires explicit human approval for every action.

**Level 4: "I trust you to auto-remediate"**
→ Auto-remediation policies (Business/Enterprise tier). Customer defines rules for automatic revert. Still uses the remediation IAM role. Full audit trail. Kill switch available.

**Open-source agent:**

The drift agent Go binary is open-source (Apache 2.0). Customers can:
- Audit the code to verify what data is collected and transmitted
- Build from source if they don't trust pre-built binaries
- Fork and modify for custom requirements
- Run in air-gapped environments with no SaaS connection (detect-only, local output)

This is the trust unlock. Security teams that won't install a closed-source agent will consider an open-source one they can audit.

---


## Section 6: MVP SCOPE

### 6.1 V1 Boundary — What Ships

The MVP is ruthlessly scoped to one IaC tool, one cloud, one notification channel, and one deployment model. Everything else is deferred. The goal is: a solo founder ships a working product in 30 days that detects Terraform drift in AWS and alerts via Slack.

**V1 Feature Matrix:**

| Capability | V1 (Launch) | Status |
|---|---|---|
| **IaC Support** | Terraform + OpenTofu (state v4 format only) | ✅ Ship |
| **Cloud Provider** | AWS only | ✅ Ship |
| **Detection: Scheduled** | Poll state vs. cloud on configurable interval | ✅ Ship |
| **Detection: Event-Driven** | CloudTrail → EventBridge → SQS → Agent | ✅ Ship |
| **Notification: Slack** | Block Kit messages with action buttons | ✅ Ship |
| **Remediation: Revert** | Scoped `terraform apply -target` via agent | ✅ Ship |
| **Remediation: Accept** | Auto-generate PR to update IaC code | ✅ Ship |
| **Dashboard** | Drift score, stack list, event history (minimal React SPA) | ✅ Ship |
| **Agent: ECS** | Terraform module for ECS Fargate deployment | ✅ Ship |
| **Agent: GitHub Actions** | Scheduled workflow, zero infra | ✅ Ship |
| **Onboarding CLI** | `drift init` auto-discovery | ✅ Ship |
| **Auth** | Email/password + GitHub OAuth via Cognito | ✅ Ship |
| **Billing** | Stripe integration, self-serve upgrade | ✅ Ship |
| **Multi-tenant** | RLS-based isolation, org/user/stack model | ✅ Ship |

**V1 Resource Type Coverage (Top 20):**

The agent ships with drift detection for the 20 most commonly drifted AWS resource types. This list is derived from driftctl's historical GitHub issues, r/terraform drift complaints, and CloudTrail event frequency data:

| Priority | Resource Type | Why It Drifts | Detection Complexity |
|---|---|---|---|
| 1 | `aws_security_group` / `aws_security_group_rule` | Emergency port opens during incidents | Low — `ec2:DescribeSecurityGroups` |
| 2 | `aws_iam_role` / `aws_iam_role_policy` | Permission escalation, console edits | Medium — policy document comparison |
| 3 | `aws_iam_policy` / `aws_iam_policy_attachment` | Inline policy edits, attachment changes | Medium — version document diff |
| 4 | `aws_s3_bucket` (config attributes) | Public access toggles, lifecycle changes | Medium — composite describe calls |
| 5 | `aws_db_instance` | Parameter group changes, storage scaling | Low — `rds:DescribeDBInstances` |
| 6 | `aws_instance` | Instance type changes, security group swaps | Low — `ec2:DescribeInstances` |
| 7 | `aws_lambda_function` | Runtime updates, env var changes | Low — `lambda:GetFunction` |
| 8 | `aws_ecs_service` | Task count changes, image tag updates | Low — `ecs:DescribeServices` |
| 9 | `aws_ecs_task_definition` | Container definition edits | Medium — JSON deep comparison |
| 10 | `aws_route53_record` | DNS record changes (manual cutover) | Low — `route53:ListResourceRecordSets` |
| 11 | `aws_lb_listener` / `aws_lb_listener_rule` | Routing rule changes | Low — `elbv2:DescribeListeners` |
| 12 | `aws_autoscaling_group` | Desired capacity (auto-scaling noise) | Low — needs noise filtering |
| 13 | `aws_cloudwatch_metric_alarm` | Threshold tweaks | Low — `cloudwatch:DescribeAlarms` |
| 14 | `aws_sns_topic` / `aws_sqs_queue` | Policy changes, subscription edits | Low — `sns:GetTopicAttributes` |
| 15 | `aws_dynamodb_table` | Capacity mode changes, GSI edits | Medium — `dynamodb:DescribeTable` |
| 16 | `aws_elasticache_cluster` | Node type changes, parameter group | Low — `elasticache:DescribeCacheClusters` |
| 17 | `aws_kms_key` | Key policy changes | Medium — policy document diff |
| 18 | `aws_cloudfront_distribution` | Origin changes, behavior edits | High — complex nested config |
| 19 | `aws_vpc` / `aws_subnet` | CIDR changes, tag drift | Low — `ec2:DescribeVpcs` |
| 20 | `aws_eip` / `aws_nat_gateway` | Association changes | Low — `ec2:DescribeAddresses` |

Resource types beyond the top 20 are detected as "unknown drift" — the agent reports that the resource exists in state but can't compare attributes. Customers can request priority for specific types via GitHub issues.

### 6.2 What's Deferred to V2+

Saying "no" is the only way a solo founder ships in 30 days. Here's what's explicitly deferred and why:

**V2 (Month 3-4):**

| Feature | Why Deferred | Dependency |
|---|---|---|
| **CloudFormation support** | Different state format (stack resources API), different drift detection mechanism (`detect-stack-drift` API). Requires a separate parser and comparator. | New state parser module |
| **Pulumi support** | Pulumi state is stored differently (Pulumi Service backend or S3 with different schema). Requires a new state parser. | New state parser module |
| **Auto-remediation policies** | Per-resource-type automation rules (auto-revert, alert, digest, ignore). Requires policy engine, rule evaluation, and careful UX to avoid accidental auto-reverts. | Policy engine, approval workflow |
| **Compliance report generation** | SOC 2 / HIPAA evidence export (PDF/CSV). Requires report templating, data aggregation, and export pipeline. | DynamoDB event store populated with 30+ days of data |
| **Drift trends & analytics** | Time-series charts (drift rate, MTTR, most-drifted resources). Requires metrics aggregation pipeline and charting frontend. | DynamoDB Streams consumer, charting library |
| **PagerDuty / OpsGenie integration** | Route critical drift through existing on-call. Requires integration auth, event mapping, and escalation logic. | Notification service extension |
| **Teams & RBAC** | Multi-team support, role-based permissions, stack-level access control. Requires authorization layer beyond basic org membership. | Auth service extension |

**V3 (Month 6-9):**

| Feature | Why Deferred |
|---|---|
| **Multi-cloud (Azure, GCP)** | Each cloud requires its own describe API mapping, authentication model, and event pipeline. Triple the agent complexity. |
| **Drift prediction (ML)** | Requires aggregate data from 500+ customers to build meaningful models. Can't do this at launch. |
| **Industry benchmarking** | Same data requirement as prediction. Need critical mass of anonymized drift data. |
| **SSO / SAML** | Enterprise auth. Not needed until enterprise customers appear. Cognito supports it when ready. |
| **Full API & webhooks** | Public API for programmatic access. V1 has internal APIs only. Public API requires versioning, rate limiting, documentation, and SDK generation. |
| **dd0c platform integration** | Cross-module data flow (drift → alert, drift → portal). Requires dd0c/alert and dd0c/portal to exist first. |

**Explicitly NOT building (ever, unless market demands it):**

| Anti-Feature | Why Not |
|---|---|
| CI/CD orchestration | That's Spacelift/env0's game. We detect drift, we don't run pipelines. |
| Policy-as-code engine (OPA/Sentinel) | Adjacent but different problem. Integrate with existing policy tools, don't build one. |
| Cost management | That's dd0c/cost. Separate product, separate concern. |
| Service catalog | That's dd0c/portal. Drift feeds into it, doesn't replace it. |
| Multi-cloud state management | We read state, we don't manage it. No state migration, no state locking, no remote backend hosting. |

### 6.3 Onboarding Flow

The onboarding flow is the product's first impression. It must go from `npm install` to first Slack alert in under 5 minutes. Every second of friction is a lost conversion.

**Flow: CLI-First Onboarding**

```
Step 1: Install CLI
─────────────────────────────────────────────
$ brew install dd0c/tap/drift-cli
# or: curl -sSL https://get.dd0c.dev/drift | sh
# or: go install github.com/dd0c/drift-cli@latest

Step 2: Authenticate
─────────────────────────────────────────────
$ drift auth login
→ Opens browser: https://app.dd0c.dev/auth
→ GitHub OAuth or email/password
→ CLI receives token via localhost callback
✅ Authenticated as brian@dd0c.dev (org: dd0c)

Step 3: Auto-Discover State Backends
─────────────────────────────────────────────
$ drift init
🔍 Scanning for Terraform state backends...

Found 3 state backends:
  1. s3://acme-terraform-state/prod/networking.tfstate    (23 resources)
  2. s3://acme-terraform-state/prod/compute.tfstate       (47 resources)
  3. s3://acme-terraform-state/staging/main.tfstate       (31 resources)

Register all 3 stacks? [Y/n]: Y

✅ Registered 3 stacks (101 resources total)
   Org plan: Free (3 stacks max) — you're at capacity

Step 4: Configure Slack
─────────────────────────────────────────────
$ drift connect slack
→ Opens browser: Slack OAuth install flow
→ Select workspace and channel (#infrastructure-alerts)
✅ Connected to Slack workspace "Acme Corp" → #infrastructure-alerts

Step 5: First Drift Check
─────────────────────────────────────────────
$ drift check --all
🔍 Checking 3 stacks (101 resources)...

Stack: prod-networking (23 resources)
  🔴 CRITICAL  aws_security_group.api — ingress rule added (0.0.0.0/0:443)
  🟡 MEDIUM    aws_route53_record.api — TTL changed (300 → 60)
  ✅ 21 resources clean

Stack: prod-compute (47 resources)
  🟠 HIGH      aws_iam_role.lambda_exec — policy document changed
  🔵 LOW       aws_instance.worker[0] — tags.Environment changed
  ✅ 45 resources clean

Stack: staging-main (31 resources)
  ✅ All 31 resources clean

Summary: 4 drifted resources across 3 stacks (96% aligned)
📨 Slack alerts sent to #infrastructure-alerts

Step 6: Deploy Agent (Optional — for continuous monitoring)
─────────────────────────────────────────────
$ drift agent deploy --type=github-action
→ Generates .github/workflows/drift-check.yml
→ Creates GitHub secret DD0C_API_KEY via gh CLI
✅ Agent deployed — drift checks will run every 15 minutes

# OR for ECS deployment:
$ drift agent deploy --type=ecs --vpc-id=vpc-abc123 --subnets=subnet-1,subnet-2
→ Generates Terraform module in ./dd0c-drift-agent/
→ Run: cd dd0c-drift-agent && terraform init && terraform apply
```

**Auto-Discovery Logic:**

The `drift init` command discovers state backends through multiple strategies:

| Strategy | How It Works | Coverage |
|---|---|---|
| **AWS credential chain** | Uses default AWS credentials to scan S3 buckets matching common patterns (`*-terraform-state`, `*-tfstate`, `*-tf-state`) | ~60% of teams |
| **Terraform config scan** | Walks current directory tree for `*.tf` files, parses `backend` blocks | ~80% of teams (if run from repo root) |
| **Environment variables** | Reads `TF_STATE_BUCKET`, `TF_WORKSPACE`, `AWS_DEFAULT_REGION` | ~30% of teams |
| **Terraform Cloud/Enterprise** | Checks `~/.terraform.d/credentials.tfrc.json` for TFC tokens, queries workspace API | ~15% of teams (TFC users) |
| **Interactive fallback** | If auto-discovery finds nothing: "Enter your S3 state bucket name:" | 100% (manual) |

The goal: 80% of users hit `drift init` and see their stacks listed without typing a bucket name.

**Onboarding Metrics:**

| Metric | Target | Kill Threshold |
|---|---|---|
| Time from install to first `drift check` output | < 3 minutes | > 10 minutes |
| Time from signup to first Slack alert | < 5 minutes | > 15 minutes |
| `drift init` auto-discovery success rate | > 70% | < 40% |
| Onboarding completion rate (install → Slack connected) | > 60% | < 30% |
| Drop-off at each step | < 15% per step | > 30% at any step |

### 6.4 Technical Debt Budget

A solo founder shipping in 30 days will accumulate technical debt. The key is to accumulate it intentionally, in known locations, with a plan to pay it down.

**Acceptable Debt (Ship With It):**

| Debt Item | Where | Why It's Acceptable | Pay Down By |
|---|---|---|---|
| **Hardcoded resource type mappings** | Agent: `resource_mapper.go` | The top 20 resource types are hardcoded as Go structs with describe API calls. No plugin system. Adding type 21 requires a code change and agent release. | V2: Plugin system or YAML-driven resource type definitions |
| **Single-region SaaS** | Infrastructure: `us-east-1` only | Multi-region adds complexity (RDS replication, S3 cross-region, routing). Not needed until EU customers demand GDPR residency. | V2: `eu-west-1` deployment when first EU enterprise customer signs |
| **No database migrations framework** | SaaS: raw SQL files | At MVP, the schema is small enough to manage with numbered SQL files. No Flyway/Liquibase. | Month 3: Adopt `golang-migrate` or Prisma Migrate before schema exceeds 20 tables |
| **Minimal error handling in CLI** | CLI: `drift init`, `drift check` | Error messages are functional but not polished. Stack traces may leak in edge cases. | Month 2: Error wrapping, user-friendly messages, `--debug` flag for verbose output |
| **No retry logic on Slack API** | Notification Service Lambda | Slack API failures drop the notification silently. No retry queue. At low volume, this is rare. | Month 2: SQS DLQ for failed Slack deliveries, retry with exponential backoff |
| **Dashboard is read-only** | Web SPA | V1 dashboard shows drift scores and event history. Settings, team management, and policy configuration are CLI-only. | V2: Full dashboard CRUD for stacks, policies, team members |
| **No rate limiting on public API** | API Gateway | API Gateway has default throttling (10K req/s) but no per-org rate limiting. At MVP scale, this is fine. | Month 3: Per-org rate limiting via API Gateway usage plans + API keys |
| **Test coverage < 80%** | Agent + SaaS | Integration tests cover the critical path (detect → notify → remediate). Unit test coverage will be ~60% at launch. | Month 2-3: Increase to 80%+ with focus on drift comparator and secret scrubber |
| **No OpenTelemetry** | SaaS services | V1 uses CloudWatch Logs + X-Ray. No custom metrics, no distributed trace correlation across services. | V2: OTel SDK integration, custom metrics (drift detection latency, queue depth, notification delivery rate) |
| **Monorepo without workspace tooling** | Repository structure | Single repo with `agent/`, `saas/`, `dashboard/`, `cli/` directories. No Nx/Turborepo. Builds are sequential. | Month 3: Turborepo or Nx for parallel builds when CI time exceeds 15 minutes |

**Unacceptable Debt (Do NOT Ship With):**

| Debt Item | Why It's Unacceptable |
|---|---|
| **No secret scrubbing** | Transmitting customer secrets to SaaS is a trust-destroying, potentially lawsuit-inducing failure. Secret scrubber ships in V1, fully tested. |
| **No RLS on PostgreSQL** | Cross-tenant data leakage is an existential risk. RLS is enabled from Day 1 with integration tests that verify isolation. |
| **No mTLS on agent connections** | Agent-to-SaaS communication without mutual TLS means anyone with an API key can impersonate an agent. mTLS ships in V1. |
| **No Stripe webhook verification** | Accepting unverified Stripe webhooks enables billing manipulation. Signature verification is a one-liner — no excuse to skip it. |
| **No input validation on drift diffs** | Malicious agents could inject SQL or XSS via crafted drift diffs. Input validation and parameterized queries are non-negotiable. |
| **No CloudTrail event signature verification** | EventBridge events should be validated against CloudTrail digest files to prevent spoofed drift events. |

**Debt Paydown Schedule:**

```
Month 1 (Launch):     Ship with acceptable debt. Focus on working product.
Month 2:              Error handling, Slack retry logic, test coverage to 70%
Month 3:              Rate limiting, database migrations framework, Turborepo
Month 4 (V2 start):   Plugin system for resource types, dashboard CRUD, OTel
Month 6:              Multi-region, full test coverage (80%+), performance profiling
```

### 6.5 Solo Founder Operational Model

One person builds, ships, operates, markets, and supports this product. The architecture must minimize operational burden or the founder burns out before reaching $10K MRR.

**Operational Principles:**

1. **Managed services over self-hosted.** RDS over self-managed PostgreSQL. Cognito over self-hosted auth. SQS over self-managed RabbitMQ. Lambda over always-on notification servers. Every managed service is one fewer thing to page about at 3am.

2. **Alerts on business impact, not infrastructure metrics.** Don't alert on CPU > 80%. Alert on: "drift reports stopped arriving from agent X" (customer impact), "Slack notification delivery failed 3x" (customer impact), "RDS storage > 80%" (approaching outage). Fewer alerts = sustainable on-call.

3. **Automate recovery, not just detection.** ECS auto-restarts crashed tasks. Lambda retries on failure. SQS DLQ captures poison messages. RDS Multi-AZ fails over automatically. The founder should wake up to a resolved incident, not an active one.

4. **Weekly maintenance window, not continuous ops.** Sunday evening: review CloudWatch dashboards, check DLQ depth, review error logs, update dependencies, run `terraform plan` to verify no drift (dogfooding). 2 hours/week max.

**On-Call Model:**

```
PagerDuty Configuration:
  - Critical alerts (customer-facing outage):     Page immediately, 24/7
  - High alerts (degraded service):               Page during business hours only
  - Medium alerts (non-urgent operational):        Slack notification, review in weekly maintenance
  - Low alerts (informational):                    CloudWatch dashboard only

Critical Alert Triggers:
  - API Gateway 5xx rate > 5% for 5 minutes
  - SQS FIFO queue age > 15 minutes (drift reports backing up)
  - RDS connection count > 80% of max
  - Zero drift reports received in 1 hour (all agents down?)
  - Stripe webhook processing failures > 3 consecutive

Estimated Alert Volume:
  - Month 1-3:   ~2-3 critical alerts/month (new system, bugs)
  - Month 3-6:   ~1 critical alert/month (stabilized)
  - Month 6+:    ~0.5 critical alerts/month (mature)
```

**Support Model:**

| Channel | Response Time | Tier |
|---|---|---|
| GitHub Issues (drift-cli) | 24-48 hours | All (open-source community) |
| In-app chat (Intercom) | 24 hours (business days) | Free + Paid |
| Slack community (#dd0c-drift) | Best effort, same day | All |
| Email (support@dd0c.dev) | 24 hours | Paid |
| Priority Slack DM | 4 hours (business days) | Business + Enterprise |

**Time Allocation (Solo Founder, 50 hrs/week):**

```
Week 1-4 (Pre-Launch Build):
  ├── 35 hrs  Engineering (agent + SaaS + dashboard + CLI)
  ├── 5 hrs   Infrastructure (Terraform, CI/CD, monitoring)
  ├── 5 hrs   Content (README, docs, blog draft)
  └── 5 hrs   Community (Reddit lurking, driftctl issue monitoring)

Week 5-8 (Launch + Seed):
  ├── 25 hrs  Engineering (bug fixes, polish, V1.1 patches)
  ├── 5 hrs   Infrastructure + ops
  ├── 10 hrs  Content (blog posts, Drift Cost Calculator)
  └── 10 hrs  Community (Reddit engagement, HN launch, design partners)

Week 9-12 (Growth):
  ├── 20 hrs  Engineering (V1.x improvements, V2 planning)
  ├── 5 hrs   Ops + support
  ├── 10 hrs  Content + SEO
  ├── 10 hrs  Community + partnerships
  └── 5 hrs   Business (metrics review, pricing analysis, investor prep)

Steady State (Month 4+):
  ├── 20 hrs  Engineering
  ├── 5 hrs   Ops + support (scales with customer count)
  ├── 10 hrs  Marketing (content + community + partnerships)
  ├── 10 hrs  Product (customer feedback, roadmap, design)
  └── 5 hrs   Business (metrics, billing, legal)
```

**Automation That Saves Founder Time:**

| Automation | Time Saved | Implementation |
|---|---|---|
| **Stripe billing** | 5 hrs/week | Self-serve upgrade/downgrade, automatic invoicing, dunning emails |
| **GitHub Actions CI/CD** | 3 hrs/week | Automated test → build → deploy pipeline. No manual deployments. |
| **Intercom chatbot** | 2 hrs/week | FAQ auto-responses for common questions (pricing, setup, supported resources) |
| **CloudWatch auto-remediation** | 1 hr/week | Auto-restart ECS tasks, auto-scale on queue depth, auto-archive old DynamoDB items |
| **Dependabot + Renovate** | 1 hr/week | Automated dependency updates with auto-merge for patch versions |
| **dd0c/drift on dd0c/drift** | 1 hr/week | Dogfooding — drift detection on own infrastructure eliminates manual `terraform plan` runs |

---

## Section 7: API DESIGN

The dd0c/drift API surface is divided into three distinct zones: the highly restricted Agent API (mTLS authenticated, ingestion only), the standard Dashboard API (JWT authenticated, CRUD operations), and the Integration APIs (Webhooks, Slack, and dd0c platform cross-talk).

### 7.1 Agent API (Ingestion & Heartbeat)

This API is exposed strictly for the drift agent running in the customer's environment. All endpoints require mutual TLS (mTLS) combined with a static, org-scoped API key sent via headers.

**Agent Registration & Heartbeat:**

```http
POST /v1/agents/register
Authorization: Bearer dd0c_api_...
Content-Type: application/json

{
  "agent_id": "uuid",
  "name": "prod-ecs-cluster-agent",
  "version": "1.2.3",
  "deployment_type": "ecs",
  "monitored_stacks": ["prod-networking", "prod-compute"]
}

# Response: 200 OK
{
  "status": "active",
  "poll_interval_s": 300,
  "config_hash": "abc123def456"
}
```

```http
POST /v1/agents/{agent_id}/heartbeat
Authorization: Bearer dd0c_api_...

{
  "uptime_s": 86400,
  "events_processed": 142,
  "memory_mb": 42
}
```

**Drift Report Submission:**

This is the core ingestion endpoint. It accepts batched drift reports (either from event-driven CloudTrail intercepts or scheduled full-state comparisons).

```http
POST /v1/drift-reports
Authorization: Bearer dd0c_api_...

{
  "stack_id": "prod-networking",
  "report_id": "uuid",
  "detection_method": "event_driven",
  "timestamp": "2026-02-28T10:00:00Z",
  "drifted_resources": [
    {
      "address": "module.vpc.aws_security_group.api",
      "type": "aws_security_group",
      "severity": "critical",
      "category": "security",
      "diff": {
        "ingress": {
          "old": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8"]}],
          "new": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8", "0.0.0.0/0"]}]
        }
      },
      "attribution": {
        "principal": "arn:aws:iam::123456:user/jsmith",
        "source_ip": "192.168.1.1",
        "event_name": "AuthorizeSecurityGroupIngress"
      }
    }
  ]
}
```

### 7.2 Dashboard & Query API

This is the REST API consumed by the React web dashboard. It relies on standard JWT Bearer tokens issued by Cognito. All responses are scoped via RLS to the authenticated user's `org_id`.

**Drift Event Query & Search:**

Allows complex filtering to power the drift history and active drift dashboards.

```http
GET /v1/drift-events
  ?stack_id=prod-networking
  &status=open
  &severity=critical,high
  &limit=50
  &offset=0

# Response: 200 OK
{
  "data": [
    {
      "id": "evt_abc123",
      "stack_id": "prod-networking",
      "resource_address": "module.vpc.aws_security_group.api",
      "severity": "critical",
      "status": "open",
      "created_at": "2026-02-28T10:00:00Z"
      // ... full event details ...
    }
  ],
  "pagination": {
    "total": 12,
    "has_next": false
  }
}
```

**Policy Configuration API (V2):**

Manages the auto-remediation and alerting policies for a stack.

```http
POST /v1/stacks/{stack_id}/policies

{
  "name": "Auto-revert public security groups",
  "resource_type": "aws_security_group",
  "condition": "severity == 'critical'",
  "action": "auto_revert",
  "enabled": true
}
```

### 7.3 Slack Integration

The Slack integration relies on two primary interaction models: Slash commands (user-initiated) and Interactive Actions (button clicks on drift alerts).

**Slash Commands:**

- `/drift status [stack_name]` - Returns the current drift score, number of open drift events, and agent health status for the specified stack (or all stacks if omitted).
- `/drift check [stack_name]` - Dispatches an immediate out-of-band scheduled check command to the agent for the specified stack.
- `/drift silence [resource_address] [duration]` - Temporarily mutes alerts for a noisy resource (e.g., `/drift silence aws_autoscaling_group.workers 24h`).

**Interactive Actions (Webhooks from Slack):**

When a user clicks an action button (`[Revert]`, `[Accept]`, `[Snooze]`) in a drift alert message, Slack POSTs a signed payload to the dd0c API Gateway.

```http
POST /v1/slack/interactions
Content-Type: application/x-www-form-urlencoded
X-Slack-Signature: v0=a2114d...
X-Slack-Request-Timestamp: 1614556800

payload={
  "type": "block_actions",
  "user": { "id": "U123456", "name": "jsmith" },
  "actions": [
    {
      "action_id": "drift_revert",
      "value": "evt_abc123"
    }
  ]
}
```

The Notification Service Lambda validates the signature, performs RBAC checks, and initiates the workflow via the Remediation Engine.

### 7.4 Outbound Webhooks

Customers can subscribe to drift events via outbound webhooks to trigger custom internal workflows (e.g., creating Jira tickets, updating custom dashboards).

**Webhook Registration:**

Customers configure an endpoint URL and receive a signing secret in the dashboard.

**Payload Delivery:**

```http
POST /webhook
X-dd0c-Signature: sha256=a1b2c3d4e5f6...
Content-Type: application/json

{
  "event_type": "drift.detected",
  "event_id": "webhook_evt_789",
  "timestamp": "2026-02-28T10:05:00Z",
  "data": {
    "drift_event_id": "evt_abc123",
    "stack_id": "prod-networking",
    "resource_address": "module.vpc.aws_security_group.api",
    "severity": "critical"
  }
}
```

Other event types include `drift.resolved` (when an event is remediated or accepted) and `agent.offline` (when an agent misses its heartbeat threshold).

### 7.5 dd0c Platform Integrations

While dd0c/drift works perfectly as a standalone SaaS, its value compounds when deployed alongside other modules in the dd0c platform suite.

**dd0c/alert Integration:**
Instead of dd0c/drift connecting directly to Slack or PagerDuty, it can emit events to dd0c/alert via an internal event bus (EventBridge). dd0c/alert handles the intelligent routing based on on-call schedules, deduplication, grouping, and escalation policies.
*API Flow:* `drift Event Processor -> internal EventBridge -> dd0c/alert Ingestion API`

**dd0c/portal Integration:**
dd0c/portal serves as the developer service catalog. When a team views a service in the catalog, it queries the drift dashboard API for infrastructure health.
*API Flow:* `dd0c/portal backend -> GET /v1/stacks/{id}/drift-score -> UI Rendering`
This enriches the service catalog with real-time IaC compliance metrics and allows developers to see their drift score directly next to their service ownership details without opening the drift dashboard.

### 7.6 Rate Limits & Throttling

Rate limits are enforced at the API Gateway layer via usage plans. Limits are tiered by plan and by API zone (Agent vs. Dashboard) to protect the platform from runaway agents and abusive clients while keeping the system responsive for legitimate use.

**Agent API Rate Limits:**

| Tier | Drift Report Submissions | Heartbeats | Agent Registrations |
|---|---|---|---|
| **Free** | 100 req/hour per org | 60 req/hour per agent | 5 req/day per org |
| **Starter** | 500 req/hour per org | 120 req/hour per agent | 20 req/day per org |
| **Pro** | 2,000 req/hour per org | 120 req/hour per agent | 50 req/day per org |
| **Business** | 10,000 req/hour per org | 300 req/hour per agent | 200 req/day per org |
| **Enterprise** | Custom (negotiated) | Custom | Custom |

These limits are generous relative to expected usage. A Pro tier customer with 30 stacks checking every 5 minutes generates ~360 reports/hour — well within the 2,000 limit. The limits exist to catch misconfigured agents stuck in tight loops, not to throttle normal operation.

**Dashboard API Rate Limits:**

| Tier | Read Requests | Write Requests (mutations) |
|---|---|---|
| **Free** | 300 req/min per user | 30 req/min per user |
| **Starter** | 600 req/min per user | 60 req/min per user |
| **Pro** | 1,200 req/min per user | 120 req/min per user |
| **Business** | 3,000 req/min per user | 300 req/min per user |

**Slack Interaction Limits:**

Slack interactions (button clicks, slash commands) are rate-limited at 60 req/min per Slack workspace. This prevents a runaway Slack bot or automated Slack client from overwhelming the remediation engine. Slack's own rate limits (~1 message/sec per channel) provide an additional natural throttle on the notification side.

**Rate Limit Headers:**

All API responses include standard rate limit headers:

```http
HTTP/1.1 200 OK
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1847
X-RateLimit-Reset: 1709110800
Retry-After: 42          # Only present on 429 responses
```

When a client exceeds its rate limit, the API returns `429 Too Many Requests` with a `Retry-After` header indicating seconds until the window resets. The agent is built to respect this header and back off automatically with jitter.

### 7.7 Error Codes & Response Format

All API errors follow a consistent JSON envelope. Every error includes a machine-readable `code`, a human-readable `message`, and an optional `details` object for structured context.

**Error Response Format:**

```json
{
  "error": {
    "code": "DRIFT_STACK_NOT_FOUND",
    "message": "Stack 'prod-networking' does not exist or you do not have access.",
    "request_id": "req_a1b2c3d4",
    "details": {
      "stack_id": "prod-networking"
    }
  }
}
```

**HTTP Status Codes:**

| Status | When Used |
|---|---|
| `200 OK` | Successful read or mutation |
| `201 Created` | Resource created (agent registration, policy creation, webhook subscription) |
| `202 Accepted` | Async operation queued (drift report ingested, remediation initiated) |
| `204 No Content` | Successful deletion |
| `400 Bad Request` | Malformed payload, missing required fields, invalid filter parameters |
| `401 Unauthorized` | Missing or invalid authentication (expired JWT, bad API key) |
| `403 Forbidden` | Authenticated but insufficient permissions (RBAC violation, wrong org) |
| `404 Not Found` | Resource doesn't exist or is outside the caller's org scope (RLS) |
| `409 Conflict` | Duplicate agent registration, policy name collision, concurrent remediation on same resource |
| `422 Unprocessable Entity` | Semantically invalid request (e.g., policy references a non-existent stack, invalid severity value) |
| `429 Too Many Requests` | Rate limit exceeded — includes `Retry-After` header |
| `500 Internal Server Error` | Unhandled server error — logged, alerted, includes `request_id` for support correlation |
| `502 Bad Gateway` | Upstream dependency failure (Slack API down, GitHub API timeout) |
| `503 Service Unavailable` | Planned maintenance or circuit breaker tripped — includes `Retry-After` |

**Application Error Codes:**

| Code | HTTP Status | Description |
|---|---|---|
| `AGENT_NOT_FOUND` | 404 | Agent ID does not exist or belongs to a different org |
| `AGENT_REVOKED` | 403 | Agent API key has been revoked — re-register required |
| `AGENT_VERSION_UNSUPPORTED` | 422 | Agent version is below the minimum supported version (forces upgrade) |
| `DRIFT_REPORT_DUPLICATE` | 409 | Report with this `report_id` was already ingested (SQS FIFO dedup fallback) |
| `DRIFT_REPORT_INVALID` | 400 | Report payload fails schema validation (missing fields, invalid types) |
| `DRIFT_REPORT_TOO_LARGE` | 400 | Report exceeds 1MB payload limit — split into multiple submissions |
| `DRIFT_STACK_NOT_FOUND` | 404 | Stack does not exist or caller lacks access |
| `DRIFT_STACK_LIMIT` | 403 | Org has reached the maximum stack count for their plan tier |
| `DRIFT_EVENT_NOT_FOUND` | 404 | Drift event ID does not exist |
| `REMEDIATION_IN_PROGRESS` | 409 | A remediation is already running for this resource — wait for completion |
| `REMEDIATION_NOT_PERMITTED` | 403 | User's RBAC role does not allow remediation on this stack |
| `REMEDIATION_AGENT_OFFLINE` | 502 | The agent responsible for this stack has not sent a heartbeat in >5 minutes |
| `POLICY_INVALID` | 422 | Policy condition syntax is invalid or references unsupported resource types |
| `POLICY_LIMIT` | 403 | Org has reached the maximum policy count for their plan tier |
| `SLACK_NOT_CONNECTED` | 422 | Slack workspace is not connected — required for Slack-based actions |
| `SLACK_USER_NOT_MAPPED` | 422 | Slack user ID cannot be mapped to a dd0c user — re-authenticate Slack |
| `WEBHOOK_DELIVERY_FAILED` | N/A | (Async) Webhook endpoint returned non-2xx — retried 3x with exponential backoff, then disabled |
| `AUTH_TOKEN_EXPIRED` | 401 | JWT has expired — refresh via Cognito token endpoint |
| `AUTH_TOKEN_INVALID` | 401 | JWT signature verification failed |
| `RATE_LIMIT_EXCEEDED` | 429 | Request throttled — respect `Retry-After` header |
| `INTERNAL_ERROR` | 500 | Unhandled exception — `request_id` included for support escalation |

**Retry Guidance for Agent Developers:**

The open-source agent implements the following retry strategy, and third-party integrations should follow the same pattern:

| Error Code | Retry? | Strategy |
|---|---|---|
| `429` | Yes | Exponential backoff starting at `Retry-After` value, max 5 retries, jitter ±20% |
| `500` | Yes | Exponential backoff starting at 1s, max 3 retries |
| `502` / `503` | Yes | Exponential backoff starting at 5s, max 5 retries |
| `400` / `422` | No | Fix the payload — retrying the same request will produce the same error |
| `401` | No | Re-authenticate — API key may be rotated or JWT expired |
| `403` | No | Permission issue — check RBAC or plan tier |
| `409` | Conditional | For `DRIFT_REPORT_DUPLICATE`, safe to ignore. For `REMEDIATION_IN_PROGRESS`, poll status and retry after completion. |

---