Files

Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.

2026-02-28 17:35:02 +00:00

90 KiB

Raw Blame History

dd0c/drift — Technical Architecture

Architect: Max Mayfield (Phase 6 — Architecture) Date: February 28, 2026 Product: dd0c/drift — IaC Drift Detection & Remediation SaaS Status: Architecture Design Document

Section 1: SYSTEM OVERVIEW

High-Level Architecture

graph TB
    subgraph Customer VPC
        CT[AWS CloudTrail] -->|Events| EB[Amazon EventBridge]
        EB -->|Rule Match| SQS_C[SQS Queue<br/>customer-side]
        SQS_C --> DA[Drift Agent<br/>ECS Task / GitHub Action]
        SF[Terraform State<br/>S3 Backend] -->|Read| DA
        DA -->|Encrypted Drift Diffs| HTTPS[HTTPS Egress Only]
    end

    subgraph dd0c SaaS Platform — AWS
        HTTPS -->|mTLS| APIGW[API Gateway<br/>Agent Ingestion]
        APIGW --> SQS_P[SQS FIFO Queue<br/>Event Ingestion]
        SQS_P --> PROC[Event Processor<br/>ECS Fargate]
        PROC --> DB[(PostgreSQL RDS<br/>Multi-Tenant)]
        PROC --> S3_SNAP[S3<br/>State Snapshots]
        PROC --> ES[Event Store<br/>DynamoDB Streams]

        PROC --> RE[Remediation Engine<br/>ECS Fargate]
        RE -->|Generate Plan| DA

        PROC --> NS[Notification Service<br/>Lambda]
        NS --> SLACK[Slack API]
        NS --> EMAIL[SES Email]
        NS --> WH[Webhook Delivery]

        DASH[Dashboard API<br/>ECS Fargate] --> DB
        DASH --> S3_SNAP
        UI[React SPA<br/>CloudFront] --> DASH

        AUTH[Auth Service<br/>Cognito + Lambda] --> DASH
        AUTH --> APIGW
    end

    subgraph External Integrations
        SLACK --> SLACK_USER[Slack Workspace]
        GH[GitHub API] --> PR[Pull Request<br/>Accept Drift]
        PD[PagerDuty API] --> ONCALL[On-Call Rotation]
    end

Component Inventory

Component	Responsibility	Runtime	Deployment
Drift Agent	Consumes CloudTrail events via EventBridge/SQS, reads Terraform state, computes drift diffs, pushes encrypted results to SaaS	Go binary	Customer VPC: ECS Task, GitHub Action, or standalone binary
API Gateway (Ingestion)	Authenticates agent connections (mTLS + API key), rate limits, routes to ingestion queue	AWS API Gateway (HTTP API)	SaaS account
Event Processor	Deserializes drift diffs, classifies severity, persists to DB, triggers notifications and remediation workflows	Node.js / TypeScript	ECS Fargate
State Manager	Parses Terraform state files (v4 format), builds resource graph, computes resource-level diffs against previous snapshots	Go (shared lib with Agent)	Runs inside Drift Agent + SaaS-side for dashboard queries
Remediation Engine	Generates scoped `terraform plan` for revert, manages approval workflow, dispatches apply commands back to Agent	Node.js / TypeScript	ECS Fargate
Notification Service	Formats and delivers Slack Block Kit messages, emails, webhooks; handles Slack interactivity (button callbacks)	Node.js Lambda	Lambda (event-driven, pay-per-invocation)
Dashboard API	REST API for web dashboard — drift scores, stack list, history, compliance reports, team management	Node.js / TypeScript	ECS Fargate
Web Dashboard	React SPA — drift score, stack overview, drift timeline, compliance report generator, settings	React + Vite	CloudFront + S3
Auth Service	User authentication (email/password, GitHub OAuth, Google OAuth), API key management for agents, RBAC	Cognito + Lambda authorizers	Managed + Lambda

Technology Choices

Decision	Choice	Justification
Agent Language	Go	Single static binary, no runtime dependencies, cross-compiles to Linux/macOS/Windows. Critical for customer-side deployment — zero dependency footprint.
SaaS Backend	Node.js / TypeScript	Shared language with frontend (React). Fast iteration for a solo founder. Strong AWS SDK support. TypeScript catches bugs at compile time.
Database	PostgreSQL (RDS)	Relational model fits multi-tenant SaaS (row-level security). JSONB for flexible drift diff storage. Mature, battle-tested. RDS handles backups/failover.
Event Store	DynamoDB + Streams	Append-only drift event log. DynamoDB Streams enables event sourcing pattern. Cost-effective at low volume, scales linearly.
State Snapshots	S3 + Glacier lifecycle	State snapshots are large (MB range), write-once-read-rarely. S3 is the obvious choice. Glacier after 90 days for cost.
Queue	SQS FIFO	Exactly-once processing for drift events. FIFO guarantees ordering per message group (per stack). No operational overhead vs. self-managed Kafka.
Notifications	Lambda	Event-driven, bursty workload. Pay-per-invocation. A Slack message costs ~$0.0000002 in Lambda compute.
Frontend	React + Vite + CloudFront	Standard SPA stack. CloudFront for global edge caching. Vite for fast builds. Nothing exotic — solo founder needs boring tech.
Auth	Cognito	Managed auth with OAuth flows, JWT tokens, user pools. Eliminates building auth from scratch. Cognito is ugly but functional.
IaC for SaaS infra	Terraform	Dogfooding. The SaaS that detects Terraform drift should be deployed with Terraform.

Deployment Model — Push-Based Agent Architecture

This is the non-negotiable architectural decision. The SaaS never pulls from customer infrastructure. The agent pushes out.

┌─────────────────────────────────────────────────┐
│                 Customer Account                 │
│                                                  │
│  CloudTrail ──► EventBridge ──► SQS ──► Agent   │
│                                          │       │
│  S3 State Bucket ◄──── (read) ──────────┘       │
│                                          │       │
│                                    Drift Diff    │
│                                    (encrypted)   │
│                                          │       │
│                                    HTTPS OUT ────┼──►  dd0c SaaS
│                                                  │
│  ❌ No inbound access from SaaS                  │
│  ❌ No IAM cross-account role for SaaS           │
│  ✅ Agent runs with customer's IAM role          │
│  ✅ State file never leaves customer account     │
│  ✅ Only drift diffs (no secrets) are transmitted│
└─────────────────────────────────────────────────┘

Agent Deployment Options:

ECS Fargate Task (recommended for always-on): Long-running container that subscribes to SQS queue for real-time CloudTrail events and runs scheduled state comparisons. Deployed via Terraform module provided by dd0c.
GitHub Actions Cron (recommended for getting started): Scheduled workflow that runs drift check on a cron (e.g., every 15 minutes). Zero infrastructure to manage. Lowest barrier to entry.
Standalone Binary (for air-gapped / custom): Download the Go binary, run it anywhere — EC2, Kubernetes pod, on-prem server. Maximum flexibility.

What the Agent Transmits:

The agent sends a DriftReport payload — NOT the state file. The payload contains:

Stack identifier (name, backend location hash — not the actual S3 path)
List of drifted resources: resource type, resource address, attribute-level diff (old value vs. new value)
CloudTrail attribution: IAM principal, source IP, timestamp, event name
Drift classification: severity (critical/high/medium/low), category (security/config/tags/scaling)
Agent metadata: version, heartbeat timestamp, detection method (event-driven vs. scheduled)

What the Agent Does NOT Transmit:

Full state file contents
Secret values (the agent strips sensitive attributes and known secret patterns before transmission)
Raw CloudTrail events (only correlated attribution data)
S3 bucket names, account IDs, or other infrastructure identifiers (hashed)

Section 2: CORE COMPONENTS

2.1 Drift Agent

The agent is the heart of the push-based architecture. It's a single Go binary that runs inside the customer's environment and does two things: consume CloudTrail events for real-time detection, and periodically compare Terraform state against cloud reality for comprehensive coverage.

Architecture:

graph LR
    subgraph Drift Agent Process
        EC[Event Consumer<br/>CloudTrail via SQS] --> DF[Drift Filter<br/>Resource Matcher]
        SC[Scheduled Checker<br/>Cron / Timer] --> SP[State Parser<br/>TF State v4]
        SP --> DC[Drift Comparator<br/>Attribute-Level Diff]
        DF --> DC
        DC --> SS[Secret Scrubber<br/>Strip Sensitive Values]
        SS --> TX[Transmitter<br/>HTTPS + mTLS]
        TX -->|Encrypted DriftReport| SAAS[dd0c SaaS API]
    end

Event Consumer (Real-Time Path):

CloudTrail delivers events to EventBridge. An EventBridge rule matches IaC-managed resource types and forwards to an SQS queue. The agent polls this queue.

// EventBridge Rule Pattern — matches write API calls on IaC-managed resource types
{
  "source": ["aws.ec2", "aws.iam", "aws.rds", "aws.s3", "aws.lambda", "aws.ecs"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventName": [{
      "anything-but": {
        "prefix": "Describe"
      }
    }],
    "readOnly": [false]
  }
}

When the agent receives a CloudTrail event:

Extract the resource identifier from the event (e.g., sg-abc123 from a ModifySecurityGroupRules event)
Look up which Terraform state file manages this resource (agent maintains an in-memory resource → state index)
Read the current resource attributes from the cloud API (e.g., ec2:DescribeSecurityGroups)
Compare against the declared attributes in the Terraform state file
If drift detected → build DriftReport, scrub secrets, transmit to SaaS

Scheduled Checker (Comprehensive Path):

The event-driven path catches ~80% of drift in real-time. The scheduled path catches everything else — resources modified through means that don't generate standard CloudTrail events, or resources in services not yet covered by the EventBridge rule.

Schedule: Configurable per tier
  Free:     Every 24 hours
  Starter:  Every 15 minutes
  Pro:      Every 5 minutes
  Business: Every 1 minute

The scheduled checker:

Reads the Terraform state file from the configured backend (S3, GCS, Terraform Cloud, local)
For each resource in state, calls the corresponding cloud API to get current attributes
Computes attribute-level diff between state and reality
Batches all drifted resources into a single DriftReport
Transmits to SaaS

State File Parsing:

The agent parses Terraform state v4 format (JSON). Key structures:

type TerraformState struct {
    Version          int              `json:"version"`          // Must be 4
    TerraformVersion string           `json:"terraform_version"`
    Serial           int64            `json:"serial"`
    Lineage          string           `json:"lineage"`
    Resources        []StateResource  `json:"resources"`
}

type StateResource struct {
    Module    string              `json:"module,omitempty"`
    Mode      string              `json:"mode"`      // "managed" or "data"
    Type      string              `json:"type"`       // e.g., "aws_security_group"
    Name      string              `json:"name"`
    Provider  string              `json:"provider"`
    Instances []ResourceInstance   `json:"instances"`
}

type ResourceInstance struct {
    SchemaVersion int                    `json:"schema_version"`
    Attributes    map[string]interface{} `json:"attributes"`
    Private       string                 `json:"private"`  // Base64 encoded, may contain secrets
}

The agent maintains a mapping of Terraform resource types to AWS API calls:

Terraform Resource Type	AWS Describe API	Key Identifier
`aws_security_group`	`ec2:DescribeSecurityGroups`	`attributes.id`
`aws_iam_role`	`iam:GetRole`	`attributes.name`
`aws_iam_policy`	`iam:GetPolicyVersion`	`attributes.arn`
`aws_db_instance`	`rds:DescribeDBInstances`	`attributes.identifier`
`aws_s3_bucket`	`s3:GetBucketConfiguration` (composite)	`attributes.bucket`
`aws_lambda_function`	`lambda:GetFunction`	`attributes.function_name`
`aws_ecs_service`	`ecs:DescribeServices`	`attributes.name` + `attributes.cluster`
`aws_route53_record`	`route53:ListResourceRecordSets`	`attributes.zone_id` + `attributes.name`

MVP covers the top 20 most-drifted resource types (based on community data from driftctl's historical issues). Remaining types added iteratively based on customer demand.

Secret Scrubbing:

Before transmitting any drift diff, the agent runs a scrubbing pass:

Remove any attribute marked sensitive in the Terraform provider schema
Redact values matching known secret patterns: password, secret, token, key, private_key, connection_string
Redact any value that looks like a credential (regex patterns for AWS keys, database URIs, JWT tokens)
Replace redacted values with [REDACTED] — the diff still shows "attribute changed" but not the actual values
The Private field on resource instances is always stripped entirely

2.2 Event Pipeline

The event pipeline is the real-time nervous system. CloudTrail events flow through EventBridge and SQS to the agent — no polling, no cron, no delay.

Customer-Side Pipeline:

CloudTrail (all regions)
    │
    ▼
EventBridge (default bus)
    │
    ├── Rule: drift-agent-ec2 (ec2 write events)
    ├── Rule: drift-agent-iam (iam write events)
    ├── Rule: drift-agent-rds (rds write events)
    └── Rule: drift-agent-* (per-service rules)
    │
    ▼
SQS Queue: drift-agent-events
    │ (FIFO, dedup by CloudTrail eventID)
    │ (visibility timeout: 300s)
    │ (DLQ after 3 retries)
    ▼
Drift Agent (long-poll, batch size 10)

SaaS-Side Pipeline:

Agent HTTPS POST /v1/drift-reports
    │
    ▼
API Gateway (auth: mTLS + API key header)
    │
    ▼
SQS FIFO Queue: drift-report-ingestion
    │ (message group ID = stack_id → ordering per stack)
    │ (dedup ID = report_id → exactly-once)
    ▼
Event Processor (ECS Fargate, auto-scaling 1-10 tasks)
    │
    ├──► PostgreSQL (drift_events, resources, stacks)
    ├──► DynamoDB (event store, append-only)
    ├──► S3 (state snapshot, if full snapshot report)
    └──► Notification Service (Lambda, async invoke)

Why SQS FIFO, not Kafka/Kinesis:

At MVP scale (10-1,000 stacks), SQS FIFO is the right choice:

Zero operational overhead (no brokers, no partitions, no ZooKeeper)
Exactly-once processing via deduplication ID
Per-stack ordering via message group ID
Costs ~$0.40/million messages. At 1,000 stacks checking every 5 minutes, that's ~288K messages/day = ~$3.50/month
If we hit 10,000+ stacks and need streaming analytics, migrate to Kinesis. That's a V3 problem.

2.3 State Manager

The State Manager is a shared library (Go) used by both the Drift Agent and the SaaS-side Event Processor. It handles Terraform state parsing, resource graph construction, and drift classification.

Resource Graph:

The State Manager builds a dependency graph from Terraform state. This is critical for blast radius analysis — when a resource drifts, what else might be affected?

type ResourceGraph struct {
    Nodes map[string]*ResourceNode  // key: resource address (e.g., "aws_security_group.api")
    Edges []ResourceEdge            // dependency relationships
}

type ResourceNode struct {
    Address    string                 // e.g., "module.networking.aws_security_group.api"
    Type       string                 // e.g., "aws_security_group"
    Provider   string                 // e.g., "registry.terraform.io/hashicorp/aws"
    Attributes map[string]interface{} // current state attributes
    DriftState DriftState             // clean, drifted, unknown
}

type ResourceEdge struct {
    From string  // resource address
    To   string  // resource address
    Type string  // "depends_on", "reference", "implicit"
}

The graph is built by analyzing attribute cross-references in state (e.g., aws_instance.web.vpc_security_group_ids references aws_security_group.web.id). This isn't perfect — Terraform state doesn't store the full dependency graph — but it catches 80%+ of relationships.

Drift Classification:

Every detected drift is classified along two axes:

Severity	Criteria	Examples
Critical	Security boundary change, IAM escalation, public exposure	Security group opened to 0.0.0.0/0, IAM policy with `:`, S3 bucket made public
High	Configuration change affecting availability or data	RDS parameter change, ECS task definition change, Lambda runtime change
Medium	Non-critical configuration change	Instance type change, tag modification on critical resources, DNS TTL change
Low	Cosmetic or expected drift	Tag-only changes, description updates, ASG desired count (auto-scaling)

Classification rules are defined in a YAML config shipped with the agent:

# drift-classification.yaml
rules:
  - resource_type: aws_security_group
    attribute: ingress
    condition: "contains_cidr('0.0.0.0/0')"
    severity: critical
    category: security

  - resource_type: aws_iam_role_policy
    attribute: policy
    severity: high
    category: security

  - resource_type: aws_db_instance
    attribute: parameter_group_name
    severity: high
    category: configuration

  - resource_type: "*"
    attribute: tags
    severity: low
    category: tags

  # Default: anything not matched
  - resource_type: "*"
    attribute: "*"
    severity: medium
    category: configuration

Customers can override these rules in their agent config to match their risk tolerance.

2.4 Remediation Engine

The Remediation Engine handles two workflows: Revert (make cloud match code) and Accept (make code match cloud). Both are initiated from Slack action buttons or the dashboard.

Revert Workflow:

sequenceDiagram
    participant U as Engineer (Slack)
    participant NS as Notification Service
    participant RE as Remediation Engine
    participant DA as Drift Agent
    participant TF as Terraform (in customer VPC)

    U->>NS: Click [Revert] button
    NS->>RE: Initiate revert for resource X in stack Y
    RE->>RE: Generate scoped plan (target resource only)
    RE->>RE: Compute blast radius (resource graph)
    RE->>NS: Send confirmation with blast radius
    NS->>U: "Reverting aws_security_group.api will affect 0 other resources. Proceed? [Confirm] [Cancel]"
    U->>NS: Click [Confirm]
    NS->>RE: Confirmed
    RE->>DA: Execute: terraform apply -target=aws_security_group.api -auto-approve
    DA->>TF: Run terraform apply (scoped)
    TF-->>DA: Apply complete
    DA->>RE: Result: success
    RE->>NS: Notify success
    NS->>U: "✅ aws_security_group.api reverted to declared state. Drift resolved."

Accept Workflow (Code-to-Cloud):

When the engineer clicks [Accept], the drift is intentional and the code should be updated to match reality:

Remediation Engine generates a Terraform code patch that updates the resource definition to match the current cloud state
Creates a branch and PR on the connected GitHub repository
PR includes: the code change, a description of the drift, CloudTrail attribution, and a link to the drift event in the dd0c dashboard
Engineer reviews and merges the PR through normal code review process
On merge, the next terraform apply in CI/CD is a no-op for this resource (code now matches cloud)
Agent detects the state file update and marks the drift as resolved

Approval Workflow (V2 — Pro/Business tiers):

For teams that want approval gates before remediation:

# remediation-policy.yaml
policies:
  - resource_type: aws_security_group
    action: auto-revert
    condition: "severity == 'critical'"
    # No approval needed — auto-revert critical security drift

  - resource_type: aws_iam_*
    action: require-approval
    approvers: ["@security-team"]
    timeout: 4h
    # IAM changes need security team sign-off

  - resource_type: aws_db_instance
    action: require-approval
    approvers: ["@dba-team", "@infra-lead"]
    timeout: 24h
    # Database changes need DBA approval

  - resource_type: "*"
    attribute: tags
    action: digest
    # Tag drift goes in the daily digest, no action needed

2.5 Notification Service

The Notification Service is a Lambda function that formats drift events into rich, actionable messages and delivers them to configured channels.

Slack Block Kit Message (Primary):

{
  "blocks": [
    {
      "type": "header",
      "text": {
        "type": "plain_text",
        "text": "🔴 Critical Drift Detected"
      }
    },
    {
      "type": "section",
      "fields": [
        { "type": "mrkdwn", "text": "*Stack:*\nprod-networking" },
        { "type": "mrkdwn", "text": "*Resource:*\naws_security_group.api" },
        { "type": "mrkdwn", "text": "*Changed by:*\narn:aws:iam::123456:user/jsmith" },
        { "type": "mrkdwn", "text": "*When:*\n2 minutes ago" }
      ]
    },
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*What changed:*\n```ingress rule added: 0.0.0.0/0:443 (HTTPS from anywhere)```"
      }
    },
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*Blast radius:* 0 dependent resources\n*Owner:* @ravi"
      }
    },
    {
      "type": "actions",
      "elements": [
        { "type": "button", "text": { "type": "plain_text", "text": "🔄 Revert" }, "style": "danger", "action_id": "drift_revert", "value": "evt_abc123" },
        { "type": "button", "text": { "type": "plain_text", "text": "✅ Accept" }, "action_id": "drift_accept", "value": "evt_abc123" },
        { "type": "button", "text": { "type": "plain_text", "text": "⏰ Snooze 24h" }, "action_id": "drift_snooze", "value": "evt_abc123" },
        { "type": "button", "text": { "type": "plain_text", "text": "👤 Assign" }, "action_id": "drift_assign", "value": "evt_abc123" }
      ]
    }
  ]
}

Notification Routing:

Severity	Slack	Email	PagerDuty	Webhook
Critical	Immediate (channel + DM to owner)	Immediate	Page on-call (Pro+)	Immediate
High	Immediate (channel)	Immediate	Alert (no page)	Immediate
Medium	Batched (hourly digest)	Daily digest	—	Batched
Low	Daily digest	Weekly digest	—	Batched

Slack Interactivity:

When an engineer clicks a button, Slack sends an interaction payload to our API Gateway endpoint (POST /v1/slack/interactions). The Lambda:

Verifies the Slack request signature
Looks up the drift event by ID
Checks the user's permissions (RBAC — can this user remediate this stack?)
Initiates the appropriate workflow (revert, accept, snooze, assign)
Updates the original Slack message to reflect the action taken

Section 3: DATA ARCHITECTURE

3.1 Database Schema (PostgreSQL RDS)

Multi-tenant PostgreSQL with Row-Level Security (RLS). Every table includes org_id and all queries are scoped by it. No cross-tenant data leakage by design.

-- ============================================================
-- ORGANIZATIONS & AUTH
-- ============================================================

CREATE TABLE organizations (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            TEXT NOT NULL,
    slug            TEXT UNIQUE NOT NULL,           -- e.g., "acme-corp"
    plan            TEXT NOT NULL DEFAULT 'free',    -- free, starter, pro, business, enterprise
    stripe_customer_id TEXT,
    max_stacks      INT NOT NULL DEFAULT 3,
    poll_interval_s INT NOT NULL DEFAULT 86400,      -- default: daily (free tier)
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    email           TEXT NOT NULL,
    name            TEXT,
    role            TEXT NOT NULL DEFAULT 'member',  -- owner, admin, member, viewer
    cognito_sub     TEXT UNIQUE,
    slack_user_id   TEXT,                            -- for Slack DM routing
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(org_id, email)
);

-- ============================================================
-- STACKS & RESOURCES
-- ============================================================

CREATE TABLE stacks (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    name            TEXT NOT NULL,                   -- e.g., "prod-networking"
    backend_type    TEXT NOT NULL DEFAULT 's3',       -- s3, gcs, tfc, local
    backend_hash    TEXT NOT NULL,                    -- SHA256 of backend config (no raw paths stored)
    iac_tool        TEXT NOT NULL DEFAULT 'terraform', -- terraform, opentofu, pulumi
    environment     TEXT,                             -- prod, staging, dev
    owner_user_id   UUID REFERENCES users(id),
    slack_channel   TEXT,                             -- override notification channel per stack
    drift_score     REAL NOT NULL DEFAULT 100.0,      -- 0-100, 100 = clean
    last_check_at   TIMESTAMPTZ,
    last_drift_at   TIMESTAMPTZ,
    resource_count  INT NOT NULL DEFAULT 0,
    drifted_count   INT NOT NULL DEFAULT 0,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(org_id, backend_hash)
);

CREATE TABLE resources (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    stack_id        UUID NOT NULL REFERENCES stacks(id) ON DELETE CASCADE,
    address         TEXT NOT NULL,                   -- e.g., "module.vpc.aws_security_group.api"
    resource_type   TEXT NOT NULL,                   -- e.g., "aws_security_group"
    provider        TEXT NOT NULL,                   -- e.g., "registry.terraform.io/hashicorp/aws"
    cloud_id        TEXT,                            -- e.g., "sg-abc123" (for cross-referencing)
    drift_state     TEXT NOT NULL DEFAULT 'clean',   -- clean, drifted, unknown, ignored
    last_drift_at   TIMESTAMPTZ,
    drift_count     INT NOT NULL DEFAULT 0,          -- lifetime drift count for this resource
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(stack_id, address)
);

CREATE INDEX idx_resources_type ON resources(org_id, resource_type);
CREATE INDEX idx_resources_drift ON resources(org_id, drift_state) WHERE drift_state = 'drifted';

-- ============================================================
-- DRIFT EVENTS
-- ============================================================

CREATE TABLE drift_events (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    stack_id        UUID NOT NULL REFERENCES stacks(id),
    resource_id     UUID NOT NULL REFERENCES resources(id),
    report_id       UUID NOT NULL,                   -- groups events from same detection run
    severity        TEXT NOT NULL,                    -- critical, high, medium, low
    category        TEXT NOT NULL,                    -- security, configuration, tags, scaling
    detection_method TEXT NOT NULL,                   -- event_driven, scheduled
    
    -- The drift diff (JSONB for flexible querying)
    diff            JSONB NOT NULL,
    /*  Example diff:
        {
          "attributes": {
            "ingress": {
              "old": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8"]}],
              "new": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8", "0.0.0.0/0"]}]
            }
          }
        }
    */
    
    -- CloudTrail attribution (nullable — scheduled checks don't have this)
    attributed_principal TEXT,                        -- IAM ARN who made the change
    attributed_source_ip TEXT,                        -- source IP
    attributed_event_name TEXT,                       -- e.g., "AuthorizeSecurityGroupIngress"
    attributed_at   TIMESTAMPTZ,                     -- when the change was made
    
    -- Resolution
    status          TEXT NOT NULL DEFAULT 'open',    -- open, resolved, accepted, snoozed, ignored
    resolved_by     UUID REFERENCES users(id),
    resolved_at     TIMESTAMPTZ,
    resolution_type TEXT,                            -- reverted, accepted, snoozed, auto_reverted
    resolution_note TEXT,
    
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_drift_events_stack ON drift_events(org_id, stack_id, created_at DESC);
CREATE INDEX idx_drift_events_status ON drift_events(org_id, status) WHERE status = 'open';
CREATE INDEX idx_drift_events_severity ON drift_events(org_id, severity, created_at DESC);

-- ============================================================
-- REMEDIATION PLANS
-- ============================================================

CREATE TABLE remediation_plans (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    drift_event_id  UUID NOT NULL REFERENCES drift_events(id),
    stack_id        UUID NOT NULL REFERENCES stacks(id),
    plan_type       TEXT NOT NULL,                   -- revert, accept
    status          TEXT NOT NULL DEFAULT 'pending', -- pending, approved, executing, completed, failed, cancelled
    
    -- For revert: terraform plan output (scrubbed)
    plan_output     TEXT,
    target_resources TEXT[],                          -- resource addresses targeted
    blast_radius    INT NOT NULL DEFAULT 0,          -- number of dependent resources affected
    
    -- For accept: generated code patch
    code_patch      TEXT,
    pr_url          TEXT,                             -- GitHub PR URL if created
    
    -- Approval
    requested_by    UUID REFERENCES users(id),
    approved_by     UUID REFERENCES users(id),
    approved_at     TIMESTAMPTZ,
    
    -- Execution
    started_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ,
    error_message   TEXT,
    
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- ============================================================
-- AGENTS
-- ============================================================

CREATE TABLE agents (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    name            TEXT NOT NULL,                    -- e.g., "prod-agent-1"
    api_key_hash    TEXT NOT NULL,                    -- bcrypt hash of the agent API key
    status          TEXT NOT NULL DEFAULT 'active',   -- active, inactive, revoked
    last_heartbeat  TIMESTAMPTZ,
    agent_version   TEXT,
    deployment_type TEXT,                             -- ecs, github_action, binary
    stacks          TEXT[],                           -- stack IDs this agent monitors
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- ============================================================
-- ROW-LEVEL SECURITY
-- ============================================================

ALTER TABLE stacks ENABLE ROW LEVEL SECURITY;
ALTER TABLE resources ENABLE ROW LEVEL SECURITY;
ALTER TABLE drift_events ENABLE ROW LEVEL SECURITY;
ALTER TABLE remediation_plans ENABLE ROW LEVEL SECURITY;
ALTER TABLE agents ENABLE ROW LEVEL SECURITY;
ALTER TABLE users ENABLE ROW LEVEL SECURITY;

-- All policies follow the same pattern: current_setting('app.current_org_id')
CREATE POLICY org_isolation ON stacks
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON resources
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON drift_events
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON remediation_plans
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON agents
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON users
    USING (org_id = current_setting('app.current_org_id')::UUID);

3.2 Event Sourcing (DynamoDB)

Every drift detection result is appended to an immutable event store in DynamoDB. This provides a complete audit trail that can never be modified — critical for compliance evidence.

Table: drift-events-log
Partition Key: org_id#stack_id (String)
Sort Key:      timestamp#event_id (String)
TTL:           expires_at (90 days for free, 1 year for paid, 7 years for enterprise)

Attributes:
  - event_type:     "drift_detected" | "drift_resolved" | "remediation_started" | "remediation_completed" | "agent_heartbeat" | "stack_registered"
  - payload:        Full event payload (JSON)
  - report_id:      Groups events from same detection run
  - checksum:       SHA256 of payload (tamper detection)

Why DynamoDB for the event store (not PostgreSQL):

Append-only workload — DynamoDB excels at high-throughput writes with no read contention
TTL-based expiration — automatic cleanup per tier without cron jobs
DynamoDB Streams — enables downstream consumers (analytics, compliance report generation) without polling
Cost — at 1,000 stacks × 288 checks/day × 365 days = ~105M items/year. DynamoDB on-demand pricing: ~$130/year for writes, ~$25/year for storage. PostgreSQL would need periodic archival to maintain query performance.

DynamoDB Streams Consumer:

A Lambda function subscribes to the DynamoDB Stream and:

Aggregates drift metrics (drift rate, MTTR, drift-by-resource-type) into a PostgreSQL drift_metrics table for dashboard queries
Generates daily/weekly compliance digest reports (stored in S3)
Feeds the drift score calculation engine

3.3 State Snapshot Storage (S3)

When the agent performs a full scheduled check, it can optionally upload a sanitized state snapshot to S3. This enables:

Historical comparison ("what did the state look like last Tuesday?")
Compliance evidence ("here's the state at the time of the audit")
Debugging ("the drift diff looks wrong — let me check the raw state")

Bucket: dd0c-state-snapshots-{account_id}
Prefix: {org_id}/{stack_id}/{YYYY}/{MM}/{DD}/{timestamp}-{report_id}.json.gz

Lifecycle:
  - Standard:           0-30 days
  - Infrequent Access:  30-90 days
  - Glacier Instant:    90-365 days
  - Glacier Deep:       365+ days (enterprise only)
  - Expire:             Per tier TTL (90d free, 1yr paid, 7yr enterprise)

Encryption: SSE-S3 (AES-256) + bucket policy enforcing encryption
Versioning: Enabled (tamper protection)
Object Lock: Compliance mode for enterprise tier (WORM — auditors love this)

What's in a state snapshot:

NOT the raw Terraform state file. The agent sanitizes it:

All sensitive attributes → [REDACTED]
All private instance data → stripped
Backend configuration → hashed
Account IDs → hashed (reversible only by the customer's agent)

The snapshot is useful for drift comparison but cannot be used to reconstruct the customer's infrastructure or extract secrets.

3.4 Multi-Tenant Data Isolation

Three layers of isolation:

Layer 1: Application-Level (RLS) Every API request sets app.current_org_id on the PostgreSQL session before executing queries. Row-Level Security policies ensure queries only return rows matching the org. Even a SQL injection vulnerability cannot access cross-tenant data.

// Middleware: set org context on every request
async function setOrgContext(req: Request, res: Response, next: NextFunction) {
  const orgId = req.auth.orgId; // from JWT claims
  await pool.query("SET app.current_org_id = $1", [orgId]);
  next();
}

Layer 2: Infrastructure-Level (S3 Prefixes + IAM) State snapshots are stored under {org_id}/ prefixes. The SaaS application IAM role has a policy condition that restricts S3 access to the prefix matching the authenticated org:

{
  "Effect": "Allow",
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:::dd0c-state-snapshots-*",
  "Condition": {
    "StringLike": {
      "s3:prefix": "${aws:PrincipalTag/org_id}/*"
    }
  }
}

Layer 3: Encryption-Level (Per-Org KMS Keys — Enterprise) Enterprise tier customers get a dedicated KMS key for encrypting their data at rest. This enables:

Customer-controlled key rotation
Key deletion = cryptographic data destruction (for offboarding)
CloudTrail logging of all key usage (customer can audit our access)

Data Residency (V2): For EU customers requiring GDPR data residency, deploy a separate RDS instance + S3 bucket in eu-west-1. The application routes based on org.data_region. This is a V2 feature — MVP runs single-region us-east-1.

Section 4: INFRASTRUCTURE

4.1 AWS Architecture — SaaS Platform

graph TB
    subgraph Public Edge
        CF[CloudFront<br/>Web Dashboard CDN]
        APIGW[API Gateway<br/>HTTP API]
    end

    subgraph Compute — ECS Fargate Cluster
        EP[Event Processor<br/>1-10 tasks, auto-scale]
        RE[Remediation Engine<br/>1-3 tasks, auto-scale]
        DASH[Dashboard API<br/>2 tasks, target-tracking]
    end

    subgraph Serverless
        NS[Notification Service<br/>Lambda]
        AUTH_L[Auth Authorizer<br/>Lambda]
        STREAM_L[DynamoDB Stream<br/>Consumer Lambda]
        CRON_L[Cron Jobs<br/>Lambda + EventBridge Scheduler]
    end

    subgraph Data
        RDS[(PostgreSQL 16<br/>RDS db.t4g.medium<br/>Multi-AZ)]
        DDB[(DynamoDB<br/>On-Demand<br/>Event Store)]
        S3_SNAP[S3<br/>State Snapshots]
        S3_WEB[S3<br/>Web Dashboard Assets]
    end

    subgraph Messaging
        SQS_IN[SQS FIFO<br/>drift-report-ingestion]
        SQS_REM[SQS Standard<br/>remediation-commands]
        SQS_NOTIFY[SQS Standard<br/>notification-fanout]
    end

    subgraph Auth & Secrets
        COG[Cognito User Pool]
        SM[Secrets Manager<br/>Slack tokens, DB creds]
        KMS[KMS<br/>Encryption keys]
    end

    subgraph Monitoring
        CW[CloudWatch<br/>Logs + Metrics + Alarms]
        XR[X-Ray<br/>Distributed Tracing]
    end

    CF --> S3_WEB
    CF --> APIGW
    APIGW --> AUTH_L --> COG
    APIGW --> SQS_IN
    APIGW --> DASH
    SQS_IN --> EP
    EP --> RDS
    EP --> DDB
    EP --> S3_SNAP
    EP --> SQS_NOTIFY
    EP --> SQS_REM
    SQS_NOTIFY --> NS
    SQS_REM --> RE
    RE --> RDS
    DASH --> RDS
    DASH --> S3_SNAP
    DDB --> STREAM_L
    STREAM_L --> RDS

VPC Layout:

VPC: 10.0.0.0/16 (us-east-1)

  Public Subnets (NAT Gateway, ALB):
    10.0.1.0/24  (us-east-1a)
    10.0.2.0/24  (us-east-1b)

  Private Subnets (ECS Tasks, Lambda):
    10.0.10.0/24 (us-east-1a)
    10.0.11.0/24 (us-east-1b)

  Isolated Subnets (RDS):
    10.0.20.0/24 (us-east-1a)
    10.0.21.0/24 (us-east-1b)

  VPC Endpoints (no NAT for AWS services):
    - com.amazonaws.us-east-1.sqs
    - com.amazonaws.us-east-1.dynamodb
    - com.amazonaws.us-east-1.s3
    - com.amazonaws.us-east-1.secretsmanager
    - com.amazonaws.us-east-1.kms
    - com.amazonaws.us-east-1.ecr.api
    - com.amazonaws.us-east-1.ecr.dkr
    - com.amazonaws.us-east-1.logs

4.2 Customer-Side Agent Deployment

The agent is deployed into the customer's AWS account via a Terraform module published to the Terraform Registry.

Terraform Module: dd0c/drift-agent/aws

module "drift_agent" {
  source  = "dd0c/drift-agent/aws"
  version = "~> 1.0"

  # Required
  dd0c_api_key       = var.dd0c_api_key          # From dd0c dashboard
  terraform_state_bucket = "my-terraform-state"   # S3 bucket with state files
  terraform_state_keys   = ["prod/*.tfstate"]     # Glob patterns for state files

  # Optional
  deployment_type    = "ecs"                       # "ecs" | "lambda" | "binary"
  vpc_id             = module.vpc.vpc_id
  subnet_ids         = module.vpc.private_subnet_ids
  poll_interval      = 300                         # seconds (overridden by tier)
  
  # EventBridge real-time detection
  enable_eventbridge = true
  cloudtrail_name    = "main-trail"                # Existing CloudTrail trail name
  
  # Resource type filter (optional — default: all supported types)
  resource_types     = ["aws_security_group", "aws_iam_*", "aws_db_instance"]

  tags = {
    Environment = "production"
    ManagedBy   = "dd0c-drift"
  }
}

What the module creates:

Resource	Purpose
ECS Task Definition + Service	Runs the drift agent container (Fargate, 0.25 vCPU, 512MB)
IAM Role: `dd0c-drift-agent`	Agent execution role with read-only permissions
IAM Policy: `dd0c-drift-readonly`	Read access to state bucket + describe APIs for monitored resource types
EventBridge Rules	Match CloudTrail write events for monitored resource types
SQS Queue: `dd0c-drift-events`	Buffer for EventBridge events consumed by agent
SQS DLQ: `dd0c-drift-events-dlq`	Dead letter queue for failed event processing
CloudWatch Log Group	Agent logs (retained 30 days)
Security Group	Egress-only to dd0c SaaS API endpoint + AWS service endpoints

Alternative: GitHub Actions (Zero Infrastructure)

For teams that don't want to run infrastructure, the agent runs as a GitHub Action:

# .github/workflows/drift-check.yml
name: Drift Check
on:
  schedule:
    - cron: '*/15 * * * *'  # Every 15 minutes
  workflow_dispatch: {}

jobs:
  check:
    runs-on: ubuntu-latest
    permissions:
      id-token: write  # For OIDC auth to AWS
    steps:
      - uses: dd0c/drift-action@v1
        with:
          dd0c-api-key: ${{ secrets.DD0C_API_KEY }}
          aws-role-arn: arn:aws:iam::123456789:role/dd0c-drift-readonly
          state-bucket: my-terraform-state
          state-keys: "prod/*.tfstate"

This approach trades real-time detection (no EventBridge) for zero infrastructure. Good for getting started; upgrade to ECS when they want real-time.

4.3 Cost Estimates

SaaS Platform Costs (Monthly):

Component	10 Stacks	100 Stacks	1,000 Stacks
RDS db.t4g.medium (Multi-AZ)	$140	$140	$280 (db.t4g.large)
ECS Fargate (Event Processor)	$15	$35	$150
ECS Fargate (Dashboard API)	$30	$30	$60
ECS Fargate (Remediation Engine)	$8	$15	$50
Lambda (Notifications + Stream)	$1	$5	$30
SQS	$1	$3	$15
DynamoDB (On-Demand)	$2	$15	$130
S3 (State Snapshots)	$1	$5	$40
API Gateway	$4	$15	$80
CloudFront	$5	$5	$10
NAT Gateway	$35	$35	$70
Cognito	$0	$3	$25
CloudWatch / X-Ray	$10	$20	$50
Secrets Manager	$2	$2	$5
Total SaaS Infra	~$254/mo	~$328/mo	~$995/mo

Customer-Side Agent Costs (Per Customer, Monthly):

Component	Cost
ECS Fargate (0.25 vCPU, 512MB, always-on)	~$9/mo
SQS Queue (EventBridge events)	~$0.50/mo
CloudWatch Logs	~$1/mo
EventBridge Rules	Free (included in CloudTrail)
Total per customer	~$10.50/mo

This is important for pricing: the customer pays ~$10.50/mo in their own AWS bill to run the agent. The $49/mo Starter tier needs to deliver enough value to justify $49 + $10.50 = ~$60/mo total cost.

Unit Economics at Scale:

Scale	Revenue (est.)	SaaS Infra Cost	Gross Margin
10 customers (avg $99/mo)	$990/mo	$254/mo	74%
50 customers (avg $149/mo)	$7,450/mo	$328/mo	96%
200 customers (avg $199/mo)	$39,800/mo	$995/mo	97%

SaaS margins are excellent once past the fixed-cost floor (~$254/mo). The business breaks even at ~3 paying customers.

4.4 Scaling Strategy

Phase 1: MVP (0-100 stacks)

Single RDS instance (db.t4g.medium, Multi-AZ)
ECS Fargate with auto-scaling (min 1, max 3 per service)
DynamoDB on-demand (auto-scales)
Single region (us-east-1)

Phase 2: Growth (100-1,000 stacks)

RDS read replica for dashboard queries (separate read/write paths)
ECS auto-scaling up to 10 tasks per service
SQS batch processing (batch size 10 → higher throughput)
CloudFront caching for dashboard API (drift scores, stack lists — cache 60s)

Phase 3: Scale (1,000-10,000 stacks)

RDS upgrade to db.r6g.large + read replicas
Consider migrating event ingestion from SQS FIFO to Kinesis Data Streams (higher throughput, fan-out)
DynamoDB DAX for hot-path reads (drift score lookups)
Multi-region deployment (us-east-1 + eu-west-1) for data residency
Connection pooling via RDS Proxy

Phase 4: Enterprise (10,000+ stacks)

Dedicated RDS instances per large enterprise customer
Kinesis + Lambda fan-out for event processing
ElastiCache (Redis) for session management and rate limiting
This is a "good problem to have" phase — re-architect based on actual bottlenecks

4.5 CI/CD Pipeline

graph LR
    subgraph Developer
        CODE[Push to main] --> GH[GitHub]
    end

    subgraph CI — GitHub Actions
        GH --> LINT[Lint + Type Check]
        LINT --> TEST[Unit Tests]
        TEST --> INT[Integration Tests<br/>LocalStack]
        INT --> BUILD[Docker Build<br/>+ Go Binary]
        BUILD --> SCAN[Trivy Container Scan]
        SCAN --> PUSH[Push to ECR]
    end

    subgraph CD — Terraform + ECS
        PUSH --> TF_PLAN[Terraform Plan<br/>staging]
        TF_PLAN --> APPROVE[Manual Approval<br/>for prod]
        APPROVE --> TF_APPLY[Terraform Apply<br/>prod]
        TF_APPLY --> ECS_DEPLOY[ECS Rolling Deploy<br/>Blue/Green]
        ECS_DEPLOY --> SMOKE[Smoke Tests]
        SMOKE --> DONE[✅ Deployed]
    end

Pipeline Details:

Stage	Tool	Duration
Lint + Type Check	ESLint + tsc (TypeScript), golangci-lint (Go)	~30s
Unit Tests	Vitest (TypeScript), go test (Go)	~60s
Integration Tests	LocalStack (SQS, DynamoDB, S3 emulation)	~120s
Docker Build	Multi-stage Dockerfile, Go binary cross-compile	~90s
Container Scan	Trivy (CVE scanning)	~30s
ECR Push	Docker push to private ECR	~20s
Terraform Plan	Plan against staging environment	~30s
Manual Approval	GitHub Environment protection rule (prod)	Human
Terraform Apply	Apply to prod	~60s
ECS Deploy	Rolling update (min healthy 100%, max 200%)	~120s
Smoke Tests	Hit health endpoints, verify SQS consumption	~30s
Total (automated)		~9 minutes

Agent Release Pipeline:

The Go agent binary is released separately:

Tag a release on GitHub (v1.2.3)
GoReleaser builds binaries for linux/amd64, linux/arm64, darwin/amd64, darwin/arm64
Docker image pushed to public ECR (for ECS deployment)
GitHub Action published to GitHub Marketplace
Terraform module version bumped in Terraform Registry
Changelog posted to dd0c blog + Slack community

Section 5: SECURITY

5.1 IAM Role Design — Customer Accounts

The trust model is the hardest sell: customers are giving dd0c's agent read access to their Terraform state and cloud resource attributes. The architecture must make this as narrow and auditable as possible.

Agent Execution Role (Customer-Side):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadTerraformState",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::${state_bucket}",
        "arn:aws:s3:::${state_bucket}/${state_key_prefix}*"
      ]
    },
    {
      "Sid": "DescribeResources",
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "iam:Get*",
        "iam:List*",
        "rds:Describe*",
        "s3:GetBucket*",
        "s3:GetEncryptionConfiguration",
        "s3:GetLifecycleConfiguration",
        "lambda:GetFunction",
        "lambda:GetFunctionConfiguration",
        "lambda:ListTags",
        "ecs:Describe*",
        "ecs:List*",
        "route53:GetHostedZone",
        "route53:ListResourceRecordSets",
        "elasticloadbalancing:Describe*",
        "cloudfront:GetDistribution",
        "sns:GetTopicAttributes",
        "sqs:GetQueueAttributes",
        "dynamodb:DescribeTable",
        "kms:DescribeKey",
        "kms:GetKeyPolicy"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "${monitored_regions}"
        }
      }
    },
    {
      "Sid": "ConsumeEventBridgeQueue",
      "Effect": "Allow",
      "Action": [
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage",
        "sqs:GetQueueAttributes"
      ],
      "Resource": "arn:aws:sqs:*:${account_id}:dd0c-drift-events"
    },
    {
      "Sid": "WriteAgentLogs",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:${account_id}:log-group:/dd0c/drift-agent:*"
    }
  ]
}

Key design decisions:

No cross-account role for the SaaS. The SaaS platform NEVER assumes a role in the customer's account. The agent runs with the customer's own IAM role. The SaaS only receives drift reports over HTTPS. This is the fundamental trust boundary.
Read-only by default. The agent role has zero write permissions. It can describe resources and read state files. It cannot modify anything.
Region-scoped. The aws:RequestedRegion condition limits describe calls to regions the customer explicitly configures. No global enumeration.
State bucket scoped. S3 access is limited to the specific state bucket and key prefix. Not s3:* on *.

5.2 Remediation IAM Role (Separate, Opt-In)

Remediation requires write access. This is a SEPARATE IAM role that customers opt into explicitly. It is never created by default.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TerraformApply",
      "Effect": "Allow",
      "Action": [
        "ec2:*",
        "iam:*",
        "rds:*",
        "s3:*",
        "lambda:*",
        "ecs:*"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "${monitored_regions}"
        },
        "StringLike": {
          "aws:ResourceTag/ManagedBy": "terraform"
        }
      }
    },
    {
      "Sid": "StateLock",
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": "arn:aws:dynamodb:*:${account_id}:table/${state_lock_table}"
    }
  ]
}

Guardrails on remediation:

Tag-scoped: The condition aws:ResourceTag/ManagedBy = terraform limits write actions to resources that are tagged as Terraform-managed. Resources created outside Terraform can't be modified.
Approval required: The SaaS never triggers remediation without explicit human approval (button click in Slack or dashboard). Auto-remediation policies are customer-configured and customer-approved.
Scoped apply: Remediation always uses terraform apply -target=<resource>. Never a full terraform apply. Blast radius is minimized.
Audit trail: Every remediation action is logged in the event store with: who approved it, when, what was changed, and the full terraform plan output.
Kill switch: Customers can revoke the remediation role at any time via IAM. The agent gracefully degrades to detect-only mode.

5.3 State File Security

Terraform state files are the crown jewels. They contain resource IDs, configuration, and — critically — secret values (database passwords, API keys, private keys). The architecture must handle this with extreme care.

Principle: State files never leave the customer's account.

The agent reads the state file in-memory within the customer's VPC. It extracts resource attributes for drift comparison. Before transmitting anything to the SaaS:

Attribute filtering: Only attributes relevant to drift detection are included in the report. The agent maintains an allowlist per resource type:

# attribute-allowlist.yaml
aws_security_group:
  - ingress
  - egress
  - name
  - description
  - vpc_id
  - tags

aws_iam_role:
  - assume_role_policy
  - max_session_duration
  - path
  - permissions_boundary
  - tags
  # NOT included: inline policies (may contain secrets in conditions)

aws_db_instance:
  - engine
  - engine_version
  - instance_class
  - allocated_storage
  - storage_type
  - multi_az
  - publicly_accessible
  - vpc_security_group_ids
  - db_subnet_group_name
  - parameter_group_name
  - tags
  # NOT included: master_password, endpoint (could be used for targeting)

Secret pattern scrubbing: Even within allowed attributes, values matching secret patterns are redacted:
- AWS access keys (AKIA...)
- Database connection strings (postgres://..., mysql://...)
- Private keys (-----BEGIN RSA PRIVATE KEY-----)
- JWT tokens (eyJ...)
- Generic patterns: any value for keys containing password, secret, token, key, credential
In-transit encryption: All agent-to-SaaS communication uses TLS 1.3 with mTLS (mutual TLS). The agent presents a client certificate issued during registration. The SaaS validates it before accepting any data.
At-rest encryption: Drift diffs stored in PostgreSQL and DynamoDB are encrypted with KMS (AWS-managed key for standard tiers, customer-managed key for enterprise).
No state file caching: The agent does not write state file contents to disk. State is read from S3 into memory, processed, and discarded. The Go binary uses mlock to prevent state data from being swapped to disk.

5.4 SOC 2 Considerations

dd0c/drift will pursue SOC 2 Type II certification. The architecture supports the required controls:

SOC 2 Criteria	How dd0c Addresses It
CC6.1 Logical access controls	Cognito auth, RBAC, API key auth for agents, RLS in PostgreSQL
CC6.2 Access provisioned/deprovisioned	User management via dashboard, API key rotation, agent revocation
CC6.3 Access restricted to authorized	mTLS for agent connections, JWT validation for dashboard, VPC isolation for data tier
CC7.1 Monitoring for anomalies	CloudWatch alarms, X-Ray tracing, agent heartbeat monitoring
CC7.2 Incident response	Runbook in Confluence, PagerDuty integration, automated rollback via ECS
CC8.1 Change management	Terraform IaC for all infrastructure, GitHub PR reviews, CI/CD pipeline
A1.2 Recovery objectives	RDS Multi-AZ (RPO: 0, RTO: <5min), S3 cross-region replication (enterprise), DynamoDB point-in-time recovery
C1.1 Confidentiality	State files never leave customer VPC, secret scrubbing, KMS encryption, TLS 1.3

Compliance automation:

The irony is not lost on us: a drift detection product must itself be drift-free. dd0c/drift will run dd0c/drift on its own infrastructure. Dogfooding as compliance evidence.

5.5 Trust Model

The trust model is the product's biggest adoption barrier. Here's how we address it at each level:

Level 1: "I don't trust you with any access" → GitHub Actions mode. The agent runs in the customer's GitHub Actions runner. dd0c only receives drift reports (no IAM role in customer account at all). The customer reviews the agent source code (open-source Go binary).

Level 2: "I'll give you read-only access" → Standard deployment. Agent runs in customer VPC with read-only IAM role. State files never leave the account. Only sanitized drift diffs are transmitted.

Level 3: "I trust you to remediate" → Remediation role (opt-in). Separate IAM role with write permissions. Scoped to tagged resources. Requires explicit human approval for every action.

Level 4: "I trust you to auto-remediate" → Auto-remediation policies (Business/Enterprise tier). Customer defines rules for automatic revert. Still uses the remediation IAM role. Full audit trail. Kill switch available.

Open-source agent:

The drift agent Go binary is open-source (Apache 2.0). Customers can:

Audit the code to verify what data is collected and transmitted
Build from source if they don't trust pre-built binaries
Fork and modify for custom requirements
Run in air-gapped environments with no SaaS connection (detect-only, local output)

This is the trust unlock. Security teams that won't install a closed-source agent will consider an open-source one they can audit.

Section 6: MVP SCOPE

6.1 V1 Boundary — What Ships

The MVP is ruthlessly scoped to one IaC tool, one cloud, one notification channel, and one deployment model. Everything else is deferred. The goal is: a solo founder ships a working product in 30 days that detects Terraform drift in AWS and alerts via Slack.

V1 Feature Matrix:

Capability	V1 (Launch)	Status
IaC Support	Terraform + OpenTofu (state v4 format only)	✅ Ship
Cloud Provider	AWS only	✅ Ship
Detection: Scheduled	Poll state vs. cloud on configurable interval	✅ Ship
Detection: Event-Driven	CloudTrail → EventBridge → SQS → Agent	✅ Ship
Notification: Slack	Block Kit messages with action buttons	✅ Ship
Remediation: Revert	Scoped `terraform apply -target` via agent	✅ Ship
Remediation: Accept	Auto-generate PR to update IaC code	✅ Ship
Dashboard	Drift score, stack list, event history (minimal React SPA)	✅ Ship
Agent: ECS	Terraform module for ECS Fargate deployment	✅ Ship
Agent: GitHub Actions	Scheduled workflow, zero infra	✅ Ship
Onboarding CLI	`drift init` auto-discovery	✅ Ship
Auth	Email/password + GitHub OAuth via Cognito	✅ Ship
Billing	Stripe integration, self-serve upgrade	✅ Ship
Multi-tenant	RLS-based isolation, org/user/stack model	✅ Ship

V1 Resource Type Coverage (Top 20):

The agent ships with drift detection for the 20 most commonly drifted AWS resource types. This list is derived from driftctl's historical GitHub issues, r/terraform drift complaints, and CloudTrail event frequency data:

Priority	Resource Type	Why It Drifts	Detection Complexity
1	`aws_security_group` / `aws_security_group_rule`	Emergency port opens during incidents	Low — `ec2:DescribeSecurityGroups`
2	`aws_iam_role` / `aws_iam_role_policy`	Permission escalation, console edits	Medium — policy document comparison
3	`aws_iam_policy` / `aws_iam_policy_attachment`	Inline policy edits, attachment changes	Medium — version document diff
4	`aws_s3_bucket` (config attributes)	Public access toggles, lifecycle changes	Medium — composite describe calls
5	`aws_db_instance`	Parameter group changes, storage scaling	Low — `rds:DescribeDBInstances`
6	`aws_instance`	Instance type changes, security group swaps	Low — `ec2:DescribeInstances`
7	`aws_lambda_function`	Runtime updates, env var changes	Low — `lambda:GetFunction`
8	`aws_ecs_service`	Task count changes, image tag updates	Low — `ecs:DescribeServices`
9	`aws_ecs_task_definition`	Container definition edits	Medium — JSON deep comparison
10	`aws_route53_record`	DNS record changes (manual cutover)	Low — `route53:ListResourceRecordSets`
11	`aws_lb_listener` / `aws_lb_listener_rule`	Routing rule changes	Low — `elbv2:DescribeListeners`
12	`aws_autoscaling_group`	Desired capacity (auto-scaling noise)	Low — needs noise filtering
13	`aws_cloudwatch_metric_alarm`	Threshold tweaks	Low — `cloudwatch:DescribeAlarms`
14	`aws_sns_topic` / `aws_sqs_queue`	Policy changes, subscription edits	Low — `sns:GetTopicAttributes`
15	`aws_dynamodb_table`	Capacity mode changes, GSI edits	Medium — `dynamodb:DescribeTable`
16	`aws_elasticache_cluster`	Node type changes, parameter group	Low — `elasticache:DescribeCacheClusters`
17	`aws_kms_key`	Key policy changes	Medium — policy document diff
18	`aws_cloudfront_distribution`	Origin changes, behavior edits	High — complex nested config
19	`aws_vpc` / `aws_subnet`	CIDR changes, tag drift	Low — `ec2:DescribeVpcs`
20	`aws_eip` / `aws_nat_gateway`	Association changes	Low — `ec2:DescribeAddresses`

Resource types beyond the top 20 are detected as "unknown drift" — the agent reports that the resource exists in state but can't compare attributes. Customers can request priority for specific types via GitHub issues.

6.2 What's Deferred to V2+

Saying "no" is the only way a solo founder ships in 30 days. Here's what's explicitly deferred and why:

V2 (Month 3-4):

Feature	Why Deferred	Dependency
CloudFormation support	Different state format (stack resources API), different drift detection mechanism (`detect-stack-drift` API). Requires a separate parser and comparator.	New state parser module
Pulumi support	Pulumi state is stored differently (Pulumi Service backend or S3 with different schema). Requires a new state parser.	New state parser module
Auto-remediation policies	Per-resource-type automation rules (auto-revert, alert, digest, ignore). Requires policy engine, rule evaluation, and careful UX to avoid accidental auto-reverts.	Policy engine, approval workflow
Compliance report generation	SOC 2 / HIPAA evidence export (PDF/CSV). Requires report templating, data aggregation, and export pipeline.	DynamoDB event store populated with 30+ days of data
Drift trends & analytics	Time-series charts (drift rate, MTTR, most-drifted resources). Requires metrics aggregation pipeline and charting frontend.	DynamoDB Streams consumer, charting library
PagerDuty / OpsGenie integration	Route critical drift through existing on-call. Requires integration auth, event mapping, and escalation logic.	Notification service extension
Teams & RBAC	Multi-team support, role-based permissions, stack-level access control. Requires authorization layer beyond basic org membership.	Auth service extension

V3 (Month 6-9):

Feature	Why Deferred
Multi-cloud (Azure, GCP)	Each cloud requires its own describe API mapping, authentication model, and event pipeline. Triple the agent complexity.
Drift prediction (ML)	Requires aggregate data from 500+ customers to build meaningful models. Can't do this at launch.
Industry benchmarking	Same data requirement as prediction. Need critical mass of anonymized drift data.
SSO / SAML	Enterprise auth. Not needed until enterprise customers appear. Cognito supports it when ready.
Full API & webhooks	Public API for programmatic access. V1 has internal APIs only. Public API requires versioning, rate limiting, documentation, and SDK generation.
dd0c platform integration	Cross-module data flow (drift → alert, drift → portal). Requires dd0c/alert and dd0c/portal to exist first.

Explicitly NOT building (ever, unless market demands it):

Anti-Feature	Why Not
CI/CD orchestration	That's Spacelift/env0's game. We detect drift, we don't run pipelines.
Policy-as-code engine (OPA/Sentinel)	Adjacent but different problem. Integrate with existing policy tools, don't build one.
Cost management	That's dd0c/cost. Separate product, separate concern.
Service catalog	That's dd0c/portal. Drift feeds into it, doesn't replace it.
Multi-cloud state management	We read state, we don't manage it. No state migration, no state locking, no remote backend hosting.

6.3 Onboarding Flow

The onboarding flow is the product's first impression. It must go from npm install to first Slack alert in under 5 minutes. Every second of friction is a lost conversion.

Flow: CLI-First Onboarding

Step 1: Install CLI
─────────────────────────────────────────────
$ brew install dd0c/tap/drift-cli
# or: curl -sSL https://get.dd0c.dev/drift | sh
# or: go install github.com/dd0c/drift-cli@latest

Step 2: Authenticate
─────────────────────────────────────────────
$ drift auth login
→ Opens browser: https://app.dd0c.dev/auth
→ GitHub OAuth or email/password
→ CLI receives token via localhost callback
✅ Authenticated as brian@dd0c.dev (org: dd0c)

Step 3: Auto-Discover State Backends
─────────────────────────────────────────────
$ drift init
🔍 Scanning for Terraform state backends...

Found 3 state backends:
  1. s3://acme-terraform-state/prod/networking.tfstate    (23 resources)
  2. s3://acme-terraform-state/prod/compute.tfstate       (47 resources)
  3. s3://acme-terraform-state/staging/main.tfstate       (31 resources)

Register all 3 stacks? [Y/n]: Y

✅ Registered 3 stacks (101 resources total)
   Org plan: Free (3 stacks max) — you're at capacity

Step 4: Configure Slack
─────────────────────────────────────────────
$ drift connect slack
→ Opens browser: Slack OAuth install flow
→ Select workspace and channel (#infrastructure-alerts)
✅ Connected to Slack workspace "Acme Corp" → #infrastructure-alerts

Step 5: First Drift Check
─────────────────────────────────────────────
$ drift check --all
🔍 Checking 3 stacks (101 resources)...

Stack: prod-networking (23 resources)
  🔴 CRITICAL  aws_security_group.api — ingress rule added (0.0.0.0/0:443)
  🟡 MEDIUM    aws_route53_record.api — TTL changed (300 → 60)
  ✅ 21 resources clean

Stack: prod-compute (47 resources)
  🟠 HIGH      aws_iam_role.lambda_exec — policy document changed
  🔵 LOW       aws_instance.worker[0] — tags.Environment changed
  ✅ 45 resources clean

Stack: staging-main (31 resources)
  ✅ All 31 resources clean

Summary: 4 drifted resources across 3 stacks (96% aligned)
📨 Slack alerts sent to #infrastructure-alerts

Step 6: Deploy Agent (Optional — for continuous monitoring)
─────────────────────────────────────────────
$ drift agent deploy --type=github-action
→ Generates .github/workflows/drift-check.yml
→ Creates GitHub secret DD0C_API_KEY via gh CLI
✅ Agent deployed — drift checks will run every 15 minutes

# OR for ECS deployment:
$ drift agent deploy --type=ecs --vpc-id=vpc-abc123 --subnets=subnet-1,subnet-2
→ Generates Terraform module in ./dd0c-drift-agent/
→ Run: cd dd0c-drift-agent && terraform init && terraform apply

Auto-Discovery Logic:

The drift init command discovers state backends through multiple strategies:

Strategy	How It Works	Coverage
AWS credential chain	Uses default AWS credentials to scan S3 buckets matching common patterns (`-terraform-state`, `-tfstate`, `*-tf-state`)	~60% of teams
Terraform config scan	Walks current directory tree for `*.tf` files, parses `backend` blocks	~80% of teams (if run from repo root)
Environment variables	Reads `TF_STATE_BUCKET`, `TF_WORKSPACE`, `AWS_DEFAULT_REGION`	~30% of teams
Terraform Cloud/Enterprise	Checks `~/.terraform.d/credentials.tfrc.json` for TFC tokens, queries workspace API	~15% of teams (TFC users)
Interactive fallback	If auto-discovery finds nothing: "Enter your S3 state bucket name:"	100% (manual)

The goal: 80% of users hit drift init and see their stacks listed without typing a bucket name.

Onboarding Metrics:

Metric	Target	Kill Threshold
Time from install to first `drift check` output	< 3 minutes	> 10 minutes
Time from signup to first Slack alert	< 5 minutes	> 15 minutes
`drift init` auto-discovery success rate	> 70%	< 40%
Onboarding completion rate (install → Slack connected)	> 60%	< 30%
Drop-off at each step	< 15% per step	> 30% at any step

6.4 Technical Debt Budget

A solo founder shipping in 30 days will accumulate technical debt. The key is to accumulate it intentionally, in known locations, with a plan to pay it down.

Acceptable Debt (Ship With It):

Debt Item	Where	Why It's Acceptable	Pay Down By
Hardcoded resource type mappings	Agent: `resource_mapper.go`	The top 20 resource types are hardcoded as Go structs with describe API calls. No plugin system. Adding type 21 requires a code change and agent release.	V2: Plugin system or YAML-driven resource type definitions
Single-region SaaS	Infrastructure: `us-east-1` only	Multi-region adds complexity (RDS replication, S3 cross-region, routing). Not needed until EU customers demand GDPR residency.	V2: `eu-west-1` deployment when first EU enterprise customer signs
No database migrations framework	SaaS: raw SQL files	At MVP, the schema is small enough to manage with numbered SQL files. No Flyway/Liquibase.	Month 3: Adopt `golang-migrate` or Prisma Migrate before schema exceeds 20 tables
Minimal error handling in CLI	CLI: `drift init`, `drift check`	Error messages are functional but not polished. Stack traces may leak in edge cases.	Month 2: Error wrapping, user-friendly messages, `--debug` flag for verbose output
No retry logic on Slack API	Notification Service Lambda	Slack API failures drop the notification silently. No retry queue. At low volume, this is rare.	Month 2: SQS DLQ for failed Slack deliveries, retry with exponential backoff
Dashboard is read-only	Web SPA	V1 dashboard shows drift scores and event history. Settings, team management, and policy configuration are CLI-only.	V2: Full dashboard CRUD for stacks, policies, team members
No rate limiting on public API	API Gateway	API Gateway has default throttling (10K req/s) but no per-org rate limiting. At MVP scale, this is fine.	Month 3: Per-org rate limiting via API Gateway usage plans + API keys
Test coverage < 80%	Agent + SaaS	Integration tests cover the critical path (detect → notify → remediate). Unit test coverage will be ~60% at launch.	Month 2-3: Increase to 80%+ with focus on drift comparator and secret scrubber
No OpenTelemetry	SaaS services	V1 uses CloudWatch Logs + X-Ray. No custom metrics, no distributed trace correlation across services.	V2: OTel SDK integration, custom metrics (drift detection latency, queue depth, notification delivery rate)
Monorepo without workspace tooling	Repository structure	Single repo with `agent/`, `saas/`, `dashboard/`, `cli/` directories. No Nx/Turborepo. Builds are sequential.	Month 3: Turborepo or Nx for parallel builds when CI time exceeds 15 minutes

Unacceptable Debt (Do NOT Ship With):

Debt Item	Why It's Unacceptable
No secret scrubbing	Transmitting customer secrets to SaaS is a trust-destroying, potentially lawsuit-inducing failure. Secret scrubber ships in V1, fully tested.
No RLS on PostgreSQL	Cross-tenant data leakage is an existential risk. RLS is enabled from Day 1 with integration tests that verify isolation.
No mTLS on agent connections	Agent-to-SaaS communication without mutual TLS means anyone with an API key can impersonate an agent. mTLS ships in V1.
No Stripe webhook verification	Accepting unverified Stripe webhooks enables billing manipulation. Signature verification is a one-liner — no excuse to skip it.
No input validation on drift diffs	Malicious agents could inject SQL or XSS via crafted drift diffs. Input validation and parameterized queries are non-negotiable.
No CloudTrail event signature verification	EventBridge events should be validated against CloudTrail digest files to prevent spoofed drift events.

Debt Paydown Schedule:

Month 1 (Launch):     Ship with acceptable debt. Focus on working product.
Month 2:              Error handling, Slack retry logic, test coverage to 70%
Month 3:              Rate limiting, database migrations framework, Turborepo
Month 4 (V2 start):   Plugin system for resource types, dashboard CRUD, OTel
Month 6:              Multi-region, full test coverage (80%+), performance profiling

6.5 Solo Founder Operational Model

One person builds, ships, operates, markets, and supports this product. The architecture must minimize operational burden or the founder burns out before reaching $10K MRR.

Operational Principles:

Managed services over self-hosted. RDS over self-managed PostgreSQL. Cognito over self-hosted auth. SQS over self-managed RabbitMQ. Lambda over always-on notification servers. Every managed service is one fewer thing to page about at 3am.
Alerts on business impact, not infrastructure metrics. Don't alert on CPU > 80%. Alert on: "drift reports stopped arriving from agent X" (customer impact), "Slack notification delivery failed 3x" (customer impact), "RDS storage > 80%" (approaching outage). Fewer alerts = sustainable on-call.
Automate recovery, not just detection. ECS auto-restarts crashed tasks. Lambda retries on failure. SQS DLQ captures poison messages. RDS Multi-AZ fails over automatically. The founder should wake up to a resolved incident, not an active one.
Weekly maintenance window, not continuous ops. Sunday evening: review CloudWatch dashboards, check DLQ depth, review error logs, update dependencies, run terraform plan to verify no drift (dogfooding). 2 hours/week max.

On-Call Model:

PagerDuty Configuration:
  - Critical alerts (customer-facing outage):     Page immediately, 24/7
  - High alerts (degraded service):               Page during business hours only
  - Medium alerts (non-urgent operational):        Slack notification, review in weekly maintenance
  - Low alerts (informational):                    CloudWatch dashboard only

Critical Alert Triggers:
  - API Gateway 5xx rate > 5% for 5 minutes
  - SQS FIFO queue age > 15 minutes (drift reports backing up)
  - RDS connection count > 80% of max
  - Zero drift reports received in 1 hour (all agents down?)
  - Stripe webhook processing failures > 3 consecutive

Estimated Alert Volume:
  - Month 1-3:   ~2-3 critical alerts/month (new system, bugs)
  - Month 3-6:   ~1 critical alert/month (stabilized)
  - Month 6+:    ~0.5 critical alerts/month (mature)

Support Model:

Channel	Response Time	Tier
GitHub Issues (drift-cli)	24-48 hours	All (open-source community)
In-app chat (Intercom)	24 hours (business days)	Free + Paid
Slack community (#dd0c-drift)	Best effort, same day	All
Email (support@dd0c.dev)	24 hours	Paid
Priority Slack DM	4 hours (business days)	Business + Enterprise

Time Allocation (Solo Founder, 50 hrs/week):

Week 1-4 (Pre-Launch Build):
  ├── 35 hrs  Engineering (agent + SaaS + dashboard + CLI)
  ├── 5 hrs   Infrastructure (Terraform, CI/CD, monitoring)
  ├── 5 hrs   Content (README, docs, blog draft)
  └── 5 hrs   Community (Reddit lurking, driftctl issue monitoring)

Week 5-8 (Launch + Seed):
  ├── 25 hrs  Engineering (bug fixes, polish, V1.1 patches)
  ├── 5 hrs   Infrastructure + ops
  ├── 10 hrs  Content (blog posts, Drift Cost Calculator)
  └── 10 hrs  Community (Reddit engagement, HN launch, design partners)

Week 9-12 (Growth):
  ├── 20 hrs  Engineering (V1.x improvements, V2 planning)
  ├── 5 hrs   Ops + support
  ├── 10 hrs  Content + SEO
  ├── 10 hrs  Community + partnerships
  └── 5 hrs   Business (metrics review, pricing analysis, investor prep)

Steady State (Month 4+):
  ├── 20 hrs  Engineering
  ├── 5 hrs   Ops + support (scales with customer count)
  ├── 10 hrs  Marketing (content + community + partnerships)
  ├── 10 hrs  Product (customer feedback, roadmap, design)
  └── 5 hrs   Business (metrics, billing, legal)

Automation That Saves Founder Time:

Automation	Time Saved	Implementation
Stripe billing	5 hrs/week	Self-serve upgrade/downgrade, automatic invoicing, dunning emails
GitHub Actions CI/CD	3 hrs/week	Automated test → build → deploy pipeline. No manual deployments.
Intercom chatbot	2 hrs/week	FAQ auto-responses for common questions (pricing, setup, supported resources)
CloudWatch auto-remediation	1 hr/week	Auto-restart ECS tasks, auto-scale on queue depth, auto-archive old DynamoDB items
Dependabot + Renovate	1 hr/week	Automated dependency updates with auto-merge for patch versions
dd0c/drift on dd0c/drift	1 hr/week	Dogfooding — drift detection on own infrastructure eliminates manual `terraform plan` runs

Section 7: API DESIGN

The dd0c/drift API surface is divided into three distinct zones: the highly restricted Agent API (mTLS authenticated, ingestion only), the standard Dashboard API (JWT authenticated, CRUD operations), and the Integration APIs (Webhooks, Slack, and dd0c platform cross-talk).

7.1 Agent API (Ingestion & Heartbeat)

This API is exposed strictly for the drift agent running in the customer's environment. All endpoints require mutual TLS (mTLS) combined with a static, org-scoped API key sent via headers.

Agent Registration & Heartbeat:

POST /v1/agents/register
Authorization: Bearer dd0c_api_...
Content-Type: application/json

{
  "agent_id": "uuid",
  "name": "prod-ecs-cluster-agent",
  "version": "1.2.3",
  "deployment_type": "ecs",
  "monitored_stacks": ["prod-networking", "prod-compute"]
}

# Response: 200 OK
{
  "status": "active",
  "poll_interval_s": 300,
  "config_hash": "abc123def456"
}

POST /v1/agents/{agent_id}/heartbeat
Authorization: Bearer dd0c_api_...

{
  "uptime_s": 86400,
  "events_processed": 142,
  "memory_mb": 42
}

Drift Report Submission:

This is the core ingestion endpoint. It accepts batched drift reports (either from event-driven CloudTrail intercepts or scheduled full-state comparisons).

POST /v1/drift-reports
Authorization: Bearer dd0c_api_...

{
  "stack_id": "prod-networking",
  "report_id": "uuid",
  "detection_method": "event_driven",
  "timestamp": "2026-02-28T10:00:00Z",
  "drifted_resources": [
    {
      "address": "module.vpc.aws_security_group.api",
      "type": "aws_security_group",
      "severity": "critical",
      "category": "security",
      "diff": {
        "ingress": {
          "old": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8"]}],
          "new": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8", "0.0.0.0/0"]}]
        }
      },
      "attribution": {
        "principal": "arn:aws:iam::123456:user/jsmith",
        "source_ip": "192.168.1.1",
        "event_name": "AuthorizeSecurityGroupIngress"
      }
    }
  ]
}

7.2 Dashboard & Query API

This is the REST API consumed by the React web dashboard. It relies on standard JWT Bearer tokens issued by Cognito. All responses are scoped via RLS to the authenticated user's org_id.

Drift Event Query & Search:

Allows complex filtering to power the drift history and active drift dashboards.

GET /v1/drift-events
  ?stack_id=prod-networking
  &status=open
  &severity=critical,high
  &limit=50
  &offset=0

# Response: 200 OK
{
  "data": [
    {
      "id": "evt_abc123",
      "stack_id": "prod-networking",
      "resource_address": "module.vpc.aws_security_group.api",
      "severity": "critical",
      "status": "open",
      "created_at": "2026-02-28T10:00:00Z"
      // ... full event details ...
    }
  ],
  "pagination": {
    "total": 12,
    "has_next": false
  }
}

Policy Configuration API (V2):

Manages the auto-remediation and alerting policies for a stack.

POST /v1/stacks/{stack_id}/policies

{
  "name": "Auto-revert public security groups",
  "resource_type": "aws_security_group",
  "condition": "severity == 'critical'",
  "action": "auto_revert",
  "enabled": true
}

7.3 Slack Integration

The Slack integration relies on two primary interaction models: Slash commands (user-initiated) and Interactive Actions (button clicks on drift alerts).

Slash Commands:

/drift status [stack_name] - Returns the current drift score, number of open drift events, and agent health status for the specified stack (or all stacks if omitted).
/drift check [stack_name] - Dispatches an immediate out-of-band scheduled check command to the agent for the specified stack.
/drift silence [resource_address] [duration] - Temporarily mutes alerts for a noisy resource (e.g., /drift silence aws_autoscaling_group.workers 24h).

Interactive Actions (Webhooks from Slack):

When a user clicks an action button ([Revert], [Accept], [Snooze]) in a drift alert message, Slack POSTs a signed payload to the dd0c API Gateway.

POST /v1/slack/interactions
Content-Type: application/x-www-form-urlencoded
X-Slack-Signature: v0=a2114d...
X-Slack-Request-Timestamp: 1614556800

payload={
  "type": "block_actions",
  "user": { "id": "U123456", "name": "jsmith" },
  "actions": [
    {
      "action_id": "drift_revert",
      "value": "evt_abc123"
    }
  ]
}

The Notification Service Lambda validates the signature, performs RBAC checks, and initiates the workflow via the Remediation Engine.

7.4 Outbound Webhooks

Customers can subscribe to drift events via outbound webhooks to trigger custom internal workflows (e.g., creating Jira tickets, updating custom dashboards).

Webhook Registration:

Customers configure an endpoint URL and receive a signing secret in the dashboard.

Payload Delivery:

POST /webhook
X-dd0c-Signature: sha256=a1b2c3d4e5f6...
Content-Type: application/json

{
  "event_type": "drift.detected",
  "event_id": "webhook_evt_789",
  "timestamp": "2026-02-28T10:05:00Z",
  "data": {
    "drift_event_id": "evt_abc123",
    "stack_id": "prod-networking",
    "resource_address": "module.vpc.aws_security_group.api",
    "severity": "critical"
  }
}

Other event types include drift.resolved (when an event is remediated or accepted) and agent.offline (when an agent misses its heartbeat threshold).

7.5 dd0c Platform Integrations

While dd0c/drift works perfectly as a standalone SaaS, its value compounds when deployed alongside other modules in the dd0c platform suite.

dd0c/alert Integration: Instead of dd0c/drift connecting directly to Slack or PagerDuty, it can emit events to dd0c/alert via an internal event bus (EventBridge). dd0c/alert handles the intelligent routing based on on-call schedules, deduplication, grouping, and escalation policies. API Flow: drift Event Processor -> internal EventBridge -> dd0c/alert Ingestion API

dd0c/portal Integration: dd0c/portal serves as the developer service catalog. When a team views a service in the catalog, it queries the drift dashboard API for infrastructure health. API Flow: dd0c/portal backend -> GET /v1/stacks/{id}/drift-score -> UI Rendering This enriches the service catalog with real-time IaC compliance metrics and allows developers to see their drift score directly next to their service ownership details without opening the drift dashboard.

7.6 Rate Limits & Throttling

Rate limits are enforced at the API Gateway layer via usage plans. Limits are tiered by plan and by API zone (Agent vs. Dashboard) to protect the platform from runaway agents and abusive clients while keeping the system responsive for legitimate use.

Agent API Rate Limits:

Tier	Drift Report Submissions	Heartbeats	Agent Registrations
Free	100 req/hour per org	60 req/hour per agent	5 req/day per org
Starter	500 req/hour per org	120 req/hour per agent	20 req/day per org
Pro	2,000 req/hour per org	120 req/hour per agent	50 req/day per org
Business	10,000 req/hour per org	300 req/hour per agent	200 req/day per org
Enterprise	Custom (negotiated)	Custom	Custom

These limits are generous relative to expected usage. A Pro tier customer with 30 stacks checking every 5 minutes generates ~360 reports/hour — well within the 2,000 limit. The limits exist to catch misconfigured agents stuck in tight loops, not to throttle normal operation.

Dashboard API Rate Limits:

Tier	Read Requests	Write Requests (mutations)
Free	300 req/min per user	30 req/min per user
Starter	600 req/min per user	60 req/min per user
Pro	1,200 req/min per user	120 req/min per user
Business	3,000 req/min per user	300 req/min per user

Slack Interaction Limits:

Slack interactions (button clicks, slash commands) are rate-limited at 60 req/min per Slack workspace. This prevents a runaway Slack bot or automated Slack client from overwhelming the remediation engine. Slack's own rate limits (~1 message/sec per channel) provide an additional natural throttle on the notification side.

Rate Limit Headers:

All API responses include standard rate limit headers:

HTTP/1.1 200 OK
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1847
X-RateLimit-Reset: 1709110800
Retry-After: 42          # Only present on 429 responses

When a client exceeds its rate limit, the API returns 429 Too Many Requests with a Retry-After header indicating seconds until the window resets. The agent is built to respect this header and back off automatically with jitter.

7.7 Error Codes & Response Format

All API errors follow a consistent JSON envelope. Every error includes a machine-readable code, a human-readable message, and an optional details object for structured context.

Error Response Format:

{
  "error": {
    "code": "DRIFT_STACK_NOT_FOUND",
    "message": "Stack 'prod-networking' does not exist or you do not have access.",
    "request_id": "req_a1b2c3d4",
    "details": {
      "stack_id": "prod-networking"
    }
  }
}

HTTP Status Codes:

Status	When Used
`200 OK`	Successful read or mutation
`201 Created`	Resource created (agent registration, policy creation, webhook subscription)
`202 Accepted`	Async operation queued (drift report ingested, remediation initiated)
`204 No Content`	Successful deletion
`400 Bad Request`	Malformed payload, missing required fields, invalid filter parameters
`401 Unauthorized`	Missing or invalid authentication (expired JWT, bad API key)
`403 Forbidden`	Authenticated but insufficient permissions (RBAC violation, wrong org)
`404 Not Found`	Resource doesn't exist or is outside the caller's org scope (RLS)
`409 Conflict`	Duplicate agent registration, policy name collision, concurrent remediation on same resource
`422 Unprocessable Entity`	Semantically invalid request (e.g., policy references a non-existent stack, invalid severity value)
`429 Too Many Requests`	Rate limit exceeded — includes `Retry-After` header
`500 Internal Server Error`	Unhandled server error — logged, alerted, includes `request_id` for support correlation
`502 Bad Gateway`	Upstream dependency failure (Slack API down, GitHub API timeout)
`503 Service Unavailable`	Planned maintenance or circuit breaker tripped — includes `Retry-After`

Application Error Codes:

Code	HTTP Status	Description
`AGENT_NOT_FOUND`	404	Agent ID does not exist or belongs to a different org
`AGENT_REVOKED`	403	Agent API key has been revoked — re-register required
`AGENT_VERSION_UNSUPPORTED`	422	Agent version is below the minimum supported version (forces upgrade)
`DRIFT_REPORT_DUPLICATE`	409	Report with this `report_id` was already ingested (SQS FIFO dedup fallback)
`DRIFT_REPORT_INVALID`	400	Report payload fails schema validation (missing fields, invalid types)
`DRIFT_REPORT_TOO_LARGE`	400	Report exceeds 1MB payload limit — split into multiple submissions
`DRIFT_STACK_NOT_FOUND`	404	Stack does not exist or caller lacks access
`DRIFT_STACK_LIMIT`	403	Org has reached the maximum stack count for their plan tier
`DRIFT_EVENT_NOT_FOUND`	404	Drift event ID does not exist
`REMEDIATION_IN_PROGRESS`	409	A remediation is already running for this resource — wait for completion
`REMEDIATION_NOT_PERMITTED`	403	User's RBAC role does not allow remediation on this stack
`REMEDIATION_AGENT_OFFLINE`	502	The agent responsible for this stack has not sent a heartbeat in >5 minutes
`POLICY_INVALID`	422	Policy condition syntax is invalid or references unsupported resource types
`POLICY_LIMIT`	403	Org has reached the maximum policy count for their plan tier
`SLACK_NOT_CONNECTED`	422	Slack workspace is not connected — required for Slack-based actions
`SLACK_USER_NOT_MAPPED`	422	Slack user ID cannot be mapped to a dd0c user — re-authenticate Slack
`WEBHOOK_DELIVERY_FAILED`	N/A	(Async) Webhook endpoint returned non-2xx — retried 3x with exponential backoff, then disabled
`AUTH_TOKEN_EXPIRED`	401	JWT has expired — refresh via Cognito token endpoint
`AUTH_TOKEN_INVALID`	401	JWT signature verification failed
`RATE_LIMIT_EXCEEDED`	429	Request throttled — respect `Retry-After` header
`INTERNAL_ERROR`	500	Unhandled exception — `request_id` included for support escalation

Retry Guidance for Agent Developers:

The open-source agent implements the following retry strategy, and third-party integrations should follow the same pattern:

Error Code	Retry?	Strategy
`429`	Yes	Exponential backoff starting at `Retry-After` value, max 5 retries, jitter ±20%
`500`	Yes	Exponential backoff starting at 1s, max 3 retries
`502` / `503`	Yes	Exponential backoff starting at 5s, max 5 retries
`400` / `422`	No	Fix the payload — retrying the same request will produce the same error
`401`	No	Re-authenticate — API key may be rotated or JWT expired
`403`	No	Permission issue — check RBAC or plan tier
`409`	Conditional	For `DRIFT_REPORT_DUPLICATE`, safe to ignore. For `REMEDIATION_IN_PROGRESS`, poll status and retry after completion.

90 KiB Raw Blame History Unescape Escape