Files
dd0c/products/02-iac-drift-detection/architecture/architecture.md
Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
2026-02-28 17:35:02 +00:00

90 KiB
Raw Blame History

dd0c/drift — Technical Architecture

Architect: Max Mayfield (Phase 6 — Architecture) Date: February 28, 2026 Product: dd0c/drift — IaC Drift Detection & Remediation SaaS Status: Architecture Design Document


Section 1: SYSTEM OVERVIEW

High-Level Architecture

graph TB
    subgraph Customer VPC
        CT[AWS CloudTrail] -->|Events| EB[Amazon EventBridge]
        EB -->|Rule Match| SQS_C[SQS Queue<br/>customer-side]
        SQS_C --> DA[Drift Agent<br/>ECS Task / GitHub Action]
        SF[Terraform State<br/>S3 Backend] -->|Read| DA
        DA -->|Encrypted Drift Diffs| HTTPS[HTTPS Egress Only]
    end

    subgraph dd0c SaaS Platform — AWS
        HTTPS -->|mTLS| APIGW[API Gateway<br/>Agent Ingestion]
        APIGW --> SQS_P[SQS FIFO Queue<br/>Event Ingestion]
        SQS_P --> PROC[Event Processor<br/>ECS Fargate]
        PROC --> DB[(PostgreSQL RDS<br/>Multi-Tenant)]
        PROC --> S3_SNAP[S3<br/>State Snapshots]
        PROC --> ES[Event Store<br/>DynamoDB Streams]

        PROC --> RE[Remediation Engine<br/>ECS Fargate]
        RE -->|Generate Plan| DA

        PROC --> NS[Notification Service<br/>Lambda]
        NS --> SLACK[Slack API]
        NS --> EMAIL[SES Email]
        NS --> WH[Webhook Delivery]

        DASH[Dashboard API<br/>ECS Fargate] --> DB
        DASH --> S3_SNAP
        UI[React SPA<br/>CloudFront] --> DASH

        AUTH[Auth Service<br/>Cognito + Lambda] --> DASH
        AUTH --> APIGW
    end

    subgraph External Integrations
        SLACK --> SLACK_USER[Slack Workspace]
        GH[GitHub API] --> PR[Pull Request<br/>Accept Drift]
        PD[PagerDuty API] --> ONCALL[On-Call Rotation]
    end

Component Inventory

Component Responsibility Runtime Deployment
Drift Agent Consumes CloudTrail events via EventBridge/SQS, reads Terraform state, computes drift diffs, pushes encrypted results to SaaS Go binary Customer VPC: ECS Task, GitHub Action, or standalone binary
API Gateway (Ingestion) Authenticates agent connections (mTLS + API key), rate limits, routes to ingestion queue AWS API Gateway (HTTP API) SaaS account
Event Processor Deserializes drift diffs, classifies severity, persists to DB, triggers notifications and remediation workflows Node.js / TypeScript ECS Fargate
State Manager Parses Terraform state files (v4 format), builds resource graph, computes resource-level diffs against previous snapshots Go (shared lib with Agent) Runs inside Drift Agent + SaaS-side for dashboard queries
Remediation Engine Generates scoped terraform plan for revert, manages approval workflow, dispatches apply commands back to Agent Node.js / TypeScript ECS Fargate
Notification Service Formats and delivers Slack Block Kit messages, emails, webhooks; handles Slack interactivity (button callbacks) Node.js Lambda Lambda (event-driven, pay-per-invocation)
Dashboard API REST API for web dashboard — drift scores, stack list, history, compliance reports, team management Node.js / TypeScript ECS Fargate
Web Dashboard React SPA — drift score, stack overview, drift timeline, compliance report generator, settings React + Vite CloudFront + S3
Auth Service User authentication (email/password, GitHub OAuth, Google OAuth), API key management for agents, RBAC Cognito + Lambda authorizers Managed + Lambda

Technology Choices

Decision Choice Justification
Agent Language Go Single static binary, no runtime dependencies, cross-compiles to Linux/macOS/Windows. Critical for customer-side deployment — zero dependency footprint.
SaaS Backend Node.js / TypeScript Shared language with frontend (React). Fast iteration for a solo founder. Strong AWS SDK support. TypeScript catches bugs at compile time.
Database PostgreSQL (RDS) Relational model fits multi-tenant SaaS (row-level security). JSONB for flexible drift diff storage. Mature, battle-tested. RDS handles backups/failover.
Event Store DynamoDB + Streams Append-only drift event log. DynamoDB Streams enables event sourcing pattern. Cost-effective at low volume, scales linearly.
State Snapshots S3 + Glacier lifecycle State snapshots are large (MB range), write-once-read-rarely. S3 is the obvious choice. Glacier after 90 days for cost.
Queue SQS FIFO Exactly-once processing for drift events. FIFO guarantees ordering per message group (per stack). No operational overhead vs. self-managed Kafka.
Notifications Lambda Event-driven, bursty workload. Pay-per-invocation. A Slack message costs ~$0.0000002 in Lambda compute.
Frontend React + Vite + CloudFront Standard SPA stack. CloudFront for global edge caching. Vite for fast builds. Nothing exotic — solo founder needs boring tech.
Auth Cognito Managed auth with OAuth flows, JWT tokens, user pools. Eliminates building auth from scratch. Cognito is ugly but functional.
IaC for SaaS infra Terraform Dogfooding. The SaaS that detects Terraform drift should be deployed with Terraform.

Deployment Model — Push-Based Agent Architecture

This is the non-negotiable architectural decision. The SaaS never pulls from customer infrastructure. The agent pushes out.

┌─────────────────────────────────────────────────┐
│                 Customer Account                 │
│                                                  │
│  CloudTrail ──► EventBridge ──► SQS ──► Agent   │
│                                          │       │
│  S3 State Bucket ◄──── (read) ──────────┘       │
│                                          │       │
│                                    Drift Diff    │
│                                    (encrypted)   │
│                                          │       │
│                                    HTTPS OUT ────┼──►  dd0c SaaS
│                                                  │
│  ❌ No inbound access from SaaS                  │
│  ❌ No IAM cross-account role for SaaS           │
│  ✅ Agent runs with customer's IAM role          │
│  ✅ State file never leaves customer account     │
│  ✅ Only drift diffs (no secrets) are transmitted│
└─────────────────────────────────────────────────┘

Agent Deployment Options:

  1. ECS Fargate Task (recommended for always-on): Long-running container that subscribes to SQS queue for real-time CloudTrail events and runs scheduled state comparisons. Deployed via Terraform module provided by dd0c.

  2. GitHub Actions Cron (recommended for getting started): Scheduled workflow that runs drift check on a cron (e.g., every 15 minutes). Zero infrastructure to manage. Lowest barrier to entry.

  3. Standalone Binary (for air-gapped / custom): Download the Go binary, run it anywhere — EC2, Kubernetes pod, on-prem server. Maximum flexibility.

What the Agent Transmits:

The agent sends a DriftReport payload — NOT the state file. The payload contains:

  • Stack identifier (name, backend location hash — not the actual S3 path)
  • List of drifted resources: resource type, resource address, attribute-level diff (old value vs. new value)
  • CloudTrail attribution: IAM principal, source IP, timestamp, event name
  • Drift classification: severity (critical/high/medium/low), category (security/config/tags/scaling)
  • Agent metadata: version, heartbeat timestamp, detection method (event-driven vs. scheduled)

What the Agent Does NOT Transmit:

  • Full state file contents
  • Secret values (the agent strips sensitive attributes and known secret patterns before transmission)
  • Raw CloudTrail events (only correlated attribution data)
  • S3 bucket names, account IDs, or other infrastructure identifiers (hashed)

Section 2: CORE COMPONENTS

2.1 Drift Agent

The agent is the heart of the push-based architecture. It's a single Go binary that runs inside the customer's environment and does two things: consume CloudTrail events for real-time detection, and periodically compare Terraform state against cloud reality for comprehensive coverage.

Architecture:

graph LR
    subgraph Drift Agent Process
        EC[Event Consumer<br/>CloudTrail via SQS] --> DF[Drift Filter<br/>Resource Matcher]
        SC[Scheduled Checker<br/>Cron / Timer] --> SP[State Parser<br/>TF State v4]
        SP --> DC[Drift Comparator<br/>Attribute-Level Diff]
        DF --> DC
        DC --> SS[Secret Scrubber<br/>Strip Sensitive Values]
        SS --> TX[Transmitter<br/>HTTPS + mTLS]
        TX -->|Encrypted DriftReport| SAAS[dd0c SaaS API]
    end

Event Consumer (Real-Time Path):

CloudTrail delivers events to EventBridge. An EventBridge rule matches IaC-managed resource types and forwards to an SQS queue. The agent polls this queue.

// EventBridge Rule Pattern — matches write API calls on IaC-managed resource types
{
  "source": ["aws.ec2", "aws.iam", "aws.rds", "aws.s3", "aws.lambda", "aws.ecs"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventName": [{
      "anything-but": {
        "prefix": "Describe"
      }
    }],
    "readOnly": [false]
  }
}

When the agent receives a CloudTrail event:

  1. Extract the resource identifier from the event (e.g., sg-abc123 from a ModifySecurityGroupRules event)
  2. Look up which Terraform state file manages this resource (agent maintains an in-memory resource → state index)
  3. Read the current resource attributes from the cloud API (e.g., ec2:DescribeSecurityGroups)
  4. Compare against the declared attributes in the Terraform state file
  5. If drift detected → build DriftReport, scrub secrets, transmit to SaaS

Scheduled Checker (Comprehensive Path):

The event-driven path catches ~80% of drift in real-time. The scheduled path catches everything else — resources modified through means that don't generate standard CloudTrail events, or resources in services not yet covered by the EventBridge rule.

Schedule: Configurable per tier
  Free:     Every 24 hours
  Starter:  Every 15 minutes
  Pro:      Every 5 minutes
  Business: Every 1 minute

The scheduled checker:

  1. Reads the Terraform state file from the configured backend (S3, GCS, Terraform Cloud, local)
  2. For each resource in state, calls the corresponding cloud API to get current attributes
  3. Computes attribute-level diff between state and reality
  4. Batches all drifted resources into a single DriftReport
  5. Transmits to SaaS

State File Parsing:

The agent parses Terraform state v4 format (JSON). Key structures:

type TerraformState struct {
    Version          int              `json:"version"`          // Must be 4
    TerraformVersion string           `json:"terraform_version"`
    Serial           int64            `json:"serial"`
    Lineage          string           `json:"lineage"`
    Resources        []StateResource  `json:"resources"`
}

type StateResource struct {
    Module    string              `json:"module,omitempty"`
    Mode      string              `json:"mode"`      // "managed" or "data"
    Type      string              `json:"type"`       // e.g., "aws_security_group"
    Name      string              `json:"name"`
    Provider  string              `json:"provider"`
    Instances []ResourceInstance   `json:"instances"`
}

type ResourceInstance struct {
    SchemaVersion int                    `json:"schema_version"`
    Attributes    map[string]interface{} `json:"attributes"`
    Private       string                 `json:"private"`  // Base64 encoded, may contain secrets
}

The agent maintains a mapping of Terraform resource types to AWS API calls:

Terraform Resource Type AWS Describe API Key Identifier
aws_security_group ec2:DescribeSecurityGroups attributes.id
aws_iam_role iam:GetRole attributes.name
aws_iam_policy iam:GetPolicyVersion attributes.arn
aws_db_instance rds:DescribeDBInstances attributes.identifier
aws_s3_bucket s3:GetBucketConfiguration (composite) attributes.bucket
aws_lambda_function lambda:GetFunction attributes.function_name
aws_ecs_service ecs:DescribeServices attributes.name + attributes.cluster
aws_route53_record route53:ListResourceRecordSets attributes.zone_id + attributes.name

MVP covers the top 20 most-drifted resource types (based on community data from driftctl's historical issues). Remaining types added iteratively based on customer demand.

Secret Scrubbing:

Before transmitting any drift diff, the agent runs a scrubbing pass:

  1. Remove any attribute marked sensitive in the Terraform provider schema
  2. Redact values matching known secret patterns: password, secret, token, key, private_key, connection_string
  3. Redact any value that looks like a credential (regex patterns for AWS keys, database URIs, JWT tokens)
  4. Replace redacted values with [REDACTED] — the diff still shows "attribute changed" but not the actual values
  5. The Private field on resource instances is always stripped entirely

2.2 Event Pipeline

The event pipeline is the real-time nervous system. CloudTrail events flow through EventBridge and SQS to the agent — no polling, no cron, no delay.

Customer-Side Pipeline:

CloudTrail (all regions)
    │
    ▼
EventBridge (default bus)
    │
    ├── Rule: drift-agent-ec2 (ec2 write events)
    ├── Rule: drift-agent-iam (iam write events)
    ├── Rule: drift-agent-rds (rds write events)
    └── Rule: drift-agent-* (per-service rules)
    │
    ▼
SQS Queue: drift-agent-events
    │ (FIFO, dedup by CloudTrail eventID)
    │ (visibility timeout: 300s)
    │ (DLQ after 3 retries)
    ▼
Drift Agent (long-poll, batch size 10)

SaaS-Side Pipeline:

Agent HTTPS POST /v1/drift-reports
    │
    ▼
API Gateway (auth: mTLS + API key header)
    │
    ▼
SQS FIFO Queue: drift-report-ingestion
    │ (message group ID = stack_id → ordering per stack)
    │ (dedup ID = report_id → exactly-once)
    ▼
Event Processor (ECS Fargate, auto-scaling 1-10 tasks)
    │
    ├──► PostgreSQL (drift_events, resources, stacks)
    ├──► DynamoDB (event store, append-only)
    ├──► S3 (state snapshot, if full snapshot report)
    └──► Notification Service (Lambda, async invoke)

Why SQS FIFO, not Kafka/Kinesis:

At MVP scale (10-1,000 stacks), SQS FIFO is the right choice:

  • Zero operational overhead (no brokers, no partitions, no ZooKeeper)
  • Exactly-once processing via deduplication ID
  • Per-stack ordering via message group ID
  • Costs ~$0.40/million messages. At 1,000 stacks checking every 5 minutes, that's ~288K messages/day = ~$3.50/month
  • If we hit 10,000+ stacks and need streaming analytics, migrate to Kinesis. That's a V3 problem.

2.3 State Manager

The State Manager is a shared library (Go) used by both the Drift Agent and the SaaS-side Event Processor. It handles Terraform state parsing, resource graph construction, and drift classification.

Resource Graph:

The State Manager builds a dependency graph from Terraform state. This is critical for blast radius analysis — when a resource drifts, what else might be affected?

type ResourceGraph struct {
    Nodes map[string]*ResourceNode  // key: resource address (e.g., "aws_security_group.api")
    Edges []ResourceEdge            // dependency relationships
}

type ResourceNode struct {
    Address    string                 // e.g., "module.networking.aws_security_group.api"
    Type       string                 // e.g., "aws_security_group"
    Provider   string                 // e.g., "registry.terraform.io/hashicorp/aws"
    Attributes map[string]interface{} // current state attributes
    DriftState DriftState             // clean, drifted, unknown
}

type ResourceEdge struct {
    From string  // resource address
    To   string  // resource address
    Type string  // "depends_on", "reference", "implicit"
}

The graph is built by analyzing attribute cross-references in state (e.g., aws_instance.web.vpc_security_group_ids references aws_security_group.web.id). This isn't perfect — Terraform state doesn't store the full dependency graph — but it catches 80%+ of relationships.

Drift Classification:

Every detected drift is classified along two axes:

Severity Criteria Examples
Critical Security boundary change, IAM escalation, public exposure Security group opened to 0.0.0.0/0, IAM policy with *:*, S3 bucket made public
High Configuration change affecting availability or data RDS parameter change, ECS task definition change, Lambda runtime change
Medium Non-critical configuration change Instance type change, tag modification on critical resources, DNS TTL change
Low Cosmetic or expected drift Tag-only changes, description updates, ASG desired count (auto-scaling)

Classification rules are defined in a YAML config shipped with the agent:

# drift-classification.yaml
rules:
  - resource_type: aws_security_group
    attribute: ingress
    condition: "contains_cidr('0.0.0.0/0')"
    severity: critical
    category: security

  - resource_type: aws_iam_role_policy
    attribute: policy
    severity: high
    category: security

  - resource_type: aws_db_instance
    attribute: parameter_group_name
    severity: high
    category: configuration

  - resource_type: "*"
    attribute: tags
    severity: low
    category: tags

  # Default: anything not matched
  - resource_type: "*"
    attribute: "*"
    severity: medium
    category: configuration

Customers can override these rules in their agent config to match their risk tolerance.

2.4 Remediation Engine

The Remediation Engine handles two workflows: Revert (make cloud match code) and Accept (make code match cloud). Both are initiated from Slack action buttons or the dashboard.

Revert Workflow:

sequenceDiagram
    participant U as Engineer (Slack)
    participant NS as Notification Service
    participant RE as Remediation Engine
    participant DA as Drift Agent
    participant TF as Terraform (in customer VPC)

    U->>NS: Click [Revert] button
    NS->>RE: Initiate revert for resource X in stack Y
    RE->>RE: Generate scoped plan (target resource only)
    RE->>RE: Compute blast radius (resource graph)
    RE->>NS: Send confirmation with blast radius
    NS->>U: "Reverting aws_security_group.api will affect 0 other resources. Proceed? [Confirm] [Cancel]"
    U->>NS: Click [Confirm]
    NS->>RE: Confirmed
    RE->>DA: Execute: terraform apply -target=aws_security_group.api -auto-approve
    DA->>TF: Run terraform apply (scoped)
    TF-->>DA: Apply complete
    DA->>RE: Result: success
    RE->>NS: Notify success
    NS->>U: "✅ aws_security_group.api reverted to declared state. Drift resolved."

Accept Workflow (Code-to-Cloud):

When the engineer clicks [Accept], the drift is intentional and the code should be updated to match reality:

  1. Remediation Engine generates a Terraform code patch that updates the resource definition to match the current cloud state
  2. Creates a branch and PR on the connected GitHub repository
  3. PR includes: the code change, a description of the drift, CloudTrail attribution, and a link to the drift event in the dd0c dashboard
  4. Engineer reviews and merges the PR through normal code review process
  5. On merge, the next terraform apply in CI/CD is a no-op for this resource (code now matches cloud)
  6. Agent detects the state file update and marks the drift as resolved

Approval Workflow (V2 — Pro/Business tiers):

For teams that want approval gates before remediation:

# remediation-policy.yaml
policies:
  - resource_type: aws_security_group
    action: auto-revert
    condition: "severity == 'critical'"
    # No approval needed — auto-revert critical security drift

  - resource_type: aws_iam_*
    action: require-approval
    approvers: ["@security-team"]
    timeout: 4h
    # IAM changes need security team sign-off

  - resource_type: aws_db_instance
    action: require-approval
    approvers: ["@dba-team", "@infra-lead"]
    timeout: 24h
    # Database changes need DBA approval

  - resource_type: "*"
    attribute: tags
    action: digest
    # Tag drift goes in the daily digest, no action needed

2.5 Notification Service

The Notification Service is a Lambda function that formats drift events into rich, actionable messages and delivers them to configured channels.

Slack Block Kit Message (Primary):

{
  "blocks": [
    {
      "type": "header",
      "text": {
        "type": "plain_text",
        "text": "🔴 Critical Drift Detected"
      }
    },
    {
      "type": "section",
      "fields": [
        { "type": "mrkdwn", "text": "*Stack:*\nprod-networking" },
        { "type": "mrkdwn", "text": "*Resource:*\naws_security_group.api" },
        { "type": "mrkdwn", "text": "*Changed by:*\narn:aws:iam::123456:user/jsmith" },
        { "type": "mrkdwn", "text": "*When:*\n2 minutes ago" }
      ]
    },
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*What changed:*\n```ingress rule added: 0.0.0.0/0:443 (HTTPS from anywhere)```"
      }
    },
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*Blast radius:* 0 dependent resources\n*Owner:* @ravi"
      }
    },
    {
      "type": "actions",
      "elements": [
        { "type": "button", "text": { "type": "plain_text", "text": "🔄 Revert" }, "style": "danger", "action_id": "drift_revert", "value": "evt_abc123" },
        { "type": "button", "text": { "type": "plain_text", "text": "✅ Accept" }, "action_id": "drift_accept", "value": "evt_abc123" },
        { "type": "button", "text": { "type": "plain_text", "text": "⏰ Snooze 24h" }, "action_id": "drift_snooze", "value": "evt_abc123" },
        { "type": "button", "text": { "type": "plain_text", "text": "👤 Assign" }, "action_id": "drift_assign", "value": "evt_abc123" }
      ]
    }
  ]
}

Notification Routing:

Severity Slack Email PagerDuty Webhook
Critical Immediate (channel + DM to owner) Immediate Page on-call (Pro+) Immediate
High Immediate (channel) Immediate Alert (no page) Immediate
Medium Batched (hourly digest) Daily digest Batched
Low Daily digest Weekly digest Batched

Slack Interactivity:

When an engineer clicks a button, Slack sends an interaction payload to our API Gateway endpoint (POST /v1/slack/interactions). The Lambda:

  1. Verifies the Slack request signature
  2. Looks up the drift event by ID
  3. Checks the user's permissions (RBAC — can this user remediate this stack?)
  4. Initiates the appropriate workflow (revert, accept, snooze, assign)
  5. Updates the original Slack message to reflect the action taken

Section 3: DATA ARCHITECTURE

3.1 Database Schema (PostgreSQL RDS)

Multi-tenant PostgreSQL with Row-Level Security (RLS). Every table includes org_id and all queries are scoped by it. No cross-tenant data leakage by design.

-- ============================================================
-- ORGANIZATIONS & AUTH
-- ============================================================

CREATE TABLE organizations (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            TEXT NOT NULL,
    slug            TEXT UNIQUE NOT NULL,           -- e.g., "acme-corp"
    plan            TEXT NOT NULL DEFAULT 'free',    -- free, starter, pro, business, enterprise
    stripe_customer_id TEXT,
    max_stacks      INT NOT NULL DEFAULT 3,
    poll_interval_s INT NOT NULL DEFAULT 86400,      -- default: daily (free tier)
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    email           TEXT NOT NULL,
    name            TEXT,
    role            TEXT NOT NULL DEFAULT 'member',  -- owner, admin, member, viewer
    cognito_sub     TEXT UNIQUE,
    slack_user_id   TEXT,                            -- for Slack DM routing
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(org_id, email)
);

-- ============================================================
-- STACKS & RESOURCES
-- ============================================================

CREATE TABLE stacks (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    name            TEXT NOT NULL,                   -- e.g., "prod-networking"
    backend_type    TEXT NOT NULL DEFAULT 's3',       -- s3, gcs, tfc, local
    backend_hash    TEXT NOT NULL,                    -- SHA256 of backend config (no raw paths stored)
    iac_tool        TEXT NOT NULL DEFAULT 'terraform', -- terraform, opentofu, pulumi
    environment     TEXT,                             -- prod, staging, dev
    owner_user_id   UUID REFERENCES users(id),
    slack_channel   TEXT,                             -- override notification channel per stack
    drift_score     REAL NOT NULL DEFAULT 100.0,      -- 0-100, 100 = clean
    last_check_at   TIMESTAMPTZ,
    last_drift_at   TIMESTAMPTZ,
    resource_count  INT NOT NULL DEFAULT 0,
    drifted_count   INT NOT NULL DEFAULT 0,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(org_id, backend_hash)
);

CREATE TABLE resources (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    stack_id        UUID NOT NULL REFERENCES stacks(id) ON DELETE CASCADE,
    address         TEXT NOT NULL,                   -- e.g., "module.vpc.aws_security_group.api"
    resource_type   TEXT NOT NULL,                   -- e.g., "aws_security_group"
    provider        TEXT NOT NULL,                   -- e.g., "registry.terraform.io/hashicorp/aws"
    cloud_id        TEXT,                            -- e.g., "sg-abc123" (for cross-referencing)
    drift_state     TEXT NOT NULL DEFAULT 'clean',   -- clean, drifted, unknown, ignored
    last_drift_at   TIMESTAMPTZ,
    drift_count     INT NOT NULL DEFAULT 0,          -- lifetime drift count for this resource
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(stack_id, address)
);

CREATE INDEX idx_resources_type ON resources(org_id, resource_type);
CREATE INDEX idx_resources_drift ON resources(org_id, drift_state) WHERE drift_state = 'drifted';

-- ============================================================
-- DRIFT EVENTS
-- ============================================================

CREATE TABLE drift_events (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    stack_id        UUID NOT NULL REFERENCES stacks(id),
    resource_id     UUID NOT NULL REFERENCES resources(id),
    report_id       UUID NOT NULL,                   -- groups events from same detection run
    severity        TEXT NOT NULL,                    -- critical, high, medium, low
    category        TEXT NOT NULL,                    -- security, configuration, tags, scaling
    detection_method TEXT NOT NULL,                   -- event_driven, scheduled
    
    -- The drift diff (JSONB for flexible querying)
    diff            JSONB NOT NULL,
    /*  Example diff:
        {
          "attributes": {
            "ingress": {
              "old": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8"]}],
              "new": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8", "0.0.0.0/0"]}]
            }
          }
        }
    */
    
    -- CloudTrail attribution (nullable — scheduled checks don't have this)
    attributed_principal TEXT,                        -- IAM ARN who made the change
    attributed_source_ip TEXT,                        -- source IP
    attributed_event_name TEXT,                       -- e.g., "AuthorizeSecurityGroupIngress"
    attributed_at   TIMESTAMPTZ,                     -- when the change was made
    
    -- Resolution
    status          TEXT NOT NULL DEFAULT 'open',    -- open, resolved, accepted, snoozed, ignored
    resolved_by     UUID REFERENCES users(id),
    resolved_at     TIMESTAMPTZ,
    resolution_type TEXT,                            -- reverted, accepted, snoozed, auto_reverted
    resolution_note TEXT,
    
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_drift_events_stack ON drift_events(org_id, stack_id, created_at DESC);
CREATE INDEX idx_drift_events_status ON drift_events(org_id, status) WHERE status = 'open';
CREATE INDEX idx_drift_events_severity ON drift_events(org_id, severity, created_at DESC);

-- ============================================================
-- REMEDIATION PLANS
-- ============================================================

CREATE TABLE remediation_plans (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    drift_event_id  UUID NOT NULL REFERENCES drift_events(id),
    stack_id        UUID NOT NULL REFERENCES stacks(id),
    plan_type       TEXT NOT NULL,                   -- revert, accept
    status          TEXT NOT NULL DEFAULT 'pending', -- pending, approved, executing, completed, failed, cancelled
    
    -- For revert: terraform plan output (scrubbed)
    plan_output     TEXT,
    target_resources TEXT[],                          -- resource addresses targeted
    blast_radius    INT NOT NULL DEFAULT 0,          -- number of dependent resources affected
    
    -- For accept: generated code patch
    code_patch      TEXT,
    pr_url          TEXT,                             -- GitHub PR URL if created
    
    -- Approval
    requested_by    UUID REFERENCES users(id),
    approved_by     UUID REFERENCES users(id),
    approved_at     TIMESTAMPTZ,
    
    -- Execution
    started_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ,
    error_message   TEXT,
    
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- ============================================================
-- AGENTS
-- ============================================================

CREATE TABLE agents (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL REFERENCES organizations(id),
    name            TEXT NOT NULL,                    -- e.g., "prod-agent-1"
    api_key_hash    TEXT NOT NULL,                    -- bcrypt hash of the agent API key
    status          TEXT NOT NULL DEFAULT 'active',   -- active, inactive, revoked
    last_heartbeat  TIMESTAMPTZ,
    agent_version   TEXT,
    deployment_type TEXT,                             -- ecs, github_action, binary
    stacks          TEXT[],                           -- stack IDs this agent monitors
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- ============================================================
-- ROW-LEVEL SECURITY
-- ============================================================

ALTER TABLE stacks ENABLE ROW LEVEL SECURITY;
ALTER TABLE resources ENABLE ROW LEVEL SECURITY;
ALTER TABLE drift_events ENABLE ROW LEVEL SECURITY;
ALTER TABLE remediation_plans ENABLE ROW LEVEL SECURITY;
ALTER TABLE agents ENABLE ROW LEVEL SECURITY;
ALTER TABLE users ENABLE ROW LEVEL SECURITY;

-- All policies follow the same pattern: current_setting('app.current_org_id')
CREATE POLICY org_isolation ON stacks
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON resources
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON drift_events
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON remediation_plans
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON agents
    USING (org_id = current_setting('app.current_org_id')::UUID);
CREATE POLICY org_isolation ON users
    USING (org_id = current_setting('app.current_org_id')::UUID);

3.2 Event Sourcing (DynamoDB)

Every drift detection result is appended to an immutable event store in DynamoDB. This provides a complete audit trail that can never be modified — critical for compliance evidence.

Table: drift-events-log
Partition Key: org_id#stack_id (String)
Sort Key:      timestamp#event_id (String)
TTL:           expires_at (90 days for free, 1 year for paid, 7 years for enterprise)

Attributes:
  - event_type:     "drift_detected" | "drift_resolved" | "remediation_started" | "remediation_completed" | "agent_heartbeat" | "stack_registered"
  - payload:        Full event payload (JSON)
  - report_id:      Groups events from same detection run
  - checksum:       SHA256 of payload (tamper detection)

Why DynamoDB for the event store (not PostgreSQL):

  1. Append-only workload — DynamoDB excels at high-throughput writes with no read contention
  2. TTL-based expiration — automatic cleanup per tier without cron jobs
  3. DynamoDB Streams — enables downstream consumers (analytics, compliance report generation) without polling
  4. Cost — at 1,000 stacks × 288 checks/day × 365 days = ~105M items/year. DynamoDB on-demand pricing: ~$130/year for writes, ~$25/year for storage. PostgreSQL would need periodic archival to maintain query performance.

DynamoDB Streams Consumer:

A Lambda function subscribes to the DynamoDB Stream and:

  1. Aggregates drift metrics (drift rate, MTTR, drift-by-resource-type) into a PostgreSQL drift_metrics table for dashboard queries
  2. Generates daily/weekly compliance digest reports (stored in S3)
  3. Feeds the drift score calculation engine

3.3 State Snapshot Storage (S3)

When the agent performs a full scheduled check, it can optionally upload a sanitized state snapshot to S3. This enables:

  • Historical comparison ("what did the state look like last Tuesday?")
  • Compliance evidence ("here's the state at the time of the audit")
  • Debugging ("the drift diff looks wrong — let me check the raw state")
Bucket: dd0c-state-snapshots-{account_id}
Prefix: {org_id}/{stack_id}/{YYYY}/{MM}/{DD}/{timestamp}-{report_id}.json.gz

Lifecycle:
  - Standard:           0-30 days
  - Infrequent Access:  30-90 days
  - Glacier Instant:    90-365 days
  - Glacier Deep:       365+ days (enterprise only)
  - Expire:             Per tier TTL (90d free, 1yr paid, 7yr enterprise)

Encryption: SSE-S3 (AES-256) + bucket policy enforcing encryption
Versioning: Enabled (tamper protection)
Object Lock: Compliance mode for enterprise tier (WORM — auditors love this)

What's in a state snapshot:

NOT the raw Terraform state file. The agent sanitizes it:

  1. All sensitive attributes → [REDACTED]
  2. All private instance data → stripped
  3. Backend configuration → hashed
  4. Account IDs → hashed (reversible only by the customer's agent)

The snapshot is useful for drift comparison but cannot be used to reconstruct the customer's infrastructure or extract secrets.

3.4 Multi-Tenant Data Isolation

Three layers of isolation:

Layer 1: Application-Level (RLS) Every API request sets app.current_org_id on the PostgreSQL session before executing queries. Row-Level Security policies ensure queries only return rows matching the org. Even a SQL injection vulnerability cannot access cross-tenant data.

// Middleware: set org context on every request
async function setOrgContext(req: Request, res: Response, next: NextFunction) {
  const orgId = req.auth.orgId; // from JWT claims
  await pool.query("SET app.current_org_id = $1", [orgId]);
  next();
}

Layer 2: Infrastructure-Level (S3 Prefixes + IAM) State snapshots are stored under {org_id}/ prefixes. The SaaS application IAM role has a policy condition that restricts S3 access to the prefix matching the authenticated org:

{
  "Effect": "Allow",
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:::dd0c-state-snapshots-*",
  "Condition": {
    "StringLike": {
      "s3:prefix": "${aws:PrincipalTag/org_id}/*"
    }
  }
}

Layer 3: Encryption-Level (Per-Org KMS Keys — Enterprise) Enterprise tier customers get a dedicated KMS key for encrypting their data at rest. This enables:

  • Customer-controlled key rotation
  • Key deletion = cryptographic data destruction (for offboarding)
  • CloudTrail logging of all key usage (customer can audit our access)

Data Residency (V2): For EU customers requiring GDPR data residency, deploy a separate RDS instance + S3 bucket in eu-west-1. The application routes based on org.data_region. This is a V2 feature — MVP runs single-region us-east-1.


Section 4: INFRASTRUCTURE

4.1 AWS Architecture — SaaS Platform

graph TB
    subgraph Public Edge
        CF[CloudFront<br/>Web Dashboard CDN]
        APIGW[API Gateway<br/>HTTP API]
    end

    subgraph Compute — ECS Fargate Cluster
        EP[Event Processor<br/>1-10 tasks, auto-scale]
        RE[Remediation Engine<br/>1-3 tasks, auto-scale]
        DASH[Dashboard API<br/>2 tasks, target-tracking]
    end

    subgraph Serverless
        NS[Notification Service<br/>Lambda]
        AUTH_L[Auth Authorizer<br/>Lambda]
        STREAM_L[DynamoDB Stream<br/>Consumer Lambda]
        CRON_L[Cron Jobs<br/>Lambda + EventBridge Scheduler]
    end

    subgraph Data
        RDS[(PostgreSQL 16<br/>RDS db.t4g.medium<br/>Multi-AZ)]
        DDB[(DynamoDB<br/>On-Demand<br/>Event Store)]
        S3_SNAP[S3<br/>State Snapshots]
        S3_WEB[S3<br/>Web Dashboard Assets]
    end

    subgraph Messaging
        SQS_IN[SQS FIFO<br/>drift-report-ingestion]
        SQS_REM[SQS Standard<br/>remediation-commands]
        SQS_NOTIFY[SQS Standard<br/>notification-fanout]
    end

    subgraph Auth & Secrets
        COG[Cognito User Pool]
        SM[Secrets Manager<br/>Slack tokens, DB creds]
        KMS[KMS<br/>Encryption keys]
    end

    subgraph Monitoring
        CW[CloudWatch<br/>Logs + Metrics + Alarms]
        XR[X-Ray<br/>Distributed Tracing]
    end

    CF --> S3_WEB
    CF --> APIGW
    APIGW --> AUTH_L --> COG
    APIGW --> SQS_IN
    APIGW --> DASH
    SQS_IN --> EP
    EP --> RDS
    EP --> DDB
    EP --> S3_SNAP
    EP --> SQS_NOTIFY
    EP --> SQS_REM
    SQS_NOTIFY --> NS
    SQS_REM --> RE
    RE --> RDS
    DASH --> RDS
    DASH --> S3_SNAP
    DDB --> STREAM_L
    STREAM_L --> RDS

VPC Layout:

VPC: 10.0.0.0/16 (us-east-1)

  Public Subnets (NAT Gateway, ALB):
    10.0.1.0/24  (us-east-1a)
    10.0.2.0/24  (us-east-1b)

  Private Subnets (ECS Tasks, Lambda):
    10.0.10.0/24 (us-east-1a)
    10.0.11.0/24 (us-east-1b)

  Isolated Subnets (RDS):
    10.0.20.0/24 (us-east-1a)
    10.0.21.0/24 (us-east-1b)

  VPC Endpoints (no NAT for AWS services):
    - com.amazonaws.us-east-1.sqs
    - com.amazonaws.us-east-1.dynamodb
    - com.amazonaws.us-east-1.s3
    - com.amazonaws.us-east-1.secretsmanager
    - com.amazonaws.us-east-1.kms
    - com.amazonaws.us-east-1.ecr.api
    - com.amazonaws.us-east-1.ecr.dkr
    - com.amazonaws.us-east-1.logs

4.2 Customer-Side Agent Deployment

The agent is deployed into the customer's AWS account via a Terraform module published to the Terraform Registry.

Terraform Module: dd0c/drift-agent/aws

module "drift_agent" {
  source  = "dd0c/drift-agent/aws"
  version = "~> 1.0"

  # Required
  dd0c_api_key       = var.dd0c_api_key          # From dd0c dashboard
  terraform_state_bucket = "my-terraform-state"   # S3 bucket with state files
  terraform_state_keys   = ["prod/*.tfstate"]     # Glob patterns for state files

  # Optional
  deployment_type    = "ecs"                       # "ecs" | "lambda" | "binary"
  vpc_id             = module.vpc.vpc_id
  subnet_ids         = module.vpc.private_subnet_ids
  poll_interval      = 300                         # seconds (overridden by tier)
  
  # EventBridge real-time detection
  enable_eventbridge = true
  cloudtrail_name    = "main-trail"                # Existing CloudTrail trail name
  
  # Resource type filter (optional — default: all supported types)
  resource_types     = ["aws_security_group", "aws_iam_*", "aws_db_instance"]

  tags = {
    Environment = "production"
    ManagedBy   = "dd0c-drift"
  }
}

What the module creates:

Resource Purpose
ECS Task Definition + Service Runs the drift agent container (Fargate, 0.25 vCPU, 512MB)
IAM Role: dd0c-drift-agent Agent execution role with read-only permissions
IAM Policy: dd0c-drift-readonly Read access to state bucket + describe APIs for monitored resource types
EventBridge Rules Match CloudTrail write events for monitored resource types
SQS Queue: dd0c-drift-events Buffer for EventBridge events consumed by agent
SQS DLQ: dd0c-drift-events-dlq Dead letter queue for failed event processing
CloudWatch Log Group Agent logs (retained 30 days)
Security Group Egress-only to dd0c SaaS API endpoint + AWS service endpoints

Alternative: GitHub Actions (Zero Infrastructure)

For teams that don't want to run infrastructure, the agent runs as a GitHub Action:

# .github/workflows/drift-check.yml
name: Drift Check
on:
  schedule:
    - cron: '*/15 * * * *'  # Every 15 minutes
  workflow_dispatch: {}

jobs:
  check:
    runs-on: ubuntu-latest
    permissions:
      id-token: write  # For OIDC auth to AWS
    steps:
      - uses: dd0c/drift-action@v1
        with:
          dd0c-api-key: ${{ secrets.DD0C_API_KEY }}
          aws-role-arn: arn:aws:iam::123456789:role/dd0c-drift-readonly
          state-bucket: my-terraform-state
          state-keys: "prod/*.tfstate"

This approach trades real-time detection (no EventBridge) for zero infrastructure. Good for getting started; upgrade to ECS when they want real-time.

4.3 Cost Estimates

SaaS Platform Costs (Monthly):

Component 10 Stacks 100 Stacks 1,000 Stacks
RDS db.t4g.medium (Multi-AZ) $140 $140 $280 (db.t4g.large)
ECS Fargate (Event Processor) $15 $35 $150
ECS Fargate (Dashboard API) $30 $30 $60
ECS Fargate (Remediation Engine) $8 $15 $50
Lambda (Notifications + Stream) $1 $5 $30
SQS $1 $3 $15
DynamoDB (On-Demand) $2 $15 $130
S3 (State Snapshots) $1 $5 $40
API Gateway $4 $15 $80
CloudFront $5 $5 $10
NAT Gateway $35 $35 $70
Cognito $0 $3 $25
CloudWatch / X-Ray $10 $20 $50
Secrets Manager $2 $2 $5
Total SaaS Infra ~$254/mo ~$328/mo ~$995/mo

Customer-Side Agent Costs (Per Customer, Monthly):

Component Cost
ECS Fargate (0.25 vCPU, 512MB, always-on) ~$9/mo
SQS Queue (EventBridge events) ~$0.50/mo
CloudWatch Logs ~$1/mo
EventBridge Rules Free (included in CloudTrail)
Total per customer ~$10.50/mo

This is important for pricing: the customer pays ~$10.50/mo in their own AWS bill to run the agent. The $49/mo Starter tier needs to deliver enough value to justify $49 + $10.50 = ~$60/mo total cost.

Unit Economics at Scale:

Scale Revenue (est.) SaaS Infra Cost Gross Margin
10 customers (avg $99/mo) $990/mo $254/mo 74%
50 customers (avg $149/mo) $7,450/mo $328/mo 96%
200 customers (avg $199/mo) $39,800/mo $995/mo 97%

SaaS margins are excellent once past the fixed-cost floor (~$254/mo). The business breaks even at ~3 paying customers.

4.4 Scaling Strategy

Phase 1: MVP (0-100 stacks)

  • Single RDS instance (db.t4g.medium, Multi-AZ)
  • ECS Fargate with auto-scaling (min 1, max 3 per service)
  • DynamoDB on-demand (auto-scales)
  • Single region (us-east-1)

Phase 2: Growth (100-1,000 stacks)

  • RDS read replica for dashboard queries (separate read/write paths)
  • ECS auto-scaling up to 10 tasks per service
  • SQS batch processing (batch size 10 → higher throughput)
  • CloudFront caching for dashboard API (drift scores, stack lists — cache 60s)

Phase 3: Scale (1,000-10,000 stacks)

  • RDS upgrade to db.r6g.large + read replicas
  • Consider migrating event ingestion from SQS FIFO to Kinesis Data Streams (higher throughput, fan-out)
  • DynamoDB DAX for hot-path reads (drift score lookups)
  • Multi-region deployment (us-east-1 + eu-west-1) for data residency
  • Connection pooling via RDS Proxy

Phase 4: Enterprise (10,000+ stacks)

  • Dedicated RDS instances per large enterprise customer
  • Kinesis + Lambda fan-out for event processing
  • ElastiCache (Redis) for session management and rate limiting
  • This is a "good problem to have" phase — re-architect based on actual bottlenecks

4.5 CI/CD Pipeline

graph LR
    subgraph Developer
        CODE[Push to main] --> GH[GitHub]
    end

    subgraph CI — GitHub Actions
        GH --> LINT[Lint + Type Check]
        LINT --> TEST[Unit Tests]
        TEST --> INT[Integration Tests<br/>LocalStack]
        INT --> BUILD[Docker Build<br/>+ Go Binary]
        BUILD --> SCAN[Trivy Container Scan]
        SCAN --> PUSH[Push to ECR]
    end

    subgraph CD — Terraform + ECS
        PUSH --> TF_PLAN[Terraform Plan<br/>staging]
        TF_PLAN --> APPROVE[Manual Approval<br/>for prod]
        APPROVE --> TF_APPLY[Terraform Apply<br/>prod]
        TF_APPLY --> ECS_DEPLOY[ECS Rolling Deploy<br/>Blue/Green]
        ECS_DEPLOY --> SMOKE[Smoke Tests]
        SMOKE --> DONE[✅ Deployed]
    end

Pipeline Details:

Stage Tool Duration
Lint + Type Check ESLint + tsc (TypeScript), golangci-lint (Go) ~30s
Unit Tests Vitest (TypeScript), go test (Go) ~60s
Integration Tests LocalStack (SQS, DynamoDB, S3 emulation) ~120s
Docker Build Multi-stage Dockerfile, Go binary cross-compile ~90s
Container Scan Trivy (CVE scanning) ~30s
ECR Push Docker push to private ECR ~20s
Terraform Plan Plan against staging environment ~30s
Manual Approval GitHub Environment protection rule (prod) Human
Terraform Apply Apply to prod ~60s
ECS Deploy Rolling update (min healthy 100%, max 200%) ~120s
Smoke Tests Hit health endpoints, verify SQS consumption ~30s
Total (automated) ~9 minutes

Agent Release Pipeline:

The Go agent binary is released separately:

  1. Tag a release on GitHub (v1.2.3)
  2. GoReleaser builds binaries for linux/amd64, linux/arm64, darwin/amd64, darwin/arm64
  3. Docker image pushed to public ECR (for ECS deployment)
  4. GitHub Action published to GitHub Marketplace
  5. Terraform module version bumped in Terraform Registry
  6. Changelog posted to dd0c blog + Slack community

Section 5: SECURITY

5.1 IAM Role Design — Customer Accounts

The trust model is the hardest sell: customers are giving dd0c's agent read access to their Terraform state and cloud resource attributes. The architecture must make this as narrow and auditable as possible.

Agent Execution Role (Customer-Side):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadTerraformState",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::${state_bucket}",
        "arn:aws:s3:::${state_bucket}/${state_key_prefix}*"
      ]
    },
    {
      "Sid": "DescribeResources",
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "iam:Get*",
        "iam:List*",
        "rds:Describe*",
        "s3:GetBucket*",
        "s3:GetEncryptionConfiguration",
        "s3:GetLifecycleConfiguration",
        "lambda:GetFunction",
        "lambda:GetFunctionConfiguration",
        "lambda:ListTags",
        "ecs:Describe*",
        "ecs:List*",
        "route53:GetHostedZone",
        "route53:ListResourceRecordSets",
        "elasticloadbalancing:Describe*",
        "cloudfront:GetDistribution",
        "sns:GetTopicAttributes",
        "sqs:GetQueueAttributes",
        "dynamodb:DescribeTable",
        "kms:DescribeKey",
        "kms:GetKeyPolicy"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "${monitored_regions}"
        }
      }
    },
    {
      "Sid": "ConsumeEventBridgeQueue",
      "Effect": "Allow",
      "Action": [
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage",
        "sqs:GetQueueAttributes"
      ],
      "Resource": "arn:aws:sqs:*:${account_id}:dd0c-drift-events"
    },
    {
      "Sid": "WriteAgentLogs",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:${account_id}:log-group:/dd0c/drift-agent:*"
    }
  ]
}

Key design decisions:

  1. No cross-account role for the SaaS. The SaaS platform NEVER assumes a role in the customer's account. The agent runs with the customer's own IAM role. The SaaS only receives drift reports over HTTPS. This is the fundamental trust boundary.

  2. Read-only by default. The agent role has zero write permissions. It can describe resources and read state files. It cannot modify anything.

  3. Region-scoped. The aws:RequestedRegion condition limits describe calls to regions the customer explicitly configures. No global enumeration.

  4. State bucket scoped. S3 access is limited to the specific state bucket and key prefix. Not s3:* on *.

5.2 Remediation IAM Role (Separate, Opt-In)

Remediation requires write access. This is a SEPARATE IAM role that customers opt into explicitly. It is never created by default.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TerraformApply",
      "Effect": "Allow",
      "Action": [
        "ec2:*",
        "iam:*",
        "rds:*",
        "s3:*",
        "lambda:*",
        "ecs:*"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "${monitored_regions}"
        },
        "StringLike": {
          "aws:ResourceTag/ManagedBy": "terraform"
        }
      }
    },
    {
      "Sid": "StateLock",
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": "arn:aws:dynamodb:*:${account_id}:table/${state_lock_table}"
    }
  ]
}

Guardrails on remediation:

  1. Tag-scoped: The condition aws:ResourceTag/ManagedBy = terraform limits write actions to resources that are tagged as Terraform-managed. Resources created outside Terraform can't be modified.
  2. Approval required: The SaaS never triggers remediation without explicit human approval (button click in Slack or dashboard). Auto-remediation policies are customer-configured and customer-approved.
  3. Scoped apply: Remediation always uses terraform apply -target=<resource>. Never a full terraform apply. Blast radius is minimized.
  4. Audit trail: Every remediation action is logged in the event store with: who approved it, when, what was changed, and the full terraform plan output.
  5. Kill switch: Customers can revoke the remediation role at any time via IAM. The agent gracefully degrades to detect-only mode.

5.3 State File Security

Terraform state files are the crown jewels. They contain resource IDs, configuration, and — critically — secret values (database passwords, API keys, private keys). The architecture must handle this with extreme care.

Principle: State files never leave the customer's account.

The agent reads the state file in-memory within the customer's VPC. It extracts resource attributes for drift comparison. Before transmitting anything to the SaaS:

  1. Attribute filtering: Only attributes relevant to drift detection are included in the report. The agent maintains an allowlist per resource type:
# attribute-allowlist.yaml
aws_security_group:
  - ingress
  - egress
  - name
  - description
  - vpc_id
  - tags

aws_iam_role:
  - assume_role_policy
  - max_session_duration
  - path
  - permissions_boundary
  - tags
  # NOT included: inline policies (may contain secrets in conditions)

aws_db_instance:
  - engine
  - engine_version
  - instance_class
  - allocated_storage
  - storage_type
  - multi_az
  - publicly_accessible
  - vpc_security_group_ids
  - db_subnet_group_name
  - parameter_group_name
  - tags
  # NOT included: master_password, endpoint (could be used for targeting)
  1. Secret pattern scrubbing: Even within allowed attributes, values matching secret patterns are redacted:

    • AWS access keys (AKIA...)
    • Database connection strings (postgres://..., mysql://...)
    • Private keys (-----BEGIN RSA PRIVATE KEY-----)
    • JWT tokens (eyJ...)
    • Generic patterns: any value for keys containing password, secret, token, key, credential
  2. In-transit encryption: All agent-to-SaaS communication uses TLS 1.3 with mTLS (mutual TLS). The agent presents a client certificate issued during registration. The SaaS validates it before accepting any data.

  3. At-rest encryption: Drift diffs stored in PostgreSQL and DynamoDB are encrypted with KMS (AWS-managed key for standard tiers, customer-managed key for enterprise).

  4. No state file caching: The agent does not write state file contents to disk. State is read from S3 into memory, processed, and discarded. The Go binary uses mlock to prevent state data from being swapped to disk.

5.4 SOC 2 Considerations

dd0c/drift will pursue SOC 2 Type II certification. The architecture supports the required controls:

SOC 2 Criteria How dd0c Addresses It
CC6.1 Logical access controls Cognito auth, RBAC, API key auth for agents, RLS in PostgreSQL
CC6.2 Access provisioned/deprovisioned User management via dashboard, API key rotation, agent revocation
CC6.3 Access restricted to authorized mTLS for agent connections, JWT validation for dashboard, VPC isolation for data tier
CC7.1 Monitoring for anomalies CloudWatch alarms, X-Ray tracing, agent heartbeat monitoring
CC7.2 Incident response Runbook in Confluence, PagerDuty integration, automated rollback via ECS
CC8.1 Change management Terraform IaC for all infrastructure, GitHub PR reviews, CI/CD pipeline
A1.2 Recovery objectives RDS Multi-AZ (RPO: 0, RTO: <5min), S3 cross-region replication (enterprise), DynamoDB point-in-time recovery
C1.1 Confidentiality State files never leave customer VPC, secret scrubbing, KMS encryption, TLS 1.3

Compliance automation:

The irony is not lost on us: a drift detection product must itself be drift-free. dd0c/drift will run dd0c/drift on its own infrastructure. Dogfooding as compliance evidence.

5.5 Trust Model

The trust model is the product's biggest adoption barrier. Here's how we address it at each level:

Level 1: "I don't trust you with any access" → GitHub Actions mode. The agent runs in the customer's GitHub Actions runner. dd0c only receives drift reports (no IAM role in customer account at all). The customer reviews the agent source code (open-source Go binary).

Level 2: "I'll give you read-only access" → Standard deployment. Agent runs in customer VPC with read-only IAM role. State files never leave the account. Only sanitized drift diffs are transmitted.

Level 3: "I trust you to remediate" → Remediation role (opt-in). Separate IAM role with write permissions. Scoped to tagged resources. Requires explicit human approval for every action.

Level 4: "I trust you to auto-remediate" → Auto-remediation policies (Business/Enterprise tier). Customer defines rules for automatic revert. Still uses the remediation IAM role. Full audit trail. Kill switch available.

Open-source agent:

The drift agent Go binary is open-source (Apache 2.0). Customers can:

  • Audit the code to verify what data is collected and transmitted
  • Build from source if they don't trust pre-built binaries
  • Fork and modify for custom requirements
  • Run in air-gapped environments with no SaaS connection (detect-only, local output)

This is the trust unlock. Security teams that won't install a closed-source agent will consider an open-source one they can audit.


Section 6: MVP SCOPE

6.1 V1 Boundary — What Ships

The MVP is ruthlessly scoped to one IaC tool, one cloud, one notification channel, and one deployment model. Everything else is deferred. The goal is: a solo founder ships a working product in 30 days that detects Terraform drift in AWS and alerts via Slack.

V1 Feature Matrix:

Capability V1 (Launch) Status
IaC Support Terraform + OpenTofu (state v4 format only) Ship
Cloud Provider AWS only Ship
Detection: Scheduled Poll state vs. cloud on configurable interval Ship
Detection: Event-Driven CloudTrail → EventBridge → SQS → Agent Ship
Notification: Slack Block Kit messages with action buttons Ship
Remediation: Revert Scoped terraform apply -target via agent Ship
Remediation: Accept Auto-generate PR to update IaC code Ship
Dashboard Drift score, stack list, event history (minimal React SPA) Ship
Agent: ECS Terraform module for ECS Fargate deployment Ship
Agent: GitHub Actions Scheduled workflow, zero infra Ship
Onboarding CLI drift init auto-discovery Ship
Auth Email/password + GitHub OAuth via Cognito Ship
Billing Stripe integration, self-serve upgrade Ship
Multi-tenant RLS-based isolation, org/user/stack model Ship

V1 Resource Type Coverage (Top 20):

The agent ships with drift detection for the 20 most commonly drifted AWS resource types. This list is derived from driftctl's historical GitHub issues, r/terraform drift complaints, and CloudTrail event frequency data:

Priority Resource Type Why It Drifts Detection Complexity
1 aws_security_group / aws_security_group_rule Emergency port opens during incidents Low — ec2:DescribeSecurityGroups
2 aws_iam_role / aws_iam_role_policy Permission escalation, console edits Medium — policy document comparison
3 aws_iam_policy / aws_iam_policy_attachment Inline policy edits, attachment changes Medium — version document diff
4 aws_s3_bucket (config attributes) Public access toggles, lifecycle changes Medium — composite describe calls
5 aws_db_instance Parameter group changes, storage scaling Low — rds:DescribeDBInstances
6 aws_instance Instance type changes, security group swaps Low — ec2:DescribeInstances
7 aws_lambda_function Runtime updates, env var changes Low — lambda:GetFunction
8 aws_ecs_service Task count changes, image tag updates Low — ecs:DescribeServices
9 aws_ecs_task_definition Container definition edits Medium — JSON deep comparison
10 aws_route53_record DNS record changes (manual cutover) Low — route53:ListResourceRecordSets
11 aws_lb_listener / aws_lb_listener_rule Routing rule changes Low — elbv2:DescribeListeners
12 aws_autoscaling_group Desired capacity (auto-scaling noise) Low — needs noise filtering
13 aws_cloudwatch_metric_alarm Threshold tweaks Low — cloudwatch:DescribeAlarms
14 aws_sns_topic / aws_sqs_queue Policy changes, subscription edits Low — sns:GetTopicAttributes
15 aws_dynamodb_table Capacity mode changes, GSI edits Medium — dynamodb:DescribeTable
16 aws_elasticache_cluster Node type changes, parameter group Low — elasticache:DescribeCacheClusters
17 aws_kms_key Key policy changes Medium — policy document diff
18 aws_cloudfront_distribution Origin changes, behavior edits High — complex nested config
19 aws_vpc / aws_subnet CIDR changes, tag drift Low — ec2:DescribeVpcs
20 aws_eip / aws_nat_gateway Association changes Low — ec2:DescribeAddresses

Resource types beyond the top 20 are detected as "unknown drift" — the agent reports that the resource exists in state but can't compare attributes. Customers can request priority for specific types via GitHub issues.

6.2 What's Deferred to V2+

Saying "no" is the only way a solo founder ships in 30 days. Here's what's explicitly deferred and why:

V2 (Month 3-4):

Feature Why Deferred Dependency
CloudFormation support Different state format (stack resources API), different drift detection mechanism (detect-stack-drift API). Requires a separate parser and comparator. New state parser module
Pulumi support Pulumi state is stored differently (Pulumi Service backend or S3 with different schema). Requires a new state parser. New state parser module
Auto-remediation policies Per-resource-type automation rules (auto-revert, alert, digest, ignore). Requires policy engine, rule evaluation, and careful UX to avoid accidental auto-reverts. Policy engine, approval workflow
Compliance report generation SOC 2 / HIPAA evidence export (PDF/CSV). Requires report templating, data aggregation, and export pipeline. DynamoDB event store populated with 30+ days of data
Drift trends & analytics Time-series charts (drift rate, MTTR, most-drifted resources). Requires metrics aggregation pipeline and charting frontend. DynamoDB Streams consumer, charting library
PagerDuty / OpsGenie integration Route critical drift through existing on-call. Requires integration auth, event mapping, and escalation logic. Notification service extension
Teams & RBAC Multi-team support, role-based permissions, stack-level access control. Requires authorization layer beyond basic org membership. Auth service extension

V3 (Month 6-9):

Feature Why Deferred
Multi-cloud (Azure, GCP) Each cloud requires its own describe API mapping, authentication model, and event pipeline. Triple the agent complexity.
Drift prediction (ML) Requires aggregate data from 500+ customers to build meaningful models. Can't do this at launch.
Industry benchmarking Same data requirement as prediction. Need critical mass of anonymized drift data.
SSO / SAML Enterprise auth. Not needed until enterprise customers appear. Cognito supports it when ready.
Full API & webhooks Public API for programmatic access. V1 has internal APIs only. Public API requires versioning, rate limiting, documentation, and SDK generation.
dd0c platform integration Cross-module data flow (drift → alert, drift → portal). Requires dd0c/alert and dd0c/portal to exist first.

Explicitly NOT building (ever, unless market demands it):

Anti-Feature Why Not
CI/CD orchestration That's Spacelift/env0's game. We detect drift, we don't run pipelines.
Policy-as-code engine (OPA/Sentinel) Adjacent but different problem. Integrate with existing policy tools, don't build one.
Cost management That's dd0c/cost. Separate product, separate concern.
Service catalog That's dd0c/portal. Drift feeds into it, doesn't replace it.
Multi-cloud state management We read state, we don't manage it. No state migration, no state locking, no remote backend hosting.

6.3 Onboarding Flow

The onboarding flow is the product's first impression. It must go from npm install to first Slack alert in under 5 minutes. Every second of friction is a lost conversion.

Flow: CLI-First Onboarding

Step 1: Install CLI
─────────────────────────────────────────────
$ brew install dd0c/tap/drift-cli
# or: curl -sSL https://get.dd0c.dev/drift | sh
# or: go install github.com/dd0c/drift-cli@latest

Step 2: Authenticate
─────────────────────────────────────────────
$ drift auth login
→ Opens browser: https://app.dd0c.dev/auth
→ GitHub OAuth or email/password
→ CLI receives token via localhost callback
✅ Authenticated as brian@dd0c.dev (org: dd0c)

Step 3: Auto-Discover State Backends
─────────────────────────────────────────────
$ drift init
🔍 Scanning for Terraform state backends...

Found 3 state backends:
  1. s3://acme-terraform-state/prod/networking.tfstate    (23 resources)
  2. s3://acme-terraform-state/prod/compute.tfstate       (47 resources)
  3. s3://acme-terraform-state/staging/main.tfstate       (31 resources)

Register all 3 stacks? [Y/n]: Y

✅ Registered 3 stacks (101 resources total)
   Org plan: Free (3 stacks max) — you're at capacity

Step 4: Configure Slack
─────────────────────────────────────────────
$ drift connect slack
→ Opens browser: Slack OAuth install flow
→ Select workspace and channel (#infrastructure-alerts)
✅ Connected to Slack workspace "Acme Corp" → #infrastructure-alerts

Step 5: First Drift Check
─────────────────────────────────────────────
$ drift check --all
🔍 Checking 3 stacks (101 resources)...

Stack: prod-networking (23 resources)
  🔴 CRITICAL  aws_security_group.api — ingress rule added (0.0.0.0/0:443)
  🟡 MEDIUM    aws_route53_record.api — TTL changed (300 → 60)
  ✅ 21 resources clean

Stack: prod-compute (47 resources)
  🟠 HIGH      aws_iam_role.lambda_exec — policy document changed
  🔵 LOW       aws_instance.worker[0] — tags.Environment changed
  ✅ 45 resources clean

Stack: staging-main (31 resources)
  ✅ All 31 resources clean

Summary: 4 drifted resources across 3 stacks (96% aligned)
📨 Slack alerts sent to #infrastructure-alerts

Step 6: Deploy Agent (Optional — for continuous monitoring)
─────────────────────────────────────────────
$ drift agent deploy --type=github-action
→ Generates .github/workflows/drift-check.yml
→ Creates GitHub secret DD0C_API_KEY via gh CLI
✅ Agent deployed — drift checks will run every 15 minutes

# OR for ECS deployment:
$ drift agent deploy --type=ecs --vpc-id=vpc-abc123 --subnets=subnet-1,subnet-2
→ Generates Terraform module in ./dd0c-drift-agent/
→ Run: cd dd0c-drift-agent && terraform init && terraform apply

Auto-Discovery Logic:

The drift init command discovers state backends through multiple strategies:

Strategy How It Works Coverage
AWS credential chain Uses default AWS credentials to scan S3 buckets matching common patterns (*-terraform-state, *-tfstate, *-tf-state) ~60% of teams
Terraform config scan Walks current directory tree for *.tf files, parses backend blocks ~80% of teams (if run from repo root)
Environment variables Reads TF_STATE_BUCKET, TF_WORKSPACE, AWS_DEFAULT_REGION ~30% of teams
Terraform Cloud/Enterprise Checks ~/.terraform.d/credentials.tfrc.json for TFC tokens, queries workspace API ~15% of teams (TFC users)
Interactive fallback If auto-discovery finds nothing: "Enter your S3 state bucket name:" 100% (manual)

The goal: 80% of users hit drift init and see their stacks listed without typing a bucket name.

Onboarding Metrics:

Metric Target Kill Threshold
Time from install to first drift check output < 3 minutes > 10 minutes
Time from signup to first Slack alert < 5 minutes > 15 minutes
drift init auto-discovery success rate > 70% < 40%
Onboarding completion rate (install → Slack connected) > 60% < 30%
Drop-off at each step < 15% per step > 30% at any step

6.4 Technical Debt Budget

A solo founder shipping in 30 days will accumulate technical debt. The key is to accumulate it intentionally, in known locations, with a plan to pay it down.

Acceptable Debt (Ship With It):

Debt Item Where Why It's Acceptable Pay Down By
Hardcoded resource type mappings Agent: resource_mapper.go The top 20 resource types are hardcoded as Go structs with describe API calls. No plugin system. Adding type 21 requires a code change and agent release. V2: Plugin system or YAML-driven resource type definitions
Single-region SaaS Infrastructure: us-east-1 only Multi-region adds complexity (RDS replication, S3 cross-region, routing). Not needed until EU customers demand GDPR residency. V2: eu-west-1 deployment when first EU enterprise customer signs
No database migrations framework SaaS: raw SQL files At MVP, the schema is small enough to manage with numbered SQL files. No Flyway/Liquibase. Month 3: Adopt golang-migrate or Prisma Migrate before schema exceeds 20 tables
Minimal error handling in CLI CLI: drift init, drift check Error messages are functional but not polished. Stack traces may leak in edge cases. Month 2: Error wrapping, user-friendly messages, --debug flag for verbose output
No retry logic on Slack API Notification Service Lambda Slack API failures drop the notification silently. No retry queue. At low volume, this is rare. Month 2: SQS DLQ for failed Slack deliveries, retry with exponential backoff
Dashboard is read-only Web SPA V1 dashboard shows drift scores and event history. Settings, team management, and policy configuration are CLI-only. V2: Full dashboard CRUD for stacks, policies, team members
No rate limiting on public API API Gateway API Gateway has default throttling (10K req/s) but no per-org rate limiting. At MVP scale, this is fine. Month 3: Per-org rate limiting via API Gateway usage plans + API keys
Test coverage < 80% Agent + SaaS Integration tests cover the critical path (detect → notify → remediate). Unit test coverage will be ~60% at launch. Month 2-3: Increase to 80%+ with focus on drift comparator and secret scrubber
No OpenTelemetry SaaS services V1 uses CloudWatch Logs + X-Ray. No custom metrics, no distributed trace correlation across services. V2: OTel SDK integration, custom metrics (drift detection latency, queue depth, notification delivery rate)
Monorepo without workspace tooling Repository structure Single repo with agent/, saas/, dashboard/, cli/ directories. No Nx/Turborepo. Builds are sequential. Month 3: Turborepo or Nx for parallel builds when CI time exceeds 15 minutes

Unacceptable Debt (Do NOT Ship With):

Debt Item Why It's Unacceptable
No secret scrubbing Transmitting customer secrets to SaaS is a trust-destroying, potentially lawsuit-inducing failure. Secret scrubber ships in V1, fully tested.
No RLS on PostgreSQL Cross-tenant data leakage is an existential risk. RLS is enabled from Day 1 with integration tests that verify isolation.
No mTLS on agent connections Agent-to-SaaS communication without mutual TLS means anyone with an API key can impersonate an agent. mTLS ships in V1.
No Stripe webhook verification Accepting unverified Stripe webhooks enables billing manipulation. Signature verification is a one-liner — no excuse to skip it.
No input validation on drift diffs Malicious agents could inject SQL or XSS via crafted drift diffs. Input validation and parameterized queries are non-negotiable.
No CloudTrail event signature verification EventBridge events should be validated against CloudTrail digest files to prevent spoofed drift events.

Debt Paydown Schedule:

Month 1 (Launch):     Ship with acceptable debt. Focus on working product.
Month 2:              Error handling, Slack retry logic, test coverage to 70%
Month 3:              Rate limiting, database migrations framework, Turborepo
Month 4 (V2 start):   Plugin system for resource types, dashboard CRUD, OTel
Month 6:              Multi-region, full test coverage (80%+), performance profiling

6.5 Solo Founder Operational Model

One person builds, ships, operates, markets, and supports this product. The architecture must minimize operational burden or the founder burns out before reaching $10K MRR.

Operational Principles:

  1. Managed services over self-hosted. RDS over self-managed PostgreSQL. Cognito over self-hosted auth. SQS over self-managed RabbitMQ. Lambda over always-on notification servers. Every managed service is one fewer thing to page about at 3am.

  2. Alerts on business impact, not infrastructure metrics. Don't alert on CPU > 80%. Alert on: "drift reports stopped arriving from agent X" (customer impact), "Slack notification delivery failed 3x" (customer impact), "RDS storage > 80%" (approaching outage). Fewer alerts = sustainable on-call.

  3. Automate recovery, not just detection. ECS auto-restarts crashed tasks. Lambda retries on failure. SQS DLQ captures poison messages. RDS Multi-AZ fails over automatically. The founder should wake up to a resolved incident, not an active one.

  4. Weekly maintenance window, not continuous ops. Sunday evening: review CloudWatch dashboards, check DLQ depth, review error logs, update dependencies, run terraform plan to verify no drift (dogfooding). 2 hours/week max.

On-Call Model:

PagerDuty Configuration:
  - Critical alerts (customer-facing outage):     Page immediately, 24/7
  - High alerts (degraded service):               Page during business hours only
  - Medium alerts (non-urgent operational):        Slack notification, review in weekly maintenance
  - Low alerts (informational):                    CloudWatch dashboard only

Critical Alert Triggers:
  - API Gateway 5xx rate > 5% for 5 minutes
  - SQS FIFO queue age > 15 minutes (drift reports backing up)
  - RDS connection count > 80% of max
  - Zero drift reports received in 1 hour (all agents down?)
  - Stripe webhook processing failures > 3 consecutive

Estimated Alert Volume:
  - Month 1-3:   ~2-3 critical alerts/month (new system, bugs)
  - Month 3-6:   ~1 critical alert/month (stabilized)
  - Month 6+:    ~0.5 critical alerts/month (mature)

Support Model:

Channel Response Time Tier
GitHub Issues (drift-cli) 24-48 hours All (open-source community)
In-app chat (Intercom) 24 hours (business days) Free + Paid
Slack community (#dd0c-drift) Best effort, same day All
Email (support@dd0c.dev) 24 hours Paid
Priority Slack DM 4 hours (business days) Business + Enterprise

Time Allocation (Solo Founder, 50 hrs/week):

Week 1-4 (Pre-Launch Build):
  ├── 35 hrs  Engineering (agent + SaaS + dashboard + CLI)
  ├── 5 hrs   Infrastructure (Terraform, CI/CD, monitoring)
  ├── 5 hrs   Content (README, docs, blog draft)
  └── 5 hrs   Community (Reddit lurking, driftctl issue monitoring)

Week 5-8 (Launch + Seed):
  ├── 25 hrs  Engineering (bug fixes, polish, V1.1 patches)
  ├── 5 hrs   Infrastructure + ops
  ├── 10 hrs  Content (blog posts, Drift Cost Calculator)
  └── 10 hrs  Community (Reddit engagement, HN launch, design partners)

Week 9-12 (Growth):
  ├── 20 hrs  Engineering (V1.x improvements, V2 planning)
  ├── 5 hrs   Ops + support
  ├── 10 hrs  Content + SEO
  ├── 10 hrs  Community + partnerships
  └── 5 hrs   Business (metrics review, pricing analysis, investor prep)

Steady State (Month 4+):
  ├── 20 hrs  Engineering
  ├── 5 hrs   Ops + support (scales with customer count)
  ├── 10 hrs  Marketing (content + community + partnerships)
  ├── 10 hrs  Product (customer feedback, roadmap, design)
  └── 5 hrs   Business (metrics, billing, legal)

Automation That Saves Founder Time:

Automation Time Saved Implementation
Stripe billing 5 hrs/week Self-serve upgrade/downgrade, automatic invoicing, dunning emails
GitHub Actions CI/CD 3 hrs/week Automated test → build → deploy pipeline. No manual deployments.
Intercom chatbot 2 hrs/week FAQ auto-responses for common questions (pricing, setup, supported resources)
CloudWatch auto-remediation 1 hr/week Auto-restart ECS tasks, auto-scale on queue depth, auto-archive old DynamoDB items
Dependabot + Renovate 1 hr/week Automated dependency updates with auto-merge for patch versions
dd0c/drift on dd0c/drift 1 hr/week Dogfooding — drift detection on own infrastructure eliminates manual terraform plan runs

Section 7: API DESIGN

The dd0c/drift API surface is divided into three distinct zones: the highly restricted Agent API (mTLS authenticated, ingestion only), the standard Dashboard API (JWT authenticated, CRUD operations), and the Integration APIs (Webhooks, Slack, and dd0c platform cross-talk).

7.1 Agent API (Ingestion & Heartbeat)

This API is exposed strictly for the drift agent running in the customer's environment. All endpoints require mutual TLS (mTLS) combined with a static, org-scoped API key sent via headers.

Agent Registration & Heartbeat:

POST /v1/agents/register
Authorization: Bearer dd0c_api_...
Content-Type: application/json

{
  "agent_id": "uuid",
  "name": "prod-ecs-cluster-agent",
  "version": "1.2.3",
  "deployment_type": "ecs",
  "monitored_stacks": ["prod-networking", "prod-compute"]
}

# Response: 200 OK
{
  "status": "active",
  "poll_interval_s": 300,
  "config_hash": "abc123def456"
}
POST /v1/agents/{agent_id}/heartbeat
Authorization: Bearer dd0c_api_...

{
  "uptime_s": 86400,
  "events_processed": 142,
  "memory_mb": 42
}

Drift Report Submission:

This is the core ingestion endpoint. It accepts batched drift reports (either from event-driven CloudTrail intercepts or scheduled full-state comparisons).

POST /v1/drift-reports
Authorization: Bearer dd0c_api_...

{
  "stack_id": "prod-networking",
  "report_id": "uuid",
  "detection_method": "event_driven",
  "timestamp": "2026-02-28T10:00:00Z",
  "drifted_resources": [
    {
      "address": "module.vpc.aws_security_group.api",
      "type": "aws_security_group",
      "severity": "critical",
      "category": "security",
      "diff": {
        "ingress": {
          "old": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8"]}],
          "new": [{"from_port": 443, "cidr_blocks": ["10.0.0.0/8", "0.0.0.0/0"]}]
        }
      },
      "attribution": {
        "principal": "arn:aws:iam::123456:user/jsmith",
        "source_ip": "192.168.1.1",
        "event_name": "AuthorizeSecurityGroupIngress"
      }
    }
  ]
}

7.2 Dashboard & Query API

This is the REST API consumed by the React web dashboard. It relies on standard JWT Bearer tokens issued by Cognito. All responses are scoped via RLS to the authenticated user's org_id.

Drift Event Query & Search:

Allows complex filtering to power the drift history and active drift dashboards.

GET /v1/drift-events
  ?stack_id=prod-networking
  &status=open
  &severity=critical,high
  &limit=50
  &offset=0

# Response: 200 OK
{
  "data": [
    {
      "id": "evt_abc123",
      "stack_id": "prod-networking",
      "resource_address": "module.vpc.aws_security_group.api",
      "severity": "critical",
      "status": "open",
      "created_at": "2026-02-28T10:00:00Z"
      // ... full event details ...
    }
  ],
  "pagination": {
    "total": 12,
    "has_next": false
  }
}

Policy Configuration API (V2):

Manages the auto-remediation and alerting policies for a stack.

POST /v1/stacks/{stack_id}/policies

{
  "name": "Auto-revert public security groups",
  "resource_type": "aws_security_group",
  "condition": "severity == 'critical'",
  "action": "auto_revert",
  "enabled": true
}

7.3 Slack Integration

The Slack integration relies on two primary interaction models: Slash commands (user-initiated) and Interactive Actions (button clicks on drift alerts).

Slash Commands:

  • /drift status [stack_name] - Returns the current drift score, number of open drift events, and agent health status for the specified stack (or all stacks if omitted).
  • /drift check [stack_name] - Dispatches an immediate out-of-band scheduled check command to the agent for the specified stack.
  • /drift silence [resource_address] [duration] - Temporarily mutes alerts for a noisy resource (e.g., /drift silence aws_autoscaling_group.workers 24h).

Interactive Actions (Webhooks from Slack):

When a user clicks an action button ([Revert], [Accept], [Snooze]) in a drift alert message, Slack POSTs a signed payload to the dd0c API Gateway.

POST /v1/slack/interactions
Content-Type: application/x-www-form-urlencoded
X-Slack-Signature: v0=a2114d...
X-Slack-Request-Timestamp: 1614556800

payload={
  "type": "block_actions",
  "user": { "id": "U123456", "name": "jsmith" },
  "actions": [
    {
      "action_id": "drift_revert",
      "value": "evt_abc123"
    }
  ]
}

The Notification Service Lambda validates the signature, performs RBAC checks, and initiates the workflow via the Remediation Engine.

7.4 Outbound Webhooks

Customers can subscribe to drift events via outbound webhooks to trigger custom internal workflows (e.g., creating Jira tickets, updating custom dashboards).

Webhook Registration:

Customers configure an endpoint URL and receive a signing secret in the dashboard.

Payload Delivery:

POST /webhook
X-dd0c-Signature: sha256=a1b2c3d4e5f6...
Content-Type: application/json

{
  "event_type": "drift.detected",
  "event_id": "webhook_evt_789",
  "timestamp": "2026-02-28T10:05:00Z",
  "data": {
    "drift_event_id": "evt_abc123",
    "stack_id": "prod-networking",
    "resource_address": "module.vpc.aws_security_group.api",
    "severity": "critical"
  }
}

Other event types include drift.resolved (when an event is remediated or accepted) and agent.offline (when an agent misses its heartbeat threshold).

7.5 dd0c Platform Integrations

While dd0c/drift works perfectly as a standalone SaaS, its value compounds when deployed alongside other modules in the dd0c platform suite.

dd0c/alert Integration: Instead of dd0c/drift connecting directly to Slack or PagerDuty, it can emit events to dd0c/alert via an internal event bus (EventBridge). dd0c/alert handles the intelligent routing based on on-call schedules, deduplication, grouping, and escalation policies. API Flow: drift Event Processor -> internal EventBridge -> dd0c/alert Ingestion API

dd0c/portal Integration: dd0c/portal serves as the developer service catalog. When a team views a service in the catalog, it queries the drift dashboard API for infrastructure health. API Flow: dd0c/portal backend -> GET /v1/stacks/{id}/drift-score -> UI Rendering This enriches the service catalog with real-time IaC compliance metrics and allows developers to see their drift score directly next to their service ownership details without opening the drift dashboard.

7.6 Rate Limits & Throttling

Rate limits are enforced at the API Gateway layer via usage plans. Limits are tiered by plan and by API zone (Agent vs. Dashboard) to protect the platform from runaway agents and abusive clients while keeping the system responsive for legitimate use.

Agent API Rate Limits:

Tier Drift Report Submissions Heartbeats Agent Registrations
Free 100 req/hour per org 60 req/hour per agent 5 req/day per org
Starter 500 req/hour per org 120 req/hour per agent 20 req/day per org
Pro 2,000 req/hour per org 120 req/hour per agent 50 req/day per org
Business 10,000 req/hour per org 300 req/hour per agent 200 req/day per org
Enterprise Custom (negotiated) Custom Custom

These limits are generous relative to expected usage. A Pro tier customer with 30 stacks checking every 5 minutes generates ~360 reports/hour — well within the 2,000 limit. The limits exist to catch misconfigured agents stuck in tight loops, not to throttle normal operation.

Dashboard API Rate Limits:

Tier Read Requests Write Requests (mutations)
Free 300 req/min per user 30 req/min per user
Starter 600 req/min per user 60 req/min per user
Pro 1,200 req/min per user 120 req/min per user
Business 3,000 req/min per user 300 req/min per user

Slack Interaction Limits:

Slack interactions (button clicks, slash commands) are rate-limited at 60 req/min per Slack workspace. This prevents a runaway Slack bot or automated Slack client from overwhelming the remediation engine. Slack's own rate limits (~1 message/sec per channel) provide an additional natural throttle on the notification side.

Rate Limit Headers:

All API responses include standard rate limit headers:

HTTP/1.1 200 OK
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1847
X-RateLimit-Reset: 1709110800
Retry-After: 42          # Only present on 429 responses

When a client exceeds its rate limit, the API returns 429 Too Many Requests with a Retry-After header indicating seconds until the window resets. The agent is built to respect this header and back off automatically with jitter.

7.7 Error Codes & Response Format

All API errors follow a consistent JSON envelope. Every error includes a machine-readable code, a human-readable message, and an optional details object for structured context.

Error Response Format:

{
  "error": {
    "code": "DRIFT_STACK_NOT_FOUND",
    "message": "Stack 'prod-networking' does not exist or you do not have access.",
    "request_id": "req_a1b2c3d4",
    "details": {
      "stack_id": "prod-networking"
    }
  }
}

HTTP Status Codes:

Status When Used
200 OK Successful read or mutation
201 Created Resource created (agent registration, policy creation, webhook subscription)
202 Accepted Async operation queued (drift report ingested, remediation initiated)
204 No Content Successful deletion
400 Bad Request Malformed payload, missing required fields, invalid filter parameters
401 Unauthorized Missing or invalid authentication (expired JWT, bad API key)
403 Forbidden Authenticated but insufficient permissions (RBAC violation, wrong org)
404 Not Found Resource doesn't exist or is outside the caller's org scope (RLS)
409 Conflict Duplicate agent registration, policy name collision, concurrent remediation on same resource
422 Unprocessable Entity Semantically invalid request (e.g., policy references a non-existent stack, invalid severity value)
429 Too Many Requests Rate limit exceeded — includes Retry-After header
500 Internal Server Error Unhandled server error — logged, alerted, includes request_id for support correlation
502 Bad Gateway Upstream dependency failure (Slack API down, GitHub API timeout)
503 Service Unavailable Planned maintenance or circuit breaker tripped — includes Retry-After

Application Error Codes:

Code HTTP Status Description
AGENT_NOT_FOUND 404 Agent ID does not exist or belongs to a different org
AGENT_REVOKED 403 Agent API key has been revoked — re-register required
AGENT_VERSION_UNSUPPORTED 422 Agent version is below the minimum supported version (forces upgrade)
DRIFT_REPORT_DUPLICATE 409 Report with this report_id was already ingested (SQS FIFO dedup fallback)
DRIFT_REPORT_INVALID 400 Report payload fails schema validation (missing fields, invalid types)
DRIFT_REPORT_TOO_LARGE 400 Report exceeds 1MB payload limit — split into multiple submissions
DRIFT_STACK_NOT_FOUND 404 Stack does not exist or caller lacks access
DRIFT_STACK_LIMIT 403 Org has reached the maximum stack count for their plan tier
DRIFT_EVENT_NOT_FOUND 404 Drift event ID does not exist
REMEDIATION_IN_PROGRESS 409 A remediation is already running for this resource — wait for completion
REMEDIATION_NOT_PERMITTED 403 User's RBAC role does not allow remediation on this stack
REMEDIATION_AGENT_OFFLINE 502 The agent responsible for this stack has not sent a heartbeat in >5 minutes
POLICY_INVALID 422 Policy condition syntax is invalid or references unsupported resource types
POLICY_LIMIT 403 Org has reached the maximum policy count for their plan tier
SLACK_NOT_CONNECTED 422 Slack workspace is not connected — required for Slack-based actions
SLACK_USER_NOT_MAPPED 422 Slack user ID cannot be mapped to a dd0c user — re-authenticate Slack
WEBHOOK_DELIVERY_FAILED N/A (Async) Webhook endpoint returned non-2xx — retried 3x with exponential backoff, then disabled
AUTH_TOKEN_EXPIRED 401 JWT has expired — refresh via Cognito token endpoint
AUTH_TOKEN_INVALID 401 JWT signature verification failed
RATE_LIMIT_EXCEEDED 429 Request throttled — respect Retry-After header
INTERNAL_ERROR 500 Unhandled exception — request_id included for support escalation

Retry Guidance for Agent Developers:

The open-source agent implements the following retry strategy, and third-party integrations should follow the same pattern:

Error Code Retry? Strategy
429 Yes Exponential backoff starting at Retry-After value, max 5 retries, jitter ±20%
500 Yes Exponential backoff starting at 1s, max 3 retries
502 / 503 Yes Exponential backoff starting at 5s, max 5 retries
400 / 422 No Fix the payload — retrying the same request will produce the same error
401 No Re-authenticate — API key may be rotated or JWT expired
403 No Permission issue — check RBAC or plan tier
409 Conditional For DRIFT_REPORT_DUPLICATE, safe to ignore. For REMEDIATION_IN_PROGRESS, poll status and retry after completion.