Files
Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
2026-02-28 17:35:02 +00:00

108 KiB
Raw Permalink Blame History

dd0c/cost — Technical Architecture

AWS Cost Anomaly Detective

Version: 1.0
Date: February 28, 2026
Author: Architecture (Phase 6)
Status: Draft
Audience: Senior AWS architects, founding engineers


1. SYSTEM OVERVIEW

High-Level Architecture

graph TB
    subgraph "Customer AWS Account"
        CT[CloudTrail] -->|Events| EB[EventBridge Rule]
        CUR_S3[CUR Export → S3 Bucket]
        IAM_RO[IAM Role: dd0c-cost-readonly]
        IAM_REM[IAM Role: dd0c-cost-remediate<br/>opt-in]
    end

    subgraph "dd0c/cost Platform (dd0c AWS Account)"
        subgraph "Layer 1: Real-Time Event Stream"
            EB -->|Cross-account EventBridge| EB_TARGET[EventBridge Target Bus]
            EB_TARGET --> SQS_INGEST[SQS: event-ingestion<br/>FIFO, dedup]
            SQS_INGEST --> LAMBDA_PROC[Lambda: event-processor<br/>CloudTrail → CostEvent normalization]
            LAMBDA_PROC --> DDB_BASELINE[DynamoDB: baselines<br/>per-account, per-service]
            LAMBDA_PROC --> ANOMALY[Lambda: anomaly-scorer<br/>Z-score + heuristics]
            ANOMALY --> SQS_ALERT[SQS: alert-queue]
        end

        subgraph "Layer 2: CUR Reconciliation (V2)"
            CUR_S3 -->|S3 Replication or<br/>cross-account read| S3_CUR[S3: cur-data-lake]
            S3_CUR --> ATHENA[Athena: CUR queries]
            ATHENA --> LAMBDA_RECON[Lambda: reconciler<br/>daily batch]
            LAMBDA_RECON --> DDB_BASELINE
        end

        subgraph "Notification & Remediation"
            SQS_ALERT --> LAMBDA_NOTIFY[Lambda: notifier]
            LAMBDA_NOTIFY --> SLACK[Slack API<br/>Block Kit messages]
            SLACK -->|Interactive payload| APIGW[API Gateway]
            APIGW --> LAMBDA_ACTION[Lambda: action-handler<br/>Stop/Terminate/Snooze]
            LAMBDA_ACTION -->|AssumeRole| IAM_REM
        end

        subgraph "Data Layer"
            DDB_ACCOUNTS[DynamoDB: accounts<br/>tenant config, Slack tokens]
            DDB_ANOMALIES[DynamoDB: anomalies<br/>event log, status]
            DDB_BASELINE
        end

        subgraph "API & Onboarding"
            APIGW_REST[API Gateway: REST API]
            LAMBDA_API[Lambda: api-handlers]
            CF_TEMPLATE[S3: CloudFormation templates]
        end

        subgraph "Scheduled Jobs"
            EB_CRON[EventBridge Scheduler]
            EB_CRON --> LAMBDA_ZOMBIE[Lambda: zombie-hunter<br/>daily scan]
            EB_CRON --> LAMBDA_DIGEST[Lambda: daily-digest]
            EB_CRON --> LAMBDA_RECON
        end
    end

    LAMBDA_ZOMBIE -->|DescribeInstances, etc.| IAM_RO
    LAMBDA_ACTION -->|StopInstances, etc.| IAM_REM

    style CT fill:#ff9900,color:#000
    style EB fill:#ff9900,color:#000
    style CUR_S3 fill:#ff9900,color:#000
    style SLACK fill:#4a154b,color:#fff

Component Inventory

Component Responsibility AWS Service Justification
Event Ingestion Receive CloudTrail events in real-time, normalize to CostEvent schema EventBridge + SQS FIFO + Lambda EventBridge for cross-account event routing; SQS FIFO for ordered, deduplicated processing; Lambda for stateless event transformation
Anomaly Scorer Compare incoming CostEvents against baselines, flag anomalies Lambda Stateless scoring function. Sub-second execution. No persistent compute needed.
Baseline Store Per-account, per-service spending pattern storage DynamoDB Single-digit ms reads for hot-path scoring. On-demand capacity. Pay-per-request at low scale.
Anomaly Log Immutable record of all detected anomalies and their resolution status DynamoDB Queryable by account, time range, severity. TTL for automatic retention enforcement.
Account Registry Tenant configuration, Slack tokens, IAM role ARNs, preferences DynamoDB Low-volume, high-read. Single-table design with account_id partition key.
Notifier Format and deliver Slack Block Kit messages Lambda + Slack API Stateless. Slack rate limits handled via SQS backpressure.
Action Handler Process Slack interactive payloads (Stop, Terminate, Snooze) API Gateway + Lambda API Gateway receives Slack webhook POST, Lambda executes remediation via cross-account AssumeRole.
Zombie Hunter Daily scan for idle/orphaned resources across connected accounts Lambda (scheduled) EventBridge Scheduler triggers daily. Scans EC2, EBS, EIP, ELB via DescribeInstances/Volumes/Addresses.
Daily Digest Compile and send daily spend summary + anomaly recap Lambda (scheduled) Aggregates DynamoDB anomaly data, formats Slack digest.
CUR Reconciler Process CUR data for ground-truth billing validation (V2) S3 + Athena + Lambda Athena for serverless SQL over CUR Parquet files. Lambda orchestrates daily query + baseline update.
REST API Account onboarding, anomaly queries, configuration API Gateway + Lambda Standard REST. API Gateway handles auth (Cognito JWT or API key).
CloudFormation Templates Customer onboarding IAM role provisioning S3-hosted CF templates One-click deploy. Pre-signed URL from onboarding flow.

Technology Choices

Decision Choice Alternatives Considered Rationale
Compute AWS Lambda ECS Fargate, EC2 Lambda is the only sane choice for a solo founder. Zero ops. Pay-per-invocation. CloudTrail events are bursty — Lambda scales to zero between bursts. At 100 accounts, we're looking at ~50K-500K events/day — well within Lambda concurrency limits. ECS only makes sense at 1000+ accounts when Lambda cold starts or 15-min timeout become constraints.
Event Bus EventBridge SNS, Kinesis Data Streams EventBridge supports cross-account event routing natively (critical for receiving customer CloudTrail events). Content-based filtering reduces Lambda invocations to only cost-relevant events. SNS lacks filtering granularity. Kinesis is overkill at V1 scale and adds shard management overhead.
Queue SQS FIFO SQS Standard, Kinesis FIFO provides exactly-once processing and message deduplication (CloudTrail can emit duplicate events). Message group ID = account_id ensures per-account ordering. Standard SQS risks duplicate processing and out-of-order anomaly scoring.
Database DynamoDB PostgreSQL (RDS/Aurora), TimescaleDB DynamoDB eliminates all database ops. No patching, no connection pooling, no vacuum. Single-table design handles accounts, baselines, and anomalies. On-demand pricing means $0 at zero traffic. PostgreSQL would be better for complex queries (V2 dashboard) but adds operational burden a solo founder can't afford. Migrate to Aurora when dashboard launches.
CUR Processing S3 + Athena Redshift, BigQuery, PostgreSQL Athena is serverless SQL over S3. Zero infrastructure. Pay per query (~$5/TB scanned). CUR data is already in Parquet on S3. Redshift is overkill and expensive. This is a daily batch job, not real-time analytics.
API Layer API Gateway (REST) ALB + ECS, AppSync API Gateway + Lambda is the standard serverless REST pattern. No servers to manage. Built-in throttling, API keys, and Cognito integration. AppSync (GraphQL) is unnecessary complexity for V1's simple CRUD operations.
Auth Cognito User Pools Auth0, Clerk, custom JWT Cognito is free for <50K MAU. Native API Gateway integration. Supports GitHub/Google federation via OIDC. Auth0/Clerk are better products but add vendor dependency and cost. At $19/account/month, every dollar of infrastructure cost matters.
IaC AWS CDK (TypeScript) Terraform, SAM, CloudFormation raw CDK generates CloudFormation but with TypeScript type safety and constructs. Same language as Lambda handlers (TypeScript). SAM is too limited for cross-account EventBridge patterns. Terraform adds a state management burden.
Language TypeScript (Node.js 20) Python, Go, Rust TypeScript for Lambda cold start performance (faster than Python), type safety across the stack (CDK + Lambda + API), and npm ecosystem for Slack SDK. Go would be faster but slower to iterate for a solo founder. Rust is overkill.

Two-Layer Architecture: Speed + Accuracy

The core architectural insight is that no single data source provides both real-time speed and billing accuracy. dd0c/cost resolves this with two complementary layers:

Layer 1: Event Stream Layer 2: CUR Reconciliation
Data Source CloudTrail events via EventBridge AWS Cost & Usage Report (CUR 2.0)
Latency 5-60 seconds from resource creation 12-24 hours (CUR delivery to S3)
Accuracy ~85% (on-demand pricing estimate) 99%+ (includes RIs, SPs, Spot, credits)
Granularity Individual API call (RunInstances, CreateDBInstance) Line-item billing with amortized costs
V1 Status Core product Deferred to V2
Purpose "ALERT: Someone just launched 4x p3.2xlarge" "UPDATE: Confirmed 48hr cost was $1,175 (not $1,411 — Savings Plan applied)"

Why both layers matter:

Layer 1 catches the fire in real-time. It uses on-demand pricing as the cost estimate because that's the worst-case scenario and the only price available at event time. If the customer has Reserved Instances or Savings Plans covering the resource, the actual cost is lower — but you'd rather over-alert on a covered resource than miss an uncovered one.

Layer 2 reconciles with ground truth. When CUR data arrives, dd0c updates the anomaly record with the actual billed amount. If Layer 1 estimated $1,411 but the actual cost was $1,175 (Savings Plan discount), the anomaly record is updated and the customer sees the correction. This builds trust over time — the system gets more accurate as it learns which resources are covered by commitments.

V1 ships Layer 1 only. Layer 2 is deferred to V2 because CUR setup requires additional customer configuration (enabling CUR export, S3 bucket policy for cross-account access) which adds onboarding friction. V1's goal is 5-minute onboarding. CUR adds 15-20 minutes of AWS Console clicking. Not worth it for launch.


2. CORE COMPONENTS

2.1 CloudTrail Ingestion Pipeline

The ingestion pipeline is the heart of dd0c/cost. It transforms raw CloudTrail events into normalized CostEvents in real-time.

Architecture

sequenceDiagram
    participant CT as Customer CloudTrail
    participant EB_C as Customer EventBridge
    participant EB_D as dd0c EventBridge
    participant SQS as SQS FIFO
    participant LP as Lambda: event-processor
    participant DDB as DynamoDB
    participant AS as Lambda: anomaly-scorer

    CT->>EB_C: CloudTrail event (e.g. RunInstances)
    EB_C->>EB_D: Cross-account event (EventBridge rule)
    EB_D->>SQS: Filtered event → FIFO queue
    SQS->>LP: Batch poll (up to 10 messages)
    LP->>LP: Normalize to CostEvent schema
    LP->>LP: Estimate hourly cost (pricing lookup)
    LP->>DDB: Write CostEvent
    LP->>AS: Invoke anomaly scorer (async)
    AS->>DDB: Read baseline for account+service
    AS->>AS: Z-score calculation
    AS-->>SQS: If anomaly → alert-queue

CloudTrail Events That Signal Cost Anomalies

Not all CloudTrail events are cost-relevant. dd0c filters at the EventBridge level to process only events that create, modify, or scale billable resources. This is critical — a busy AWS account generates 10,000-100,000+ CloudTrail events/day. We need to process <1% of them.

V1 Monitored Events (EC2 + RDS + Lambda):

Service CloudTrail Event Cost Signal Estimated Impact
EC2 RunInstances New instance(s) launched Instance type → hourly rate. p3.2xlarge = $3.06/hr, p4d.24xlarge = $32.77/hr
EC2 StartInstances Stopped instance restarted Same as RunInstances — billing resumes
EC2 ModifyInstanceAttribute (instanceType change) Instance resized Delta between old and new instance type hourly rate
EC2 CreateNatGateway NAT Gateway created $0.045/hr + $0.045/GB processed. Silent cost bomb.
EC2 AllocateAddress Elastic IP allocated $0.005/hr if unattached. Small but zombie indicator.
EC2 CreateVolume EBS volume created gp3: $0.08/GB-month. io2: up to $0.125/GB-month + $0.065/IOPS-month
EC2 RunScheduledInstances Scheduled instance launched Same pricing model as RunInstances
RDS CreateDBInstance New database instance db.r5.4xlarge = $2.016/hr. Multi-AZ doubles it.
RDS ModifyDBInstance (class change) Database resized Delta between old and new instance class
RDS RestoreDBInstanceFromDBSnapshot Database restored from snapshot Same as CreateDBInstance — new billable instance
RDS CreateDBCluster Aurora cluster created Writer + reader instances, per-instance pricing
Lambda CreateFunction20150331 / UpdateFunctionConfiguration20150331v2 Function created/config changed Memory × duration × invocations. Alert only if memory >1GB or timeout >60s (high-cost config)
Lambda PutProvisionedConcurrencyConfig Provisioned concurrency set $0.0000041667/GB-second provisioned. Can be expensive at scale.

V2 Expansion Targets:

Service Key Events Why Deferred
ECS/Fargate CreateService, UpdateService (desiredCount) Requires parsing task definition for resource allocation
SageMaker CreateEndpoint, CreateNotebookInstance, CreateTrainingJob High-value targets but lower customer prevalence in V1 beachhead
ElastiCache CreateCacheCluster, ModifyReplicationGroup Moderate cost, lower urgency
Redshift CreateCluster, ResizeCluster Enterprise service, outside V1 beachhead
EKS CreateNodegroup, UpdateNodegroupConfig Requires K8s-level cost attribution (complex)
OpenSearch CreateDomain, UpdateDomainConfig Moderate cost, lower urgency

EventBridge Rule (Customer-Side)

The CloudFormation template deploys this EventBridge rule in the customer's account:

{
  "Source": ["aws.ec2", "aws.rds", "aws.lambda"],
  "DetailType": ["AWS API Call via CloudTrail"],
  "Detail": {
    "eventSource": [
      "ec2.amazonaws.com",
      "rds.amazonaws.com",
      "lambda.amazonaws.com"
    ],
    "eventName": [
      "RunInstances",
      "StartInstances",
      "ModifyInstanceAttribute",
      "CreateNatGateway",
      "AllocateAddress",
      "CreateVolume",
      "CreateDBInstance",
      "ModifyDBInstance",
      "RestoreDBInstanceFromDBSnapshot",
      "CreateDBCluster",
      "CreateFunction20150331",
      "UpdateFunctionConfiguration20150331v2",
      "PutProvisionedConcurrencyConfig"
    ]
  }
}

The rule target is dd0c's EventBridge bus in dd0c's AWS account (cross-account event bus policy). This means:

  • Customer CloudTrail events matching the filter are forwarded in real-time
  • Only cost-relevant events leave the customer's account (not all CloudTrail)
  • No agent or daemon runs in the customer's account
  • EventBridge cross-account delivery is near-instant (<5 seconds typical)

Cost Estimation Engine

When a CostEvent arrives, dd0c estimates the hourly cost using a static pricing table:

// pricing/ec2-on-demand.ts
// Updated monthly from AWS Price List API (bulk JSON)
// Keyed by region + instance type
const EC2_PRICING: Record<string, Record<string, number>> = {
  "us-east-1": {
    "t3.micro": 0.0104,
    "t3.medium": 0.0416,
    "m5.xlarge": 0.192,
    "m5.2xlarge": 0.384,
    "c5.2xlarge": 0.34,
    "r5.4xlarge": 1.008,
    "p3.2xlarge": 3.06,
    "p3.8xlarge": 12.24,
    "p4d.24xlarge": 32.7726,
    "g5.xlarge": 1.006,
    "g5.12xlarge": 5.672,
    // ... full table from AWS Price List API
  },
  // ... all regions
};

interface CostEstimate {
  hourlyRate: number;        // On-demand $/hr
  dailyRate: number;         // hourlyRate × 24
  monthlyRate: number;       // hourlyRate × 730
  confidence: "on-demand";   // V1 always on-demand. V2 adds RI/SP awareness.
  disclaimer: string;        // "Estimated on-demand pricing. Actual cost may differ with RIs/Savings Plans."
}

Why static pricing tables, not real-time Price List API calls:

  • AWS Price List API is slow (~2-5 seconds for a single query). Unacceptable in the hot path.
  • Pricing changes infrequently (quarterly at most for most instance types).
  • A monthly cron job pulls the full Price List bulk JSON (~1.5GB), extracts EC2/RDS/Lambda pricing, and writes to a DynamoDB table or bundled JSON file in the Lambda deployment package.
  • At V1 scale (<100 accounts), the pricing table fits in Lambda memory (~50MB for all regions + services).

Event Processor Lambda

// functions/event-processor/handler.ts
interface CloudTrailEvent {
  detail: {
    eventSource: string;
    eventName: string;
    awsRegion: string;
    userIdentity: {
      type: string;
      arn: string;
      userName?: string;
      sessionContext?: {
        sessionIssuer?: { userName: string };
      };
    };
    requestParameters: Record<string, any>;
    responseElements: Record<string, any>;
    eventTime: string;
    eventID: string;
  };
}

interface CostEvent {
  pk: string;              // ACCOUNT#<account_id>
  sk: string;              // EVENT#<timestamp>#<event_id>
  accountId: string;
  service: "ec2" | "rds" | "lambda";
  action: string;          // "RunInstances", "CreateDBInstance", etc.
  resourceType: string;    // "EC2 Instance", "RDS Instance", etc.
  resourceId: string;      // i-xxx, db-xxx
  resourceSpec: string;    // "p3.2xlarge", "db.r5.4xlarge"
  region: string;
  actor: string;           // IAM user/role that performed the action
  actorArn: string;
  quantity: number;         // Number of instances (RunInstances can launch multiple)
  estimatedHourlyCost: number;
  estimatedDailyCost: number;
  estimatedMonthlyCost: number;
  eventTime: string;       // ISO 8601
  cloudTrailEventId: string;
  rawEvent: object;        // Original CloudTrail event (stored for debugging, TTL'd)
  ttl: number;             // DynamoDB TTL — 90 days
}

The processor extracts the actor (who did it), the resource spec (what they created), estimates the cost, and writes a normalized CostEvent to DynamoDB. This normalization is critical — downstream components (anomaly scorer, notifier, API) never touch raw CloudTrail.

Actor extraction logic:

  • userIdentity.type === "IAMUser"userIdentity.userName (e.g., "sam@company.com")
  • userIdentity.type === "AssumedRole"sessionContext.sessionIssuer.userName (e.g., "terraform-deploy-role")
  • userIdentity.type === "Root" → "Root account" (flag as high-severity regardless of cost)
  • For assumed roles, dd0c also extracts the sourceIdentity or principalId to trace back to the human behind the role when possible.

2.2 Anomaly Detection Engine

V1 uses simple statistical heuristics. No ML. No neural networks. Just math that a solo founder can debug at 2 AM.

Baseline Learning

For each (account_id, service, resource_type) tuple, dd0c maintains a rolling baseline:

interface Baseline {
  pk: string;                    // BASELINE#<account_id>
  sk: string;                    // <service>#<resource_type>  e.g., "ec2#instance"
  accountId: string;
  service: string;
  resourceType: string;
  // Rolling statistics (updated on every CostEvent)
  mean: number;                  // Mean hourly cost for this service
  stddev: number;                // Standard deviation
  sampleCount: number;           // Number of events in the window
  maxObserved: number;           // Highest single-event cost ever seen
  // Time-windowed (last 30 days)
  windowStart: string;           // ISO 8601
  windowEvents: number[];        // Array of hourly costs (last 30 days, compacted daily)
  // Learned patterns
  expectedInstanceTypes: string[]; // Instance types seen >3 times (e.g., ["t3.medium", "m5.xlarge"])
  expectedActors: string[];       // IAM users/roles that regularly create resources
  // User overrides
  sensitivityOverride?: "low" | "medium" | "high";
  suppressedResourceTypes?: string[];
  updatedAt: string;
}

Cold start problem: A new account has no baseline. For the first 14 days, dd0c uses absolute thresholds instead of statistical baselines:

Severity Threshold Example
INFO Any new resource >$0.50/hr t3.large ($0.0832/hr) — no alert. m5.2xlarge ($0.384/hr) — no alert. r5.4xlarge ($1.008/hr) — INFO.
WARNING Any new resource >$5.00/hr p3.2xlarge ($3.06/hr) — INFO. p3.8xlarge ($12.24/hr) — WARNING.
CRITICAL Any new resource >$25.00/hr p4d.24xlarge ($32.77/hr) — CRITICAL.
CRITICAL Any root account action creating billable resources Always CRITICAL regardless of cost.

After 14 days with ≥20 events, the system transitions to statistical scoring.

Anomaly Scoring Algorithm

function scoreAnomaly(event: CostEvent, baseline: Baseline): AnomalyScore {
  const scores: number[] = [];

  // Signal 1: Z-score against baseline mean
  if (baseline.sampleCount >= 20) {
    const zScore = (event.estimatedHourlyCost - baseline.mean) / Math.max(baseline.stddev, 0.01);
    scores.push(zScore);
  }

  // Signal 2: Instance type novelty
  // New instance type never seen before in this account = suspicious
  if (!baseline.expectedInstanceTypes.includes(event.resourceSpec)) {
    scores.push(3.0); // Equivalent to 3-sigma event
  }

  // Signal 3: Actor novelty
  // New actor creating expensive resources = suspicious
  if (!baseline.expectedActors.includes(event.actor)) {
    scores.push(2.0); // Moderate suspicion
  }

  // Signal 4: Absolute cost threshold
  // Regardless of baseline, very expensive resources always flag
  if (event.estimatedHourlyCost > 10.0) scores.push(4.0);
  if (event.estimatedHourlyCost > 25.0) scores.push(6.0);

  // Signal 5: Quantity anomaly
  // Launching 10 instances at once when baseline is 1-2
  if (event.quantity > 3) scores.push(2.5);

  // Signal 6: Time-of-day anomaly
  // Resource creation at 2 AM local time = suspicious
  const hour = new Date(event.eventTime).getUTCHours();
  // TODO: Convert to account's local timezone
  if (hour >= 0 && hour <= 5) scores.push(1.5);

  // Composite score: weighted average of all signals
  const compositeScore = scores.length > 0
    ? scores.reduce((a, b) => a + b, 0) / scores.length
    : 0;

  // Multiple signals compound confidence
  const confidenceMultiplier = Math.min(scores.length / 3, 2.0);
  const finalScore = compositeScore * confidenceMultiplier;

  return {
    score: finalScore,
    severity: classifySeverity(finalScore),
    signals: scores.length,
    breakdown: { /* individual signal details for alert context */ },
  };
}

function classifySeverity(score: number): "none" | "info" | "warning" | "critical" {
  if (score < 1.5) return "none";      // Below threshold — no alert
  if (score < 3.0) return "info";       // Mild anomaly — daily digest only
  if (score < 5.0) return "warning";    // Significant — immediate Slack alert
  return "critical";                     // Severe — immediate Slack alert + @channel mention
}

Why composite scoring matters: A single signal (e.g., high Z-score) might be a false positive. But high Z-score + novel instance type + novel actor + off-hours = almost certainly a real anomaly. The composite approach dramatically reduces false positives while maintaining sensitivity to genuine threats.

Sensitivity tuning: Users can override per-service sensitivity via Slack command or API:

  • LOW: Only CRITICAL alerts (>$25/hr or composite score >5.0)
  • MEDIUM (default): WARNING + CRITICAL
  • HIGH: INFO + WARNING + CRITICAL (noisy, for accounts that want maximum visibility)

Feedback Loop: "Mark as Expected"

When a user clicks [Mark as Expected] on a Slack alert:

  1. The anomaly record is updated with status: "expected"
  2. The resource spec and actor are added to the baseline's expectedInstanceTypes and expectedActors
  3. Future events matching this pattern score lower
  4. After 3 "Mark as Expected" clicks for the same pattern, the pattern is auto-suppressed with a notification: "We've auto-suppressed alerts for m5.2xlarge launches by terraform-deploy-role. You can re-enable in settings."

This is the primary mechanism for reducing false positives over time. The system learns what's normal for each account.

2.3 CUR Reconciliation (V2)

Deferred to V2 but architecturally planned now to avoid rework.

How It Works

  1. Customer enables CUR 2.0 export to an S3 bucket in their account (or dd0c's bucket via cross-account policy)
  2. CUR data arrives as Parquet files, typically within 12-24 hours of usage
  3. dd0c's daily reconciler Lambda:
    • Queries CUR via Athena for the previous day's line items
    • Matches CUR line items to Layer 1 CostEvents by resource ID + timestamp
    • Updates CostEvent records with actual billed amounts
    • Adjusts baselines with ground-truth cost data (replacing on-demand estimates)
    • If Layer 1 estimated $1,411 but actual was $1,175 (Savings Plan), updates the anomaly record

CUR Athena Query Pattern

-- Daily reconciliation: get actual costs for resources flagged by Layer 1
SELECT
  line_item_resource_id,
  line_item_usage_start_date,
  line_item_unblended_cost,
  line_item_blended_cost,
  savings_plan_savings_plan_effective_cost,
  reservation_effective_cost,
  pricing_term,  -- "OnDemand", "Reserved", "Spot"
  product_instance_type,
  line_item_usage_account_id
FROM cur_database.cur_table
WHERE line_item_usage_start_date >= DATE_ADD('day', -1, CURRENT_DATE)
  AND line_item_resource_id IN (
    -- Resource IDs from Layer 1 anomalies in the last 48 hours
    'i-0abc123def456', 'i-0xyz789ghi012'
  )
  AND line_item_line_item_type IN ('Usage', 'SavingsPlanCoveredUsage', 'DiscountedUsage')
ORDER BY line_item_usage_start_date;

Value of Reconciliation

  • Accuracy correction: Layer 1 estimates assume on-demand pricing. CUR reveals actual cost after RI/SP/Spot discounts. Over time, this trains the baseline to use realistic costs, not worst-case estimates.
  • False positive reduction: If an account has 80% Savings Plan coverage, Layer 1 will over-estimate costs by ~40%. CUR reconciliation corrects this, reducing future false positives for that account.
  • Billing validation: CUR is the source of truth for AWS billing. Customers who need accurate cost reporting (Jordan persona) require this layer.
  • Zombie cost validation: Layer 1 detects resource creation. CUR confirms ongoing cost. A resource that was created but immediately covered by a Reserved Instance isn't actually costing extra — CUR reveals this.

2.4 Remediation Engine

One-click remediation from Slack is the product's magic moment. The gap between "knowing" and "doing" is where money burns.

Remediation Actions (V1)

Action Slack Button AWS API Call Safety Guardrail
Stop Instance [Stop Instance] ec2:StopInstances Confirmation dialog: "Stop i-0abc123? This will halt the instance but preserve data. EBS volumes remain attached."
Terminate Instance [Terminate + Snapshot] ec2:CreateSnapshotec2:TerminateInstances Always creates EBS snapshot first. Confirmation dialog with instance details. 30-second undo window via Slack.
Snooze Alert [Snooze 1h/4h/24h] None (internal) Suppresses re-alerting for the specified duration. Anomaly remains in log.
Mark as Expected [Expected ✓] None (internal) Updates baseline. Adds pattern to expected list. See feedback loop above.

V2 Remediation (deferred):

  • Scale Down: ec2:ModifyInstanceAttribute to change instance type (requires stop/start)
  • Schedule Shutdown: EventBridge Scheduler rule to stop instance at specified time
  • Delete EBS Volume: ec2:DeleteVolume for unattached volumes
  • Release Elastic IP: ec2:ReleaseAddress
  • RDS Stop: rds:StopDBInstance (auto-restarts after 7 days — must warn user)

Safety Architecture

Remediation is the highest-risk feature. A bug that terminates a production database is an extinction-level event for dd0c.

Guardrails:

  1. Separate IAM role. Remediation uses dd0c-cost-remediate role, which is separate from the read-only dd0c-cost-readonly role. Customers opt-in to remediation by deploying an additional CloudFormation stack. The read-only role is deployed at onboarding; the remediation role is offered later, after trust is established.

  2. Explicit action scoping. The remediation IAM role allows ONLY:

    {
      "Effect": "Allow",
      "Action": [
        "ec2:StopInstances",
        "ec2:TerminateInstances",
        "ec2:CreateSnapshot",
        "ec2:DescribeInstances",
        "ec2:DescribeVolumes"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/dd0c-remediation": "enabled"
        }
      }
    }
    

    In V1, remediation only works on resources tagged with dd0c-remediation: enabled. This is a deliberate friction — customers must explicitly tag resources they're comfortable having dd0c act on. Production databases won't have this tag.

  3. Confirmation dialogs. Every destructive action shows a Slack modal with resource details, estimated impact, and a confirm/cancel button. No single-click termination.

  4. Automatic EBS snapshots. Before any TerminateInstances call, dd0c creates snapshots of all attached EBS volumes. The snapshot IDs are included in the Slack confirmation message.

  5. Audit log. Every remediation action is logged to DynamoDB with: who clicked the button (Slack user ID), what action was taken, which resource, timestamp, and the result (success/failure). This is the customer's audit trail.

  6. Dry-run first. Before executing, dd0c calls the AWS API with DryRun: true to verify permissions and resource state. If the dry-run fails (e.g., instance already stopped, insufficient permissions), the user sees an error instead of a silent failure.

Slack Interactive Message Flow

sequenceDiagram
    participant S as Slack
    participant AG as API Gateway
    participant LH as Lambda: action-handler
    participant DDB as DynamoDB
    participant AWS_C as Customer AWS Account

    S->>AG: POST /slack/actions (interactive payload)
    AG->>LH: Invoke with payload
    LH->>LH: Verify Slack signature (HMAC-SHA256)
    LH->>DDB: Look up anomaly record + account config
    LH->>LH: Validate action is allowed for this resource
    LH->>AWS_C: sts:AssumeRole (dd0c-cost-remediate)
    LH->>AWS_C: ec2:StopInstances (DryRun=true)
    alt DryRun succeeds
        LH->>S: Open confirmation modal (Block Kit)
        S->>AG: User confirms
        AG->>LH: Execute action
        LH->>AWS_C: ec2:CreateSnapshot (if terminate)
        LH->>AWS_C: ec2:StopInstances / TerminateInstances
        LH->>DDB: Log remediation action
        LH->>S: Update original message: "✅ Stopped by @sam at 2:34 PM"
    else DryRun fails
        LH->>S: Error message: "Can't stop this instance: [reason]"
    end

2.5 Notification Service

Slack Alert Format (Block Kit)

Anomaly alerts use Slack Block Kit for rich, actionable messages:

🔴 CRITICAL: Expensive Resource Detected

*4× p3.2xlarge instances* launched in us-east-1
├─ Estimated cost: *$12.24/hr* ($293.76/day)
├─ Who: sam@company.com (IAM User)
├─ When: Today at 11:02 AM UTC
├─ Account: 123456789012 (production)
└─ Why this alert: New instance type never seen in this account.
   Cost is 8.2× your average EC2 hourly spend.

[Stop Instances]  [Terminate + Snapshot]  [Snooze 4h]  [Expected ✓]

 Cost is estimated using on-demand pricing.
   Actual cost may be lower with Reserved Instances or Savings Plans.

Alert Severity → Notification Behavior

Severity Slack Behavior Digest Inclusion
INFO No immediate alert. Included in daily digest only.
WARNING Immediate Slack message to configured channel. No @mention.
CRITICAL Immediate Slack message with <!channel> mention.

Daily Digest

Sent at 9:00 AM in the customer's configured timezone (default: UTC). Compiled by the daily-digest Lambda:

📊 dd0c Daily Digest — Feb 28, 2026

*Yesterday's Spend Estimate:* $487.22 (+12% vs. 7-day avg)

*Anomalies Detected:* 3
├─ 🔴 4× p3.2xlarge (sam@company.com) — $12.24/hr — RESOLVED ✅
├─ 🟡 New NAT Gateway in us-west-2 — $1.08/hr — OPEN
└─ 🔵 Lambda memory increased to 3GB (deploy-role) — $0.12/hr — Expected ✓

*Zombie Watch:* 🧟
├─ i-0abc123 (t3.medium, us-east-1) — idle 6 days — $0.0416/hr ($23.96 wasted)
├─ vol-0def456 (100GB gp3, unattached) — 14 days — $8.00/month
└─ eipalloc-0ghi789 (unattached) — 31 days — $3.60/month

*End-of-Month Forecast:* $14,230 (vs. $12,100 last month, +17.6%)

[View Details]  [Adjust Sensitivity]

Slack Rate Limiting

Slack's rate limits: ~1 message/second per channel, 20K messages/day per workspace. At V1 scale (<100 accounts), this is not a concern. The SQS alert queue provides natural backpressure — if Slack returns 429, the Lambda retries with exponential backoff via SQS visibility timeout.

Weekly Digest (V2)

Deferred. Will include: week-over-week spend comparison, top anomalies, remediation summary, savings achieved, and a "dd0c saved you $X this week" callout for the cross-sell/retention narrative.


3. DATA ARCHITECTURE

3.1 Event Schema

All data flows through a normalized CostEvent schema. Raw CloudTrail events are transformed at ingestion and never exposed to downstream components.

CostEvent (Primary Entity)

// The atomic unit of data in dd0c/cost
interface CostEvent {
  // DynamoDB keys
  pk: string;              // "ACCOUNT#<account_id>"
  sk: string;              // "EVENT#<iso_timestamp>#<event_id>"
  
  // GSI1: Query by service + time
  gsi1pk: string;          // "ACCOUNT#<account_id>#SERVICE#<service>"
  gsi1sk: string;          // "<iso_timestamp>"
  
  // GSI2: Query by actor
  gsi2pk: string;          // "ACCOUNT#<account_id>#ACTOR#<actor_hash>"
  gsi2sk: string;          // "<iso_timestamp>"

  // Core fields
  accountId: string;       // Customer AWS account ID
  tenantId: string;        // dd0c tenant ID (maps to billing entity)
  service: string;         // "ec2" | "rds" | "lambda"
  action: string;          // CloudTrail eventName
  resourceType: string;    // "instance" | "nat-gateway" | "db-instance" | "volume" | ...
  resourceId: string;      // AWS resource ID (i-xxx, db-xxx, vol-xxx)
  resourceSpec: string;    // Instance type / config (p3.2xlarge, db.r5.4xlarge)
  region: string;          // AWS region
  
  // Attribution
  actor: string;           // Human-readable: "sam@company.com" or "terraform-deploy-role"
  actorArn: string;        // Full IAM ARN
  actorType: string;       // "IAMUser" | "AssumedRole" | "Root"
  
  // Cost estimation (Layer 1)
  quantity: number;
  estimatedHourlyCost: number;
  estimatedDailyCost: number;
  estimatedMonthlyCost: number;
  pricingBasis: "on-demand";  // V1 always on-demand
  
  // Cost reconciliation (Layer 2 — V2, nullable in V1)
  actualHourlyCost?: number;
  actualPricingTerm?: "OnDemand" | "Reserved" | "SavingsPlan" | "Spot";
  reconciled: boolean;     // false until CUR reconciliation runs
  reconciledAt?: string;
  
  // Anomaly scoring
  anomalyScore: number;
  anomalySeverity: "none" | "info" | "warning" | "critical";
  anomalySignals: number;  // How many scoring signals fired
  
  // Status tracking
  status: "open" | "resolved" | "expected" | "snoozed";
  resolvedAction?: string; // "stopped" | "terminated" | "snoozed" | "marked-expected"
  resolvedBy?: string;     // Slack user ID who took action
  resolvedAt?: string;
  
  // Metadata
  cloudTrailEventId: string;
  eventTime: string;       // Original CloudTrail event time
  ingestedAt: string;      // When dd0c processed it
  ttl: number;             // DynamoDB TTL epoch — 90 days from ingestedAt
}

Anomaly Record (Derived from CostEvent)

When a CostEvent scores above the alert threshold, an Anomaly record is created:

interface AnomalyRecord {
  pk: string;              // "ANOMALY#<account_id>"
  sk: string;              // "<iso_timestamp>#<anomaly_id>"
  
  // GSI: Query open anomalies
  gsi3pk: string;          // "ANOMALY#<account_id>#STATUS#<status>"
  gsi3sk: string;          // "<iso_timestamp>"
  
  anomalyId: string;       // ULID
  accountId: string;
  tenantId: string;
  
  // Anomaly details
  severity: "info" | "warning" | "critical";
  score: number;
  signalBreakdown: {
    zScore?: number;
    instanceTypeNovelty?: boolean;
    actorNovelty?: boolean;
    absoluteCostThreshold?: boolean;
    quantityAnomaly?: boolean;
    timeOfDayAnomaly?: boolean;
  };
  
  // The triggering event(s)
  triggerEventIds: string[];  // One or more CostEvent IDs
  
  // Human-readable summary (pre-computed for Slack)
  title: string;           // "4× p3.2xlarge launched in us-east-1"
  description: string;     // "sam@company.com launched 4 GPU instances..."
  estimatedImpact: string; // "$12.24/hr ($293.76/day)"
  
  // Notification tracking
  slackMessageTs?: string; // Slack message timestamp (for updating)
  slackChannelId?: string;
  notifiedAt?: string;
  
  // Resolution
  status: "open" | "resolved" | "expected" | "snoozed";
  snoozeUntil?: string;
  resolvedAction?: string;
  resolvedBy?: string;
  resolvedAt?: string;
  
  // Remediation audit
  remediationLog: RemediationEntry[];
  
  ttl: number;             // 90 days
}

interface RemediationEntry {
  action: string;          // "stop" | "terminate" | "snapshot" | "snooze"
  executedBy: string;      // Slack user ID
  executedAt: string;
  targetResourceId: string;
  result: "success" | "failure";
  errorMessage?: string;
  snapshotId?: string;     // If EBS snapshot was created
  dryRunPassed: boolean;
}

3.2 Baseline / Threshold Storage

interface Baseline {
  pk: string;              // "BASELINE#<account_id>"
  sk: string;              // "<service>#<resource_type>"
  
  accountId: string;
  service: string;
  resourceType: string;
  
  // Rolling statistics
  mean: number;
  stddev: number;
  sampleCount: number;
  maxObserved: number;
  minObserved: number;
  p95: number;             // 95th percentile hourly cost
  
  // Time-windowed data (30-day rolling)
  windowDays: number;      // Default 30
  dailyAggregates: {       // Last 30 days, one entry per day
    date: string;          // "2026-02-28"
    totalCost: number;
    eventCount: number;
    maxSingleEvent: number;
  }[];
  
  // Learned patterns
  expectedInstanceTypes: string[];
  expectedActors: string[];
  typicalHourRange: [number, number]; // e.g., [8, 18] — resources usually created 8am-6pm
  
  // User configuration
  sensitivityOverride?: "low" | "medium" | "high";
  suppressedPatterns: {
    resourceSpec?: string;
    actor?: string;
    suppressedAt: string;
    suppressedBy: string;  // Slack user ID
    reason: string;        // "Marked as expected 3 times"
  }[];
  
  // State
  maturityState: "cold-start" | "learning" | "mature";
  // cold-start: <14 days or <20 events → absolute thresholds
  // learning: 14-30 days → statistical + absolute hybrid
  // mature: >30 days and >50 events → full statistical scoring
  
  updatedAt: string;
  createdAt: string;
}

Baseline update strategy: Baselines are updated on every CostEvent ingestion. The update is an atomic DynamoDB UpdateItem with ADD and SET expressions — no read-modify-write race conditions. Daily aggregates are compacted by a nightly Lambda that rolls individual events into daily summaries and trims the window to 30 days.

3.3 DynamoDB Single-Table Design

All entities live in one DynamoDB table (dd0c-cost-main) with a single-table design:

┌─────────────────────────────────┬──────────────────────────────────────────┐
│ PK                              │ SK                                       │
├─────────────────────────────────┼──────────────────────────────────────────┤
│ TENANT#<tenant_id>              │ METADATA                                 │  → Tenant config
│ TENANT#<tenant_id>              │ ACCOUNT#<account_id>                     │  → Account registration
│ ACCOUNT#<account_id>            │ EVENT#<timestamp>#<event_id>             │  → CostEvent
│ ACCOUNT#<account_id>            │ CONFIG                                   │  → Account settings
│ BASELINE#<account_id>           │ <service>#<resource_type>                │  → Baseline
│ ANOMALY#<account_id>            │ <timestamp>#<anomaly_id>                 │  → Anomaly record
│ SLACK#<workspace_id>            │ INSTALL                                  │  → Slack OAuth tokens
│ SLACK#<workspace_id>            │ CHANNEL#<channel_id>                     │  → Channel config
└─────────────────────────────────┴──────────────────────────────────────────┘

GSI1 (Service queries):
  PK: ACCOUNT#<account_id>#SERVICE#<service>
  SK: <timestamp>

GSI2 (Actor queries):
  PK: ACCOUNT#<account_id>#ACTOR#<actor_hash>
  SK: <timestamp>

GSI3 (Open anomalies):
  PK: ANOMALY#<account_id>#STATUS#<status>
  SK: <timestamp>

GSI4 (Tenant lookups):
  PK: TENANT#<tenant_id>
  SK: (same as table SK)

Why single-table: At V1 scale, a single DynamoDB table with GSIs handles all access patterns. No cross-table joins needed. Simplifies IaC, monitoring, and backup. When the V2 dashboard requires complex queries (aggregations, time-series), we add Aurora PostgreSQL as a read replica — DynamoDB Streams → Lambda → Aurora for the dashboard's query layer. The real-time path stays on DynamoDB.

3.4 CUR Data Warehouse (V2)

┌─────────────────────────────────────────────────────────┐
│ S3: dd0c-cur-datalake                                   │
│                                                         │
│ s3://dd0c-cur-datalake/                                 │
│   ├── raw/                                              │
│   │   └── <account_id>/                                 │
│   │       └── year=2026/month=02/                       │
│   │           └── cur-00001.parquet                     │
│   ├── processed/                                        │
│   │   └── <account_id>/                                 │
│   │       └── year=2026/month=02/day=28/                │
│   │           └── daily-summary.parquet                 │
│   └── athena-results/                                   │
│       └── query-<id>/                                   │
│           └── results.csv                               │
│                                                         │
│ Athena Database: dd0c_cur                               │
│ ├── Table: raw_cur (partitioned by account_id, year,    │
│ │   month — Parquet, Snappy compression)                │
│ └── Table: daily_summary (materialized by reconciler)   │
│                                                         │
│ Glue Crawler: runs daily, updates partitions            │
└─────────────────────────────────────────────────────────┘

CUR ingestion options (customer choice):

  1. Cross-account S3 replication: Customer's CUR bucket replicates to dd0c's S3 bucket. Simplest but requires S3 replication rule setup.
  2. Cross-account Athena query: dd0c queries the customer's CUR bucket directly via cross-account S3 access. No data copy. More secure but slower (cross-account S3 reads).
  3. CUR export to dd0c bucket: Customer configures CUR 2.0 to export directly to dd0c's S3 bucket with a customer-specific prefix. Cleanest but requires CUR reconfiguration.

V2 will support option 1 (replication) as the default, with option 2 as the "security-conscious" alternative.

3.5 Multi-Tenant Data Isolation

dd0c/cost is a multi-tenant SaaS. Customer data isolation is non-negotiable.

Isolation model: Logical isolation with partition-key enforcement.

  • Every DynamoDB item includes accountId and tenantId in the partition key
  • All Lambda functions receive tenantId from the authenticated session (Cognito JWT claim)
  • DynamoDB queries are ALWAYS scoped to a partition key that includes the tenant's account ID — there is no "scan all accounts" operation exposed to any customer-facing code path
  • S3 CUR data is prefixed by account_id — S3 bucket policies enforce prefix-level access
  • Athena queries include WHERE line_item_usage_account_id = '<account_id>' — enforced at the query construction layer, not user input

Why not per-tenant tables/databases: At V1 scale (<100 tenants), per-tenant DynamoDB tables would mean 100+ tables to manage, monitor, and back up. Single-table with partition-key isolation is the standard pattern for DynamoDB multi-tenancy at this scale. If we hit 10,000+ tenants and need stronger isolation (e.g., for SOC 2 or a large enterprise customer), we can migrate specific tenants to dedicated tables — DynamoDB Streams makes this a non-disruptive migration.

Cross-tenant data access (internal only):

  • The anomaly scoring engine reads baselines for a single account — never cross-account
  • The daily digest reads anomalies for a single account
  • The only cross-tenant operation is internal analytics (aggregate metrics across all tenants for product health monitoring). This runs on a separate IAM role with read-only access and is never exposed via API.

3.6 Retention Policies

Data Type Retention Mechanism Rationale
CostEvents 90 days DynamoDB TTL Sufficient for baseline learning (30-day window) + investigation buffer. Older events are summarized in baselines.
Anomaly Records 1 year DynamoDB TTL Customers need anomaly history for trend analysis and SOC 2 audit evidence.
Baselines Indefinite (while account active) No TTL Baselines are the product's memory. Deleting them resets learning. Cleaned up on account disconnection.
Remediation Audit Log 2 years DynamoDB TTL Compliance requirement. Customers need proof of who did what and when.
CUR Data (S3) 13 months S3 Lifecycle Policy Matches AWS's own CUR retention. Enables year-over-year comparison in V3.
Slack OAuth Tokens Indefinite (while connected) Manual cleanup on disconnect Required for ongoing Slack integration. Encrypted at rest (KMS).
Raw CloudTrail Events 7 days DynamoDB TTL on rawEvent field Stored for debugging only. The normalized CostEvent is the source of truth.
Athena Query Results 7 days S3 Lifecycle Policy Ephemeral. Re-queryable from source data.

Data deletion on account disconnection: When a customer disconnects their AWS account or deletes their dd0c account:

  1. All CostEvents, Anomalies, and Baselines for that account are marked for deletion
  2. A background Lambda processes deletion in batches (DynamoDB BatchWriteItem, 25 items/batch)
  3. S3 CUR data for that account prefix is deleted via S3 Lifecycle rule (immediate expiration)
  4. Slack tokens are revoked and deleted
  5. Deletion is confirmed to the customer via email
  6. Timeline: complete within 72 hours (GDPR-compliant)

4. INFRASTRUCTURE

4.1 AWS Architecture

All dd0c/cost infrastructure runs in a single AWS account (dd0c-platform) in us-east-1 (primary) with no multi-region in V1. The entire stack is serverless — zero EC2 instances, zero containers, zero servers to patch.

graph TB
    subgraph "dd0c-platform AWS Account (us-east-1)"
        subgraph "Ingestion"
            EB[EventBridge: dd0c-cost-bus<br/>Cross-account event target]
            SQS_I[SQS FIFO: event-ingestion<br/>MessageGroupId=accountId<br/>Dedup=cloudTrailEventId]
            L_PROC[Lambda: event-processor<br/>128MB, 30s timeout<br/>Batch size: 10]
        end

        subgraph "Scoring & Alerting"
            L_SCORE[Lambda: anomaly-scorer<br/>256MB, 10s timeout]
            SQS_A[SQS Standard: alert-queue<br/>DLQ after 3 retries]
            L_NOTIFY[Lambda: notifier<br/>128MB, 15s timeout]
        end

        subgraph "Remediation"
            APIGW_SLACK[API Gateway: /slack/actions<br/>POST only, Slack signature verification]
            L_ACTION[Lambda: action-handler<br/>256MB, 30s timeout]
        end

        subgraph "Scheduled"
            EBS[EventBridge Scheduler]
            L_ZOMBIE[Lambda: zombie-hunter<br/>512MB, 5min timeout<br/>Daily 06:00 UTC]
            L_DIGEST[Lambda: daily-digest<br/>256MB, 2min timeout<br/>Daily 09:00 UTC per TZ]
            L_PRICING[Lambda: pricing-updater<br/>1024MB, 5min timeout<br/>Weekly Sunday 00:00 UTC]
        end

        subgraph "API"
            APIGW_REST[API Gateway: REST API<br/>/v1/accounts, /v1/anomalies, etc.]
            L_API[Lambda: api-handlers<br/>256MB, 30s timeout]
            COGNITO[Cognito User Pool<br/>GitHub + Google OIDC]
        end

        subgraph "Data"
            DDB[DynamoDB: dd0c-cost-main<br/>On-demand capacity<br/>Point-in-time recovery: ON<br/>Encryption: AWS-managed KMS]
            S3_CF[S3: dd0c-cf-templates<br/>CloudFormation templates<br/>Public read]
            S3_CUR[S3: dd0c-cur-datalake<br/>V2 — CUR storage<br/>SSE-S3 encryption]
        end

        subgraph "Observability"
            CW[CloudWatch Logs + Metrics]
            CW_ALARM[CloudWatch Alarms<br/>Lambda errors, SQS DLQ depth,<br/>DDB throttles]
            SNS_OPS[SNS: ops-alerts<br/>→ Brian's phone]
        end
    end

    EB --> SQS_I --> L_PROC --> L_SCORE --> SQS_A --> L_NOTIFY
    EBS --> L_ZOMBIE
    EBS --> L_DIGEST
    EBS --> L_PRICING
    APIGW_SLACK --> L_ACTION
    APIGW_REST --> L_API
    L_API --> COGNITO
    L_PROC --> DDB
    L_SCORE --> DDB
    L_ACTION --> DDB
    L_ZOMBIE --> DDB
    L_DIGEST --> DDB
    CW_ALARM --> SNS_OPS

Lambda Function Inventory

Function Memory Timeout Trigger Concurrency Est. Invocations/day (10 accounts)
event-processor 128 MB 30s SQS FIFO (batch 10) 5 reserved 500-5,000
anomaly-scorer 256 MB 10s Async invoke from processor 5 reserved 500-5,000
notifier 128 MB 15s SQS Standard 2 reserved 10-50 (only anomalies)
action-handler 256 MB 30s API Gateway Unreserved 5-20 (user-initiated)
zombie-hunter 512 MB 5 min EventBridge Scheduler 1 1 (daily)
daily-digest 256 MB 2 min EventBridge Scheduler 1 1 (daily)
pricing-updater 1024 MB 5 min EventBridge Scheduler 1 0.14 (weekly)
api-handlers 256 MB 30s API Gateway Unreserved 50-500

Reserved concurrency on the ingestion path prevents a burst of CloudTrail events from consuming all Lambda concurrency in the account and starving the API/notification functions.

4.2 Customer-Side Infrastructure

The customer deploys a CloudFormation stack that creates:

Read-Only Stack (Required — deployed at onboarding)

# dd0c-cost-readonly.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: 'dd0c/cost — Read-only monitoring (CloudTrail events + resource describe)'

Parameters:
  Dd0cAccountId:
    Type: String
    Default: '111122223333'  # dd0c's AWS account ID
  ExternalId:
    Type: String
    Description: 'Unique external ID for cross-account role assumption'

Resources:
  # IAM Role for dd0c to read CloudTrail events and describe resources
  Dd0cCostReadOnlyRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: dd0c-cost-readonly
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              AWS: !Sub 'arn:aws:iam::${Dd0cAccountId}:root'
            Action: sts:AssumeRole
            Condition:
              StringEquals:
                sts:ExternalId: !Ref ExternalId
      Policies:
        - PolicyName: dd0c-cost-readonly-policy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              # Describe resources for zombie hunting and context
              - Effect: Allow
                Action:
                  - ec2:DescribeInstances
                  - ec2:DescribeVolumes
                  - ec2:DescribeAddresses
                  - ec2:DescribeNatGateways
                  - ec2:DescribeSnapshots
                  - elasticloadbalancing:DescribeLoadBalancers
                  - elasticloadbalancing:DescribeTargetGroups
                  - rds:DescribeDBInstances
                  - rds:DescribeDBClusters
                  - lambda:ListFunctions
                  - lambda:GetFunction
                  - cloudwatch:GetMetricStatistics
                  - cloudwatch:GetMetricData
                  - ce:GetCostAndUsage
                  - tag:GetResources
                Resource: '*'
              # CloudWatch billing metrics for end-of-month forecast
              - Effect: Allow
                Action:
                  - cloudwatch:GetMetricStatistics
                Resource: '*'
                Condition:
                  StringEquals:
                    cloudwatch:namespace: 'AWS/Billing'

  # EventBridge rule to forward cost-relevant CloudTrail events
  Dd0cCostEventRule:
    Type: AWS::Events::Rule
    Properties:
      Name: dd0c-cost-forward
      Description: 'Forward cost-relevant CloudTrail events to dd0c'
      State: ENABLED
      EventPattern:
        source:
          - aws.ec2
          - aws.rds
          - aws.lambda
        detail-type:
          - 'AWS API Call via CloudTrail'
        detail:
          eventSource:
            - ec2.amazonaws.com
            - rds.amazonaws.com
            - lambda.amazonaws.com
          eventName:
            - RunInstances
            - StartInstances
            - ModifyInstanceAttribute
            - CreateNatGateway
            - AllocateAddress
            - CreateVolume
            - CreateDBInstance
            - ModifyDBInstance
            - RestoreDBInstanceFromDBSnapshot
            - CreateDBCluster
            - CreateFunction20150331
            - UpdateFunctionConfiguration20150331v2
            - PutProvisionedConcurrencyConfig
      Targets:
        - Id: dd0c-cost-bus
          Arn: !Sub 'arn:aws:events:${AWS::Region}:${Dd0cAccountId}:event-bus/dd0c-cost-bus'
          RoleArn: !GetAtt EventBridgeForwardRole.Arn

  # IAM Role for EventBridge to forward events cross-account
  EventBridgeForwardRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: dd0c-cost-eventbridge-forward
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: events.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: forward-to-dd0c
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action: events:PutEvents
                Resource: !Sub 'arn:aws:events:${AWS::Region}:${Dd0cAccountId}:event-bus/dd0c-cost-bus'

Outputs:
  RoleArn:
    Value: !GetAtt Dd0cCostReadOnlyRole.Arn
    Description: 'Provide this ARN to dd0c to complete setup'
  ExternalId:
    Value: !Ref ExternalId
    Description: 'External ID for secure cross-account access'

Remediation Stack (Optional — deployed after trust is established)

# dd0c-cost-remediate.yaml (separate stack, opt-in)
Resources:
  Dd0cCostRemediateRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: dd0c-cost-remediate
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              AWS: !Sub 'arn:aws:iam::${Dd0cAccountId}:root'
            Action: sts:AssumeRole
            Condition:
              StringEquals:
                sts:ExternalId: !Ref ExternalId
      Policies:
        - PolicyName: dd0c-cost-remediate-policy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - ec2:StopInstances
                  - ec2:TerminateInstances
                  - ec2:CreateSnapshot
                Resource: '*'
                Condition:
                  StringEquals:
                    'aws:ResourceTag/dd0c-remediation': 'enabled'

4.3 Cost Estimate

dd0c Platform Infrastructure Cost

Component 1 Account 10 Accounts 100 Accounts Pricing Basis
Lambda (ingestion) $0.02/mo $0.15/mo $1.50/mo ~500-50K invocations/day, 128MB, 200ms avg
Lambda (scoring) $0.01/mo $0.10/mo $1.00/mo Same invocation count, 256MB, 50ms avg
Lambda (notifier) $0.001/mo $0.01/mo $0.10/mo 10-1000 anomalies/day
Lambda (scheduled) $0.01/mo $0.01/mo $0.05/mo 3 daily + 1 weekly, fixed
Lambda (API) $0.005/mo $0.05/mo $0.50/mo 50-5000 API calls/day
DynamoDB $0.50/mo $2.00/mo $15.00/mo On-demand: ~$1.25/million writes, ~$0.25/million reads. PITR adds ~25%.
SQS $0.01/mo $0.05/mo $0.50/mo First 1M requests free, then $0.40/million
EventBridge $0.01/mo $0.05/mo $0.50/mo $1.00/million events
API Gateway $0.10/mo $0.10/mo $0.50/mo $3.50/million requests, first 1M free
Cognito $0.00/mo $0.00/mo $0.00/mo Free <50K MAU
S3 $0.01/mo $0.05/mo $0.50/mo CF templates + CUR storage (V2)
CloudWatch $0.50/mo $0.50/mo $2.00/mo Logs ingestion + custom metrics
Total ~$1.17/mo ~$3.06/mo ~$22.15/mo
Revenue $19/mo $190/mo $1,900/mo
Gross Margin 93.8% 98.4% 98.8%

Key insight: Infrastructure cost is negligible at all scales. The $19/account/month price point is almost pure margin. The cost driver at scale will be engineering time (Brian's time), not AWS bills. This is the beauty of serverless — costs scale linearly with usage, and CloudTrail event processing is computationally trivial.

Customer-Side Cost (what the customer pays AWS for dd0c's infrastructure in their account)

Component Cost Notes
EventBridge rule $0.00 Rules are free. Events forwarded: $1.00/million. At ~5K events/day = $0.15/month.
IAM roles $0.00 IAM is free.
CloudTrail $0.00 (if already enabled) Most accounts already have CloudTrail enabled. If not, first trail is free.
Total customer-side cost ~$0.15/month Negligible. Important for the sales conversation: "dd0c adds ~$0.15/month to your AWS bill."

4.4 Scaling Strategy

CloudTrail Event Volume Estimates

Account Activity Events/Day (all CloudTrail) Cost-Relevant Events/Day dd0c Processes
Small startup (10 engineers) 5,000-20,000 50-200 50-200
Medium startup (50 engineers) 50,000-200,000 200-1,000 200-1,000
Mid-market (200 engineers) 500,000-2,000,000 1,000-5,000 1,000-5,000
CI/CD heavy (Terraform runs) 1,000,000+ 5,000-20,000 5,000-20,000

The EventBridge filter is the key scaling lever. By filtering at the EventBridge rule level (customer-side), only cost-relevant events cross the account boundary. A busy account generating 2M CloudTrail events/day sends only ~5K-20K to dd0c. This is well within Lambda's processing capacity.

Scaling Bottlenecks and Mitigations

Scale Bottleneck Mitigation
100 accounts None. Lambda + DynamoDB on-demand handles this trivially.
500 accounts SQS FIFO throughput (300 msg/sec per message group, 3,000 msg/sec per queue). With accountId as message group, each account gets 300 msg/sec — more than enough. Monitor SQS ApproximateAgeOfOldestMessage. If >60s, add a second FIFO queue with hash-based routing.
1,000 accounts Lambda concurrent executions. 5 reserved for ingestion × 10 batch size = 50 events/sec. May need to increase reserved concurrency. Increase reserved concurrency to 20. Monitor ConcurrentExecutions metric.
5,000 accounts DynamoDB write throughput. On-demand scales automatically but costs increase. Single-table hot partition risk if one account generates disproportionate events. Monitor ConsumedWriteCapacityUnits. If hot partition detected, add account-level write sharding (append random suffix to PK).
10,000+ accounts Lambda cold starts become noticeable. EventBridge cross-account bus may hit account-level limits. Migrate ingestion to ECS Fargate (long-running containers, no cold starts). Replace EventBridge with Kinesis Data Streams for higher throughput. This is a V3/V4 concern.

The honest assessment: dd0c/cost's architecture comfortably handles 1,000 accounts without any changes. At 5,000+ accounts, targeted optimizations are needed. At 10,000+, a partial re-architecture (Lambda → ECS for ingestion) is warranted. Given the business plan targets 100 accounts at Month 6 and ~2,600 at $50K MRR, the V1 architecture has 4-10x headroom before scaling work is needed.

4.5 CI/CD Pipeline

graph LR
    subgraph "Developer"
        CODE[TypeScript Code] --> GIT[Git Push → GitHub]
    end

    subgraph "GitHub Actions"
        GIT --> LINT[ESLint + Prettier]
        LINT --> TEST[Vitest Unit Tests]
        TEST --> BUILD[CDK Synth<br/>Generate CloudFormation]
        BUILD --> DIFF[CDK Diff<br/>Show changes]
    end

    subgraph "Deployment (main branch)"
        DIFF -->|main branch merge| DEPLOY_STG[CDK Deploy → Staging]
        DEPLOY_STG --> SMOKE[Smoke Tests<br/>Deploy CF stack to test account<br/>Trigger test event<br/>Verify Slack alert]
        SMOKE -->|pass| DEPLOY_PROD[CDK Deploy → Production]
        DEPLOY_PROD --> MONITOR[CloudWatch Alarms<br/>5-min error rate check]
        MONITOR -->|alarm| ROLLBACK[CDK Deploy → Previous Version]
    end

Stack:

  • Source control: GitHub (private repo)
  • CI/CD: GitHub Actions (free for private repos up to 2,000 min/month)
  • IaC: AWS CDK v2 (TypeScript)
  • Testing: Vitest for unit tests, custom smoke test script for integration
  • Environments: Staging (dd0c-staging account) + Production (dd0c-platform account)
  • Deployment: CDK deploy with --require-approval never for staging, --require-approval broadening for production (alerts on IAM/security changes)

Deployment cadence: Continuous deployment to staging on every push. Production deploys on merge to main after smoke tests pass. Rollback is automatic if CloudWatch error rate alarm fires within 5 minutes of deploy.

Solo founder optimization: No manual approval gates. No staging-to-prod promotion ceremony. Push to main → it's live in 10 minutes. If it breaks, CloudWatch catches it and rolls back. Brian's time is too valuable for deployment theater.


5. SECURITY

5.1 IAM Role Design: Customer AWS Accounts

dd0c/cost requires cross-account access to customer AWS accounts. This is the most security-sensitive aspect of the architecture. The design principle: minimum privilege, maximum transparency, separate roles for separate risk levels.

Role 1: dd0c-cost-readonly (Required)

Deployed at onboarding. Read-only. Cannot modify any customer resources.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DescribeComputeResources",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeVolumes",
        "ec2:DescribeAddresses",
        "ec2:DescribeNatGateways",
        "ec2:DescribeSnapshots",
        "ec2:DescribeRegions",
        "elasticloadbalancing:DescribeLoadBalancers",
        "elasticloadbalancing:DescribeTargetGroups",
        "rds:DescribeDBInstances",
        "rds:DescribeDBClusters",
        "lambda:ListFunctions",
        "lambda:GetFunction"
      ],
      "Resource": "*"
    },
    {
      "Sid": "CloudWatchMetrics",
      "Effect": "Allow",
      "Action": [
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:GetMetricData",
        "cloudwatch:ListMetrics"
      ],
      "Resource": "*"
    },
    {
      "Sid": "CostExplorerReadOnly",
      "Effect": "Allow",
      "Action": [
        "ce:GetCostAndUsage",
        "ce:GetCostForecast"
      ],
      "Resource": "*"
    },
    {
      "Sid": "TagReadOnly",
      "Effect": "Allow",
      "Action": [
        "tag:GetResources",
        "tag:GetTagKeys",
        "tag:GetTagValues"
      ],
      "Resource": "*"
    }
  ]
}

Trust policy with external ID:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "dd0c-cost-<unique-per-customer-uuid>"
        }
      }
    }
  ]
}

Why external ID matters: Without an external ID, any AWS account that knows dd0c's account ID could trick dd0c into assuming a role in a victim's account (the "confused deputy" problem). The external ID is a unique, per-customer secret generated at onboarding and stored in dd0c's database. It's included in the CloudFormation template as a parameter.

What this role explicitly CANNOT do:

  • Create, modify, or delete any resource
  • Read S3 bucket contents
  • Read secrets, parameters, or configuration
  • Access IAM users, roles, or policies
  • Read CloudTrail logs directly (events come via EventBridge, not API)
  • Access any networking configuration (VPCs, security groups, etc.)

Role 2: dd0c-cost-remediate (Optional, Opt-In)

Deployed separately, only when the customer explicitly enables one-click remediation. Scoped to tagged resources only.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RemediateTaggedEC2Only",
      "Effect": "Allow",
      "Action": [
        "ec2:StopInstances",
        "ec2:TerminateInstances",
        "ec2:CreateSnapshot"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/dd0c-remediation": "enabled"
        }
      }
    },
    {
      "Sid": "DescribeForDryRun",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeVolumes"
      ],
      "Resource": "*"
    }
  ]
}

Tag-based scoping: Remediation only works on resources tagged dd0c-remediation: enabled. This means:

  • Production databases without the tag are untouchable — even if dd0c has a bug
  • Customers control exactly which resources dd0c can act on
  • The tag can be applied via Terraform/CloudFormation as part of dev/staging resource definitions
  • Production resources should never have this tag (and dd0c's onboarding docs will say so explicitly)

Role Assumption Flow

dd0c Lambda → sts:AssumeRole(
  RoleArn: "arn:aws:iam::<customer_account>:role/dd0c-cost-readonly",
  ExternalId: "dd0c-cost-<customer-uuid>",
  DurationSeconds: 900  // 15-minute session, minimum viable
) → Temporary credentials → ec2:DescribeInstances → Discard credentials
  • Session duration: 15 minutes (minimum practical). Credentials are never cached beyond a single Lambda invocation.
  • No long-lived credentials are stored anywhere. Every cross-account call uses fresh STS temporary credentials.
  • The Lambda execution role in dd0c's account has sts:AssumeRole permission scoped to arn:aws:iam::*:role/dd0c-cost-readonly and arn:aws:iam::*:role/dd0c-cost-remediate — it can only assume roles with these exact names.

5.2 Customer Data Sensitivity

CloudTrail events contain sensitive information. dd0c must handle this responsibly.

What CloudTrail Events Reveal

Data Element Sensitivity dd0c Usage Storage
AWS Account ID Medium Required for multi-tenant routing Stored (encrypted at rest)
IAM User/Role ARN Medium Attribution ("who did it") Stored (encrypted at rest)
API action + parameters High Cost estimation (instance type, count) Normalized fields stored. Raw event stored 7 days then TTL'd.
Source IP address High Not used by dd0c Stripped at ingestion. Never stored.
User agent string Low Not used by dd0c Stripped at ingestion. Never stored.
Request/response bodies High Instance type, count extracted. Rest discarded. Only extracted fields stored. Raw TTL'd at 7 days.
Error responses Low Used to filter failed API calls (no cost impact) Not stored. Failed events are dropped at ingestion.

Data minimization principle: dd0c extracts exactly 8 fields from each CloudTrail event (service, action, resource type, resource ID, resource spec, region, actor, timestamp) and discards the rest. The raw event is stored for 7 days for debugging purposes only, then automatically deleted via DynamoDB TTL.

Data in Transit

  • Customer → dd0c: EventBridge cross-account delivery uses AWS's internal network. Events never traverse the public internet.
  • dd0c → Customer (remediation): STS AssumeRole + API calls use HTTPS (TLS 1.2+) over AWS's internal network.
  • dd0c → Slack: HTTPS (TLS 1.3) to Slack's API endpoints.
  • User → dd0c API: HTTPS (TLS 1.2+) via API Gateway with enforced minimum TLS version.

Data at Rest

Data Store Encryption Key Management
DynamoDB AES-256 (AWS-managed KMS key) AWS manages key rotation. Upgrade to CMK if enterprise customers require it.
S3 (CUR data) SSE-S3 (AES-256) AWS-managed. Upgrade to SSE-KMS with CMK for V2.
S3 (CF templates) Public read (non-sensitive) N/A
Slack OAuth tokens DynamoDB encryption + application-level AES-256 Application key stored in AWS Secrets Manager. Tokens are double-encrypted.

5.3 SOC 2 Considerations

SOC 2 Type II is the target within 12 months of launch. It's table stakes for selling to any company with a security team.

SOC 2 Trust Service Criteria Mapping

Criteria dd0c/cost Implementation
Security IAM least-privilege, encryption at rest/transit, Cognito auth, API Gateway throttling, Slack signature verification
Availability Lambda auto-scaling, DynamoDB on-demand, SQS durability, CloudWatch alarms with auto-rollback
Processing Integrity SQS FIFO deduplication, idempotent Lambda handlers, DynamoDB conditional writes, remediation audit log
Confidentiality Data minimization (strip source IPs, raw events TTL'd), encryption, per-tenant partition isolation, no cross-tenant data access
Privacy No PII collected (IAM ARNs are not PII under most frameworks). Data deletion on account disconnection within 72 hours.

Pre-SOC 2 Security Checklist (Ship with V1)

  • All data encrypted at rest (DynamoDB, S3)
  • All data encrypted in transit (TLS 1.2+)
  • No long-lived credentials stored (STS temporary credentials only)
  • External ID on all cross-account roles (confused deputy prevention)
  • Remediation scoped to tagged resources only
  • Remediation audit log (who, what, when, result)
  • Slack signature verification on all interactive payloads (HMAC-SHA256)
  • Cognito JWT validation on all API requests
  • API Gateway throttling (1,000 req/sec default, per-API-key limits)
  • CloudWatch alarms on Lambda errors, SQS DLQ depth, DynamoDB throttles
  • DynamoDB Point-in-Time Recovery enabled
  • GitHub branch protection (require PR review — even if reviewing your own PRs as solo founder)
  • Bug bounty program (launch within 30 days of public launch)
  • Penetration test (schedule within 90 days of launch)
  • SOC 2 Type I audit (Month 6-9)
  • SOC 2 Type II audit (Month 12-15)

5.4 The Trust Model

dd0c asks customers to grant read access to their CloudTrail events and resource metadata. This is a significant trust ask. The architecture must earn and maintain that trust.

Trust-Building Measures

  1. Open-source the CloudFormation templates. Customers can read exactly what IAM permissions they're granting. No hidden permissions. The templates are hosted on a public S3 bucket and linked from the docs.

  2. Open-source the event processor. The Lambda function that processes CloudTrail events can be published as open source. Customers can audit exactly what data is extracted and what's discarded. (The anomaly scoring algorithm and business logic remain proprietary.)

  3. Minimal permissions, clearly documented. The docs page for "What permissions does dd0c need?" lists every IAM action with a plain-English explanation of why it's needed. No * actions. No iam:*. No s3:*.

  4. Separate remediation role. Read-only access is the default. Write access (remediation) is a separate, opt-in deployment. Customers who never deploy the remediation stack can use dd0c for detection and alerting only — dd0c can never modify their resources.

  5. External ID rotation. Customers can rotate their external ID at any time via the dd0c dashboard. This invalidates dd0c's ability to assume the role until the new external ID is configured — an emergency kill switch.

  6. Self-hosted agent option (V3). For customers who refuse to send CloudTrail events to dd0c's account, a self-hosted agent runs in the customer's VPC. It processes events locally and sends only anonymized anomaly summaries (severity, estimated cost, resource type — no ARNs, no account IDs) to dd0c's SaaS for the dashboard. This is the nuclear option for security-paranoid customers.

Threat Model

Threat Likelihood Impact Mitigation
dd0c's AWS account compromised Low Critical MFA on root, SCPs restricting dangerous actions, CloudTrail on dd0c's own account, AWS GuardDuty enabled
Attacker assumes customer role via dd0c Very Low Critical External ID prevents confused deputy. STS sessions are 15 minutes. Lambda execution role is scoped to specific role names.
dd0c employee (Brian) goes rogue N/A (solo founder) Critical Open-source templates + audit logs provide transparency. SOC 2 audit provides external verification.
Slack token compromise Low High Tokens double-encrypted (DynamoDB + application-level). Slack token rotation supported. Tokens scoped to minimum bot permissions.
DynamoDB data breach Very Low High Encryption at rest. No public endpoints. VPC endpoints for DynamoDB access from Lambda (V2). IAM policies restrict access to dd0c's Lambda execution roles only.

6. MVP SCOPE

6.1 V1 Boundary: What Ships at Day 90

V1 is scoped to three services, one notification channel, and zero dashboards. Every feature that doesn't directly serve "detect → alert → fix" is cut.

V1 Feature Matrix

Feature V1 (Day 90) V2 (Month 4-6) V3 (Month 7-12)
EC2 anomaly detection
RDS anomaly detection
Lambda anomaly detection
ECS/Fargate detection
SageMaker detection
Slack alerts (Block Kit)
Manual remediation suggestions
One-click remediation (Slack buttons)
Zombie resource hunter (daily) (daily) (continuous)
Daily digest
Weekly digest
End-of-month forecast (basic) (improved) (ML-based)
CUR reconciliation
Web dashboard (basic) (full)
Multi-account support (1 account)
Team attribution
Custom anomaly rules
Autonomous remediation (opt-in)
API access (Business tier)
dd0c/route integration (shared accounts) (deep)

V1 Remediation: Suggestions, Not Buttons

Critical V1 scoping decision: V1 ships with remediation suggestions in Slack alerts, not one-click action buttons.

Why:

  1. Remediation is the highest-risk feature. A bug that stops a production instance is catastrophic for a product with zero brand trust. V1 needs to build trust through accurate detection before earning the right to take action.
  2. The remediation IAM role doubles onboarding complexity. V1 onboarding deploys one CloudFormation stack (read-only). Adding a second stack for remediation adds friction and IAM anxiety.
  3. Suggestions still deliver 80% of the value. A Slack alert that says "Stop this instance: aws ec2 stop-instances --instance-ids i-0abc123 --region us-east-1" with a copy-paste CLI command is almost as fast as a button. The user runs it in their terminal in 5 seconds.

V1 alert format (suggestions, not buttons):

🟡 WARNING: Unusual EC2 Activity

*2× p3.2xlarge instances* launched in us-east-1
├─ Estimated cost: *$6.12/hr* ($146.88/day)
├─ Who: sam@company.com (IAM User)
├─ When: Today at 11:02 AM UTC
├─ Account: 123456789012
└─ Why: Instance type never seen in this account.
   Cost is 4.1× your average EC2 hourly spend.

💡 *Suggested actions:*
• Stop instances: `aws ec2 stop-instances --instance-ids i-0abc123 i-0def456 --region us-east-1`
• Check if needed: `aws ec2 describe-instances --instance-ids i-0abc123 i-0def456 --region us-east-1 --query 'Reservations[].Instances[].{State:State.Name,Launch:LaunchTime,Tags:Tags}'`

[Mark as Expected ✓]  [Snooze 4h]

 Cost estimated using on-demand pricing.

The only interactive buttons in V1 are [Mark as Expected] and [Snooze] — both are internal to dd0c (no cross-account API calls, zero risk).

One-click remediation buttons ship in V2 after:

  • 30+ days of accurate detection builds customer trust
  • The remediation IAM role and tag-based scoping are battle-tested internally
  • At least 3 design partners have opted into the remediation stack

6.2 The Onboarding Flow (Technical)

The onboarding flow is the most critical user journey. Every second of friction costs signups.

sequenceDiagram
    participant U as User (Browser)
    participant DD as dd0c Web App
    participant COG as Cognito
    participant AWS_C as Customer AWS Console
    participant CF as CloudFormation
    participant DD_API as dd0c API
    participant SL as Slack

    U->>DD: Click "Start Free"
    DD->>COG: Redirect to Cognito hosted UI
    COG->>COG: GitHub/Google OAuth
    COG->>DD: JWT token (id_token + access_token)
    DD->>DD_API: POST /v1/accounts/setup (init tenant)
    DD_API->>DD_API: Generate unique external_id (UUID v4)
    DD_API->>U: Return CloudFormation quick-create URL

    Note over U,AWS_C: User clicks CF link → opens AWS Console
    U->>AWS_C: Click link (pre-filled CF stack)
    AWS_C->>CF: Create stack (dd0c-cost-readonly)
    CF->>CF: Create IAM role + EventBridge rule (~60-90 sec)
    CF-->>AWS_C: Stack complete → outputs Role ARN

    U->>DD: Paste Role ARN (or auto-detected via CF callback)
    DD->>DD_API: POST /v1/accounts (roleArn, externalId)
    DD_API->>DD_API: sts:AssumeRole (validate access)
    DD_API->>U: ✅ Account connected

    U->>DD: Click "Connect Slack"
    DD->>SL: Slack OAuth flow (bot scopes: chat:write, commands)
    SL->>DD: OAuth callback (bot token, workspace ID)
    DD->>DD_API: POST /v1/slack/install (token, workspace, channel)

    DD_API->>DD_API: Trigger immediate zombie scan
    DD_API->>SL: First alert: "Found 3 zombie resources costing $127/mo"

    Note over U: Total time: 3-5 minutes. First value: <10 minutes.

CloudFormation Quick-Create URL

The magic of fast onboarding is the CloudFormation quick-create URL. Instead of asking users to download a template and upload it, dd0c generates a URL that opens the AWS Console with the stack pre-configured:

https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/quickcreate
  ?templateURL=https://dd0c-cf-templates.s3.amazonaws.com/dd0c-cost-readonly-v1.yaml
  &stackName=dd0c-cost-monitoring
  &param_Dd0cAccountId=111122223333
  &param_ExternalId=dd0c-cost-a1b2c3d4-e5f6-7890-abcd-ef1234567890

The user sees a pre-filled CloudFormation page. They check "I acknowledge that AWS CloudFormation might create IAM resources" and click "Create stack." Done in 90 seconds.

Auto-Detection of Role ARN (V1 Enhancement)

Instead of asking users to copy-paste the Role ARN, dd0c can poll for the role:

  1. After the user clicks the CF link, dd0c starts polling sts:AssumeRole with the expected role ARN (arn:aws:iam::<account_id>:role/dd0c-cost-readonly) every 10 seconds
  2. The account ID is extracted from the Cognito JWT (if the user signed up with an AWS-linked identity) or entered manually
  3. When the role becomes assumable (CF stack complete), dd0c auto-detects it and skips the "paste ARN" step
  4. Fallback: manual ARN entry if auto-detection fails after 5 minutes

6.3 False Positive Rate Target

Target: <20% false positive rate by Day 60 (design partner phase).

Definition: A "false positive" is an alert where the user clicks [Mark as Expected] or [Snooze permanently]. An alert that the user ignores is NOT counted as a false positive (it might be a true positive they chose not to act on).

Measurement

// Calculated daily per account
const falsePositiveRate = 
  markedAsExpected / (markedAsExpected + actedOn + openAfter48Hours);

// Where:
// markedAsExpected = alerts where user clicked "Mark as Expected"
// actedOn = alerts where user clicked Stop/Terminate/Snooze (temporary)
// openAfter48Hours = alerts still open after 48 hours (ambiguous — excluded from FP calc)

False Positive Reduction Strategy

Phase FP Rate Target Strategy
Day 1-14 (cold start) <40% Absolute thresholds only. Conservative defaults. Miss small anomalies rather than cry wolf.
Day 14-30 (learning) <30% Statistical baselines kick in. Composite scoring reduces single-signal false positives.
Day 30-60 (design partners) <20% Feedback loop active. "Mark as Expected" retrains baselines. Per-account patterns learned.
Day 60-90 (launch) <15% Sensitivity tuning based on design partner data. Default thresholds calibrated to real-world patterns.
Month 3+ (mature) <10% Mature baselines. Suppressed patterns. CUR reconciliation (V2) corrects pricing estimates.

Alert-to-Action Ratio

The complementary metric to false positive rate. Measures what percentage of alerts result in a meaningful action.

Target: >25% alert-to-action ratio.

If <20% of alerts result in action (stop, terminate, snooze, or investigate), the product is too noisy. This is the "boy who cried wolf" metric — if it drops below 20%, dd0c has 30 days to fix it or trigger the kill criteria.

6.4 Technical Debt Budget

V1 will accumulate technical debt. That's fine — it's a 90-day sprint. But the debt must be tracked and bounded.

Acceptable V1 Technical Debt

Debt Item Impact Payoff Timeline
Static pricing tables (no RI/SP awareness) Over-estimates costs for accounts with commitments. Higher false positive rate. V2 (CUR reconciliation)
Single-region deployment (us-east-1) Higher latency for non-US customers. Single point of failure. V3 (multi-region)
No web dashboard Customers can't view anomaly history outside Slack. V2
Single DynamoDB table for everything Will hit GSI limits and hot partition issues at scale. V2 (add Aurora read replica for dashboard queries)
Hardcoded Slack as only notification channel Can't support Teams, Discord, email, PagerDuty. V2 (notification abstraction layer)
No automated integration tests Relies on manual smoke testing + unit tests. Month 2 (add integration test suite)
Lambda cold starts on infrequent paths Zombie hunter and digest Lambdas cold-start every invocation. V2 (provisioned concurrency if needed, or migrate to ARM for faster cold starts)
No rate limiting per tenant A single noisy account could consume disproportionate Lambda concurrency. V2 (per-account SQS message group + concurrency limits)

Unacceptable Technical Debt (Must Not Ship)

  • Storing long-lived AWS credentials (must use STS temporary credentials)
  • Unencrypted data at rest
  • Missing Slack signature verification
  • Missing external ID on cross-account roles
  • No CloudWatch alarms on critical paths
  • No DynamoDB Point-in-Time Recovery
  • Hardcoded secrets in code (must use Secrets Manager or environment variables from CDK)

6.5 Solo Founder Operational Model

Brian is building and operating this alone. The architecture must minimize operational burden.

Operational Runbook (What Can Go Wrong)

Scenario Detection Response Automation Level
Lambda ingestion errors spike CloudWatch alarm: event-processor error rate >5% in 5 min Check CloudWatch Logs. Common causes: malformed CloudTrail event, DynamoDB throttle, STS AssumeRole failure. Alert → Brian's phone via SNS. Manual investigation.
SQS DLQ messages accumulate CloudWatch alarm: DLQ ApproximateNumberOfMessagesVisible > 0 Inspect DLQ messages. Replay after fixing root cause. Alert → Brian's phone. Manual replay via CLI script.
DynamoDB throttling CloudWatch alarm: ThrottledRequests > 0 On-demand capacity should auto-scale. If persistent, check for hot partition (single account generating disproportionate events). Alert → Brian's phone. Usually self-resolving.
Slack API rate limited Lambda retry with exponential backoff via SQS visibility timeout SQS handles retry automatically. If persistent, check for alert storm (noisy account). Fully automated.
Customer CloudFormation stack deleted EventBridge events stop arriving. Detected by daily health check (no events in 24h for active account). Notify customer via Slack: "We stopped receiving events from account X. Did you remove the dd0c stack?" Semi-automated. Health check Lambda detects, sends Slack notification.
Deployment breaks production CloudWatch alarm: error rate >5% within 5 min of deploy Automatic rollback to previous Lambda version via CDK. Fully automated rollback.

Time Budget (Brian's Weekly Hours on dd0c/cost)

Activity Hours/Week (V1 build) Hours/Week (Post-launch)
Feature development 20-25 10-15
Bug fixes 5-10 5-10
Customer support 0 2-5
Ops/monitoring 1-2 2-3
Content marketing 2-3 3-5
Total 28-40 22-38

The constraint: Brian is also building dd0c/route simultaneously. Total available hours: ~50-60/week across both products. dd0c/cost gets ~50% of time during the build phase, dropping to ~40% post-launch as dd0c/route matures.

Automation imperative: Every hour spent on ops is an hour not spent on features or marketing. The architecture must be self-healing:

  • Lambda auto-scales and auto-retries
  • SQS provides durability and backpressure
  • DynamoDB on-demand eliminates capacity planning
  • CloudWatch alarms catch problems before customers notice
  • CDK deploys are one command (cdk deploy --all)
  • No SSH. No servers. No patching. No 3 AM pages (unless Lambda error rate spikes, which means something is fundamentally broken).

7. API DESIGN

7.1 Account Registration & Onboarding API

All API endpoints are served via API Gateway at https://api.dd0c.dev/v1. Authentication is via Cognito JWT in the Authorization: Bearer <token> header.

POST /v1/accounts/setup

Initialize a new tenant and generate onboarding artifacts.

Request:
  Headers:
    Authorization: Bearer <cognito_jwt>
  Body: (none — tenant derived from JWT)

Response: 201 Created
{
  "tenantId": "tn_01HXYZ...",
  "externalId": "dd0c-cost-a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "cloudFormationUrl": "https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/quickcreate?templateURL=https://dd0c-cf-templates.s3.amazonaws.com/dd0c-cost-readonly-v1.yaml&stackName=dd0c-cost-monitoring&param_Dd0cAccountId=111122223333&param_ExternalId=dd0c-cost-a1b2c3d4...",
  "status": "pending_aws_connection"
}

POST /v1/accounts

Register a connected AWS account after CloudFormation stack deployment.

Request:
  Headers:
    Authorization: Bearer <cognito_jwt>
  Body:
  {
    "awsAccountId": "123456789012",
    "roleArn": "arn:aws:iam::123456789012:role/dd0c-cost-readonly",
    "region": "us-east-1",
    "friendlyName": "production"  // optional
  }

Response: 201 Created
{
  "accountId": "acc_01HXYZ...",
  "awsAccountId": "123456789012",
  "status": "validating",
  "validation": {
    "assumeRole": "pending",
    "eventBridge": "pending",
    "firstEvent": "pending"
  }
}

// dd0c immediately:
// 1. Attempts sts:AssumeRole to validate access
// 2. Triggers zombie scan
// 3. Starts listening for EventBridge events
// Validation status updates via polling or webhook

GET /v1/accounts

List all connected AWS accounts for the authenticated tenant.

Response: 200 OK
{
  "accounts": [
    {
      "accountId": "acc_01HXYZ...",
      "awsAccountId": "123456789012",
      "friendlyName": "production",
      "status": "active",
      "connectedAt": "2026-02-28T10:00:00Z",
      "lastEventAt": "2026-02-28T14:32:00Z",
      "baselineMaturity": "learning",  // "cold-start" | "learning" | "mature"
      "remediationEnabled": false,
      "stats": {
        "anomaliesLast7d": 3,
        "estimatedMonthlyCost": 14230.00,
        "zombieResourceCount": 5,
        "zombieEstimatedMonthlyCost": 127.40
      }
    }
  ]
}

DELETE /v1/accounts/{accountId}

Disconnect an AWS account. Triggers data deletion pipeline.

Response: 202 Accepted
{
  "accountId": "acc_01HXYZ...",
  "status": "disconnecting",
  "dataDeletedBy": "2026-03-03T10:00:00Z"  // 72-hour GDPR window
}

GET /v1/accounts/{accountId}/health

Check the health of a connected account (EventBridge events flowing, role assumable, etc.).

Response: 200 OK
{
  "accountId": "acc_01HXYZ...",
  "health": "healthy",  // "healthy" | "degraded" | "disconnected"
  "checks": {
    "roleAssumable": { "status": "pass", "lastChecked": "2026-02-28T14:00:00Z" },
    "eventsFlowing": { "status": "pass", "lastEventAt": "2026-02-28T14:32:00Z" },
    "baselinePopulated": { "status": "pass", "maturity": "learning", "sampleCount": 47 }
  }
}

7.2 Anomaly Query & Search API

GET /v1/anomalies

Query anomalies across all connected accounts.

Query Parameters:
  accountId    (optional) — filter by account
  severity     (optional) — "info" | "warning" | "critical"
  status       (optional) — "open" | "resolved" | "expected" | "snoozed"
  service      (optional) — "ec2" | "rds" | "lambda"
  since        (optional) — ISO 8601 timestamp (default: 7 days ago)
  until        (optional) — ISO 8601 timestamp (default: now)
  limit        (optional) — 1-100 (default: 50)
  cursor       (optional) — pagination cursor

Response: 200 OK
{
  "anomalies": [
    {
      "anomalyId": "an_01HXYZ...",
      "accountId": "acc_01HXYZ...",
      "awsAccountId": "123456789012",
      "severity": "warning",
      "score": 4.2,
      "status": "open",
      "title": "2× p3.2xlarge launched in us-east-1",
      "description": "sam@company.com launched 2 GPU instances at 11:02 AM UTC. Instance type never seen in this account. Cost is 4.1× average EC2 hourly spend.",
      "estimatedHourlyCost": 6.12,
      "estimatedDailyCost": 146.88,
      "service": "ec2",
      "resourceType": "instance",
      "resourceIds": ["i-0abc123", "i-0def456"],
      "resourceSpec": "p3.2xlarge",
      "region": "us-east-1",
      "actor": "sam@company.com",
      "detectedAt": "2026-02-28T11:02:15Z",
      "signals": {
        "zScore": 4.1,
        "instanceTypeNovelty": true,
        "actorNovelty": false,
        "absoluteCostThreshold": false,
        "quantityAnomaly": false,
        "timeOfDayAnomaly": false
      },
      "suggestedActions": [
        {
          "action": "stop",
          "command": "aws ec2 stop-instances --instance-ids i-0abc123 i-0def456 --region us-east-1",
          "risk": "low"
        }
      ],
      "slackMessageUrl": "https://company.slack.com/archives/C01ABC/p1234567890"
    }
  ],
  "cursor": "eyJsYXN0S2V5Ijo...",
  "total": 12
}

GET /v1/anomalies/{anomalyId}

Get full details for a single anomaly, including remediation audit log.

Response: 200 OK
{
  // ... all fields from list response, plus:
  "remediationLog": [
    {
      "action": "stop",
      "executedBy": "U01SLACK_USER",
      "executedByName": "Sam Chen",
      "executedAt": "2026-02-28T11:15:00Z",
      "targetResourceId": "i-0abc123",
      "result": "success",
      "dryRunPassed": true
    }
  ],
  "triggerEvents": [
    {
      "cloudTrailEventId": "abc123-def456-...",
      "eventTime": "2026-02-28T11:02:12Z",
      "action": "RunInstances",
      "rawParameters": {
        "instanceType": "p3.2xlarge",
        "minCount": 2,
        "maxCount": 2,
        "imageId": "ami-0abc123..."
      }
    }
  ],
  "reconciliation": {
    "reconciled": false,
    "actualCost": null,
    "pricingTerm": null,
    "reconciledAt": null
  }
}

PATCH /v1/anomalies/{anomalyId}

Update anomaly status (mark as expected, snooze, resolve).

Request:
{
  "status": "expected",  // or "snoozed" with snoozeUntil
  "snoozeUntil": "2026-02-28T15:00:00Z"  // only for "snoozed"
}

Response: 200 OK
{ "anomalyId": "an_01HXYZ...", "status": "expected", "updatedAt": "..." }

7.3 Baseline Configuration API

GET /v1/accounts/{accountId}/baselines

View current baselines for an account.

Response: 200 OK
{
  "baselines": [
    {
      "service": "ec2",
      "resourceType": "instance",
      "maturity": "learning",
      "sampleCount": 47,
      "mean": 0.48,
      "stddev": 0.92,
      "p95": 1.92,
      "maxObserved": 3.06,
      "expectedInstanceTypes": ["t3.medium", "m5.xlarge", "c5.2xlarge"],
      "expectedActors": ["sam@company.com", "terraform-deploy-role"],
      "sensitivity": "medium",
      "suppressedPatterns": []
    },
    {
      "service": "rds",
      "resourceType": "db-instance",
      "maturity": "cold-start",
      "sampleCount": 3,
      "mean": 0.416,
      "stddev": 0.0,
      "expectedInstanceTypes": ["db.t3.medium"],
      "expectedActors": ["terraform-deploy-role"],
      "sensitivity": "medium",
      "suppressedPatterns": []
    }
  ]
}

PATCH /v1/accounts/{accountId}/baselines/{service}/{resourceType}

Override baseline sensitivity or suppress patterns.

Request:
{
  "sensitivity": "low",  // "low" | "medium" | "high"
  "suppressPattern": {   // optional — add a suppressed pattern
    "resourceSpec": "t3.medium",
    "actor": null,        // null = any actor
    "reason": "We launch t3.mediums constantly for CI"
  }
}

Response: 200 OK
{ "updated": true }

POST /v1/accounts/{accountId}/baselines/reset

Reset baselines for an account (re-enter cold-start mode). Useful if the account's usage pattern has fundamentally changed.

Request:
{
  "service": "ec2",       // optional — reset specific service. Omit for all.
  "resourceType": "instance"  // optional
}

Response: 200 OK
{ "reset": true, "newMaturity": "cold-start" }

7.4 Slack Bot Commands & Interactive Payloads

Slash Commands

Command Description Response
/dd0c status Show connected accounts and health Ephemeral message with account list, health status, and last event time
/dd0c anomalies Show open anomalies Ephemeral message with top 5 open anomalies, sorted by severity
/dd0c zombies Trigger on-demand zombie scan "Scanning... results in ~60 seconds" → followed by zombie report
/dd0c sensitivity <service> <low|medium|high> Adjust anomaly sensitivity "EC2 sensitivity set to LOW. You'll only see critical anomalies."
/dd0c digest Trigger on-demand daily digest Sends the daily digest message immediately
/dd0c help Show available commands Command reference

Interactive Message Payloads

Slack sends interactive payloads to POST https://api.dd0c.dev/v1/slack/actions when users click buttons in alert messages.

Incoming payload structure (from Slack):

{
  "type": "block_actions",
  "user": { "id": "U01ABC", "name": "sam" },
  "team": { "id": "T01ABC" },
  "channel": { "id": "C01ABC" },
  "message": { "ts": "1234567890.123456" },
  "actions": [
    {
      "action_id": "mark_expected",
      "block_id": "anomaly_an_01HXYZ",
      "value": "an_01HXYZ..."
    }
  ]
}

Action handling:

action_id Handler Side Effects
mark_expected Update anomaly status → retrain baseline Update original Slack message: " Marked as expected by @sam"
snooze_1h Set snoozeUntil = now + 1h Update message: "💤 Snoozed for 1 hour by @sam"
snooze_4h Set snoozeUntil = now + 4h Update message: "💤 Snoozed for 4 hours by @sam"
snooze_24h Set snoozeUntil = now + 24h Update message: "💤 Snoozed for 24 hours by @sam"
stop_instance (V2) Open confirmation modal → execute ec2:StopInstances Update message: "🛑 Stopped by @sam at 2:34 PM"
terminate_instance (V2) Open confirmation modal → snapshot → execute ec2:TerminateInstances Update message: "💀 Terminated by @sam. Snapshot: snap-0abc123"

Slack signature verification (critical security):

Every incoming request is verified using Slack's signing secret:

import crypto from 'crypto';

function verifySlackSignature(
  signingSecret: string,
  requestBody: string,
  timestamp: string,
  signature: string
): boolean {
  // Reject requests older than 5 minutes (replay attack prevention)
  const fiveMinutesAgo = Math.floor(Date.now() / 1000) - 300;
  if (parseInt(timestamp) < fiveMinutesAgo) return false;

  const sigBasestring = `v0:${timestamp}:${requestBody}`;
  const mySignature = 'v0=' + crypto
    .createHmac('sha256', signingSecret)
    .update(sigBasestring)
    .digest('hex');

  return crypto.timingSafeEqual(
    Buffer.from(mySignature),
    Buffer.from(signature)
  );
}

7.5 Dashboard REST API (V2)

V2 introduces a lightweight web dashboard. The API supports it.

GET /v1/dashboard/summary

Overview data for the dashboard home screen.

Response: 200 OK
{
  "period": "last_30_days",
  "totalEstimatedSpend": 14230.00,
  "spendTrend": "+17.6%",
  "anomaliesDetected": 12,
  "anomaliesResolved": 9,
  "anomaliesOpen": 3,
  "estimatedSavings": 4720.00,  // Cost avoided via remediation actions
  "zombieResources": 5,
  "zombieEstimatedMonthlyCost": 127.40,
  "topAnomalies": [ /* top 5 by severity */ ],
  "spendByService": [
    { "service": "ec2", "estimatedCost": 8540.00, "percentage": 60.0 },
    { "service": "rds", "estimatedCost": 3420.00, "percentage": 24.0 },
    { "service": "lambda", "estimatedCost": 2270.00, "percentage": 16.0 }
  ],
  "dailySpend": [
    { "date": "2026-02-01", "estimatedCost": 412.00 },
    { "date": "2026-02-02", "estimatedCost": 398.00 },
    // ... 30 days
  ]
}

GET /v1/dashboard/timeline

Anomaly timeline for visualization.

Query Parameters:
  since  (optional) — default 30 days
  until  (optional) — default now

Response: 200 OK
{
  "events": [
    {
      "timestamp": "2026-02-28T11:02:15Z",
      "type": "anomaly",
      "severity": "warning",
      "title": "2× p3.2xlarge launched",
      "estimatedHourlyCost": 6.12,
      "status": "resolved"
    },
    {
      "timestamp": "2026-02-27T09:00:00Z",
      "type": "digest",
      "title": "Daily digest sent"
    }
  ]
}

7.6 Integration Points: dd0c/route Cross-Sell

dd0c/cost and dd0c/route are the "gateway drug pair." Their integration creates compound value that neither product delivers alone.

Shared Infrastructure

Component Shared? Details
Cognito User Pool Shared Single sign-on across dd0c products. One login, access both products.
Tenant/Account Registry Shared TENANT#<tenant_id> in DynamoDB is the same entity across products. A customer who uses dd0c/route and adds dd0c/cost doesn't create a new account.
Slack Integration Shared One Slack app (dd0c) with scopes for both products. Alerts from both products go to the same (or different) channels. Single OAuth flow.
Billing (Stripe) Shared One Stripe customer. One invoice. Bundle pricing applied automatically.
API Gateway Shared api.dd0c.dev/v1/cost/* and api.dd0c.dev/v1/route/* on the same API Gateway.
CloudFormation Templates Separate dd0c/cost needs CloudTrail + resource describe. dd0c/route needs different permissions (if any AWS integration). Separate stacks, separate IAM roles.
Data Stores Separate Different DynamoDB tables. Different data schemas. No cross-product data access in V1.

Cross-Sell Triggers

// In dd0c/route's notification service:
// When a dd0c/route customer saves money on LLM routing,
// check if they also use dd0c/cost.

interface CrossSellTrigger {
  trigger: string;
  condition: string;
  message: string;
}

const CROSS_SELL_TRIGGERS: CrossSellTrigger[] = [
  {
    trigger: "route_savings_milestone",
    condition: "dd0c/route customer saves >$500/month AND does not have dd0c/cost",
    message: "🎉 dd0c/route saved you $X on LLM costs this month. Want to find savings on your AWS bill too? dd0c/cost monitors your AWS account for cost anomalies in real-time. [Try dd0c/cost →]"
  },
  {
    trigger: "cost_onboarding_complete",
    condition: "dd0c/cost customer completes onboarding AND does not have dd0c/route",
    message: "✅ dd0c/cost is monitoring your AWS account. If your team uses OpenAI, Anthropic, or other LLM APIs, dd0c/route can cut those costs by 30-50% with intelligent model routing. [Try dd0c/route →]"
  },
  {
    trigger: "combined_savings_report",
    condition: "Customer uses both products AND it's the 1st of the month",
    message: "📊 dd0c Monthly Savings Report\n• AWS cost anomalies caught: $X (dd0c/cost)\n• LLM routing savings: $Y (dd0c/route)\n• Total saved: $Z\n• dd0c subscription cost: $W\n• Net savings: $(Z-W) 🚀"
  }
];

Future Integration: Combined Cost Intelligence (V3+)

When a customer uses both products, dd0c has a unique data advantage:

  • dd0c/route knows: which services make LLM API calls, how much they spend on each model, and which calls could be routed to cheaper models
  • dd0c/cost knows: which AWS resources are running, their cost, and which are anomalous or idle

Combined insight example:

"Your recommendation-service (ECS, us-east-1) costs $1,800/month in compute AND makes $3,200/month in GPT-4o API calls. dd0c/route can cut the API costs to $1,100/month by routing 60% of calls to Claude Haiku. And the ECS service is over-provisioned — you're running 4 tasks but CPU never exceeds 30%. Scaling to 2 tasks saves $900/month. Total potential savings: $3,000/month."

No single-product competitor can deliver this insight. It requires both infrastructure cost data AND application-level API cost data. This is the platform moat.

API: Cross-Product Account Linking

POST /v1/accounts/{accountId}/link
{
  "product": "route",
  "routeAccountId": "rt_01HXYZ..."
}

Response: 200 OK
{
  "linked": true,
  "sharedTenantId": "tn_01HXYZ...",
  "enabledFeatures": ["combined_savings_report", "cross_sell_suppressed"]
}

When accounts are linked, cross-sell messages are suppressed (the customer already uses both products) and combined reporting is enabled.


APPENDIX A: DECISION LOG

# Decision Alternatives Rationale Revisit Trigger
1 Lambda over ECS for all compute ECS Fargate Zero ops, pay-per-invocation, auto-scale to zero. Solo founder can't afford container management. >5,000 accounts or Lambda cold starts >2s on hot path
2 DynamoDB single-table over PostgreSQL Aurora PostgreSQL, Aurora Serverless No connection pooling, no vacuum, no patching. On-demand pricing = $0 at zero traffic. V2 dashboard needs complex aggregation queries → add Aurora as read replica
3 EventBridge over Kinesis for ingestion Kinesis Data Streams Native cross-account event routing. Content-based filtering at source. Kinesis requires shard management. >10,000 accounts or need sub-second ordering guarantees
4 SQS FIFO over Standard SQS Standard Exactly-once processing prevents duplicate anomaly alerts. Message group per account ensures ordering. FIFO throughput limit (3,000 msg/sec) becomes bottleneck
5 Static pricing tables over real-time Price List API AWS Price List API per-request Price List API is 2-5s per query. Unacceptable in hot path. Pricing changes quarterly. Weekly batch update is sufficient. AWS introduces hourly pricing changes (unlikely)
6 V1 ships suggestions, not one-click remediation Ship remediation in V1 Trust must be earned before taking action on customer resources. Suggestions deliver 80% of value at 0% risk. Design partners request buttons AND false positive rate <15%
7 TypeScript over Python Python, Go Same language for CDK + Lambda + API. Faster cold starts than Python. Type safety across stack. Team grows and prefers Python (unlikely for solo founder)
8 Cognito over Auth0/Clerk Auth0, Clerk, Supabase Auth Free <50K MAU. Native API Gateway integration. No vendor dependency. Cognito UX becomes a conversion bottleneck (common complaint) → migrate to Clerk
9 Single-region (us-east-1) for V1 Multi-region Simplicity. EventBridge cross-account works within a region. Multi-region adds complexity for zero benefit at <100 accounts. Non-US customers >30% of base OR availability SLA requirement
10 No web dashboard in V1 Ship basic dashboard Slack-first means no dashboard needed for core workflow. Dashboard is engineering time that doesn't improve detection or alerting. >50% of design partners request anomaly history outside Slack

APPENDIX B: CLOUDTRAIL EVENT REFERENCE

Quick reference for CloudTrail event structures used by dd0c/cost's event processor.

EC2 RunInstances

{
  "eventSource": "ec2.amazonaws.com",
  "eventName": "RunInstances",
  "awsRegion": "us-east-1",
  "requestParameters": {
    "instanceType": "p3.2xlarge",
    "minCount": 2,
    "maxCount": 2,
    "imageId": "ami-0abc123..."
  },
  "responseElements": {
    "instancesSet": {
      "items": [
        { "instanceId": "i-0abc123..." },
        { "instanceId": "i-0def456..." }
      ]
    }
  }
}

Extraction: instanceType from requestParameters, instanceId from responseElements.instancesSet.items[*], count from array length.

RDS CreateDBInstance

{
  "eventSource": "rds.amazonaws.com",
  "eventName": "CreateDBInstance",
  "requestParameters": {
    "dBInstanceIdentifier": "my-database",
    "dBInstanceClass": "db.r5.4xlarge",
    "engine": "postgres",
    "multiAZ": true,
    "allocatedStorage": 100
  }
}

Extraction: dBInstanceClass for pricing lookup. multiAZ: true doubles the cost. allocatedStorage for storage cost estimate.

EC2 CreateNatGateway

{
  "eventSource": "ec2.amazonaws.com",
  "eventName": "CreateNatGateway",
  "requestParameters": {
    "subnetId": "subnet-0abc123...",
    "allocationId": "eipalloc-0def456..."
  },
  "responseElements": {
    "CreateNatGatewayResponse": {
      "natGateway": {
        "natGatewayId": "nat-0ghi789..."
      }
    }
  }
}

Extraction: NAT Gateway pricing is flat ($0.045/hr + $0.045/GB). No instance type to look up. Alert on creation because NAT Gateways are notorious silent cost bombs.


This architecture document is a living artifact. It will be updated as V1 development reveals implementation realities, design partner feedback reshapes priorities, and scaling demands evolve. The core architectural bet — real-time CloudTrail event stream processing as the speed layer, with CUR reconciliation as the accuracy layer — is the foundation that everything else builds on. If that bet is wrong, the product is wrong. If it's right, dd0c/cost has an architectural moat that batch-processing competitors can't easily cross.