Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
108 KiB
dd0c/cost — Technical Architecture
AWS Cost Anomaly Detective
Version: 1.0
Date: February 28, 2026
Author: Architecture (Phase 6)
Status: Draft
Audience: Senior AWS architects, founding engineers
1. SYSTEM OVERVIEW
High-Level Architecture
graph TB
subgraph "Customer AWS Account"
CT[CloudTrail] -->|Events| EB[EventBridge Rule]
CUR_S3[CUR Export → S3 Bucket]
IAM_RO[IAM Role: dd0c-cost-readonly]
IAM_REM[IAM Role: dd0c-cost-remediate<br/>opt-in]
end
subgraph "dd0c/cost Platform (dd0c AWS Account)"
subgraph "Layer 1: Real-Time Event Stream"
EB -->|Cross-account EventBridge| EB_TARGET[EventBridge Target Bus]
EB_TARGET --> SQS_INGEST[SQS: event-ingestion<br/>FIFO, dedup]
SQS_INGEST --> LAMBDA_PROC[Lambda: event-processor<br/>CloudTrail → CostEvent normalization]
LAMBDA_PROC --> DDB_BASELINE[DynamoDB: baselines<br/>per-account, per-service]
LAMBDA_PROC --> ANOMALY[Lambda: anomaly-scorer<br/>Z-score + heuristics]
ANOMALY --> SQS_ALERT[SQS: alert-queue]
end
subgraph "Layer 2: CUR Reconciliation (V2)"
CUR_S3 -->|S3 Replication or<br/>cross-account read| S3_CUR[S3: cur-data-lake]
S3_CUR --> ATHENA[Athena: CUR queries]
ATHENA --> LAMBDA_RECON[Lambda: reconciler<br/>daily batch]
LAMBDA_RECON --> DDB_BASELINE
end
subgraph "Notification & Remediation"
SQS_ALERT --> LAMBDA_NOTIFY[Lambda: notifier]
LAMBDA_NOTIFY --> SLACK[Slack API<br/>Block Kit messages]
SLACK -->|Interactive payload| APIGW[API Gateway]
APIGW --> LAMBDA_ACTION[Lambda: action-handler<br/>Stop/Terminate/Snooze]
LAMBDA_ACTION -->|AssumeRole| IAM_REM
end
subgraph "Data Layer"
DDB_ACCOUNTS[DynamoDB: accounts<br/>tenant config, Slack tokens]
DDB_ANOMALIES[DynamoDB: anomalies<br/>event log, status]
DDB_BASELINE
end
subgraph "API & Onboarding"
APIGW_REST[API Gateway: REST API]
LAMBDA_API[Lambda: api-handlers]
CF_TEMPLATE[S3: CloudFormation templates]
end
subgraph "Scheduled Jobs"
EB_CRON[EventBridge Scheduler]
EB_CRON --> LAMBDA_ZOMBIE[Lambda: zombie-hunter<br/>daily scan]
EB_CRON --> LAMBDA_DIGEST[Lambda: daily-digest]
EB_CRON --> LAMBDA_RECON
end
end
LAMBDA_ZOMBIE -->|DescribeInstances, etc.| IAM_RO
LAMBDA_ACTION -->|StopInstances, etc.| IAM_REM
style CT fill:#ff9900,color:#000
style EB fill:#ff9900,color:#000
style CUR_S3 fill:#ff9900,color:#000
style SLACK fill:#4a154b,color:#fff
Component Inventory
| Component | Responsibility | AWS Service | Justification |
|---|---|---|---|
| Event Ingestion | Receive CloudTrail events in real-time, normalize to CostEvent schema | EventBridge + SQS FIFO + Lambda | EventBridge for cross-account event routing; SQS FIFO for ordered, deduplicated processing; Lambda for stateless event transformation |
| Anomaly Scorer | Compare incoming CostEvents against baselines, flag anomalies | Lambda | Stateless scoring function. Sub-second execution. No persistent compute needed. |
| Baseline Store | Per-account, per-service spending pattern storage | DynamoDB | Single-digit ms reads for hot-path scoring. On-demand capacity. Pay-per-request at low scale. |
| Anomaly Log | Immutable record of all detected anomalies and their resolution status | DynamoDB | Queryable by account, time range, severity. TTL for automatic retention enforcement. |
| Account Registry | Tenant configuration, Slack tokens, IAM role ARNs, preferences | DynamoDB | Low-volume, high-read. Single-table design with account_id partition key. |
| Notifier | Format and deliver Slack Block Kit messages | Lambda + Slack API | Stateless. Slack rate limits handled via SQS backpressure. |
| Action Handler | Process Slack interactive payloads (Stop, Terminate, Snooze) | API Gateway + Lambda | API Gateway receives Slack webhook POST, Lambda executes remediation via cross-account AssumeRole. |
| Zombie Hunter | Daily scan for idle/orphaned resources across connected accounts | Lambda (scheduled) | EventBridge Scheduler triggers daily. Scans EC2, EBS, EIP, ELB via DescribeInstances/Volumes/Addresses. |
| Daily Digest | Compile and send daily spend summary + anomaly recap | Lambda (scheduled) | Aggregates DynamoDB anomaly data, formats Slack digest. |
| CUR Reconciler | Process CUR data for ground-truth billing validation (V2) | S3 + Athena + Lambda | Athena for serverless SQL over CUR Parquet files. Lambda orchestrates daily query + baseline update. |
| REST API | Account onboarding, anomaly queries, configuration | API Gateway + Lambda | Standard REST. API Gateway handles auth (Cognito JWT or API key). |
| CloudFormation Templates | Customer onboarding IAM role provisioning | S3-hosted CF templates | One-click deploy. Pre-signed URL from onboarding flow. |
Technology Choices
| Decision | Choice | Alternatives Considered | Rationale |
|---|---|---|---|
| Compute | AWS Lambda | ECS Fargate, EC2 | Lambda is the only sane choice for a solo founder. Zero ops. Pay-per-invocation. CloudTrail events are bursty — Lambda scales to zero between bursts. At 100 accounts, we're looking at ~50K-500K events/day — well within Lambda concurrency limits. ECS only makes sense at 1000+ accounts when Lambda cold starts or 15-min timeout become constraints. |
| Event Bus | EventBridge | SNS, Kinesis Data Streams | EventBridge supports cross-account event routing natively (critical for receiving customer CloudTrail events). Content-based filtering reduces Lambda invocations to only cost-relevant events. SNS lacks filtering granularity. Kinesis is overkill at V1 scale and adds shard management overhead. |
| Queue | SQS FIFO | SQS Standard, Kinesis | FIFO provides exactly-once processing and message deduplication (CloudTrail can emit duplicate events). Message group ID = account_id ensures per-account ordering. Standard SQS risks duplicate processing and out-of-order anomaly scoring. |
| Database | DynamoDB | PostgreSQL (RDS/Aurora), TimescaleDB | DynamoDB eliminates all database ops. No patching, no connection pooling, no vacuum. Single-table design handles accounts, baselines, and anomalies. On-demand pricing means $0 at zero traffic. PostgreSQL would be better for complex queries (V2 dashboard) but adds operational burden a solo founder can't afford. Migrate to Aurora when dashboard launches. |
| CUR Processing | S3 + Athena | Redshift, BigQuery, PostgreSQL | Athena is serverless SQL over S3. Zero infrastructure. Pay per query (~$5/TB scanned). CUR data is already in Parquet on S3. Redshift is overkill and expensive. This is a daily batch job, not real-time analytics. |
| API Layer | API Gateway (REST) | ALB + ECS, AppSync | API Gateway + Lambda is the standard serverless REST pattern. No servers to manage. Built-in throttling, API keys, and Cognito integration. AppSync (GraphQL) is unnecessary complexity for V1's simple CRUD operations. |
| Auth | Cognito User Pools | Auth0, Clerk, custom JWT | Cognito is free for <50K MAU. Native API Gateway integration. Supports GitHub/Google federation via OIDC. Auth0/Clerk are better products but add vendor dependency and cost. At $19/account/month, every dollar of infrastructure cost matters. |
| IaC | AWS CDK (TypeScript) | Terraform, SAM, CloudFormation raw | CDK generates CloudFormation but with TypeScript type safety and constructs. Same language as Lambda handlers (TypeScript). SAM is too limited for cross-account EventBridge patterns. Terraform adds a state management burden. |
| Language | TypeScript (Node.js 20) | Python, Go, Rust | TypeScript for Lambda cold start performance (faster than Python), type safety across the stack (CDK + Lambda + API), and npm ecosystem for Slack SDK. Go would be faster but slower to iterate for a solo founder. Rust is overkill. |
Two-Layer Architecture: Speed + Accuracy
The core architectural insight is that no single data source provides both real-time speed and billing accuracy. dd0c/cost resolves this with two complementary layers:
| Layer 1: Event Stream | Layer 2: CUR Reconciliation | |
|---|---|---|
| Data Source | CloudTrail events via EventBridge | AWS Cost & Usage Report (CUR 2.0) |
| Latency | 5-60 seconds from resource creation | 12-24 hours (CUR delivery to S3) |
| Accuracy | ~85% (on-demand pricing estimate) | 99%+ (includes RIs, SPs, Spot, credits) |
| Granularity | Individual API call (RunInstances, CreateDBInstance) | Line-item billing with amortized costs |
| V1 Status | ✅ Core product | ❌ Deferred to V2 |
| Purpose | "ALERT: Someone just launched 4x p3.2xlarge" | "UPDATE: Confirmed 48hr cost was $1,175 (not $1,411 — Savings Plan applied)" |
Why both layers matter:
Layer 1 catches the fire in real-time. It uses on-demand pricing as the cost estimate because that's the worst-case scenario and the only price available at event time. If the customer has Reserved Instances or Savings Plans covering the resource, the actual cost is lower — but you'd rather over-alert on a covered resource than miss an uncovered one.
Layer 2 reconciles with ground truth. When CUR data arrives, dd0c updates the anomaly record with the actual billed amount. If Layer 1 estimated $1,411 but the actual cost was $1,175 (Savings Plan discount), the anomaly record is updated and the customer sees the correction. This builds trust over time — the system gets more accurate as it learns which resources are covered by commitments.
V1 ships Layer 1 only. Layer 2 is deferred to V2 because CUR setup requires additional customer configuration (enabling CUR export, S3 bucket policy for cross-account access) which adds onboarding friction. V1's goal is 5-minute onboarding. CUR adds 15-20 minutes of AWS Console clicking. Not worth it for launch.
2. CORE COMPONENTS
2.1 CloudTrail Ingestion Pipeline
The ingestion pipeline is the heart of dd0c/cost. It transforms raw CloudTrail events into normalized CostEvents in real-time.
Architecture
sequenceDiagram
participant CT as Customer CloudTrail
participant EB_C as Customer EventBridge
participant EB_D as dd0c EventBridge
participant SQS as SQS FIFO
participant LP as Lambda: event-processor
participant DDB as DynamoDB
participant AS as Lambda: anomaly-scorer
CT->>EB_C: CloudTrail event (e.g. RunInstances)
EB_C->>EB_D: Cross-account event (EventBridge rule)
EB_D->>SQS: Filtered event → FIFO queue
SQS->>LP: Batch poll (up to 10 messages)
LP->>LP: Normalize to CostEvent schema
LP->>LP: Estimate hourly cost (pricing lookup)
LP->>DDB: Write CostEvent
LP->>AS: Invoke anomaly scorer (async)
AS->>DDB: Read baseline for account+service
AS->>AS: Z-score calculation
AS-->>SQS: If anomaly → alert-queue
CloudTrail Events That Signal Cost Anomalies
Not all CloudTrail events are cost-relevant. dd0c filters at the EventBridge level to process only events that create, modify, or scale billable resources. This is critical — a busy AWS account generates 10,000-100,000+ CloudTrail events/day. We need to process <1% of them.
V1 Monitored Events (EC2 + RDS + Lambda):
| Service | CloudTrail Event | Cost Signal | Estimated Impact |
|---|---|---|---|
| EC2 | RunInstances |
New instance(s) launched | Instance type → hourly rate. p3.2xlarge = $3.06/hr, p4d.24xlarge = $32.77/hr |
| EC2 | StartInstances |
Stopped instance restarted | Same as RunInstances — billing resumes |
| EC2 | ModifyInstanceAttribute (instanceType change) |
Instance resized | Delta between old and new instance type hourly rate |
| EC2 | CreateNatGateway |
NAT Gateway created | $0.045/hr + $0.045/GB processed. Silent cost bomb. |
| EC2 | AllocateAddress |
Elastic IP allocated | $0.005/hr if unattached. Small but zombie indicator. |
| EC2 | CreateVolume |
EBS volume created | gp3: $0.08/GB-month. io2: up to $0.125/GB-month + $0.065/IOPS-month |
| EC2 | RunScheduledInstances |
Scheduled instance launched | Same pricing model as RunInstances |
| RDS | CreateDBInstance |
New database instance | db.r5.4xlarge = $2.016/hr. Multi-AZ doubles it. |
| RDS | ModifyDBInstance (class change) |
Database resized | Delta between old and new instance class |
| RDS | RestoreDBInstanceFromDBSnapshot |
Database restored from snapshot | Same as CreateDBInstance — new billable instance |
| RDS | CreateDBCluster |
Aurora cluster created | Writer + reader instances, per-instance pricing |
| Lambda | CreateFunction20150331 / UpdateFunctionConfiguration20150331v2 |
Function created/config changed | Memory × duration × invocations. Alert only if memory >1GB or timeout >60s (high-cost config) |
| Lambda | PutProvisionedConcurrencyConfig |
Provisioned concurrency set | $0.0000041667/GB-second provisioned. Can be expensive at scale. |
V2 Expansion Targets:
| Service | Key Events | Why Deferred |
|---|---|---|
| ECS/Fargate | CreateService, UpdateService (desiredCount) |
Requires parsing task definition for resource allocation |
| SageMaker | CreateEndpoint, CreateNotebookInstance, CreateTrainingJob |
High-value targets but lower customer prevalence in V1 beachhead |
| ElastiCache | CreateCacheCluster, ModifyReplicationGroup |
Moderate cost, lower urgency |
| Redshift | CreateCluster, ResizeCluster |
Enterprise service, outside V1 beachhead |
| EKS | CreateNodegroup, UpdateNodegroupConfig |
Requires K8s-level cost attribution (complex) |
| OpenSearch | CreateDomain, UpdateDomainConfig |
Moderate cost, lower urgency |
EventBridge Rule (Customer-Side)
The CloudFormation template deploys this EventBridge rule in the customer's account:
{
"Source": ["aws.ec2", "aws.rds", "aws.lambda"],
"DetailType": ["AWS API Call via CloudTrail"],
"Detail": {
"eventSource": [
"ec2.amazonaws.com",
"rds.amazonaws.com",
"lambda.amazonaws.com"
],
"eventName": [
"RunInstances",
"StartInstances",
"ModifyInstanceAttribute",
"CreateNatGateway",
"AllocateAddress",
"CreateVolume",
"CreateDBInstance",
"ModifyDBInstance",
"RestoreDBInstanceFromDBSnapshot",
"CreateDBCluster",
"CreateFunction20150331",
"UpdateFunctionConfiguration20150331v2",
"PutProvisionedConcurrencyConfig"
]
}
}
The rule target is dd0c's EventBridge bus in dd0c's AWS account (cross-account event bus policy). This means:
- Customer CloudTrail events matching the filter are forwarded in real-time
- Only cost-relevant events leave the customer's account (not all CloudTrail)
- No agent or daemon runs in the customer's account
- EventBridge cross-account delivery is near-instant (<5 seconds typical)
Cost Estimation Engine
When a CostEvent arrives, dd0c estimates the hourly cost using a static pricing table:
// pricing/ec2-on-demand.ts
// Updated monthly from AWS Price List API (bulk JSON)
// Keyed by region + instance type
const EC2_PRICING: Record<string, Record<string, number>> = {
"us-east-1": {
"t3.micro": 0.0104,
"t3.medium": 0.0416,
"m5.xlarge": 0.192,
"m5.2xlarge": 0.384,
"c5.2xlarge": 0.34,
"r5.4xlarge": 1.008,
"p3.2xlarge": 3.06,
"p3.8xlarge": 12.24,
"p4d.24xlarge": 32.7726,
"g5.xlarge": 1.006,
"g5.12xlarge": 5.672,
// ... full table from AWS Price List API
},
// ... all regions
};
interface CostEstimate {
hourlyRate: number; // On-demand $/hr
dailyRate: number; // hourlyRate × 24
monthlyRate: number; // hourlyRate × 730
confidence: "on-demand"; // V1 always on-demand. V2 adds RI/SP awareness.
disclaimer: string; // "Estimated on-demand pricing. Actual cost may differ with RIs/Savings Plans."
}
Why static pricing tables, not real-time Price List API calls:
- AWS Price List API is slow (~2-5 seconds for a single query). Unacceptable in the hot path.
- Pricing changes infrequently (quarterly at most for most instance types).
- A monthly cron job pulls the full Price List bulk JSON (~1.5GB), extracts EC2/RDS/Lambda pricing, and writes to a DynamoDB table or bundled JSON file in the Lambda deployment package.
- At V1 scale (<100 accounts), the pricing table fits in Lambda memory (~50MB for all regions + services).
Event Processor Lambda
// functions/event-processor/handler.ts
interface CloudTrailEvent {
detail: {
eventSource: string;
eventName: string;
awsRegion: string;
userIdentity: {
type: string;
arn: string;
userName?: string;
sessionContext?: {
sessionIssuer?: { userName: string };
};
};
requestParameters: Record<string, any>;
responseElements: Record<string, any>;
eventTime: string;
eventID: string;
};
}
interface CostEvent {
pk: string; // ACCOUNT#<account_id>
sk: string; // EVENT#<timestamp>#<event_id>
accountId: string;
service: "ec2" | "rds" | "lambda";
action: string; // "RunInstances", "CreateDBInstance", etc.
resourceType: string; // "EC2 Instance", "RDS Instance", etc.
resourceId: string; // i-xxx, db-xxx
resourceSpec: string; // "p3.2xlarge", "db.r5.4xlarge"
region: string;
actor: string; // IAM user/role that performed the action
actorArn: string;
quantity: number; // Number of instances (RunInstances can launch multiple)
estimatedHourlyCost: number;
estimatedDailyCost: number;
estimatedMonthlyCost: number;
eventTime: string; // ISO 8601
cloudTrailEventId: string;
rawEvent: object; // Original CloudTrail event (stored for debugging, TTL'd)
ttl: number; // DynamoDB TTL — 90 days
}
The processor extracts the actor (who did it), the resource spec (what they created), estimates the cost, and writes a normalized CostEvent to DynamoDB. This normalization is critical — downstream components (anomaly scorer, notifier, API) never touch raw CloudTrail.
Actor extraction logic:
userIdentity.type === "IAMUser"→userIdentity.userName(e.g., "sam@company.com")userIdentity.type === "AssumedRole"→sessionContext.sessionIssuer.userName(e.g., "terraform-deploy-role")userIdentity.type === "Root"→ "Root account" (flag as high-severity regardless of cost)- For assumed roles, dd0c also extracts the
sourceIdentityorprincipalIdto trace back to the human behind the role when possible.
2.2 Anomaly Detection Engine
V1 uses simple statistical heuristics. No ML. No neural networks. Just math that a solo founder can debug at 2 AM.
Baseline Learning
For each (account_id, service, resource_type) tuple, dd0c maintains a rolling baseline:
interface Baseline {
pk: string; // BASELINE#<account_id>
sk: string; // <service>#<resource_type> e.g., "ec2#instance"
accountId: string;
service: string;
resourceType: string;
// Rolling statistics (updated on every CostEvent)
mean: number; // Mean hourly cost for this service
stddev: number; // Standard deviation
sampleCount: number; // Number of events in the window
maxObserved: number; // Highest single-event cost ever seen
// Time-windowed (last 30 days)
windowStart: string; // ISO 8601
windowEvents: number[]; // Array of hourly costs (last 30 days, compacted daily)
// Learned patterns
expectedInstanceTypes: string[]; // Instance types seen >3 times (e.g., ["t3.medium", "m5.xlarge"])
expectedActors: string[]; // IAM users/roles that regularly create resources
// User overrides
sensitivityOverride?: "low" | "medium" | "high";
suppressedResourceTypes?: string[];
updatedAt: string;
}
Cold start problem: A new account has no baseline. For the first 14 days, dd0c uses absolute thresholds instead of statistical baselines:
| Severity | Threshold | Example |
|---|---|---|
| INFO | Any new resource >$0.50/hr | t3.large ($0.0832/hr) — no alert. m5.2xlarge ($0.384/hr) — no alert. r5.4xlarge ($1.008/hr) — INFO. |
| WARNING | Any new resource >$5.00/hr | p3.2xlarge ($3.06/hr) — INFO. p3.8xlarge ($12.24/hr) — WARNING. |
| CRITICAL | Any new resource >$25.00/hr | p4d.24xlarge ($32.77/hr) — CRITICAL. |
| CRITICAL | Any root account action creating billable resources | Always CRITICAL regardless of cost. |
After 14 days with ≥20 events, the system transitions to statistical scoring.
Anomaly Scoring Algorithm
function scoreAnomaly(event: CostEvent, baseline: Baseline): AnomalyScore {
const scores: number[] = [];
// Signal 1: Z-score against baseline mean
if (baseline.sampleCount >= 20) {
const zScore = (event.estimatedHourlyCost - baseline.mean) / Math.max(baseline.stddev, 0.01);
scores.push(zScore);
}
// Signal 2: Instance type novelty
// New instance type never seen before in this account = suspicious
if (!baseline.expectedInstanceTypes.includes(event.resourceSpec)) {
scores.push(3.0); // Equivalent to 3-sigma event
}
// Signal 3: Actor novelty
// New actor creating expensive resources = suspicious
if (!baseline.expectedActors.includes(event.actor)) {
scores.push(2.0); // Moderate suspicion
}
// Signal 4: Absolute cost threshold
// Regardless of baseline, very expensive resources always flag
if (event.estimatedHourlyCost > 10.0) scores.push(4.0);
if (event.estimatedHourlyCost > 25.0) scores.push(6.0);
// Signal 5: Quantity anomaly
// Launching 10 instances at once when baseline is 1-2
if (event.quantity > 3) scores.push(2.5);
// Signal 6: Time-of-day anomaly
// Resource creation at 2 AM local time = suspicious
const hour = new Date(event.eventTime).getUTCHours();
// TODO: Convert to account's local timezone
if (hour >= 0 && hour <= 5) scores.push(1.5);
// Composite score: weighted average of all signals
const compositeScore = scores.length > 0
? scores.reduce((a, b) => a + b, 0) / scores.length
: 0;
// Multiple signals compound confidence
const confidenceMultiplier = Math.min(scores.length / 3, 2.0);
const finalScore = compositeScore * confidenceMultiplier;
return {
score: finalScore,
severity: classifySeverity(finalScore),
signals: scores.length,
breakdown: { /* individual signal details for alert context */ },
};
}
function classifySeverity(score: number): "none" | "info" | "warning" | "critical" {
if (score < 1.5) return "none"; // Below threshold — no alert
if (score < 3.0) return "info"; // Mild anomaly — daily digest only
if (score < 5.0) return "warning"; // Significant — immediate Slack alert
return "critical"; // Severe — immediate Slack alert + @channel mention
}
Why composite scoring matters: A single signal (e.g., high Z-score) might be a false positive. But high Z-score + novel instance type + novel actor + off-hours = almost certainly a real anomaly. The composite approach dramatically reduces false positives while maintaining sensitivity to genuine threats.
Sensitivity tuning: Users can override per-service sensitivity via Slack command or API:
LOW: Only CRITICAL alerts (>$25/hr or composite score >5.0)MEDIUM(default): WARNING + CRITICALHIGH: INFO + WARNING + CRITICAL (noisy, for accounts that want maximum visibility)
Feedback Loop: "Mark as Expected"
When a user clicks [Mark as Expected] on a Slack alert:
- The anomaly record is updated with
status: "expected" - The resource spec and actor are added to the baseline's
expectedInstanceTypesandexpectedActors - Future events matching this pattern score lower
- After 3 "Mark as Expected" clicks for the same pattern, the pattern is auto-suppressed with a notification: "We've auto-suppressed alerts for m5.2xlarge launches by terraform-deploy-role. You can re-enable in settings."
This is the primary mechanism for reducing false positives over time. The system learns what's normal for each account.
2.3 CUR Reconciliation (V2)
Deferred to V2 but architecturally planned now to avoid rework.
How It Works
- Customer enables CUR 2.0 export to an S3 bucket in their account (or dd0c's bucket via cross-account policy)
- CUR data arrives as Parquet files, typically within 12-24 hours of usage
- dd0c's daily reconciler Lambda:
- Queries CUR via Athena for the previous day's line items
- Matches CUR line items to Layer 1 CostEvents by resource ID + timestamp
- Updates CostEvent records with actual billed amounts
- Adjusts baselines with ground-truth cost data (replacing on-demand estimates)
- If Layer 1 estimated $1,411 but actual was $1,175 (Savings Plan), updates the anomaly record
CUR Athena Query Pattern
-- Daily reconciliation: get actual costs for resources flagged by Layer 1
SELECT
line_item_resource_id,
line_item_usage_start_date,
line_item_unblended_cost,
line_item_blended_cost,
savings_plan_savings_plan_effective_cost,
reservation_effective_cost,
pricing_term, -- "OnDemand", "Reserved", "Spot"
product_instance_type,
line_item_usage_account_id
FROM cur_database.cur_table
WHERE line_item_usage_start_date >= DATE_ADD('day', -1, CURRENT_DATE)
AND line_item_resource_id IN (
-- Resource IDs from Layer 1 anomalies in the last 48 hours
'i-0abc123def456', 'i-0xyz789ghi012'
)
AND line_item_line_item_type IN ('Usage', 'SavingsPlanCoveredUsage', 'DiscountedUsage')
ORDER BY line_item_usage_start_date;
Value of Reconciliation
- Accuracy correction: Layer 1 estimates assume on-demand pricing. CUR reveals actual cost after RI/SP/Spot discounts. Over time, this trains the baseline to use realistic costs, not worst-case estimates.
- False positive reduction: If an account has 80% Savings Plan coverage, Layer 1 will over-estimate costs by ~40%. CUR reconciliation corrects this, reducing future false positives for that account.
- Billing validation: CUR is the source of truth for AWS billing. Customers who need accurate cost reporting (Jordan persona) require this layer.
- Zombie cost validation: Layer 1 detects resource creation. CUR confirms ongoing cost. A resource that was created but immediately covered by a Reserved Instance isn't actually costing extra — CUR reveals this.
2.4 Remediation Engine
One-click remediation from Slack is the product's magic moment. The gap between "knowing" and "doing" is where money burns.
Remediation Actions (V1)
| Action | Slack Button | AWS API Call | Safety Guardrail |
|---|---|---|---|
| Stop Instance | [Stop Instance] |
ec2:StopInstances |
Confirmation dialog: "Stop i-0abc123? This will halt the instance but preserve data. EBS volumes remain attached." |
| Terminate Instance | [Terminate + Snapshot] |
ec2:CreateSnapshot → ec2:TerminateInstances |
Always creates EBS snapshot first. Confirmation dialog with instance details. 30-second undo window via Slack. |
| Snooze Alert | [Snooze 1h/4h/24h] |
None (internal) | Suppresses re-alerting for the specified duration. Anomaly remains in log. |
| Mark as Expected | [Expected ✓] |
None (internal) | Updates baseline. Adds pattern to expected list. See feedback loop above. |
V2 Remediation (deferred):
- Scale Down:
ec2:ModifyInstanceAttributeto change instance type (requires stop/start) - Schedule Shutdown: EventBridge Scheduler rule to stop instance at specified time
- Delete EBS Volume:
ec2:DeleteVolumefor unattached volumes - Release Elastic IP:
ec2:ReleaseAddress - RDS Stop:
rds:StopDBInstance(auto-restarts after 7 days — must warn user)
Safety Architecture
Remediation is the highest-risk feature. A bug that terminates a production database is an extinction-level event for dd0c.
Guardrails:
-
Separate IAM role. Remediation uses
dd0c-cost-remediaterole, which is separate from the read-onlydd0c-cost-readonlyrole. Customers opt-in to remediation by deploying an additional CloudFormation stack. The read-only role is deployed at onboarding; the remediation role is offered later, after trust is established. -
Explicit action scoping. The remediation IAM role allows ONLY:
{ "Effect": "Allow", "Action": [ "ec2:StopInstances", "ec2:TerminateInstances", "ec2:CreateSnapshot", "ec2:DescribeInstances", "ec2:DescribeVolumes" ], "Resource": "*", "Condition": { "StringEquals": { "aws:ResourceTag/dd0c-remediation": "enabled" } } }In V1, remediation only works on resources tagged with
dd0c-remediation: enabled. This is a deliberate friction — customers must explicitly tag resources they're comfortable having dd0c act on. Production databases won't have this tag. -
Confirmation dialogs. Every destructive action shows a Slack modal with resource details, estimated impact, and a confirm/cancel button. No single-click termination.
-
Automatic EBS snapshots. Before any
TerminateInstancescall, dd0c creates snapshots of all attached EBS volumes. The snapshot IDs are included in the Slack confirmation message. -
Audit log. Every remediation action is logged to DynamoDB with: who clicked the button (Slack user ID), what action was taken, which resource, timestamp, and the result (success/failure). This is the customer's audit trail.
-
Dry-run first. Before executing, dd0c calls the AWS API with
DryRun: trueto verify permissions and resource state. If the dry-run fails (e.g., instance already stopped, insufficient permissions), the user sees an error instead of a silent failure.
Slack Interactive Message Flow
sequenceDiagram
participant S as Slack
participant AG as API Gateway
participant LH as Lambda: action-handler
participant DDB as DynamoDB
participant AWS_C as Customer AWS Account
S->>AG: POST /slack/actions (interactive payload)
AG->>LH: Invoke with payload
LH->>LH: Verify Slack signature (HMAC-SHA256)
LH->>DDB: Look up anomaly record + account config
LH->>LH: Validate action is allowed for this resource
LH->>AWS_C: sts:AssumeRole (dd0c-cost-remediate)
LH->>AWS_C: ec2:StopInstances (DryRun=true)
alt DryRun succeeds
LH->>S: Open confirmation modal (Block Kit)
S->>AG: User confirms
AG->>LH: Execute action
LH->>AWS_C: ec2:CreateSnapshot (if terminate)
LH->>AWS_C: ec2:StopInstances / TerminateInstances
LH->>DDB: Log remediation action
LH->>S: Update original message: "✅ Stopped by @sam at 2:34 PM"
else DryRun fails
LH->>S: Error message: "Can't stop this instance: [reason]"
end
2.5 Notification Service
Slack Alert Format (Block Kit)
Anomaly alerts use Slack Block Kit for rich, actionable messages:
🔴 CRITICAL: Expensive Resource Detected
*4× p3.2xlarge instances* launched in us-east-1
├─ Estimated cost: *$12.24/hr* ($293.76/day)
├─ Who: sam@company.com (IAM User)
├─ When: Today at 11:02 AM UTC
├─ Account: 123456789012 (production)
└─ Why this alert: New instance type never seen in this account.
Cost is 8.2× your average EC2 hourly spend.
[Stop Instances] [Terminate + Snapshot] [Snooze 4h] [Expected ✓]
ℹ️ Cost is estimated using on-demand pricing.
Actual cost may be lower with Reserved Instances or Savings Plans.
Alert Severity → Notification Behavior
| Severity | Slack Behavior | Digest Inclusion |
|---|---|---|
| INFO | No immediate alert. Included in daily digest only. | ✅ |
| WARNING | Immediate Slack message to configured channel. No @mention. | ✅ |
| CRITICAL | Immediate Slack message with <!channel> mention. |
✅ |
Daily Digest
Sent at 9:00 AM in the customer's configured timezone (default: UTC). Compiled by the daily-digest Lambda:
📊 dd0c Daily Digest — Feb 28, 2026
*Yesterday's Spend Estimate:* $487.22 (+12% vs. 7-day avg)
*Anomalies Detected:* 3
├─ 🔴 4× p3.2xlarge (sam@company.com) — $12.24/hr — RESOLVED ✅
├─ 🟡 New NAT Gateway in us-west-2 — $1.08/hr — OPEN
└─ 🔵 Lambda memory increased to 3GB (deploy-role) — $0.12/hr — Expected ✓
*Zombie Watch:* 🧟
├─ i-0abc123 (t3.medium, us-east-1) — idle 6 days — $0.0416/hr ($23.96 wasted)
├─ vol-0def456 (100GB gp3, unattached) — 14 days — $8.00/month
└─ eipalloc-0ghi789 (unattached) — 31 days — $3.60/month
*End-of-Month Forecast:* $14,230 (vs. $12,100 last month, +17.6%)
[View Details] [Adjust Sensitivity]
Slack Rate Limiting
Slack's rate limits: ~1 message/second per channel, 20K messages/day per workspace. At V1 scale (<100 accounts), this is not a concern. The SQS alert queue provides natural backpressure — if Slack returns 429, the Lambda retries with exponential backoff via SQS visibility timeout.
Weekly Digest (V2)
Deferred. Will include: week-over-week spend comparison, top anomalies, remediation summary, savings achieved, and a "dd0c saved you $X this week" callout for the cross-sell/retention narrative.
3. DATA ARCHITECTURE
3.1 Event Schema
All data flows through a normalized CostEvent schema. Raw CloudTrail events are transformed at ingestion and never exposed to downstream components.
CostEvent (Primary Entity)
// The atomic unit of data in dd0c/cost
interface CostEvent {
// DynamoDB keys
pk: string; // "ACCOUNT#<account_id>"
sk: string; // "EVENT#<iso_timestamp>#<event_id>"
// GSI1: Query by service + time
gsi1pk: string; // "ACCOUNT#<account_id>#SERVICE#<service>"
gsi1sk: string; // "<iso_timestamp>"
// GSI2: Query by actor
gsi2pk: string; // "ACCOUNT#<account_id>#ACTOR#<actor_hash>"
gsi2sk: string; // "<iso_timestamp>"
// Core fields
accountId: string; // Customer AWS account ID
tenantId: string; // dd0c tenant ID (maps to billing entity)
service: string; // "ec2" | "rds" | "lambda"
action: string; // CloudTrail eventName
resourceType: string; // "instance" | "nat-gateway" | "db-instance" | "volume" | ...
resourceId: string; // AWS resource ID (i-xxx, db-xxx, vol-xxx)
resourceSpec: string; // Instance type / config (p3.2xlarge, db.r5.4xlarge)
region: string; // AWS region
// Attribution
actor: string; // Human-readable: "sam@company.com" or "terraform-deploy-role"
actorArn: string; // Full IAM ARN
actorType: string; // "IAMUser" | "AssumedRole" | "Root"
// Cost estimation (Layer 1)
quantity: number;
estimatedHourlyCost: number;
estimatedDailyCost: number;
estimatedMonthlyCost: number;
pricingBasis: "on-demand"; // V1 always on-demand
// Cost reconciliation (Layer 2 — V2, nullable in V1)
actualHourlyCost?: number;
actualPricingTerm?: "OnDemand" | "Reserved" | "SavingsPlan" | "Spot";
reconciled: boolean; // false until CUR reconciliation runs
reconciledAt?: string;
// Anomaly scoring
anomalyScore: number;
anomalySeverity: "none" | "info" | "warning" | "critical";
anomalySignals: number; // How many scoring signals fired
// Status tracking
status: "open" | "resolved" | "expected" | "snoozed";
resolvedAction?: string; // "stopped" | "terminated" | "snoozed" | "marked-expected"
resolvedBy?: string; // Slack user ID who took action
resolvedAt?: string;
// Metadata
cloudTrailEventId: string;
eventTime: string; // Original CloudTrail event time
ingestedAt: string; // When dd0c processed it
ttl: number; // DynamoDB TTL epoch — 90 days from ingestedAt
}
Anomaly Record (Derived from CostEvent)
When a CostEvent scores above the alert threshold, an Anomaly record is created:
interface AnomalyRecord {
pk: string; // "ANOMALY#<account_id>"
sk: string; // "<iso_timestamp>#<anomaly_id>"
// GSI: Query open anomalies
gsi3pk: string; // "ANOMALY#<account_id>#STATUS#<status>"
gsi3sk: string; // "<iso_timestamp>"
anomalyId: string; // ULID
accountId: string;
tenantId: string;
// Anomaly details
severity: "info" | "warning" | "critical";
score: number;
signalBreakdown: {
zScore?: number;
instanceTypeNovelty?: boolean;
actorNovelty?: boolean;
absoluteCostThreshold?: boolean;
quantityAnomaly?: boolean;
timeOfDayAnomaly?: boolean;
};
// The triggering event(s)
triggerEventIds: string[]; // One or more CostEvent IDs
// Human-readable summary (pre-computed for Slack)
title: string; // "4× p3.2xlarge launched in us-east-1"
description: string; // "sam@company.com launched 4 GPU instances..."
estimatedImpact: string; // "$12.24/hr ($293.76/day)"
// Notification tracking
slackMessageTs?: string; // Slack message timestamp (for updating)
slackChannelId?: string;
notifiedAt?: string;
// Resolution
status: "open" | "resolved" | "expected" | "snoozed";
snoozeUntil?: string;
resolvedAction?: string;
resolvedBy?: string;
resolvedAt?: string;
// Remediation audit
remediationLog: RemediationEntry[];
ttl: number; // 90 days
}
interface RemediationEntry {
action: string; // "stop" | "terminate" | "snapshot" | "snooze"
executedBy: string; // Slack user ID
executedAt: string;
targetResourceId: string;
result: "success" | "failure";
errorMessage?: string;
snapshotId?: string; // If EBS snapshot was created
dryRunPassed: boolean;
}
3.2 Baseline / Threshold Storage
interface Baseline {
pk: string; // "BASELINE#<account_id>"
sk: string; // "<service>#<resource_type>"
accountId: string;
service: string;
resourceType: string;
// Rolling statistics
mean: number;
stddev: number;
sampleCount: number;
maxObserved: number;
minObserved: number;
p95: number; // 95th percentile hourly cost
// Time-windowed data (30-day rolling)
windowDays: number; // Default 30
dailyAggregates: { // Last 30 days, one entry per day
date: string; // "2026-02-28"
totalCost: number;
eventCount: number;
maxSingleEvent: number;
}[];
// Learned patterns
expectedInstanceTypes: string[];
expectedActors: string[];
typicalHourRange: [number, number]; // e.g., [8, 18] — resources usually created 8am-6pm
// User configuration
sensitivityOverride?: "low" | "medium" | "high";
suppressedPatterns: {
resourceSpec?: string;
actor?: string;
suppressedAt: string;
suppressedBy: string; // Slack user ID
reason: string; // "Marked as expected 3 times"
}[];
// State
maturityState: "cold-start" | "learning" | "mature";
// cold-start: <14 days or <20 events → absolute thresholds
// learning: 14-30 days → statistical + absolute hybrid
// mature: >30 days and >50 events → full statistical scoring
updatedAt: string;
createdAt: string;
}
Baseline update strategy: Baselines are updated on every CostEvent ingestion. The update is an atomic DynamoDB UpdateItem with ADD and SET expressions — no read-modify-write race conditions. Daily aggregates are compacted by a nightly Lambda that rolls individual events into daily summaries and trims the window to 30 days.
3.3 DynamoDB Single-Table Design
All entities live in one DynamoDB table (dd0c-cost-main) with a single-table design:
┌─────────────────────────────────┬──────────────────────────────────────────┐
│ PK │ SK │
├─────────────────────────────────┼──────────────────────────────────────────┤
│ TENANT#<tenant_id> │ METADATA │ → Tenant config
│ TENANT#<tenant_id> │ ACCOUNT#<account_id> │ → Account registration
│ ACCOUNT#<account_id> │ EVENT#<timestamp>#<event_id> │ → CostEvent
│ ACCOUNT#<account_id> │ CONFIG │ → Account settings
│ BASELINE#<account_id> │ <service>#<resource_type> │ → Baseline
│ ANOMALY#<account_id> │ <timestamp>#<anomaly_id> │ → Anomaly record
│ SLACK#<workspace_id> │ INSTALL │ → Slack OAuth tokens
│ SLACK#<workspace_id> │ CHANNEL#<channel_id> │ → Channel config
└─────────────────────────────────┴──────────────────────────────────────────┘
GSI1 (Service queries):
PK: ACCOUNT#<account_id>#SERVICE#<service>
SK: <timestamp>
GSI2 (Actor queries):
PK: ACCOUNT#<account_id>#ACTOR#<actor_hash>
SK: <timestamp>
GSI3 (Open anomalies):
PK: ANOMALY#<account_id>#STATUS#<status>
SK: <timestamp>
GSI4 (Tenant lookups):
PK: TENANT#<tenant_id>
SK: (same as table SK)
Why single-table: At V1 scale, a single DynamoDB table with GSIs handles all access patterns. No cross-table joins needed. Simplifies IaC, monitoring, and backup. When the V2 dashboard requires complex queries (aggregations, time-series), we add Aurora PostgreSQL as a read replica — DynamoDB Streams → Lambda → Aurora for the dashboard's query layer. The real-time path stays on DynamoDB.
3.4 CUR Data Warehouse (V2)
┌─────────────────────────────────────────────────────────┐
│ S3: dd0c-cur-datalake │
│ │
│ s3://dd0c-cur-datalake/ │
│ ├── raw/ │
│ │ └── <account_id>/ │
│ │ └── year=2026/month=02/ │
│ │ └── cur-00001.parquet │
│ ├── processed/ │
│ │ └── <account_id>/ │
│ │ └── year=2026/month=02/day=28/ │
│ │ └── daily-summary.parquet │
│ └── athena-results/ │
│ └── query-<id>/ │
│ └── results.csv │
│ │
│ Athena Database: dd0c_cur │
│ ├── Table: raw_cur (partitioned by account_id, year, │
│ │ month — Parquet, Snappy compression) │
│ └── Table: daily_summary (materialized by reconciler) │
│ │
│ Glue Crawler: runs daily, updates partitions │
└─────────────────────────────────────────────────────────┘
CUR ingestion options (customer choice):
- Cross-account S3 replication: Customer's CUR bucket replicates to dd0c's S3 bucket. Simplest but requires S3 replication rule setup.
- Cross-account Athena query: dd0c queries the customer's CUR bucket directly via cross-account S3 access. No data copy. More secure but slower (cross-account S3 reads).
- CUR export to dd0c bucket: Customer configures CUR 2.0 to export directly to dd0c's S3 bucket with a customer-specific prefix. Cleanest but requires CUR reconfiguration.
V2 will support option 1 (replication) as the default, with option 2 as the "security-conscious" alternative.
3.5 Multi-Tenant Data Isolation
dd0c/cost is a multi-tenant SaaS. Customer data isolation is non-negotiable.
Isolation model: Logical isolation with partition-key enforcement.
- Every DynamoDB item includes
accountIdandtenantIdin the partition key - All Lambda functions receive
tenantIdfrom the authenticated session (Cognito JWT claim) - DynamoDB queries are ALWAYS scoped to a partition key that includes the tenant's account ID — there is no "scan all accounts" operation exposed to any customer-facing code path
- S3 CUR data is prefixed by
account_id— S3 bucket policies enforce prefix-level access - Athena queries include
WHERE line_item_usage_account_id = '<account_id>'— enforced at the query construction layer, not user input
Why not per-tenant tables/databases: At V1 scale (<100 tenants), per-tenant DynamoDB tables would mean 100+ tables to manage, monitor, and back up. Single-table with partition-key isolation is the standard pattern for DynamoDB multi-tenancy at this scale. If we hit 10,000+ tenants and need stronger isolation (e.g., for SOC 2 or a large enterprise customer), we can migrate specific tenants to dedicated tables — DynamoDB Streams makes this a non-disruptive migration.
Cross-tenant data access (internal only):
- The anomaly scoring engine reads baselines for a single account — never cross-account
- The daily digest reads anomalies for a single account
- The only cross-tenant operation is internal analytics (aggregate metrics across all tenants for product health monitoring). This runs on a separate IAM role with read-only access and is never exposed via API.
3.6 Retention Policies
| Data Type | Retention | Mechanism | Rationale |
|---|---|---|---|
| CostEvents | 90 days | DynamoDB TTL | Sufficient for baseline learning (30-day window) + investigation buffer. Older events are summarized in baselines. |
| Anomaly Records | 1 year | DynamoDB TTL | Customers need anomaly history for trend analysis and SOC 2 audit evidence. |
| Baselines | Indefinite (while account active) | No TTL | Baselines are the product's memory. Deleting them resets learning. Cleaned up on account disconnection. |
| Remediation Audit Log | 2 years | DynamoDB TTL | Compliance requirement. Customers need proof of who did what and when. |
| CUR Data (S3) | 13 months | S3 Lifecycle Policy | Matches AWS's own CUR retention. Enables year-over-year comparison in V3. |
| Slack OAuth Tokens | Indefinite (while connected) | Manual cleanup on disconnect | Required for ongoing Slack integration. Encrypted at rest (KMS). |
| Raw CloudTrail Events | 7 days | DynamoDB TTL on rawEvent field |
Stored for debugging only. The normalized CostEvent is the source of truth. |
| Athena Query Results | 7 days | S3 Lifecycle Policy | Ephemeral. Re-queryable from source data. |
Data deletion on account disconnection: When a customer disconnects their AWS account or deletes their dd0c account:
- All CostEvents, Anomalies, and Baselines for that account are marked for deletion
- A background Lambda processes deletion in batches (DynamoDB BatchWriteItem, 25 items/batch)
- S3 CUR data for that account prefix is deleted via S3 Lifecycle rule (immediate expiration)
- Slack tokens are revoked and deleted
- Deletion is confirmed to the customer via email
- Timeline: complete within 72 hours (GDPR-compliant)
4. INFRASTRUCTURE
4.1 AWS Architecture
All dd0c/cost infrastructure runs in a single AWS account (dd0c-platform) in us-east-1 (primary) with no multi-region in V1. The entire stack is serverless — zero EC2 instances, zero containers, zero servers to patch.
graph TB
subgraph "dd0c-platform AWS Account (us-east-1)"
subgraph "Ingestion"
EB[EventBridge: dd0c-cost-bus<br/>Cross-account event target]
SQS_I[SQS FIFO: event-ingestion<br/>MessageGroupId=accountId<br/>Dedup=cloudTrailEventId]
L_PROC[Lambda: event-processor<br/>128MB, 30s timeout<br/>Batch size: 10]
end
subgraph "Scoring & Alerting"
L_SCORE[Lambda: anomaly-scorer<br/>256MB, 10s timeout]
SQS_A[SQS Standard: alert-queue<br/>DLQ after 3 retries]
L_NOTIFY[Lambda: notifier<br/>128MB, 15s timeout]
end
subgraph "Remediation"
APIGW_SLACK[API Gateway: /slack/actions<br/>POST only, Slack signature verification]
L_ACTION[Lambda: action-handler<br/>256MB, 30s timeout]
end
subgraph "Scheduled"
EBS[EventBridge Scheduler]
L_ZOMBIE[Lambda: zombie-hunter<br/>512MB, 5min timeout<br/>Daily 06:00 UTC]
L_DIGEST[Lambda: daily-digest<br/>256MB, 2min timeout<br/>Daily 09:00 UTC per TZ]
L_PRICING[Lambda: pricing-updater<br/>1024MB, 5min timeout<br/>Weekly Sunday 00:00 UTC]
end
subgraph "API"
APIGW_REST[API Gateway: REST API<br/>/v1/accounts, /v1/anomalies, etc.]
L_API[Lambda: api-handlers<br/>256MB, 30s timeout]
COGNITO[Cognito User Pool<br/>GitHub + Google OIDC]
end
subgraph "Data"
DDB[DynamoDB: dd0c-cost-main<br/>On-demand capacity<br/>Point-in-time recovery: ON<br/>Encryption: AWS-managed KMS]
S3_CF[S3: dd0c-cf-templates<br/>CloudFormation templates<br/>Public read]
S3_CUR[S3: dd0c-cur-datalake<br/>V2 — CUR storage<br/>SSE-S3 encryption]
end
subgraph "Observability"
CW[CloudWatch Logs + Metrics]
CW_ALARM[CloudWatch Alarms<br/>Lambda errors, SQS DLQ depth,<br/>DDB throttles]
SNS_OPS[SNS: ops-alerts<br/>→ Brian's phone]
end
end
EB --> SQS_I --> L_PROC --> L_SCORE --> SQS_A --> L_NOTIFY
EBS --> L_ZOMBIE
EBS --> L_DIGEST
EBS --> L_PRICING
APIGW_SLACK --> L_ACTION
APIGW_REST --> L_API
L_API --> COGNITO
L_PROC --> DDB
L_SCORE --> DDB
L_ACTION --> DDB
L_ZOMBIE --> DDB
L_DIGEST --> DDB
CW_ALARM --> SNS_OPS
Lambda Function Inventory
| Function | Memory | Timeout | Trigger | Concurrency | Est. Invocations/day (10 accounts) |
|---|---|---|---|---|---|
event-processor |
128 MB | 30s | SQS FIFO (batch 10) | 5 reserved | 500-5,000 |
anomaly-scorer |
256 MB | 10s | Async invoke from processor | 5 reserved | 500-5,000 |
notifier |
128 MB | 15s | SQS Standard | 2 reserved | 10-50 (only anomalies) |
action-handler |
256 MB | 30s | API Gateway | Unreserved | 5-20 (user-initiated) |
zombie-hunter |
512 MB | 5 min | EventBridge Scheduler | 1 | 1 (daily) |
daily-digest |
256 MB | 2 min | EventBridge Scheduler | 1 | 1 (daily) |
pricing-updater |
1024 MB | 5 min | EventBridge Scheduler | 1 | 0.14 (weekly) |
api-handlers |
256 MB | 30s | API Gateway | Unreserved | 50-500 |
Reserved concurrency on the ingestion path prevents a burst of CloudTrail events from consuming all Lambda concurrency in the account and starving the API/notification functions.
4.2 Customer-Side Infrastructure
The customer deploys a CloudFormation stack that creates:
Read-Only Stack (Required — deployed at onboarding)
# dd0c-cost-readonly.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: 'dd0c/cost — Read-only monitoring (CloudTrail events + resource describe)'
Parameters:
Dd0cAccountId:
Type: String
Default: '111122223333' # dd0c's AWS account ID
ExternalId:
Type: String
Description: 'Unique external ID for cross-account role assumption'
Resources:
# IAM Role for dd0c to read CloudTrail events and describe resources
Dd0cCostReadOnlyRole:
Type: AWS::IAM::Role
Properties:
RoleName: dd0c-cost-readonly
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
AWS: !Sub 'arn:aws:iam::${Dd0cAccountId}:root'
Action: sts:AssumeRole
Condition:
StringEquals:
sts:ExternalId: !Ref ExternalId
Policies:
- PolicyName: dd0c-cost-readonly-policy
PolicyDocument:
Version: '2012-10-17'
Statement:
# Describe resources for zombie hunting and context
- Effect: Allow
Action:
- ec2:DescribeInstances
- ec2:DescribeVolumes
- ec2:DescribeAddresses
- ec2:DescribeNatGateways
- ec2:DescribeSnapshots
- elasticloadbalancing:DescribeLoadBalancers
- elasticloadbalancing:DescribeTargetGroups
- rds:DescribeDBInstances
- rds:DescribeDBClusters
- lambda:ListFunctions
- lambda:GetFunction
- cloudwatch:GetMetricStatistics
- cloudwatch:GetMetricData
- ce:GetCostAndUsage
- tag:GetResources
Resource: '*'
# CloudWatch billing metrics for end-of-month forecast
- Effect: Allow
Action:
- cloudwatch:GetMetricStatistics
Resource: '*'
Condition:
StringEquals:
cloudwatch:namespace: 'AWS/Billing'
# EventBridge rule to forward cost-relevant CloudTrail events
Dd0cCostEventRule:
Type: AWS::Events::Rule
Properties:
Name: dd0c-cost-forward
Description: 'Forward cost-relevant CloudTrail events to dd0c'
State: ENABLED
EventPattern:
source:
- aws.ec2
- aws.rds
- aws.lambda
detail-type:
- 'AWS API Call via CloudTrail'
detail:
eventSource:
- ec2.amazonaws.com
- rds.amazonaws.com
- lambda.amazonaws.com
eventName:
- RunInstances
- StartInstances
- ModifyInstanceAttribute
- CreateNatGateway
- AllocateAddress
- CreateVolume
- CreateDBInstance
- ModifyDBInstance
- RestoreDBInstanceFromDBSnapshot
- CreateDBCluster
- CreateFunction20150331
- UpdateFunctionConfiguration20150331v2
- PutProvisionedConcurrencyConfig
Targets:
- Id: dd0c-cost-bus
Arn: !Sub 'arn:aws:events:${AWS::Region}:${Dd0cAccountId}:event-bus/dd0c-cost-bus'
RoleArn: !GetAtt EventBridgeForwardRole.Arn
# IAM Role for EventBridge to forward events cross-account
EventBridgeForwardRole:
Type: AWS::IAM::Role
Properties:
RoleName: dd0c-cost-eventbridge-forward
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: events.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: forward-to-dd0c
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action: events:PutEvents
Resource: !Sub 'arn:aws:events:${AWS::Region}:${Dd0cAccountId}:event-bus/dd0c-cost-bus'
Outputs:
RoleArn:
Value: !GetAtt Dd0cCostReadOnlyRole.Arn
Description: 'Provide this ARN to dd0c to complete setup'
ExternalId:
Value: !Ref ExternalId
Description: 'External ID for secure cross-account access'
Remediation Stack (Optional — deployed after trust is established)
# dd0c-cost-remediate.yaml (separate stack, opt-in)
Resources:
Dd0cCostRemediateRole:
Type: AWS::IAM::Role
Properties:
RoleName: dd0c-cost-remediate
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
AWS: !Sub 'arn:aws:iam::${Dd0cAccountId}:root'
Action: sts:AssumeRole
Condition:
StringEquals:
sts:ExternalId: !Ref ExternalId
Policies:
- PolicyName: dd0c-cost-remediate-policy
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- ec2:StopInstances
- ec2:TerminateInstances
- ec2:CreateSnapshot
Resource: '*'
Condition:
StringEquals:
'aws:ResourceTag/dd0c-remediation': 'enabled'
4.3 Cost Estimate
dd0c Platform Infrastructure Cost
| Component | 1 Account | 10 Accounts | 100 Accounts | Pricing Basis |
|---|---|---|---|---|
| Lambda (ingestion) | $0.02/mo | $0.15/mo | $1.50/mo | ~500-50K invocations/day, 128MB, 200ms avg |
| Lambda (scoring) | $0.01/mo | $0.10/mo | $1.00/mo | Same invocation count, 256MB, 50ms avg |
| Lambda (notifier) | $0.001/mo | $0.01/mo | $0.10/mo | 10-1000 anomalies/day |
| Lambda (scheduled) | $0.01/mo | $0.01/mo | $0.05/mo | 3 daily + 1 weekly, fixed |
| Lambda (API) | $0.005/mo | $0.05/mo | $0.50/mo | 50-5000 API calls/day |
| DynamoDB | $0.50/mo | $2.00/mo | $15.00/mo | On-demand: ~$1.25/million writes, ~$0.25/million reads. PITR adds ~25%. |
| SQS | $0.01/mo | $0.05/mo | $0.50/mo | First 1M requests free, then $0.40/million |
| EventBridge | $0.01/mo | $0.05/mo | $0.50/mo | $1.00/million events |
| API Gateway | $0.10/mo | $0.10/mo | $0.50/mo | $3.50/million requests, first 1M free |
| Cognito | $0.00/mo | $0.00/mo | $0.00/mo | Free <50K MAU |
| S3 | $0.01/mo | $0.05/mo | $0.50/mo | CF templates + CUR storage (V2) |
| CloudWatch | $0.50/mo | $0.50/mo | $2.00/mo | Logs ingestion + custom metrics |
| Total | ~$1.17/mo | ~$3.06/mo | ~$22.15/mo | |
| Revenue | $19/mo | $190/mo | $1,900/mo | |
| Gross Margin | 93.8% | 98.4% | 98.8% |
Key insight: Infrastructure cost is negligible at all scales. The $19/account/month price point is almost pure margin. The cost driver at scale will be engineering time (Brian's time), not AWS bills. This is the beauty of serverless — costs scale linearly with usage, and CloudTrail event processing is computationally trivial.
Customer-Side Cost (what the customer pays AWS for dd0c's infrastructure in their account)
| Component | Cost | Notes |
|---|---|---|
| EventBridge rule | $0.00 | Rules are free. Events forwarded: $1.00/million. At ~5K events/day = $0.15/month. |
| IAM roles | $0.00 | IAM is free. |
| CloudTrail | $0.00 (if already enabled) | Most accounts already have CloudTrail enabled. If not, first trail is free. |
| Total customer-side cost | ~$0.15/month | Negligible. Important for the sales conversation: "dd0c adds ~$0.15/month to your AWS bill." |
4.4 Scaling Strategy
CloudTrail Event Volume Estimates
| Account Activity | Events/Day (all CloudTrail) | Cost-Relevant Events/Day | dd0c Processes |
|---|---|---|---|
| Small startup (10 engineers) | 5,000-20,000 | 50-200 | 50-200 |
| Medium startup (50 engineers) | 50,000-200,000 | 200-1,000 | 200-1,000 |
| Mid-market (200 engineers) | 500,000-2,000,000 | 1,000-5,000 | 1,000-5,000 |
| CI/CD heavy (Terraform runs) | 1,000,000+ | 5,000-20,000 | 5,000-20,000 |
The EventBridge filter is the key scaling lever. By filtering at the EventBridge rule level (customer-side), only cost-relevant events cross the account boundary. A busy account generating 2M CloudTrail events/day sends only ~5K-20K to dd0c. This is well within Lambda's processing capacity.
Scaling Bottlenecks and Mitigations
| Scale | Bottleneck | Mitigation |
|---|---|---|
| 100 accounts | None. Lambda + DynamoDB on-demand handles this trivially. | — |
| 500 accounts | SQS FIFO throughput (300 msg/sec per message group, 3,000 msg/sec per queue). With accountId as message group, each account gets 300 msg/sec — more than enough. | Monitor SQS ApproximateAgeOfOldestMessage. If >60s, add a second FIFO queue with hash-based routing. |
| 1,000 accounts | Lambda concurrent executions. 5 reserved for ingestion × 10 batch size = 50 events/sec. May need to increase reserved concurrency. | Increase reserved concurrency to 20. Monitor ConcurrentExecutions metric. |
| 5,000 accounts | DynamoDB write throughput. On-demand scales automatically but costs increase. Single-table hot partition risk if one account generates disproportionate events. | Monitor ConsumedWriteCapacityUnits. If hot partition detected, add account-level write sharding (append random suffix to PK). |
| 10,000+ accounts | Lambda cold starts become noticeable. EventBridge cross-account bus may hit account-level limits. | Migrate ingestion to ECS Fargate (long-running containers, no cold starts). Replace EventBridge with Kinesis Data Streams for higher throughput. This is a V3/V4 concern. |
The honest assessment: dd0c/cost's architecture comfortably handles 1,000 accounts without any changes. At 5,000+ accounts, targeted optimizations are needed. At 10,000+, a partial re-architecture (Lambda → ECS for ingestion) is warranted. Given the business plan targets 100 accounts at Month 6 and ~2,600 at $50K MRR, the V1 architecture has 4-10x headroom before scaling work is needed.
4.5 CI/CD Pipeline
graph LR
subgraph "Developer"
CODE[TypeScript Code] --> GIT[Git Push → GitHub]
end
subgraph "GitHub Actions"
GIT --> LINT[ESLint + Prettier]
LINT --> TEST[Vitest Unit Tests]
TEST --> BUILD[CDK Synth<br/>Generate CloudFormation]
BUILD --> DIFF[CDK Diff<br/>Show changes]
end
subgraph "Deployment (main branch)"
DIFF -->|main branch merge| DEPLOY_STG[CDK Deploy → Staging]
DEPLOY_STG --> SMOKE[Smoke Tests<br/>Deploy CF stack to test account<br/>Trigger test event<br/>Verify Slack alert]
SMOKE -->|pass| DEPLOY_PROD[CDK Deploy → Production]
DEPLOY_PROD --> MONITOR[CloudWatch Alarms<br/>5-min error rate check]
MONITOR -->|alarm| ROLLBACK[CDK Deploy → Previous Version]
end
Stack:
- Source control: GitHub (private repo)
- CI/CD: GitHub Actions (free for private repos up to 2,000 min/month)
- IaC: AWS CDK v2 (TypeScript)
- Testing: Vitest for unit tests, custom smoke test script for integration
- Environments: Staging (dd0c-staging account) + Production (dd0c-platform account)
- Deployment: CDK deploy with
--require-approval neverfor staging,--require-approval broadeningfor production (alerts on IAM/security changes)
Deployment cadence: Continuous deployment to staging on every push. Production deploys on merge to main after smoke tests pass. Rollback is automatic if CloudWatch error rate alarm fires within 5 minutes of deploy.
Solo founder optimization: No manual approval gates. No staging-to-prod promotion ceremony. Push to main → it's live in 10 minutes. If it breaks, CloudWatch catches it and rolls back. Brian's time is too valuable for deployment theater.
5. SECURITY
5.1 IAM Role Design: Customer AWS Accounts
dd0c/cost requires cross-account access to customer AWS accounts. This is the most security-sensitive aspect of the architecture. The design principle: minimum privilege, maximum transparency, separate roles for separate risk levels.
Role 1: dd0c-cost-readonly (Required)
Deployed at onboarding. Read-only. Cannot modify any customer resources.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DescribeComputeResources",
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ec2:DescribeVolumes",
"ec2:DescribeAddresses",
"ec2:DescribeNatGateways",
"ec2:DescribeSnapshots",
"ec2:DescribeRegions",
"elasticloadbalancing:DescribeLoadBalancers",
"elasticloadbalancing:DescribeTargetGroups",
"rds:DescribeDBInstances",
"rds:DescribeDBClusters",
"lambda:ListFunctions",
"lambda:GetFunction"
],
"Resource": "*"
},
{
"Sid": "CloudWatchMetrics",
"Effect": "Allow",
"Action": [
"cloudwatch:GetMetricStatistics",
"cloudwatch:GetMetricData",
"cloudwatch:ListMetrics"
],
"Resource": "*"
},
{
"Sid": "CostExplorerReadOnly",
"Effect": "Allow",
"Action": [
"ce:GetCostAndUsage",
"ce:GetCostForecast"
],
"Resource": "*"
},
{
"Sid": "TagReadOnly",
"Effect": "Allow",
"Action": [
"tag:GetResources",
"tag:GetTagKeys",
"tag:GetTagValues"
],
"Resource": "*"
}
]
}
Trust policy with external ID:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::111122223333:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "dd0c-cost-<unique-per-customer-uuid>"
}
}
}
]
}
Why external ID matters: Without an external ID, any AWS account that knows dd0c's account ID could trick dd0c into assuming a role in a victim's account (the "confused deputy" problem). The external ID is a unique, per-customer secret generated at onboarding and stored in dd0c's database. It's included in the CloudFormation template as a parameter.
What this role explicitly CANNOT do:
- ❌ Create, modify, or delete any resource
- ❌ Read S3 bucket contents
- ❌ Read secrets, parameters, or configuration
- ❌ Access IAM users, roles, or policies
- ❌ Read CloudTrail logs directly (events come via EventBridge, not API)
- ❌ Access any networking configuration (VPCs, security groups, etc.)
Role 2: dd0c-cost-remediate (Optional, Opt-In)
Deployed separately, only when the customer explicitly enables one-click remediation. Scoped to tagged resources only.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RemediateTaggedEC2Only",
"Effect": "Allow",
"Action": [
"ec2:StopInstances",
"ec2:TerminateInstances",
"ec2:CreateSnapshot"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/dd0c-remediation": "enabled"
}
}
},
{
"Sid": "DescribeForDryRun",
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ec2:DescribeVolumes"
],
"Resource": "*"
}
]
}
Tag-based scoping: Remediation only works on resources tagged dd0c-remediation: enabled. This means:
- Production databases without the tag are untouchable — even if dd0c has a bug
- Customers control exactly which resources dd0c can act on
- The tag can be applied via Terraform/CloudFormation as part of dev/staging resource definitions
- Production resources should never have this tag (and dd0c's onboarding docs will say so explicitly)
Role Assumption Flow
dd0c Lambda → sts:AssumeRole(
RoleArn: "arn:aws:iam::<customer_account>:role/dd0c-cost-readonly",
ExternalId: "dd0c-cost-<customer-uuid>",
DurationSeconds: 900 // 15-minute session, minimum viable
) → Temporary credentials → ec2:DescribeInstances → Discard credentials
- Session duration: 15 minutes (minimum practical). Credentials are never cached beyond a single Lambda invocation.
- No long-lived credentials are stored anywhere. Every cross-account call uses fresh STS temporary credentials.
- The Lambda execution role in dd0c's account has
sts:AssumeRolepermission scoped toarn:aws:iam::*:role/dd0c-cost-readonlyandarn:aws:iam::*:role/dd0c-cost-remediate— it can only assume roles with these exact names.
5.2 Customer Data Sensitivity
CloudTrail events contain sensitive information. dd0c must handle this responsibly.
What CloudTrail Events Reveal
| Data Element | Sensitivity | dd0c Usage | Storage |
|---|---|---|---|
| AWS Account ID | Medium | Required for multi-tenant routing | Stored (encrypted at rest) |
| IAM User/Role ARN | Medium | Attribution ("who did it") | Stored (encrypted at rest) |
| API action + parameters | High | Cost estimation (instance type, count) | Normalized fields stored. Raw event stored 7 days then TTL'd. |
| Source IP address | High | Not used by dd0c | Stripped at ingestion. Never stored. |
| User agent string | Low | Not used by dd0c | Stripped at ingestion. Never stored. |
| Request/response bodies | High | Instance type, count extracted. Rest discarded. | Only extracted fields stored. Raw TTL'd at 7 days. |
| Error responses | Low | Used to filter failed API calls (no cost impact) | Not stored. Failed events are dropped at ingestion. |
Data minimization principle: dd0c extracts exactly 8 fields from each CloudTrail event (service, action, resource type, resource ID, resource spec, region, actor, timestamp) and discards the rest. The raw event is stored for 7 days for debugging purposes only, then automatically deleted via DynamoDB TTL.
Data in Transit
- Customer → dd0c: EventBridge cross-account delivery uses AWS's internal network. Events never traverse the public internet.
- dd0c → Customer (remediation): STS AssumeRole + API calls use HTTPS (TLS 1.2+) over AWS's internal network.
- dd0c → Slack: HTTPS (TLS 1.3) to Slack's API endpoints.
- User → dd0c API: HTTPS (TLS 1.2+) via API Gateway with enforced minimum TLS version.
Data at Rest
| Data Store | Encryption | Key Management |
|---|---|---|
| DynamoDB | AES-256 (AWS-managed KMS key) | AWS manages key rotation. Upgrade to CMK if enterprise customers require it. |
| S3 (CUR data) | SSE-S3 (AES-256) | AWS-managed. Upgrade to SSE-KMS with CMK for V2. |
| S3 (CF templates) | Public read (non-sensitive) | N/A |
| Slack OAuth tokens | DynamoDB encryption + application-level AES-256 | Application key stored in AWS Secrets Manager. Tokens are double-encrypted. |
5.3 SOC 2 Considerations
SOC 2 Type II is the target within 12 months of launch. It's table stakes for selling to any company with a security team.
SOC 2 Trust Service Criteria Mapping
| Criteria | dd0c/cost Implementation |
|---|---|
| Security | IAM least-privilege, encryption at rest/transit, Cognito auth, API Gateway throttling, Slack signature verification |
| Availability | Lambda auto-scaling, DynamoDB on-demand, SQS durability, CloudWatch alarms with auto-rollback |
| Processing Integrity | SQS FIFO deduplication, idempotent Lambda handlers, DynamoDB conditional writes, remediation audit log |
| Confidentiality | Data minimization (strip source IPs, raw events TTL'd), encryption, per-tenant partition isolation, no cross-tenant data access |
| Privacy | No PII collected (IAM ARNs are not PII under most frameworks). Data deletion on account disconnection within 72 hours. |
Pre-SOC 2 Security Checklist (Ship with V1)
- All data encrypted at rest (DynamoDB, S3)
- All data encrypted in transit (TLS 1.2+)
- No long-lived credentials stored (STS temporary credentials only)
- External ID on all cross-account roles (confused deputy prevention)
- Remediation scoped to tagged resources only
- Remediation audit log (who, what, when, result)
- Slack signature verification on all interactive payloads (HMAC-SHA256)
- Cognito JWT validation on all API requests
- API Gateway throttling (1,000 req/sec default, per-API-key limits)
- CloudWatch alarms on Lambda errors, SQS DLQ depth, DynamoDB throttles
- DynamoDB Point-in-Time Recovery enabled
- GitHub branch protection (require PR review — even if reviewing your own PRs as solo founder)
- Bug bounty program (launch within 30 days of public launch)
- Penetration test (schedule within 90 days of launch)
- SOC 2 Type I audit (Month 6-9)
- SOC 2 Type II audit (Month 12-15)
5.4 The Trust Model
dd0c asks customers to grant read access to their CloudTrail events and resource metadata. This is a significant trust ask. The architecture must earn and maintain that trust.
Trust-Building Measures
-
Open-source the CloudFormation templates. Customers can read exactly what IAM permissions they're granting. No hidden permissions. The templates are hosted on a public S3 bucket and linked from the docs.
-
Open-source the event processor. The Lambda function that processes CloudTrail events can be published as open source. Customers can audit exactly what data is extracted and what's discarded. (The anomaly scoring algorithm and business logic remain proprietary.)
-
Minimal permissions, clearly documented. The docs page for "What permissions does dd0c need?" lists every IAM action with a plain-English explanation of why it's needed. No
*actions. Noiam:*. Nos3:*. -
Separate remediation role. Read-only access is the default. Write access (remediation) is a separate, opt-in deployment. Customers who never deploy the remediation stack can use dd0c for detection and alerting only — dd0c can never modify their resources.
-
External ID rotation. Customers can rotate their external ID at any time via the dd0c dashboard. This invalidates dd0c's ability to assume the role until the new external ID is configured — an emergency kill switch.
-
Self-hosted agent option (V3). For customers who refuse to send CloudTrail events to dd0c's account, a self-hosted agent runs in the customer's VPC. It processes events locally and sends only anonymized anomaly summaries (severity, estimated cost, resource type — no ARNs, no account IDs) to dd0c's SaaS for the dashboard. This is the nuclear option for security-paranoid customers.
Threat Model
| Threat | Likelihood | Impact | Mitigation |
|---|---|---|---|
| dd0c's AWS account compromised | Low | Critical | MFA on root, SCPs restricting dangerous actions, CloudTrail on dd0c's own account, AWS GuardDuty enabled |
| Attacker assumes customer role via dd0c | Very Low | Critical | External ID prevents confused deputy. STS sessions are 15 minutes. Lambda execution role is scoped to specific role names. |
| dd0c employee (Brian) goes rogue | N/A (solo founder) | Critical | Open-source templates + audit logs provide transparency. SOC 2 audit provides external verification. |
| Slack token compromise | Low | High | Tokens double-encrypted (DynamoDB + application-level). Slack token rotation supported. Tokens scoped to minimum bot permissions. |
| DynamoDB data breach | Very Low | High | Encryption at rest. No public endpoints. VPC endpoints for DynamoDB access from Lambda (V2). IAM policies restrict access to dd0c's Lambda execution roles only. |
6. MVP SCOPE
6.1 V1 Boundary: What Ships at Day 90
V1 is scoped to three services, one notification channel, and zero dashboards. Every feature that doesn't directly serve "detect → alert → fix" is cut.
V1 Feature Matrix
| Feature | V1 (Day 90) | V2 (Month 4-6) | V3 (Month 7-12) |
|---|---|---|---|
| EC2 anomaly detection | ✅ | ✅ | ✅ |
| RDS anomaly detection | ✅ | ✅ | ✅ |
| Lambda anomaly detection | ✅ | ✅ | ✅ |
| ECS/Fargate detection | ❌ | ✅ | ✅ |
| SageMaker detection | ❌ | ❌ | ✅ |
| Slack alerts (Block Kit) | ✅ | ✅ | ✅ |
| Manual remediation suggestions | ✅ | ✅ | ✅ |
| One-click remediation (Slack buttons) | ❌ | ✅ | ✅ |
| Zombie resource hunter | ✅ (daily) | ✅ (daily) | ✅ (continuous) |
| Daily digest | ✅ | ✅ | ✅ |
| Weekly digest | ❌ | ✅ | ✅ |
| End-of-month forecast | ✅ (basic) | ✅ (improved) | ✅ (ML-based) |
| CUR reconciliation | ❌ | ✅ | ✅ |
| Web dashboard | ❌ | ✅ (basic) | ✅ (full) |
| Multi-account support | ❌ (1 account) | ✅ | ✅ |
| Team attribution | ❌ | ✅ | ✅ |
| Custom anomaly rules | ❌ | ❌ | ✅ |
| Autonomous remediation | ❌ | ❌ | ✅ (opt-in) |
| API access | ❌ | ✅ (Business tier) | ✅ |
| dd0c/route integration | ❌ | ✅ (shared accounts) | ✅ (deep) |
V1 Remediation: Suggestions, Not Buttons
Critical V1 scoping decision: V1 ships with remediation suggestions in Slack alerts, not one-click action buttons.
Why:
- Remediation is the highest-risk feature. A bug that stops a production instance is catastrophic for a product with zero brand trust. V1 needs to build trust through accurate detection before earning the right to take action.
- The remediation IAM role doubles onboarding complexity. V1 onboarding deploys one CloudFormation stack (read-only). Adding a second stack for remediation adds friction and IAM anxiety.
- Suggestions still deliver 80% of the value. A Slack alert that says "Stop this instance:
aws ec2 stop-instances --instance-ids i-0abc123 --region us-east-1" with a copy-paste CLI command is almost as fast as a button. The user runs it in their terminal in 5 seconds.
V1 alert format (suggestions, not buttons):
🟡 WARNING: Unusual EC2 Activity
*2× p3.2xlarge instances* launched in us-east-1
├─ Estimated cost: *$6.12/hr* ($146.88/day)
├─ Who: sam@company.com (IAM User)
├─ When: Today at 11:02 AM UTC
├─ Account: 123456789012
└─ Why: Instance type never seen in this account.
Cost is 4.1× your average EC2 hourly spend.
💡 *Suggested actions:*
• Stop instances: `aws ec2 stop-instances --instance-ids i-0abc123 i-0def456 --region us-east-1`
• Check if needed: `aws ec2 describe-instances --instance-ids i-0abc123 i-0def456 --region us-east-1 --query 'Reservations[].Instances[].{State:State.Name,Launch:LaunchTime,Tags:Tags}'`
[Mark as Expected ✓] [Snooze 4h]
ℹ️ Cost estimated using on-demand pricing.
The only interactive buttons in V1 are [Mark as Expected] and [Snooze] — both are internal to dd0c (no cross-account API calls, zero risk).
One-click remediation buttons ship in V2 after:
- 30+ days of accurate detection builds customer trust
- The remediation IAM role and tag-based scoping are battle-tested internally
- At least 3 design partners have opted into the remediation stack
6.2 The Onboarding Flow (Technical)
The onboarding flow is the most critical user journey. Every second of friction costs signups.
sequenceDiagram
participant U as User (Browser)
participant DD as dd0c Web App
participant COG as Cognito
participant AWS_C as Customer AWS Console
participant CF as CloudFormation
participant DD_API as dd0c API
participant SL as Slack
U->>DD: Click "Start Free"
DD->>COG: Redirect to Cognito hosted UI
COG->>COG: GitHub/Google OAuth
COG->>DD: JWT token (id_token + access_token)
DD->>DD_API: POST /v1/accounts/setup (init tenant)
DD_API->>DD_API: Generate unique external_id (UUID v4)
DD_API->>U: Return CloudFormation quick-create URL
Note over U,AWS_C: User clicks CF link → opens AWS Console
U->>AWS_C: Click link (pre-filled CF stack)
AWS_C->>CF: Create stack (dd0c-cost-readonly)
CF->>CF: Create IAM role + EventBridge rule (~60-90 sec)
CF-->>AWS_C: Stack complete → outputs Role ARN
U->>DD: Paste Role ARN (or auto-detected via CF callback)
DD->>DD_API: POST /v1/accounts (roleArn, externalId)
DD_API->>DD_API: sts:AssumeRole (validate access)
DD_API->>U: ✅ Account connected
U->>DD: Click "Connect Slack"
DD->>SL: Slack OAuth flow (bot scopes: chat:write, commands)
SL->>DD: OAuth callback (bot token, workspace ID)
DD->>DD_API: POST /v1/slack/install (token, workspace, channel)
DD_API->>DD_API: Trigger immediate zombie scan
DD_API->>SL: First alert: "Found 3 zombie resources costing $127/mo"
Note over U: Total time: 3-5 minutes. First value: <10 minutes.
CloudFormation Quick-Create URL
The magic of fast onboarding is the CloudFormation quick-create URL. Instead of asking users to download a template and upload it, dd0c generates a URL that opens the AWS Console with the stack pre-configured:
https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/quickcreate
?templateURL=https://dd0c-cf-templates.s3.amazonaws.com/dd0c-cost-readonly-v1.yaml
&stackName=dd0c-cost-monitoring
¶m_Dd0cAccountId=111122223333
¶m_ExternalId=dd0c-cost-a1b2c3d4-e5f6-7890-abcd-ef1234567890
The user sees a pre-filled CloudFormation page. They check "I acknowledge that AWS CloudFormation might create IAM resources" and click "Create stack." Done in 90 seconds.
Auto-Detection of Role ARN (V1 Enhancement)
Instead of asking users to copy-paste the Role ARN, dd0c can poll for the role:
- After the user clicks the CF link, dd0c starts polling
sts:AssumeRolewith the expected role ARN (arn:aws:iam::<account_id>:role/dd0c-cost-readonly) every 10 seconds - The account ID is extracted from the Cognito JWT (if the user signed up with an AWS-linked identity) or entered manually
- When the role becomes assumable (CF stack complete), dd0c auto-detects it and skips the "paste ARN" step
- Fallback: manual ARN entry if auto-detection fails after 5 minutes
6.3 False Positive Rate Target
Target: <20% false positive rate by Day 60 (design partner phase).
Definition: A "false positive" is an alert where the user clicks [Mark as Expected] or [Snooze permanently]. An alert that the user ignores is NOT counted as a false positive (it might be a true positive they chose not to act on).
Measurement
// Calculated daily per account
const falsePositiveRate =
markedAsExpected / (markedAsExpected + actedOn + openAfter48Hours);
// Where:
// markedAsExpected = alerts where user clicked "Mark as Expected"
// actedOn = alerts where user clicked Stop/Terminate/Snooze (temporary)
// openAfter48Hours = alerts still open after 48 hours (ambiguous — excluded from FP calc)
False Positive Reduction Strategy
| Phase | FP Rate Target | Strategy |
|---|---|---|
| Day 1-14 (cold start) | <40% | Absolute thresholds only. Conservative defaults. Miss small anomalies rather than cry wolf. |
| Day 14-30 (learning) | <30% | Statistical baselines kick in. Composite scoring reduces single-signal false positives. |
| Day 30-60 (design partners) | <20% | Feedback loop active. "Mark as Expected" retrains baselines. Per-account patterns learned. |
| Day 60-90 (launch) | <15% | Sensitivity tuning based on design partner data. Default thresholds calibrated to real-world patterns. |
| Month 3+ (mature) | <10% | Mature baselines. Suppressed patterns. CUR reconciliation (V2) corrects pricing estimates. |
Alert-to-Action Ratio
The complementary metric to false positive rate. Measures what percentage of alerts result in a meaningful action.
Target: >25% alert-to-action ratio.
If <20% of alerts result in action (stop, terminate, snooze, or investigate), the product is too noisy. This is the "boy who cried wolf" metric — if it drops below 20%, dd0c has 30 days to fix it or trigger the kill criteria.
6.4 Technical Debt Budget
V1 will accumulate technical debt. That's fine — it's a 90-day sprint. But the debt must be tracked and bounded.
Acceptable V1 Technical Debt
| Debt Item | Impact | Payoff Timeline |
|---|---|---|
| Static pricing tables (no RI/SP awareness) | Over-estimates costs for accounts with commitments. Higher false positive rate. | V2 (CUR reconciliation) |
| Single-region deployment (us-east-1) | Higher latency for non-US customers. Single point of failure. | V3 (multi-region) |
| No web dashboard | Customers can't view anomaly history outside Slack. | V2 |
| Single DynamoDB table for everything | Will hit GSI limits and hot partition issues at scale. | V2 (add Aurora read replica for dashboard queries) |
| Hardcoded Slack as only notification channel | Can't support Teams, Discord, email, PagerDuty. | V2 (notification abstraction layer) |
| No automated integration tests | Relies on manual smoke testing + unit tests. | Month 2 (add integration test suite) |
| Lambda cold starts on infrequent paths | Zombie hunter and digest Lambdas cold-start every invocation. | V2 (provisioned concurrency if needed, or migrate to ARM for faster cold starts) |
| No rate limiting per tenant | A single noisy account could consume disproportionate Lambda concurrency. | V2 (per-account SQS message group + concurrency limits) |
Unacceptable Technical Debt (Must Not Ship)
- ❌ Storing long-lived AWS credentials (must use STS temporary credentials)
- ❌ Unencrypted data at rest
- ❌ Missing Slack signature verification
- ❌ Missing external ID on cross-account roles
- ❌ No CloudWatch alarms on critical paths
- ❌ No DynamoDB Point-in-Time Recovery
- ❌ Hardcoded secrets in code (must use Secrets Manager or environment variables from CDK)
6.5 Solo Founder Operational Model
Brian is building and operating this alone. The architecture must minimize operational burden.
Operational Runbook (What Can Go Wrong)
| Scenario | Detection | Response | Automation Level |
|---|---|---|---|
| Lambda ingestion errors spike | CloudWatch alarm: event-processor error rate >5% in 5 min |
Check CloudWatch Logs. Common causes: malformed CloudTrail event, DynamoDB throttle, STS AssumeRole failure. | Alert → Brian's phone via SNS. Manual investigation. |
| SQS DLQ messages accumulate | CloudWatch alarm: DLQ ApproximateNumberOfMessagesVisible > 0 |
Inspect DLQ messages. Replay after fixing root cause. | Alert → Brian's phone. Manual replay via CLI script. |
| DynamoDB throttling | CloudWatch alarm: ThrottledRequests > 0 |
On-demand capacity should auto-scale. If persistent, check for hot partition (single account generating disproportionate events). | Alert → Brian's phone. Usually self-resolving. |
| Slack API rate limited | Lambda retry with exponential backoff via SQS visibility timeout | SQS handles retry automatically. If persistent, check for alert storm (noisy account). | Fully automated. |
| Customer CloudFormation stack deleted | EventBridge events stop arriving. Detected by daily health check (no events in 24h for active account). | Notify customer via Slack: "We stopped receiving events from account X. Did you remove the dd0c stack?" | Semi-automated. Health check Lambda detects, sends Slack notification. |
| Deployment breaks production | CloudWatch alarm: error rate >5% within 5 min of deploy | Automatic rollback to previous Lambda version via CDK. | Fully automated rollback. |
Time Budget (Brian's Weekly Hours on dd0c/cost)
| Activity | Hours/Week (V1 build) | Hours/Week (Post-launch) |
|---|---|---|
| Feature development | 20-25 | 10-15 |
| Bug fixes | 5-10 | 5-10 |
| Customer support | 0 | 2-5 |
| Ops/monitoring | 1-2 | 2-3 |
| Content marketing | 2-3 | 3-5 |
| Total | 28-40 | 22-38 |
The constraint: Brian is also building dd0c/route simultaneously. Total available hours: ~50-60/week across both products. dd0c/cost gets ~50% of time during the build phase, dropping to ~40% post-launch as dd0c/route matures.
Automation imperative: Every hour spent on ops is an hour not spent on features or marketing. The architecture must be self-healing:
- Lambda auto-scales and auto-retries
- SQS provides durability and backpressure
- DynamoDB on-demand eliminates capacity planning
- CloudWatch alarms catch problems before customers notice
- CDK deploys are one command (
cdk deploy --all) - No SSH. No servers. No patching. No 3 AM pages (unless Lambda error rate spikes, which means something is fundamentally broken).
7. API DESIGN
7.1 Account Registration & Onboarding API
All API endpoints are served via API Gateway at https://api.dd0c.dev/v1. Authentication is via Cognito JWT in the Authorization: Bearer <token> header.
POST /v1/accounts/setup
Initialize a new tenant and generate onboarding artifacts.
Request:
Headers:
Authorization: Bearer <cognito_jwt>
Body: (none — tenant derived from JWT)
Response: 201 Created
{
"tenantId": "tn_01HXYZ...",
"externalId": "dd0c-cost-a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"cloudFormationUrl": "https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/quickcreate?templateURL=https://dd0c-cf-templates.s3.amazonaws.com/dd0c-cost-readonly-v1.yaml&stackName=dd0c-cost-monitoring¶m_Dd0cAccountId=111122223333¶m_ExternalId=dd0c-cost-a1b2c3d4...",
"status": "pending_aws_connection"
}
POST /v1/accounts
Register a connected AWS account after CloudFormation stack deployment.
Request:
Headers:
Authorization: Bearer <cognito_jwt>
Body:
{
"awsAccountId": "123456789012",
"roleArn": "arn:aws:iam::123456789012:role/dd0c-cost-readonly",
"region": "us-east-1",
"friendlyName": "production" // optional
}
Response: 201 Created
{
"accountId": "acc_01HXYZ...",
"awsAccountId": "123456789012",
"status": "validating",
"validation": {
"assumeRole": "pending",
"eventBridge": "pending",
"firstEvent": "pending"
}
}
// dd0c immediately:
// 1. Attempts sts:AssumeRole to validate access
// 2. Triggers zombie scan
// 3. Starts listening for EventBridge events
// Validation status updates via polling or webhook
GET /v1/accounts
List all connected AWS accounts for the authenticated tenant.
Response: 200 OK
{
"accounts": [
{
"accountId": "acc_01HXYZ...",
"awsAccountId": "123456789012",
"friendlyName": "production",
"status": "active",
"connectedAt": "2026-02-28T10:00:00Z",
"lastEventAt": "2026-02-28T14:32:00Z",
"baselineMaturity": "learning", // "cold-start" | "learning" | "mature"
"remediationEnabled": false,
"stats": {
"anomaliesLast7d": 3,
"estimatedMonthlyCost": 14230.00,
"zombieResourceCount": 5,
"zombieEstimatedMonthlyCost": 127.40
}
}
]
}
DELETE /v1/accounts/{accountId}
Disconnect an AWS account. Triggers data deletion pipeline.
Response: 202 Accepted
{
"accountId": "acc_01HXYZ...",
"status": "disconnecting",
"dataDeletedBy": "2026-03-03T10:00:00Z" // 72-hour GDPR window
}
GET /v1/accounts/{accountId}/health
Check the health of a connected account (EventBridge events flowing, role assumable, etc.).
Response: 200 OK
{
"accountId": "acc_01HXYZ...",
"health": "healthy", // "healthy" | "degraded" | "disconnected"
"checks": {
"roleAssumable": { "status": "pass", "lastChecked": "2026-02-28T14:00:00Z" },
"eventsFlowing": { "status": "pass", "lastEventAt": "2026-02-28T14:32:00Z" },
"baselinePopulated": { "status": "pass", "maturity": "learning", "sampleCount": 47 }
}
}
7.2 Anomaly Query & Search API
GET /v1/anomalies
Query anomalies across all connected accounts.
Query Parameters:
accountId (optional) — filter by account
severity (optional) — "info" | "warning" | "critical"
status (optional) — "open" | "resolved" | "expected" | "snoozed"
service (optional) — "ec2" | "rds" | "lambda"
since (optional) — ISO 8601 timestamp (default: 7 days ago)
until (optional) — ISO 8601 timestamp (default: now)
limit (optional) — 1-100 (default: 50)
cursor (optional) — pagination cursor
Response: 200 OK
{
"anomalies": [
{
"anomalyId": "an_01HXYZ...",
"accountId": "acc_01HXYZ...",
"awsAccountId": "123456789012",
"severity": "warning",
"score": 4.2,
"status": "open",
"title": "2× p3.2xlarge launched in us-east-1",
"description": "sam@company.com launched 2 GPU instances at 11:02 AM UTC. Instance type never seen in this account. Cost is 4.1× average EC2 hourly spend.",
"estimatedHourlyCost": 6.12,
"estimatedDailyCost": 146.88,
"service": "ec2",
"resourceType": "instance",
"resourceIds": ["i-0abc123", "i-0def456"],
"resourceSpec": "p3.2xlarge",
"region": "us-east-1",
"actor": "sam@company.com",
"detectedAt": "2026-02-28T11:02:15Z",
"signals": {
"zScore": 4.1,
"instanceTypeNovelty": true,
"actorNovelty": false,
"absoluteCostThreshold": false,
"quantityAnomaly": false,
"timeOfDayAnomaly": false
},
"suggestedActions": [
{
"action": "stop",
"command": "aws ec2 stop-instances --instance-ids i-0abc123 i-0def456 --region us-east-1",
"risk": "low"
}
],
"slackMessageUrl": "https://company.slack.com/archives/C01ABC/p1234567890"
}
],
"cursor": "eyJsYXN0S2V5Ijo...",
"total": 12
}
GET /v1/anomalies/{anomalyId}
Get full details for a single anomaly, including remediation audit log.
Response: 200 OK
{
// ... all fields from list response, plus:
"remediationLog": [
{
"action": "stop",
"executedBy": "U01SLACK_USER",
"executedByName": "Sam Chen",
"executedAt": "2026-02-28T11:15:00Z",
"targetResourceId": "i-0abc123",
"result": "success",
"dryRunPassed": true
}
],
"triggerEvents": [
{
"cloudTrailEventId": "abc123-def456-...",
"eventTime": "2026-02-28T11:02:12Z",
"action": "RunInstances",
"rawParameters": {
"instanceType": "p3.2xlarge",
"minCount": 2,
"maxCount": 2,
"imageId": "ami-0abc123..."
}
}
],
"reconciliation": {
"reconciled": false,
"actualCost": null,
"pricingTerm": null,
"reconciledAt": null
}
}
PATCH /v1/anomalies/{anomalyId}
Update anomaly status (mark as expected, snooze, resolve).
Request:
{
"status": "expected", // or "snoozed" with snoozeUntil
"snoozeUntil": "2026-02-28T15:00:00Z" // only for "snoozed"
}
Response: 200 OK
{ "anomalyId": "an_01HXYZ...", "status": "expected", "updatedAt": "..." }
7.3 Baseline Configuration API
GET /v1/accounts/{accountId}/baselines
View current baselines for an account.
Response: 200 OK
{
"baselines": [
{
"service": "ec2",
"resourceType": "instance",
"maturity": "learning",
"sampleCount": 47,
"mean": 0.48,
"stddev": 0.92,
"p95": 1.92,
"maxObserved": 3.06,
"expectedInstanceTypes": ["t3.medium", "m5.xlarge", "c5.2xlarge"],
"expectedActors": ["sam@company.com", "terraform-deploy-role"],
"sensitivity": "medium",
"suppressedPatterns": []
},
{
"service": "rds",
"resourceType": "db-instance",
"maturity": "cold-start",
"sampleCount": 3,
"mean": 0.416,
"stddev": 0.0,
"expectedInstanceTypes": ["db.t3.medium"],
"expectedActors": ["terraform-deploy-role"],
"sensitivity": "medium",
"suppressedPatterns": []
}
]
}
PATCH /v1/accounts/{accountId}/baselines/{service}/{resourceType}
Override baseline sensitivity or suppress patterns.
Request:
{
"sensitivity": "low", // "low" | "medium" | "high"
"suppressPattern": { // optional — add a suppressed pattern
"resourceSpec": "t3.medium",
"actor": null, // null = any actor
"reason": "We launch t3.mediums constantly for CI"
}
}
Response: 200 OK
{ "updated": true }
POST /v1/accounts/{accountId}/baselines/reset
Reset baselines for an account (re-enter cold-start mode). Useful if the account's usage pattern has fundamentally changed.
Request:
{
"service": "ec2", // optional — reset specific service. Omit for all.
"resourceType": "instance" // optional
}
Response: 200 OK
{ "reset": true, "newMaturity": "cold-start" }
7.4 Slack Bot Commands & Interactive Payloads
Slash Commands
| Command | Description | Response |
|---|---|---|
/dd0c status |
Show connected accounts and health | Ephemeral message with account list, health status, and last event time |
/dd0c anomalies |
Show open anomalies | Ephemeral message with top 5 open anomalies, sorted by severity |
/dd0c zombies |
Trigger on-demand zombie scan | "Scanning... results in ~60 seconds" → followed by zombie report |
/dd0c sensitivity <service> <low|medium|high> |
Adjust anomaly sensitivity | "EC2 sensitivity set to LOW. You'll only see critical anomalies." |
/dd0c digest |
Trigger on-demand daily digest | Sends the daily digest message immediately |
/dd0c help |
Show available commands | Command reference |
Interactive Message Payloads
Slack sends interactive payloads to POST https://api.dd0c.dev/v1/slack/actions when users click buttons in alert messages.
Incoming payload structure (from Slack):
{
"type": "block_actions",
"user": { "id": "U01ABC", "name": "sam" },
"team": { "id": "T01ABC" },
"channel": { "id": "C01ABC" },
"message": { "ts": "1234567890.123456" },
"actions": [
{
"action_id": "mark_expected",
"block_id": "anomaly_an_01HXYZ",
"value": "an_01HXYZ..."
}
]
}
Action handling:
action_id |
Handler | Side Effects |
|---|---|---|
mark_expected |
Update anomaly status → retrain baseline | Update original Slack message: "✅ Marked as expected by @sam" |
snooze_1h |
Set snoozeUntil = now + 1h |
Update message: "💤 Snoozed for 1 hour by @sam" |
snooze_4h |
Set snoozeUntil = now + 4h |
Update message: "💤 Snoozed for 4 hours by @sam" |
snooze_24h |
Set snoozeUntil = now + 24h |
Update message: "💤 Snoozed for 24 hours by @sam" |
stop_instance (V2) |
Open confirmation modal → execute ec2:StopInstances |
Update message: "🛑 Stopped by @sam at 2:34 PM" |
terminate_instance (V2) |
Open confirmation modal → snapshot → execute ec2:TerminateInstances |
Update message: "💀 Terminated by @sam. Snapshot: snap-0abc123" |
Slack signature verification (critical security):
Every incoming request is verified using Slack's signing secret:
import crypto from 'crypto';
function verifySlackSignature(
signingSecret: string,
requestBody: string,
timestamp: string,
signature: string
): boolean {
// Reject requests older than 5 minutes (replay attack prevention)
const fiveMinutesAgo = Math.floor(Date.now() / 1000) - 300;
if (parseInt(timestamp) < fiveMinutesAgo) return false;
const sigBasestring = `v0:${timestamp}:${requestBody}`;
const mySignature = 'v0=' + crypto
.createHmac('sha256', signingSecret)
.update(sigBasestring)
.digest('hex');
return crypto.timingSafeEqual(
Buffer.from(mySignature),
Buffer.from(signature)
);
}
7.5 Dashboard REST API (V2)
V2 introduces a lightweight web dashboard. The API supports it.
GET /v1/dashboard/summary
Overview data for the dashboard home screen.
Response: 200 OK
{
"period": "last_30_days",
"totalEstimatedSpend": 14230.00,
"spendTrend": "+17.6%",
"anomaliesDetected": 12,
"anomaliesResolved": 9,
"anomaliesOpen": 3,
"estimatedSavings": 4720.00, // Cost avoided via remediation actions
"zombieResources": 5,
"zombieEstimatedMonthlyCost": 127.40,
"topAnomalies": [ /* top 5 by severity */ ],
"spendByService": [
{ "service": "ec2", "estimatedCost": 8540.00, "percentage": 60.0 },
{ "service": "rds", "estimatedCost": 3420.00, "percentage": 24.0 },
{ "service": "lambda", "estimatedCost": 2270.00, "percentage": 16.0 }
],
"dailySpend": [
{ "date": "2026-02-01", "estimatedCost": 412.00 },
{ "date": "2026-02-02", "estimatedCost": 398.00 },
// ... 30 days
]
}
GET /v1/dashboard/timeline
Anomaly timeline for visualization.
Query Parameters:
since (optional) — default 30 days
until (optional) — default now
Response: 200 OK
{
"events": [
{
"timestamp": "2026-02-28T11:02:15Z",
"type": "anomaly",
"severity": "warning",
"title": "2× p3.2xlarge launched",
"estimatedHourlyCost": 6.12,
"status": "resolved"
},
{
"timestamp": "2026-02-27T09:00:00Z",
"type": "digest",
"title": "Daily digest sent"
}
]
}
7.6 Integration Points: dd0c/route Cross-Sell
dd0c/cost and dd0c/route are the "gateway drug pair." Their integration creates compound value that neither product delivers alone.
Shared Infrastructure
| Component | Shared? | Details |
|---|---|---|
| Cognito User Pool | ✅ Shared | Single sign-on across dd0c products. One login, access both products. |
| Tenant/Account Registry | ✅ Shared | TENANT#<tenant_id> in DynamoDB is the same entity across products. A customer who uses dd0c/route and adds dd0c/cost doesn't create a new account. |
| Slack Integration | ✅ Shared | One Slack app (dd0c) with scopes for both products. Alerts from both products go to the same (or different) channels. Single OAuth flow. |
| Billing (Stripe) | ✅ Shared | One Stripe customer. One invoice. Bundle pricing applied automatically. |
| API Gateway | ✅ Shared | api.dd0c.dev/v1/cost/* and api.dd0c.dev/v1/route/* on the same API Gateway. |
| CloudFormation Templates | ❌ Separate | dd0c/cost needs CloudTrail + resource describe. dd0c/route needs different permissions (if any AWS integration). Separate stacks, separate IAM roles. |
| Data Stores | ❌ Separate | Different DynamoDB tables. Different data schemas. No cross-product data access in V1. |
Cross-Sell Triggers
// In dd0c/route's notification service:
// When a dd0c/route customer saves money on LLM routing,
// check if they also use dd0c/cost.
interface CrossSellTrigger {
trigger: string;
condition: string;
message: string;
}
const CROSS_SELL_TRIGGERS: CrossSellTrigger[] = [
{
trigger: "route_savings_milestone",
condition: "dd0c/route customer saves >$500/month AND does not have dd0c/cost",
message: "🎉 dd0c/route saved you $X on LLM costs this month. Want to find savings on your AWS bill too? dd0c/cost monitors your AWS account for cost anomalies in real-time. [Try dd0c/cost →]"
},
{
trigger: "cost_onboarding_complete",
condition: "dd0c/cost customer completes onboarding AND does not have dd0c/route",
message: "✅ dd0c/cost is monitoring your AWS account. If your team uses OpenAI, Anthropic, or other LLM APIs, dd0c/route can cut those costs by 30-50% with intelligent model routing. [Try dd0c/route →]"
},
{
trigger: "combined_savings_report",
condition: "Customer uses both products AND it's the 1st of the month",
message: "📊 dd0c Monthly Savings Report\n• AWS cost anomalies caught: $X (dd0c/cost)\n• LLM routing savings: $Y (dd0c/route)\n• Total saved: $Z\n• dd0c subscription cost: $W\n• Net savings: $(Z-W) 🚀"
}
];
Future Integration: Combined Cost Intelligence (V3+)
When a customer uses both products, dd0c has a unique data advantage:
- dd0c/route knows: which services make LLM API calls, how much they spend on each model, and which calls could be routed to cheaper models
- dd0c/cost knows: which AWS resources are running, their cost, and which are anomalous or idle
Combined insight example:
"Your
recommendation-service(ECS, us-east-1) costs $1,800/month in compute AND makes $3,200/month in GPT-4o API calls. dd0c/route can cut the API costs to $1,100/month by routing 60% of calls to Claude Haiku. And the ECS service is over-provisioned — you're running 4 tasks but CPU never exceeds 30%. Scaling to 2 tasks saves $900/month. Total potential savings: $3,000/month."
No single-product competitor can deliver this insight. It requires both infrastructure cost data AND application-level API cost data. This is the platform moat.
API: Cross-Product Account Linking
POST /v1/accounts/{accountId}/link
{
"product": "route",
"routeAccountId": "rt_01HXYZ..."
}
Response: 200 OK
{
"linked": true,
"sharedTenantId": "tn_01HXYZ...",
"enabledFeatures": ["combined_savings_report", "cross_sell_suppressed"]
}
When accounts are linked, cross-sell messages are suppressed (the customer already uses both products) and combined reporting is enabled.
APPENDIX A: DECISION LOG
| # | Decision | Alternatives | Rationale | Revisit Trigger |
|---|---|---|---|---|
| 1 | Lambda over ECS for all compute | ECS Fargate | Zero ops, pay-per-invocation, auto-scale to zero. Solo founder can't afford container management. | >5,000 accounts or Lambda cold starts >2s on hot path |
| 2 | DynamoDB single-table over PostgreSQL | Aurora PostgreSQL, Aurora Serverless | No connection pooling, no vacuum, no patching. On-demand pricing = $0 at zero traffic. | V2 dashboard needs complex aggregation queries → add Aurora as read replica |
| 3 | EventBridge over Kinesis for ingestion | Kinesis Data Streams | Native cross-account event routing. Content-based filtering at source. Kinesis requires shard management. | >10,000 accounts or need sub-second ordering guarantees |
| 4 | SQS FIFO over Standard | SQS Standard | Exactly-once processing prevents duplicate anomaly alerts. Message group per account ensures ordering. | FIFO throughput limit (3,000 msg/sec) becomes bottleneck |
| 5 | Static pricing tables over real-time Price List API | AWS Price List API per-request | Price List API is 2-5s per query. Unacceptable in hot path. Pricing changes quarterly. Weekly batch update is sufficient. | AWS introduces hourly pricing changes (unlikely) |
| 6 | V1 ships suggestions, not one-click remediation | Ship remediation in V1 | Trust must be earned before taking action on customer resources. Suggestions deliver 80% of value at 0% risk. | Design partners request buttons AND false positive rate <15% |
| 7 | TypeScript over Python | Python, Go | Same language for CDK + Lambda + API. Faster cold starts than Python. Type safety across stack. | Team grows and prefers Python (unlikely for solo founder) |
| 8 | Cognito over Auth0/Clerk | Auth0, Clerk, Supabase Auth | Free <50K MAU. Native API Gateway integration. No vendor dependency. | Cognito UX becomes a conversion bottleneck (common complaint) → migrate to Clerk |
| 9 | Single-region (us-east-1) for V1 | Multi-region | Simplicity. EventBridge cross-account works within a region. Multi-region adds complexity for zero benefit at <100 accounts. | Non-US customers >30% of base OR availability SLA requirement |
| 10 | No web dashboard in V1 | Ship basic dashboard | Slack-first means no dashboard needed for core workflow. Dashboard is engineering time that doesn't improve detection or alerting. | >50% of design partners request anomaly history outside Slack |
APPENDIX B: CLOUDTRAIL EVENT REFERENCE
Quick reference for CloudTrail event structures used by dd0c/cost's event processor.
EC2 RunInstances
{
"eventSource": "ec2.amazonaws.com",
"eventName": "RunInstances",
"awsRegion": "us-east-1",
"requestParameters": {
"instanceType": "p3.2xlarge",
"minCount": 2,
"maxCount": 2,
"imageId": "ami-0abc123..."
},
"responseElements": {
"instancesSet": {
"items": [
{ "instanceId": "i-0abc123..." },
{ "instanceId": "i-0def456..." }
]
}
}
}
Extraction: instanceType from requestParameters, instanceId from responseElements.instancesSet.items[*], count from array length.
RDS CreateDBInstance
{
"eventSource": "rds.amazonaws.com",
"eventName": "CreateDBInstance",
"requestParameters": {
"dBInstanceIdentifier": "my-database",
"dBInstanceClass": "db.r5.4xlarge",
"engine": "postgres",
"multiAZ": true,
"allocatedStorage": 100
}
}
Extraction: dBInstanceClass for pricing lookup. multiAZ: true doubles the cost. allocatedStorage for storage cost estimate.
EC2 CreateNatGateway
{
"eventSource": "ec2.amazonaws.com",
"eventName": "CreateNatGateway",
"requestParameters": {
"subnetId": "subnet-0abc123...",
"allocationId": "eipalloc-0def456..."
},
"responseElements": {
"CreateNatGatewayResponse": {
"natGateway": {
"natGatewayId": "nat-0ghi789..."
}
}
}
}
Extraction: NAT Gateway pricing is flat ($0.045/hr + $0.045/GB). No instance type to look up. Alert on creation because NAT Gateways are notorious silent cost bombs.
This architecture document is a living artifact. It will be updated as V1 development reveals implementation realities, design partner feedback reshapes priorities, and scaling demands evolve. The core architectural bet — real-time CloudTrail event stream processing as the speed layer, with CUR reconciliation as the accuracy layer — is the foundation that everything else builds on. If that bet is wrong, the product is wrong. If it's right, dd0c/cost has an architectural moat that batch-processing competitors can't easily cross.