# dd0c/cost — BDD Acceptance Test Specifications **Phase 3: Given/When/Then per Story** **Date:** March 1, 2026 --- ## Epic 1: CloudTrail Ingestion ### Story 1.1: EventBridge Cross-Account Rules ```gherkin Feature: CloudTrail Event Ingestion Scenario: EC2 RunInstances event ingested successfully Given a customer AWS account has EventBridge rules forwarding to dd0c When a RunInstances CloudTrail event fires in the customer account Then the event arrives in the dd0c SQS FIFO queue within 5 seconds And the event contains userIdentity, eventSource, and requestParameters Scenario: Non-cost-generating event is filtered at EventBridge Given EventBridge rules filter for cost-relevant API calls only When a DescribeInstances CloudTrail event fires Then the event is NOT forwarded to the SQS queue ``` ### Story 1.2: SQS FIFO Deduplication ```gherkin Feature: CloudTrail Event Deduplication Scenario: Duplicate CloudTrail event is deduplicated by SQS FIFO Given a RunInstances event with eventID "evt-abc-123" was already processed When the same event arrives again (CloudTrail duplicate delivery) Then SQS FIFO deduplicates based on the eventID And only one CostEvent is written to DynamoDB Scenario: Same resource type from different events is allowed Given two different RunInstances events for the same instance type But with different eventIDs When both arrive in the SQS queue Then both are processed as separate CostEvents ``` ### Story 1.3: Lambda Normalizer ```gherkin Feature: CloudTrail Event Normalization Scenario: EC2 RunInstances normalized to CostEvent schema Given a raw CloudTrail RunInstances event When the Lambda normalizer processes it Then a CostEvent is created with: service="ec2", instanceType="m5.xlarge", region, actor ARN, hourly cost And the hourly cost is looked up from the static pricing table Scenario: RDS CreateDBInstance normalized correctly Given a raw CloudTrail CreateDBInstance event When the Lambda normalizer processes it Then a CostEvent is created with: service="rds", engine, storageType, multiAZ flag Scenario: Unknown instance type uses fallback pricing Given a CloudTrail event with instanceType "z99.mega" not in the pricing table When the normalizer looks up pricing Then fallback pricing of $0 is applied And a warning is logged: "Unknown instance type: z99.mega" Scenario: Malformed CloudTrail JSON does not crash the Lambda Given a CloudTrail event with invalid JSON structure When the Lambda normalizer processes it Then the event is sent to the DLQ And no CostEvent is written And an error metric is emitted Scenario: Batched RunInstances (multiple instances) creates multiple CostEvents Given a RunInstances event launching 5 instances simultaneously When the normalizer processes it Then 5 separate CostEvents are created (one per instance) ``` ### Story 1.4: Cost Precision ```gherkin Feature: Cost Calculation Precision Scenario: Hourly cost calculated with 4 decimal places Given instance type "m5.xlarge" costs $0.192/hr When the normalizer calculates hourly cost Then the cost is stored as 0.1920 (4 decimal places) Scenario: Sub-cent Lambda costs handled correctly Given a Lambda invocation costing $0.0000002 per request When the normalizer calculates cost Then the cost is stored without floating-point precision loss ``` --- ## Epic 2: Anomaly Detection ### Story 2.1: Z-Score Calculation ```gherkin Feature: Z-Score Anomaly Scoring Scenario: Event matching baseline mean scores zero Given a baseline with mean=$1.25/hr and stddev=$0.15 When a CostEvent arrives with hourly cost $1.25 Then the Z-score is 0.0 And the severity is "info" Scenario: Event 3 standard deviations above mean scores critical Given a baseline with mean=$1.25/hr and stddev=$0.15 When a CostEvent arrives with hourly cost $1.70 (Z=3.0) Then the Z-score is 3.0 And the severity is "critical" Scenario: Zero standard deviation does not cause division by zero Given a baseline with mean=$1.00/hr and stddev=$0.00 (all identical observations) When a CostEvent arrives with hourly cost $1.50 Then the scorer handles the zero-stddev case gracefully And the severity is based on absolute cost delta, not Z-score Scenario: Single data point baseline (stddev undefined) Given a baseline with only 1 observation When a CostEvent arrives Then the scorer uses the cold-start fast-path instead of Z-score ``` ### Story 2.2: Novelty Scoring ```gherkin Feature: Actor and Instance Novelty Detection Scenario: New IAM role triggers actor novelty penalty Given the baseline has observed actors: ["arn:aws:iam::123:role/ci"] When a CostEvent arrives from actor "arn:aws:iam::123:role/unknown-script" Then an actor novelty penalty is added to the composite score Scenario: Known actor and known instance type has no novelty penalty Given the baseline has observed actors and instance types matching the event When the CostEvent is scored Then the novelty component is 0 ``` ### Story 2.3: Cold-Start Fast Path ```gherkin Feature: Cold-Start Fast Path for Expensive Resources Scenario: $25/hr instance triggers immediate critical (bypasses baseline) Given the account baseline has fewer than 14 days of data When a CostEvent arrives for a p4d.24xlarge at $32.77/hr Then the fast-path triggers immediately And severity is "critical" regardless of baseline state Scenario: $0.10/hr instance ignored during cold-start Given the account baseline has fewer than 14 days of data When a CostEvent arrives for a t3.nano at $0.0052/hr Then the fast-path does NOT trigger And the event is scored normally (likely "info") Scenario: Fast-path transitions to statistical scoring at maturity Given the account baseline has 20+ events AND 14+ days When a CostEvent arrives for a $10/hr instance Then the Z-score path is used (not fast-path) ``` ### Story 2.4: Composite Scoring ```gherkin Feature: Composite Severity Classification Scenario: Composite score below 30 classified as info Given Z-score contribution is 15 and novelty contribution is 10 When the composite score is calculated (25) Then severity is "info" Scenario: Composite score above 60 classified as critical Given Z-score contribution is 45 and novelty contribution is 20 When the composite score is calculated (65) Then severity is "critical" Scenario: Events below $0.50/hr never classified as critical Given a CostEvent with hourly cost $0.30 and Z-score of 5.0 When the composite score is calculated Then severity is capped at "warning" (never critical for cheap resources) ``` ### Story 2.5: Feedback Loop ```gherkin Feature: Mark-as-Expected Feedback Scenario: Marking an anomaly as expected reduces future scores Given an anomaly was triggered for actor "arn:aws:iam::123:role/batch-job" When the user clicks "Mark as Expected" in Slack Then the actor is added to the expected_actors list And future events from this actor receive a reduced score Scenario: Expected actor still flagged if cost is 10x above baseline Given actor "batch-job" is in the expected_actors list And the baseline mean is $1.00/hr When a CostEvent arrives from "batch-job" at $15.00/hr Then the anomaly is still flagged (10x override) ``` --- ## Epic 3: Zombie Hunter ```gherkin Feature: Idle Resource Detection Scenario: EC2 instance running 7+ days with <5% CPU detected as zombie Given an EC2 instance "i-abc123" has been running for 10 days And CloudWatch average CPU utilization is 2% When the daily Zombie Hunter scan runs Then "i-abc123" is flagged as a zombie And cumulative waste cost is calculated (10 days × hourly rate) Scenario: RDS instance with 0 connections for 3+ days detected Given an RDS instance "prod-db-unused" has 0 connections for 5 days When the Zombie Hunter scan runs Then "prod-db-unused" is flagged as a zombie Scenario: Unattached EBS volume older than 7 days detected Given an EBS volume "vol-xyz" has been unattached for 14 days When the Zombie Hunter scan runs Then "vol-xyz" is flagged as a zombie And the waste cost includes the daily storage charge Scenario: Instance tagged dd0c:ignore is excluded Given an EC2 instance has tag "dd0c:ignore=true" When the Zombie Hunter scan runs Then the instance is skipped Scenario: Zombie scan respects read-only permissions Given the dd0c IAM role has only read permissions When the Zombie Hunter scan runs Then no modify/stop/terminate API calls are made ``` --- ## Epic 4: Notification & Remediation ### Story 4.1: Slack Block Kit Alerts ```gherkin Feature: Slack Anomaly Alerts Scenario: Critical anomaly sends Slack alert with Block Kit Given an anomaly with severity "critical" for a p4d.24xlarge in us-east-1 When the notification engine processes it Then a Slack message is sent with Block Kit formatting And the message includes: resource type, region, hourly cost, actor, "Why this alert" section And the message includes buttons: Snooze 1h, Snooze 24h, Mark Expected, Stop Instance Scenario: Info-level anomalies are NOT sent to Slack Given an anomaly with severity "info" When the notification engine processes it Then no Slack message is sent And the anomaly is only logged to DynamoDB ``` ### Story 4.2: Interactive Remediation ```gherkin Feature: One-Click Remediation via Slack Scenario: User clicks "Stop Instance" and remediation succeeds Given a Slack alert for EC2 instance "i-abc123" in account "123456789012" When the user clicks "Stop Instance" Then the system assumes the cross-account remediation role via STS And ec2:StopInstances is called for "i-abc123" And the Slack message is updated to "Remediation Successful ✅" Scenario: Cross-account IAM role has been revoked Given the customer revoked the dd0c remediation IAM role When the user clicks "Stop Instance" Then STS AssumeRole fails And the Slack message is updated to "Remediation Failed: IAM role not found" And no instance is stopped Scenario: Snooze suppresses alerts for the specified duration Given the user clicks "Snooze 24h" on an anomaly When a similar anomaly fires within 24 hours Then no Slack alert is sent And the anomaly is still logged to DynamoDB Scenario: Mark Expected updates the baseline Given the user clicks "Mark Expected" When the feedback is processed Then the actor is added to expected_actors And the false-positive counter is incremented for governance scoring ``` ### Story 4.3: Slack Signature Validation ```gherkin Feature: Slack Interactive Payload Security Scenario: Valid Slack signature accepted Given a Slack interactive payload with correct HMAC-SHA256 signature When the API receives the payload Then the action is processed Scenario: Invalid signature rejected Given a payload with an incorrect X-Slack-Signature header When the API receives the payload Then HTTP 401 is returned And no action is taken Scenario: Expired timestamp rejected (replay attack) Given a payload with X-Slack-Request-Timestamp older than 5 minutes When the API receives the payload Then HTTP 401 is returned ``` ### Story 4.4: Daily Digest ```gherkin Feature: Daily Anomaly Digest Scenario: Daily digest aggregates 24h of anomalies Given 15 anomalies were detected in the last 24 hours When the daily digest cron fires Then a Slack message is sent with: total anomalies, top 3 costliest, total estimated spend, zombie count Scenario: No digest sent if zero anomalies Given zero anomalies in the last 24 hours When the daily digest cron fires Then no Slack message is sent ``` --- ## Epic 5: Onboarding & PLG ```gherkin Feature: AWS Account Onboarding Scenario: CloudFormation quick-create deploys IAM role Given the user clicks the quick-create URL When CloudFormation deploys the stack Then an IAM role is created with the correct permissions and ExternalId And an EventBridge rule is created for cost-relevant CloudTrail events Scenario: Role validation succeeds with correct ExternalId Given the IAM role exists with ExternalId "dd0c-tenant-123" When POST /v1/accounts validates the role Then STS AssumeRole succeeds And the account is marked as "active" And the initial Zombie Hunter scan is triggered Scenario: Role validation fails with wrong ExternalId Given the IAM role exists but ExternalId does not match When POST /v1/accounts validates the role Then STS AssumeRole fails with AccessDenied And a clear error message is returned Scenario: Free tier allows 1 account Given the tenant is on the free tier with 0 connected accounts When they connect their first AWS account Then the connection succeeds Scenario: Free tier rejects 2nd account Given the tenant is on the free tier with 1 connected account When they attempt to connect a second account Then HTTP 403 is returned with an upgrade prompt Scenario: Stripe upgrade unlocks multiple accounts Given the tenant upgrades via Stripe Checkout When checkout.session.completed webhook fires Then the tenant tier is updated to "pro" And they can connect additional accounts Scenario: Stripe webhook signature validated Given a Stripe webhook payload When the signature does not match the signing secret Then HTTP 401 is returned ``` --- ## Epic 6: Dashboard API ```gherkin Feature: Dashboard API with Tenant Isolation Scenario: GET /v1/anomalies returns anomalies for authenticated tenant Given tenant "t-123" has 50 anomalies And tenant "t-456" has 30 anomalies When tenant "t-123" calls GET /v1/anomalies Then only tenant "t-123" anomalies are returned And zero anomalies from "t-456" are included Scenario: Cursor-based pagination Given tenant "t-123" has 200 anomalies When GET /v1/anomalies?limit=20 is called Then 20 results are returned with a cursor token And calling with the cursor returns the next 20 Scenario: Filter by severity Given anomalies exist with severity "critical", "warning", and "info" When GET /v1/anomalies?severity=critical is called Then only critical anomalies are returned Scenario: Baseline override Given tenant "t-123" wants to adjust sensitivity for EC2 m5.xlarge When PATCH /v1/accounts/{id}/baselines/ec2/m5.xlarge with sensitivity=0.8 Then the baseline sensitivity is updated And future scoring uses the adjusted threshold Scenario: Missing Cognito JWT returns 401 Given no Authorization header is present When any API endpoint is called Then HTTP 401 is returned ``` --- ## Epic 7: Dashboard UI ```gherkin Feature: Dashboard UI Scenario: Anomaly feed renders with severity badges Given the user is logged in When the dashboard loads Then anomalies are listed with color-coded severity badges (red/yellow/blue) Scenario: Baseline learning progress displayed Given an account connected 5 days ago with 30 events When the user views the account detail Then a progress bar shows "5/14 days, 30/20 events" toward maturity Scenario: Zombie resource list shows waste estimate Given 3 zombie resources were detected When the user views the Zombie Hunter tab Then each zombie shows: resource type, age, cumulative waste cost ``` --- ## Epic 8: Infrastructure ```gherkin Feature: CDK Infrastructure Scenario: CDK deploys all required resources Given the CDK stack is synthesized When cdk deploy is executed Then EventBridge rules, SQS FIFO queues, Lambda functions, DynamoDB tables, and Cognito user pool are created Scenario: CI/CD pipeline runs on push to main Given code is pushed to the main branch When GitHub Actions triggers Then unit tests, integration tests, and CDK synth run And Lambda functions are deployed And zero downtime is maintained ``` --- ## Epic 9: Multi-Account Management ```gherkin Feature: Multi-Account Management Scenario: Link additional AWS account Given the tenant is on the pro tier When they POST /v1/accounts with a new role ARN Then the account is validated and linked And CloudTrail events from the new account begin flowing Scenario: Consolidated anomaly view across accounts Given the tenant has 3 linked accounts When they GET /v1/anomalies (no account filter) Then anomalies from all 3 accounts are returned And each anomaly includes the source account ID Scenario: Disconnect account stops event processing Given account "123456789012" is linked When DELETE /v1/accounts/{id} is called Then the account is marked as "disconnecting" And EventBridge rules are cleaned up And no new events are processed from that account ``` --- ## Epic 10: Transparent Factory Compliance ### Story 10.1: Atomic Flagging ```gherkin Feature: Alert Volume Circuit Breaker Scenario: Circuit breaker trips at 3x baseline alert volume Given the baseline alert rate is 5 alerts/hour When 16 alerts fire in the current hour (>3x) Then the circuit breaker trips And the scoring flag is auto-disabled And suppressed anomalies are buffered in the DLQ Scenario: Fast-path alerts are exempt from circuit breaker Given the circuit breaker is tripped When a $30/hr instance triggers the cold-start fast-path Then the alert is still sent (fast-path bypasses breaker) Scenario: CI blocks expired flags Given a feature flag with TTL expired and rollout=100% When CI runs the flag audit Then the build fails ``` ### Story 10.2: Elastic Schema ```gherkin Feature: DynamoDB Additive Schema Scenario: New attribute added without breaking V1 readers Given a DynamoDB item has a new "anomaly_score_v2" attribute When V1 code reads the item Then deserialization succeeds (unknown attributes ignored) Scenario: Key schema modifications are rejected Given a migration attempts to change the partition key structure When CI runs the schema lint Then the build fails with "Key schema modification detected" ``` ### Story 10.3: Cognitive Durability ```gherkin Feature: Decision Logs for Scoring Changes Scenario: PR modifying Z-score thresholds requires decision log Given a PR changes files in src/scoring/ And no decision_log.json is included When CI runs the decision log check Then the build fails Scenario: Cyclomatic complexity enforced Given a scoring function has complexity 12 When the linter runs with threshold 10 Then the lint fails ``` ### Story 10.4: Semantic Observability ```gherkin Feature: OTEL Spans on Scoring Decisions Scenario: Every anomaly scoring decision emits an OTEL span Given a CostEvent is processed by the Anomaly Scorer When scoring completes Then an "anomaly_scoring" span is created And span attributes include: cost.z_score, cost.anomaly_score, cost.baseline_days And cost.fast_path_triggered is set to true/false Scenario: Account ID is hashed in spans (PII protection) Given a scoring span is emitted When the span attributes are inspected Then the AWS account ID is SHA-256 hashed And no raw account ID appears in any attribute ``` ### Story 10.5: Configurable Autonomy (14-Day Governance) ```gherkin Feature: 14-Day Auto-Promotion Governance Scenario: New account starts in strict mode (log-only) Given a new AWS account is connected When the first anomaly is detected Then the anomaly is logged to DynamoDB And NO Slack alert is sent (strict mode) Scenario: Account auto-promotes to audit mode after 14 days Given an account has been connected for 15 days And the false-positive rate is 5% (<10% threshold) When the daily governance cron runs Then the account mode is changed from "strict" to "audit" And a log entry "Auto-promoted to audit mode" is written Scenario: Account does NOT promote if false-positive rate is high Given an account has been connected for 15 days And the false-positive rate is 15% (>10% threshold) When the daily governance cron runs Then the account remains in "strict" mode Scenario: Panic mode halts all alerting in <1 second Given panic mode is triggered via POST /admin/panic When a new critical anomaly is detected Then NO Slack alert is sent And the anomaly is still scored and logged And the dashboard API returns "alerting paused" header Scenario: Panic mode requires manual clearance Given panic mode is active When 24 hours pass without manual intervention Then panic mode is still active (no auto-clear) And only POST /admin/panic with panic=false clears it ``` --- *End of dd0c/cost BDD Acceptance Specifications — 10 Epics, 55+ Scenarios*