Files
Max Mayfield c1484426cc Phase 3: BDD acceptance specs for P1 (route) and P5 (cost)
P1: 50+ scenarios across 10 epics, all stories covered
P5: 55+ scenarios across 10 epics, written manually (Sonnet credential failures)
Remaining P2/P3/P4/P6 in progress via subagents
2026-03-01 01:50:30 +00:00

21 KiB
Raw Permalink Blame History

dd0c/cost — BDD Acceptance Test Specifications

Phase 3: Given/When/Then per Story Date: March 1, 2026


Epic 1: CloudTrail Ingestion

Story 1.1: EventBridge Cross-Account Rules

Feature: CloudTrail Event Ingestion

  Scenario: EC2 RunInstances event ingested successfully
    Given a customer AWS account has EventBridge rules forwarding to dd0c
    When a RunInstances CloudTrail event fires in the customer account
    Then the event arrives in the dd0c SQS FIFO queue within 5 seconds
    And the event contains userIdentity, eventSource, and requestParameters

  Scenario: Non-cost-generating event is filtered at EventBridge
    Given EventBridge rules filter for cost-relevant API calls only
    When a DescribeInstances CloudTrail event fires
    Then the event is NOT forwarded to the SQS queue

Story 1.2: SQS FIFO Deduplication

Feature: CloudTrail Event Deduplication

  Scenario: Duplicate CloudTrail event is deduplicated by SQS FIFO
    Given a RunInstances event with eventID "evt-abc-123" was already processed
    When the same event arrives again (CloudTrail duplicate delivery)
    Then SQS FIFO deduplicates based on the eventID
    And only one CostEvent is written to DynamoDB

  Scenario: Same resource type from different events is allowed
    Given two different RunInstances events for the same instance type
    But with different eventIDs
    When both arrive in the SQS queue
    Then both are processed as separate CostEvents

Story 1.3: Lambda Normalizer

Feature: CloudTrail Event Normalization

  Scenario: EC2 RunInstances normalized to CostEvent schema
    Given a raw CloudTrail RunInstances event
    When the Lambda normalizer processes it
    Then a CostEvent is created with: service="ec2", instanceType="m5.xlarge", region, actor ARN, hourly cost
    And the hourly cost is looked up from the static pricing table

  Scenario: RDS CreateDBInstance normalized correctly
    Given a raw CloudTrail CreateDBInstance event
    When the Lambda normalizer processes it
    Then a CostEvent is created with: service="rds", engine, storageType, multiAZ flag

  Scenario: Unknown instance type uses fallback pricing
    Given a CloudTrail event with instanceType "z99.mega" not in the pricing table
    When the normalizer looks up pricing
    Then fallback pricing of $0 is applied
    And a warning is logged: "Unknown instance type: z99.mega"

  Scenario: Malformed CloudTrail JSON does not crash the Lambda
    Given a CloudTrail event with invalid JSON structure
    When the Lambda normalizer processes it
    Then the event is sent to the DLQ
    And no CostEvent is written
    And an error metric is emitted

  Scenario: Batched RunInstances (multiple instances) creates multiple CostEvents
    Given a RunInstances event launching 5 instances simultaneously
    When the normalizer processes it
    Then 5 separate CostEvents are created (one per instance)

Story 1.4: Cost Precision

Feature: Cost Calculation Precision

  Scenario: Hourly cost calculated with 4 decimal places
    Given instance type "m5.xlarge" costs $0.192/hr
    When the normalizer calculates hourly cost
    Then the cost is stored as 0.1920 (4 decimal places)

  Scenario: Sub-cent Lambda costs handled correctly
    Given a Lambda invocation costing $0.0000002 per request
    When the normalizer calculates cost
    Then the cost is stored without floating-point precision loss

Epic 2: Anomaly Detection

Story 2.1: Z-Score Calculation

Feature: Z-Score Anomaly Scoring

  Scenario: Event matching baseline mean scores zero
    Given a baseline with mean=$1.25/hr and stddev=$0.15
    When a CostEvent arrives with hourly cost $1.25
    Then the Z-score is 0.0
    And the severity is "info"

  Scenario: Event 3 standard deviations above mean scores critical
    Given a baseline with mean=$1.25/hr and stddev=$0.15
    When a CostEvent arrives with hourly cost $1.70 (Z=3.0)
    Then the Z-score is 3.0
    And the severity is "critical"

  Scenario: Zero standard deviation does not cause division by zero
    Given a baseline with mean=$1.00/hr and stddev=$0.00 (all identical observations)
    When a CostEvent arrives with hourly cost $1.50
    Then the scorer handles the zero-stddev case gracefully
    And the severity is based on absolute cost delta, not Z-score

  Scenario: Single data point baseline (stddev undefined)
    Given a baseline with only 1 observation
    When a CostEvent arrives
    Then the scorer uses the cold-start fast-path instead of Z-score

Story 2.2: Novelty Scoring

Feature: Actor and Instance Novelty Detection

  Scenario: New IAM role triggers actor novelty penalty
    Given the baseline has observed actors: ["arn:aws:iam::123:role/ci"]
    When a CostEvent arrives from actor "arn:aws:iam::123:role/unknown-script"
    Then an actor novelty penalty is added to the composite score

  Scenario: Known actor and known instance type has no novelty penalty
    Given the baseline has observed actors and instance types matching the event
    When the CostEvent is scored
    Then the novelty component is 0

Story 2.3: Cold-Start Fast Path

Feature: Cold-Start Fast Path for Expensive Resources

  Scenario: $25/hr instance triggers immediate critical (bypasses baseline)
    Given the account baseline has fewer than 14 days of data
    When a CostEvent arrives for a p4d.24xlarge at $32.77/hr
    Then the fast-path triggers immediately
    And severity is "critical" regardless of baseline state

  Scenario: $0.10/hr instance ignored during cold-start
    Given the account baseline has fewer than 14 days of data
    When a CostEvent arrives for a t3.nano at $0.0052/hr
    Then the fast-path does NOT trigger
    And the event is scored normally (likely "info")

  Scenario: Fast-path transitions to statistical scoring at maturity
    Given the account baseline has 20+ events AND 14+ days
    When a CostEvent arrives for a $10/hr instance
    Then the Z-score path is used (not fast-path)

Story 2.4: Composite Scoring

Feature: Composite Severity Classification

  Scenario: Composite score below 30 classified as info
    Given Z-score contribution is 15 and novelty contribution is 10
    When the composite score is calculated (25)
    Then severity is "info"

  Scenario: Composite score above 60 classified as critical
    Given Z-score contribution is 45 and novelty contribution is 20
    When the composite score is calculated (65)
    Then severity is "critical"

  Scenario: Events below $0.50/hr never classified as critical
    Given a CostEvent with hourly cost $0.30 and Z-score of 5.0
    When the composite score is calculated
    Then severity is capped at "warning" (never critical for cheap resources)

Story 2.5: Feedback Loop

Feature: Mark-as-Expected Feedback

  Scenario: Marking an anomaly as expected reduces future scores
    Given an anomaly was triggered for actor "arn:aws:iam::123:role/batch-job"
    When the user clicks "Mark as Expected" in Slack
    Then the actor is added to the expected_actors list
    And future events from this actor receive a reduced score

  Scenario: Expected actor still flagged if cost is 10x above baseline
    Given actor "batch-job" is in the expected_actors list
    And the baseline mean is $1.00/hr
    When a CostEvent arrives from "batch-job" at $15.00/hr
    Then the anomaly is still flagged (10x override)

Epic 3: Zombie Hunter

Feature: Idle Resource Detection

  Scenario: EC2 instance running 7+ days with <5% CPU detected as zombie
    Given an EC2 instance "i-abc123" has been running for 10 days
    And CloudWatch average CPU utilization is 2%
    When the daily Zombie Hunter scan runs
    Then "i-abc123" is flagged as a zombie
    And cumulative waste cost is calculated (10 days × hourly rate)

  Scenario: RDS instance with 0 connections for 3+ days detected
    Given an RDS instance "prod-db-unused" has 0 connections for 5 days
    When the Zombie Hunter scan runs
    Then "prod-db-unused" is flagged as a zombie

  Scenario: Unattached EBS volume older than 7 days detected
    Given an EBS volume "vol-xyz" has been unattached for 14 days
    When the Zombie Hunter scan runs
    Then "vol-xyz" is flagged as a zombie
    And the waste cost includes the daily storage charge

  Scenario: Instance tagged dd0c:ignore is excluded
    Given an EC2 instance has tag "dd0c:ignore=true"
    When the Zombie Hunter scan runs
    Then the instance is skipped

  Scenario: Zombie scan respects read-only permissions
    Given the dd0c IAM role has only read permissions
    When the Zombie Hunter scan runs
    Then no modify/stop/terminate API calls are made

Epic 4: Notification & Remediation

Story 4.1: Slack Block Kit Alerts

Feature: Slack Anomaly Alerts

  Scenario: Critical anomaly sends Slack alert with Block Kit
    Given an anomaly with severity "critical" for a p4d.24xlarge in us-east-1
    When the notification engine processes it
    Then a Slack message is sent with Block Kit formatting
    And the message includes: resource type, region, hourly cost, actor, "Why this alert" section
    And the message includes buttons: Snooze 1h, Snooze 24h, Mark Expected, Stop Instance

  Scenario: Info-level anomalies are NOT sent to Slack
    Given an anomaly with severity "info"
    When the notification engine processes it
    Then no Slack message is sent
    And the anomaly is only logged to DynamoDB

Story 4.2: Interactive Remediation

Feature: One-Click Remediation via Slack

  Scenario: User clicks "Stop Instance" and remediation succeeds
    Given a Slack alert for EC2 instance "i-abc123" in account "123456789012"
    When the user clicks "Stop Instance"
    Then the system assumes the cross-account remediation role via STS
    And ec2:StopInstances is called for "i-abc123"
    And the Slack message is updated to "Remediation Successful ✅"

  Scenario: Cross-account IAM role has been revoked
    Given the customer revoked the dd0c remediation IAM role
    When the user clicks "Stop Instance"
    Then STS AssumeRole fails
    And the Slack message is updated to "Remediation Failed: IAM role not found"
    And no instance is stopped

  Scenario: Snooze suppresses alerts for the specified duration
    Given the user clicks "Snooze 24h" on an anomaly
    When a similar anomaly fires within 24 hours
    Then no Slack alert is sent
    And the anomaly is still logged to DynamoDB

  Scenario: Mark Expected updates the baseline
    Given the user clicks "Mark Expected"
    When the feedback is processed
    Then the actor is added to expected_actors
    And the false-positive counter is incremented for governance scoring

Story 4.3: Slack Signature Validation

Feature: Slack Interactive Payload Security

  Scenario: Valid Slack signature accepted
    Given a Slack interactive payload with correct HMAC-SHA256 signature
    When the API receives the payload
    Then the action is processed

  Scenario: Invalid signature rejected
    Given a payload with an incorrect X-Slack-Signature header
    When the API receives the payload
    Then HTTP 401 is returned
    And no action is taken

  Scenario: Expired timestamp rejected (replay attack)
    Given a payload with X-Slack-Request-Timestamp older than 5 minutes
    When the API receives the payload
    Then HTTP 401 is returned

Story 4.4: Daily Digest

Feature: Daily Anomaly Digest

  Scenario: Daily digest aggregates 24h of anomalies
    Given 15 anomalies were detected in the last 24 hours
    When the daily digest cron fires
    Then a Slack message is sent with: total anomalies, top 3 costliest, total estimated spend, zombie count

  Scenario: No digest sent if zero anomalies
    Given zero anomalies in the last 24 hours
    When the daily digest cron fires
    Then no Slack message is sent

Epic 5: Onboarding & PLG

Feature: AWS Account Onboarding

  Scenario: CloudFormation quick-create deploys IAM role
    Given the user clicks the quick-create URL
    When CloudFormation deploys the stack
    Then an IAM role is created with the correct permissions and ExternalId
    And an EventBridge rule is created for cost-relevant CloudTrail events

  Scenario: Role validation succeeds with correct ExternalId
    Given the IAM role exists with ExternalId "dd0c-tenant-123"
    When POST /v1/accounts validates the role
    Then STS AssumeRole succeeds
    And the account is marked as "active"
    And the initial Zombie Hunter scan is triggered

  Scenario: Role validation fails with wrong ExternalId
    Given the IAM role exists but ExternalId does not match
    When POST /v1/accounts validates the role
    Then STS AssumeRole fails with AccessDenied
    And a clear error message is returned

  Scenario: Free tier allows 1 account
    Given the tenant is on the free tier with 0 connected accounts
    When they connect their first AWS account
    Then the connection succeeds

  Scenario: Free tier rejects 2nd account
    Given the tenant is on the free tier with 1 connected account
    When they attempt to connect a second account
    Then HTTP 403 is returned with an upgrade prompt

  Scenario: Stripe upgrade unlocks multiple accounts
    Given the tenant upgrades via Stripe Checkout
    When checkout.session.completed webhook fires
    Then the tenant tier is updated to "pro"
    And they can connect additional accounts

  Scenario: Stripe webhook signature validated
    Given a Stripe webhook payload
    When the signature does not match the signing secret
    Then HTTP 401 is returned

Epic 6: Dashboard API

Feature: Dashboard API with Tenant Isolation

  Scenario: GET /v1/anomalies returns anomalies for authenticated tenant
    Given tenant "t-123" has 50 anomalies
    And tenant "t-456" has 30 anomalies
    When tenant "t-123" calls GET /v1/anomalies
    Then only tenant "t-123" anomalies are returned
    And zero anomalies from "t-456" are included

  Scenario: Cursor-based pagination
    Given tenant "t-123" has 200 anomalies
    When GET /v1/anomalies?limit=20 is called
    Then 20 results are returned with a cursor token
    And calling with the cursor returns the next 20

  Scenario: Filter by severity
    Given anomalies exist with severity "critical", "warning", and "info"
    When GET /v1/anomalies?severity=critical is called
    Then only critical anomalies are returned

  Scenario: Baseline override
    Given tenant "t-123" wants to adjust sensitivity for EC2 m5.xlarge
    When PATCH /v1/accounts/{id}/baselines/ec2/m5.xlarge with sensitivity=0.8
    Then the baseline sensitivity is updated
    And future scoring uses the adjusted threshold

  Scenario: Missing Cognito JWT returns 401
    Given no Authorization header is present
    When any API endpoint is called
    Then HTTP 401 is returned

Epic 7: Dashboard UI

Feature: Dashboard UI

  Scenario: Anomaly feed renders with severity badges
    Given the user is logged in
    When the dashboard loads
    Then anomalies are listed with color-coded severity badges (red/yellow/blue)

  Scenario: Baseline learning progress displayed
    Given an account connected 5 days ago with 30 events
    When the user views the account detail
    Then a progress bar shows "5/14 days, 30/20 events" toward maturity

  Scenario: Zombie resource list shows waste estimate
    Given 3 zombie resources were detected
    When the user views the Zombie Hunter tab
    Then each zombie shows: resource type, age, cumulative waste cost

Epic 8: Infrastructure

Feature: CDK Infrastructure

  Scenario: CDK deploys all required resources
    Given the CDK stack is synthesized
    When cdk deploy is executed
    Then EventBridge rules, SQS FIFO queues, Lambda functions, DynamoDB tables, and Cognito user pool are created

  Scenario: CI/CD pipeline runs on push to main
    Given code is pushed to the main branch
    When GitHub Actions triggers
    Then unit tests, integration tests, and CDK synth run
    And Lambda functions are deployed
    And zero downtime is maintained

Epic 9: Multi-Account Management

Feature: Multi-Account Management

  Scenario: Link additional AWS account
    Given the tenant is on the pro tier
    When they POST /v1/accounts with a new role ARN
    Then the account is validated and linked
    And CloudTrail events from the new account begin flowing

  Scenario: Consolidated anomaly view across accounts
    Given the tenant has 3 linked accounts
    When they GET /v1/anomalies (no account filter)
    Then anomalies from all 3 accounts are returned
    And each anomaly includes the source account ID

  Scenario: Disconnect account stops event processing
    Given account "123456789012" is linked
    When DELETE /v1/accounts/{id} is called
    Then the account is marked as "disconnecting"
    And EventBridge rules are cleaned up
    And no new events are processed from that account

Epic 10: Transparent Factory Compliance

Story 10.1: Atomic Flagging

Feature: Alert Volume Circuit Breaker

  Scenario: Circuit breaker trips at 3x baseline alert volume
    Given the baseline alert rate is 5 alerts/hour
    When 16 alerts fire in the current hour (>3x)
    Then the circuit breaker trips
    And the scoring flag is auto-disabled
    And suppressed anomalies are buffered in the DLQ

  Scenario: Fast-path alerts are exempt from circuit breaker
    Given the circuit breaker is tripped
    When a $30/hr instance triggers the cold-start fast-path
    Then the alert is still sent (fast-path bypasses breaker)

  Scenario: CI blocks expired flags
    Given a feature flag with TTL expired and rollout=100%
    When CI runs the flag audit
    Then the build fails

Story 10.2: Elastic Schema

Feature: DynamoDB Additive Schema

  Scenario: New attribute added without breaking V1 readers
    Given a DynamoDB item has a new "anomaly_score_v2" attribute
    When V1 code reads the item
    Then deserialization succeeds (unknown attributes ignored)

  Scenario: Key schema modifications are rejected
    Given a migration attempts to change the partition key structure
    When CI runs the schema lint
    Then the build fails with "Key schema modification detected"

Story 10.3: Cognitive Durability

Feature: Decision Logs for Scoring Changes

  Scenario: PR modifying Z-score thresholds requires decision log
    Given a PR changes files in src/scoring/
    And no decision_log.json is included
    When CI runs the decision log check
    Then the build fails

  Scenario: Cyclomatic complexity enforced
    Given a scoring function has complexity 12
    When the linter runs with threshold 10
    Then the lint fails

Story 10.4: Semantic Observability

Feature: OTEL Spans on Scoring Decisions

  Scenario: Every anomaly scoring decision emits an OTEL span
    Given a CostEvent is processed by the Anomaly Scorer
    When scoring completes
    Then an "anomaly_scoring" span is created
    And span attributes include: cost.z_score, cost.anomaly_score, cost.baseline_days
    And cost.fast_path_triggered is set to true/false

  Scenario: Account ID is hashed in spans (PII protection)
    Given a scoring span is emitted
    When the span attributes are inspected
    Then the AWS account ID is SHA-256 hashed
    And no raw account ID appears in any attribute

Story 10.5: Configurable Autonomy (14-Day Governance)

Feature: 14-Day Auto-Promotion Governance

  Scenario: New account starts in strict mode (log-only)
    Given a new AWS account is connected
    When the first anomaly is detected
    Then the anomaly is logged to DynamoDB
    And NO Slack alert is sent (strict mode)

  Scenario: Account auto-promotes to audit mode after 14 days
    Given an account has been connected for 15 days
    And the false-positive rate is 5% (<10% threshold)
    When the daily governance cron runs
    Then the account mode is changed from "strict" to "audit"
    And a log entry "Auto-promoted to audit mode" is written

  Scenario: Account does NOT promote if false-positive rate is high
    Given an account has been connected for 15 days
    And the false-positive rate is 15% (>10% threshold)
    When the daily governance cron runs
    Then the account remains in "strict" mode

  Scenario: Panic mode halts all alerting in <1 second
    Given panic mode is triggered via POST /admin/panic
    When a new critical anomaly is detected
    Then NO Slack alert is sent
    And the anomaly is still scored and logged
    And the dashboard API returns "alerting paused" header

  Scenario: Panic mode requires manual clearance
    Given panic mode is active
    When 24 hours pass without manual intervention
    Then panic mode is still active (no auto-clear)
    And only POST /admin/panic with panic=false clears it

End of dd0c/cost BDD Acceptance Specifications — 10 Epics, 55+ Scenarios