Phase 3: BDD acceptance specs for P1 (route) and P5 (cost)

P1: 50+ scenarios across 10 epics, all stories covered P5: 55+ scenarios across 10 epics, written manually (Sonnet credential failures) Remaining P2/P3/P4/P6 in progress via subagents
2026-03-01 01:50:30 +00:00
parent 03bfe931fc
commit c1484426cc
2 changed files with 1295 additions and 0 deletions
--- a/products/05-aws-cost-anomaly/acceptance-specs/acceptance-specs.md
+++ b/products/05-aws-cost-anomaly/acceptance-specs/acceptance-specs.md
@@ -0,0 +1,610 @@
+# dd0c/cost — BDD Acceptance Test Specifications
+
+**Phase 3: Given/When/Then per Story**
+**Date:** March 1, 2026
+
+---
+
+## Epic 1: CloudTrail Ingestion
+
+### Story 1.1: EventBridge Cross-Account Rules
+
+```gherkin
+Feature: CloudTrail Event Ingestion
+
+  Scenario: EC2 RunInstances event ingested successfully
+    Given a customer AWS account has EventBridge rules forwarding to dd0c
+    When a RunInstances CloudTrail event fires in the customer account
+    Then the event arrives in the dd0c SQS FIFO queue within 5 seconds
+    And the event contains userIdentity, eventSource, and requestParameters
+
+  Scenario: Non-cost-generating event is filtered at EventBridge
+    Given EventBridge rules filter for cost-relevant API calls only
+    When a DescribeInstances CloudTrail event fires
+    Then the event is NOT forwarded to the SQS queue
+```
+
+### Story 1.2: SQS FIFO Deduplication
+
+```gherkin
+Feature: CloudTrail Event Deduplication
+
+  Scenario: Duplicate CloudTrail event is deduplicated by SQS FIFO
+    Given a RunInstances event with eventID "evt-abc-123" was already processed
+    When the same event arrives again (CloudTrail duplicate delivery)
+    Then SQS FIFO deduplicates based on the eventID
+    And only one CostEvent is written to DynamoDB
+
+  Scenario: Same resource type from different events is allowed
+    Given two different RunInstances events for the same instance type
+    But with different eventIDs
+    When both arrive in the SQS queue
+    Then both are processed as separate CostEvents
+```
+
+### Story 1.3: Lambda Normalizer
+
+```gherkin
+Feature: CloudTrail Event Normalization
+
+  Scenario: EC2 RunInstances normalized to CostEvent schema
+    Given a raw CloudTrail RunInstances event
+    When the Lambda normalizer processes it
+    Then a CostEvent is created with: service="ec2", instanceType="m5.xlarge", region, actor ARN, hourly cost
+    And the hourly cost is looked up from the static pricing table
+
+  Scenario: RDS CreateDBInstance normalized correctly
+    Given a raw CloudTrail CreateDBInstance event
+    When the Lambda normalizer processes it
+    Then a CostEvent is created with: service="rds", engine, storageType, multiAZ flag
+
+  Scenario: Unknown instance type uses fallback pricing
+    Given a CloudTrail event with instanceType "z99.mega" not in the pricing table
+    When the normalizer looks up pricing
+    Then fallback pricing of $0 is applied
+    And a warning is logged: "Unknown instance type: z99.mega"
+
+  Scenario: Malformed CloudTrail JSON does not crash the Lambda
+    Given a CloudTrail event with invalid JSON structure
+    When the Lambda normalizer processes it
+    Then the event is sent to the DLQ
+    And no CostEvent is written
+    And an error metric is emitted
+
+  Scenario: Batched RunInstances (multiple instances) creates multiple CostEvents
+    Given a RunInstances event launching 5 instances simultaneously
+    When the normalizer processes it
+    Then 5 separate CostEvents are created (one per instance)
+```
+
+### Story 1.4: Cost Precision
+
+```gherkin
+Feature: Cost Calculation Precision
+
+  Scenario: Hourly cost calculated with 4 decimal places
+    Given instance type "m5.xlarge" costs $0.192/hr
+    When the normalizer calculates hourly cost
+    Then the cost is stored as 0.1920 (4 decimal places)
+
+  Scenario: Sub-cent Lambda costs handled correctly
+    Given a Lambda invocation costing $0.0000002 per request
+    When the normalizer calculates cost
+    Then the cost is stored without floating-point precision loss
+```
+
+---
+
+## Epic 2: Anomaly Detection
+
+### Story 2.1: Z-Score Calculation
+
+```gherkin
+Feature: Z-Score Anomaly Scoring
+
+  Scenario: Event matching baseline mean scores zero
+    Given a baseline with mean=$1.25/hr and stddev=$0.15
+    When a CostEvent arrives with hourly cost $1.25
+    Then the Z-score is 0.0
+    And the severity is "info"
+
+  Scenario: Event 3 standard deviations above mean scores critical
+    Given a baseline with mean=$1.25/hr and stddev=$0.15
+    When a CostEvent arrives with hourly cost $1.70 (Z=3.0)
+    Then the Z-score is 3.0
+    And the severity is "critical"
+
+  Scenario: Zero standard deviation does not cause division by zero
+    Given a baseline with mean=$1.00/hr and stddev=$0.00 (all identical observations)
+    When a CostEvent arrives with hourly cost $1.50
+    Then the scorer handles the zero-stddev case gracefully
+    And the severity is based on absolute cost delta, not Z-score
+
+  Scenario: Single data point baseline (stddev undefined)
+    Given a baseline with only 1 observation
+    When a CostEvent arrives
+    Then the scorer uses the cold-start fast-path instead of Z-score
+```
+
+### Story 2.2: Novelty Scoring
+
+```gherkin
+Feature: Actor and Instance Novelty Detection
+
+  Scenario: New IAM role triggers actor novelty penalty
+    Given the baseline has observed actors: ["arn:aws:iam::123:role/ci"]
+    When a CostEvent arrives from actor "arn:aws:iam::123:role/unknown-script"
+    Then an actor novelty penalty is added to the composite score
+
+  Scenario: Known actor and known instance type has no novelty penalty
+    Given the baseline has observed actors and instance types matching the event
+    When the CostEvent is scored
+    Then the novelty component is 0
+```
+
+### Story 2.3: Cold-Start Fast Path
+
+```gherkin
+Feature: Cold-Start Fast Path for Expensive Resources
+
+  Scenario: $25/hr instance triggers immediate critical (bypasses baseline)
+    Given the account baseline has fewer than 14 days of data
+    When a CostEvent arrives for a p4d.24xlarge at $32.77/hr
+    Then the fast-path triggers immediately
+    And severity is "critical" regardless of baseline state
+
+  Scenario: $0.10/hr instance ignored during cold-start
+    Given the account baseline has fewer than 14 days of data
+    When a CostEvent arrives for a t3.nano at $0.0052/hr
+    Then the fast-path does NOT trigger
+    And the event is scored normally (likely "info")
+
+  Scenario: Fast-path transitions to statistical scoring at maturity
+    Given the account baseline has 20+ events AND 14+ days
+    When a CostEvent arrives for a $10/hr instance
+    Then the Z-score path is used (not fast-path)
+```
+
+### Story 2.4: Composite Scoring
+
+```gherkin
+Feature: Composite Severity Classification
+
+  Scenario: Composite score below 30 classified as info
+    Given Z-score contribution is 15 and novelty contribution is 10
+    When the composite score is calculated (25)
+    Then severity is "info"
+
+  Scenario: Composite score above 60 classified as critical
+    Given Z-score contribution is 45 and novelty contribution is 20
+    When the composite score is calculated (65)
+    Then severity is "critical"
+
+  Scenario: Events below $0.50/hr never classified as critical
+    Given a CostEvent with hourly cost $0.30 and Z-score of 5.0
+    When the composite score is calculated
+    Then severity is capped at "warning" (never critical for cheap resources)
+```
+
+### Story 2.5: Feedback Loop
+
+```gherkin
+Feature: Mark-as-Expected Feedback
+
+  Scenario: Marking an anomaly as expected reduces future scores
+    Given an anomaly was triggered for actor "arn:aws:iam::123:role/batch-job"
+    When the user clicks "Mark as Expected" in Slack
+    Then the actor is added to the expected_actors list
+    And future events from this actor receive a reduced score
+
+  Scenario: Expected actor still flagged if cost is 10x above baseline
+    Given actor "batch-job" is in the expected_actors list
+    And the baseline mean is $1.00/hr
+    When a CostEvent arrives from "batch-job" at $15.00/hr
+    Then the anomaly is still flagged (10x override)
+```
+
+---
+
+## Epic 3: Zombie Hunter
+
+```gherkin
+Feature: Idle Resource Detection
+
+  Scenario: EC2 instance running 7+ days with <5% CPU detected as zombie
+    Given an EC2 instance "i-abc123" has been running for 10 days
+    And CloudWatch average CPU utilization is 2%
+    When the daily Zombie Hunter scan runs
+    Then "i-abc123" is flagged as a zombie
+    And cumulative waste cost is calculated (10 days × hourly rate)
+
+  Scenario: RDS instance with 0 connections for 3+ days detected
+    Given an RDS instance "prod-db-unused" has 0 connections for 5 days
+    When the Zombie Hunter scan runs
+    Then "prod-db-unused" is flagged as a zombie
+
+  Scenario: Unattached EBS volume older than 7 days detected
+    Given an EBS volume "vol-xyz" has been unattached for 14 days
+    When the Zombie Hunter scan runs
+    Then "vol-xyz" is flagged as a zombie
+    And the waste cost includes the daily storage charge
+
+  Scenario: Instance tagged dd0c:ignore is excluded
+    Given an EC2 instance has tag "dd0c:ignore=true"
+    When the Zombie Hunter scan runs
+    Then the instance is skipped
+
+  Scenario: Zombie scan respects read-only permissions
+    Given the dd0c IAM role has only read permissions
+    When the Zombie Hunter scan runs
+    Then no modify/stop/terminate API calls are made
+```
+
+---
+
+## Epic 4: Notification & Remediation
+
+### Story 4.1: Slack Block Kit Alerts
+
+```gherkin
+Feature: Slack Anomaly Alerts
+
+  Scenario: Critical anomaly sends Slack alert with Block Kit
+    Given an anomaly with severity "critical" for a p4d.24xlarge in us-east-1
+    When the notification engine processes it
+    Then a Slack message is sent with Block Kit formatting
+    And the message includes: resource type, region, hourly cost, actor, "Why this alert" section
+    And the message includes buttons: Snooze 1h, Snooze 24h, Mark Expected, Stop Instance
+
+  Scenario: Info-level anomalies are NOT sent to Slack
+    Given an anomaly with severity "info"
+    When the notification engine processes it
+    Then no Slack message is sent
+    And the anomaly is only logged to DynamoDB
+```
+
+### Story 4.2: Interactive Remediation
+
+```gherkin
+Feature: One-Click Remediation via Slack
+
+  Scenario: User clicks "Stop Instance" and remediation succeeds
+    Given a Slack alert for EC2 instance "i-abc123" in account "123456789012"
+    When the user clicks "Stop Instance"
+    Then the system assumes the cross-account remediation role via STS
+    And ec2:StopInstances is called for "i-abc123"
+    And the Slack message is updated to "Remediation Successful ✅"
+
+  Scenario: Cross-account IAM role has been revoked
+    Given the customer revoked the dd0c remediation IAM role
+    When the user clicks "Stop Instance"
+    Then STS AssumeRole fails
+    And the Slack message is updated to "Remediation Failed: IAM role not found"
+    And no instance is stopped
+
+  Scenario: Snooze suppresses alerts for the specified duration
+    Given the user clicks "Snooze 24h" on an anomaly
+    When a similar anomaly fires within 24 hours
+    Then no Slack alert is sent
+    And the anomaly is still logged to DynamoDB
+
+  Scenario: Mark Expected updates the baseline
+    Given the user clicks "Mark Expected"
+    When the feedback is processed
+    Then the actor is added to expected_actors
+    And the false-positive counter is incremented for governance scoring
+```
+
+### Story 4.3: Slack Signature Validation
+
+```gherkin
+Feature: Slack Interactive Payload Security
+
+  Scenario: Valid Slack signature accepted
+    Given a Slack interactive payload with correct HMAC-SHA256 signature
+    When the API receives the payload
+    Then the action is processed
+
+  Scenario: Invalid signature rejected
+    Given a payload with an incorrect X-Slack-Signature header
+    When the API receives the payload
+    Then HTTP 401 is returned
+    And no action is taken
+
+  Scenario: Expired timestamp rejected (replay attack)
+    Given a payload with X-Slack-Request-Timestamp older than 5 minutes
+    When the API receives the payload
+    Then HTTP 401 is returned
+```
+
+### Story 4.4: Daily Digest
+
+```gherkin
+Feature: Daily Anomaly Digest
+
+  Scenario: Daily digest aggregates 24h of anomalies
+    Given 15 anomalies were detected in the last 24 hours
+    When the daily digest cron fires
+    Then a Slack message is sent with: total anomalies, top 3 costliest, total estimated spend, zombie count
+
+  Scenario: No digest sent if zero anomalies
+    Given zero anomalies in the last 24 hours
+    When the daily digest cron fires
+    Then no Slack message is sent
+```
+
+---
+
+## Epic 5: Onboarding & PLG
+
+```gherkin
+Feature: AWS Account Onboarding
+
+  Scenario: CloudFormation quick-create deploys IAM role
+    Given the user clicks the quick-create URL
+    When CloudFormation deploys the stack
+    Then an IAM role is created with the correct permissions and ExternalId
+    And an EventBridge rule is created for cost-relevant CloudTrail events
+
+  Scenario: Role validation succeeds with correct ExternalId
+    Given the IAM role exists with ExternalId "dd0c-tenant-123"
+    When POST /v1/accounts validates the role
+    Then STS AssumeRole succeeds
+    And the account is marked as "active"
+    And the initial Zombie Hunter scan is triggered
+
+  Scenario: Role validation fails with wrong ExternalId
+    Given the IAM role exists but ExternalId does not match
+    When POST /v1/accounts validates the role
+    Then STS AssumeRole fails with AccessDenied
+    And a clear error message is returned
+
+  Scenario: Free tier allows 1 account
+    Given the tenant is on the free tier with 0 connected accounts
+    When they connect their first AWS account
+    Then the connection succeeds
+
+  Scenario: Free tier rejects 2nd account
+    Given the tenant is on the free tier with 1 connected account
+    When they attempt to connect a second account
+    Then HTTP 403 is returned with an upgrade prompt
+
+  Scenario: Stripe upgrade unlocks multiple accounts
+    Given the tenant upgrades via Stripe Checkout
+    When checkout.session.completed webhook fires
+    Then the tenant tier is updated to "pro"
+    And they can connect additional accounts
+
+  Scenario: Stripe webhook signature validated
+    Given a Stripe webhook payload
+    When the signature does not match the signing secret
+    Then HTTP 401 is returned
+```
+
+---
+
+## Epic 6: Dashboard API
+
+```gherkin
+Feature: Dashboard API with Tenant Isolation
+
+  Scenario: GET /v1/anomalies returns anomalies for authenticated tenant
+    Given tenant "t-123" has 50 anomalies
+    And tenant "t-456" has 30 anomalies
+    When tenant "t-123" calls GET /v1/anomalies
+    Then only tenant "t-123" anomalies are returned
+    And zero anomalies from "t-456" are included
+
+  Scenario: Cursor-based pagination
+    Given tenant "t-123" has 200 anomalies
+    When GET /v1/anomalies?limit=20 is called
+    Then 20 results are returned with a cursor token
+    And calling with the cursor returns the next 20
+
+  Scenario: Filter by severity
+    Given anomalies exist with severity "critical", "warning", and "info"
+    When GET /v1/anomalies?severity=critical is called
+    Then only critical anomalies are returned
+
+  Scenario: Baseline override
+    Given tenant "t-123" wants to adjust sensitivity for EC2 m5.xlarge
+    When PATCH /v1/accounts/{id}/baselines/ec2/m5.xlarge with sensitivity=0.8
+    Then the baseline sensitivity is updated
+    And future scoring uses the adjusted threshold
+
+  Scenario: Missing Cognito JWT returns 401
+    Given no Authorization header is present
+    When any API endpoint is called
+    Then HTTP 401 is returned
+```
+
+---
+
+## Epic 7: Dashboard UI
+
+```gherkin
+Feature: Dashboard UI
+
+  Scenario: Anomaly feed renders with severity badges
+    Given the user is logged in
+    When the dashboard loads
+    Then anomalies are listed with color-coded severity badges (red/yellow/blue)
+
+  Scenario: Baseline learning progress displayed
+    Given an account connected 5 days ago with 30 events
+    When the user views the account detail
+    Then a progress bar shows "5/14 days, 30/20 events" toward maturity
+
+  Scenario: Zombie resource list shows waste estimate
+    Given 3 zombie resources were detected
+    When the user views the Zombie Hunter tab
+    Then each zombie shows: resource type, age, cumulative waste cost
+```
+
+---
+
+## Epic 8: Infrastructure
+
+```gherkin
+Feature: CDK Infrastructure
+
+  Scenario: CDK deploys all required resources
+    Given the CDK stack is synthesized
+    When cdk deploy is executed
+    Then EventBridge rules, SQS FIFO queues, Lambda functions, DynamoDB tables, and Cognito user pool are created
+
+  Scenario: CI/CD pipeline runs on push to main
+    Given code is pushed to the main branch
+    When GitHub Actions triggers
+    Then unit tests, integration tests, and CDK synth run
+    And Lambda functions are deployed
+    And zero downtime is maintained
+```
+
+---
+
+## Epic 9: Multi-Account Management
+
+```gherkin
+Feature: Multi-Account Management
+
+  Scenario: Link additional AWS account
+    Given the tenant is on the pro tier
+    When they POST /v1/accounts with a new role ARN
+    Then the account is validated and linked
+    And CloudTrail events from the new account begin flowing
+
+  Scenario: Consolidated anomaly view across accounts
+    Given the tenant has 3 linked accounts
+    When they GET /v1/anomalies (no account filter)
+    Then anomalies from all 3 accounts are returned
+    And each anomaly includes the source account ID
+
+  Scenario: Disconnect account stops event processing
+    Given account "123456789012" is linked
+    When DELETE /v1/accounts/{id} is called
+    Then the account is marked as "disconnecting"
+    And EventBridge rules are cleaned up
+    And no new events are processed from that account
+```
+
+---
+
+## Epic 10: Transparent Factory Compliance
+
+### Story 10.1: Atomic Flagging
+
+```gherkin
+Feature: Alert Volume Circuit Breaker
+
+  Scenario: Circuit breaker trips at 3x baseline alert volume
+    Given the baseline alert rate is 5 alerts/hour
+    When 16 alerts fire in the current hour (>3x)
+    Then the circuit breaker trips
+    And the scoring flag is auto-disabled
+    And suppressed anomalies are buffered in the DLQ
+
+  Scenario: Fast-path alerts are exempt from circuit breaker
+    Given the circuit breaker is tripped
+    When a $30/hr instance triggers the cold-start fast-path
+    Then the alert is still sent (fast-path bypasses breaker)
+
+  Scenario: CI blocks expired flags
+    Given a feature flag with TTL expired and rollout=100%
+    When CI runs the flag audit
+    Then the build fails
+```
+
+### Story 10.2: Elastic Schema
+
+```gherkin
+Feature: DynamoDB Additive Schema
+
+  Scenario: New attribute added without breaking V1 readers
+    Given a DynamoDB item has a new "anomaly_score_v2" attribute
+    When V1 code reads the item
+    Then deserialization succeeds (unknown attributes ignored)
+
+  Scenario: Key schema modifications are rejected
+    Given a migration attempts to change the partition key structure
+    When CI runs the schema lint
+    Then the build fails with "Key schema modification detected"
+```
+
+### Story 10.3: Cognitive Durability
+
+```gherkin
+Feature: Decision Logs for Scoring Changes
+
+  Scenario: PR modifying Z-score thresholds requires decision log
+    Given a PR changes files in src/scoring/
+    And no decision_log.json is included
+    When CI runs the decision log check
+    Then the build fails
+
+  Scenario: Cyclomatic complexity enforced
+    Given a scoring function has complexity 12
+    When the linter runs with threshold 10
+    Then the lint fails
+```
+
+### Story 10.4: Semantic Observability
+
+```gherkin
+Feature: OTEL Spans on Scoring Decisions
+
+  Scenario: Every anomaly scoring decision emits an OTEL span
+    Given a CostEvent is processed by the Anomaly Scorer
+    When scoring completes
+    Then an "anomaly_scoring" span is created
+    And span attributes include: cost.z_score, cost.anomaly_score, cost.baseline_days
+    And cost.fast_path_triggered is set to true/false
+
+  Scenario: Account ID is hashed in spans (PII protection)
+    Given a scoring span is emitted
+    When the span attributes are inspected
+    Then the AWS account ID is SHA-256 hashed
+    And no raw account ID appears in any attribute
+```
+
+### Story 10.5: Configurable Autonomy (14-Day Governance)
+
+```gherkin
+Feature: 14-Day Auto-Promotion Governance
+
+  Scenario: New account starts in strict mode (log-only)
+    Given a new AWS account is connected
+    When the first anomaly is detected
+    Then the anomaly is logged to DynamoDB
+    And NO Slack alert is sent (strict mode)
+
+  Scenario: Account auto-promotes to audit mode after 14 days
+    Given an account has been connected for 15 days
+    And the false-positive rate is 5% (<10% threshold)
+    When the daily governance cron runs
+    Then the account mode is changed from "strict" to "audit"
+    And a log entry "Auto-promoted to audit mode" is written
+
+  Scenario: Account does NOT promote if false-positive rate is high
+    Given an account has been connected for 15 days
+    And the false-positive rate is 15% (>10% threshold)
+    When the daily governance cron runs
+    Then the account remains in "strict" mode
+
+  Scenario: Panic mode halts all alerting in <1 second
+    Given panic mode is triggered via POST /admin/panic
+    When a new critical anomaly is detected
+    Then NO Slack alert is sent
+    And the anomaly is still scored and logged
+    And the dashboard API returns "alerting paused" header
+
+  Scenario: Panic mode requires manual clearance
+    Given panic mode is active
+    When 24 hours pass without manual intervention
+    Then panic mode is still active (no auto-clear)
+    And only POST /admin/panic with panic=false clears it
+```
+
+---
+
+*End of dd0c/cost BDD Acceptance Specifications — 10 Epics, 55+ Scenarios*