Phase 3: BDD acceptance specs for P1 (route) and P5 (cost)

P1: 50+ scenarios across 10 epics, all stories covered
P5: 55+ scenarios across 10 epics, written manually (Sonnet credential failures)
Remaining P2/P3/P4/P6 in progress via subagents
This commit is contained in:
2026-03-01 01:50:30 +00:00
parent 03bfe931fc
commit c1484426cc
2 changed files with 1295 additions and 0 deletions

View File

@@ -0,0 +1,610 @@
# dd0c/cost — BDD Acceptance Test Specifications
**Phase 3: Given/When/Then per Story**
**Date:** March 1, 2026
---
## Epic 1: CloudTrail Ingestion
### Story 1.1: EventBridge Cross-Account Rules
```gherkin
Feature: CloudTrail Event Ingestion
Scenario: EC2 RunInstances event ingested successfully
Given a customer AWS account has EventBridge rules forwarding to dd0c
When a RunInstances CloudTrail event fires in the customer account
Then the event arrives in the dd0c SQS FIFO queue within 5 seconds
And the event contains userIdentity, eventSource, and requestParameters
Scenario: Non-cost-generating event is filtered at EventBridge
Given EventBridge rules filter for cost-relevant API calls only
When a DescribeInstances CloudTrail event fires
Then the event is NOT forwarded to the SQS queue
```
### Story 1.2: SQS FIFO Deduplication
```gherkin
Feature: CloudTrail Event Deduplication
Scenario: Duplicate CloudTrail event is deduplicated by SQS FIFO
Given a RunInstances event with eventID "evt-abc-123" was already processed
When the same event arrives again (CloudTrail duplicate delivery)
Then SQS FIFO deduplicates based on the eventID
And only one CostEvent is written to DynamoDB
Scenario: Same resource type from different events is allowed
Given two different RunInstances events for the same instance type
But with different eventIDs
When both arrive in the SQS queue
Then both are processed as separate CostEvents
```
### Story 1.3: Lambda Normalizer
```gherkin
Feature: CloudTrail Event Normalization
Scenario: EC2 RunInstances normalized to CostEvent schema
Given a raw CloudTrail RunInstances event
When the Lambda normalizer processes it
Then a CostEvent is created with: service="ec2", instanceType="m5.xlarge", region, actor ARN, hourly cost
And the hourly cost is looked up from the static pricing table
Scenario: RDS CreateDBInstance normalized correctly
Given a raw CloudTrail CreateDBInstance event
When the Lambda normalizer processes it
Then a CostEvent is created with: service="rds", engine, storageType, multiAZ flag
Scenario: Unknown instance type uses fallback pricing
Given a CloudTrail event with instanceType "z99.mega" not in the pricing table
When the normalizer looks up pricing
Then fallback pricing of $0 is applied
And a warning is logged: "Unknown instance type: z99.mega"
Scenario: Malformed CloudTrail JSON does not crash the Lambda
Given a CloudTrail event with invalid JSON structure
When the Lambda normalizer processes it
Then the event is sent to the DLQ
And no CostEvent is written
And an error metric is emitted
Scenario: Batched RunInstances (multiple instances) creates multiple CostEvents
Given a RunInstances event launching 5 instances simultaneously
When the normalizer processes it
Then 5 separate CostEvents are created (one per instance)
```
### Story 1.4: Cost Precision
```gherkin
Feature: Cost Calculation Precision
Scenario: Hourly cost calculated with 4 decimal places
Given instance type "m5.xlarge" costs $0.192/hr
When the normalizer calculates hourly cost
Then the cost is stored as 0.1920 (4 decimal places)
Scenario: Sub-cent Lambda costs handled correctly
Given a Lambda invocation costing $0.0000002 per request
When the normalizer calculates cost
Then the cost is stored without floating-point precision loss
```
---
## Epic 2: Anomaly Detection
### Story 2.1: Z-Score Calculation
```gherkin
Feature: Z-Score Anomaly Scoring
Scenario: Event matching baseline mean scores zero
Given a baseline with mean=$1.25/hr and stddev=$0.15
When a CostEvent arrives with hourly cost $1.25
Then the Z-score is 0.0
And the severity is "info"
Scenario: Event 3 standard deviations above mean scores critical
Given a baseline with mean=$1.25/hr and stddev=$0.15
When a CostEvent arrives with hourly cost $1.70 (Z=3.0)
Then the Z-score is 3.0
And the severity is "critical"
Scenario: Zero standard deviation does not cause division by zero
Given a baseline with mean=$1.00/hr and stddev=$0.00 (all identical observations)
When a CostEvent arrives with hourly cost $1.50
Then the scorer handles the zero-stddev case gracefully
And the severity is based on absolute cost delta, not Z-score
Scenario: Single data point baseline (stddev undefined)
Given a baseline with only 1 observation
When a CostEvent arrives
Then the scorer uses the cold-start fast-path instead of Z-score
```
### Story 2.2: Novelty Scoring
```gherkin
Feature: Actor and Instance Novelty Detection
Scenario: New IAM role triggers actor novelty penalty
Given the baseline has observed actors: ["arn:aws:iam::123:role/ci"]
When a CostEvent arrives from actor "arn:aws:iam::123:role/unknown-script"
Then an actor novelty penalty is added to the composite score
Scenario: Known actor and known instance type has no novelty penalty
Given the baseline has observed actors and instance types matching the event
When the CostEvent is scored
Then the novelty component is 0
```
### Story 2.3: Cold-Start Fast Path
```gherkin
Feature: Cold-Start Fast Path for Expensive Resources
Scenario: $25/hr instance triggers immediate critical (bypasses baseline)
Given the account baseline has fewer than 14 days of data
When a CostEvent arrives for a p4d.24xlarge at $32.77/hr
Then the fast-path triggers immediately
And severity is "critical" regardless of baseline state
Scenario: $0.10/hr instance ignored during cold-start
Given the account baseline has fewer than 14 days of data
When a CostEvent arrives for a t3.nano at $0.0052/hr
Then the fast-path does NOT trigger
And the event is scored normally (likely "info")
Scenario: Fast-path transitions to statistical scoring at maturity
Given the account baseline has 20+ events AND 14+ days
When a CostEvent arrives for a $10/hr instance
Then the Z-score path is used (not fast-path)
```
### Story 2.4: Composite Scoring
```gherkin
Feature: Composite Severity Classification
Scenario: Composite score below 30 classified as info
Given Z-score contribution is 15 and novelty contribution is 10
When the composite score is calculated (25)
Then severity is "info"
Scenario: Composite score above 60 classified as critical
Given Z-score contribution is 45 and novelty contribution is 20
When the composite score is calculated (65)
Then severity is "critical"
Scenario: Events below $0.50/hr never classified as critical
Given a CostEvent with hourly cost $0.30 and Z-score of 5.0
When the composite score is calculated
Then severity is capped at "warning" (never critical for cheap resources)
```
### Story 2.5: Feedback Loop
```gherkin
Feature: Mark-as-Expected Feedback
Scenario: Marking an anomaly as expected reduces future scores
Given an anomaly was triggered for actor "arn:aws:iam::123:role/batch-job"
When the user clicks "Mark as Expected" in Slack
Then the actor is added to the expected_actors list
And future events from this actor receive a reduced score
Scenario: Expected actor still flagged if cost is 10x above baseline
Given actor "batch-job" is in the expected_actors list
And the baseline mean is $1.00/hr
When a CostEvent arrives from "batch-job" at $15.00/hr
Then the anomaly is still flagged (10x override)
```
---
## Epic 3: Zombie Hunter
```gherkin
Feature: Idle Resource Detection
Scenario: EC2 instance running 7+ days with <5% CPU detected as zombie
Given an EC2 instance "i-abc123" has been running for 10 days
And CloudWatch average CPU utilization is 2%
When the daily Zombie Hunter scan runs
Then "i-abc123" is flagged as a zombie
And cumulative waste cost is calculated (10 days × hourly rate)
Scenario: RDS instance with 0 connections for 3+ days detected
Given an RDS instance "prod-db-unused" has 0 connections for 5 days
When the Zombie Hunter scan runs
Then "prod-db-unused" is flagged as a zombie
Scenario: Unattached EBS volume older than 7 days detected
Given an EBS volume "vol-xyz" has been unattached for 14 days
When the Zombie Hunter scan runs
Then "vol-xyz" is flagged as a zombie
And the waste cost includes the daily storage charge
Scenario: Instance tagged dd0c:ignore is excluded
Given an EC2 instance has tag "dd0c:ignore=true"
When the Zombie Hunter scan runs
Then the instance is skipped
Scenario: Zombie scan respects read-only permissions
Given the dd0c IAM role has only read permissions
When the Zombie Hunter scan runs
Then no modify/stop/terminate API calls are made
```
---
## Epic 4: Notification & Remediation
### Story 4.1: Slack Block Kit Alerts
```gherkin
Feature: Slack Anomaly Alerts
Scenario: Critical anomaly sends Slack alert with Block Kit
Given an anomaly with severity "critical" for a p4d.24xlarge in us-east-1
When the notification engine processes it
Then a Slack message is sent with Block Kit formatting
And the message includes: resource type, region, hourly cost, actor, "Why this alert" section
And the message includes buttons: Snooze 1h, Snooze 24h, Mark Expected, Stop Instance
Scenario: Info-level anomalies are NOT sent to Slack
Given an anomaly with severity "info"
When the notification engine processes it
Then no Slack message is sent
And the anomaly is only logged to DynamoDB
```
### Story 4.2: Interactive Remediation
```gherkin
Feature: One-Click Remediation via Slack
Scenario: User clicks "Stop Instance" and remediation succeeds
Given a Slack alert for EC2 instance "i-abc123" in account "123456789012"
When the user clicks "Stop Instance"
Then the system assumes the cross-account remediation role via STS
And ec2:StopInstances is called for "i-abc123"
And the Slack message is updated to "Remediation Successful "
Scenario: Cross-account IAM role has been revoked
Given the customer revoked the dd0c remediation IAM role
When the user clicks "Stop Instance"
Then STS AssumeRole fails
And the Slack message is updated to "Remediation Failed: IAM role not found"
And no instance is stopped
Scenario: Snooze suppresses alerts for the specified duration
Given the user clicks "Snooze 24h" on an anomaly
When a similar anomaly fires within 24 hours
Then no Slack alert is sent
And the anomaly is still logged to DynamoDB
Scenario: Mark Expected updates the baseline
Given the user clicks "Mark Expected"
When the feedback is processed
Then the actor is added to expected_actors
And the false-positive counter is incremented for governance scoring
```
### Story 4.3: Slack Signature Validation
```gherkin
Feature: Slack Interactive Payload Security
Scenario: Valid Slack signature accepted
Given a Slack interactive payload with correct HMAC-SHA256 signature
When the API receives the payload
Then the action is processed
Scenario: Invalid signature rejected
Given a payload with an incorrect X-Slack-Signature header
When the API receives the payload
Then HTTP 401 is returned
And no action is taken
Scenario: Expired timestamp rejected (replay attack)
Given a payload with X-Slack-Request-Timestamp older than 5 minutes
When the API receives the payload
Then HTTP 401 is returned
```
### Story 4.4: Daily Digest
```gherkin
Feature: Daily Anomaly Digest
Scenario: Daily digest aggregates 24h of anomalies
Given 15 anomalies were detected in the last 24 hours
When the daily digest cron fires
Then a Slack message is sent with: total anomalies, top 3 costliest, total estimated spend, zombie count
Scenario: No digest sent if zero anomalies
Given zero anomalies in the last 24 hours
When the daily digest cron fires
Then no Slack message is sent
```
---
## Epic 5: Onboarding & PLG
```gherkin
Feature: AWS Account Onboarding
Scenario: CloudFormation quick-create deploys IAM role
Given the user clicks the quick-create URL
When CloudFormation deploys the stack
Then an IAM role is created with the correct permissions and ExternalId
And an EventBridge rule is created for cost-relevant CloudTrail events
Scenario: Role validation succeeds with correct ExternalId
Given the IAM role exists with ExternalId "dd0c-tenant-123"
When POST /v1/accounts validates the role
Then STS AssumeRole succeeds
And the account is marked as "active"
And the initial Zombie Hunter scan is triggered
Scenario: Role validation fails with wrong ExternalId
Given the IAM role exists but ExternalId does not match
When POST /v1/accounts validates the role
Then STS AssumeRole fails with AccessDenied
And a clear error message is returned
Scenario: Free tier allows 1 account
Given the tenant is on the free tier with 0 connected accounts
When they connect their first AWS account
Then the connection succeeds
Scenario: Free tier rejects 2nd account
Given the tenant is on the free tier with 1 connected account
When they attempt to connect a second account
Then HTTP 403 is returned with an upgrade prompt
Scenario: Stripe upgrade unlocks multiple accounts
Given the tenant upgrades via Stripe Checkout
When checkout.session.completed webhook fires
Then the tenant tier is updated to "pro"
And they can connect additional accounts
Scenario: Stripe webhook signature validated
Given a Stripe webhook payload
When the signature does not match the signing secret
Then HTTP 401 is returned
```
---
## Epic 6: Dashboard API
```gherkin
Feature: Dashboard API with Tenant Isolation
Scenario: GET /v1/anomalies returns anomalies for authenticated tenant
Given tenant "t-123" has 50 anomalies
And tenant "t-456" has 30 anomalies
When tenant "t-123" calls GET /v1/anomalies
Then only tenant "t-123" anomalies are returned
And zero anomalies from "t-456" are included
Scenario: Cursor-based pagination
Given tenant "t-123" has 200 anomalies
When GET /v1/anomalies?limit=20 is called
Then 20 results are returned with a cursor token
And calling with the cursor returns the next 20
Scenario: Filter by severity
Given anomalies exist with severity "critical", "warning", and "info"
When GET /v1/anomalies?severity=critical is called
Then only critical anomalies are returned
Scenario: Baseline override
Given tenant "t-123" wants to adjust sensitivity for EC2 m5.xlarge
When PATCH /v1/accounts/{id}/baselines/ec2/m5.xlarge with sensitivity=0.8
Then the baseline sensitivity is updated
And future scoring uses the adjusted threshold
Scenario: Missing Cognito JWT returns 401
Given no Authorization header is present
When any API endpoint is called
Then HTTP 401 is returned
```
---
## Epic 7: Dashboard UI
```gherkin
Feature: Dashboard UI
Scenario: Anomaly feed renders with severity badges
Given the user is logged in
When the dashboard loads
Then anomalies are listed with color-coded severity badges (red/yellow/blue)
Scenario: Baseline learning progress displayed
Given an account connected 5 days ago with 30 events
When the user views the account detail
Then a progress bar shows "5/14 days, 30/20 events" toward maturity
Scenario: Zombie resource list shows waste estimate
Given 3 zombie resources were detected
When the user views the Zombie Hunter tab
Then each zombie shows: resource type, age, cumulative waste cost
```
---
## Epic 8: Infrastructure
```gherkin
Feature: CDK Infrastructure
Scenario: CDK deploys all required resources
Given the CDK stack is synthesized
When cdk deploy is executed
Then EventBridge rules, SQS FIFO queues, Lambda functions, DynamoDB tables, and Cognito user pool are created
Scenario: CI/CD pipeline runs on push to main
Given code is pushed to the main branch
When GitHub Actions triggers
Then unit tests, integration tests, and CDK synth run
And Lambda functions are deployed
And zero downtime is maintained
```
---
## Epic 9: Multi-Account Management
```gherkin
Feature: Multi-Account Management
Scenario: Link additional AWS account
Given the tenant is on the pro tier
When they POST /v1/accounts with a new role ARN
Then the account is validated and linked
And CloudTrail events from the new account begin flowing
Scenario: Consolidated anomaly view across accounts
Given the tenant has 3 linked accounts
When they GET /v1/anomalies (no account filter)
Then anomalies from all 3 accounts are returned
And each anomaly includes the source account ID
Scenario: Disconnect account stops event processing
Given account "123456789012" is linked
When DELETE /v1/accounts/{id} is called
Then the account is marked as "disconnecting"
And EventBridge rules are cleaned up
And no new events are processed from that account
```
---
## Epic 10: Transparent Factory Compliance
### Story 10.1: Atomic Flagging
```gherkin
Feature: Alert Volume Circuit Breaker
Scenario: Circuit breaker trips at 3x baseline alert volume
Given the baseline alert rate is 5 alerts/hour
When 16 alerts fire in the current hour (>3x)
Then the circuit breaker trips
And the scoring flag is auto-disabled
And suppressed anomalies are buffered in the DLQ
Scenario: Fast-path alerts are exempt from circuit breaker
Given the circuit breaker is tripped
When a $30/hr instance triggers the cold-start fast-path
Then the alert is still sent (fast-path bypasses breaker)
Scenario: CI blocks expired flags
Given a feature flag with TTL expired and rollout=100%
When CI runs the flag audit
Then the build fails
```
### Story 10.2: Elastic Schema
```gherkin
Feature: DynamoDB Additive Schema
Scenario: New attribute added without breaking V1 readers
Given a DynamoDB item has a new "anomaly_score_v2" attribute
When V1 code reads the item
Then deserialization succeeds (unknown attributes ignored)
Scenario: Key schema modifications are rejected
Given a migration attempts to change the partition key structure
When CI runs the schema lint
Then the build fails with "Key schema modification detected"
```
### Story 10.3: Cognitive Durability
```gherkin
Feature: Decision Logs for Scoring Changes
Scenario: PR modifying Z-score thresholds requires decision log
Given a PR changes files in src/scoring/
And no decision_log.json is included
When CI runs the decision log check
Then the build fails
Scenario: Cyclomatic complexity enforced
Given a scoring function has complexity 12
When the linter runs with threshold 10
Then the lint fails
```
### Story 10.4: Semantic Observability
```gherkin
Feature: OTEL Spans on Scoring Decisions
Scenario: Every anomaly scoring decision emits an OTEL span
Given a CostEvent is processed by the Anomaly Scorer
When scoring completes
Then an "anomaly_scoring" span is created
And span attributes include: cost.z_score, cost.anomaly_score, cost.baseline_days
And cost.fast_path_triggered is set to true/false
Scenario: Account ID is hashed in spans (PII protection)
Given a scoring span is emitted
When the span attributes are inspected
Then the AWS account ID is SHA-256 hashed
And no raw account ID appears in any attribute
```
### Story 10.5: Configurable Autonomy (14-Day Governance)
```gherkin
Feature: 14-Day Auto-Promotion Governance
Scenario: New account starts in strict mode (log-only)
Given a new AWS account is connected
When the first anomaly is detected
Then the anomaly is logged to DynamoDB
And NO Slack alert is sent (strict mode)
Scenario: Account auto-promotes to audit mode after 14 days
Given an account has been connected for 15 days
And the false-positive rate is 5% (<10% threshold)
When the daily governance cron runs
Then the account mode is changed from "strict" to "audit"
And a log entry "Auto-promoted to audit mode" is written
Scenario: Account does NOT promote if false-positive rate is high
Given an account has been connected for 15 days
And the false-positive rate is 15% (>10% threshold)
When the daily governance cron runs
Then the account remains in "strict" mode
Scenario: Panic mode halts all alerting in <1 second
Given panic mode is triggered via POST /admin/panic
When a new critical anomaly is detected
Then NO Slack alert is sent
And the anomaly is still scored and logged
And the dashboard API returns "alerting paused" header
Scenario: Panic mode requires manual clearance
Given panic mode is active
When 24 hours pass without manual intervention
Then panic mode is still active (no auto-clear)
And only POST /admin/panic with panic=false clears it
```
---
*End of dd0c/cost BDD Acceptance Specifications — 10 Epics, 55+ Scenarios*