Phase 3: BDD acceptance specs for P1 (route) and P5 (cost)

P1: 50+ scenarios across 10 epics, all stories covered P5: 55+ scenarios across 10 epics, written manually (Sonnet credential failures) Remaining P2/P3/P4/P6 in progress via subagents
2026-03-01 01:50:30 +00:00
parent 03bfe931fc
commit c1484426cc
2 changed files with 1295 additions and 0 deletions
--- a/products/01-llm-cost-router/acceptance-specs/acceptance-specs.md
+++ b/products/01-llm-cost-router/acceptance-specs/acceptance-specs.md
@@ -0,0 +1,685 @@
 # dd0c/route — BDD Acceptance Test Specifications
 **Phase 3: Given/When/Then per Story**
 **Date:** March 1, 2026
 ---
 ## Epic 1: Proxy Engine
 ### Story 1.1: OpenAI SDK Drop-In Compatibility
 ```gherkin
 Feature: OpenAI SDK Drop-In Compatibility
  Scenario: Non-streaming request proxied successfully
    Given a valid API key "sk-test-123" exists in Redis cache
    And the upstream OpenAI endpoint is healthy
    When I send POST /v1/chat/completions with model "gpt-4o" and stream=false
    Then the response status is 200
    And the response body matches the OpenAI ChatCompletion schema
    And the response includes usage.prompt_tokens and usage.completion_tokens
  Scenario: Request with invalid API key is rejected
    Given no API key "sk-invalid" exists in Redis or PostgreSQL
    When I send POST /v1/chat/completions with Authorization "Bearer sk-invalid"
    Then the response status is 401
    And the response body contains error "Invalid API key"
  Scenario: API key validated from Redis cache (fast path)
    Given API key "sk-cached" exists in Redis cache
    When I send POST /v1/chat/completions with Authorization "Bearer sk-cached"
    Then the key is validated without querying PostgreSQL
    And the request is forwarded to the upstream provider
  Scenario: API key falls back to PostgreSQL when Redis is unavailable
    Given API key "sk-db-only" exists in PostgreSQL but NOT in Redis
    When I send POST /v1/chat/completions with Authorization "Bearer sk-db-only"
    Then the key is validated via PostgreSQL
    And the key is written back to Redis cache for future requests
 ```
 ### Story 1.2: SSE Streaming Support
 ```gherkin
 Feature: SSE Streaming Passthrough
  Scenario: Streaming response is proxied chunk-by-chunk
    Given a valid API key and upstream OpenAI is healthy
    When I send POST /v1/chat/completions with stream=true
    Then the response Content-Type is "text/event-stream"
    And I receive multiple SSE chunks ending with "data: [DONE]"
    And each chunk matches the OpenAI streaming delta schema
  Scenario: Client disconnects mid-stream
    Given a streaming request is in progress
    When the client drops the TCP connection after 5 chunks
    Then the proxy aborts the upstream provider connection
    And no further chunks are buffered in memory
 ```
 ### Story 1.3: <5ms Latency Overhead
 ```gherkin
 Feature: Proxy Latency Budget
  Scenario: P99 latency overhead under 5ms
    Given 1000 non-streaming requests are sent sequentially
    When I measure the delta between proxy response time and upstream response time
    Then the P99 delta is less than 5 milliseconds
  Scenario: Telemetry emission does not block the hot path
    Given TimescaleDB is experiencing 5-second write latency
    When I send a request through the proxy
    Then the proxy responds within the 5ms overhead budget
    And the telemetry event is queued in the mpsc channel (not dropped)
 ```
 ### Story 1.4: Transparent Error Passthrough
 ```gherkin
 Feature: Provider Error Passthrough
  Scenario: Rate limit error passed through transparently
    Given the upstream provider returns HTTP 429 with Retry-After header
    When the proxy receives this response
    Then the proxy returns HTTP 429 to the client
    And the Retry-After header is preserved
  Scenario: Provider 500 error passed through
    Given the upstream provider returns HTTP 500
    When the proxy receives this response
    Then the proxy returns HTTP 500 to the client
    And the original error body is preserved
 ```
 ---
 ## Epic 2: Router Brain
 ### Story 2.1: Complexity Classification
 ```gherkin
 Feature: Request Complexity Classification
  Scenario: Simple extraction task classified as low complexity
    Given a request with system prompt "Extract the name from this text"
    And token count is under 500
    When the complexity classifier evaluates the request
    Then the complexity score is "low"
    And classification completes in under 2ms
  Scenario: Multi-turn reasoning classified as high complexity
    Given a request with 10+ messages in the conversation
    And system prompt contains "analyze", "reason", or "compare"
    When the complexity classifier evaluates the request
    Then the complexity score is "high"
  Scenario: Unknown pattern defaults to medium complexity
    Given a request with no recognizable task pattern
    When the complexity classifier evaluates the request
    Then the complexity score is "medium"
 ```
 ### Story 2.2: Routing Rules
 ```gherkin
 Feature: Configurable Routing Rules
  Scenario: First-match routing rule applied
    Given routing rule "if feature=classify -> cheapest from [gpt-4o-mini, claude-haiku]"
    And the request includes header X-DD0C-Feature: classify
    When the router evaluates the request
    Then the request is routed to the cheapest model in the rule set
    And the routing decision is logged with strategy "cheapest"
  Scenario: No matching rule falls through to default
    Given routing rules exist but none match the request headers
    When the router evaluates the request
    Then the request is routed using the "passthrough" strategy
    And the original model in the request body is used
 ```
 ### Story 2.3: Automatic Fallback
 ```gherkin
 Feature: Provider Fallback Chain
  Scenario: Primary provider fails, fallback succeeds
    Given routing rule specifies fallback chain [openai, anthropic]
    And OpenAI returns HTTP 503
    When the proxy processes the request
    Then the request is retried against Anthropic
    And the response is returned successfully
    And the routing decision log shows "fallback triggered"
  Scenario: Circuit breaker opens after sustained failures
    Given OpenAI error rate exceeds 10% over the last 60 seconds
    When a new request arrives targeting OpenAI
    Then the circuit breaker is OPEN for OpenAI
    And the request is immediately routed to the fallback provider
    And no request is sent to OpenAI
 ```
 ### Story 2.4: Real-Time Cost Savings
 ```gherkin
 Feature: Cost Savings Calculation
  Scenario: Savings calculated when model is downgraded
    Given the original request specified model "gpt-4o" ($15/1M input tokens)
    And the router downgraded to "gpt-4o-mini" ($0.15/1M input tokens)
    And the request used 1000 input tokens
    When the cost calculator runs
    Then cost_original is $0.015
    And cost_actual is $0.00015
    And cost_saved is $0.01485
  Scenario: Zero savings when passthrough (no routing)
    Given the request was routed with strategy "passthrough"
    When the cost calculator runs
    Then cost_saved is $0.00
 ```
 ---
 ## Epic 3: Analytics Pipeline
 ### Story 3.1: Non-Blocking Telemetry
 ```gherkin
 Feature: Asynchronous Telemetry Emission
  Scenario: Telemetry emitted without blocking request
    Given the proxy processes a request successfully
    When the response is sent to the client
    Then a RequestEvent is emitted to the mpsc channel
    And the channel send completes in under 1ms
  Scenario: Bounded channel drops events when full
    Given the mpsc channel is at capacity (1000 events)
    When a new telemetry event is emitted
    Then the event is dropped (not blocking the proxy)
    And a "telemetry_dropped" counter is incremented
 ```
 ### Story 3.2: Fast Dashboard Queries
 ```gherkin
 Feature: Continuous Aggregates for Dashboard
  Scenario: Hourly cost summary is pre-calculated
    Given 10,000 request events exist in the last hour
    When I query GET /api/dashboard/summary?period=1h
    Then the response returns in under 200ms
    And total_cost, total_saved, and avg_latency are present
  Scenario: Treemap data returns cost breakdown by team and model
    Given request events tagged with team="payments" and team="search"
    When I query GET /api/dashboard/treemap?period=7d
    Then the response includes cost breakdowns per team per model
 ```
 ### Story 3.3: Automatic Data Compression
 ```gherkin
 Feature: TimescaleDB Compression
  Scenario: Chunks older than 7 days are compressed
    Given request_events data exists from 10 days ago
    When the compression policy runs
    Then chunks older than 7 days are compressed by 90%+
    And compressed data is still queryable via continuous aggregates
 ```
 ---
 ## Epic 4: Dashboard API
 ### Story 4.1: GitHub OAuth Signup
 ```gherkin
 Feature: GitHub OAuth Authentication
  Scenario: New user signs up via GitHub OAuth
    Given the user has a valid GitHub account
    When they complete the /api/auth/github OAuth flow
    Then an organization is created automatically
    And a JWT access token is returned
    And a refresh token is stored in Redis
  Scenario: Existing user logs in via GitHub OAuth
    Given the user already has an organization
    When they complete the /api/auth/github OAuth flow
    Then a new JWT is issued for the existing organization
    And no duplicate organization is created
 ```
 ### Story 4.2: Routing Rules & Provider Keys CRUD
 ```gherkin
 Feature: Routing Rules Management
  Scenario: Create a new routing rule
    Given I am authenticated as an Owner
    When I POST /api/orgs/{id}/routing-rules with a valid rule body
    Then the rule is created and returned with an ID
    And the rule is loaded into the proxy within 60 seconds
  Scenario: Provider API key is encrypted at rest
    Given I POST /api/orgs/{id}/provider-keys with key "sk-live-abc123"
    When the key is stored in PostgreSQL
    Then the stored value is AES-256-GCM encrypted
    And decrypting with the correct KMS key returns "sk-live-abc123"
 ```
 ### Story 4.3: Dashboard Data Endpoints
 ```gherkin
 Feature: Dashboard Summary & Treemap
  Scenario: Summary endpoint returns aggregated metrics
    Given request events exist for the authenticated org
    When I GET /api/dashboard/summary?period=30d
    Then the response includes total_cost, total_saved, request_count, avg_latency
  Scenario: Request inspector with filters
    Given 500 request events exist for the org
    When I GET /api/requests?feature=classify&limit=20
    Then 20 results are returned matching feature="classify"
    And no prompt content is included in the response
 ```
 ### Story 4.4: API Key Revocation
 ```gherkin
 Feature: API Key Revocation
  Scenario: Revoked key is immediately blocked
    Given API key "sk-compromised" is active
    When I DELETE /api/orgs/{id}/api-keys/{key_id}
    Then the key is removed from Redis cache immediately
    And subsequent requests with "sk-compromised" return 401
 ```
 ---
 ## Epic 5: Dashboard UI
 ### Story 5.1: Cost Treemap
 ```gherkin
 Feature: AI Spend Treemap Visualization
  Scenario: Treemap renders cost breakdown
    Given the user is logged into the dashboard
    And cost data exists for teams "payments" and "search"
    When the dashboard loads with period=7d
    Then a treemap visualization renders
    And the "payments" block is proportionally larger if it spent more
 ```
 ### Story 5.2: Real-Time Savings Counter
 ```gherkin
 Feature: Savings Counter
  Scenario: Weekly savings counter updates
    Given the user is on the dashboard
    And $127.50 has been saved this week
    When the dashboard loads
    Then the counter displays "You saved $127.50 this week"
 ```
 ### Story 5.3: Routing Rules Editor
 ```gherkin
 Feature: Routing Rules Editor UI
  Scenario: Create rule via drag-and-drop interface
    Given the user navigates to Settings > Routing Rules
    When they create a new rule with feature="classify" and strategy="cheapest"
    And drag it to priority position 1
    Then the rule is saved via the API
    And the rule list reflects the new priority order
 ```
 ### Story 5.4: Request Inspector
 ```gherkin
 Feature: Request Inspector
  Scenario: Inspect routing decision for a specific request
    Given the user opens the Request Inspector
    When they click on a specific request row
    Then the detail panel shows: model_selected, model_alternatives, cost_delta, routing_strategy
    And no prompt content is displayed
 ```
 ---
 ## Epic 6: Shadow Audit CLI
 ### Story 6.1: Zero-Setup Codebase Scan
 ```gherkin
 Feature: Shadow Audit CLI Scan
  Scenario: Scan TypeScript project for LLM usage
    Given a TypeScript project with 3 files calling openai.chat.completions.create()
    When I run npx dd0c-scan in the project root
    Then the CLI detects 3 LLM API call sites
    And estimates monthly cost based on heuristic token counts
    And displays projected savings with dd0c/route
  Scenario: Scan runs completely offline with cached pricing
    Given the pricing table was cached from a previous run
    And the network is unavailable
    When I run npx dd0c-scan
    Then the scan completes using cached pricing data
    And a warning is shown that pricing may be stale
 ```
 ### Story 6.2: No Source Code Exfiltration
 ```gherkin
 Feature: Local-Only Scan
  Scenario: No source code sent to server
    Given I run npx dd0c-scan without --opt-in flag
    When the scan completes
    Then zero HTTP requests are made to any dd0c server
    And the report is rendered entirely in the terminal
 ```
 ### Story 6.3: Terminal Report
 ```gherkin
 Feature: Terminal Report Output
  Scenario: Top Opportunities report
    Given the scan found 5 files with LLM calls
    When the report renders
    Then it shows "Top Opportunities" sorted by potential savings
    And each entry includes: file path, current model, suggested model, estimated monthly savings
 ```
 ---
 ## Epic 7: Slack Integration
 ### Story 7.1: Weekly Savings Digest
 ```gherkin
 Feature: Monday Morning Digest
  Scenario: Weekly digest sent via email
    Given it is Monday at 9:00 AM UTC
    And the org has cost data from the previous week
    When the dd0c-worker cron fires
    Then an email is sent via AWS SES to the org admin
    And the email includes: total_saved, top_model_savings, week_over_week_trend
  Scenario: No digest sent if no activity
    Given the org had zero requests last week
    When the cron fires
    Then no email is sent
 ```
 ### Story 7.2: Budget Alert via Slack
 ```gherkin
 Feature: Budget Threshold Alerts
  Scenario: Daily spend exceeds configured threshold
    Given the org configured alert threshold at $100/day
    And today's spend reaches $101
    When the dd0c-worker evaluates thresholds
    Then a Slack webhook is fired with the alert payload
    And the payload includes X-DD0C-Signature header
    And last_fired_at is updated to prevent duplicate alerts
  Scenario: Alert not re-fired for same incident
    Given an alert was already fired for today's threshold breach
    When the worker evaluates thresholds again
    Then no duplicate Slack webhook is sent
 ```
 ---
 ## Epic 8: Infrastructure & DevOps
 ### Story 8.1: ECS Fargate Deployment
 ```gherkin
 Feature: Containerized Deployment
  Scenario: CDK deploys all services to ECS Fargate
    Given the CDK stack is synthesized
    When cdk deploy is executed
    Then ECS services are created for proxy, api, and worker
    And ALB routes /v1/* to proxy and /api/* to api
    And CloudFront serves static dashboard assets from S3
 ```
 ### Story 8.2: CI/CD Pipeline
 ```gherkin
 Feature: GitHub Actions CI/CD
  Scenario: Push to main triggers full pipeline
    Given code is pushed to the main branch
    When GitHub Actions triggers
    Then tests run (unit + integration + canary suite)
    And Docker images are built and pushed to ECR
    And ECS services are updated with rolling deployment
    And zero downtime is maintained during deployment
 ```
 ### Story 8.3: CloudWatch Alarms
 ```gherkin
 Feature: Monitoring & Alerting
  Scenario: P99 latency alarm fires
    Given proxy P99 latency exceeds 50ms for 5 minutes
    When CloudWatch evaluates the alarm
    Then a PagerDuty incident is created
 ```
 ---
 ## Epic 9: Onboarding & PLG
 ### Story 9.1: One-Click GitHub Signup
 ```gherkin
 Feature: Frictionless Signup
  Scenario: New user completes signup in under 60 seconds
    Given the user clicks "Sign up with GitHub"
    When they authorize the OAuth app
    Then an org is created, an API key is generated
    And the user lands on the onboarding wizard
    And total elapsed time is under 60 seconds
 ```
 ### Story 9.2: Free Tier Enforcement
 ```gherkin
 Feature: Free Tier ($50/month routed spend)
  Scenario: Free tier user within limit
    Given the org is on the free tier
    And this month's routed spend is $45
    When a new request is proxied
    Then the request is processed normally
  Scenario: Free tier user exceeds limit
    Given the org is on the free tier
    And this month's routed spend is $50.01
    When a new request is proxied
    Then the proxy returns HTTP 429
    And the response body includes an upgrade CTA with Stripe Checkout link
  Scenario: Monthly counter resets on the 1st
    Given the org used $50 last month
    When the calendar rolls to the 1st of the new month
    Then the Redis counter is reset to $0
    And requests are processed normally again
 ```
 ### Story 9.3: API Key Management
 ```gherkin
 Feature: API Key CRUD
  Scenario: Generate new API key
    Given I am authenticated as an Owner
    When I POST /api/orgs/{id}/api-keys
    Then a new API key is returned in plaintext (shown once)
    And the key is stored as a bcrypt hash in PostgreSQL
    And the key is cached in Redis for fast auth
  Scenario: Rotate API key
    Given API key "sk-old" is active
    When I POST /api/orgs/{id}/api-keys/{key_id}/rotate
    Then a new key "sk-new" is returned
    And "sk-old" is immediately invalidated
 ```
 ### Story 9.4: First Route Onboarding
 ```gherkin
 Feature: 2-Minute First Route
  Scenario: User completes onboarding wizard
    Given the user just signed up
    When they copy the API key (step 1)
    And paste the provided curl command (step 2)
    And the request appears in the dashboard (step 3)
    Then the onboarding is marked complete
    And a PostHog event "routing.savings.first_dollar" is tracked (if savings occurred)
 ```
 ### Story 9.5: Team Invites
 ```gherkin
 Feature: Team Member Invites
  Scenario: Invite team member via email
    Given I am an Owner of org "acme"
    When I POST /api/orgs/{id}/invites with email "dev@acme.com"
    Then an email with a magic link is sent
    And the magic link JWT expires in 72 hours
  Scenario: Invited user joins existing org
    Given "dev@acme.com" received an invite magic link
    When they click the link and complete GitHub OAuth
    Then they are added to org "acme" as a Member (not Owner)
    And no new org is created
 ```
 ---
 ## Epic 10: Transparent Factory Compliance
 ### Story 10.1: Atomic Flagging
 ```gherkin
 Feature: Feature Flag Infrastructure
  Scenario: New routing strategy behind a flag (default off)
    Given feature flag "enable_cascading_router" exists with default=off
    When a request arrives that would trigger cascading routing
    Then the passthrough strategy is used instead
    And the flag evaluation completes without network calls
  Scenario: Flag auto-disables on latency regression
    Given flag "enable_new_classifier" is at 50% rollout
    And P99 latency increased by 8% since flag was enabled
    When the circuit breaker evaluates flag health
    Then the flag is auto-disabled within 30 seconds
    And an alert is fired
  Scenario: CI blocks deployment with expired flag
    Given flag "old_experiment" has TTL expired and rollout=100%
    When CI runs the flag audit
    Then the build fails with "Expired flag at full rollout: old_experiment"
 ```
 ### Story 10.2: Elastic Schema
 ```gherkin
 Feature: Additive-Only Schema Migrations
  Scenario: Migration with DROP COLUMN is rejected
    Given a migration file contains "ALTER TABLE request_events DROP COLUMN old_field"
    When CI runs the schema lint
    Then the build fails with "Destructive schema change detected"
  Scenario: V1 code ignores V2 fields
    Given a request_events row contains a new "routing_v2" column
    When V1 Rust code deserializes the row
    Then deserialization succeeds (unknown fields ignored)
    And no error is logged
 ```
 ### Story 10.3: Cognitive Durability
 ```gherkin
 Feature: Decision Logs for Routing Logic
  Scenario: PR touching router requires decision log
    Given a PR modifies files in src/router/
    And no decision_log.json is included
    When CI runs the decision log check
    Then the build fails with "Decision log required for routing changes"
  Scenario: Cyclomatic complexity exceeds cap
    Given a function in src/router/ has cyclomatic complexity 12
    When cargo clippy runs with cognitive_complexity threshold 10
    Then the lint fails
 ```
 ### Story 10.4: Semantic Observability
 ```gherkin
 Feature: AI Reasoning Spans
  Scenario: Routing decision emits OTEL span
    Given a request is routed from gpt-4o to gpt-4o-mini
    When the routing decision completes
    Then an "ai_routing_decision" span is created
    And span attributes include: ai.model_selected, ai.cost_delta, ai.complexity_score
    And ai.prompt_hash is a SHA-256 hash (not raw content)
  Scenario: No PII in any span
    Given a request with user email in the prompt
    When the span is emitted
    Then no span attribute contains the email address
    And ai.prompt_hash is the only prompt-related attribute
 ```
 ### Story 10.5: Configurable Autonomy
 ```gherkin
 Feature: Governance Policy
  Scenario: Strict mode blocks auto-applied config changes
    Given governance_mode is "strict"
    When the background task attempts to refresh routing rules
    Then the refresh is blocked
    And a log entry "Blocked by strict mode" is written
  Scenario: Panic mode freezes to last-known-good
    Given panic_mode is triggered via POST /admin/panic
    When a new request arrives
    Then routing uses the frozen last-known-good configuration
    And auto-failover is disabled
    And the response header includes "X-DD0C-Panic: active"
 ```
 ---
 *End of dd0c/route BDD Acceptance Specifications — 10 Epics, 50+ Scenarios*
--- a/products/05-aws-cost-anomaly/acceptance-specs/acceptance-specs.md
+++ b/products/05-aws-cost-anomaly/acceptance-specs/acceptance-specs.md
@@ -0,0 +1,610 @@
 # dd0c/cost — BDD Acceptance Test Specifications
 **Phase 3: Given/When/Then per Story**
 **Date:** March 1, 2026
 ---
 ## Epic 1: CloudTrail Ingestion
 ### Story 1.1: EventBridge Cross-Account Rules
 ```gherkin
 Feature: CloudTrail Event Ingestion
  Scenario: EC2 RunInstances event ingested successfully
    Given a customer AWS account has EventBridge rules forwarding to dd0c
    When a RunInstances CloudTrail event fires in the customer account
    Then the event arrives in the dd0c SQS FIFO queue within 5 seconds
    And the event contains userIdentity, eventSource, and requestParameters
  Scenario: Non-cost-generating event is filtered at EventBridge
    Given EventBridge rules filter for cost-relevant API calls only
    When a DescribeInstances CloudTrail event fires
    Then the event is NOT forwarded to the SQS queue
 ```
 ### Story 1.2: SQS FIFO Deduplication
 ```gherkin
 Feature: CloudTrail Event Deduplication
  Scenario: Duplicate CloudTrail event is deduplicated by SQS FIFO
    Given a RunInstances event with eventID "evt-abc-123" was already processed
    When the same event arrives again (CloudTrail duplicate delivery)
    Then SQS FIFO deduplicates based on the eventID
    And only one CostEvent is written to DynamoDB
  Scenario: Same resource type from different events is allowed
    Given two different RunInstances events for the same instance type
    But with different eventIDs
    When both arrive in the SQS queue
    Then both are processed as separate CostEvents
 ```
 ### Story 1.3: Lambda Normalizer
 ```gherkin
 Feature: CloudTrail Event Normalization
  Scenario: EC2 RunInstances normalized to CostEvent schema
    Given a raw CloudTrail RunInstances event
    When the Lambda normalizer processes it
    Then a CostEvent is created with: service="ec2", instanceType="m5.xlarge", region, actor ARN, hourly cost
    And the hourly cost is looked up from the static pricing table
  Scenario: RDS CreateDBInstance normalized correctly
    Given a raw CloudTrail CreateDBInstance event
    When the Lambda normalizer processes it
    Then a CostEvent is created with: service="rds", engine, storageType, multiAZ flag
  Scenario: Unknown instance type uses fallback pricing
    Given a CloudTrail event with instanceType "z99.mega" not in the pricing table
    When the normalizer looks up pricing
    Then fallback pricing of $0 is applied
    And a warning is logged: "Unknown instance type: z99.mega"
  Scenario: Malformed CloudTrail JSON does not crash the Lambda
    Given a CloudTrail event with invalid JSON structure
    When the Lambda normalizer processes it
    Then the event is sent to the DLQ
    And no CostEvent is written
    And an error metric is emitted
  Scenario: Batched RunInstances (multiple instances) creates multiple CostEvents
    Given a RunInstances event launching 5 instances simultaneously
    When the normalizer processes it
    Then 5 separate CostEvents are created (one per instance)
 ```
 ### Story 1.4: Cost Precision
 ```gherkin
 Feature: Cost Calculation Precision
  Scenario: Hourly cost calculated with 4 decimal places
    Given instance type "m5.xlarge" costs $0.192/hr
    When the normalizer calculates hourly cost
    Then the cost is stored as 0.1920 (4 decimal places)
  Scenario: Sub-cent Lambda costs handled correctly
    Given a Lambda invocation costing $0.0000002 per request
    When the normalizer calculates cost
    Then the cost is stored without floating-point precision loss
 ```
 ---
 ## Epic 2: Anomaly Detection
 ### Story 2.1: Z-Score Calculation
 ```gherkin
 Feature: Z-Score Anomaly Scoring
  Scenario: Event matching baseline mean scores zero
    Given a baseline with mean=$1.25/hr and stddev=$0.15
    When a CostEvent arrives with hourly cost $1.25
    Then the Z-score is 0.0
    And the severity is "info"
  Scenario: Event 3 standard deviations above mean scores critical
    Given a baseline with mean=$1.25/hr and stddev=$0.15
    When a CostEvent arrives with hourly cost $1.70 (Z=3.0)
    Then the Z-score is 3.0
    And the severity is "critical"
  Scenario: Zero standard deviation does not cause division by zero
    Given a baseline with mean=$1.00/hr and stddev=$0.00 (all identical observations)
    When a CostEvent arrives with hourly cost $1.50
    Then the scorer handles the zero-stddev case gracefully
    And the severity is based on absolute cost delta, not Z-score
  Scenario: Single data point baseline (stddev undefined)
    Given a baseline with only 1 observation
    When a CostEvent arrives
    Then the scorer uses the cold-start fast-path instead of Z-score
 ```
 ### Story 2.2: Novelty Scoring
 ```gherkin
 Feature: Actor and Instance Novelty Detection
  Scenario: New IAM role triggers actor novelty penalty
    Given the baseline has observed actors: ["arn:aws:iam::123:role/ci"]
    When a CostEvent arrives from actor "arn:aws:iam::123:role/unknown-script"
    Then an actor novelty penalty is added to the composite score
  Scenario: Known actor and known instance type has no novelty penalty
    Given the baseline has observed actors and instance types matching the event
    When the CostEvent is scored
    Then the novelty component is 0
 ```
 ### Story 2.3: Cold-Start Fast Path
 ```gherkin
 Feature: Cold-Start Fast Path for Expensive Resources
  Scenario: $25/hr instance triggers immediate critical (bypasses baseline)
    Given the account baseline has fewer than 14 days of data
    When a CostEvent arrives for a p4d.24xlarge at $32.77/hr
    Then the fast-path triggers immediately
    And severity is "critical" regardless of baseline state
  Scenario: $0.10/hr instance ignored during cold-start
    Given the account baseline has fewer than 14 days of data
    When a CostEvent arrives for a t3.nano at $0.0052/hr
    Then the fast-path does NOT trigger
    And the event is scored normally (likely "info")
  Scenario: Fast-path transitions to statistical scoring at maturity
    Given the account baseline has 20+ events AND 14+ days
    When a CostEvent arrives for a $10/hr instance
    Then the Z-score path is used (not fast-path)
 ```
 ### Story 2.4: Composite Scoring
 ```gherkin
 Feature: Composite Severity Classification
  Scenario: Composite score below 30 classified as info
    Given Z-score contribution is 15 and novelty contribution is 10
    When the composite score is calculated (25)
    Then severity is "info"
  Scenario: Composite score above 60 classified as critical
    Given Z-score contribution is 45 and novelty contribution is 20
    When the composite score is calculated (65)
    Then severity is "critical"
  Scenario: Events below $0.50/hr never classified as critical
    Given a CostEvent with hourly cost $0.30 and Z-score of 5.0
    When the composite score is calculated
    Then severity is capped at "warning" (never critical for cheap resources)
 ```
 ### Story 2.5: Feedback Loop
 ```gherkin
 Feature: Mark-as-Expected Feedback
  Scenario: Marking an anomaly as expected reduces future scores
    Given an anomaly was triggered for actor "arn:aws:iam::123:role/batch-job"
    When the user clicks "Mark as Expected" in Slack
    Then the actor is added to the expected_actors list
    And future events from this actor receive a reduced score
  Scenario: Expected actor still flagged if cost is 10x above baseline
    Given actor "batch-job" is in the expected_actors list
    And the baseline mean is $1.00/hr
    When a CostEvent arrives from "batch-job" at $15.00/hr
    Then the anomaly is still flagged (10x override)
 ```
 ---
 ## Epic 3: Zombie Hunter
 ```gherkin
 Feature: Idle Resource Detection
  Scenario: EC2 instance running 7+ days with <5% CPU detected as zombie
    Given an EC2 instance "i-abc123" has been running for 10 days
    And CloudWatch average CPU utilization is 2%
    When the daily Zombie Hunter scan runs
    Then "i-abc123" is flagged as a zombie
    And cumulative waste cost is calculated (10 days × hourly rate)
  Scenario: RDS instance with 0 connections for 3+ days detected
    Given an RDS instance "prod-db-unused" has 0 connections for 5 days
    When the Zombie Hunter scan runs
    Then "prod-db-unused" is flagged as a zombie
  Scenario: Unattached EBS volume older than 7 days detected
    Given an EBS volume "vol-xyz" has been unattached for 14 days
    When the Zombie Hunter scan runs
    Then "vol-xyz" is flagged as a zombie
    And the waste cost includes the daily storage charge
  Scenario: Instance tagged dd0c:ignore is excluded
    Given an EC2 instance has tag "dd0c:ignore=true"
    When the Zombie Hunter scan runs
    Then the instance is skipped
  Scenario: Zombie scan respects read-only permissions
    Given the dd0c IAM role has only read permissions
    When the Zombie Hunter scan runs
    Then no modify/stop/terminate API calls are made
 ```
 ---
 ## Epic 4: Notification & Remediation
 ### Story 4.1: Slack Block Kit Alerts
 ```gherkin
 Feature: Slack Anomaly Alerts
  Scenario: Critical anomaly sends Slack alert with Block Kit
    Given an anomaly with severity "critical" for a p4d.24xlarge in us-east-1
    When the notification engine processes it
    Then a Slack message is sent with Block Kit formatting
    And the message includes: resource type, region, hourly cost, actor, "Why this alert" section
    And the message includes buttons: Snooze 1h, Snooze 24h, Mark Expected, Stop Instance
  Scenario: Info-level anomalies are NOT sent to Slack
    Given an anomaly with severity "info"
    When the notification engine processes it
    Then no Slack message is sent
    And the anomaly is only logged to DynamoDB
 ```
 ### Story 4.2: Interactive Remediation
 ```gherkin
 Feature: One-Click Remediation via Slack
  Scenario: User clicks "Stop Instance" and remediation succeeds
    Given a Slack alert for EC2 instance "i-abc123" in account "123456789012"
    When the user clicks "Stop Instance"
    Then the system assumes the cross-account remediation role via STS
    And ec2:StopInstances is called for "i-abc123"
    And the Slack message is updated to "Remediation Successful ✅"
  Scenario: Cross-account IAM role has been revoked
    Given the customer revoked the dd0c remediation IAM role
    When the user clicks "Stop Instance"
    Then STS AssumeRole fails
    And the Slack message is updated to "Remediation Failed: IAM role not found"
    And no instance is stopped
  Scenario: Snooze suppresses alerts for the specified duration
    Given the user clicks "Snooze 24h" on an anomaly
    When a similar anomaly fires within 24 hours
    Then no Slack alert is sent
    And the anomaly is still logged to DynamoDB
  Scenario: Mark Expected updates the baseline
    Given the user clicks "Mark Expected"
    When the feedback is processed
    Then the actor is added to expected_actors
    And the false-positive counter is incremented for governance scoring
 ```
 ### Story 4.3: Slack Signature Validation
 ```gherkin
 Feature: Slack Interactive Payload Security
  Scenario: Valid Slack signature accepted
    Given a Slack interactive payload with correct HMAC-SHA256 signature
    When the API receives the payload
    Then the action is processed
  Scenario: Invalid signature rejected
    Given a payload with an incorrect X-Slack-Signature header
    When the API receives the payload
    Then HTTP 401 is returned
    And no action is taken
  Scenario: Expired timestamp rejected (replay attack)
    Given a payload with X-Slack-Request-Timestamp older than 5 minutes
    When the API receives the payload
    Then HTTP 401 is returned
 ```
 ### Story 4.4: Daily Digest
 ```gherkin
 Feature: Daily Anomaly Digest
  Scenario: Daily digest aggregates 24h of anomalies
    Given 15 anomalies were detected in the last 24 hours
    When the daily digest cron fires
    Then a Slack message is sent with: total anomalies, top 3 costliest, total estimated spend, zombie count
  Scenario: No digest sent if zero anomalies
    Given zero anomalies in the last 24 hours
    When the daily digest cron fires
    Then no Slack message is sent
 ```
 ---
 ## Epic 5: Onboarding & PLG
 ```gherkin
 Feature: AWS Account Onboarding
  Scenario: CloudFormation quick-create deploys IAM role
    Given the user clicks the quick-create URL
    When CloudFormation deploys the stack
    Then an IAM role is created with the correct permissions and ExternalId
    And an EventBridge rule is created for cost-relevant CloudTrail events
  Scenario: Role validation succeeds with correct ExternalId
    Given the IAM role exists with ExternalId "dd0c-tenant-123"
    When POST /v1/accounts validates the role
    Then STS AssumeRole succeeds
    And the account is marked as "active"
    And the initial Zombie Hunter scan is triggered
  Scenario: Role validation fails with wrong ExternalId
    Given the IAM role exists but ExternalId does not match
    When POST /v1/accounts validates the role
    Then STS AssumeRole fails with AccessDenied
    And a clear error message is returned
  Scenario: Free tier allows 1 account
    Given the tenant is on the free tier with 0 connected accounts
    When they connect their first AWS account
    Then the connection succeeds
  Scenario: Free tier rejects 2nd account
    Given the tenant is on the free tier with 1 connected account
    When they attempt to connect a second account
    Then HTTP 403 is returned with an upgrade prompt
  Scenario: Stripe upgrade unlocks multiple accounts
    Given the tenant upgrades via Stripe Checkout
    When checkout.session.completed webhook fires
    Then the tenant tier is updated to "pro"
    And they can connect additional accounts
  Scenario: Stripe webhook signature validated
    Given a Stripe webhook payload
    When the signature does not match the signing secret
    Then HTTP 401 is returned
 ```
 ---
 ## Epic 6: Dashboard API
 ```gherkin
 Feature: Dashboard API with Tenant Isolation
  Scenario: GET /v1/anomalies returns anomalies for authenticated tenant
    Given tenant "t-123" has 50 anomalies
    And tenant "t-456" has 30 anomalies
    When tenant "t-123" calls GET /v1/anomalies
    Then only tenant "t-123" anomalies are returned
    And zero anomalies from "t-456" are included
  Scenario: Cursor-based pagination
    Given tenant "t-123" has 200 anomalies
    When GET /v1/anomalies?limit=20 is called
    Then 20 results are returned with a cursor token
    And calling with the cursor returns the next 20
  Scenario: Filter by severity
    Given anomalies exist with severity "critical", "warning", and "info"
    When GET /v1/anomalies?severity=critical is called
    Then only critical anomalies are returned
  Scenario: Baseline override
    Given tenant "t-123" wants to adjust sensitivity for EC2 m5.xlarge
    When PATCH /v1/accounts/{id}/baselines/ec2/m5.xlarge with sensitivity=0.8
    Then the baseline sensitivity is updated
    And future scoring uses the adjusted threshold
  Scenario: Missing Cognito JWT returns 401
    Given no Authorization header is present
    When any API endpoint is called
    Then HTTP 401 is returned
 ```
 ---
 ## Epic 7: Dashboard UI
 ```gherkin
 Feature: Dashboard UI
  Scenario: Anomaly feed renders with severity badges
    Given the user is logged in
    When the dashboard loads
    Then anomalies are listed with color-coded severity badges (red/yellow/blue)
  Scenario: Baseline learning progress displayed
    Given an account connected 5 days ago with 30 events
    When the user views the account detail
    Then a progress bar shows "5/14 days, 30/20 events" toward maturity
  Scenario: Zombie resource list shows waste estimate
    Given 3 zombie resources were detected
    When the user views the Zombie Hunter tab
    Then each zombie shows: resource type, age, cumulative waste cost
 ```
 ---
 ## Epic 8: Infrastructure
 ```gherkin
 Feature: CDK Infrastructure
  Scenario: CDK deploys all required resources
    Given the CDK stack is synthesized
    When cdk deploy is executed
    Then EventBridge rules, SQS FIFO queues, Lambda functions, DynamoDB tables, and Cognito user pool are created
  Scenario: CI/CD pipeline runs on push to main
    Given code is pushed to the main branch
    When GitHub Actions triggers
    Then unit tests, integration tests, and CDK synth run
    And Lambda functions are deployed
    And zero downtime is maintained
 ```
 ---
 ## Epic 9: Multi-Account Management
 ```gherkin
 Feature: Multi-Account Management
  Scenario: Link additional AWS account
    Given the tenant is on the pro tier
    When they POST /v1/accounts with a new role ARN
    Then the account is validated and linked
    And CloudTrail events from the new account begin flowing
  Scenario: Consolidated anomaly view across accounts
    Given the tenant has 3 linked accounts
    When they GET /v1/anomalies (no account filter)
    Then anomalies from all 3 accounts are returned
    And each anomaly includes the source account ID
  Scenario: Disconnect account stops event processing
    Given account "123456789012" is linked
    When DELETE /v1/accounts/{id} is called
    Then the account is marked as "disconnecting"
    And EventBridge rules are cleaned up
    And no new events are processed from that account
 ```
 ---
 ## Epic 10: Transparent Factory Compliance
 ### Story 10.1: Atomic Flagging
 ```gherkin
 Feature: Alert Volume Circuit Breaker
  Scenario: Circuit breaker trips at 3x baseline alert volume
    Given the baseline alert rate is 5 alerts/hour
    When 16 alerts fire in the current hour (>3x)
    Then the circuit breaker trips
    And the scoring flag is auto-disabled
    And suppressed anomalies are buffered in the DLQ
  Scenario: Fast-path alerts are exempt from circuit breaker
    Given the circuit breaker is tripped
    When a $30/hr instance triggers the cold-start fast-path
    Then the alert is still sent (fast-path bypasses breaker)
  Scenario: CI blocks expired flags
    Given a feature flag with TTL expired and rollout=100%
    When CI runs the flag audit
    Then the build fails
 ```
 ### Story 10.2: Elastic Schema
 ```gherkin
 Feature: DynamoDB Additive Schema
  Scenario: New attribute added without breaking V1 readers
    Given a DynamoDB item has a new "anomaly_score_v2" attribute
    When V1 code reads the item
    Then deserialization succeeds (unknown attributes ignored)
  Scenario: Key schema modifications are rejected
    Given a migration attempts to change the partition key structure
    When CI runs the schema lint
    Then the build fails with "Key schema modification detected"
 ```
 ### Story 10.3: Cognitive Durability
 ```gherkin
 Feature: Decision Logs for Scoring Changes
  Scenario: PR modifying Z-score thresholds requires decision log
    Given a PR changes files in src/scoring/
    And no decision_log.json is included
    When CI runs the decision log check
    Then the build fails
  Scenario: Cyclomatic complexity enforced
    Given a scoring function has complexity 12
    When the linter runs with threshold 10
    Then the lint fails
 ```
 ### Story 10.4: Semantic Observability
 ```gherkin
 Feature: OTEL Spans on Scoring Decisions
  Scenario: Every anomaly scoring decision emits an OTEL span
    Given a CostEvent is processed by the Anomaly Scorer
    When scoring completes
    Then an "anomaly_scoring" span is created
    And span attributes include: cost.z_score, cost.anomaly_score, cost.baseline_days
    And cost.fast_path_triggered is set to true/false
  Scenario: Account ID is hashed in spans (PII protection)
    Given a scoring span is emitted
    When the span attributes are inspected
    Then the AWS account ID is SHA-256 hashed
    And no raw account ID appears in any attribute
 ```
 ### Story 10.5: Configurable Autonomy (14-Day Governance)
 ```gherkin
 Feature: 14-Day Auto-Promotion Governance
  Scenario: New account starts in strict mode (log-only)
    Given a new AWS account is connected
    When the first anomaly is detected
    Then the anomaly is logged to DynamoDB
    And NO Slack alert is sent (strict mode)
  Scenario: Account auto-promotes to audit mode after 14 days
    Given an account has been connected for 15 days
    And the false-positive rate is 5% (<10% threshold)
    When the daily governance cron runs
    Then the account mode is changed from "strict" to "audit"
    And a log entry "Auto-promoted to audit mode" is written
  Scenario: Account does NOT promote if false-positive rate is high
    Given an account has been connected for 15 days
    And the false-positive rate is 15% (>10% threshold)
    When the daily governance cron runs
    Then the account remains in "strict" mode
  Scenario: Panic mode halts all alerting in <1 second
    Given panic mode is triggered via POST /admin/panic
    When a new critical anomaly is detected
    Then NO Slack alert is sent
    And the anomaly is still scored and logged
    And the dashboard API returns "alerting paused" header
  Scenario: Panic mode requires manual clearance
    Given panic mode is active
    When 24 hours pass without manual intervention
    Then panic mode is still active (no auto-clear)
    And only POST /admin/panic with panic=false clears it
 ```
 ---
 *End of dd0c/cost BDD Acceptance Specifications — 10 Epics, 55+ Scenarios*