diff --git a/products/01-llm-cost-router/acceptance-specs/acceptance-specs.md b/products/01-llm-cost-router/acceptance-specs/acceptance-specs.md new file mode 100644 index 0000000..771cceb --- /dev/null +++ b/products/01-llm-cost-router/acceptance-specs/acceptance-specs.md @@ -0,0 +1,685 @@ +# dd0c/route — BDD Acceptance Test Specifications + +**Phase 3: Given/When/Then per Story** +**Date:** March 1, 2026 + +--- + +## Epic 1: Proxy Engine + +### Story 1.1: OpenAI SDK Drop-In Compatibility + +```gherkin +Feature: OpenAI SDK Drop-In Compatibility + + Scenario: Non-streaming request proxied successfully + Given a valid API key "sk-test-123" exists in Redis cache + And the upstream OpenAI endpoint is healthy + When I send POST /v1/chat/completions with model "gpt-4o" and stream=false + Then the response status is 200 + And the response body matches the OpenAI ChatCompletion schema + And the response includes usage.prompt_tokens and usage.completion_tokens + + Scenario: Request with invalid API key is rejected + Given no API key "sk-invalid" exists in Redis or PostgreSQL + When I send POST /v1/chat/completions with Authorization "Bearer sk-invalid" + Then the response status is 401 + And the response body contains error "Invalid API key" + + Scenario: API key validated from Redis cache (fast path) + Given API key "sk-cached" exists in Redis cache + When I send POST /v1/chat/completions with Authorization "Bearer sk-cached" + Then the key is validated without querying PostgreSQL + And the request is forwarded to the upstream provider + + Scenario: API key falls back to PostgreSQL when Redis is unavailable + Given API key "sk-db-only" exists in PostgreSQL but NOT in Redis + When I send POST /v1/chat/completions with Authorization "Bearer sk-db-only" + Then the key is validated via PostgreSQL + And the key is written back to Redis cache for future requests +``` + +### Story 1.2: SSE Streaming Support + +```gherkin +Feature: SSE Streaming Passthrough + + Scenario: Streaming response is proxied chunk-by-chunk + Given a valid API key and upstream OpenAI is healthy + When I send POST /v1/chat/completions with stream=true + Then the response Content-Type is "text/event-stream" + And I receive multiple SSE chunks ending with "data: [DONE]" + And each chunk matches the OpenAI streaming delta schema + + Scenario: Client disconnects mid-stream + Given a streaming request is in progress + When the client drops the TCP connection after 5 chunks + Then the proxy aborts the upstream provider connection + And no further chunks are buffered in memory +``` + +### Story 1.3: <5ms Latency Overhead + +```gherkin +Feature: Proxy Latency Budget + + Scenario: P99 latency overhead under 5ms + Given 1000 non-streaming requests are sent sequentially + When I measure the delta between proxy response time and upstream response time + Then the P99 delta is less than 5 milliseconds + + Scenario: Telemetry emission does not block the hot path + Given TimescaleDB is experiencing 5-second write latency + When I send a request through the proxy + Then the proxy responds within the 5ms overhead budget + And the telemetry event is queued in the mpsc channel (not dropped) +``` + +### Story 1.4: Transparent Error Passthrough + +```gherkin +Feature: Provider Error Passthrough + + Scenario: Rate limit error passed through transparently + Given the upstream provider returns HTTP 429 with Retry-After header + When the proxy receives this response + Then the proxy returns HTTP 429 to the client + And the Retry-After header is preserved + + Scenario: Provider 500 error passed through + Given the upstream provider returns HTTP 500 + When the proxy receives this response + Then the proxy returns HTTP 500 to the client + And the original error body is preserved +``` + +--- + +## Epic 2: Router Brain + +### Story 2.1: Complexity Classification + +```gherkin +Feature: Request Complexity Classification + + Scenario: Simple extraction task classified as low complexity + Given a request with system prompt "Extract the name from this text" + And token count is under 500 + When the complexity classifier evaluates the request + Then the complexity score is "low" + And classification completes in under 2ms + + Scenario: Multi-turn reasoning classified as high complexity + Given a request with 10+ messages in the conversation + And system prompt contains "analyze", "reason", or "compare" + When the complexity classifier evaluates the request + Then the complexity score is "high" + + Scenario: Unknown pattern defaults to medium complexity + Given a request with no recognizable task pattern + When the complexity classifier evaluates the request + Then the complexity score is "medium" +``` + +### Story 2.2: Routing Rules + +```gherkin +Feature: Configurable Routing Rules + + Scenario: First-match routing rule applied + Given routing rule "if feature=classify -> cheapest from [gpt-4o-mini, claude-haiku]" + And the request includes header X-DD0C-Feature: classify + When the router evaluates the request + Then the request is routed to the cheapest model in the rule set + And the routing decision is logged with strategy "cheapest" + + Scenario: No matching rule falls through to default + Given routing rules exist but none match the request headers + When the router evaluates the request + Then the request is routed using the "passthrough" strategy + And the original model in the request body is used +``` + +### Story 2.3: Automatic Fallback + +```gherkin +Feature: Provider Fallback Chain + + Scenario: Primary provider fails, fallback succeeds + Given routing rule specifies fallback chain [openai, anthropic] + And OpenAI returns HTTP 503 + When the proxy processes the request + Then the request is retried against Anthropic + And the response is returned successfully + And the routing decision log shows "fallback triggered" + + Scenario: Circuit breaker opens after sustained failures + Given OpenAI error rate exceeds 10% over the last 60 seconds + When a new request arrives targeting OpenAI + Then the circuit breaker is OPEN for OpenAI + And the request is immediately routed to the fallback provider + And no request is sent to OpenAI +``` + +### Story 2.4: Real-Time Cost Savings + +```gherkin +Feature: Cost Savings Calculation + + Scenario: Savings calculated when model is downgraded + Given the original request specified model "gpt-4o" ($15/1M input tokens) + And the router downgraded to "gpt-4o-mini" ($0.15/1M input tokens) + And the request used 1000 input tokens + When the cost calculator runs + Then cost_original is $0.015 + And cost_actual is $0.00015 + And cost_saved is $0.01485 + + Scenario: Zero savings when passthrough (no routing) + Given the request was routed with strategy "passthrough" + When the cost calculator runs + Then cost_saved is $0.00 +``` + +--- + +## Epic 3: Analytics Pipeline + +### Story 3.1: Non-Blocking Telemetry + +```gherkin +Feature: Asynchronous Telemetry Emission + + Scenario: Telemetry emitted without blocking request + Given the proxy processes a request successfully + When the response is sent to the client + Then a RequestEvent is emitted to the mpsc channel + And the channel send completes in under 1ms + + Scenario: Bounded channel drops events when full + Given the mpsc channel is at capacity (1000 events) + When a new telemetry event is emitted + Then the event is dropped (not blocking the proxy) + And a "telemetry_dropped" counter is incremented +``` + +### Story 3.2: Fast Dashboard Queries + +```gherkin +Feature: Continuous Aggregates for Dashboard + + Scenario: Hourly cost summary is pre-calculated + Given 10,000 request events exist in the last hour + When I query GET /api/dashboard/summary?period=1h + Then the response returns in under 200ms + And total_cost, total_saved, and avg_latency are present + + Scenario: Treemap data returns cost breakdown by team and model + Given request events tagged with team="payments" and team="search" + When I query GET /api/dashboard/treemap?period=7d + Then the response includes cost breakdowns per team per model +``` + +### Story 3.3: Automatic Data Compression + +```gherkin +Feature: TimescaleDB Compression + + Scenario: Chunks older than 7 days are compressed + Given request_events data exists from 10 days ago + When the compression policy runs + Then chunks older than 7 days are compressed by 90%+ + And compressed data is still queryable via continuous aggregates +``` + +--- + +## Epic 4: Dashboard API + +### Story 4.1: GitHub OAuth Signup + +```gherkin +Feature: GitHub OAuth Authentication + + Scenario: New user signs up via GitHub OAuth + Given the user has a valid GitHub account + When they complete the /api/auth/github OAuth flow + Then an organization is created automatically + And a JWT access token is returned + And a refresh token is stored in Redis + + Scenario: Existing user logs in via GitHub OAuth + Given the user already has an organization + When they complete the /api/auth/github OAuth flow + Then a new JWT is issued for the existing organization + And no duplicate organization is created +``` + +### Story 4.2: Routing Rules & Provider Keys CRUD + +```gherkin +Feature: Routing Rules Management + + Scenario: Create a new routing rule + Given I am authenticated as an Owner + When I POST /api/orgs/{id}/routing-rules with a valid rule body + Then the rule is created and returned with an ID + And the rule is loaded into the proxy within 60 seconds + + Scenario: Provider API key is encrypted at rest + Given I POST /api/orgs/{id}/provider-keys with key "sk-live-abc123" + When the key is stored in PostgreSQL + Then the stored value is AES-256-GCM encrypted + And decrypting with the correct KMS key returns "sk-live-abc123" +``` + +### Story 4.3: Dashboard Data Endpoints + +```gherkin +Feature: Dashboard Summary & Treemap + + Scenario: Summary endpoint returns aggregated metrics + Given request events exist for the authenticated org + When I GET /api/dashboard/summary?period=30d + Then the response includes total_cost, total_saved, request_count, avg_latency + + Scenario: Request inspector with filters + Given 500 request events exist for the org + When I GET /api/requests?feature=classify&limit=20 + Then 20 results are returned matching feature="classify" + And no prompt content is included in the response +``` + +### Story 4.4: API Key Revocation + +```gherkin +Feature: API Key Revocation + + Scenario: Revoked key is immediately blocked + Given API key "sk-compromised" is active + When I DELETE /api/orgs/{id}/api-keys/{key_id} + Then the key is removed from Redis cache immediately + And subsequent requests with "sk-compromised" return 401 +``` + +--- + +## Epic 5: Dashboard UI + +### Story 5.1: Cost Treemap + +```gherkin +Feature: AI Spend Treemap Visualization + + Scenario: Treemap renders cost breakdown + Given the user is logged into the dashboard + And cost data exists for teams "payments" and "search" + When the dashboard loads with period=7d + Then a treemap visualization renders + And the "payments" block is proportionally larger if it spent more +``` + +### Story 5.2: Real-Time Savings Counter + +```gherkin +Feature: Savings Counter + + Scenario: Weekly savings counter updates + Given the user is on the dashboard + And $127.50 has been saved this week + When the dashboard loads + Then the counter displays "You saved $127.50 this week" +``` + +### Story 5.3: Routing Rules Editor + +```gherkin +Feature: Routing Rules Editor UI + + Scenario: Create rule via drag-and-drop interface + Given the user navigates to Settings > Routing Rules + When they create a new rule with feature="classify" and strategy="cheapest" + And drag it to priority position 1 + Then the rule is saved via the API + And the rule list reflects the new priority order +``` + +### Story 5.4: Request Inspector + +```gherkin +Feature: Request Inspector + + Scenario: Inspect routing decision for a specific request + Given the user opens the Request Inspector + When they click on a specific request row + Then the detail panel shows: model_selected, model_alternatives, cost_delta, routing_strategy + And no prompt content is displayed +``` + +--- + +## Epic 6: Shadow Audit CLI + +### Story 6.1: Zero-Setup Codebase Scan + +```gherkin +Feature: Shadow Audit CLI Scan + + Scenario: Scan TypeScript project for LLM usage + Given a TypeScript project with 3 files calling openai.chat.completions.create() + When I run npx dd0c-scan in the project root + Then the CLI detects 3 LLM API call sites + And estimates monthly cost based on heuristic token counts + And displays projected savings with dd0c/route + + Scenario: Scan runs completely offline with cached pricing + Given the pricing table was cached from a previous run + And the network is unavailable + When I run npx dd0c-scan + Then the scan completes using cached pricing data + And a warning is shown that pricing may be stale +``` + +### Story 6.2: No Source Code Exfiltration + +```gherkin +Feature: Local-Only Scan + + Scenario: No source code sent to server + Given I run npx dd0c-scan without --opt-in flag + When the scan completes + Then zero HTTP requests are made to any dd0c server + And the report is rendered entirely in the terminal +``` + +### Story 6.3: Terminal Report + +```gherkin +Feature: Terminal Report Output + + Scenario: Top Opportunities report + Given the scan found 5 files with LLM calls + When the report renders + Then it shows "Top Opportunities" sorted by potential savings + And each entry includes: file path, current model, suggested model, estimated monthly savings +``` + +--- + +## Epic 7: Slack Integration + +### Story 7.1: Weekly Savings Digest + +```gherkin +Feature: Monday Morning Digest + + Scenario: Weekly digest sent via email + Given it is Monday at 9:00 AM UTC + And the org has cost data from the previous week + When the dd0c-worker cron fires + Then an email is sent via AWS SES to the org admin + And the email includes: total_saved, top_model_savings, week_over_week_trend + + Scenario: No digest sent if no activity + Given the org had zero requests last week + When the cron fires + Then no email is sent +``` + +### Story 7.2: Budget Alert via Slack + +```gherkin +Feature: Budget Threshold Alerts + + Scenario: Daily spend exceeds configured threshold + Given the org configured alert threshold at $100/day + And today's spend reaches $101 + When the dd0c-worker evaluates thresholds + Then a Slack webhook is fired with the alert payload + And the payload includes X-DD0C-Signature header + And last_fired_at is updated to prevent duplicate alerts + + Scenario: Alert not re-fired for same incident + Given an alert was already fired for today's threshold breach + When the worker evaluates thresholds again + Then no duplicate Slack webhook is sent +``` + +--- + +## Epic 8: Infrastructure & DevOps + +### Story 8.1: ECS Fargate Deployment + +```gherkin +Feature: Containerized Deployment + + Scenario: CDK deploys all services to ECS Fargate + Given the CDK stack is synthesized + When cdk deploy is executed + Then ECS services are created for proxy, api, and worker + And ALB routes /v1/* to proxy and /api/* to api + And CloudFront serves static dashboard assets from S3 +``` + +### Story 8.2: CI/CD Pipeline + +```gherkin +Feature: GitHub Actions CI/CD + + Scenario: Push to main triggers full pipeline + Given code is pushed to the main branch + When GitHub Actions triggers + Then tests run (unit + integration + canary suite) + And Docker images are built and pushed to ECR + And ECS services are updated with rolling deployment + And zero downtime is maintained during deployment +``` + +### Story 8.3: CloudWatch Alarms + +```gherkin +Feature: Monitoring & Alerting + + Scenario: P99 latency alarm fires + Given proxy P99 latency exceeds 50ms for 5 minutes + When CloudWatch evaluates the alarm + Then a PagerDuty incident is created +``` + +--- + +## Epic 9: Onboarding & PLG + +### Story 9.1: One-Click GitHub Signup + +```gherkin +Feature: Frictionless Signup + + Scenario: New user completes signup in under 60 seconds + Given the user clicks "Sign up with GitHub" + When they authorize the OAuth app + Then an org is created, an API key is generated + And the user lands on the onboarding wizard + And total elapsed time is under 60 seconds +``` + +### Story 9.2: Free Tier Enforcement + +```gherkin +Feature: Free Tier ($50/month routed spend) + + Scenario: Free tier user within limit + Given the org is on the free tier + And this month's routed spend is $45 + When a new request is proxied + Then the request is processed normally + + Scenario: Free tier user exceeds limit + Given the org is on the free tier + And this month's routed spend is $50.01 + When a new request is proxied + Then the proxy returns HTTP 429 + And the response body includes an upgrade CTA with Stripe Checkout link + + Scenario: Monthly counter resets on the 1st + Given the org used $50 last month + When the calendar rolls to the 1st of the new month + Then the Redis counter is reset to $0 + And requests are processed normally again +``` + +### Story 9.3: API Key Management + +```gherkin +Feature: API Key CRUD + + Scenario: Generate new API key + Given I am authenticated as an Owner + When I POST /api/orgs/{id}/api-keys + Then a new API key is returned in plaintext (shown once) + And the key is stored as a bcrypt hash in PostgreSQL + And the key is cached in Redis for fast auth + + Scenario: Rotate API key + Given API key "sk-old" is active + When I POST /api/orgs/{id}/api-keys/{key_id}/rotate + Then a new key "sk-new" is returned + And "sk-old" is immediately invalidated +``` + +### Story 9.4: First Route Onboarding + +```gherkin +Feature: 2-Minute First Route + + Scenario: User completes onboarding wizard + Given the user just signed up + When they copy the API key (step 1) + And paste the provided curl command (step 2) + And the request appears in the dashboard (step 3) + Then the onboarding is marked complete + And a PostHog event "routing.savings.first_dollar" is tracked (if savings occurred) +``` + +### Story 9.5: Team Invites + +```gherkin +Feature: Team Member Invites + + Scenario: Invite team member via email + Given I am an Owner of org "acme" + When I POST /api/orgs/{id}/invites with email "dev@acme.com" + Then an email with a magic link is sent + And the magic link JWT expires in 72 hours + + Scenario: Invited user joins existing org + Given "dev@acme.com" received an invite magic link + When they click the link and complete GitHub OAuth + Then they are added to org "acme" as a Member (not Owner) + And no new org is created +``` + +--- + +## Epic 10: Transparent Factory Compliance + +### Story 10.1: Atomic Flagging + +```gherkin +Feature: Feature Flag Infrastructure + + Scenario: New routing strategy behind a flag (default off) + Given feature flag "enable_cascading_router" exists with default=off + When a request arrives that would trigger cascading routing + Then the passthrough strategy is used instead + And the flag evaluation completes without network calls + + Scenario: Flag auto-disables on latency regression + Given flag "enable_new_classifier" is at 50% rollout + And P99 latency increased by 8% since flag was enabled + When the circuit breaker evaluates flag health + Then the flag is auto-disabled within 30 seconds + And an alert is fired + + Scenario: CI blocks deployment with expired flag + Given flag "old_experiment" has TTL expired and rollout=100% + When CI runs the flag audit + Then the build fails with "Expired flag at full rollout: old_experiment" +``` + +### Story 10.2: Elastic Schema + +```gherkin +Feature: Additive-Only Schema Migrations + + Scenario: Migration with DROP COLUMN is rejected + Given a migration file contains "ALTER TABLE request_events DROP COLUMN old_field" + When CI runs the schema lint + Then the build fails with "Destructive schema change detected" + + Scenario: V1 code ignores V2 fields + Given a request_events row contains a new "routing_v2" column + When V1 Rust code deserializes the row + Then deserialization succeeds (unknown fields ignored) + And no error is logged +``` + +### Story 10.3: Cognitive Durability + +```gherkin +Feature: Decision Logs for Routing Logic + + Scenario: PR touching router requires decision log + Given a PR modifies files in src/router/ + And no decision_log.json is included + When CI runs the decision log check + Then the build fails with "Decision log required for routing changes" + + Scenario: Cyclomatic complexity exceeds cap + Given a function in src/router/ has cyclomatic complexity 12 + When cargo clippy runs with cognitive_complexity threshold 10 + Then the lint fails +``` + +### Story 10.4: Semantic Observability + +```gherkin +Feature: AI Reasoning Spans + + Scenario: Routing decision emits OTEL span + Given a request is routed from gpt-4o to gpt-4o-mini + When the routing decision completes + Then an "ai_routing_decision" span is created + And span attributes include: ai.model_selected, ai.cost_delta, ai.complexity_score + And ai.prompt_hash is a SHA-256 hash (not raw content) + + Scenario: No PII in any span + Given a request with user email in the prompt + When the span is emitted + Then no span attribute contains the email address + And ai.prompt_hash is the only prompt-related attribute +``` + +### Story 10.5: Configurable Autonomy + +```gherkin +Feature: Governance Policy + + Scenario: Strict mode blocks auto-applied config changes + Given governance_mode is "strict" + When the background task attempts to refresh routing rules + Then the refresh is blocked + And a log entry "Blocked by strict mode" is written + + Scenario: Panic mode freezes to last-known-good + Given panic_mode is triggered via POST /admin/panic + When a new request arrives + Then routing uses the frozen last-known-good configuration + And auto-failover is disabled + And the response header includes "X-DD0C-Panic: active" +``` + +--- + +*End of dd0c/route BDD Acceptance Specifications — 10 Epics, 50+ Scenarios* diff --git a/products/05-aws-cost-anomaly/acceptance-specs/acceptance-specs.md b/products/05-aws-cost-anomaly/acceptance-specs/acceptance-specs.md new file mode 100644 index 0000000..477985c --- /dev/null +++ b/products/05-aws-cost-anomaly/acceptance-specs/acceptance-specs.md @@ -0,0 +1,610 @@ +# dd0c/cost — BDD Acceptance Test Specifications + +**Phase 3: Given/When/Then per Story** +**Date:** March 1, 2026 + +--- + +## Epic 1: CloudTrail Ingestion + +### Story 1.1: EventBridge Cross-Account Rules + +```gherkin +Feature: CloudTrail Event Ingestion + + Scenario: EC2 RunInstances event ingested successfully + Given a customer AWS account has EventBridge rules forwarding to dd0c + When a RunInstances CloudTrail event fires in the customer account + Then the event arrives in the dd0c SQS FIFO queue within 5 seconds + And the event contains userIdentity, eventSource, and requestParameters + + Scenario: Non-cost-generating event is filtered at EventBridge + Given EventBridge rules filter for cost-relevant API calls only + When a DescribeInstances CloudTrail event fires + Then the event is NOT forwarded to the SQS queue +``` + +### Story 1.2: SQS FIFO Deduplication + +```gherkin +Feature: CloudTrail Event Deduplication + + Scenario: Duplicate CloudTrail event is deduplicated by SQS FIFO + Given a RunInstances event with eventID "evt-abc-123" was already processed + When the same event arrives again (CloudTrail duplicate delivery) + Then SQS FIFO deduplicates based on the eventID + And only one CostEvent is written to DynamoDB + + Scenario: Same resource type from different events is allowed + Given two different RunInstances events for the same instance type + But with different eventIDs + When both arrive in the SQS queue + Then both are processed as separate CostEvents +``` + +### Story 1.3: Lambda Normalizer + +```gherkin +Feature: CloudTrail Event Normalization + + Scenario: EC2 RunInstances normalized to CostEvent schema + Given a raw CloudTrail RunInstances event + When the Lambda normalizer processes it + Then a CostEvent is created with: service="ec2", instanceType="m5.xlarge", region, actor ARN, hourly cost + And the hourly cost is looked up from the static pricing table + + Scenario: RDS CreateDBInstance normalized correctly + Given a raw CloudTrail CreateDBInstance event + When the Lambda normalizer processes it + Then a CostEvent is created with: service="rds", engine, storageType, multiAZ flag + + Scenario: Unknown instance type uses fallback pricing + Given a CloudTrail event with instanceType "z99.mega" not in the pricing table + When the normalizer looks up pricing + Then fallback pricing of $0 is applied + And a warning is logged: "Unknown instance type: z99.mega" + + Scenario: Malformed CloudTrail JSON does not crash the Lambda + Given a CloudTrail event with invalid JSON structure + When the Lambda normalizer processes it + Then the event is sent to the DLQ + And no CostEvent is written + And an error metric is emitted + + Scenario: Batched RunInstances (multiple instances) creates multiple CostEvents + Given a RunInstances event launching 5 instances simultaneously + When the normalizer processes it + Then 5 separate CostEvents are created (one per instance) +``` + +### Story 1.4: Cost Precision + +```gherkin +Feature: Cost Calculation Precision + + Scenario: Hourly cost calculated with 4 decimal places + Given instance type "m5.xlarge" costs $0.192/hr + When the normalizer calculates hourly cost + Then the cost is stored as 0.1920 (4 decimal places) + + Scenario: Sub-cent Lambda costs handled correctly + Given a Lambda invocation costing $0.0000002 per request + When the normalizer calculates cost + Then the cost is stored without floating-point precision loss +``` + +--- + +## Epic 2: Anomaly Detection + +### Story 2.1: Z-Score Calculation + +```gherkin +Feature: Z-Score Anomaly Scoring + + Scenario: Event matching baseline mean scores zero + Given a baseline with mean=$1.25/hr and stddev=$0.15 + When a CostEvent arrives with hourly cost $1.25 + Then the Z-score is 0.0 + And the severity is "info" + + Scenario: Event 3 standard deviations above mean scores critical + Given a baseline with mean=$1.25/hr and stddev=$0.15 + When a CostEvent arrives with hourly cost $1.70 (Z=3.0) + Then the Z-score is 3.0 + And the severity is "critical" + + Scenario: Zero standard deviation does not cause division by zero + Given a baseline with mean=$1.00/hr and stddev=$0.00 (all identical observations) + When a CostEvent arrives with hourly cost $1.50 + Then the scorer handles the zero-stddev case gracefully + And the severity is based on absolute cost delta, not Z-score + + Scenario: Single data point baseline (stddev undefined) + Given a baseline with only 1 observation + When a CostEvent arrives + Then the scorer uses the cold-start fast-path instead of Z-score +``` + +### Story 2.2: Novelty Scoring + +```gherkin +Feature: Actor and Instance Novelty Detection + + Scenario: New IAM role triggers actor novelty penalty + Given the baseline has observed actors: ["arn:aws:iam::123:role/ci"] + When a CostEvent arrives from actor "arn:aws:iam::123:role/unknown-script" + Then an actor novelty penalty is added to the composite score + + Scenario: Known actor and known instance type has no novelty penalty + Given the baseline has observed actors and instance types matching the event + When the CostEvent is scored + Then the novelty component is 0 +``` + +### Story 2.3: Cold-Start Fast Path + +```gherkin +Feature: Cold-Start Fast Path for Expensive Resources + + Scenario: $25/hr instance triggers immediate critical (bypasses baseline) + Given the account baseline has fewer than 14 days of data + When a CostEvent arrives for a p4d.24xlarge at $32.77/hr + Then the fast-path triggers immediately + And severity is "critical" regardless of baseline state + + Scenario: $0.10/hr instance ignored during cold-start + Given the account baseline has fewer than 14 days of data + When a CostEvent arrives for a t3.nano at $0.0052/hr + Then the fast-path does NOT trigger + And the event is scored normally (likely "info") + + Scenario: Fast-path transitions to statistical scoring at maturity + Given the account baseline has 20+ events AND 14+ days + When a CostEvent arrives for a $10/hr instance + Then the Z-score path is used (not fast-path) +``` + +### Story 2.4: Composite Scoring + +```gherkin +Feature: Composite Severity Classification + + Scenario: Composite score below 30 classified as info + Given Z-score contribution is 15 and novelty contribution is 10 + When the composite score is calculated (25) + Then severity is "info" + + Scenario: Composite score above 60 classified as critical + Given Z-score contribution is 45 and novelty contribution is 20 + When the composite score is calculated (65) + Then severity is "critical" + + Scenario: Events below $0.50/hr never classified as critical + Given a CostEvent with hourly cost $0.30 and Z-score of 5.0 + When the composite score is calculated + Then severity is capped at "warning" (never critical for cheap resources) +``` + +### Story 2.5: Feedback Loop + +```gherkin +Feature: Mark-as-Expected Feedback + + Scenario: Marking an anomaly as expected reduces future scores + Given an anomaly was triggered for actor "arn:aws:iam::123:role/batch-job" + When the user clicks "Mark as Expected" in Slack + Then the actor is added to the expected_actors list + And future events from this actor receive a reduced score + + Scenario: Expected actor still flagged if cost is 10x above baseline + Given actor "batch-job" is in the expected_actors list + And the baseline mean is $1.00/hr + When a CostEvent arrives from "batch-job" at $15.00/hr + Then the anomaly is still flagged (10x override) +``` + +--- + +## Epic 3: Zombie Hunter + +```gherkin +Feature: Idle Resource Detection + + Scenario: EC2 instance running 7+ days with <5% CPU detected as zombie + Given an EC2 instance "i-abc123" has been running for 10 days + And CloudWatch average CPU utilization is 2% + When the daily Zombie Hunter scan runs + Then "i-abc123" is flagged as a zombie + And cumulative waste cost is calculated (10 days × hourly rate) + + Scenario: RDS instance with 0 connections for 3+ days detected + Given an RDS instance "prod-db-unused" has 0 connections for 5 days + When the Zombie Hunter scan runs + Then "prod-db-unused" is flagged as a zombie + + Scenario: Unattached EBS volume older than 7 days detected + Given an EBS volume "vol-xyz" has been unattached for 14 days + When the Zombie Hunter scan runs + Then "vol-xyz" is flagged as a zombie + And the waste cost includes the daily storage charge + + Scenario: Instance tagged dd0c:ignore is excluded + Given an EC2 instance has tag "dd0c:ignore=true" + When the Zombie Hunter scan runs + Then the instance is skipped + + Scenario: Zombie scan respects read-only permissions + Given the dd0c IAM role has only read permissions + When the Zombie Hunter scan runs + Then no modify/stop/terminate API calls are made +``` + +--- + +## Epic 4: Notification & Remediation + +### Story 4.1: Slack Block Kit Alerts + +```gherkin +Feature: Slack Anomaly Alerts + + Scenario: Critical anomaly sends Slack alert with Block Kit + Given an anomaly with severity "critical" for a p4d.24xlarge in us-east-1 + When the notification engine processes it + Then a Slack message is sent with Block Kit formatting + And the message includes: resource type, region, hourly cost, actor, "Why this alert" section + And the message includes buttons: Snooze 1h, Snooze 24h, Mark Expected, Stop Instance + + Scenario: Info-level anomalies are NOT sent to Slack + Given an anomaly with severity "info" + When the notification engine processes it + Then no Slack message is sent + And the anomaly is only logged to DynamoDB +``` + +### Story 4.2: Interactive Remediation + +```gherkin +Feature: One-Click Remediation via Slack + + Scenario: User clicks "Stop Instance" and remediation succeeds + Given a Slack alert for EC2 instance "i-abc123" in account "123456789012" + When the user clicks "Stop Instance" + Then the system assumes the cross-account remediation role via STS + And ec2:StopInstances is called for "i-abc123" + And the Slack message is updated to "Remediation Successful ✅" + + Scenario: Cross-account IAM role has been revoked + Given the customer revoked the dd0c remediation IAM role + When the user clicks "Stop Instance" + Then STS AssumeRole fails + And the Slack message is updated to "Remediation Failed: IAM role not found" + And no instance is stopped + + Scenario: Snooze suppresses alerts for the specified duration + Given the user clicks "Snooze 24h" on an anomaly + When a similar anomaly fires within 24 hours + Then no Slack alert is sent + And the anomaly is still logged to DynamoDB + + Scenario: Mark Expected updates the baseline + Given the user clicks "Mark Expected" + When the feedback is processed + Then the actor is added to expected_actors + And the false-positive counter is incremented for governance scoring +``` + +### Story 4.3: Slack Signature Validation + +```gherkin +Feature: Slack Interactive Payload Security + + Scenario: Valid Slack signature accepted + Given a Slack interactive payload with correct HMAC-SHA256 signature + When the API receives the payload + Then the action is processed + + Scenario: Invalid signature rejected + Given a payload with an incorrect X-Slack-Signature header + When the API receives the payload + Then HTTP 401 is returned + And no action is taken + + Scenario: Expired timestamp rejected (replay attack) + Given a payload with X-Slack-Request-Timestamp older than 5 minutes + When the API receives the payload + Then HTTP 401 is returned +``` + +### Story 4.4: Daily Digest + +```gherkin +Feature: Daily Anomaly Digest + + Scenario: Daily digest aggregates 24h of anomalies + Given 15 anomalies were detected in the last 24 hours + When the daily digest cron fires + Then a Slack message is sent with: total anomalies, top 3 costliest, total estimated spend, zombie count + + Scenario: No digest sent if zero anomalies + Given zero anomalies in the last 24 hours + When the daily digest cron fires + Then no Slack message is sent +``` + +--- + +## Epic 5: Onboarding & PLG + +```gherkin +Feature: AWS Account Onboarding + + Scenario: CloudFormation quick-create deploys IAM role + Given the user clicks the quick-create URL + When CloudFormation deploys the stack + Then an IAM role is created with the correct permissions and ExternalId + And an EventBridge rule is created for cost-relevant CloudTrail events + + Scenario: Role validation succeeds with correct ExternalId + Given the IAM role exists with ExternalId "dd0c-tenant-123" + When POST /v1/accounts validates the role + Then STS AssumeRole succeeds + And the account is marked as "active" + And the initial Zombie Hunter scan is triggered + + Scenario: Role validation fails with wrong ExternalId + Given the IAM role exists but ExternalId does not match + When POST /v1/accounts validates the role + Then STS AssumeRole fails with AccessDenied + And a clear error message is returned + + Scenario: Free tier allows 1 account + Given the tenant is on the free tier with 0 connected accounts + When they connect their first AWS account + Then the connection succeeds + + Scenario: Free tier rejects 2nd account + Given the tenant is on the free tier with 1 connected account + When they attempt to connect a second account + Then HTTP 403 is returned with an upgrade prompt + + Scenario: Stripe upgrade unlocks multiple accounts + Given the tenant upgrades via Stripe Checkout + When checkout.session.completed webhook fires + Then the tenant tier is updated to "pro" + And they can connect additional accounts + + Scenario: Stripe webhook signature validated + Given a Stripe webhook payload + When the signature does not match the signing secret + Then HTTP 401 is returned +``` + +--- + +## Epic 6: Dashboard API + +```gherkin +Feature: Dashboard API with Tenant Isolation + + Scenario: GET /v1/anomalies returns anomalies for authenticated tenant + Given tenant "t-123" has 50 anomalies + And tenant "t-456" has 30 anomalies + When tenant "t-123" calls GET /v1/anomalies + Then only tenant "t-123" anomalies are returned + And zero anomalies from "t-456" are included + + Scenario: Cursor-based pagination + Given tenant "t-123" has 200 anomalies + When GET /v1/anomalies?limit=20 is called + Then 20 results are returned with a cursor token + And calling with the cursor returns the next 20 + + Scenario: Filter by severity + Given anomalies exist with severity "critical", "warning", and "info" + When GET /v1/anomalies?severity=critical is called + Then only critical anomalies are returned + + Scenario: Baseline override + Given tenant "t-123" wants to adjust sensitivity for EC2 m5.xlarge + When PATCH /v1/accounts/{id}/baselines/ec2/m5.xlarge with sensitivity=0.8 + Then the baseline sensitivity is updated + And future scoring uses the adjusted threshold + + Scenario: Missing Cognito JWT returns 401 + Given no Authorization header is present + When any API endpoint is called + Then HTTP 401 is returned +``` + +--- + +## Epic 7: Dashboard UI + +```gherkin +Feature: Dashboard UI + + Scenario: Anomaly feed renders with severity badges + Given the user is logged in + When the dashboard loads + Then anomalies are listed with color-coded severity badges (red/yellow/blue) + + Scenario: Baseline learning progress displayed + Given an account connected 5 days ago with 30 events + When the user views the account detail + Then a progress bar shows "5/14 days, 30/20 events" toward maturity + + Scenario: Zombie resource list shows waste estimate + Given 3 zombie resources were detected + When the user views the Zombie Hunter tab + Then each zombie shows: resource type, age, cumulative waste cost +``` + +--- + +## Epic 8: Infrastructure + +```gherkin +Feature: CDK Infrastructure + + Scenario: CDK deploys all required resources + Given the CDK stack is synthesized + When cdk deploy is executed + Then EventBridge rules, SQS FIFO queues, Lambda functions, DynamoDB tables, and Cognito user pool are created + + Scenario: CI/CD pipeline runs on push to main + Given code is pushed to the main branch + When GitHub Actions triggers + Then unit tests, integration tests, and CDK synth run + And Lambda functions are deployed + And zero downtime is maintained +``` + +--- + +## Epic 9: Multi-Account Management + +```gherkin +Feature: Multi-Account Management + + Scenario: Link additional AWS account + Given the tenant is on the pro tier + When they POST /v1/accounts with a new role ARN + Then the account is validated and linked + And CloudTrail events from the new account begin flowing + + Scenario: Consolidated anomaly view across accounts + Given the tenant has 3 linked accounts + When they GET /v1/anomalies (no account filter) + Then anomalies from all 3 accounts are returned + And each anomaly includes the source account ID + + Scenario: Disconnect account stops event processing + Given account "123456789012" is linked + When DELETE /v1/accounts/{id} is called + Then the account is marked as "disconnecting" + And EventBridge rules are cleaned up + And no new events are processed from that account +``` + +--- + +## Epic 10: Transparent Factory Compliance + +### Story 10.1: Atomic Flagging + +```gherkin +Feature: Alert Volume Circuit Breaker + + Scenario: Circuit breaker trips at 3x baseline alert volume + Given the baseline alert rate is 5 alerts/hour + When 16 alerts fire in the current hour (>3x) + Then the circuit breaker trips + And the scoring flag is auto-disabled + And suppressed anomalies are buffered in the DLQ + + Scenario: Fast-path alerts are exempt from circuit breaker + Given the circuit breaker is tripped + When a $30/hr instance triggers the cold-start fast-path + Then the alert is still sent (fast-path bypasses breaker) + + Scenario: CI blocks expired flags + Given a feature flag with TTL expired and rollout=100% + When CI runs the flag audit + Then the build fails +``` + +### Story 10.2: Elastic Schema + +```gherkin +Feature: DynamoDB Additive Schema + + Scenario: New attribute added without breaking V1 readers + Given a DynamoDB item has a new "anomaly_score_v2" attribute + When V1 code reads the item + Then deserialization succeeds (unknown attributes ignored) + + Scenario: Key schema modifications are rejected + Given a migration attempts to change the partition key structure + When CI runs the schema lint + Then the build fails with "Key schema modification detected" +``` + +### Story 10.3: Cognitive Durability + +```gherkin +Feature: Decision Logs for Scoring Changes + + Scenario: PR modifying Z-score thresholds requires decision log + Given a PR changes files in src/scoring/ + And no decision_log.json is included + When CI runs the decision log check + Then the build fails + + Scenario: Cyclomatic complexity enforced + Given a scoring function has complexity 12 + When the linter runs with threshold 10 + Then the lint fails +``` + +### Story 10.4: Semantic Observability + +```gherkin +Feature: OTEL Spans on Scoring Decisions + + Scenario: Every anomaly scoring decision emits an OTEL span + Given a CostEvent is processed by the Anomaly Scorer + When scoring completes + Then an "anomaly_scoring" span is created + And span attributes include: cost.z_score, cost.anomaly_score, cost.baseline_days + And cost.fast_path_triggered is set to true/false + + Scenario: Account ID is hashed in spans (PII protection) + Given a scoring span is emitted + When the span attributes are inspected + Then the AWS account ID is SHA-256 hashed + And no raw account ID appears in any attribute +``` + +### Story 10.5: Configurable Autonomy (14-Day Governance) + +```gherkin +Feature: 14-Day Auto-Promotion Governance + + Scenario: New account starts in strict mode (log-only) + Given a new AWS account is connected + When the first anomaly is detected + Then the anomaly is logged to DynamoDB + And NO Slack alert is sent (strict mode) + + Scenario: Account auto-promotes to audit mode after 14 days + Given an account has been connected for 15 days + And the false-positive rate is 5% (<10% threshold) + When the daily governance cron runs + Then the account mode is changed from "strict" to "audit" + And a log entry "Auto-promoted to audit mode" is written + + Scenario: Account does NOT promote if false-positive rate is high + Given an account has been connected for 15 days + And the false-positive rate is 15% (>10% threshold) + When the daily governance cron runs + Then the account remains in "strict" mode + + Scenario: Panic mode halts all alerting in <1 second + Given panic mode is triggered via POST /admin/panic + When a new critical anomaly is detected + Then NO Slack alert is sent + And the anomaly is still scored and logged + And the dashboard API returns "alerting paused" header + + Scenario: Panic mode requires manual clearance + Given panic mode is active + When 24 hours pass without manual intervention + Then panic mode is still active (no auto-clear) + And only POST /admin/panic with panic=false clears it +``` + +--- + +*End of dd0c/cost BDD Acceptance Specifications — 10 Epics, 55+ Scenarios*