Phase 3: BDD acceptance specs for P1 (route) and P5 (cost)
P1: 50+ scenarios across 10 epics, all stories covered P5: 55+ scenarios across 10 epics, written manually (Sonnet credential failures) Remaining P2/P3/P4/P6 in progress via subagents
This commit is contained in:
685
products/01-llm-cost-router/acceptance-specs/acceptance-specs.md
Normal file
685
products/01-llm-cost-router/acceptance-specs/acceptance-specs.md
Normal file
@@ -0,0 +1,685 @@
|
|||||||
|
# dd0c/route — BDD Acceptance Test Specifications
|
||||||
|
|
||||||
|
**Phase 3: Given/When/Then per Story**
|
||||||
|
**Date:** March 1, 2026
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 1: Proxy Engine
|
||||||
|
|
||||||
|
### Story 1.1: OpenAI SDK Drop-In Compatibility
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: OpenAI SDK Drop-In Compatibility
|
||||||
|
|
||||||
|
Scenario: Non-streaming request proxied successfully
|
||||||
|
Given a valid API key "sk-test-123" exists in Redis cache
|
||||||
|
And the upstream OpenAI endpoint is healthy
|
||||||
|
When I send POST /v1/chat/completions with model "gpt-4o" and stream=false
|
||||||
|
Then the response status is 200
|
||||||
|
And the response body matches the OpenAI ChatCompletion schema
|
||||||
|
And the response includes usage.prompt_tokens and usage.completion_tokens
|
||||||
|
|
||||||
|
Scenario: Request with invalid API key is rejected
|
||||||
|
Given no API key "sk-invalid" exists in Redis or PostgreSQL
|
||||||
|
When I send POST /v1/chat/completions with Authorization "Bearer sk-invalid"
|
||||||
|
Then the response status is 401
|
||||||
|
And the response body contains error "Invalid API key"
|
||||||
|
|
||||||
|
Scenario: API key validated from Redis cache (fast path)
|
||||||
|
Given API key "sk-cached" exists in Redis cache
|
||||||
|
When I send POST /v1/chat/completions with Authorization "Bearer sk-cached"
|
||||||
|
Then the key is validated without querying PostgreSQL
|
||||||
|
And the request is forwarded to the upstream provider
|
||||||
|
|
||||||
|
Scenario: API key falls back to PostgreSQL when Redis is unavailable
|
||||||
|
Given API key "sk-db-only" exists in PostgreSQL but NOT in Redis
|
||||||
|
When I send POST /v1/chat/completions with Authorization "Bearer sk-db-only"
|
||||||
|
Then the key is validated via PostgreSQL
|
||||||
|
And the key is written back to Redis cache for future requests
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 1.2: SSE Streaming Support
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: SSE Streaming Passthrough
|
||||||
|
|
||||||
|
Scenario: Streaming response is proxied chunk-by-chunk
|
||||||
|
Given a valid API key and upstream OpenAI is healthy
|
||||||
|
When I send POST /v1/chat/completions with stream=true
|
||||||
|
Then the response Content-Type is "text/event-stream"
|
||||||
|
And I receive multiple SSE chunks ending with "data: [DONE]"
|
||||||
|
And each chunk matches the OpenAI streaming delta schema
|
||||||
|
|
||||||
|
Scenario: Client disconnects mid-stream
|
||||||
|
Given a streaming request is in progress
|
||||||
|
When the client drops the TCP connection after 5 chunks
|
||||||
|
Then the proxy aborts the upstream provider connection
|
||||||
|
And no further chunks are buffered in memory
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 1.3: <5ms Latency Overhead
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Proxy Latency Budget
|
||||||
|
|
||||||
|
Scenario: P99 latency overhead under 5ms
|
||||||
|
Given 1000 non-streaming requests are sent sequentially
|
||||||
|
When I measure the delta between proxy response time and upstream response time
|
||||||
|
Then the P99 delta is less than 5 milliseconds
|
||||||
|
|
||||||
|
Scenario: Telemetry emission does not block the hot path
|
||||||
|
Given TimescaleDB is experiencing 5-second write latency
|
||||||
|
When I send a request through the proxy
|
||||||
|
Then the proxy responds within the 5ms overhead budget
|
||||||
|
And the telemetry event is queued in the mpsc channel (not dropped)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 1.4: Transparent Error Passthrough
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Provider Error Passthrough
|
||||||
|
|
||||||
|
Scenario: Rate limit error passed through transparently
|
||||||
|
Given the upstream provider returns HTTP 429 with Retry-After header
|
||||||
|
When the proxy receives this response
|
||||||
|
Then the proxy returns HTTP 429 to the client
|
||||||
|
And the Retry-After header is preserved
|
||||||
|
|
||||||
|
Scenario: Provider 500 error passed through
|
||||||
|
Given the upstream provider returns HTTP 500
|
||||||
|
When the proxy receives this response
|
||||||
|
Then the proxy returns HTTP 500 to the client
|
||||||
|
And the original error body is preserved
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 2: Router Brain
|
||||||
|
|
||||||
|
### Story 2.1: Complexity Classification
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Request Complexity Classification
|
||||||
|
|
||||||
|
Scenario: Simple extraction task classified as low complexity
|
||||||
|
Given a request with system prompt "Extract the name from this text"
|
||||||
|
And token count is under 500
|
||||||
|
When the complexity classifier evaluates the request
|
||||||
|
Then the complexity score is "low"
|
||||||
|
And classification completes in under 2ms
|
||||||
|
|
||||||
|
Scenario: Multi-turn reasoning classified as high complexity
|
||||||
|
Given a request with 10+ messages in the conversation
|
||||||
|
And system prompt contains "analyze", "reason", or "compare"
|
||||||
|
When the complexity classifier evaluates the request
|
||||||
|
Then the complexity score is "high"
|
||||||
|
|
||||||
|
Scenario: Unknown pattern defaults to medium complexity
|
||||||
|
Given a request with no recognizable task pattern
|
||||||
|
When the complexity classifier evaluates the request
|
||||||
|
Then the complexity score is "medium"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 2.2: Routing Rules
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Configurable Routing Rules
|
||||||
|
|
||||||
|
Scenario: First-match routing rule applied
|
||||||
|
Given routing rule "if feature=classify -> cheapest from [gpt-4o-mini, claude-haiku]"
|
||||||
|
And the request includes header X-DD0C-Feature: classify
|
||||||
|
When the router evaluates the request
|
||||||
|
Then the request is routed to the cheapest model in the rule set
|
||||||
|
And the routing decision is logged with strategy "cheapest"
|
||||||
|
|
||||||
|
Scenario: No matching rule falls through to default
|
||||||
|
Given routing rules exist but none match the request headers
|
||||||
|
When the router evaluates the request
|
||||||
|
Then the request is routed using the "passthrough" strategy
|
||||||
|
And the original model in the request body is used
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 2.3: Automatic Fallback
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Provider Fallback Chain
|
||||||
|
|
||||||
|
Scenario: Primary provider fails, fallback succeeds
|
||||||
|
Given routing rule specifies fallback chain [openai, anthropic]
|
||||||
|
And OpenAI returns HTTP 503
|
||||||
|
When the proxy processes the request
|
||||||
|
Then the request is retried against Anthropic
|
||||||
|
And the response is returned successfully
|
||||||
|
And the routing decision log shows "fallback triggered"
|
||||||
|
|
||||||
|
Scenario: Circuit breaker opens after sustained failures
|
||||||
|
Given OpenAI error rate exceeds 10% over the last 60 seconds
|
||||||
|
When a new request arrives targeting OpenAI
|
||||||
|
Then the circuit breaker is OPEN for OpenAI
|
||||||
|
And the request is immediately routed to the fallback provider
|
||||||
|
And no request is sent to OpenAI
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 2.4: Real-Time Cost Savings
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Cost Savings Calculation
|
||||||
|
|
||||||
|
Scenario: Savings calculated when model is downgraded
|
||||||
|
Given the original request specified model "gpt-4o" ($15/1M input tokens)
|
||||||
|
And the router downgraded to "gpt-4o-mini" ($0.15/1M input tokens)
|
||||||
|
And the request used 1000 input tokens
|
||||||
|
When the cost calculator runs
|
||||||
|
Then cost_original is $0.015
|
||||||
|
And cost_actual is $0.00015
|
||||||
|
And cost_saved is $0.01485
|
||||||
|
|
||||||
|
Scenario: Zero savings when passthrough (no routing)
|
||||||
|
Given the request was routed with strategy "passthrough"
|
||||||
|
When the cost calculator runs
|
||||||
|
Then cost_saved is $0.00
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 3: Analytics Pipeline
|
||||||
|
|
||||||
|
### Story 3.1: Non-Blocking Telemetry
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Asynchronous Telemetry Emission
|
||||||
|
|
||||||
|
Scenario: Telemetry emitted without blocking request
|
||||||
|
Given the proxy processes a request successfully
|
||||||
|
When the response is sent to the client
|
||||||
|
Then a RequestEvent is emitted to the mpsc channel
|
||||||
|
And the channel send completes in under 1ms
|
||||||
|
|
||||||
|
Scenario: Bounded channel drops events when full
|
||||||
|
Given the mpsc channel is at capacity (1000 events)
|
||||||
|
When a new telemetry event is emitted
|
||||||
|
Then the event is dropped (not blocking the proxy)
|
||||||
|
And a "telemetry_dropped" counter is incremented
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 3.2: Fast Dashboard Queries
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Continuous Aggregates for Dashboard
|
||||||
|
|
||||||
|
Scenario: Hourly cost summary is pre-calculated
|
||||||
|
Given 10,000 request events exist in the last hour
|
||||||
|
When I query GET /api/dashboard/summary?period=1h
|
||||||
|
Then the response returns in under 200ms
|
||||||
|
And total_cost, total_saved, and avg_latency are present
|
||||||
|
|
||||||
|
Scenario: Treemap data returns cost breakdown by team and model
|
||||||
|
Given request events tagged with team="payments" and team="search"
|
||||||
|
When I query GET /api/dashboard/treemap?period=7d
|
||||||
|
Then the response includes cost breakdowns per team per model
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 3.3: Automatic Data Compression
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: TimescaleDB Compression
|
||||||
|
|
||||||
|
Scenario: Chunks older than 7 days are compressed
|
||||||
|
Given request_events data exists from 10 days ago
|
||||||
|
When the compression policy runs
|
||||||
|
Then chunks older than 7 days are compressed by 90%+
|
||||||
|
And compressed data is still queryable via continuous aggregates
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 4: Dashboard API
|
||||||
|
|
||||||
|
### Story 4.1: GitHub OAuth Signup
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: GitHub OAuth Authentication
|
||||||
|
|
||||||
|
Scenario: New user signs up via GitHub OAuth
|
||||||
|
Given the user has a valid GitHub account
|
||||||
|
When they complete the /api/auth/github OAuth flow
|
||||||
|
Then an organization is created automatically
|
||||||
|
And a JWT access token is returned
|
||||||
|
And a refresh token is stored in Redis
|
||||||
|
|
||||||
|
Scenario: Existing user logs in via GitHub OAuth
|
||||||
|
Given the user already has an organization
|
||||||
|
When they complete the /api/auth/github OAuth flow
|
||||||
|
Then a new JWT is issued for the existing organization
|
||||||
|
And no duplicate organization is created
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 4.2: Routing Rules & Provider Keys CRUD
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Routing Rules Management
|
||||||
|
|
||||||
|
Scenario: Create a new routing rule
|
||||||
|
Given I am authenticated as an Owner
|
||||||
|
When I POST /api/orgs/{id}/routing-rules with a valid rule body
|
||||||
|
Then the rule is created and returned with an ID
|
||||||
|
And the rule is loaded into the proxy within 60 seconds
|
||||||
|
|
||||||
|
Scenario: Provider API key is encrypted at rest
|
||||||
|
Given I POST /api/orgs/{id}/provider-keys with key "sk-live-abc123"
|
||||||
|
When the key is stored in PostgreSQL
|
||||||
|
Then the stored value is AES-256-GCM encrypted
|
||||||
|
And decrypting with the correct KMS key returns "sk-live-abc123"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 4.3: Dashboard Data Endpoints
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Dashboard Summary & Treemap
|
||||||
|
|
||||||
|
Scenario: Summary endpoint returns aggregated metrics
|
||||||
|
Given request events exist for the authenticated org
|
||||||
|
When I GET /api/dashboard/summary?period=30d
|
||||||
|
Then the response includes total_cost, total_saved, request_count, avg_latency
|
||||||
|
|
||||||
|
Scenario: Request inspector with filters
|
||||||
|
Given 500 request events exist for the org
|
||||||
|
When I GET /api/requests?feature=classify&limit=20
|
||||||
|
Then 20 results are returned matching feature="classify"
|
||||||
|
And no prompt content is included in the response
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 4.4: API Key Revocation
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: API Key Revocation
|
||||||
|
|
||||||
|
Scenario: Revoked key is immediately blocked
|
||||||
|
Given API key "sk-compromised" is active
|
||||||
|
When I DELETE /api/orgs/{id}/api-keys/{key_id}
|
||||||
|
Then the key is removed from Redis cache immediately
|
||||||
|
And subsequent requests with "sk-compromised" return 401
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 5: Dashboard UI
|
||||||
|
|
||||||
|
### Story 5.1: Cost Treemap
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: AI Spend Treemap Visualization
|
||||||
|
|
||||||
|
Scenario: Treemap renders cost breakdown
|
||||||
|
Given the user is logged into the dashboard
|
||||||
|
And cost data exists for teams "payments" and "search"
|
||||||
|
When the dashboard loads with period=7d
|
||||||
|
Then a treemap visualization renders
|
||||||
|
And the "payments" block is proportionally larger if it spent more
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 5.2: Real-Time Savings Counter
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Savings Counter
|
||||||
|
|
||||||
|
Scenario: Weekly savings counter updates
|
||||||
|
Given the user is on the dashboard
|
||||||
|
And $127.50 has been saved this week
|
||||||
|
When the dashboard loads
|
||||||
|
Then the counter displays "You saved $127.50 this week"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 5.3: Routing Rules Editor
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Routing Rules Editor UI
|
||||||
|
|
||||||
|
Scenario: Create rule via drag-and-drop interface
|
||||||
|
Given the user navigates to Settings > Routing Rules
|
||||||
|
When they create a new rule with feature="classify" and strategy="cheapest"
|
||||||
|
And drag it to priority position 1
|
||||||
|
Then the rule is saved via the API
|
||||||
|
And the rule list reflects the new priority order
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 5.4: Request Inspector
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Request Inspector
|
||||||
|
|
||||||
|
Scenario: Inspect routing decision for a specific request
|
||||||
|
Given the user opens the Request Inspector
|
||||||
|
When they click on a specific request row
|
||||||
|
Then the detail panel shows: model_selected, model_alternatives, cost_delta, routing_strategy
|
||||||
|
And no prompt content is displayed
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 6: Shadow Audit CLI
|
||||||
|
|
||||||
|
### Story 6.1: Zero-Setup Codebase Scan
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Shadow Audit CLI Scan
|
||||||
|
|
||||||
|
Scenario: Scan TypeScript project for LLM usage
|
||||||
|
Given a TypeScript project with 3 files calling openai.chat.completions.create()
|
||||||
|
When I run npx dd0c-scan in the project root
|
||||||
|
Then the CLI detects 3 LLM API call sites
|
||||||
|
And estimates monthly cost based on heuristic token counts
|
||||||
|
And displays projected savings with dd0c/route
|
||||||
|
|
||||||
|
Scenario: Scan runs completely offline with cached pricing
|
||||||
|
Given the pricing table was cached from a previous run
|
||||||
|
And the network is unavailable
|
||||||
|
When I run npx dd0c-scan
|
||||||
|
Then the scan completes using cached pricing data
|
||||||
|
And a warning is shown that pricing may be stale
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 6.2: No Source Code Exfiltration
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Local-Only Scan
|
||||||
|
|
||||||
|
Scenario: No source code sent to server
|
||||||
|
Given I run npx dd0c-scan without --opt-in flag
|
||||||
|
When the scan completes
|
||||||
|
Then zero HTTP requests are made to any dd0c server
|
||||||
|
And the report is rendered entirely in the terminal
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 6.3: Terminal Report
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Terminal Report Output
|
||||||
|
|
||||||
|
Scenario: Top Opportunities report
|
||||||
|
Given the scan found 5 files with LLM calls
|
||||||
|
When the report renders
|
||||||
|
Then it shows "Top Opportunities" sorted by potential savings
|
||||||
|
And each entry includes: file path, current model, suggested model, estimated monthly savings
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 7: Slack Integration
|
||||||
|
|
||||||
|
### Story 7.1: Weekly Savings Digest
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Monday Morning Digest
|
||||||
|
|
||||||
|
Scenario: Weekly digest sent via email
|
||||||
|
Given it is Monday at 9:00 AM UTC
|
||||||
|
And the org has cost data from the previous week
|
||||||
|
When the dd0c-worker cron fires
|
||||||
|
Then an email is sent via AWS SES to the org admin
|
||||||
|
And the email includes: total_saved, top_model_savings, week_over_week_trend
|
||||||
|
|
||||||
|
Scenario: No digest sent if no activity
|
||||||
|
Given the org had zero requests last week
|
||||||
|
When the cron fires
|
||||||
|
Then no email is sent
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 7.2: Budget Alert via Slack
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Budget Threshold Alerts
|
||||||
|
|
||||||
|
Scenario: Daily spend exceeds configured threshold
|
||||||
|
Given the org configured alert threshold at $100/day
|
||||||
|
And today's spend reaches $101
|
||||||
|
When the dd0c-worker evaluates thresholds
|
||||||
|
Then a Slack webhook is fired with the alert payload
|
||||||
|
And the payload includes X-DD0C-Signature header
|
||||||
|
And last_fired_at is updated to prevent duplicate alerts
|
||||||
|
|
||||||
|
Scenario: Alert not re-fired for same incident
|
||||||
|
Given an alert was already fired for today's threshold breach
|
||||||
|
When the worker evaluates thresholds again
|
||||||
|
Then no duplicate Slack webhook is sent
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 8: Infrastructure & DevOps
|
||||||
|
|
||||||
|
### Story 8.1: ECS Fargate Deployment
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Containerized Deployment
|
||||||
|
|
||||||
|
Scenario: CDK deploys all services to ECS Fargate
|
||||||
|
Given the CDK stack is synthesized
|
||||||
|
When cdk deploy is executed
|
||||||
|
Then ECS services are created for proxy, api, and worker
|
||||||
|
And ALB routes /v1/* to proxy and /api/* to api
|
||||||
|
And CloudFront serves static dashboard assets from S3
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 8.2: CI/CD Pipeline
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: GitHub Actions CI/CD
|
||||||
|
|
||||||
|
Scenario: Push to main triggers full pipeline
|
||||||
|
Given code is pushed to the main branch
|
||||||
|
When GitHub Actions triggers
|
||||||
|
Then tests run (unit + integration + canary suite)
|
||||||
|
And Docker images are built and pushed to ECR
|
||||||
|
And ECS services are updated with rolling deployment
|
||||||
|
And zero downtime is maintained during deployment
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 8.3: CloudWatch Alarms
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Monitoring & Alerting
|
||||||
|
|
||||||
|
Scenario: P99 latency alarm fires
|
||||||
|
Given proxy P99 latency exceeds 50ms for 5 minutes
|
||||||
|
When CloudWatch evaluates the alarm
|
||||||
|
Then a PagerDuty incident is created
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 9: Onboarding & PLG
|
||||||
|
|
||||||
|
### Story 9.1: One-Click GitHub Signup
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Frictionless Signup
|
||||||
|
|
||||||
|
Scenario: New user completes signup in under 60 seconds
|
||||||
|
Given the user clicks "Sign up with GitHub"
|
||||||
|
When they authorize the OAuth app
|
||||||
|
Then an org is created, an API key is generated
|
||||||
|
And the user lands on the onboarding wizard
|
||||||
|
And total elapsed time is under 60 seconds
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 9.2: Free Tier Enforcement
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Free Tier ($50/month routed spend)
|
||||||
|
|
||||||
|
Scenario: Free tier user within limit
|
||||||
|
Given the org is on the free tier
|
||||||
|
And this month's routed spend is $45
|
||||||
|
When a new request is proxied
|
||||||
|
Then the request is processed normally
|
||||||
|
|
||||||
|
Scenario: Free tier user exceeds limit
|
||||||
|
Given the org is on the free tier
|
||||||
|
And this month's routed spend is $50.01
|
||||||
|
When a new request is proxied
|
||||||
|
Then the proxy returns HTTP 429
|
||||||
|
And the response body includes an upgrade CTA with Stripe Checkout link
|
||||||
|
|
||||||
|
Scenario: Monthly counter resets on the 1st
|
||||||
|
Given the org used $50 last month
|
||||||
|
When the calendar rolls to the 1st of the new month
|
||||||
|
Then the Redis counter is reset to $0
|
||||||
|
And requests are processed normally again
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 9.3: API Key Management
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: API Key CRUD
|
||||||
|
|
||||||
|
Scenario: Generate new API key
|
||||||
|
Given I am authenticated as an Owner
|
||||||
|
When I POST /api/orgs/{id}/api-keys
|
||||||
|
Then a new API key is returned in plaintext (shown once)
|
||||||
|
And the key is stored as a bcrypt hash in PostgreSQL
|
||||||
|
And the key is cached in Redis for fast auth
|
||||||
|
|
||||||
|
Scenario: Rotate API key
|
||||||
|
Given API key "sk-old" is active
|
||||||
|
When I POST /api/orgs/{id}/api-keys/{key_id}/rotate
|
||||||
|
Then a new key "sk-new" is returned
|
||||||
|
And "sk-old" is immediately invalidated
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 9.4: First Route Onboarding
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: 2-Minute First Route
|
||||||
|
|
||||||
|
Scenario: User completes onboarding wizard
|
||||||
|
Given the user just signed up
|
||||||
|
When they copy the API key (step 1)
|
||||||
|
And paste the provided curl command (step 2)
|
||||||
|
And the request appears in the dashboard (step 3)
|
||||||
|
Then the onboarding is marked complete
|
||||||
|
And a PostHog event "routing.savings.first_dollar" is tracked (if savings occurred)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 9.5: Team Invites
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Team Member Invites
|
||||||
|
|
||||||
|
Scenario: Invite team member via email
|
||||||
|
Given I am an Owner of org "acme"
|
||||||
|
When I POST /api/orgs/{id}/invites with email "dev@acme.com"
|
||||||
|
Then an email with a magic link is sent
|
||||||
|
And the magic link JWT expires in 72 hours
|
||||||
|
|
||||||
|
Scenario: Invited user joins existing org
|
||||||
|
Given "dev@acme.com" received an invite magic link
|
||||||
|
When they click the link and complete GitHub OAuth
|
||||||
|
Then they are added to org "acme" as a Member (not Owner)
|
||||||
|
And no new org is created
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 10: Transparent Factory Compliance
|
||||||
|
|
||||||
|
### Story 10.1: Atomic Flagging
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Feature Flag Infrastructure
|
||||||
|
|
||||||
|
Scenario: New routing strategy behind a flag (default off)
|
||||||
|
Given feature flag "enable_cascading_router" exists with default=off
|
||||||
|
When a request arrives that would trigger cascading routing
|
||||||
|
Then the passthrough strategy is used instead
|
||||||
|
And the flag evaluation completes without network calls
|
||||||
|
|
||||||
|
Scenario: Flag auto-disables on latency regression
|
||||||
|
Given flag "enable_new_classifier" is at 50% rollout
|
||||||
|
And P99 latency increased by 8% since flag was enabled
|
||||||
|
When the circuit breaker evaluates flag health
|
||||||
|
Then the flag is auto-disabled within 30 seconds
|
||||||
|
And an alert is fired
|
||||||
|
|
||||||
|
Scenario: CI blocks deployment with expired flag
|
||||||
|
Given flag "old_experiment" has TTL expired and rollout=100%
|
||||||
|
When CI runs the flag audit
|
||||||
|
Then the build fails with "Expired flag at full rollout: old_experiment"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 10.2: Elastic Schema
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Additive-Only Schema Migrations
|
||||||
|
|
||||||
|
Scenario: Migration with DROP COLUMN is rejected
|
||||||
|
Given a migration file contains "ALTER TABLE request_events DROP COLUMN old_field"
|
||||||
|
When CI runs the schema lint
|
||||||
|
Then the build fails with "Destructive schema change detected"
|
||||||
|
|
||||||
|
Scenario: V1 code ignores V2 fields
|
||||||
|
Given a request_events row contains a new "routing_v2" column
|
||||||
|
When V1 Rust code deserializes the row
|
||||||
|
Then deserialization succeeds (unknown fields ignored)
|
||||||
|
And no error is logged
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 10.3: Cognitive Durability
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Decision Logs for Routing Logic
|
||||||
|
|
||||||
|
Scenario: PR touching router requires decision log
|
||||||
|
Given a PR modifies files in src/router/
|
||||||
|
And no decision_log.json is included
|
||||||
|
When CI runs the decision log check
|
||||||
|
Then the build fails with "Decision log required for routing changes"
|
||||||
|
|
||||||
|
Scenario: Cyclomatic complexity exceeds cap
|
||||||
|
Given a function in src/router/ has cyclomatic complexity 12
|
||||||
|
When cargo clippy runs with cognitive_complexity threshold 10
|
||||||
|
Then the lint fails
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 10.4: Semantic Observability
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: AI Reasoning Spans
|
||||||
|
|
||||||
|
Scenario: Routing decision emits OTEL span
|
||||||
|
Given a request is routed from gpt-4o to gpt-4o-mini
|
||||||
|
When the routing decision completes
|
||||||
|
Then an "ai_routing_decision" span is created
|
||||||
|
And span attributes include: ai.model_selected, ai.cost_delta, ai.complexity_score
|
||||||
|
And ai.prompt_hash is a SHA-256 hash (not raw content)
|
||||||
|
|
||||||
|
Scenario: No PII in any span
|
||||||
|
Given a request with user email in the prompt
|
||||||
|
When the span is emitted
|
||||||
|
Then no span attribute contains the email address
|
||||||
|
And ai.prompt_hash is the only prompt-related attribute
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 10.5: Configurable Autonomy
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Governance Policy
|
||||||
|
|
||||||
|
Scenario: Strict mode blocks auto-applied config changes
|
||||||
|
Given governance_mode is "strict"
|
||||||
|
When the background task attempts to refresh routing rules
|
||||||
|
Then the refresh is blocked
|
||||||
|
And a log entry "Blocked by strict mode" is written
|
||||||
|
|
||||||
|
Scenario: Panic mode freezes to last-known-good
|
||||||
|
Given panic_mode is triggered via POST /admin/panic
|
||||||
|
When a new request arrives
|
||||||
|
Then routing uses the frozen last-known-good configuration
|
||||||
|
And auto-failover is disabled
|
||||||
|
And the response header includes "X-DD0C-Panic: active"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*End of dd0c/route BDD Acceptance Specifications — 10 Epics, 50+ Scenarios*
|
||||||
@@ -0,0 +1,610 @@
|
|||||||
|
# dd0c/cost — BDD Acceptance Test Specifications
|
||||||
|
|
||||||
|
**Phase 3: Given/When/Then per Story**
|
||||||
|
**Date:** March 1, 2026
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 1: CloudTrail Ingestion
|
||||||
|
|
||||||
|
### Story 1.1: EventBridge Cross-Account Rules
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: CloudTrail Event Ingestion
|
||||||
|
|
||||||
|
Scenario: EC2 RunInstances event ingested successfully
|
||||||
|
Given a customer AWS account has EventBridge rules forwarding to dd0c
|
||||||
|
When a RunInstances CloudTrail event fires in the customer account
|
||||||
|
Then the event arrives in the dd0c SQS FIFO queue within 5 seconds
|
||||||
|
And the event contains userIdentity, eventSource, and requestParameters
|
||||||
|
|
||||||
|
Scenario: Non-cost-generating event is filtered at EventBridge
|
||||||
|
Given EventBridge rules filter for cost-relevant API calls only
|
||||||
|
When a DescribeInstances CloudTrail event fires
|
||||||
|
Then the event is NOT forwarded to the SQS queue
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 1.2: SQS FIFO Deduplication
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: CloudTrail Event Deduplication
|
||||||
|
|
||||||
|
Scenario: Duplicate CloudTrail event is deduplicated by SQS FIFO
|
||||||
|
Given a RunInstances event with eventID "evt-abc-123" was already processed
|
||||||
|
When the same event arrives again (CloudTrail duplicate delivery)
|
||||||
|
Then SQS FIFO deduplicates based on the eventID
|
||||||
|
And only one CostEvent is written to DynamoDB
|
||||||
|
|
||||||
|
Scenario: Same resource type from different events is allowed
|
||||||
|
Given two different RunInstances events for the same instance type
|
||||||
|
But with different eventIDs
|
||||||
|
When both arrive in the SQS queue
|
||||||
|
Then both are processed as separate CostEvents
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 1.3: Lambda Normalizer
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: CloudTrail Event Normalization
|
||||||
|
|
||||||
|
Scenario: EC2 RunInstances normalized to CostEvent schema
|
||||||
|
Given a raw CloudTrail RunInstances event
|
||||||
|
When the Lambda normalizer processes it
|
||||||
|
Then a CostEvent is created with: service="ec2", instanceType="m5.xlarge", region, actor ARN, hourly cost
|
||||||
|
And the hourly cost is looked up from the static pricing table
|
||||||
|
|
||||||
|
Scenario: RDS CreateDBInstance normalized correctly
|
||||||
|
Given a raw CloudTrail CreateDBInstance event
|
||||||
|
When the Lambda normalizer processes it
|
||||||
|
Then a CostEvent is created with: service="rds", engine, storageType, multiAZ flag
|
||||||
|
|
||||||
|
Scenario: Unknown instance type uses fallback pricing
|
||||||
|
Given a CloudTrail event with instanceType "z99.mega" not in the pricing table
|
||||||
|
When the normalizer looks up pricing
|
||||||
|
Then fallback pricing of $0 is applied
|
||||||
|
And a warning is logged: "Unknown instance type: z99.mega"
|
||||||
|
|
||||||
|
Scenario: Malformed CloudTrail JSON does not crash the Lambda
|
||||||
|
Given a CloudTrail event with invalid JSON structure
|
||||||
|
When the Lambda normalizer processes it
|
||||||
|
Then the event is sent to the DLQ
|
||||||
|
And no CostEvent is written
|
||||||
|
And an error metric is emitted
|
||||||
|
|
||||||
|
Scenario: Batched RunInstances (multiple instances) creates multiple CostEvents
|
||||||
|
Given a RunInstances event launching 5 instances simultaneously
|
||||||
|
When the normalizer processes it
|
||||||
|
Then 5 separate CostEvents are created (one per instance)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 1.4: Cost Precision
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Cost Calculation Precision
|
||||||
|
|
||||||
|
Scenario: Hourly cost calculated with 4 decimal places
|
||||||
|
Given instance type "m5.xlarge" costs $0.192/hr
|
||||||
|
When the normalizer calculates hourly cost
|
||||||
|
Then the cost is stored as 0.1920 (4 decimal places)
|
||||||
|
|
||||||
|
Scenario: Sub-cent Lambda costs handled correctly
|
||||||
|
Given a Lambda invocation costing $0.0000002 per request
|
||||||
|
When the normalizer calculates cost
|
||||||
|
Then the cost is stored without floating-point precision loss
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 2: Anomaly Detection
|
||||||
|
|
||||||
|
### Story 2.1: Z-Score Calculation
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Z-Score Anomaly Scoring
|
||||||
|
|
||||||
|
Scenario: Event matching baseline mean scores zero
|
||||||
|
Given a baseline with mean=$1.25/hr and stddev=$0.15
|
||||||
|
When a CostEvent arrives with hourly cost $1.25
|
||||||
|
Then the Z-score is 0.0
|
||||||
|
And the severity is "info"
|
||||||
|
|
||||||
|
Scenario: Event 3 standard deviations above mean scores critical
|
||||||
|
Given a baseline with mean=$1.25/hr and stddev=$0.15
|
||||||
|
When a CostEvent arrives with hourly cost $1.70 (Z=3.0)
|
||||||
|
Then the Z-score is 3.0
|
||||||
|
And the severity is "critical"
|
||||||
|
|
||||||
|
Scenario: Zero standard deviation does not cause division by zero
|
||||||
|
Given a baseline with mean=$1.00/hr and stddev=$0.00 (all identical observations)
|
||||||
|
When a CostEvent arrives with hourly cost $1.50
|
||||||
|
Then the scorer handles the zero-stddev case gracefully
|
||||||
|
And the severity is based on absolute cost delta, not Z-score
|
||||||
|
|
||||||
|
Scenario: Single data point baseline (stddev undefined)
|
||||||
|
Given a baseline with only 1 observation
|
||||||
|
When a CostEvent arrives
|
||||||
|
Then the scorer uses the cold-start fast-path instead of Z-score
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 2.2: Novelty Scoring
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Actor and Instance Novelty Detection
|
||||||
|
|
||||||
|
Scenario: New IAM role triggers actor novelty penalty
|
||||||
|
Given the baseline has observed actors: ["arn:aws:iam::123:role/ci"]
|
||||||
|
When a CostEvent arrives from actor "arn:aws:iam::123:role/unknown-script"
|
||||||
|
Then an actor novelty penalty is added to the composite score
|
||||||
|
|
||||||
|
Scenario: Known actor and known instance type has no novelty penalty
|
||||||
|
Given the baseline has observed actors and instance types matching the event
|
||||||
|
When the CostEvent is scored
|
||||||
|
Then the novelty component is 0
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 2.3: Cold-Start Fast Path
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Cold-Start Fast Path for Expensive Resources
|
||||||
|
|
||||||
|
Scenario: $25/hr instance triggers immediate critical (bypasses baseline)
|
||||||
|
Given the account baseline has fewer than 14 days of data
|
||||||
|
When a CostEvent arrives for a p4d.24xlarge at $32.77/hr
|
||||||
|
Then the fast-path triggers immediately
|
||||||
|
And severity is "critical" regardless of baseline state
|
||||||
|
|
||||||
|
Scenario: $0.10/hr instance ignored during cold-start
|
||||||
|
Given the account baseline has fewer than 14 days of data
|
||||||
|
When a CostEvent arrives for a t3.nano at $0.0052/hr
|
||||||
|
Then the fast-path does NOT trigger
|
||||||
|
And the event is scored normally (likely "info")
|
||||||
|
|
||||||
|
Scenario: Fast-path transitions to statistical scoring at maturity
|
||||||
|
Given the account baseline has 20+ events AND 14+ days
|
||||||
|
When a CostEvent arrives for a $10/hr instance
|
||||||
|
Then the Z-score path is used (not fast-path)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 2.4: Composite Scoring
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Composite Severity Classification
|
||||||
|
|
||||||
|
Scenario: Composite score below 30 classified as info
|
||||||
|
Given Z-score contribution is 15 and novelty contribution is 10
|
||||||
|
When the composite score is calculated (25)
|
||||||
|
Then severity is "info"
|
||||||
|
|
||||||
|
Scenario: Composite score above 60 classified as critical
|
||||||
|
Given Z-score contribution is 45 and novelty contribution is 20
|
||||||
|
When the composite score is calculated (65)
|
||||||
|
Then severity is "critical"
|
||||||
|
|
||||||
|
Scenario: Events below $0.50/hr never classified as critical
|
||||||
|
Given a CostEvent with hourly cost $0.30 and Z-score of 5.0
|
||||||
|
When the composite score is calculated
|
||||||
|
Then severity is capped at "warning" (never critical for cheap resources)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 2.5: Feedback Loop
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Mark-as-Expected Feedback
|
||||||
|
|
||||||
|
Scenario: Marking an anomaly as expected reduces future scores
|
||||||
|
Given an anomaly was triggered for actor "arn:aws:iam::123:role/batch-job"
|
||||||
|
When the user clicks "Mark as Expected" in Slack
|
||||||
|
Then the actor is added to the expected_actors list
|
||||||
|
And future events from this actor receive a reduced score
|
||||||
|
|
||||||
|
Scenario: Expected actor still flagged if cost is 10x above baseline
|
||||||
|
Given actor "batch-job" is in the expected_actors list
|
||||||
|
And the baseline mean is $1.00/hr
|
||||||
|
When a CostEvent arrives from "batch-job" at $15.00/hr
|
||||||
|
Then the anomaly is still flagged (10x override)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 3: Zombie Hunter
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Idle Resource Detection
|
||||||
|
|
||||||
|
Scenario: EC2 instance running 7+ days with <5% CPU detected as zombie
|
||||||
|
Given an EC2 instance "i-abc123" has been running for 10 days
|
||||||
|
And CloudWatch average CPU utilization is 2%
|
||||||
|
When the daily Zombie Hunter scan runs
|
||||||
|
Then "i-abc123" is flagged as a zombie
|
||||||
|
And cumulative waste cost is calculated (10 days × hourly rate)
|
||||||
|
|
||||||
|
Scenario: RDS instance with 0 connections for 3+ days detected
|
||||||
|
Given an RDS instance "prod-db-unused" has 0 connections for 5 days
|
||||||
|
When the Zombie Hunter scan runs
|
||||||
|
Then "prod-db-unused" is flagged as a zombie
|
||||||
|
|
||||||
|
Scenario: Unattached EBS volume older than 7 days detected
|
||||||
|
Given an EBS volume "vol-xyz" has been unattached for 14 days
|
||||||
|
When the Zombie Hunter scan runs
|
||||||
|
Then "vol-xyz" is flagged as a zombie
|
||||||
|
And the waste cost includes the daily storage charge
|
||||||
|
|
||||||
|
Scenario: Instance tagged dd0c:ignore is excluded
|
||||||
|
Given an EC2 instance has tag "dd0c:ignore=true"
|
||||||
|
When the Zombie Hunter scan runs
|
||||||
|
Then the instance is skipped
|
||||||
|
|
||||||
|
Scenario: Zombie scan respects read-only permissions
|
||||||
|
Given the dd0c IAM role has only read permissions
|
||||||
|
When the Zombie Hunter scan runs
|
||||||
|
Then no modify/stop/terminate API calls are made
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 4: Notification & Remediation
|
||||||
|
|
||||||
|
### Story 4.1: Slack Block Kit Alerts
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Slack Anomaly Alerts
|
||||||
|
|
||||||
|
Scenario: Critical anomaly sends Slack alert with Block Kit
|
||||||
|
Given an anomaly with severity "critical" for a p4d.24xlarge in us-east-1
|
||||||
|
When the notification engine processes it
|
||||||
|
Then a Slack message is sent with Block Kit formatting
|
||||||
|
And the message includes: resource type, region, hourly cost, actor, "Why this alert" section
|
||||||
|
And the message includes buttons: Snooze 1h, Snooze 24h, Mark Expected, Stop Instance
|
||||||
|
|
||||||
|
Scenario: Info-level anomalies are NOT sent to Slack
|
||||||
|
Given an anomaly with severity "info"
|
||||||
|
When the notification engine processes it
|
||||||
|
Then no Slack message is sent
|
||||||
|
And the anomaly is only logged to DynamoDB
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 4.2: Interactive Remediation
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: One-Click Remediation via Slack
|
||||||
|
|
||||||
|
Scenario: User clicks "Stop Instance" and remediation succeeds
|
||||||
|
Given a Slack alert for EC2 instance "i-abc123" in account "123456789012"
|
||||||
|
When the user clicks "Stop Instance"
|
||||||
|
Then the system assumes the cross-account remediation role via STS
|
||||||
|
And ec2:StopInstances is called for "i-abc123"
|
||||||
|
And the Slack message is updated to "Remediation Successful ✅"
|
||||||
|
|
||||||
|
Scenario: Cross-account IAM role has been revoked
|
||||||
|
Given the customer revoked the dd0c remediation IAM role
|
||||||
|
When the user clicks "Stop Instance"
|
||||||
|
Then STS AssumeRole fails
|
||||||
|
And the Slack message is updated to "Remediation Failed: IAM role not found"
|
||||||
|
And no instance is stopped
|
||||||
|
|
||||||
|
Scenario: Snooze suppresses alerts for the specified duration
|
||||||
|
Given the user clicks "Snooze 24h" on an anomaly
|
||||||
|
When a similar anomaly fires within 24 hours
|
||||||
|
Then no Slack alert is sent
|
||||||
|
And the anomaly is still logged to DynamoDB
|
||||||
|
|
||||||
|
Scenario: Mark Expected updates the baseline
|
||||||
|
Given the user clicks "Mark Expected"
|
||||||
|
When the feedback is processed
|
||||||
|
Then the actor is added to expected_actors
|
||||||
|
And the false-positive counter is incremented for governance scoring
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 4.3: Slack Signature Validation
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Slack Interactive Payload Security
|
||||||
|
|
||||||
|
Scenario: Valid Slack signature accepted
|
||||||
|
Given a Slack interactive payload with correct HMAC-SHA256 signature
|
||||||
|
When the API receives the payload
|
||||||
|
Then the action is processed
|
||||||
|
|
||||||
|
Scenario: Invalid signature rejected
|
||||||
|
Given a payload with an incorrect X-Slack-Signature header
|
||||||
|
When the API receives the payload
|
||||||
|
Then HTTP 401 is returned
|
||||||
|
And no action is taken
|
||||||
|
|
||||||
|
Scenario: Expired timestamp rejected (replay attack)
|
||||||
|
Given a payload with X-Slack-Request-Timestamp older than 5 minutes
|
||||||
|
When the API receives the payload
|
||||||
|
Then HTTP 401 is returned
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 4.4: Daily Digest
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Daily Anomaly Digest
|
||||||
|
|
||||||
|
Scenario: Daily digest aggregates 24h of anomalies
|
||||||
|
Given 15 anomalies were detected in the last 24 hours
|
||||||
|
When the daily digest cron fires
|
||||||
|
Then a Slack message is sent with: total anomalies, top 3 costliest, total estimated spend, zombie count
|
||||||
|
|
||||||
|
Scenario: No digest sent if zero anomalies
|
||||||
|
Given zero anomalies in the last 24 hours
|
||||||
|
When the daily digest cron fires
|
||||||
|
Then no Slack message is sent
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 5: Onboarding & PLG
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: AWS Account Onboarding
|
||||||
|
|
||||||
|
Scenario: CloudFormation quick-create deploys IAM role
|
||||||
|
Given the user clicks the quick-create URL
|
||||||
|
When CloudFormation deploys the stack
|
||||||
|
Then an IAM role is created with the correct permissions and ExternalId
|
||||||
|
And an EventBridge rule is created for cost-relevant CloudTrail events
|
||||||
|
|
||||||
|
Scenario: Role validation succeeds with correct ExternalId
|
||||||
|
Given the IAM role exists with ExternalId "dd0c-tenant-123"
|
||||||
|
When POST /v1/accounts validates the role
|
||||||
|
Then STS AssumeRole succeeds
|
||||||
|
And the account is marked as "active"
|
||||||
|
And the initial Zombie Hunter scan is triggered
|
||||||
|
|
||||||
|
Scenario: Role validation fails with wrong ExternalId
|
||||||
|
Given the IAM role exists but ExternalId does not match
|
||||||
|
When POST /v1/accounts validates the role
|
||||||
|
Then STS AssumeRole fails with AccessDenied
|
||||||
|
And a clear error message is returned
|
||||||
|
|
||||||
|
Scenario: Free tier allows 1 account
|
||||||
|
Given the tenant is on the free tier with 0 connected accounts
|
||||||
|
When they connect their first AWS account
|
||||||
|
Then the connection succeeds
|
||||||
|
|
||||||
|
Scenario: Free tier rejects 2nd account
|
||||||
|
Given the tenant is on the free tier with 1 connected account
|
||||||
|
When they attempt to connect a second account
|
||||||
|
Then HTTP 403 is returned with an upgrade prompt
|
||||||
|
|
||||||
|
Scenario: Stripe upgrade unlocks multiple accounts
|
||||||
|
Given the tenant upgrades via Stripe Checkout
|
||||||
|
When checkout.session.completed webhook fires
|
||||||
|
Then the tenant tier is updated to "pro"
|
||||||
|
And they can connect additional accounts
|
||||||
|
|
||||||
|
Scenario: Stripe webhook signature validated
|
||||||
|
Given a Stripe webhook payload
|
||||||
|
When the signature does not match the signing secret
|
||||||
|
Then HTTP 401 is returned
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 6: Dashboard API
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Dashboard API with Tenant Isolation
|
||||||
|
|
||||||
|
Scenario: GET /v1/anomalies returns anomalies for authenticated tenant
|
||||||
|
Given tenant "t-123" has 50 anomalies
|
||||||
|
And tenant "t-456" has 30 anomalies
|
||||||
|
When tenant "t-123" calls GET /v1/anomalies
|
||||||
|
Then only tenant "t-123" anomalies are returned
|
||||||
|
And zero anomalies from "t-456" are included
|
||||||
|
|
||||||
|
Scenario: Cursor-based pagination
|
||||||
|
Given tenant "t-123" has 200 anomalies
|
||||||
|
When GET /v1/anomalies?limit=20 is called
|
||||||
|
Then 20 results are returned with a cursor token
|
||||||
|
And calling with the cursor returns the next 20
|
||||||
|
|
||||||
|
Scenario: Filter by severity
|
||||||
|
Given anomalies exist with severity "critical", "warning", and "info"
|
||||||
|
When GET /v1/anomalies?severity=critical is called
|
||||||
|
Then only critical anomalies are returned
|
||||||
|
|
||||||
|
Scenario: Baseline override
|
||||||
|
Given tenant "t-123" wants to adjust sensitivity for EC2 m5.xlarge
|
||||||
|
When PATCH /v1/accounts/{id}/baselines/ec2/m5.xlarge with sensitivity=0.8
|
||||||
|
Then the baseline sensitivity is updated
|
||||||
|
And future scoring uses the adjusted threshold
|
||||||
|
|
||||||
|
Scenario: Missing Cognito JWT returns 401
|
||||||
|
Given no Authorization header is present
|
||||||
|
When any API endpoint is called
|
||||||
|
Then HTTP 401 is returned
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 7: Dashboard UI
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Dashboard UI
|
||||||
|
|
||||||
|
Scenario: Anomaly feed renders with severity badges
|
||||||
|
Given the user is logged in
|
||||||
|
When the dashboard loads
|
||||||
|
Then anomalies are listed with color-coded severity badges (red/yellow/blue)
|
||||||
|
|
||||||
|
Scenario: Baseline learning progress displayed
|
||||||
|
Given an account connected 5 days ago with 30 events
|
||||||
|
When the user views the account detail
|
||||||
|
Then a progress bar shows "5/14 days, 30/20 events" toward maturity
|
||||||
|
|
||||||
|
Scenario: Zombie resource list shows waste estimate
|
||||||
|
Given 3 zombie resources were detected
|
||||||
|
When the user views the Zombie Hunter tab
|
||||||
|
Then each zombie shows: resource type, age, cumulative waste cost
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 8: Infrastructure
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: CDK Infrastructure
|
||||||
|
|
||||||
|
Scenario: CDK deploys all required resources
|
||||||
|
Given the CDK stack is synthesized
|
||||||
|
When cdk deploy is executed
|
||||||
|
Then EventBridge rules, SQS FIFO queues, Lambda functions, DynamoDB tables, and Cognito user pool are created
|
||||||
|
|
||||||
|
Scenario: CI/CD pipeline runs on push to main
|
||||||
|
Given code is pushed to the main branch
|
||||||
|
When GitHub Actions triggers
|
||||||
|
Then unit tests, integration tests, and CDK synth run
|
||||||
|
And Lambda functions are deployed
|
||||||
|
And zero downtime is maintained
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 9: Multi-Account Management
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Multi-Account Management
|
||||||
|
|
||||||
|
Scenario: Link additional AWS account
|
||||||
|
Given the tenant is on the pro tier
|
||||||
|
When they POST /v1/accounts with a new role ARN
|
||||||
|
Then the account is validated and linked
|
||||||
|
And CloudTrail events from the new account begin flowing
|
||||||
|
|
||||||
|
Scenario: Consolidated anomaly view across accounts
|
||||||
|
Given the tenant has 3 linked accounts
|
||||||
|
When they GET /v1/anomalies (no account filter)
|
||||||
|
Then anomalies from all 3 accounts are returned
|
||||||
|
And each anomaly includes the source account ID
|
||||||
|
|
||||||
|
Scenario: Disconnect account stops event processing
|
||||||
|
Given account "123456789012" is linked
|
||||||
|
When DELETE /v1/accounts/{id} is called
|
||||||
|
Then the account is marked as "disconnecting"
|
||||||
|
And EventBridge rules are cleaned up
|
||||||
|
And no new events are processed from that account
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Epic 10: Transparent Factory Compliance
|
||||||
|
|
||||||
|
### Story 10.1: Atomic Flagging
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Alert Volume Circuit Breaker
|
||||||
|
|
||||||
|
Scenario: Circuit breaker trips at 3x baseline alert volume
|
||||||
|
Given the baseline alert rate is 5 alerts/hour
|
||||||
|
When 16 alerts fire in the current hour (>3x)
|
||||||
|
Then the circuit breaker trips
|
||||||
|
And the scoring flag is auto-disabled
|
||||||
|
And suppressed anomalies are buffered in the DLQ
|
||||||
|
|
||||||
|
Scenario: Fast-path alerts are exempt from circuit breaker
|
||||||
|
Given the circuit breaker is tripped
|
||||||
|
When a $30/hr instance triggers the cold-start fast-path
|
||||||
|
Then the alert is still sent (fast-path bypasses breaker)
|
||||||
|
|
||||||
|
Scenario: CI blocks expired flags
|
||||||
|
Given a feature flag with TTL expired and rollout=100%
|
||||||
|
When CI runs the flag audit
|
||||||
|
Then the build fails
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 10.2: Elastic Schema
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: DynamoDB Additive Schema
|
||||||
|
|
||||||
|
Scenario: New attribute added without breaking V1 readers
|
||||||
|
Given a DynamoDB item has a new "anomaly_score_v2" attribute
|
||||||
|
When V1 code reads the item
|
||||||
|
Then deserialization succeeds (unknown attributes ignored)
|
||||||
|
|
||||||
|
Scenario: Key schema modifications are rejected
|
||||||
|
Given a migration attempts to change the partition key structure
|
||||||
|
When CI runs the schema lint
|
||||||
|
Then the build fails with "Key schema modification detected"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 10.3: Cognitive Durability
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: Decision Logs for Scoring Changes
|
||||||
|
|
||||||
|
Scenario: PR modifying Z-score thresholds requires decision log
|
||||||
|
Given a PR changes files in src/scoring/
|
||||||
|
And no decision_log.json is included
|
||||||
|
When CI runs the decision log check
|
||||||
|
Then the build fails
|
||||||
|
|
||||||
|
Scenario: Cyclomatic complexity enforced
|
||||||
|
Given a scoring function has complexity 12
|
||||||
|
When the linter runs with threshold 10
|
||||||
|
Then the lint fails
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 10.4: Semantic Observability
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: OTEL Spans on Scoring Decisions
|
||||||
|
|
||||||
|
Scenario: Every anomaly scoring decision emits an OTEL span
|
||||||
|
Given a CostEvent is processed by the Anomaly Scorer
|
||||||
|
When scoring completes
|
||||||
|
Then an "anomaly_scoring" span is created
|
||||||
|
And span attributes include: cost.z_score, cost.anomaly_score, cost.baseline_days
|
||||||
|
And cost.fast_path_triggered is set to true/false
|
||||||
|
|
||||||
|
Scenario: Account ID is hashed in spans (PII protection)
|
||||||
|
Given a scoring span is emitted
|
||||||
|
When the span attributes are inspected
|
||||||
|
Then the AWS account ID is SHA-256 hashed
|
||||||
|
And no raw account ID appears in any attribute
|
||||||
|
```
|
||||||
|
|
||||||
|
### Story 10.5: Configurable Autonomy (14-Day Governance)
|
||||||
|
|
||||||
|
```gherkin
|
||||||
|
Feature: 14-Day Auto-Promotion Governance
|
||||||
|
|
||||||
|
Scenario: New account starts in strict mode (log-only)
|
||||||
|
Given a new AWS account is connected
|
||||||
|
When the first anomaly is detected
|
||||||
|
Then the anomaly is logged to DynamoDB
|
||||||
|
And NO Slack alert is sent (strict mode)
|
||||||
|
|
||||||
|
Scenario: Account auto-promotes to audit mode after 14 days
|
||||||
|
Given an account has been connected for 15 days
|
||||||
|
And the false-positive rate is 5% (<10% threshold)
|
||||||
|
When the daily governance cron runs
|
||||||
|
Then the account mode is changed from "strict" to "audit"
|
||||||
|
And a log entry "Auto-promoted to audit mode" is written
|
||||||
|
|
||||||
|
Scenario: Account does NOT promote if false-positive rate is high
|
||||||
|
Given an account has been connected for 15 days
|
||||||
|
And the false-positive rate is 15% (>10% threshold)
|
||||||
|
When the daily governance cron runs
|
||||||
|
Then the account remains in "strict" mode
|
||||||
|
|
||||||
|
Scenario: Panic mode halts all alerting in <1 second
|
||||||
|
Given panic mode is triggered via POST /admin/panic
|
||||||
|
When a new critical anomaly is detected
|
||||||
|
Then NO Slack alert is sent
|
||||||
|
And the anomaly is still scored and logged
|
||||||
|
And the dashboard API returns "alerting paused" header
|
||||||
|
|
||||||
|
Scenario: Panic mode requires manual clearance
|
||||||
|
Given panic mode is active
|
||||||
|
When 24 hours pass without manual intervention
|
||||||
|
Then panic mode is still active (no auto-clear)
|
||||||
|
And only POST /admin/panic with panic=false clears it
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*End of dd0c/cost BDD Acceptance Specifications — 10 Epics, 55+ Scenarios*
|
||||||
Reference in New Issue
Block a user