Phase 3: BDD acceptance specs for P1 (route) and P5 (cost)

P1: 50+ scenarios across 10 epics, all stories covered
P5: 55+ scenarios across 10 epics, written manually (Sonnet credential failures)
Remaining P2/P3/P4/P6 in progress via subagents
This commit is contained in:
2026-03-01 01:50:30 +00:00
parent 03bfe931fc
commit c1484426cc
2 changed files with 1295 additions and 0 deletions

View File

@@ -0,0 +1,685 @@
# dd0c/route — BDD Acceptance Test Specifications
**Phase 3: Given/When/Then per Story**
**Date:** March 1, 2026
---
## Epic 1: Proxy Engine
### Story 1.1: OpenAI SDK Drop-In Compatibility
```gherkin
Feature: OpenAI SDK Drop-In Compatibility
Scenario: Non-streaming request proxied successfully
Given a valid API key "sk-test-123" exists in Redis cache
And the upstream OpenAI endpoint is healthy
When I send POST /v1/chat/completions with model "gpt-4o" and stream=false
Then the response status is 200
And the response body matches the OpenAI ChatCompletion schema
And the response includes usage.prompt_tokens and usage.completion_tokens
Scenario: Request with invalid API key is rejected
Given no API key "sk-invalid" exists in Redis or PostgreSQL
When I send POST /v1/chat/completions with Authorization "Bearer sk-invalid"
Then the response status is 401
And the response body contains error "Invalid API key"
Scenario: API key validated from Redis cache (fast path)
Given API key "sk-cached" exists in Redis cache
When I send POST /v1/chat/completions with Authorization "Bearer sk-cached"
Then the key is validated without querying PostgreSQL
And the request is forwarded to the upstream provider
Scenario: API key falls back to PostgreSQL when Redis is unavailable
Given API key "sk-db-only" exists in PostgreSQL but NOT in Redis
When I send POST /v1/chat/completions with Authorization "Bearer sk-db-only"
Then the key is validated via PostgreSQL
And the key is written back to Redis cache for future requests
```
### Story 1.2: SSE Streaming Support
```gherkin
Feature: SSE Streaming Passthrough
Scenario: Streaming response is proxied chunk-by-chunk
Given a valid API key and upstream OpenAI is healthy
When I send POST /v1/chat/completions with stream=true
Then the response Content-Type is "text/event-stream"
And I receive multiple SSE chunks ending with "data: [DONE]"
And each chunk matches the OpenAI streaming delta schema
Scenario: Client disconnects mid-stream
Given a streaming request is in progress
When the client drops the TCP connection after 5 chunks
Then the proxy aborts the upstream provider connection
And no further chunks are buffered in memory
```
### Story 1.3: <5ms Latency Overhead
```gherkin
Feature: Proxy Latency Budget
Scenario: P99 latency overhead under 5ms
Given 1000 non-streaming requests are sent sequentially
When I measure the delta between proxy response time and upstream response time
Then the P99 delta is less than 5 milliseconds
Scenario: Telemetry emission does not block the hot path
Given TimescaleDB is experiencing 5-second write latency
When I send a request through the proxy
Then the proxy responds within the 5ms overhead budget
And the telemetry event is queued in the mpsc channel (not dropped)
```
### Story 1.4: Transparent Error Passthrough
```gherkin
Feature: Provider Error Passthrough
Scenario: Rate limit error passed through transparently
Given the upstream provider returns HTTP 429 with Retry-After header
When the proxy receives this response
Then the proxy returns HTTP 429 to the client
And the Retry-After header is preserved
Scenario: Provider 500 error passed through
Given the upstream provider returns HTTP 500
When the proxy receives this response
Then the proxy returns HTTP 500 to the client
And the original error body is preserved
```
---
## Epic 2: Router Brain
### Story 2.1: Complexity Classification
```gherkin
Feature: Request Complexity Classification
Scenario: Simple extraction task classified as low complexity
Given a request with system prompt "Extract the name from this text"
And token count is under 500
When the complexity classifier evaluates the request
Then the complexity score is "low"
And classification completes in under 2ms
Scenario: Multi-turn reasoning classified as high complexity
Given a request with 10+ messages in the conversation
And system prompt contains "analyze", "reason", or "compare"
When the complexity classifier evaluates the request
Then the complexity score is "high"
Scenario: Unknown pattern defaults to medium complexity
Given a request with no recognizable task pattern
When the complexity classifier evaluates the request
Then the complexity score is "medium"
```
### Story 2.2: Routing Rules
```gherkin
Feature: Configurable Routing Rules
Scenario: First-match routing rule applied
Given routing rule "if feature=classify -> cheapest from [gpt-4o-mini, claude-haiku]"
And the request includes header X-DD0C-Feature: classify
When the router evaluates the request
Then the request is routed to the cheapest model in the rule set
And the routing decision is logged with strategy "cheapest"
Scenario: No matching rule falls through to default
Given routing rules exist but none match the request headers
When the router evaluates the request
Then the request is routed using the "passthrough" strategy
And the original model in the request body is used
```
### Story 2.3: Automatic Fallback
```gherkin
Feature: Provider Fallback Chain
Scenario: Primary provider fails, fallback succeeds
Given routing rule specifies fallback chain [openai, anthropic]
And OpenAI returns HTTP 503
When the proxy processes the request
Then the request is retried against Anthropic
And the response is returned successfully
And the routing decision log shows "fallback triggered"
Scenario: Circuit breaker opens after sustained failures
Given OpenAI error rate exceeds 10% over the last 60 seconds
When a new request arrives targeting OpenAI
Then the circuit breaker is OPEN for OpenAI
And the request is immediately routed to the fallback provider
And no request is sent to OpenAI
```
### Story 2.4: Real-Time Cost Savings
```gherkin
Feature: Cost Savings Calculation
Scenario: Savings calculated when model is downgraded
Given the original request specified model "gpt-4o" ($15/1M input tokens)
And the router downgraded to "gpt-4o-mini" ($0.15/1M input tokens)
And the request used 1000 input tokens
When the cost calculator runs
Then cost_original is $0.015
And cost_actual is $0.00015
And cost_saved is $0.01485
Scenario: Zero savings when passthrough (no routing)
Given the request was routed with strategy "passthrough"
When the cost calculator runs
Then cost_saved is $0.00
```
---
## Epic 3: Analytics Pipeline
### Story 3.1: Non-Blocking Telemetry
```gherkin
Feature: Asynchronous Telemetry Emission
Scenario: Telemetry emitted without blocking request
Given the proxy processes a request successfully
When the response is sent to the client
Then a RequestEvent is emitted to the mpsc channel
And the channel send completes in under 1ms
Scenario: Bounded channel drops events when full
Given the mpsc channel is at capacity (1000 events)
When a new telemetry event is emitted
Then the event is dropped (not blocking the proxy)
And a "telemetry_dropped" counter is incremented
```
### Story 3.2: Fast Dashboard Queries
```gherkin
Feature: Continuous Aggregates for Dashboard
Scenario: Hourly cost summary is pre-calculated
Given 10,000 request events exist in the last hour
When I query GET /api/dashboard/summary?period=1h
Then the response returns in under 200ms
And total_cost, total_saved, and avg_latency are present
Scenario: Treemap data returns cost breakdown by team and model
Given request events tagged with team="payments" and team="search"
When I query GET /api/dashboard/treemap?period=7d
Then the response includes cost breakdowns per team per model
```
### Story 3.3: Automatic Data Compression
```gherkin
Feature: TimescaleDB Compression
Scenario: Chunks older than 7 days are compressed
Given request_events data exists from 10 days ago
When the compression policy runs
Then chunks older than 7 days are compressed by 90%+
And compressed data is still queryable via continuous aggregates
```
---
## Epic 4: Dashboard API
### Story 4.1: GitHub OAuth Signup
```gherkin
Feature: GitHub OAuth Authentication
Scenario: New user signs up via GitHub OAuth
Given the user has a valid GitHub account
When they complete the /api/auth/github OAuth flow
Then an organization is created automatically
And a JWT access token is returned
And a refresh token is stored in Redis
Scenario: Existing user logs in via GitHub OAuth
Given the user already has an organization
When they complete the /api/auth/github OAuth flow
Then a new JWT is issued for the existing organization
And no duplicate organization is created
```
### Story 4.2: Routing Rules & Provider Keys CRUD
```gherkin
Feature: Routing Rules Management
Scenario: Create a new routing rule
Given I am authenticated as an Owner
When I POST /api/orgs/{id}/routing-rules with a valid rule body
Then the rule is created and returned with an ID
And the rule is loaded into the proxy within 60 seconds
Scenario: Provider API key is encrypted at rest
Given I POST /api/orgs/{id}/provider-keys with key "sk-live-abc123"
When the key is stored in PostgreSQL
Then the stored value is AES-256-GCM encrypted
And decrypting with the correct KMS key returns "sk-live-abc123"
```
### Story 4.3: Dashboard Data Endpoints
```gherkin
Feature: Dashboard Summary & Treemap
Scenario: Summary endpoint returns aggregated metrics
Given request events exist for the authenticated org
When I GET /api/dashboard/summary?period=30d
Then the response includes total_cost, total_saved, request_count, avg_latency
Scenario: Request inspector with filters
Given 500 request events exist for the org
When I GET /api/requests?feature=classify&limit=20
Then 20 results are returned matching feature="classify"
And no prompt content is included in the response
```
### Story 4.4: API Key Revocation
```gherkin
Feature: API Key Revocation
Scenario: Revoked key is immediately blocked
Given API key "sk-compromised" is active
When I DELETE /api/orgs/{id}/api-keys/{key_id}
Then the key is removed from Redis cache immediately
And subsequent requests with "sk-compromised" return 401
```
---
## Epic 5: Dashboard UI
### Story 5.1: Cost Treemap
```gherkin
Feature: AI Spend Treemap Visualization
Scenario: Treemap renders cost breakdown
Given the user is logged into the dashboard
And cost data exists for teams "payments" and "search"
When the dashboard loads with period=7d
Then a treemap visualization renders
And the "payments" block is proportionally larger if it spent more
```
### Story 5.2: Real-Time Savings Counter
```gherkin
Feature: Savings Counter
Scenario: Weekly savings counter updates
Given the user is on the dashboard
And $127.50 has been saved this week
When the dashboard loads
Then the counter displays "You saved $127.50 this week"
```
### Story 5.3: Routing Rules Editor
```gherkin
Feature: Routing Rules Editor UI
Scenario: Create rule via drag-and-drop interface
Given the user navigates to Settings > Routing Rules
When they create a new rule with feature="classify" and strategy="cheapest"
And drag it to priority position 1
Then the rule is saved via the API
And the rule list reflects the new priority order
```
### Story 5.4: Request Inspector
```gherkin
Feature: Request Inspector
Scenario: Inspect routing decision for a specific request
Given the user opens the Request Inspector
When they click on a specific request row
Then the detail panel shows: model_selected, model_alternatives, cost_delta, routing_strategy
And no prompt content is displayed
```
---
## Epic 6: Shadow Audit CLI
### Story 6.1: Zero-Setup Codebase Scan
```gherkin
Feature: Shadow Audit CLI Scan
Scenario: Scan TypeScript project for LLM usage
Given a TypeScript project with 3 files calling openai.chat.completions.create()
When I run npx dd0c-scan in the project root
Then the CLI detects 3 LLM API call sites
And estimates monthly cost based on heuristic token counts
And displays projected savings with dd0c/route
Scenario: Scan runs completely offline with cached pricing
Given the pricing table was cached from a previous run
And the network is unavailable
When I run npx dd0c-scan
Then the scan completes using cached pricing data
And a warning is shown that pricing may be stale
```
### Story 6.2: No Source Code Exfiltration
```gherkin
Feature: Local-Only Scan
Scenario: No source code sent to server
Given I run npx dd0c-scan without --opt-in flag
When the scan completes
Then zero HTTP requests are made to any dd0c server
And the report is rendered entirely in the terminal
```
### Story 6.3: Terminal Report
```gherkin
Feature: Terminal Report Output
Scenario: Top Opportunities report
Given the scan found 5 files with LLM calls
When the report renders
Then it shows "Top Opportunities" sorted by potential savings
And each entry includes: file path, current model, suggested model, estimated monthly savings
```
---
## Epic 7: Slack Integration
### Story 7.1: Weekly Savings Digest
```gherkin
Feature: Monday Morning Digest
Scenario: Weekly digest sent via email
Given it is Monday at 9:00 AM UTC
And the org has cost data from the previous week
When the dd0c-worker cron fires
Then an email is sent via AWS SES to the org admin
And the email includes: total_saved, top_model_savings, week_over_week_trend
Scenario: No digest sent if no activity
Given the org had zero requests last week
When the cron fires
Then no email is sent
```
### Story 7.2: Budget Alert via Slack
```gherkin
Feature: Budget Threshold Alerts
Scenario: Daily spend exceeds configured threshold
Given the org configured alert threshold at $100/day
And today's spend reaches $101
When the dd0c-worker evaluates thresholds
Then a Slack webhook is fired with the alert payload
And the payload includes X-DD0C-Signature header
And last_fired_at is updated to prevent duplicate alerts
Scenario: Alert not re-fired for same incident
Given an alert was already fired for today's threshold breach
When the worker evaluates thresholds again
Then no duplicate Slack webhook is sent
```
---
## Epic 8: Infrastructure & DevOps
### Story 8.1: ECS Fargate Deployment
```gherkin
Feature: Containerized Deployment
Scenario: CDK deploys all services to ECS Fargate
Given the CDK stack is synthesized
When cdk deploy is executed
Then ECS services are created for proxy, api, and worker
And ALB routes /v1/* to proxy and /api/* to api
And CloudFront serves static dashboard assets from S3
```
### Story 8.2: CI/CD Pipeline
```gherkin
Feature: GitHub Actions CI/CD
Scenario: Push to main triggers full pipeline
Given code is pushed to the main branch
When GitHub Actions triggers
Then tests run (unit + integration + canary suite)
And Docker images are built and pushed to ECR
And ECS services are updated with rolling deployment
And zero downtime is maintained during deployment
```
### Story 8.3: CloudWatch Alarms
```gherkin
Feature: Monitoring & Alerting
Scenario: P99 latency alarm fires
Given proxy P99 latency exceeds 50ms for 5 minutes
When CloudWatch evaluates the alarm
Then a PagerDuty incident is created
```
---
## Epic 9: Onboarding & PLG
### Story 9.1: One-Click GitHub Signup
```gherkin
Feature: Frictionless Signup
Scenario: New user completes signup in under 60 seconds
Given the user clicks "Sign up with GitHub"
When they authorize the OAuth app
Then an org is created, an API key is generated
And the user lands on the onboarding wizard
And total elapsed time is under 60 seconds
```
### Story 9.2: Free Tier Enforcement
```gherkin
Feature: Free Tier ($50/month routed spend)
Scenario: Free tier user within limit
Given the org is on the free tier
And this month's routed spend is $45
When a new request is proxied
Then the request is processed normally
Scenario: Free tier user exceeds limit
Given the org is on the free tier
And this month's routed spend is $50.01
When a new request is proxied
Then the proxy returns HTTP 429
And the response body includes an upgrade CTA with Stripe Checkout link
Scenario: Monthly counter resets on the 1st
Given the org used $50 last month
When the calendar rolls to the 1st of the new month
Then the Redis counter is reset to $0
And requests are processed normally again
```
### Story 9.3: API Key Management
```gherkin
Feature: API Key CRUD
Scenario: Generate new API key
Given I am authenticated as an Owner
When I POST /api/orgs/{id}/api-keys
Then a new API key is returned in plaintext (shown once)
And the key is stored as a bcrypt hash in PostgreSQL
And the key is cached in Redis for fast auth
Scenario: Rotate API key
Given API key "sk-old" is active
When I POST /api/orgs/{id}/api-keys/{key_id}/rotate
Then a new key "sk-new" is returned
And "sk-old" is immediately invalidated
```
### Story 9.4: First Route Onboarding
```gherkin
Feature: 2-Minute First Route
Scenario: User completes onboarding wizard
Given the user just signed up
When they copy the API key (step 1)
And paste the provided curl command (step 2)
And the request appears in the dashboard (step 3)
Then the onboarding is marked complete
And a PostHog event "routing.savings.first_dollar" is tracked (if savings occurred)
```
### Story 9.5: Team Invites
```gherkin
Feature: Team Member Invites
Scenario: Invite team member via email
Given I am an Owner of org "acme"
When I POST /api/orgs/{id}/invites with email "dev@acme.com"
Then an email with a magic link is sent
And the magic link JWT expires in 72 hours
Scenario: Invited user joins existing org
Given "dev@acme.com" received an invite magic link
When they click the link and complete GitHub OAuth
Then they are added to org "acme" as a Member (not Owner)
And no new org is created
```
---
## Epic 10: Transparent Factory Compliance
### Story 10.1: Atomic Flagging
```gherkin
Feature: Feature Flag Infrastructure
Scenario: New routing strategy behind a flag (default off)
Given feature flag "enable_cascading_router" exists with default=off
When a request arrives that would trigger cascading routing
Then the passthrough strategy is used instead
And the flag evaluation completes without network calls
Scenario: Flag auto-disables on latency regression
Given flag "enable_new_classifier" is at 50% rollout
And P99 latency increased by 8% since flag was enabled
When the circuit breaker evaluates flag health
Then the flag is auto-disabled within 30 seconds
And an alert is fired
Scenario: CI blocks deployment with expired flag
Given flag "old_experiment" has TTL expired and rollout=100%
When CI runs the flag audit
Then the build fails with "Expired flag at full rollout: old_experiment"
```
### Story 10.2: Elastic Schema
```gherkin
Feature: Additive-Only Schema Migrations
Scenario: Migration with DROP COLUMN is rejected
Given a migration file contains "ALTER TABLE request_events DROP COLUMN old_field"
When CI runs the schema lint
Then the build fails with "Destructive schema change detected"
Scenario: V1 code ignores V2 fields
Given a request_events row contains a new "routing_v2" column
When V1 Rust code deserializes the row
Then deserialization succeeds (unknown fields ignored)
And no error is logged
```
### Story 10.3: Cognitive Durability
```gherkin
Feature: Decision Logs for Routing Logic
Scenario: PR touching router requires decision log
Given a PR modifies files in src/router/
And no decision_log.json is included
When CI runs the decision log check
Then the build fails with "Decision log required for routing changes"
Scenario: Cyclomatic complexity exceeds cap
Given a function in src/router/ has cyclomatic complexity 12
When cargo clippy runs with cognitive_complexity threshold 10
Then the lint fails
```
### Story 10.4: Semantic Observability
```gherkin
Feature: AI Reasoning Spans
Scenario: Routing decision emits OTEL span
Given a request is routed from gpt-4o to gpt-4o-mini
When the routing decision completes
Then an "ai_routing_decision" span is created
And span attributes include: ai.model_selected, ai.cost_delta, ai.complexity_score
And ai.prompt_hash is a SHA-256 hash (not raw content)
Scenario: No PII in any span
Given a request with user email in the prompt
When the span is emitted
Then no span attribute contains the email address
And ai.prompt_hash is the only prompt-related attribute
```
### Story 10.5: Configurable Autonomy
```gherkin
Feature: Governance Policy
Scenario: Strict mode blocks auto-applied config changes
Given governance_mode is "strict"
When the background task attempts to refresh routing rules
Then the refresh is blocked
And a log entry "Blocked by strict mode" is written
Scenario: Panic mode freezes to last-known-good
Given panic_mode is triggered via POST /admin/panic
When a new request arrives
Then routing uses the frozen last-known-good configuration
And auto-failover is disabled
And the response header includes "X-DD0C-Panic: active"
```
---
*End of dd0c/route BDD Acceptance Specifications — 10 Epics, 50+ Scenarios*