products/01-llm-cost-router/test-architecture/test-architecture.md

# dd0c/route — Test Architecture & TDD Strategy

**Product:** dd0c/route — LLM Cost Router & Optimization Dashboard
**Author:** Test Architecture Phase
**Date:** February 28, 2026
**Status:** V1 MVP — Solo Founder Scope

---

## Section 1: Testing Philosophy & TDD Workflow

### 1.1 Core Philosophy

dd0c/route is a **latency-sensitive proxy** with correctness requirements that compound: a wrong routing decision costs money, a wrong cost calculation misleads customers, and a wrong auth check is a security incident. Tests are not optional — they are the specification.

The guiding principle: **tests describe behavior, not implementation**. A test that breaks when you rename a private function is a bad test. A test that breaks when you accidentally route a complex request to a cheap model is a good test.

For a solo founder, the test suite is also the **second developer** — it catches regressions when Brian is moving fast and hasn't slept enough.

### 1.2 Red-Green-Refactor Adapted to dd0c

The standard TDD cycle applies, but with product-specific adaptations:

```
RED   → Write a failing test that describes the desired behavior
         (e.g., "a request tagged feature=classify should route to gpt-4o-mini")

GREEN → Write the minimum code to make it pass
         (no premature optimization — just make it work)

REFACTOR → Clean up without breaking tests
            (extract the complexity classifier into its own module,
             add the proptest property suite, optimize the hot path)
```

**When to write tests first (strict TDD):**
- All Router Brain logic (routing rules, complexity classifier, cost calculations)
- All auth/security paths (key validation, JWT issuance, RBAC checks)
- All cost calculation formulas (cost_saved = cost_original - cost_actual)
- All circuit breaker state transitions
- All schema migration validators

**When integration tests lead (test-after, then harden):**
- Provider format translation (OpenAI ↔ Anthropic) — write the translator, then write contract tests against real response fixtures
- TimescaleDB continuous aggregate queries — create the schema, run queries against Testcontainers, then lock in the behavior
- SSE streaming passthrough — implement the stream relay, then write tests that assert chunk ordering and [DONE] handling

**When E2E tests lead:**
- The "first route" onboarding journey — define the happy path first, then build backward
- The Shadow Audit CLI output format — define the expected terminal output, then build the parser

### 1.3 Test Naming Conventions

All tests follow the **Given-When-Then** naming pattern expressed as a single descriptive string:

```rust
// Rust unit tests
#[test]
fn complexity_classifier_returns_low_for_short_extraction_prompts() { ... }

#[test]
fn router_selects_cheapest_model_when_strategy_is_cheapest_and_complexity_is_low() { ... }

#[test]
fn circuit_breaker_opens_after_threshold_error_rate_exceeded() { ... }

#[test]
fn cost_calculation_returns_zero_savings_when_requested_and_used_model_are_identical() { ... }
```

```rust
// Integration tests (in tests/ directory)
#[tokio::test]
async fn proxy_forwards_streaming_request_to_openai_and_returns_sse_chunks() { ... }

#[tokio::test]
async fn proxy_returns_401_when_api_key_is_revoked() { ... }
```

```typescript
// TypeScript (Dashboard UI / CLI)
describe('CostTreemap', () => {
  it('renders spend breakdown by feature tag when data is loaded', () => { ... });
  it('shows empty state when no requests exist for the selected period', () => { ... });
});

describe('dd0c-scan CLI', () => {
  it('detects gpt-4o usage in TypeScript files and estimates monthly cost', () => { ... });
});
```

**Rules:**
- No `test_` prefix in Rust (redundant inside `#[cfg(test)]`)
- No `should_` prefix (verbose, adds no information)
- Use `_` as word separator in Rust, camelCase in TypeScript
- Name describes the **observable outcome**, not the internal mechanism
- If you can't name the test without saying "and", split it into two tests

---

## Section 2: Test Pyramid

### 2.1 Recommended Ratio

```
         ┌─────────────────┐
         │   E2E / Smoke   │  ~5%  (~20 tests)
         │   (Playwright,  │
         │    k6 journeys) │
        ─┴─────────────────┴─
       ┌───────────────────────┐
       │   Integration Tests   │  ~20%  (~80 tests)
       │   (Testcontainers,    │
       │    contract tests)    │
      ─┴───────────────────────┴─
    ┌─────────────────────────────┐
    │        Unit Tests           │  ~75%  (~300 tests)
    │  (#[cfg(test)], proptest,   │
    │   mockall, vitest)          │
    └─────────────────────────────┘
```

**Target: ~400 tests at V1 launch.** Fast feedback loop is more valuable than exhaustive coverage at this stage. Every test must run in CI in under 5 minutes total.

### 2.2 Unit Test Targets (per component)

| Component | Target Test Count | Key Focus |
|-----------|------------------|-----------|
| Router Brain (rule engine) | ~60 | Rule matching, strategy execution, edge cases |
| Complexity Classifier | ~40 | Token count thresholds, regex patterns, confidence scores |
| Cost Calculator | ~30 | Formula correctness, precision, zero-savings edge cases |
| Circuit Breaker | ~25 | State transitions, threshold logic, Redis key format |
| Auth (key validation, JWT) | ~30 | Valid/invalid/revoked keys, JWT claims, RBAC |
| Provider Translators | ~30 | OpenAI↔Anthropic format mapping, streaming chunks |
| Analytics Pipeline (batch logic) | ~20 | Batching thresholds, flush triggers, error handling |
| Dashboard API handlers | ~40 | Request validation, response shape, error codes |
| Shadow Audit CLI parser | ~25 | File detection, token estimation, report formatting |
| **Total** | **~300** | |

### 2.3 Integration Test Boundaries

| Boundary | Test Type | Tool |
|----------|-----------|------|
| Proxy → TimescaleDB | DB integration | Testcontainers (TimescaleDB image) |
| Proxy → Redis | Cache integration | Testcontainers (Redis image) |
| Proxy → PostgreSQL | DB integration | Testcontainers (PostgreSQL image) |
| Proxy → OpenAI API | Contract test | Recorded fixtures (no live calls in CI) |
| Proxy → Anthropic API | Contract test | Recorded fixtures |
| Dashboard API → PostgreSQL | DB integration | Testcontainers |
| Dashboard API → TimescaleDB | DB integration | Testcontainers |
| Worker → TimescaleDB | DB integration | Testcontainers |
| Worker → SES | Mock integration | `wiremock-rs` |
| Worker → Slack webhooks | Mock integration | `wiremock-rs` |

### 2.4 E2E / Smoke Test Scenarios

| Scenario | Priority | Tool |
|----------|----------|------|
| New user signs up via GitHub OAuth and gets API key | P0 | Playwright |
| Developer swaps base URL and first request routes correctly | P0 | curl / k6 |
| Routing rule created in UI takes effect on next proxy request | P0 | Playwright + k6 |
| Budget alert fires when threshold is crossed | P1 | k6 + webhook receiver |
| `npx dd0c-scan` runs on sample repo and produces report | P1 | Node.js test runner |
| Dashboard treemap renders after 100 synthetic requests | P1 | Playwright |
| Proxy continues routing when TimescaleDB is unavailable | P0 | Chaos (kill container) |

---

## Section 3: Unit Test Strategy (Per Component)

### 3.1 Proxy Engine (`crates/proxy`)

**What to test:**
- Request parsing: extraction of model, messages, headers, stream flag
- Auth middleware: Redis cache hit, cache miss → PG fallback, revoked key, malformed key
- Response header injection: `X-DD0C-Model`, `X-DD0C-Cost`, `X-DD0C-Saved` values
- SSE chunk passthrough: ordering, `[DONE]` detection, token count extraction from final chunk
- Graceful degradation: telemetry channel full → drop event, don't block request
- Rate limiting: per-key counter increment, 429 response when exceeded

**Key test cases:**

```rust
#[cfg(test)]
mod proxy_tests {
    use super::*;
    use mockall::predicate::*;

    #[test]
    fn parse_request_extracts_model_and_stream_flag() {
        let body = r#"{"model":"gpt-4o","messages":[{"role":"user","content":"hi"}],"stream":true}"#;
        let req = ProxyRequest::parse(body).unwrap();
        assert_eq!(req.model, "gpt-4o");
        assert!(req.stream);
    }

    #[test]
    fn parse_request_extracts_dd0c_feature_tag_from_headers() {
        let headers = make_headers([("X-DD0C-Feature", "classify")]);
        let tags = extract_tags(&headers);
        assert_eq!(tags.feature, Some("classify".to_string()));
    }

    #[tokio::test]
    async fn auth_middleware_returns_401_for_unknown_key() {
        let mut mock_cache = MockKeyCache::new();
        mock_cache.expect_get().returning(|_| Ok(None));
        let mut mock_db = MockKeyStore::new();
        mock_db.expect_lookup().returning(|_| Ok(None));

        let result = validate_api_key("dd0c_sk_live_unknown", &mock_cache, &mock_db).await;
        assert_eq!(result, Err(AuthError::InvalidKey));
    }

    #[tokio::test]
    async fn auth_middleware_caches_valid_key_after_db_lookup() {
        let mut mock_cache = MockKeyCache::new();
        mock_cache.expect_get().returning(|_| Ok(None));
        mock_cache.expect_set().times(1).returning(|_, _| Ok(()));
        let mut mock_db = MockKeyStore::new();
        mock_db.expect_lookup().returning(|_| Ok(Some(make_api_key())));

        validate_api_key("dd0c_sk_live_valid", &mock_cache, &mock_db).await.unwrap();
    }

    #[test]
    fn telemetry_emitter_drops_event_when_channel_is_full_without_blocking() {
        let (tx, _rx) = tokio::sync::mpsc::channel(1);
        tx.try_send(make_event()).unwrap(); // fill the channel
        let result = try_emit_telemetry(&tx, make_event());
        assert!(result.is_ok()); // graceful drop, no panic
    }
}
```

**Mocking strategy:**
- `MockKeyCache` and `MockKeyStore` via `mockall` for auth tests
- `MockLlmProvider` for dispatch tests — returns canned responses without network
- Bounded `mpsc` channels to test backpressure behavior

**Property-based tests (`proptest`):**

```rust
use proptest::prelude::*;

proptest! {
    #[test]
    fn api_key_hash_is_deterministic(key in "[a-zA-Z0-9]{32}") {
        let h1 = hash_api_key(&key);
        let h2 = hash_api_key(&key);
        prop_assert_eq!(h1, h2);
    }

    #[test]
    fn response_headers_never_contain_prompt_content(
        prompt in ".{1,500}",
        model in "gpt-4o|gpt-4o-mini|claude-3-haiku"
    ) {
        let headers = build_response_headers(&make_routing_decision(&model), &make_event());
        for (_, value) in &headers {
            prop_assert!(!value.to_str().unwrap_or("").contains(&prompt));
        }
    }
}
```

---

### 3.2 Router Brain (`crates/shared/router`)

This is the highest-value test target. Routing logic directly affects customer savings — bugs here cost money.

**What to test:**
- Rule matching: first-match-wins, tag matching, model matching, complexity matching
- Strategy execution: `passthrough`, `cheapest`, `quality_first`, `cascading`
- Budget enforcement: hard limit reached → throttle to cheapest or reject
- Complexity classifier: token count thresholds, regex pattern matching, confidence output
- Cost calculation: formula correctness, floating-point precision, zero-savings case
- Circuit breaker: CLOSED→OPEN→HALF_OPEN→CLOSED transitions, Redis key format

**Key test cases:**

```rust
#[cfg(test)]
mod router_tests {
    use super::*;

    #[test]
    fn rule_engine_returns_first_matching_rule_by_priority() {
        let rules = vec![
            make_rule(0, match_feature: "classify", strategy: Cheapest),
            make_rule(1, match_feature: "classify", strategy: Passthrough),
        ];
        let req = make_request_with_feature("classify");
        let decision = evaluate_rules(&rules, &req, &cost_tables());
        assert_eq!(decision.strategy, RoutingStrategy::Cheapest);
    }

    #[test]
    fn rule_engine_falls_through_to_passthrough_when_no_rules_match() {
        let rules = vec![make_rule(0, match_feature: "summarize", strategy: Cheapest)];
        let req = make_request_with_feature("classify");
        let decision = evaluate_rules(&rules, &req, &cost_tables());
        assert_eq!(decision.strategy, RoutingStrategy::Passthrough);
        assert_eq!(decision.target_model, req.model);
    }

    #[test]
    fn cheapest_strategy_selects_lowest_cost_model_from_chain() {
        let chain = vec!["gpt-4o", "gpt-4o-mini", "claude-3-haiku"];
        let costs = cost_tables_with(&[
            ("gpt-4o",        2.50, 10.00),
            ("gpt-4o-mini",   0.15,  0.60),
            ("claude-3-haiku",0.25,  1.25),
        ]);
        let model = select_cheapest(&chain, &costs, 500, 100);
        assert_eq!(model, "gpt-4o-mini");
    }

    #[test]
    fn classifier_returns_low_for_short_extraction_system_prompt() {
        let messages = vec![
            system("Extract the sentiment. Reply with one word."),
            user("The product is great!"),
        ];
        let result = classify_complexity(&messages, "gpt-4o");
        assert_eq!(result.level, ComplexityLevel::Low);
        assert!(result.confidence > 0.7);
    }

    #[test]
    fn classifier_returns_high_for_code_generation_prompt() {
        let messages = vec![
            system("You are an expert software engineer. Write production-quality code."),
            user("Implement a binary search tree with insertion, deletion, and traversal."),
        ];
        let result = classify_complexity(&messages, "gpt-4o");
        assert_eq!(result.level, ComplexityLevel::High);
    }

    #[test]
    fn cost_saved_is_zero_when_requested_and_used_model_are_identical() {
        let event = make_event("gpt-4o-mini", "gpt-4o-mini", 1000, 200);
        assert_eq!(calculate_cost_saved(&event, &cost_tables()), 0.0);
    }

    #[test]
    fn cost_saved_is_positive_when_routed_to_cheaper_model() {
        let costs = cost_tables_with(&[
            ("gpt-4o",      2.50, 10.00),
            ("gpt-4o-mini", 0.15,  0.60),
        ]);
        let event = make_event("gpt-4o", "gpt-4o-mini", 1_000_000, 200_000);
        let saved = calculate_cost_saved(&event, &costs);
        // (2.50-0.15)*1 + (10.00-0.60)*0.2 = 2.35 + 1.88 = 4.23
        assert!((saved - 4.23).abs() < 0.01);
    }

    #[test]
    fn circuit_breaker_transitions_to_open_after_error_threshold() {
        let mut cb = CircuitBreaker::new(threshold: 0.10, window_secs: 60);
        for _ in 0..9 { cb.record_success(); }
        cb.record_failure(); // 10% error rate — exactly at threshold
        assert_eq!(cb.state(), CircuitState::Open);
    }

    #[test]
    fn circuit_breaker_transitions_to_half_open_after_cooldown() {
        let mut cb = CircuitBreaker::new(threshold: 0.10, window_secs: 60);
        cb.force_open();
        cb.advance_time(Duration::from_secs(31));
        assert_eq!(cb.state(), CircuitState::HalfOpen);
    }
}
```

**Property-based tests:**

```rust
proptest! {
    #[test]
    fn cheapest_strategy_never_selects_more_expensive_model(
        input_tokens in 1u32..1_000_000u32,
        output_tokens in 1u32..100_000u32,
    ) {
        let chain = vec!["gpt-4o", "gpt-4o-mini", "claude-3-haiku"];
        let costs = cost_tables();
        let selected = select_cheapest(&chain, &costs, input_tokens, output_tokens);
        let selected_cost = compute_cost(&selected, input_tokens, output_tokens, &costs);
        for model in &chain {
            let model_cost = compute_cost(model, input_tokens, output_tokens, &costs);
            prop_assert!(selected_cost <= model_cost);
        }
    }

    #[test]
    fn complexity_classifier_never_panics_on_arbitrary_input(
        system_prompt in ".*",
        user_message in ".*",
        model in "gpt-4o|gpt-4o-mini|claude-3-haiku",
    ) {
        let messages = vec![system(&system_prompt), user(&user_message)];
        let result = classify_complexity(&messages, &model);
        prop_assert!(result.confidence >= 0.0 && result.confidence <= 1.0);
    }

    #[test]
    fn cost_saved_is_never_negative(
        input_tokens in 1u32..1_000_000u32,
        output_tokens in 1u32..100_000u32,
    ) {
        let costs = cost_tables();
        for (requested, used) in routable_model_pairs() {
            let event = make_event(requested, used, input_tokens, output_tokens);
            prop_assert!(calculate_cost_saved(&event, &costs) >= 0.0);
        }
    }
}
```

---

### 3.3 Analytics Pipeline (telemetry worker)

**What to test:**
- Batch collector flushes at 100 events OR 1 second, whichever comes first
- Handles worker panic without losing buffered events (bounded channel survives)
- `RequestEvent` serializes correctly to PostgreSQL COPY format
- Graceful degradation: DB unavailable → events dropped, proxy continues unaffected

```rust
#[tokio::test]
async fn batch_collector_flushes_after_100_events_before_timeout() {
    let (tx, rx) = mpsc::channel(1000);
    let flush_count = Arc::new(AtomicU32::new(0));
    let mock_db = MockTelemetryDb::counting(flush_count.clone());
    let worker = spawn_batch_worker(rx, mock_db, 100, Duration::from_secs(10));

    for _ in 0..100 { tx.send(make_event()).await.unwrap(); }
    tokio::time::sleep(Duration::from_millis(50)).await;

    assert_eq!(flush_count.load(Ordering::SeqCst), 1);
    worker.abort();
}

#[tokio::test]
async fn batch_collector_flushes_partial_batch_after_interval() {
    let (tx, rx) = mpsc::channel(1000);
    let flush_count = Arc::new(AtomicU32::new(0));
    let mock_db = MockTelemetryDb::counting(flush_count.clone());
    let worker = spawn_batch_worker(rx, mock_db, 100, Duration::from_secs(1));

    tx.send(make_event()).await.unwrap(); // only 1 event
    tokio::time::sleep(Duration::from_millis(1100)).await;

    assert_eq!(flush_count.load(Ordering::SeqCst), 1);
    worker.abort();
}

#[tokio::test]
async fn proxy_continues_routing_when_telemetry_db_is_unavailable() {
    let failing_db = MockTelemetryDb::always_failing();
    let (tx, rx) = mpsc::channel(1000);
    spawn_batch_worker(rx, failing_db, 1, Duration::from_millis(10));

    // Proxy should still be able to send events without blocking
    for _ in 0..200 {
        let _ = tx.try_send(make_event()); // may drop when full — that's fine
    }
    // No panic, no deadlock
}
```

---

### 3.4 Dashboard API (`crates/api`)

**What to test:**
- GitHub OAuth: state parameter validation, code exchange, user upsert
- JWT issuance: claims (sub, org_id, role, exp), RS256 signature verification
- RBAC: member cannot modify routing rules, owner can do everything
- API key CRUD: create returns full key once, list returns prefix only, revoke invalidates
- Provider credential encryption: stored value differs from plaintext input
- Request inspector: pagination cursor, filter application, no prompt content in response

```rust
#[tokio::test]
async fn create_api_key_returns_full_key_only_once() {
    let app = test_app().await;
    let resp = app.post("/api/orgs/test-org/keys")
        .json(&json!({"name": "production"})).send().await;

    assert_eq!(resp.status(), 201);
    let body: Value = resp.json().await;
    assert!(body["key"].as_str().unwrap().starts_with("dd0c_sk_live_"));

    // Listing must NOT return the full key
    let list: Value = app.get("/api/orgs/test-org/keys").send().await.json().await;
    assert!(list["data"][0]["key"].is_null());
    assert!(list["data"][0]["key_prefix"].as_str().unwrap().len() < 20);
}

#[tokio::test]
async fn member_role_cannot_create_routing_rules() {
    let app = test_app_with_role(Role::Member).await;
    let resp = app.post("/api/orgs/test-org/routing/rules")
        .json(&make_rule_payload()).send().await;
    assert_eq!(resp.status(), 403);
}

#[tokio::test]
async fn request_inspector_never_returns_prompt_content() {
    let app = test_app_with_events(100).await;
    let body: Value = app.get("/api/orgs/test-org/requests").send().await.json().await;
    for event in body["data"].as_array().unwrap() {
        assert!(event.get("messages").is_none());
        assert!(event.get("prompt").is_none());
        assert!(event.get("content").is_none());
    }
}

#[tokio::test]
async fn provider_credential_is_stored_encrypted() {
    let app = test_app().await;
    app.put("/api/orgs/test-org/providers/openai")
        .json(&json!({"api_key": "sk-plaintext-key"})).send().await;

    let stored = fetch_raw_credential_from_db("test-org", "openai").await;
    assert_ne!(stored.encrypted_key, b"sk-plaintext-key");
    assert!(stored.encrypted_key.len() > 16); // has GCM nonce + ciphertext
}
```

---

### 3.5 Shadow Audit CLI (`cli/`)

**What to test:**
- File scanner detects OpenAI/Anthropic SDK usage in `.ts`, `.js`, `.py`
- Model extractor parses model string from SDK call arguments
- Token estimator produces non-zero estimate for non-empty prompts
- Report formatter includes savings percentage, top opportunities, sign-up CTA
- Offline mode works when pricing cache exists on disk

```typescript
describe('FileScanner', () => {
  it('detects openai SDK usage in TypeScript files', () => {
    const code = `const r = await client.chat.completions.create({ model: 'gpt-4o' })`;
    const calls = scanFile('service.ts', code);
    expect(calls).toHaveLength(1);
    expect(calls[0].model).toBe('gpt-4o');
  });

  it('detects anthropic SDK usage in Python files', () => {
    const code = `client.messages.create(model="claude-3-opus-20240229")`;
    const calls = scanFile('service.py', code);
    expect(calls[0].model).toBe('claude-3-opus-20240229');
  });

  it('ignores commented-out SDK calls', () => {
    const code = `// client.chat.completions.create({ model: 'gpt-4o' })`;
    expect(scanFile('service.ts', code)).toHaveLength(0);
  });
});

describe('SavingsReport', () => {
  it('calculates positive savings when cheaper model is available', () => {
    const calls = [{ model: 'gpt-4o', estimatedMonthlyTokens: 10_000_000 }];
    const report = generateReport(calls, mockPricingTable);
    expect(report.totalSavings).toBeGreaterThan(0);
    expect(report.savingsPercentage).toBeGreaterThan(0);
  });

  it('includes sign-up CTA in formatted output', () => {
    const output = formatReport(mockReport);
    expect(output).toContain('route.dd0c.dev');
  });
});
```

---

## Section 4: Integration Test Strategy

### 4.1 Service Boundary Tests

Integration tests live in `tests/` at the crate root and use **Testcontainers** to spin up real dependencies. No mocks at the service boundary — if it talks to a database, it talks to a real one.

**Dependency:** `testcontainers` crate + Docker daemon in CI.

```toml
# Cargo.toml (dev-dependencies)
[dev-dependencies]
testcontainers = "0.15"
testcontainers-modules = { version = "0.3", features = ["postgres", "redis"] }
tokio = { version = "1", features = ["full", "test-util"] }
wiremock = "0.6"
```

#### Proxy ↔ TimescaleDB

```rust
// tests/analytics_integration.rs
use testcontainers::clients::Cli;
use testcontainers_modules::postgres::Postgres;

#[tokio::test]
async fn batch_worker_inserts_events_into_timescaledb_hypertable() {
    let docker = Cli::default();
    let pg = docker.run(Postgres::default().with_tag("15-alpine"));
    let db_url = format!("postgres://postgres:postgres@localhost:{}/postgres", pg.get_host_port_ipv4(5432));

    run_migrations(&db_url).await;
    enable_timescaledb(&db_url).await;

    let (tx, rx) = mpsc::channel(100);
    let worker = spawn_batch_worker(rx, db_url.clone(), 10, Duration::from_millis(100));

    for _ in 0..10 {
        tx.send(make_event()).await.unwrap();
    }
    tokio::time::sleep(Duration::from_millis(200)).await;

    let count: i64 = sqlx::query_scalar("SELECT COUNT(*) FROM request_events")
        .fetch_one(&pool(&db_url).await).await.unwrap();
    assert_eq!(count, 10);
    worker.abort();
}

#[tokio::test]
async fn continuous_aggregate_reflects_inserted_events_after_refresh() {
    // ... setup TimescaleDB, insert 100 events, trigger aggregate refresh,
    // assert hourly_cost_summary has correct totals
}
```

#### Proxy ↔ Redis

```rust
// tests/cache_integration.rs
use testcontainers_modules::redis::Redis;

#[tokio::test]
async fn api_key_cache_stores_and_retrieves_key_within_ttl() {
    let docker = Cli::default();
    let redis = docker.run(Redis::default());
    let client = connect_redis(redis.get_host_port_ipv4(6379)).await;

    let key = make_api_key();
    cache_api_key(&client, &key, Duration::from_secs(60)).await.unwrap();

    let retrieved = get_cached_key(&client, &key.hash).await.unwrap();
    assert_eq!(retrieved.unwrap().org_id, key.org_id);
}

#[tokio::test]
async fn circuit_breaker_state_is_shared_across_two_proxy_instances() {
    let docker = Cli::default();
    let redis = docker.run(Redis::default());
    let client1 = connect_redis(redis.get_host_port_ipv4(6379)).await;
    let client2 = connect_redis(redis.get_host_port_ipv4(6379)).await;

    let cb1 = RedisCircuitBreaker::new("openai", client1);
    let cb2 = RedisCircuitBreaker::new("openai", client2);

    cb1.force_open().await.unwrap();

    // Instance 2 should see the open circuit set by instance 1
    assert_eq!(cb2.state().await.unwrap(), CircuitState::Open);
}

#[tokio::test]
async fn rate_limit_counter_increments_and_enforces_limit() {
    let docker = Cli::default();
    let redis = docker.run(Redis::default());
    let client = connect_redis(redis.get_host_port_ipv4(6379)).await;

    let limiter = RateLimiter::new(client, limit: 5, window: Duration::from_secs(60));
    for _ in 0..5 {
        assert!(limiter.check_and_increment("key_abc").await.unwrap());
    }
    // 6th request should be rejected
    assert!(!limiter.check_and_increment("key_abc").await.unwrap());
}
```

#### Dashboard API ↔ PostgreSQL

```rust
// tests/api_db_integration.rs

#[tokio::test]
async fn create_org_and_api_key_persists_to_postgres() {
    let docker = Cli::default();
    let pg = docker.run(Postgres::default());
    let pool = setup_test_db(pg.get_host_port_ipv4(5432)).await;

    let org = create_organization(&pool, "Acme Corp").await.unwrap();
    let (key, raw) = create_api_key(&pool, org.id, "production").await.unwrap();

    // Raw key is never stored
    let stored: ApiKey = sqlx::query_as("SELECT * FROM api_keys WHERE id = $1")
        .bind(key.id).fetch_one(&pool).await.unwrap();
    assert_ne!(stored.key_hash, raw); // hash != raw key
    assert!(stored.key_prefix.starts_with("dd0c_sk_"));
}

#[tokio::test]
async fn routing_rules_are_returned_in_priority_order() {
    let pool = test_pool().await;
    let org_id = seed_org(&pool).await;

    insert_rule(&pool, org_id, priority: 10, name: "low priority").await;
    insert_rule(&pool, org_id, priority: 1,  name: "high priority").await;
    insert_rule(&pool, org_id, priority: 5,  name: "mid priority").await;

    let rules = get_routing_rules(&pool, org_id).await.unwrap();
    assert_eq!(rules[0].name, "high priority");
    assert_eq!(rules[1].name, "mid priority");
    assert_eq!(rules[2].name, "low priority");
}
```

---

### 4.2 Contract Tests for OpenAI API Compatibility

The proxy's core promise is drop-in OpenAI compatibility. Contract tests verify this using **recorded fixtures** — real OpenAI/Anthropic responses captured once and replayed in CI without live API calls.

**Fixture capture workflow:**
1. Run `cargo test --features=record-fixtures` once against live APIs (requires real keys)
2. Fixtures saved to `tests/fixtures/openai/` and `tests/fixtures/anthropic/`
3. CI always uses recorded fixtures — no live API calls, no flakiness, no cost

```rust
// tests/contract_openai.rs

#[tokio::test]
async fn proxy_response_matches_openai_response_schema() {
    let fixture = load_fixture("openai/chat_completions_non_streaming.json");
    let mock_provider = WireMock::start().await;
    Mock::given(method("POST"))
        .and(path("/v1/chat/completions"))
        .respond_with(ResponseTemplate::new(200).set_body_json(&fixture))
        .mount(&mock_provider).await;

    let proxy = start_test_proxy(mock_provider.uri()).await;
    let response = proxy.post("/v1/chat/completions")
        .header("Authorization", "Bearer dd0c_sk_live_test")
        .json(&standard_chat_request())
        .send().await;

    assert_eq!(response.status(), 200);
    let body: Value = response.json().await;
    // Assert OpenAI schema compliance
    assert!(body["id"].as_str().unwrap().starts_with("chatcmpl-"));
    assert_eq!(body["object"], "chat.completion");
    assert!(body["choices"][0]["message"]["content"].is_string());
    assert!(body["usage"]["prompt_tokens"].is_number());
}

#[tokio::test]
async fn proxy_preserves_sse_chunk_ordering_for_streaming_requests() {
    let fixture_chunks = load_sse_fixture("openai/chat_completions_streaming.txt");
    let mock_provider = WireMock::start().await;
    Mock::given(method("POST"))
        .respond_with(ResponseTemplate::new(200)
            .set_body_raw(fixture_chunks, "text/event-stream"))
        .mount(&mock_provider).await;

    let proxy = start_test_proxy(mock_provider.uri()).await;
    let chunks = collect_sse_chunks(proxy, streaming_chat_request()).await;

    // Verify chunk ordering and [DONE] termination
    assert!(chunks.last().unwrap().contains("[DONE]"));
    let content: String = chunks.iter()
        .filter_map(|c| extract_delta_content(c))
        .collect();
    assert!(!content.is_empty());
}

#[tokio::test]
async fn proxy_translates_anthropic_response_to_openai_format() {
    let anthropic_fixture = load_fixture("anthropic/messages_response.json");
    let mock_anthropic = WireMock::start().await;
    Mock::given(method("POST"))
        .and(path("/v1/messages"))
        .respond_with(ResponseTemplate::new(200).set_body_json(&anthropic_fixture))
        .mount(&mock_anthropic).await;

    let proxy = start_test_proxy_with_anthropic(mock_anthropic.uri()).await;
    let response: Value = proxy.post("/v1/chat/completions")
        .json(&chat_request_routed_to_anthropic())
        .send().await.json().await;

    // Response must look like OpenAI even though it came from Anthropic
    assert_eq!(response["object"], "chat.completion");
    assert!(response["choices"][0]["message"]["content"].is_string());
    assert!(response["usage"]["prompt_tokens"].is_number());
}

#[tokio::test]
async fn proxy_passes_through_provider_429_with_original_body() {
    let mock_provider = WireMock::start().await;
    Mock::given(method("POST"))
        .respond_with(ResponseTemplate::new(429)
            .set_body_json(&json!({"error": {"message": "Rate limit exceeded", "type": "rate_limit_error"}})))
        .mount(&mock_provider).await;

    let proxy = start_test_proxy(mock_provider.uri()).await;
    let response = proxy.post("/v1/chat/completions")
        .json(&standard_chat_request()).send().await;

    assert_eq!(response.status(), 429);
    assert_eq!(response.headers()["X-DD0C-Provider-Error"], "true");
    let body: Value = response.json().await;
    assert_eq!(body["error"]["type"], "rate_limit_error");
}
```

---

### 4.3 Worker Integration Tests

```rust
// tests/worker_integration.rs

#[tokio::test]
async fn weekly_digest_worker_queries_correct_date_range() {
    let pool = test_timescaledb_pool().await;
    seed_events_for_last_7_days(&pool, org_id: "test-org", count: 500).await;

    let mock_ses = WireMock::start().await;
    Mock::given(method("POST"))
        .and(path("/v2/email/outbound-emails"))
        .respond_with(ResponseTemplate::new(200))
        .mount(&mock_ses).await;

    run_weekly_digest("test-org", &pool, mock_ses.uri()).await.unwrap();

    let requests = mock_ses.received_requests().await.unwrap();
    assert_eq!(requests.len(), 1);
    let email_body: Value = serde_json::from_slice(&requests[0].body).unwrap();
    assert!(email_body["subject"].as_str().unwrap().contains("savings"));
}

#[tokio::test]
async fn budget_alert_fires_exactly_once_when_threshold_crossed() {
    let pool = test_pool().await;
    let alert = seed_alert(&pool, threshold: 100.0).await;
    seed_spend(&pool, org_id: alert.org_id, amount: 105.0).await;

    let mock_slack = WireMock::start().await;
    Mock::given(method("POST")).respond_with(ResponseTemplate::new(200))
        .mount(&mock_slack).await;

    // Run evaluator twice — alert should only fire once
    evaluate_alerts(&pool, mock_slack.uri()).await.unwrap();
    evaluate_alerts(&pool, mock_slack.uri()).await.unwrap();

    let requests = mock_slack.received_requests().await.unwrap();
    assert_eq!(requests.len(), 1); // not 2 — deduplication works
}
```

---

## Section 5: E2E & Smoke Tests

### 5.1 Critical User Journeys

These are the flows that must work on every deploy. If any of these break, the product is broken.

#### Journey 1: First Route (P0)
```
1. Developer signs up via GitHub OAuth
2. Org + API key created automatically
3. Developer copies curl command from onboarding wizard
4. curl request hits proxy with dd0c key
5. Request routes to correct model per default rules
6. Response headers contain X-DD0C-Model, X-DD0C-Cost, X-DD0C-Saved
7. Request appears in dashboard request inspector within 5 seconds
```

**Playwright test:**
```typescript
test('first route onboarding journey completes in under 2 minutes', async ({ page }) => {
  await page.goto('https://staging.route.dd0c.dev');
  await page.click('[data-testid="github-signin"]');
  // ... OAuth mock in staging
  await expect(page.locator('[data-testid="api-key-display"]')).toBeVisible();

  const apiKey = await page.locator('[data-testid="api-key-value"]').textContent();
  expect(apiKey).toMatch(/^dd0c_sk_live_/);

  // Simulate the curl command
  const response = await fetch('https://proxy.staging.route.dd0c.dev/v1/chat/completions', {
    method: 'POST',
    headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json' },
    body: JSON.stringify({ model: 'gpt-4o', messages: [{ role: 'user', content: 'Say hello' }] })
  });
  expect(response.status).toBe(200);
  expect(response.headers.get('X-DD0C-Model-Used')).toBeTruthy();

  // Request should appear in inspector
  await page.goto(`https://staging.route.dd0c.dev/dashboard/requests`);
  await expect(page.locator('[data-testid="request-row"]').first()).toBeVisible({ timeout: 10000 });
});
```

#### Journey 2: Routing Rule Takes Effect (P0)
```
1. User creates routing rule: feature=classify → cheapest from [gpt-4o-mini, claude-haiku]
2. Sends request with X-DD0C-Feature: classify header requesting gpt-4o
3. Proxy routes to gpt-4o-mini (cheapest)
4. Response header X-DD0C-Model-Used = gpt-4o-mini
5. Dashboard shows savings for this request
```

#### Journey 3: Graceful Degradation (P0)
```
1. TimescaleDB container is killed
2. Proxy continues accepting and routing requests
3. Requests return 200 with correct routing
4. No 500 errors from proxy
5. When TimescaleDB recovers, telemetry resumes
```

**k6 chaos test:**
```javascript
// tests/e2e/chaos_timescaledb.js
import http from 'k6/http';
import { check } from 'k6';

export let options = { vus: 10, duration: '60s' };

export default function () {
  const res = http.post('https://proxy.staging.route.dd0c.dev/v1/chat/completions',
    JSON.stringify({ model: 'gpt-4o-mini', messages: [{ role: 'user', content: 'ping' }] }),
    { headers: { 'Authorization': 'Bearer dd0c_sk_test_...', 'Content-Type': 'application/json' } }
  );
  check(res, {
    'status is 200': (r) => r.status === 200,
    'routing header present': (r) => r.headers['X-DD0C-Model-Used'] !== undefined,
  });
}
// Run this while: docker stop dd0c-timescaledb
```

### 5.2 Staging Environment Requirements

| Requirement | Detail |
|-------------|--------|
| Isolated AWS account | Separate from prod — no shared RDS, no shared Redis |
| GitHub OAuth app | Separate OAuth app pointing to staging callback URL |
| Synthetic LLM providers | `wiremock` or `mockoon` containers replacing real OpenAI/Anthropic |
| Seeded data | 10K synthetic `request_events` pre-loaded for dashboard testing |
| Feature flags | All flags default-off in staging; tests explicitly enable them |
| Teardown | Staging DB wiped and re-seeded on each E2E run |

### 5.3 Synthetic Traffic Generation

For dashboard and performance tests, a traffic generator seeds realistic request patterns:

```rust
// tools/traffic-gen/src/main.rs
// Generates realistic request distributions matching real usage patterns

struct TrafficProfile {
    requests_per_second: f64,
    feature_distribution: HashMap<String, f64>, // {"classify": 0.4, "summarize": 0.3, ...}
    model_distribution: HashMap<String, f64>,   // {"gpt-4o": 0.6, "gpt-4o-mini": 0.4}
    streaming_ratio: f64,                        // 0.3 = 30% streaming
}

// Usage: cargo run --bin traffic-gen -- --profile realistic --duration 60s --target staging
```

---

## Section 6: Performance & Load Testing

### 6.1 Latency Budget Tests (<5ms proxy overhead)

The <5ms overhead SLA is the product's core technical promise. It must be continuously validated.

**Benchmark setup:** Use `criterion` for micro-benchmarks on the hot path components.

```toml
# Cargo.toml
[[bench]]
name = "hot_path"
harness = false

[dev-dependencies]
criterion = { version = "0.5", features = ["async_tokio"] }
```

```rust
// benches/hot_path.rs
use criterion::{criterion_group, criterion_main, Criterion};

fn bench_complexity_classifier(c: &mut Criterion) {
    let messages = vec![
        system("Extract the sentiment. Reply with one word."),
        user("The product is great!"),
    ];
    c.bench_function("complexity_classifier_short_prompt", |b| {
        b.iter(|| classify_complexity(&messages, "gpt-4o"))
    });
    // Target: <500µs (well within the 2ms budget)
}

fn bench_rule_engine_10_rules(c: &mut Criterion) {
    let rules = make_rules(10);
    let req = make_request_with_feature("classify");
    let costs = cost_tables();
    c.bench_function("rule_engine_10_rules", |b| {
        b.iter(|| evaluate_rules(&rules, &req, &costs))
    });
    // Target: <1ms
}

fn bench_api_key_hash_lookup(c: &mut Criterion) {
    let key = "dd0c_sk_live_a3f2b8c9d4e5f6a7b8c9d4e5f6a7b8c9";
    c.bench_function("api_key_sha256_hash", |b| {
        b.iter(|| hash_api_key(key))
    });
    // Target: <100µs
}

criterion_group!(benches, bench_complexity_classifier, bench_rule_engine_10_rules, bench_api_key_hash_lookup);
criterion_main!(benches);
```

**CI gate:** If any benchmark regresses by >20% vs. the baseline, the PR is blocked.

```yaml
# .github/workflows/bench.yml
- name: Run benchmarks
  run: cargo bench -- --output-format bencher | tee bench_output.txt
- name: Compare with baseline
  uses: benchmark-action/github-action-benchmark@v1
  with:
    tool: cargo
    output-file-path: bench_output.txt
    alert-threshold: '120%'
    fail-on-alert: true
```

### 6.2 Throughput Benchmarks

**k6 load test — sustained throughput:**

```javascript
// tests/load/throughput.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

const proxyOverhead = new Trend('proxy_overhead_ms');
const errorRate = new Rate('errors');

export let options = {
  stages: [
    { duration: '2m', target: 50 },   // ramp up
    { duration: '5m', target: 50 },   // sustained load
    { duration: '2m', target: 100 },  // peak load
    { duration: '1m', target: 0 },    // ramp down
  ],
  thresholds: {
    'proxy_overhead_ms': ['p(99)<5'],  // THE SLA
    'http_req_duration': ['p(99)<500'], // total including LLM
    'errors': ['rate<0.01'],            // <1% error rate
  },
};

export default function () {
  const start = Date.now();
  const res = http.post(
    `${__ENV.PROXY_URL}/v1/chat/completions`,
    JSON.stringify({ model: 'gpt-4o-mini', messages: [{ role: 'user', content: 'ping' }] }),
    { headers: { 'Authorization': `Bearer ${__ENV.DD0C_KEY}`, 'Content-Type': 'application/json' } }
  );

  const overhead = parseInt(res.headers['X-DD0C-Latency-Overhead-Ms'] || '999');
  proxyOverhead.add(overhead);
  errorRate.add(res.status !== 200);

  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(0.1);
}
```

**Targets:**

| Metric | Target | Blocking |
|--------|--------|---------|
| Proxy overhead P99 | <5ms | Yes — blocks deploy |
| Proxy overhead P50 | <2ms | No — informational |
| Total request P99 | <500ms (excl. LLM time) | Yes |
| Error rate | <1% | Yes |
| Throughput | >500 req/s per proxy task | No — informational |

### 6.3 Chaos & Fault Injection

| Scenario | Tool | Expected Behavior | Pass Criteria |
|----------|------|-------------------|---------------|
| Kill TimescaleDB | `docker stop` | Proxy continues routing, telemetry dropped | 0 proxy 5xx errors |
| Kill Redis | `docker stop` | Auth falls back to PG, rate limiting disabled | <10% latency increase |
| OpenAI returns 429 | WireMock | Fallback to Anthropic within 1 retry | Request succeeds, `was_fallback=true` |
| Anthropic returns 500 | WireMock | Circuit opens, fallback to gpt-4o | Request succeeds or 503 with header |
| All providers return 500 | WireMock | 503 with `X-DD0C-Fallback-Exhausted` | Correct error code, no panic |
| Network partition (50% packet loss) | `tc netem` | Increased latency, no crashes | P99 < 2x normal |
| Proxy OOM | `--memory 256m` Docker limit | ECS restarts task, ALB routes to healthy | <30s recovery |

```bash
# Chaos test runner script
#!/bin/bash
# tests/chaos/run_chaos.sh

echo "=== Chaos Test: TimescaleDB Failure ==="
docker stop dd0c-timescaledb-test
sleep 5
k6 run --env PROXY_URL=http://localhost:8080 tests/load/throughput.js --duration 30s
docker start dd0c-timescaledb-test
echo "TimescaleDB recovered"
```

---

## Section 7: CI/CD Pipeline Integration

### 7.1 Test Stages

```
┌─────────────────────────────────────────────────────────────────┐
│  git commit (pre-commit hook)                                    │
│  ├─ cargo fmt --check                                           │
│  ├─ cargo clippy -- -D warnings                                 │
│  ├─ grep for forbidden DDL keywords in new migration files      │
│  └─ check decision_log.json present if router/ files changed   │
└─────────────────────────────────────────────────────────────────┘
                          │ push
┌─────────────────────────────────────────────────────────────────┐
│  PR / push to branch                                            │
│  ├─ cargo test --workspace (unit tests only, no Docker)        │
│  ├─ cargo bench (regression check vs. baseline)                │
│  ├─ vitest --run (UI unit tests)                               │
│  ├─ eslint + tsc --noEmit (UI type check)                      │
│  └─ cargo audit (dependency vulnerability scan)                │
│  Target: <3 minutes                                             │
└─────────────────────────────────────────────────────────────────┘
                          │ PR approved
┌─────────────────────────────────────────────────────────────────┐
│  merge to main                                                  │
│  ├─ All PR checks (re-run)                                     │
│  ├─ Integration tests (Testcontainers — requires Docker)       │
│  ├─ Contract tests (fixture-based, no live APIs)               │
│  ├─ Coverage report (tarpaulin) — gate at 70%                  │
│  └─ Flag TTL audit (fail if any flag > 14 days at 100%)        │
│  Target: <8 minutes                                             │
└─────────────────────────────────────────────────────────────────┘
                          │ tests pass
┌─────────────────────────────────────────────────────────────────┐
│  deploy to staging                                              │
│  ├─ docker build + push to ECR                                 │
│  ├─ sqlx migrate run (staging DB)                              │
│  ├─ ECS rolling deploy                                         │
│  ├─ Smoke tests (k6, 60s, 10 VUs)                             │
│  └─ Playwright E2E (critical journeys only)                    │
│  Target: <15 minutes total                                      │
└─────────────────────────────────────────────────────────────────┘
                          │ staging green
┌─────────────────────────────────────────────────────────────────┐
│  deploy to production                                           │
│  ├─ ECS rolling deploy                                         │
│  ├─ Synthetic canary (1 req/min via CloudWatch Synthetics)     │
│  └─ Rollback trigger: error rate >5% for 3 minutes            │
└─────────────────────────────────────────────────────────────────┘
```

### 7.2 GitHub Actions Configuration

```yaml
# .github/workflows/ci.yml
name: CI

on:
  push:
    branches: [main]
  pull_request:

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
        with:
          components: clippy, rustfmt
      - uses: Swatinem/rust-cache@v2
      - run: cargo fmt --check
      - run: cargo clippy --workspace -- -D warnings
      - run: cargo test --workspace --lib  # unit tests only (no integration)
      - run: cd ui && npm ci && npx vitest --run
      - run: cd cli && npm ci && npx vitest --run

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    services:
      docker:
        image: docker:dind
        options: --privileged
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - uses: Swatinem/rust-cache@v2
      - run: cargo test --workspace --test '*'  # integration tests in tests/

  coverage:
    runs-on: ubuntu-latest
    needs: integration-tests
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - run: cargo install cargo-tarpaulin
      - run: cargo tarpaulin --workspace --out Xml --output-dir coverage/
      - uses: codecov/codecov-action@v4
        with:
          fail_ci_if_error: true
          threshold: 70  # block merge if coverage drops below 70%

  benchmarks:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - uses: Swatinem/rust-cache@v2
      - run: cargo bench -- --output-format bencher | tee bench_output.txt
      - uses: benchmark-action/github-action-benchmark@v1
        with:
          tool: cargo
          output-file-path: bench_output.txt
          alert-threshold: '120%'
          fail-on-alert: true
          github-token: ${{ secrets.GITHUB_TOKEN }}
          auto-push: ${{ github.ref == 'refs/heads/main' }}
```

### 7.3 Coverage Thresholds

| Crate | Minimum Coverage | Rationale |
|-------|-----------------|-----------|
| `crates/shared` (router, cost) | 85% | Core business logic — high confidence required |
| `crates/proxy` | 75% | Hot path — streaming paths are hard to unit test |
| `crates/api` | 75% | Auth and RBAC paths must be covered |
| `crates/worker` | 65% | Async scheduling is harder to test deterministically |
| `cli/` | 70% | Parser logic must be covered |
| `ui/` | 60% | UI components — visual testing supplements unit tests |

Coverage is measured by `cargo-tarpaulin` for Rust and `vitest --coverage` for TypeScript. Coverage gates block merges but do not block deploys (a deploy with lower coverage is better than a rollback).

### 7.4 Test Parallelization

```toml
# .cargo/config.toml
[test]
# Run unit tests in parallel (default), integration tests sequentially
# Integration tests use Testcontainers — each gets its own container
```

```yaml
# GitHub Actions matrix for parallel integration test suites
integration-tests:
  strategy:
    matrix:
      suite: [proxy, api, worker, analytics]
  steps:
    - run: cargo test --test ${{ matrix.suite }}_integration
```

Each integration test suite spins up its own Testcontainers instances — no shared state, no port conflicts, fully parallelizable.

---

## Section 8: Transparent Factory Tenet Testing

### 8.1 Atomic Flagging — Feature Flag Behavior (Story 10.1)

Every flag must be testable in three states: off (default), on, and auto-disabled (circuit tripped).

```rust
// tests/feature_flags.rs

#[tokio::test]
async fn routing_strategy_uses_passthrough_when_flag_is_off() {
    let flags = FlagProvider::from_json(json!({
        "cascading_routing": { "enabled": false }
    }));
    let req = make_request_with_feature("classify");
    let decision = route_with_flags(&req, &flags, &cost_tables()).await;
    assert_eq!(decision.strategy, RoutingStrategy::Passthrough);
}

#[tokio::test]
async fn routing_strategy_uses_cascading_when_flag_is_on() {
    let flags = FlagProvider::from_json(json!({
        "cascading_routing": { "enabled": true }
    }));
    let req = make_request_with_feature("classify");
    let decision = route_with_flags(&req, &flags, &cost_tables()).await;
    assert_eq!(decision.strategy, RoutingStrategy::Cascading);
}

#[tokio::test]
async fn flag_auto_disables_when_p99_latency_increases_by_more_than_5_percent() {
    let flags = Arc::new(Mutex::new(FlagProvider::from_json(json!({
        "new_complexity_classifier": { "enabled": true, "owner": "brian", "ttl_days": 7 }
    }))));

    let monitor = FlagHealthMonitor::new(flags.clone(), baseline_p99_ms: 4.0);

    // Simulate latency spike
    for _ in 0..100 {
        monitor.record_latency(4.3); // 7.5% above baseline
    }

    tokio::time::sleep(Duration::from_secs(31)).await;

    let current_flags = flags.lock().await;
    assert!(!current_flags.is_enabled("new_complexity_classifier"),
        "flag should have auto-disabled due to latency regression");
}

#[test]
fn flag_with_expired_ttl_fails_ci_audit() {
    let flags = vec![
        FlagDefinition {
            name: "old_feature".to_string(),
            rollout_pct: 100,
            created_at: Utc::now() - Duration::days(20),
            ttl_days: 14,
            owner: "brian".to_string(),
        }
    ];
    let violations = audit_flag_ttls(&flags);
    assert_eq!(violations.len(), 1);
    assert_eq!(violations[0].flag_name, "old_feature");
}
```

**Flag test matrix** — every flag must have tests for all three states:

| Flag | Off behavior | On behavior | Auto-disable trigger |
|------|-------------|-------------|---------------------|
| `cascading_routing` | Passthrough | Try cheapest, escalate on error | P99 >5% regression |
| `complexity_classifier_v2` | Use heuristic v1 | Use ML classifier | Error rate >2% |
| `provider_failover_anthropic` | No Anthropic fallback | Anthropic in fallback chain | Anthropic error rate >10% |
| `cost_table_auto_refresh` | Manual refresh only | Background 60s refresh | N/A |

---

### 8.2 Elastic Schema — Migration Validation (Story 10.2)

CI must reject any migration containing destructive DDL.

```rust
// tools/migration-lint/src/main.rs

const FORBIDDEN_PATTERNS: &[&str] = &[
    r"DROP\s+TABLE",
    r"DROP\s+COLUMN",
    r"ALTER\s+TABLE\s+\w+\s+RENAME",
    r"ALTER\s+COLUMN\s+\w+\s+TYPE",
    r"TRUNCATE",
];

pub fn lint_migration(sql: &str) -> Vec<LintViolation> {
    FORBIDDEN_PATTERNS.iter()
        .filter_map(|pattern| {
            let re = Regex::new(pattern).unwrap();
            if re.is_match(&sql.to_uppercase()) {
                Some(LintViolation { pattern: pattern.to_string(), sql: sql.to_string() })
            } else {
                None
            }
        })
        .collect()
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn lint_rejects_drop_table() {
        let sql = "DROP TABLE request_events;";
        assert!(!lint_migration(sql).is_empty());
    }

    #[test]
    fn lint_rejects_alter_column_type() {
        let sql = "ALTER TABLE request_events ALTER COLUMN latency_ms TYPE BIGINT;";
        assert!(!lint_migration(sql).is_empty());
    }

    #[test]
    fn lint_accepts_add_nullable_column() {
        let sql = "ALTER TABLE request_events ADD COLUMN cache_key VARCHAR(64) NULL;";
        assert!(lint_migration(sql).is_empty());
    }

    #[test]
    fn lint_accepts_create_index() {
        let sql = "CREATE INDEX CONCURRENTLY idx_re_model ON request_events(model_used);";
        assert!(lint_migration(sql).is_empty());
    }

    #[test]
    fn migration_file_includes_sunset_date_comment() {
        let sql = "-- sunset_date: 2026-03-30\nALTER TABLE orgs ADD COLUMN tier_v2 VARCHAR(20) NULL;";
        assert!(has_sunset_date_comment(sql));
    }

    #[test]
    fn migration_without_sunset_date_fails_lint() {
        let sql = "ALTER TABLE orgs ADD COLUMN tier_v2 VARCHAR(20) NULL;";
        assert!(!has_sunset_date_comment(sql));
    }
}
```

**Dual-write pattern test:**

```rust
#[tokio::test]
async fn dual_write_writes_to_both_old_and_new_schema_in_same_transaction() {
    let pool = test_pool().await;
    // Simulate migration window: both `plan` (old) and `plan_v2` (new) columns exist
    sqlx::query("ALTER TABLE organizations ADD COLUMN plan_v2 VARCHAR(30) NULL")
        .execute(&pool).await.unwrap();

    let org_id = create_org_dual_write(&pool, plan: "pro").await.unwrap();

    let row = sqlx::query("SELECT plan, plan_v2 FROM organizations WHERE id = $1")
        .bind(org_id).fetch_one(&pool).await.unwrap();
    assert_eq!(row.get::<String, _>("plan"), "pro");
    assert_eq!(row.get::<String, _>("plan_v2"), "pro"); // written to both
}
```

---

### 8.3 Cognitive Durability — Decision Log Validation (Story 10.3)

CI enforces that PRs touching routing or cost logic include a `decision_log.json` entry.

```python
# tools/decision-log-check/check.py
# Run as: python check.py --changed-files <list>

import json, sys, re
from pathlib import Path

GUARDED_PATHS = ["src/router/", "src/cost/", "migrations/"]
REQUIRED_FIELDS = ["prompt", "reasoning", "alternatives_considered", "confidence", "timestamp", "author"]

def check_decision_log(changed_files: list[str]) -> list[str]:
    errors = []
    touches_guarded = any(
        any(f.startswith(p) for p in GUARDED_PATHS)
        for f in changed_files
    )
    if not touches_guarded:
        return []

    log_files = list(Path("docs/decisions").glob("*.json"))
    if not log_files:
        return ["No decision_log.json found in docs/decisions/ for changes to guarded paths"]

    # Check the most recently modified log file
    latest = max(log_files, key=lambda p: p.stat().st_mtime)
    try:
        log = json.loads(latest.read_text())
        for field in REQUIRED_FIELDS:
            if field not in log:
                errors.append(f"decision_log missing required field: {field}")
    except json.JSONDecodeError as e:
        errors.append(f"decision_log.json is not valid JSON: {e}")

    return errors

# Tests for the checker itself
def test_check_passes_when_log_present_with_all_fields():
    # ... test implementation

def test_check_fails_when_log_missing_reasoning_field():
    # ... test implementation
```

**Cyclomatic complexity enforcement:**

```toml
# .clippy.toml
cognitive-complexity-threshold = 10
```

```yaml
# CI step
- run: cargo clippy --workspace -- -W clippy::cognitive_complexity -D warnings
```

---

### 8.4 Semantic Observability — OTEL Span Assertion Tests (Story 10.4)

Tests verify that routing decisions emit correctly structured OpenTelemetry spans.

```rust
// tests/observability.rs
use opentelemetry_sdk::testing::trace::InMemorySpanExporter;

#[tokio::test]
async fn routing_decision_emits_ai_routing_decision_span() {
    let exporter = InMemorySpanExporter::default();
    let tracer = setup_test_tracer(exporter.clone());

    let req = make_request("gpt-4o", feature: "classify");
    let _decision = route_request_with_tracing(&req, &tracer, &cost_tables()).await;

    let spans = exporter.get_finished_spans().unwrap();
    let routing_span = spans.iter()
        .find(|s| s.name == "ai_routing_decision")
        .expect("ai_routing_decision span must be emitted");

    // Assert required attributes
    let attrs = span_attrs_as_map(routing_span);
    assert!(attrs.contains_key("ai.model_selected"));
    assert!(attrs.contains_key("ai.model_alternatives"));
    assert!(attrs.contains_key("ai.cost_delta"));
    assert!(attrs.contains_key("ai.complexity_score"));
    assert!(attrs.contains_key("ai.routing_strategy"));
    assert!(attrs.contains_key("ai.prompt_hash"));
}

#[tokio::test]
async fn routing_span_never_contains_raw_prompt_content() {
    let exporter = InMemorySpanExporter::default();
    let tracer = setup_test_tracer(exporter.clone());

    let secret_prompt = "My secret quarterly revenue is $4.2M";
    let req = make_request_with_prompt("gpt-4o", secret_prompt);
    route_request_with_tracing(&req, &tracer, &cost_tables()).await;

    let spans = exporter.get_finished_spans().unwrap();
    for span in &spans {
        for (key, value) in span_attrs_as_map(span) {
            assert!(!format!("{:?}", value).contains(secret_prompt),
                "span attribute '{}' contains raw prompt content", key);
        }
        for event in &span.events {
            assert!(!event.name.contains(secret_prompt));
        }
    }
}

#[tokio::test]
async fn prompt_hash_is_sha256_of_first_500_chars_of_system_prompt() {
    let exporter = InMemorySpanExporter::default();
    let tracer = setup_test_tracer(exporter.clone());

    let system_prompt = "You are a helpful assistant.";
    let req = make_request_with_system_prompt("gpt-4o", system_prompt);
    route_request_with_tracing(&req, &tracer, &cost_tables()).await;

    let spans = exporter.get_finished_spans().unwrap();
    let routing_span = spans.iter().find(|s| s.name == "ai_routing_decision").unwrap();
    let attrs = span_attrs_as_map(routing_span);

    let expected_hash = sha256_hex(&system_prompt[..system_prompt.len().min(500)]);
    assert_eq!(attrs["ai.prompt_hash"].as_str().unwrap(), expected_hash);
}

#[tokio::test]
async fn routing_span_is_child_of_request_trace() {
    let exporter = InMemorySpanExporter::default();
    let tracer = setup_test_tracer(exporter.clone());

    route_request_with_tracing(&make_request("gpt-4o", feature: "test"), &tracer, &cost_tables()).await;

    let spans = exporter.get_finished_spans().unwrap();
    let request_span = spans.iter().find(|s| s.name == "proxy_request").unwrap();
    let routing_span = spans.iter().find(|s| s.name == "ai_routing_decision").unwrap();

    assert_eq!(routing_span.parent_span_id, request_span.span_context.span_id());
}
```

---

### 8.5 Configurable Autonomy — Governance Policy Tests (Story 10.5)

```rust
// tests/governance.rs

#[tokio::test]
async fn strict_mode_blocks_automatic_routing_rule_update() {
    let policy = Policy { governance_mode: GovernanceMode::Strict, panic_mode: false };
    let result = apply_routing_rule_update(&policy, make_rule_update()).await;
    assert_eq!(result, Err(GovernanceError::BlockedByStrictMode));
}

#[tokio::test]
async fn audit_mode_applies_change_and_logs_it() {
    let policy = Policy { governance_mode: GovernanceMode::Audit, panic_mode: false };
    let log = Arc::new(Mutex::new(vec![]));
    let result = apply_routing_rule_update_with_log(&policy, make_rule_update(), log.clone()).await;

    assert!(result.is_ok());
    let entries = log.lock().await;
    assert_eq!(entries.len(), 1);
    assert!(entries[0].contains("Allowed by audit mode"));
}

#[tokio::test]
async fn panic_mode_freezes_all_routing_to_hardcoded_provider() {
    let policy = Policy { governance_mode: GovernanceMode::Audit, panic_mode: true };
    let req = make_request_with_feature("classify"); // would normally route to gpt-4o-mini

    let decision = route_with_policy(&req, &policy, &cost_tables()).await;

    assert_eq!(decision.strategy, RoutingStrategy::Passthrough);
    assert_eq!(decision.target_provider, Provider::OpenAI); // hardcoded fallback
    assert!(decision.reason.contains("panic mode"));
}

#[tokio::test]
async fn panic_mode_disables_auto_failover() {
    let policy = Policy { governance_mode: GovernanceMode::Audit, panic_mode: true };
    // Even if primary provider fails, panic mode should not auto-failover
    let mock_openai = WireMock::start().await;
    Mock::given(method("POST")).respond_with(ResponseTemplate::new(500))
        .mount(&mock_openai).await;

    let result = dispatch_with_policy(&policy, mock_openai.uri()).await;
    // Should return the provider error, not silently failover
    assert_eq!(result.unwrap_err(), DispatchError::ProviderError(500));
}

#[tokio::test]
async fn policy_file_changes_are_hot_reloaded_within_5_seconds() {
    let policy_path = temp_policy_file(GovernanceMode::Audit);
    let watcher = PolicyWatcher::new(&policy_path);

    // Change to strict mode
    write_policy_file(&policy_path, GovernanceMode::Strict);
    tokio::time::sleep(Duration::from_secs(5)).await;

    assert_eq!(watcher.current_mode(), GovernanceMode::Strict);
}
```

---

## Section 9: Test Data & Fixtures

### 9.1 Factory Patterns for Test Data

All test data is created via factory functions — no raw struct literals scattered across tests. Factories provide sensible defaults with override capability.

```rust
// crates/shared/src/testing/factories.rs
// Feature-gated: only compiled in test builds

#[cfg(any(test, feature = "test-utils"))]
pub mod factories {
    use crate::models::*;
    use uuid::Uuid;
    use chrono::Utc;

    pub struct OrgFactory {
        name: String,
        plan: String,
        monthly_spend_limit: Option<f64>,
    }

    impl OrgFactory {
        pub fn new() -> Self {
            Self {
                name: format!("Test Org {}", &Uuid::new_v4().to_string()[..8]),
                plan: "free".to_string(),
                monthly_spend_limit: None,
            }
        }
        pub fn pro(mut self) -> Self { self.plan = "pro".to_string(); self }
        pub fn with_spend_limit(mut self, limit: f64) -> Self {
            self.monthly_spend_limit = Some(limit); self
        }
        pub fn build(self) -> Organization {
            Organization {
                id: Uuid::new_v4(),
                name: self.name,
                slug: slugify(&self.name),
                plan: self.plan,
                monthly_llm_spend_limit: self.monthly_spend_limit,
                created_at: Utc::now(),
                updated_at: Utc::now(),
                ..Default::default()
            }
        }
    }

    pub struct RequestEventFactory {
        org_id: Uuid,
        model_requested: String,
        model_used: String,
        feature_tag: Option<String>,
        input_tokens: u32,
        output_tokens: u32,
        cost_actual: f64,
        cost_original: f64,
        latency_ms: u32,
        status_code: u16,
    }

    impl RequestEventFactory {
        pub fn new(org_id: Uuid) -> Self {
            Self {
                org_id,
                model_requested: "gpt-4o".to_string(),
                model_used: "gpt-4o-mini".to_string(),
                feature_tag: Some("classify".to_string()),
                input_tokens: 500,
                output_tokens: 50,
                cost_actual: 0.000083,
                cost_original: 0.001375,
                latency_ms: 3,
                status_code: 200,
            }
        }
        pub fn with_model(mut self, requested: &str, used: &str) -> Self {
            self.model_requested = requested.to_string();
            self.model_used = used.to_string();
            self
        }
        pub fn with_feature(mut self, feature: &str) -> Self {
            self.feature_tag = Some(feature.to_string()); self
        }
        pub fn with_tokens(mut self, input: u32, output: u32) -> Self {
            self.input_tokens = input;
            self.output_tokens = output;
            self
        }
        pub fn failed(mut self) -> Self {
            self.status_code = 500; self
        }
        pub fn build(self) -> RequestEvent {
            RequestEvent {
                id: Uuid::new_v4(),
                org_id: self.org_id,
                timestamp: Utc::now(),
                model_requested: self.model_requested,
                model_used: self.model_used,
                feature_tag: self.feature_tag,
                input_tokens: self.input_tokens,
                output_tokens: self.output_tokens,
                cost_actual: self.cost_actual,
                cost_original: self.cost_original,
                cost_saved: self.cost_original - self.cost_actual,
                latency_ms: self.latency_ms,
                status_code: self.status_code,
                ..Default::default()
            }
        }
    }

    pub struct RoutingRuleFactory {
        org_id: Uuid,
        priority: i32,
        strategy: RoutingStrategy,
        match_feature: Option<String>,
        model_chain: Vec<String>,
    }

    impl RoutingRuleFactory {
        pub fn cheapest(org_id: Uuid) -> Self {
            Self {
                org_id,
                priority: 0,
                strategy: RoutingStrategy::Cheapest,
                match_feature: None,
                model_chain: vec!["gpt-4o-mini".to_string(), "claude-3-haiku".to_string()],
            }
        }
        pub fn for_feature(mut self, feature: &str) -> Self {
            self.match_feature = Some(feature.to_string()); self
        }
        pub fn with_priority(mut self, p: i32) -> Self { self.priority = p; self }
        pub fn build(self) -> RoutingRule { /* ... */ }
    }

    // Convenience helpers
    pub fn make_event(org_id: Uuid) -> RequestEvent {
        RequestEventFactory::new(org_id).build()
    }

    pub fn make_events(org_id: Uuid, count: usize) -> Vec<RequestEvent> {
        (0..count).map(|_| make_event(org_id)).collect()
    }

    pub fn make_events_spread_over_days(org_id: Uuid, count: usize, days: u32) -> Vec<RequestEvent> {
        (0..count).map(|i| {
            let mut event = make_event(org_id);
            event.timestamp = Utc::now() - Duration::days((i % days as usize) as i64);
            event
        }).collect()
    }
}
```

**TypeScript factories for UI and CLI tests:**

```typescript
// ui/src/testing/factories.ts

export const makeOrg = (overrides: Partial<Organization> = {}): Organization => ({
  id: crypto.randomUUID(),
  name: 'Test Org',
  slug: 'test-org',
  plan: 'free',
  createdAt: new Date().toISOString(),
  ...overrides,
});

export const makeDashboardSummary = (overrides: Partial<DashboardSummary> = {}): DashboardSummary => ({
  period: '7d',
  totalRequests: 42850,
  totalCost: 127.43,
  totalCostWithoutRouting: 891.20,
  totalSaved: 763.77,
  savingsPercentage: 85.7,
  avgLatencyMs: 4.2,
  ...overrides,
});

export const makeRequestEvent = (overrides: Partial<RequestEvent> = {}): RequestEvent => ({
  id: `req_${Math.random().toString(36).slice(2, 10)}`,
  timestamp: new Date().toISOString(),
  modelRequested: 'gpt-4o',
  modelUsed: 'gpt-4o-mini',
  provider: 'openai',
  featureTag: 'classify',
  inputTokens: 142,
  outputTokens: 8,
  cost: 0.000026,
  costWithoutRouting: 0.000435,
  saved: 0.000409,
  latencyMs: 245,
  complexity: 'LOW',
  status: 200,
  ...overrides,
});

export const makeTreemapData = (): TreemapNode[] => [
  { name: 'classify', value: 450.20, children: [
    { name: 'gpt-4o-mini', value: 320.10 },
    { name: 'claude-3-haiku', value: 130.10 },
  ]},
  { name: 'summarize', value: 280.50, children: [
    { name: 'gpt-4o', value: 280.50 },
  ]},
];
```

---

### 9.2 Provider Response Mocks (OpenAI & Anthropic)

Recorded fixtures live in `tests/fixtures/`. They are captured once from real APIs and committed to the repo.

```
tests/fixtures/
├── openai/
│   ├── chat_completions_non_streaming.json
│   ├── chat_completions_streaming.txt          # raw SSE stream
│   ├── chat_completions_streaming_with_usage.txt
│   ├── chat_completions_tool_call.json
│   ├── embeddings_response.json
│   ├── error_rate_limit_429.json
│   ├── error_invalid_api_key_401.json
│   └── error_server_error_500.json
├── anthropic/
│   ├── messages_response.json
│   ├── messages_streaming.txt
│   ├── error_overloaded_529.json
│   └── error_rate_limit_429.json
└── dd0c/
    ├── routing_decision_cheapest.json          # expected routing decision output
    ├── routing_decision_cascading.json
    └── request_event_full.json                 # full RequestEvent with all fields
```

**OpenAI non-streaming fixture:**
```json
// tests/fixtures/openai/chat_completions_non_streaming.json
{
  "id": "chatcmpl-test123",
  "object": "chat.completion",
  "created": 1709251200,
  "model": "gpt-4o-mini-2024-07-18",
  "choices": [{
    "index": 0,
    "message": { "role": "assistant", "content": "This is a billing inquiry." },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 8,
    "total_tokens": 50
  }
}
```

**OpenAI streaming fixture:**
```
// tests/fixtures/openai/chat_completions_streaming.txt
data: {"id":"chatcmpl-test123","object":"chat.completion.chunk","created":1709251200,"model":"gpt-4o-mini-2024-07-18","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-test123","object":"chat.completion.chunk","created":1709251200,"model":"gpt-4o-mini-2024-07-18","choices":[{"index":0,"delta":{"content":"This"},"finish_reason":null}]}

data: {"id":"chatcmpl-test123","object":"chat.completion.chunk","created":1709251200,"model":"gpt-4o-mini-2024-07-18","choices":[{"index":0,"delta":{"content":" is"},"finish_reason":null}]}

data: {"id":"chatcmpl-test123","object":"chat.completion.chunk","created":1709251200,"model":"gpt-4o-mini-2024-07-18","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":42,"completion_tokens":3,"total_tokens":45}}

data: [DONE]
```

**Fixture loader utility:**
```rust
// tests/common/fixtures.rs

pub fn load_fixture(path: &str) -> serde_json::Value {
    let fixture_path = Path::new(env!("CARGO_MANIFEST_DIR"))
        .join("tests/fixtures")
        .join(path);
    let content = fs::read_to_string(&fixture_path)
        .unwrap_or_else(|_| panic!("fixture not found: {}", fixture_path.display()));
    serde_json::from_str(&content)
        .unwrap_or_else(|_| panic!("fixture is not valid JSON: {}", path))
}

pub fn load_sse_fixture(path: &str) -> Vec<u8> {
    let fixture_path = Path::new(env!("CARGO_MANIFEST_DIR"))
        .join("tests/fixtures")
        .join(path);
    fs::read(&fixture_path)
        .unwrap_or_else(|_| panic!("SSE fixture not found: {}", fixture_path.display()))
}
```

---

### 9.3 Cost Table Fixtures

```rust
// crates/shared/src/testing/cost_tables.rs

#[cfg(any(test, feature = "test-utils"))]
pub fn cost_tables() -> CostTables {
    CostTables::from_vec(vec![
        ModelCost {
            provider: Provider::OpenAI,
            model_id: "gpt-4o-2024-11-20".to_string(),
            model_alias: "gpt-4o".to_string(),
            input_cost_per_m: 2.50,
            output_cost_per_m: 10.00,
            quality_tier: QualityTier::Frontier,
            max_context: 128_000,
            supports_streaming: true,
            supports_tools: true,
            supports_vision: true,
        },
        ModelCost {
            provider: Provider::OpenAI,
            model_id: "gpt-4o-mini-2024-07-18".to_string(),
            model_alias: "gpt-4o-mini".to_string(),
            input_cost_per_m: 0.15,
            output_cost_per_m: 0.60,
            quality_tier: QualityTier::Economy,
            max_context: 128_000,
            supports_streaming: true,
            supports_tools: true,
            supports_vision: true,
        },
        ModelCost {
            provider: Provider::Anthropic,
            model_id: "claude-3-haiku-20240307".to_string(),
            model_alias: "claude-3-haiku".to_string(),
            input_cost_per_m: 0.25,
            output_cost_per_m: 1.25,
            quality_tier: QualityTier::Economy,
            max_context: 200_000,
            supports_streaming: true,
            supports_tools: true,
            supports_vision: false,
        },
        ModelCost {
            provider: Provider::Anthropic,
            model_id: "claude-3-5-sonnet-20241022".to_string(),
            model_alias: "claude-3-5-sonnet".to_string(),
            input_cost_per_m: 3.00,
            output_cost_per_m: 15.00,
            quality_tier: QualityTier::Frontier,
            max_context: 200_000,
            supports_streaming: true,
            supports_tools: true,
            supports_vision: true,
        },
    ])
}

/// Returns all valid (requested, used) pairs where routing makes sense
#[cfg(any(test, feature = "test-utils"))]
pub fn routable_model_pairs() -> Vec<(&'static str, &'static str)> {
    vec![
        ("gpt-4o", "gpt-4o-mini"),
        ("gpt-4o", "claude-3-haiku"),
        ("claude-3-5-sonnet", "gpt-4o-mini"),
        ("claude-3-5-sonnet", "claude-3-haiku"),
        // Same model (zero savings)
        ("gpt-4o-mini", "gpt-4o-mini"),
        ("gpt-4o", "gpt-4o"),
    ]
}
```

---

## Section 10: TDD Implementation Order

### 10.1 Bootstrap Sequence

Before writing any product tests, the test infrastructure itself must be bootstrapped. This is the meta-TDD step.

```
Week 0 (before Epic 1 code):

Day 1: Test infrastructure setup
  ├─ Add dev-dependencies: mockall, proptest, testcontainers, wiremock, criterion
  ├─ Create crates/shared/src/testing/ module (factories, cost_tables, helpers)
  ├─ Create tests/common/ (fixture loader, test app builder, DB setup helpers)
  ├─ Write and pass: "test infrastructure compiles and factories produce valid structs"
  └─ Set up cargo-tarpaulin and confirm coverage reporting works

Day 2: CI pipeline skeleton
  ├─ Create .github/workflows/ci.yml with unit test job (no tests yet — just passes)
  ├─ Add benchmark job with baseline capture
  ├─ Add migration lint script and test it against a sample migration
  └─ Confirm: `git push` → CI green (trivially)
```

### 10.2 Epic-by-Epic TDD Order

Tests must be written in dependency order — you can't test the Router Brain without the cost table fixtures, and you can't test the Analytics Pipeline without the proxy event schema.

```
Phase 1: Foundation (Epic 1 — Proxy Engine)
─────────────────────────────────────────────
WRITE FIRST (before any proxy code):
  1. test: parse_request_extracts_model_and_stream_flag
  2. test: auth_middleware_returns_401_for_unknown_key
  3. test: auth_middleware_caches_valid_key_after_db_lookup
  4. test: response_headers_contain_routing_metadata
  5. test: telemetry_emitter_drops_event_when_channel_is_full

THEN implement proxy core to make them pass.

THEN add property tests:
  6. proptest: api_key_hash_is_deterministic
  7. proptest: response_headers_never_contain_prompt_content

THEN add integration tests (requires Docker):
  8. integration: proxy_forwards_request_to_mock_openai_and_returns_200
  9. integration: proxy_returns_401_for_revoked_key_after_cache_invalidation
  10. contract: proxy_response_matches_openai_response_schema
  11. contract: proxy_preserves_sse_chunk_ordering_for_streaming_requests
  12. contract: proxy_translates_anthropic_response_to_openai_format

Phase 2: Intelligence (Epic 2 — Router Brain)
──────────────────────────────────────────────
WRITE FIRST:
  13. test: rule_engine_returns_first_matching_rule_by_priority
  14. test: rule_engine_falls_through_to_passthrough_when_no_rules_match
  15. test: cheapest_strategy_selects_lowest_cost_model_from_chain
  16. test: classifier_returns_low_for_short_extraction_system_prompt
  17. test: classifier_returns_high_for_code_generation_prompt
  18. test: cost_saved_is_zero_when_models_are_identical
  19. test: cost_saved_is_positive_when_routed_to_cheaper_model
  20. test: circuit_breaker_transitions_to_open_after_error_threshold
  21. test: circuit_breaker_transitions_to_half_open_after_cooldown

THEN implement Router Brain.

THEN add property tests:
  22. proptest: cheapest_strategy_never_selects_more_expensive_model
  23. proptest: complexity_classifier_never_panics_on_arbitrary_input
  24. proptest: cost_saved_is_never_negative

THEN add integration tests:
  25. integration: circuit_breaker_state_is_shared_across_two_proxy_instances
  26. integration: routing_rule_loaded_from_db_takes_effect_on_next_request

Phase 3: Data (Epic 3 — Analytics Pipeline)
─────────────────────────────────────────────
WRITE FIRST:
  27. test: batch_collector_flushes_after_100_events_before_timeout
  28. test: batch_collector_flushes_partial_batch_after_interval
  29. test: proxy_continues_routing_when_telemetry_db_is_unavailable

THEN implement analytics worker.

THEN add integration tests:
  30. integration: batch_worker_inserts_events_into_timescaledb_hypertable
  31. integration: continuous_aggregate_reflects_inserted_events_after_refresh

Phase 4: Control Plane (Epic 4 — Dashboard API)
─────────────────────────────────────────────────
WRITE FIRST:
  32. test: create_api_key_returns_full_key_only_once
  33. test: member_role_cannot_create_routing_rules
  34. test: request_inspector_never_returns_prompt_content
  35. test: provider_credential_is_stored_encrypted
  36. test: revoked_api_key_returns_401_on_next_proxy_request

THEN implement Dashboard API.

THEN add integration tests:
  37. integration: create_org_and_api_key_persists_to_postgres
  38. integration: routing_rules_are_returned_in_priority_order
  39. integration: dashboard_summary_query_returns_correct_aggregates

Phase 5: Transparent Factory (Epic 10 — Cross-cutting)
────────────────────────────────────────────────────────
These tests are written ALONGSIDE the features they govern, not after.

  40. test: routing_strategy_uses_passthrough_when_flag_is_off          (with Epic 2)
  41. test: flag_auto_disables_when_p99_latency_increases_by_5_percent  (with Epic 2)
  42. test: lint_rejects_drop_table                                      (before any migration)
  43. test: routing_decision_emits_ai_routing_decision_span             (with Epic 2)
  44. test: routing_span_never_contains_raw_prompt_content              (with Epic 2)
  45. test: strict_mode_blocks_automatic_routing_rule_update            (with Epic 2)
  46. test: panic_mode_freezes_all_routing_to_hardcoded_provider        (with Epic 2)

Phase 6: UI & CLI (Epics 5 & 6)
─────────────────────────────────
  47. vitest: CostTreemap renders spend breakdown by feature tag
  48. vitest: RoutingRulesEditor allows drag-to-reorder priority
  49. vitest: RequestInspector filters by feature tag
  50. vitest: dd0c-scan detects gpt-4o usage in TypeScript files
  51. vitest: SavingsReport calculates positive savings

Phase 7: E2E (after staging environment is live)
──────────────────────────────────────────────────
  52. playwright: first_route_onboarding_journey_completes_in_under_2_minutes
  53. playwright: routing_rule_created_in_ui_takes_effect_on_next_request
  54. k6: proxy_overhead_p99_is_under_5ms_at_50_concurrent_users
  55. k6: proxy_continues_routing_when_timescaledb_is_killed
```

### 10.3 Test Count Milestones

| Milestone | Tests Written | Coverage Target | Gate |
|-----------|--------------|-----------------|------|
| End of Epic 1 | ~50 | 60% proxy crate | CI green |
| End of Epic 2 | ~120 | 80% shared/router | CI green |
| End of Epic 3 | ~150 | 70% worker | CI green |
| End of Epic 4 | ~220 | 75% api crate | CI green |
| End of Epic 10 | ~280 | 80% overall | CI green |
| End of Epic 5+6 | ~320 | 75% overall | CI green |
| V1 Launch | ~400 | 75% overall | Deploy gate |

### 10.4 The "Test It First" Checklist

Before writing any new function, ask:

```
□ Does this function have a clear, testable contract?
  (If not, the function is probably doing too much — split it)

□ Can I write the test without knowing the implementation?
  (If not, the abstraction is wrong — redesign the interface)

□ Does this function touch the hot path?
  → Add a criterion benchmark

□ Does this function handle money (cost calculations)?
  → Add proptest property tests

□ Does this function touch auth or security?
  → Add tests for the invalid/revoked/malformed cases explicitly

□ Does this function emit telemetry or spans?
  → Add an OTEL span assertion test

□ Does this function change routing behavior?
  → Add a feature flag test (off/on/auto-disabled)

□ Does this function modify the database schema?
  → Add a migration lint test and a dual-write test
```

---

## Appendix: Test Toolchain Summary

| Tool | Language | Purpose | Config |
|------|----------|---------|--------|
| `cargo test` | Rust | Unit + integration test runner | `Cargo.toml` |
| `mockall` | Rust | Mock generation for traits | `#[automock]` attribute |
| `proptest` | Rust | Property-based testing | `proptest!` macro |
| `criterion` | Rust | Micro-benchmarks | `[[bench]]` in Cargo.toml |
| `testcontainers` | Rust | Real DB/Redis in tests | Docker required |
| `wiremock` | Rust | HTTP mock server | `WireMock::start().await` |
| `cargo-tarpaulin` | Rust | Code coverage | `cargo tarpaulin` |
| `cargo-audit` | Rust | Dependency vulnerability scan | `cargo audit` |
| `vitest` | TypeScript | Unit tests for UI + CLI | `vitest.config.ts` |
| `@testing-library/react` | TypeScript | React component tests | With vitest |
| `Playwright` | TypeScript | E2E browser tests | `playwright.config.ts` |
| `k6` | JavaScript | Load + chaos tests | `k6 run` |
| `migration-lint` | Python/Bash | DDL safety checks | Pre-commit + CI |
| `decision-log-check` | Python | Cognitive durability enforcement | CI only |
| `benchmark-action` | GitHub Actions | Benchmark regression detection | `.github/workflows/` |

---

*Test Architecture document generated for dd0c/route V1 MVP.*
*Total estimated test count at V1 launch: ~400 tests.*
*Target CI runtime: <8 minutes (unit + integration), <15 minutes (full pipeline with E2E).*