Files
dd0c/products/01-llm-cost-router/test-architecture/test-architecture.md
Max Mayfield 03bfe931fc Implement review remediation + PLG analytics SDK
- All 6 test architectures patched with Section 11 addendums
- P5 (cost) fully rewritten from 232 to ~600 lines
- PLG brainstorm + party mode advisory board results
- Analytics SDK v2 (PostHog Cloud, Zod strict, Lambda-safe)
- Analytics tests v2 (safeParse, no , no timestamp, no PII)
- Addresses all Gemini review findings across P1-P6
2026-03-01 01:42:49 +00:00

92 KiB

dd0c/route — Test Architecture & TDD Strategy

Product: dd0c/route — LLM Cost Router & Optimization Dashboard Author: Test Architecture Phase Date: February 28, 2026 Status: V1 MVP — Solo Founder Scope


Section 1: Testing Philosophy & TDD Workflow

1.1 Core Philosophy

dd0c/route is a latency-sensitive proxy with correctness requirements that compound: a wrong routing decision costs money, a wrong cost calculation misleads customers, and a wrong auth check is a security incident. Tests are not optional — they are the specification.

The guiding principle: tests describe behavior, not implementation. A test that breaks when you rename a private function is a bad test. A test that breaks when you accidentally route a complex request to a cheap model is a good test.

For a solo founder, the test suite is also the second developer — it catches regressions when Brian is moving fast and hasn't slept enough.

1.2 Red-Green-Refactor Adapted to dd0c

The standard TDD cycle applies, but with product-specific adaptations:

RED   → Write a failing test that describes the desired behavior
         (e.g., "a request tagged feature=classify should route to gpt-4o-mini")

GREEN → Write the minimum code to make it pass
         (no premature optimization — just make it work)

REFACTOR → Clean up without breaking tests
            (extract the complexity classifier into its own module,
             add the proptest property suite, optimize the hot path)

When to write tests first (strict TDD):

  • All Router Brain logic (routing rules, complexity classifier, cost calculations)
  • All auth/security paths (key validation, JWT issuance, RBAC checks)
  • All cost calculation formulas (cost_saved = cost_original - cost_actual)
  • All circuit breaker state transitions
  • All schema migration validators

When integration tests lead (test-after, then harden):

  • Provider format translation (OpenAI ↔ Anthropic) — write the translator, then write contract tests against real response fixtures
  • TimescaleDB continuous aggregate queries — create the schema, run queries against Testcontainers, then lock in the behavior
  • SSE streaming passthrough — implement the stream relay, then write tests that assert chunk ordering and [DONE] handling

When E2E tests lead:

  • The "first route" onboarding journey — define the happy path first, then build backward
  • The Shadow Audit CLI output format — define the expected terminal output, then build the parser

1.3 Test Naming Conventions

All tests follow the Given-When-Then naming pattern expressed as a single descriptive string:

// Rust unit tests
#[test]
fn complexity_classifier_returns_low_for_short_extraction_prompts() { ... }

#[test]
fn router_selects_cheapest_model_when_strategy_is_cheapest_and_complexity_is_low() { ... }

#[test]
fn circuit_breaker_opens_after_threshold_error_rate_exceeded() { ... }

#[test]
fn cost_calculation_returns_zero_savings_when_requested_and_used_model_are_identical() { ... }
// Integration tests (in tests/ directory)
#[tokio::test]
async fn proxy_forwards_streaming_request_to_openai_and_returns_sse_chunks() { ... }

#[tokio::test]
async fn proxy_returns_401_when_api_key_is_revoked() { ... }
// TypeScript (Dashboard UI / CLI)
describe('CostTreemap', () => {
  it('renders spend breakdown by feature tag when data is loaded', () => { ... });
  it('shows empty state when no requests exist for the selected period', () => { ... });
});

describe('dd0c-scan CLI', () => {
  it('detects gpt-4o usage in TypeScript files and estimates monthly cost', () => { ... });
});

Rules:

  • No test_ prefix in Rust (redundant inside #[cfg(test)])
  • No should_ prefix (verbose, adds no information)
  • Use _ as word separator in Rust, camelCase in TypeScript
  • Name describes the observable outcome, not the internal mechanism
  • If you can't name the test without saying "and", split it into two tests

Section 2: Test Pyramid

         ┌─────────────────┐
         │   E2E / Smoke   │  ~5%  (~20 tests)
         │   (Playwright,  │
         │    k6 journeys) │
        ─┴─────────────────┴─
       ┌───────────────────────┐
       │   Integration Tests   │  ~20%  (~80 tests)
       │   (Testcontainers,    │
       │    contract tests)    │
      ─┴───────────────────────┴─
    ┌─────────────────────────────┐
    │        Unit Tests           │  ~75%  (~300 tests)
    │  (#[cfg(test)], proptest,   │
    │   mockall, vitest)          │
    └─────────────────────────────┘

Target: ~400 tests at V1 launch. Fast feedback loop is more valuable than exhaustive coverage at this stage. Every test must run in CI in under 5 minutes total.

2.2 Unit Test Targets (per component)

Component Target Test Count Key Focus
Router Brain (rule engine) ~60 Rule matching, strategy execution, edge cases
Complexity Classifier ~40 Token count thresholds, regex patterns, confidence scores
Cost Calculator ~30 Formula correctness, precision, zero-savings edge cases
Circuit Breaker ~25 State transitions, threshold logic, Redis key format
Auth (key validation, JWT) ~30 Valid/invalid/revoked keys, JWT claims, RBAC
Provider Translators ~30 OpenAI↔Anthropic format mapping, streaming chunks
Analytics Pipeline (batch logic) ~20 Batching thresholds, flush triggers, error handling
Dashboard API handlers ~40 Request validation, response shape, error codes
Shadow Audit CLI parser ~25 File detection, token estimation, report formatting
Total ~300

2.3 Integration Test Boundaries

Boundary Test Type Tool
Proxy → TimescaleDB DB integration Testcontainers (TimescaleDB image)
Proxy → Redis Cache integration Testcontainers (Redis image)
Proxy → PostgreSQL DB integration Testcontainers (PostgreSQL image)
Proxy → OpenAI API Contract test Recorded fixtures (no live calls in CI)
Proxy → Anthropic API Contract test Recorded fixtures
Dashboard API → PostgreSQL DB integration Testcontainers
Dashboard API → TimescaleDB DB integration Testcontainers
Worker → TimescaleDB DB integration Testcontainers
Worker → SES Mock integration wiremock-rs
Worker → Slack webhooks Mock integration wiremock-rs

2.4 E2E / Smoke Test Scenarios

Scenario Priority Tool
New user signs up via GitHub OAuth and gets API key P0 Playwright
Developer swaps base URL and first request routes correctly P0 curl / k6
Routing rule created in UI takes effect on next proxy request P0 Playwright + k6
Budget alert fires when threshold is crossed P1 k6 + webhook receiver
npx dd0c-scan runs on sample repo and produces report P1 Node.js test runner
Dashboard treemap renders after 100 synthetic requests P1 Playwright
Proxy continues routing when TimescaleDB is unavailable P0 Chaos (kill container)

Section 3: Unit Test Strategy (Per Component)

3.1 Proxy Engine (crates/proxy)

What to test:

  • Request parsing: extraction of model, messages, headers, stream flag
  • Auth middleware: Redis cache hit, cache miss → PG fallback, revoked key, malformed key
  • Response header injection: X-DD0C-Model, X-DD0C-Cost, X-DD0C-Saved values
  • SSE chunk passthrough: ordering, [DONE] detection, token count extraction from final chunk
  • Graceful degradation: telemetry channel full → drop event, don't block request
  • Rate limiting: per-key counter increment, 429 response when exceeded

Key test cases:

#[cfg(test)]
mod proxy_tests {
    use super::*;
    use mockall::predicate::*;

    #[test]
    fn parse_request_extracts_model_and_stream_flag() {
        let body = r#"{"model":"gpt-4o","messages":[{"role":"user","content":"hi"}],"stream":true}"#;
        let req = ProxyRequest::parse(body).unwrap();
        assert_eq!(req.model, "gpt-4o");
        assert!(req.stream);
    }

    #[test]
    fn parse_request_extracts_dd0c_feature_tag_from_headers() {
        let headers = make_headers([("X-DD0C-Feature", "classify")]);
        let tags = extract_tags(&headers);
        assert_eq!(tags.feature, Some("classify".to_string()));
    }

    #[tokio::test]
    async fn auth_middleware_returns_401_for_unknown_key() {
        let mut mock_cache = MockKeyCache::new();
        mock_cache.expect_get().returning(|_| Ok(None));
        let mut mock_db = MockKeyStore::new();
        mock_db.expect_lookup().returning(|_| Ok(None));

        let result = validate_api_key("dd0c_sk_live_unknown", &mock_cache, &mock_db).await;
        assert_eq!(result, Err(AuthError::InvalidKey));
    }

    #[tokio::test]
    async fn auth_middleware_caches_valid_key_after_db_lookup() {
        let mut mock_cache = MockKeyCache::new();
        mock_cache.expect_get().returning(|_| Ok(None));
        mock_cache.expect_set().times(1).returning(|_, _| Ok(()));
        let mut mock_db = MockKeyStore::new();
        mock_db.expect_lookup().returning(|_| Ok(Some(make_api_key())));

        validate_api_key("dd0c_sk_live_valid", &mock_cache, &mock_db).await.unwrap();
    }

    #[test]
    fn telemetry_emitter_drops_event_when_channel_is_full_without_blocking() {
        let (tx, _rx) = tokio::sync::mpsc::channel(1);
        tx.try_send(make_event()).unwrap(); // fill the channel
        let result = try_emit_telemetry(&tx, make_event());
        assert!(result.is_ok()); // graceful drop, no panic
    }
}

Mocking strategy:

  • MockKeyCache and MockKeyStore via mockall for auth tests
  • MockLlmProvider for dispatch tests — returns canned responses without network
  • Bounded mpsc channels to test backpressure behavior

Property-based tests (proptest):

use proptest::prelude::*;

proptest! {
    #[test]
    fn api_key_hash_is_deterministic(key in "[a-zA-Z0-9]{32}") {
        let h1 = hash_api_key(&key);
        let h2 = hash_api_key(&key);
        prop_assert_eq!(h1, h2);
    }

    #[test]
    fn response_headers_never_contain_prompt_content(
        prompt in ".{1,500}",
        model in "gpt-4o|gpt-4o-mini|claude-3-haiku"
    ) {
        let headers = build_response_headers(&make_routing_decision(&model), &make_event());
        for (_, value) in &headers {
            prop_assert!(!value.to_str().unwrap_or("").contains(&prompt));
        }
    }
}

3.2 Router Brain (crates/shared/router)

This is the highest-value test target. Routing logic directly affects customer savings — bugs here cost money.

What to test:

  • Rule matching: first-match-wins, tag matching, model matching, complexity matching
  • Strategy execution: passthrough, cheapest, quality_first, cascading
  • Budget enforcement: hard limit reached → throttle to cheapest or reject
  • Complexity classifier: token count thresholds, regex pattern matching, confidence output
  • Cost calculation: formula correctness, floating-point precision, zero-savings case
  • Circuit breaker: CLOSED→OPEN→HALF_OPEN→CLOSED transitions, Redis key format

Key test cases:

#[cfg(test)]
mod router_tests {
    use super::*;

    #[test]
    fn rule_engine_returns_first_matching_rule_by_priority() {
        let rules = vec![
            make_rule(0, match_feature: "classify", strategy: Cheapest),
            make_rule(1, match_feature: "classify", strategy: Passthrough),
        ];
        let req = make_request_with_feature("classify");
        let decision = evaluate_rules(&rules, &req, &cost_tables());
        assert_eq!(decision.strategy, RoutingStrategy::Cheapest);
    }

    #[test]
    fn rule_engine_falls_through_to_passthrough_when_no_rules_match() {
        let rules = vec![make_rule(0, match_feature: "summarize", strategy: Cheapest)];
        let req = make_request_with_feature("classify");
        let decision = evaluate_rules(&rules, &req, &cost_tables());
        assert_eq!(decision.strategy, RoutingStrategy::Passthrough);
        assert_eq!(decision.target_model, req.model);
    }

    #[test]
    fn cheapest_strategy_selects_lowest_cost_model_from_chain() {
        let chain = vec!["gpt-4o", "gpt-4o-mini", "claude-3-haiku"];
        let costs = cost_tables_with(&[
            ("gpt-4o",        2.50, 10.00),
            ("gpt-4o-mini",   0.15,  0.60),
            ("claude-3-haiku",0.25,  1.25),
        ]);
        let model = select_cheapest(&chain, &costs, 500, 100);
        assert_eq!(model, "gpt-4o-mini");
    }

    #[test]
    fn classifier_returns_low_for_short_extraction_system_prompt() {
        let messages = vec![
            system("Extract the sentiment. Reply with one word."),
            user("The product is great!"),
        ];
        let result = classify_complexity(&messages, "gpt-4o");
        assert_eq!(result.level, ComplexityLevel::Low);
        assert!(result.confidence > 0.7);
    }

    #[test]
    fn classifier_returns_high_for_code_generation_prompt() {
        let messages = vec![
            system("You are an expert software engineer. Write production-quality code."),
            user("Implement a binary search tree with insertion, deletion, and traversal."),
        ];
        let result = classify_complexity(&messages, "gpt-4o");
        assert_eq!(result.level, ComplexityLevel::High);
    }

    #[test]
    fn cost_saved_is_zero_when_requested_and_used_model_are_identical() {
        let event = make_event("gpt-4o-mini", "gpt-4o-mini", 1000, 200);
        assert_eq!(calculate_cost_saved(&event, &cost_tables()), 0.0);
    }

    #[test]
    fn cost_saved_is_positive_when_routed_to_cheaper_model() {
        let costs = cost_tables_with(&[
            ("gpt-4o",      2.50, 10.00),
            ("gpt-4o-mini", 0.15,  0.60),
        ]);
        let event = make_event("gpt-4o", "gpt-4o-mini", 1_000_000, 200_000);
        let saved = calculate_cost_saved(&event, &costs);
        // (2.50-0.15)*1 + (10.00-0.60)*0.2 = 2.35 + 1.88 = 4.23
        assert!((saved - 4.23).abs() < 0.01);
    }

    #[test]
    fn circuit_breaker_transitions_to_open_after_error_threshold() {
        let mut cb = CircuitBreaker::new(threshold: 0.10, window_secs: 60);
        for _ in 0..9 { cb.record_success(); }
        cb.record_failure(); // 10% error rate — exactly at threshold
        assert_eq!(cb.state(), CircuitState::Open);
    }

    #[test]
    fn circuit_breaker_transitions_to_half_open_after_cooldown() {
        let mut cb = CircuitBreaker::new(threshold: 0.10, window_secs: 60);
        cb.force_open();
        cb.advance_time(Duration::from_secs(31));
        assert_eq!(cb.state(), CircuitState::HalfOpen);
    }
}

Property-based tests:

proptest! {
    #[test]
    fn cheapest_strategy_never_selects_more_expensive_model(
        input_tokens in 1u32..1_000_000u32,
        output_tokens in 1u32..100_000u32,
    ) {
        let chain = vec!["gpt-4o", "gpt-4o-mini", "claude-3-haiku"];
        let costs = cost_tables();
        let selected = select_cheapest(&chain, &costs, input_tokens, output_tokens);
        let selected_cost = compute_cost(&selected, input_tokens, output_tokens, &costs);
        for model in &chain {
            let model_cost = compute_cost(model, input_tokens, output_tokens, &costs);
            prop_assert!(selected_cost <= model_cost);
        }
    }

    #[test]
    fn complexity_classifier_never_panics_on_arbitrary_input(
        system_prompt in ".*",
        user_message in ".*",
        model in "gpt-4o|gpt-4o-mini|claude-3-haiku",
    ) {
        let messages = vec![system(&system_prompt), user(&user_message)];
        let result = classify_complexity(&messages, &model);
        prop_assert!(result.confidence >= 0.0 && result.confidence <= 1.0);
    }

    #[test]
    fn cost_saved_is_never_negative(
        input_tokens in 1u32..1_000_000u32,
        output_tokens in 1u32..100_000u32,
    ) {
        let costs = cost_tables();
        for (requested, used) in routable_model_pairs() {
            let event = make_event(requested, used, input_tokens, output_tokens);
            prop_assert!(calculate_cost_saved(&event, &costs) >= 0.0);
        }
    }
}

3.3 Analytics Pipeline (telemetry worker)

What to test:

  • Batch collector flushes at 100 events OR 1 second, whichever comes first
  • Handles worker panic without losing buffered events (bounded channel survives)
  • RequestEvent serializes correctly to PostgreSQL COPY format
  • Graceful degradation: DB unavailable → events dropped, proxy continues unaffected
#[tokio::test]
async fn batch_collector_flushes_after_100_events_before_timeout() {
    let (tx, rx) = mpsc::channel(1000);
    let flush_count = Arc::new(AtomicU32::new(0));
    let mock_db = MockTelemetryDb::counting(flush_count.clone());
    let worker = spawn_batch_worker(rx, mock_db, 100, Duration::from_secs(10));

    for _ in 0..100 { tx.send(make_event()).await.unwrap(); }
    tokio::time::sleep(Duration::from_millis(50)).await;

    assert_eq!(flush_count.load(Ordering::SeqCst), 1);
    worker.abort();
}

#[tokio::test]
async fn batch_collector_flushes_partial_batch_after_interval() {
    let (tx, rx) = mpsc::channel(1000);
    let flush_count = Arc::new(AtomicU32::new(0));
    let mock_db = MockTelemetryDb::counting(flush_count.clone());
    let worker = spawn_batch_worker(rx, mock_db, 100, Duration::from_secs(1));

    tx.send(make_event()).await.unwrap(); // only 1 event
    tokio::time::sleep(Duration::from_millis(1100)).await;

    assert_eq!(flush_count.load(Ordering::SeqCst), 1);
    worker.abort();
}

#[tokio::test]
async fn proxy_continues_routing_when_telemetry_db_is_unavailable() {
    let failing_db = MockTelemetryDb::always_failing();
    let (tx, rx) = mpsc::channel(1000);
    spawn_batch_worker(rx, failing_db, 1, Duration::from_millis(10));

    // Proxy should still be able to send events without blocking
    for _ in 0..200 {
        let _ = tx.try_send(make_event()); // may drop when full — that's fine
    }
    // No panic, no deadlock
}

3.4 Dashboard API (crates/api)

What to test:

  • GitHub OAuth: state parameter validation, code exchange, user upsert
  • JWT issuance: claims (sub, org_id, role, exp), RS256 signature verification
  • RBAC: member cannot modify routing rules, owner can do everything
  • API key CRUD: create returns full key once, list returns prefix only, revoke invalidates
  • Provider credential encryption: stored value differs from plaintext input
  • Request inspector: pagination cursor, filter application, no prompt content in response
#[tokio::test]
async fn create_api_key_returns_full_key_only_once() {
    let app = test_app().await;
    let resp = app.post("/api/orgs/test-org/keys")
        .json(&json!({"name": "production"})).send().await;

    assert_eq!(resp.status(), 201);
    let body: Value = resp.json().await;
    assert!(body["key"].as_str().unwrap().starts_with("dd0c_sk_live_"));

    // Listing must NOT return the full key
    let list: Value = app.get("/api/orgs/test-org/keys").send().await.json().await;
    assert!(list["data"][0]["key"].is_null());
    assert!(list["data"][0]["key_prefix"].as_str().unwrap().len() < 20);
}

#[tokio::test]
async fn member_role_cannot_create_routing_rules() {
    let app = test_app_with_role(Role::Member).await;
    let resp = app.post("/api/orgs/test-org/routing/rules")
        .json(&make_rule_payload()).send().await;
    assert_eq!(resp.status(), 403);
}

#[tokio::test]
async fn request_inspector_never_returns_prompt_content() {
    let app = test_app_with_events(100).await;
    let body: Value = app.get("/api/orgs/test-org/requests").send().await.json().await;
    for event in body["data"].as_array().unwrap() {
        assert!(event.get("messages").is_none());
        assert!(event.get("prompt").is_none());
        assert!(event.get("content").is_none());
    }
}

#[tokio::test]
async fn provider_credential_is_stored_encrypted() {
    let app = test_app().await;
    app.put("/api/orgs/test-org/providers/openai")
        .json(&json!({"api_key": "sk-plaintext-key"})).send().await;

    let stored = fetch_raw_credential_from_db("test-org", "openai").await;
    assert_ne!(stored.encrypted_key, b"sk-plaintext-key");
    assert!(stored.encrypted_key.len() > 16); // has GCM nonce + ciphertext
}

3.5 Shadow Audit CLI (cli/)

What to test:

  • File scanner detects OpenAI/Anthropic SDK usage in .ts, .js, .py
  • Model extractor parses model string from SDK call arguments
  • Token estimator produces non-zero estimate for non-empty prompts
  • Report formatter includes savings percentage, top opportunities, sign-up CTA
  • Offline mode works when pricing cache exists on disk
describe('FileScanner', () => {
  it('detects openai SDK usage in TypeScript files', () => {
    const code = `const r = await client.chat.completions.create({ model: 'gpt-4o' })`;
    const calls = scanFile('service.ts', code);
    expect(calls).toHaveLength(1);
    expect(calls[0].model).toBe('gpt-4o');
  });

  it('detects anthropic SDK usage in Python files', () => {
    const code = `client.messages.create(model="claude-3-opus-20240229")`;
    const calls = scanFile('service.py', code);
    expect(calls[0].model).toBe('claude-3-opus-20240229');
  });

  it('ignores commented-out SDK calls', () => {
    const code = `// client.chat.completions.create({ model: 'gpt-4o' })`;
    expect(scanFile('service.ts', code)).toHaveLength(0);
  });
});

describe('SavingsReport', () => {
  it('calculates positive savings when cheaper model is available', () => {
    const calls = [{ model: 'gpt-4o', estimatedMonthlyTokens: 10_000_000 }];
    const report = generateReport(calls, mockPricingTable);
    expect(report.totalSavings).toBeGreaterThan(0);
    expect(report.savingsPercentage).toBeGreaterThan(0);
  });

  it('includes sign-up CTA in formatted output', () => {
    const output = formatReport(mockReport);
    expect(output).toContain('route.dd0c.dev');
  });
});

Section 4: Integration Test Strategy

4.1 Service Boundary Tests

Integration tests live in tests/ at the crate root and use Testcontainers to spin up real dependencies. No mocks at the service boundary — if it talks to a database, it talks to a real one.

Dependency: testcontainers crate + Docker daemon in CI.

# Cargo.toml (dev-dependencies)
[dev-dependencies]
testcontainers = "0.15"
testcontainers-modules = { version = "0.3", features = ["postgres", "redis"] }
tokio = { version = "1", features = ["full", "test-util"] }
wiremock = "0.6"

Proxy ↔ TimescaleDB

// tests/analytics_integration.rs
use testcontainers::clients::Cli;
use testcontainers_modules::postgres::Postgres;

#[tokio::test]
async fn batch_worker_inserts_events_into_timescaledb_hypertable() {
    let docker = Cli::default();
    let pg = docker.run(Postgres::default().with_tag("15-alpine"));
    let db_url = format!("postgres://postgres:postgres@localhost:{}/postgres", pg.get_host_port_ipv4(5432));

    run_migrations(&db_url).await;
    enable_timescaledb(&db_url).await;

    let (tx, rx) = mpsc::channel(100);
    let worker = spawn_batch_worker(rx, db_url.clone(), 10, Duration::from_millis(100));

    for _ in 0..10 {
        tx.send(make_event()).await.unwrap();
    }
    tokio::time::sleep(Duration::from_millis(200)).await;

    let count: i64 = sqlx::query_scalar("SELECT COUNT(*) FROM request_events")
        .fetch_one(&pool(&db_url).await).await.unwrap();
    assert_eq!(count, 10);
    worker.abort();
}

#[tokio::test]
async fn continuous_aggregate_reflects_inserted_events_after_refresh() {
    // ... setup TimescaleDB, insert 100 events, trigger aggregate refresh,
    // assert hourly_cost_summary has correct totals
}

Proxy ↔ Redis

// tests/cache_integration.rs
use testcontainers_modules::redis::Redis;

#[tokio::test]
async fn api_key_cache_stores_and_retrieves_key_within_ttl() {
    let docker = Cli::default();
    let redis = docker.run(Redis::default());
    let client = connect_redis(redis.get_host_port_ipv4(6379)).await;

    let key = make_api_key();
    cache_api_key(&client, &key, Duration::from_secs(60)).await.unwrap();

    let retrieved = get_cached_key(&client, &key.hash).await.unwrap();
    assert_eq!(retrieved.unwrap().org_id, key.org_id);
}

#[tokio::test]
async fn circuit_breaker_state_is_shared_across_two_proxy_instances() {
    let docker = Cli::default();
    let redis = docker.run(Redis::default());
    let client1 = connect_redis(redis.get_host_port_ipv4(6379)).await;
    let client2 = connect_redis(redis.get_host_port_ipv4(6379)).await;

    let cb1 = RedisCircuitBreaker::new("openai", client1);
    let cb2 = RedisCircuitBreaker::new("openai", client2);

    cb1.force_open().await.unwrap();

    // Instance 2 should see the open circuit set by instance 1
    assert_eq!(cb2.state().await.unwrap(), CircuitState::Open);
}

#[tokio::test]
async fn rate_limit_counter_increments_and_enforces_limit() {
    let docker = Cli::default();
    let redis = docker.run(Redis::default());
    let client = connect_redis(redis.get_host_port_ipv4(6379)).await;

    let limiter = RateLimiter::new(client, limit: 5, window: Duration::from_secs(60));
    for _ in 0..5 {
        assert!(limiter.check_and_increment("key_abc").await.unwrap());
    }
    // 6th request should be rejected
    assert!(!limiter.check_and_increment("key_abc").await.unwrap());
}

Dashboard API ↔ PostgreSQL

// tests/api_db_integration.rs

#[tokio::test]
async fn create_org_and_api_key_persists_to_postgres() {
    let docker = Cli::default();
    let pg = docker.run(Postgres::default());
    let pool = setup_test_db(pg.get_host_port_ipv4(5432)).await;

    let org = create_organization(&pool, "Acme Corp").await.unwrap();
    let (key, raw) = create_api_key(&pool, org.id, "production").await.unwrap();

    // Raw key is never stored
    let stored: ApiKey = sqlx::query_as("SELECT * FROM api_keys WHERE id = $1")
        .bind(key.id).fetch_one(&pool).await.unwrap();
    assert_ne!(stored.key_hash, raw); // hash != raw key
    assert!(stored.key_prefix.starts_with("dd0c_sk_"));
}

#[tokio::test]
async fn routing_rules_are_returned_in_priority_order() {
    let pool = test_pool().await;
    let org_id = seed_org(&pool).await;

    insert_rule(&pool, org_id, priority: 10, name: "low priority").await;
    insert_rule(&pool, org_id, priority: 1,  name: "high priority").await;
    insert_rule(&pool, org_id, priority: 5,  name: "mid priority").await;

    let rules = get_routing_rules(&pool, org_id).await.unwrap();
    assert_eq!(rules[0].name, "high priority");
    assert_eq!(rules[1].name, "mid priority");
    assert_eq!(rules[2].name, "low priority");
}

4.2 Contract Tests for OpenAI API Compatibility

The proxy's core promise is drop-in OpenAI compatibility. Contract tests verify this using recorded fixtures — real OpenAI/Anthropic responses captured once and replayed in CI without live API calls.

Fixture capture workflow:

  1. Run cargo test --features=record-fixtures once against live APIs (requires real keys)
  2. Fixtures saved to tests/fixtures/openai/ and tests/fixtures/anthropic/
  3. CI always uses recorded fixtures — no live API calls, no flakiness, no cost
// tests/contract_openai.rs

#[tokio::test]
async fn proxy_response_matches_openai_response_schema() {
    let fixture = load_fixture("openai/chat_completions_non_streaming.json");
    let mock_provider = WireMock::start().await;
    Mock::given(method("POST"))
        .and(path("/v1/chat/completions"))
        .respond_with(ResponseTemplate::new(200).set_body_json(&fixture))
        .mount(&mock_provider).await;

    let proxy = start_test_proxy(mock_provider.uri()).await;
    let response = proxy.post("/v1/chat/completions")
        .header("Authorization", "Bearer dd0c_sk_live_test")
        .json(&standard_chat_request())
        .send().await;

    assert_eq!(response.status(), 200);
    let body: Value = response.json().await;
    // Assert OpenAI schema compliance
    assert!(body["id"].as_str().unwrap().starts_with("chatcmpl-"));
    assert_eq!(body["object"], "chat.completion");
    assert!(body["choices"][0]["message"]["content"].is_string());
    assert!(body["usage"]["prompt_tokens"].is_number());
}

#[tokio::test]
async fn proxy_preserves_sse_chunk_ordering_for_streaming_requests() {
    let fixture_chunks = load_sse_fixture("openai/chat_completions_streaming.txt");
    let mock_provider = WireMock::start().await;
    Mock::given(method("POST"))
        .respond_with(ResponseTemplate::new(200)
            .set_body_raw(fixture_chunks, "text/event-stream"))
        .mount(&mock_provider).await;

    let proxy = start_test_proxy(mock_provider.uri()).await;
    let chunks = collect_sse_chunks(proxy, streaming_chat_request()).await;

    // Verify chunk ordering and [DONE] termination
    assert!(chunks.last().unwrap().contains("[DONE]"));
    let content: String = chunks.iter()
        .filter_map(|c| extract_delta_content(c))
        .collect();
    assert!(!content.is_empty());
}

#[tokio::test]
async fn proxy_translates_anthropic_response_to_openai_format() {
    let anthropic_fixture = load_fixture("anthropic/messages_response.json");
    let mock_anthropic = WireMock::start().await;
    Mock::given(method("POST"))
        .and(path("/v1/messages"))
        .respond_with(ResponseTemplate::new(200).set_body_json(&anthropic_fixture))
        .mount(&mock_anthropic).await;

    let proxy = start_test_proxy_with_anthropic(mock_anthropic.uri()).await;
    let response: Value = proxy.post("/v1/chat/completions")
        .json(&chat_request_routed_to_anthropic())
        .send().await.json().await;

    // Response must look like OpenAI even though it came from Anthropic
    assert_eq!(response["object"], "chat.completion");
    assert!(response["choices"][0]["message"]["content"].is_string());
    assert!(response["usage"]["prompt_tokens"].is_number());
}

#[tokio::test]
async fn proxy_passes_through_provider_429_with_original_body() {
    let mock_provider = WireMock::start().await;
    Mock::given(method("POST"))
        .respond_with(ResponseTemplate::new(429)
            .set_body_json(&json!({"error": {"message": "Rate limit exceeded", "type": "rate_limit_error"}})))
        .mount(&mock_provider).await;

    let proxy = start_test_proxy(mock_provider.uri()).await;
    let response = proxy.post("/v1/chat/completions")
        .json(&standard_chat_request()).send().await;

    assert_eq!(response.status(), 429);
    assert_eq!(response.headers()["X-DD0C-Provider-Error"], "true");
    let body: Value = response.json().await;
    assert_eq!(body["error"]["type"], "rate_limit_error");
}

4.3 Worker Integration Tests

// tests/worker_integration.rs

#[tokio::test]
async fn weekly_digest_worker_queries_correct_date_range() {
    let pool = test_timescaledb_pool().await;
    seed_events_for_last_7_days(&pool, org_id: "test-org", count: 500).await;

    let mock_ses = WireMock::start().await;
    Mock::given(method("POST"))
        .and(path("/v2/email/outbound-emails"))
        .respond_with(ResponseTemplate::new(200))
        .mount(&mock_ses).await;

    run_weekly_digest("test-org", &pool, mock_ses.uri()).await.unwrap();

    let requests = mock_ses.received_requests().await.unwrap();
    assert_eq!(requests.len(), 1);
    let email_body: Value = serde_json::from_slice(&requests[0].body).unwrap();
    assert!(email_body["subject"].as_str().unwrap().contains("savings"));
}

#[tokio::test]
async fn budget_alert_fires_exactly_once_when_threshold_crossed() {
    let pool = test_pool().await;
    let alert = seed_alert(&pool, threshold: 100.0).await;
    seed_spend(&pool, org_id: alert.org_id, amount: 105.0).await;

    let mock_slack = WireMock::start().await;
    Mock::given(method("POST")).respond_with(ResponseTemplate::new(200))
        .mount(&mock_slack).await;

    // Run evaluator twice — alert should only fire once
    evaluate_alerts(&pool, mock_slack.uri()).await.unwrap();
    evaluate_alerts(&pool, mock_slack.uri()).await.unwrap();

    let requests = mock_slack.received_requests().await.unwrap();
    assert_eq!(requests.len(), 1); // not 2 — deduplication works
}

Section 5: E2E & Smoke Tests

5.1 Critical User Journeys

These are the flows that must work on every deploy. If any of these break, the product is broken.

Journey 1: First Route (P0)

1. Developer signs up via GitHub OAuth
2. Org + API key created automatically
3. Developer copies curl command from onboarding wizard
4. curl request hits proxy with dd0c key
5. Request routes to correct model per default rules
6. Response headers contain X-DD0C-Model, X-DD0C-Cost, X-DD0C-Saved
7. Request appears in dashboard request inspector within 5 seconds

Playwright test:

test('first route onboarding journey completes in under 2 minutes', async ({ page }) => {
  await page.goto('https://staging.route.dd0c.dev');
  await page.click('[data-testid="github-signin"]');
  // ... OAuth mock in staging
  await expect(page.locator('[data-testid="api-key-display"]')).toBeVisible();

  const apiKey = await page.locator('[data-testid="api-key-value"]').textContent();
  expect(apiKey).toMatch(/^dd0c_sk_live_/);

  // Simulate the curl command
  const response = await fetch('https://proxy.staging.route.dd0c.dev/v1/chat/completions', {
    method: 'POST',
    headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json' },
    body: JSON.stringify({ model: 'gpt-4o', messages: [{ role: 'user', content: 'Say hello' }] })
  });
  expect(response.status).toBe(200);
  expect(response.headers.get('X-DD0C-Model-Used')).toBeTruthy();

  // Request should appear in inspector
  await page.goto(`https://staging.route.dd0c.dev/dashboard/requests`);
  await expect(page.locator('[data-testid="request-row"]').first()).toBeVisible({ timeout: 10000 });
});

Journey 2: Routing Rule Takes Effect (P0)

1. User creates routing rule: feature=classify → cheapest from [gpt-4o-mini, claude-haiku]
2. Sends request with X-DD0C-Feature: classify header requesting gpt-4o
3. Proxy routes to gpt-4o-mini (cheapest)
4. Response header X-DD0C-Model-Used = gpt-4o-mini
5. Dashboard shows savings for this request

Journey 3: Graceful Degradation (P0)

1. TimescaleDB container is killed
2. Proxy continues accepting and routing requests
3. Requests return 200 with correct routing
4. No 500 errors from proxy
5. When TimescaleDB recovers, telemetry resumes

k6 chaos test:

// tests/e2e/chaos_timescaledb.js
import http from 'k6/http';
import { check } from 'k6';

export let options = { vus: 10, duration: '60s' };

export default function () {
  const res = http.post('https://proxy.staging.route.dd0c.dev/v1/chat/completions',
    JSON.stringify({ model: 'gpt-4o-mini', messages: [{ role: 'user', content: 'ping' }] }),
    { headers: { 'Authorization': 'Bearer dd0c_sk_test_...', 'Content-Type': 'application/json' } }
  );
  check(res, {
    'status is 200': (r) => r.status === 200,
    'routing header present': (r) => r.headers['X-DD0C-Model-Used'] !== undefined,
  });
}
// Run this while: docker stop dd0c-timescaledb

5.2 Staging Environment Requirements

Requirement Detail
Isolated AWS account Separate from prod — no shared RDS, no shared Redis
GitHub OAuth app Separate OAuth app pointing to staging callback URL
Synthetic LLM providers wiremock or mockoon containers replacing real OpenAI/Anthropic
Seeded data 10K synthetic request_events pre-loaded for dashboard testing
Feature flags All flags default-off in staging; tests explicitly enable them
Teardown Staging DB wiped and re-seeded on each E2E run

5.3 Synthetic Traffic Generation

For dashboard and performance tests, a traffic generator seeds realistic request patterns:

// tools/traffic-gen/src/main.rs
// Generates realistic request distributions matching real usage patterns

struct TrafficProfile {
    requests_per_second: f64,
    feature_distribution: HashMap<String, f64>, // {"classify": 0.4, "summarize": 0.3, ...}
    model_distribution: HashMap<String, f64>,   // {"gpt-4o": 0.6, "gpt-4o-mini": 0.4}
    streaming_ratio: f64,                        // 0.3 = 30% streaming
}

// Usage: cargo run --bin traffic-gen -- --profile realistic --duration 60s --target staging

Section 6: Performance & Load Testing

6.1 Latency Budget Tests (<5ms proxy overhead)

The <5ms overhead SLA is the product's core technical promise. It must be continuously validated.

Benchmark setup: Use criterion for micro-benchmarks on the hot path components.

# Cargo.toml
[[bench]]
name = "hot_path"
harness = false

[dev-dependencies]
criterion = { version = "0.5", features = ["async_tokio"] }
// benches/hot_path.rs
use criterion::{criterion_group, criterion_main, Criterion};

fn bench_complexity_classifier(c: &mut Criterion) {
    let messages = vec![
        system("Extract the sentiment. Reply with one word."),
        user("The product is great!"),
    ];
    c.bench_function("complexity_classifier_short_prompt", |b| {
        b.iter(|| classify_complexity(&messages, "gpt-4o"))
    });
    // Target: <500µs (well within the 2ms budget)
}

fn bench_rule_engine_10_rules(c: &mut Criterion) {
    let rules = make_rules(10);
    let req = make_request_with_feature("classify");
    let costs = cost_tables();
    c.bench_function("rule_engine_10_rules", |b| {
        b.iter(|| evaluate_rules(&rules, &req, &costs))
    });
    // Target: <1ms
}

fn bench_api_key_hash_lookup(c: &mut Criterion) {
    let key = "dd0c_sk_live_a3f2b8c9d4e5f6a7b8c9d4e5f6a7b8c9";
    c.bench_function("api_key_sha256_hash", |b| {
        b.iter(|| hash_api_key(key))
    });
    // Target: <100µs
}

criterion_group!(benches, bench_complexity_classifier, bench_rule_engine_10_rules, bench_api_key_hash_lookup);
criterion_main!(benches);

CI gate: If any benchmark regresses by >20% vs. the baseline, the PR is blocked.

# .github/workflows/bench.yml
- name: Run benchmarks
  run: cargo bench -- --output-format bencher | tee bench_output.txt
- name: Compare with baseline
  uses: benchmark-action/github-action-benchmark@v1
  with:
    tool: cargo
    output-file-path: bench_output.txt
    alert-threshold: '120%'
    fail-on-alert: true

6.2 Throughput Benchmarks

k6 load test — sustained throughput:

// tests/load/throughput.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

const proxyOverhead = new Trend('proxy_overhead_ms');
const errorRate = new Rate('errors');

export let options = {
  stages: [
    { duration: '2m', target: 50 },   // ramp up
    { duration: '5m', target: 50 },   // sustained load
    { duration: '2m', target: 100 },  // peak load
    { duration: '1m', target: 0 },    // ramp down
  ],
  thresholds: {
    'proxy_overhead_ms': ['p(99)<5'],  // THE SLA
    'http_req_duration': ['p(99)<500'], // total including LLM
    'errors': ['rate<0.01'],            // <1% error rate
  },
};

export default function () {
  const start = Date.now();
  const res = http.post(
    `${__ENV.PROXY_URL}/v1/chat/completions`,
    JSON.stringify({ model: 'gpt-4o-mini', messages: [{ role: 'user', content: 'ping' }] }),
    { headers: { 'Authorization': `Bearer ${__ENV.DD0C_KEY}`, 'Content-Type': 'application/json' } }
  );

  const overhead = parseInt(res.headers['X-DD0C-Latency-Overhead-Ms'] || '999');
  proxyOverhead.add(overhead);
  errorRate.add(res.status !== 200);

  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(0.1);
}

Targets:

Metric Target Blocking
Proxy overhead P99 <5ms Yes — blocks deploy
Proxy overhead P50 <2ms No — informational
Total request P99 <500ms (excl. LLM time) Yes
Error rate <1% Yes
Throughput >500 req/s per proxy task No — informational

6.3 Chaos & Fault Injection

Scenario Tool Expected Behavior Pass Criteria
Kill TimescaleDB docker stop Proxy continues routing, telemetry dropped 0 proxy 5xx errors
Kill Redis docker stop Auth falls back to PG, rate limiting disabled <10% latency increase
OpenAI returns 429 WireMock Fallback to Anthropic within 1 retry Request succeeds, was_fallback=true
Anthropic returns 500 WireMock Circuit opens, fallback to gpt-4o Request succeeds or 503 with header
All providers return 500 WireMock 503 with X-DD0C-Fallback-Exhausted Correct error code, no panic
Network partition (50% packet loss) tc netem Increased latency, no crashes P99 < 2x normal
Proxy OOM --memory 256m Docker limit ECS restarts task, ALB routes to healthy <30s recovery
# Chaos test runner script
#!/bin/bash
# tests/chaos/run_chaos.sh

echo "=== Chaos Test: TimescaleDB Failure ==="
docker stop dd0c-timescaledb-test
sleep 5
k6 run --env PROXY_URL=http://localhost:8080 tests/load/throughput.js --duration 30s
docker start dd0c-timescaledb-test
echo "TimescaleDB recovered"

Section 7: CI/CD Pipeline Integration

7.1 Test Stages

┌─────────────────────────────────────────────────────────────────┐
│  git commit (pre-commit hook)                                    │
│  ├─ cargo fmt --check                                           │
│  ├─ cargo clippy -- -D warnings                                 │
│  ├─ grep for forbidden DDL keywords in new migration files      │
│  └─ check decision_log.json present if router/ files changed   │
└─────────────────────────────────────────────────────────────────┘
                          │ push
┌─────────────────────────────────────────────────────────────────┐
│  PR / push to branch                                            │
│  ├─ cargo test --workspace (unit tests only, no Docker)        │
│  ├─ cargo bench (regression check vs. baseline)                │
│  ├─ vitest --run (UI unit tests)                               │
│  ├─ eslint + tsc --noEmit (UI type check)                      │
│  └─ cargo audit (dependency vulnerability scan)                │
│  Target: <3 minutes                                             │
└─────────────────────────────────────────────────────────────────┘
                          │ PR approved
┌─────────────────────────────────────────────────────────────────┐
│  merge to main                                                  │
│  ├─ All PR checks (re-run)                                     │
│  ├─ Integration tests (Testcontainers — requires Docker)       │
│  ├─ Contract tests (fixture-based, no live APIs)               │
│  ├─ Coverage report (tarpaulin) — gate at 70%                  │
│  └─ Flag TTL audit (fail if any flag > 14 days at 100%)        │
│  Target: <8 minutes                                             │
└─────────────────────────────────────────────────────────────────┘
                          │ tests pass
┌─────────────────────────────────────────────────────────────────┐
│  deploy to staging                                              │
│  ├─ docker build + push to ECR                                 │
│  ├─ sqlx migrate run (staging DB)                              │
│  ├─ ECS rolling deploy                                         │
│  ├─ Smoke tests (k6, 60s, 10 VUs)                             │
│  └─ Playwright E2E (critical journeys only)                    │
│  Target: <15 minutes total                                      │
└─────────────────────────────────────────────────────────────────┘
                          │ staging green
┌─────────────────────────────────────────────────────────────────┐
│  deploy to production                                           │
│  ├─ ECS rolling deploy                                         │
│  ├─ Synthetic canary (1 req/min via CloudWatch Synthetics)     │
│  └─ Rollback trigger: error rate >5% for 3 minutes            │
└─────────────────────────────────────────────────────────────────┘

7.2 GitHub Actions Configuration

# .github/workflows/ci.yml
name: CI

on:
  push:
    branches: [main]
  pull_request:

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
        with:
          components: clippy, rustfmt
      - uses: Swatinem/rust-cache@v2
      - run: cargo fmt --check
      - run: cargo clippy --workspace -- -D warnings
      - run: cargo test --workspace --lib  # unit tests only (no integration)
      - run: cd ui && npm ci && npx vitest --run
      - run: cd cli && npm ci && npx vitest --run

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    services:
      docker:
        image: docker:dind
        options: --privileged
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - uses: Swatinem/rust-cache@v2
      - run: cargo test --workspace --test '*'  # integration tests in tests/

  coverage:
    runs-on: ubuntu-latest
    needs: integration-tests
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - run: cargo install cargo-tarpaulin
      - run: cargo tarpaulin --workspace --out Xml --output-dir coverage/
      - uses: codecov/codecov-action@v4
        with:
          fail_ci_if_error: true
          threshold: 70  # block merge if coverage drops below 70%

  benchmarks:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - uses: Swatinem/rust-cache@v2
      - run: cargo bench -- --output-format bencher | tee bench_output.txt
      - uses: benchmark-action/github-action-benchmark@v1
        with:
          tool: cargo
          output-file-path: bench_output.txt
          alert-threshold: '120%'
          fail-on-alert: true
          github-token: ${{ secrets.GITHUB_TOKEN }}
          auto-push: ${{ github.ref == 'refs/heads/main' }}

7.3 Coverage Thresholds

Crate Minimum Coverage Rationale
crates/shared (router, cost) 85% Core business logic — high confidence required
crates/proxy 75% Hot path — streaming paths are hard to unit test
crates/api 75% Auth and RBAC paths must be covered
crates/worker 65% Async scheduling is harder to test deterministically
cli/ 70% Parser logic must be covered
ui/ 60% UI components — visual testing supplements unit tests

Coverage is measured by cargo-tarpaulin for Rust and vitest --coverage for TypeScript. Coverage gates block merges but do not block deploys (a deploy with lower coverage is better than a rollback).

7.4 Test Parallelization

# .cargo/config.toml
[test]
# Run unit tests in parallel (default), integration tests sequentially
# Integration tests use Testcontainers — each gets its own container
# GitHub Actions matrix for parallel integration test suites
integration-tests:
  strategy:
    matrix:
      suite: [proxy, api, worker, analytics]
  steps:
    - run: cargo test --test ${{ matrix.suite }}_integration

Each integration test suite spins up its own Testcontainers instances — no shared state, no port conflicts, fully parallelizable.


Section 8: Transparent Factory Tenet Testing

8.1 Atomic Flagging — Feature Flag Behavior (Story 10.1)

Every flag must be testable in three states: off (default), on, and auto-disabled (circuit tripped).

// tests/feature_flags.rs

#[tokio::test]
async fn routing_strategy_uses_passthrough_when_flag_is_off() {
    let flags = FlagProvider::from_json(json!({
        "cascading_routing": { "enabled": false }
    }));
    let req = make_request_with_feature("classify");
    let decision = route_with_flags(&req, &flags, &cost_tables()).await;
    assert_eq!(decision.strategy, RoutingStrategy::Passthrough);
}

#[tokio::test]
async fn routing_strategy_uses_cascading_when_flag_is_on() {
    let flags = FlagProvider::from_json(json!({
        "cascading_routing": { "enabled": true }
    }));
    let req = make_request_with_feature("classify");
    let decision = route_with_flags(&req, &flags, &cost_tables()).await;
    assert_eq!(decision.strategy, RoutingStrategy::Cascading);
}

#[tokio::test]
async fn flag_auto_disables_when_p99_latency_increases_by_more_than_5_percent() {
    let flags = Arc::new(Mutex::new(FlagProvider::from_json(json!({
        "new_complexity_classifier": { "enabled": true, "owner": "brian", "ttl_days": 7 }
    }))));

    let monitor = FlagHealthMonitor::new(flags.clone(), baseline_p99_ms: 4.0);

    // Simulate latency spike
    for _ in 0..100 {
        monitor.record_latency(4.3); // 7.5% above baseline
    }

    tokio::time::sleep(Duration::from_secs(31)).await;

    let current_flags = flags.lock().await;
    assert!(!current_flags.is_enabled("new_complexity_classifier"),
        "flag should have auto-disabled due to latency regression");
}

#[test]
fn flag_with_expired_ttl_fails_ci_audit() {
    let flags = vec![
        FlagDefinition {
            name: "old_feature".to_string(),
            rollout_pct: 100,
            created_at: Utc::now() - Duration::days(20),
            ttl_days: 14,
            owner: "brian".to_string(),
        }
    ];
    let violations = audit_flag_ttls(&flags);
    assert_eq!(violations.len(), 1);
    assert_eq!(violations[0].flag_name, "old_feature");
}

Flag test matrix — every flag must have tests for all three states:

Flag Off behavior On behavior Auto-disable trigger
cascading_routing Passthrough Try cheapest, escalate on error P99 >5% regression
complexity_classifier_v2 Use heuristic v1 Use ML classifier Error rate >2%
provider_failover_anthropic No Anthropic fallback Anthropic in fallback chain Anthropic error rate >10%
cost_table_auto_refresh Manual refresh only Background 60s refresh N/A

8.2 Elastic Schema — Migration Validation (Story 10.2)

CI must reject any migration containing destructive DDL.

// tools/migration-lint/src/main.rs

const FORBIDDEN_PATTERNS: &[&str] = &[
    r"DROP\s+TABLE",
    r"DROP\s+COLUMN",
    r"ALTER\s+TABLE\s+\w+\s+RENAME",
    r"ALTER\s+COLUMN\s+\w+\s+TYPE",
    r"TRUNCATE",
];

pub fn lint_migration(sql: &str) -> Vec<LintViolation> {
    FORBIDDEN_PATTERNS.iter()
        .filter_map(|pattern| {
            let re = Regex::new(pattern).unwrap();
            if re.is_match(&sql.to_uppercase()) {
                Some(LintViolation { pattern: pattern.to_string(), sql: sql.to_string() })
            } else {
                None
            }
        })
        .collect()
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn lint_rejects_drop_table() {
        let sql = "DROP TABLE request_events;";
        assert!(!lint_migration(sql).is_empty());
    }

    #[test]
    fn lint_rejects_alter_column_type() {
        let sql = "ALTER TABLE request_events ALTER COLUMN latency_ms TYPE BIGINT;";
        assert!(!lint_migration(sql).is_empty());
    }

    #[test]
    fn lint_accepts_add_nullable_column() {
        let sql = "ALTER TABLE request_events ADD COLUMN cache_key VARCHAR(64) NULL;";
        assert!(lint_migration(sql).is_empty());
    }

    #[test]
    fn lint_accepts_create_index() {
        let sql = "CREATE INDEX CONCURRENTLY idx_re_model ON request_events(model_used);";
        assert!(lint_migration(sql).is_empty());
    }

    #[test]
    fn migration_file_includes_sunset_date_comment() {
        let sql = "-- sunset_date: 2026-03-30\nALTER TABLE orgs ADD COLUMN tier_v2 VARCHAR(20) NULL;";
        assert!(has_sunset_date_comment(sql));
    }

    #[test]
    fn migration_without_sunset_date_fails_lint() {
        let sql = "ALTER TABLE orgs ADD COLUMN tier_v2 VARCHAR(20) NULL;";
        assert!(!has_sunset_date_comment(sql));
    }
}

Dual-write pattern test:

#[tokio::test]
async fn dual_write_writes_to_both_old_and_new_schema_in_same_transaction() {
    let pool = test_pool().await;
    // Simulate migration window: both `plan` (old) and `plan_v2` (new) columns exist
    sqlx::query("ALTER TABLE organizations ADD COLUMN plan_v2 VARCHAR(30) NULL")
        .execute(&pool).await.unwrap();

    let org_id = create_org_dual_write(&pool, plan: "pro").await.unwrap();

    let row = sqlx::query("SELECT plan, plan_v2 FROM organizations WHERE id = $1")
        .bind(org_id).fetch_one(&pool).await.unwrap();
    assert_eq!(row.get::<String, _>("plan"), "pro");
    assert_eq!(row.get::<String, _>("plan_v2"), "pro"); // written to both
}

8.3 Cognitive Durability — Decision Log Validation (Story 10.3)

CI enforces that PRs touching routing or cost logic include a decision_log.json entry.

# tools/decision-log-check/check.py
# Run as: python check.py --changed-files <list>

import json, sys, re
from pathlib import Path

GUARDED_PATHS = ["src/router/", "src/cost/", "migrations/"]
REQUIRED_FIELDS = ["prompt", "reasoning", "alternatives_considered", "confidence", "timestamp", "author"]

def check_decision_log(changed_files: list[str]) -> list[str]:
    errors = []
    touches_guarded = any(
        any(f.startswith(p) for p in GUARDED_PATHS)
        for f in changed_files
    )
    if not touches_guarded:
        return []

    log_files = list(Path("docs/decisions").glob("*.json"))
    if not log_files:
        return ["No decision_log.json found in docs/decisions/ for changes to guarded paths"]

    # Check the most recently modified log file
    latest = max(log_files, key=lambda p: p.stat().st_mtime)
    try:
        log = json.loads(latest.read_text())
        for field in REQUIRED_FIELDS:
            if field not in log:
                errors.append(f"decision_log missing required field: {field}")
    except json.JSONDecodeError as e:
        errors.append(f"decision_log.json is not valid JSON: {e}")

    return errors

# Tests for the checker itself
def test_check_passes_when_log_present_with_all_fields():
    # ... test implementation

def test_check_fails_when_log_missing_reasoning_field():
    # ... test implementation

Cyclomatic complexity enforcement:

# .clippy.toml
cognitive-complexity-threshold = 10
# CI step
- run: cargo clippy --workspace -- -W clippy::cognitive_complexity -D warnings

8.4 Semantic Observability — OTEL Span Assertion Tests (Story 10.4)

Tests verify that routing decisions emit correctly structured OpenTelemetry spans.

// tests/observability.rs
use opentelemetry_sdk::testing::trace::InMemorySpanExporter;

#[tokio::test]
async fn routing_decision_emits_ai_routing_decision_span() {
    let exporter = InMemorySpanExporter::default();
    let tracer = setup_test_tracer(exporter.clone());

    let req = make_request("gpt-4o", feature: "classify");
    let _decision = route_request_with_tracing(&req, &tracer, &cost_tables()).await;

    let spans = exporter.get_finished_spans().unwrap();
    let routing_span = spans.iter()
        .find(|s| s.name == "ai_routing_decision")
        .expect("ai_routing_decision span must be emitted");

    // Assert required attributes
    let attrs = span_attrs_as_map(routing_span);
    assert!(attrs.contains_key("ai.model_selected"));
    assert!(attrs.contains_key("ai.model_alternatives"));
    assert!(attrs.contains_key("ai.cost_delta"));
    assert!(attrs.contains_key("ai.complexity_score"));
    assert!(attrs.contains_key("ai.routing_strategy"));
    assert!(attrs.contains_key("ai.prompt_hash"));
}

#[tokio::test]
async fn routing_span_never_contains_raw_prompt_content() {
    let exporter = InMemorySpanExporter::default();
    let tracer = setup_test_tracer(exporter.clone());

    let secret_prompt = "My secret quarterly revenue is $4.2M";
    let req = make_request_with_prompt("gpt-4o", secret_prompt);
    route_request_with_tracing(&req, &tracer, &cost_tables()).await;

    let spans = exporter.get_finished_spans().unwrap();
    for span in &spans {
        for (key, value) in span_attrs_as_map(span) {
            assert!(!format!("{:?}", value).contains(secret_prompt),
                "span attribute '{}' contains raw prompt content", key);
        }
        for event in &span.events {
            assert!(!event.name.contains(secret_prompt));
        }
    }
}

#[tokio::test]
async fn prompt_hash_is_sha256_of_first_500_chars_of_system_prompt() {
    let exporter = InMemorySpanExporter::default();
    let tracer = setup_test_tracer(exporter.clone());

    let system_prompt = "You are a helpful assistant.";
    let req = make_request_with_system_prompt("gpt-4o", system_prompt);
    route_request_with_tracing(&req, &tracer, &cost_tables()).await;

    let spans = exporter.get_finished_spans().unwrap();
    let routing_span = spans.iter().find(|s| s.name == "ai_routing_decision").unwrap();
    let attrs = span_attrs_as_map(routing_span);

    let expected_hash = sha256_hex(&system_prompt[..system_prompt.len().min(500)]);
    assert_eq!(attrs["ai.prompt_hash"].as_str().unwrap(), expected_hash);
}

#[tokio::test]
async fn routing_span_is_child_of_request_trace() {
    let exporter = InMemorySpanExporter::default();
    let tracer = setup_test_tracer(exporter.clone());

    route_request_with_tracing(&make_request("gpt-4o", feature: "test"), &tracer, &cost_tables()).await;

    let spans = exporter.get_finished_spans().unwrap();
    let request_span = spans.iter().find(|s| s.name == "proxy_request").unwrap();
    let routing_span = spans.iter().find(|s| s.name == "ai_routing_decision").unwrap();

    assert_eq!(routing_span.parent_span_id, request_span.span_context.span_id());
}

8.5 Configurable Autonomy — Governance Policy Tests (Story 10.5)

// tests/governance.rs

#[tokio::test]
async fn strict_mode_blocks_automatic_routing_rule_update() {
    let policy = Policy { governance_mode: GovernanceMode::Strict, panic_mode: false };
    let result = apply_routing_rule_update(&policy, make_rule_update()).await;
    assert_eq!(result, Err(GovernanceError::BlockedByStrictMode));
}

#[tokio::test]
async fn audit_mode_applies_change_and_logs_it() {
    let policy = Policy { governance_mode: GovernanceMode::Audit, panic_mode: false };
    let log = Arc::new(Mutex::new(vec![]));
    let result = apply_routing_rule_update_with_log(&policy, make_rule_update(), log.clone()).await;

    assert!(result.is_ok());
    let entries = log.lock().await;
    assert_eq!(entries.len(), 1);
    assert!(entries[0].contains("Allowed by audit mode"));
}

#[tokio::test]
async fn panic_mode_freezes_all_routing_to_hardcoded_provider() {
    let policy = Policy { governance_mode: GovernanceMode::Audit, panic_mode: true };
    let req = make_request_with_feature("classify"); // would normally route to gpt-4o-mini

    let decision = route_with_policy(&req, &policy, &cost_tables()).await;

    assert_eq!(decision.strategy, RoutingStrategy::Passthrough);
    assert_eq!(decision.target_provider, Provider::OpenAI); // hardcoded fallback
    assert!(decision.reason.contains("panic mode"));
}

#[tokio::test]
async fn panic_mode_disables_auto_failover() {
    let policy = Policy { governance_mode: GovernanceMode::Audit, panic_mode: true };
    // Even if primary provider fails, panic mode should not auto-failover
    let mock_openai = WireMock::start().await;
    Mock::given(method("POST")).respond_with(ResponseTemplate::new(500))
        .mount(&mock_openai).await;

    let result = dispatch_with_policy(&policy, mock_openai.uri()).await;
    // Should return the provider error, not silently failover
    assert_eq!(result.unwrap_err(), DispatchError::ProviderError(500));
}

#[tokio::test]
async fn policy_file_changes_are_hot_reloaded_within_5_seconds() {
    let policy_path = temp_policy_file(GovernanceMode::Audit);
    let watcher = PolicyWatcher::new(&policy_path);

    // Change to strict mode
    write_policy_file(&policy_path, GovernanceMode::Strict);
    tokio::time::sleep(Duration::from_secs(5)).await;

    assert_eq!(watcher.current_mode(), GovernanceMode::Strict);
}

Section 9: Test Data & Fixtures

9.1 Factory Patterns for Test Data

All test data is created via factory functions — no raw struct literals scattered across tests. Factories provide sensible defaults with override capability.

// crates/shared/src/testing/factories.rs
// Feature-gated: only compiled in test builds

#[cfg(any(test, feature = "test-utils"))]
pub mod factories {
    use crate::models::*;
    use uuid::Uuid;
    use chrono::Utc;

    pub struct OrgFactory {
        name: String,
        plan: String,
        monthly_spend_limit: Option<f64>,
    }

    impl OrgFactory {
        pub fn new() -> Self {
            Self {
                name: format!("Test Org {}", &Uuid::new_v4().to_string()[..8]),
                plan: "free".to_string(),
                monthly_spend_limit: None,
            }
        }
        pub fn pro(mut self) -> Self { self.plan = "pro".to_string(); self }
        pub fn with_spend_limit(mut self, limit: f64) -> Self {
            self.monthly_spend_limit = Some(limit); self
        }
        pub fn build(self) -> Organization {
            Organization {
                id: Uuid::new_v4(),
                name: self.name,
                slug: slugify(&self.name),
                plan: self.plan,
                monthly_llm_spend_limit: self.monthly_spend_limit,
                created_at: Utc::now(),
                updated_at: Utc::now(),
                ..Default::default()
            }
        }
    }

    pub struct RequestEventFactory {
        org_id: Uuid,
        model_requested: String,
        model_used: String,
        feature_tag: Option<String>,
        input_tokens: u32,
        output_tokens: u32,
        cost_actual: f64,
        cost_original: f64,
        latency_ms: u32,
        status_code: u16,
    }

    impl RequestEventFactory {
        pub fn new(org_id: Uuid) -> Self {
            Self {
                org_id,
                model_requested: "gpt-4o".to_string(),
                model_used: "gpt-4o-mini".to_string(),
                feature_tag: Some("classify".to_string()),
                input_tokens: 500,
                output_tokens: 50,
                cost_actual: 0.000083,
                cost_original: 0.001375,
                latency_ms: 3,
                status_code: 200,
            }
        }
        pub fn with_model(mut self, requested: &str, used: &str) -> Self {
            self.model_requested = requested.to_string();
            self.model_used = used.to_string();
            self
        }
        pub fn with_feature(mut self, feature: &str) -> Self {
            self.feature_tag = Some(feature.to_string()); self
        }
        pub fn with_tokens(mut self, input: u32, output: u32) -> Self {
            self.input_tokens = input;
            self.output_tokens = output;
            self
        }
        pub fn failed(mut self) -> Self {
            self.status_code = 500; self
        }
        pub fn build(self) -> RequestEvent {
            RequestEvent {
                id: Uuid::new_v4(),
                org_id: self.org_id,
                timestamp: Utc::now(),
                model_requested: self.model_requested,
                model_used: self.model_used,
                feature_tag: self.feature_tag,
                input_tokens: self.input_tokens,
                output_tokens: self.output_tokens,
                cost_actual: self.cost_actual,
                cost_original: self.cost_original,
                cost_saved: self.cost_original - self.cost_actual,
                latency_ms: self.latency_ms,
                status_code: self.status_code,
                ..Default::default()
            }
        }
    }

    pub struct RoutingRuleFactory {
        org_id: Uuid,
        priority: i32,
        strategy: RoutingStrategy,
        match_feature: Option<String>,
        model_chain: Vec<String>,
    }

    impl RoutingRuleFactory {
        pub fn cheapest(org_id: Uuid) -> Self {
            Self {
                org_id,
                priority: 0,
                strategy: RoutingStrategy::Cheapest,
                match_feature: None,
                model_chain: vec!["gpt-4o-mini".to_string(), "claude-3-haiku".to_string()],
            }
        }
        pub fn for_feature(mut self, feature: &str) -> Self {
            self.match_feature = Some(feature.to_string()); self
        }
        pub fn with_priority(mut self, p: i32) -> Self { self.priority = p; self }
        pub fn build(self) -> RoutingRule { /* ... */ }
    }

    // Convenience helpers
    pub fn make_event(org_id: Uuid) -> RequestEvent {
        RequestEventFactory::new(org_id).build()
    }

    pub fn make_events(org_id: Uuid, count: usize) -> Vec<RequestEvent> {
        (0..count).map(|_| make_event(org_id)).collect()
    }

    pub fn make_events_spread_over_days(org_id: Uuid, count: usize, days: u32) -> Vec<RequestEvent> {
        (0..count).map(|i| {
            let mut event = make_event(org_id);
            event.timestamp = Utc::now() - Duration::days((i % days as usize) as i64);
            event
        }).collect()
    }
}

TypeScript factories for UI and CLI tests:

// ui/src/testing/factories.ts

export const makeOrg = (overrides: Partial<Organization> = {}): Organization => ({
  id: crypto.randomUUID(),
  name: 'Test Org',
  slug: 'test-org',
  plan: 'free',
  createdAt: new Date().toISOString(),
  ...overrides,
});

export const makeDashboardSummary = (overrides: Partial<DashboardSummary> = {}): DashboardSummary => ({
  period: '7d',
  totalRequests: 42850,
  totalCost: 127.43,
  totalCostWithoutRouting: 891.20,
  totalSaved: 763.77,
  savingsPercentage: 85.7,
  avgLatencyMs: 4.2,
  ...overrides,
});

export const makeRequestEvent = (overrides: Partial<RequestEvent> = {}): RequestEvent => ({
  id: `req_${Math.random().toString(36).slice(2, 10)}`,
  timestamp: new Date().toISOString(),
  modelRequested: 'gpt-4o',
  modelUsed: 'gpt-4o-mini',
  provider: 'openai',
  featureTag: 'classify',
  inputTokens: 142,
  outputTokens: 8,
  cost: 0.000026,
  costWithoutRouting: 0.000435,
  saved: 0.000409,
  latencyMs: 245,
  complexity: 'LOW',
  status: 200,
  ...overrides,
});

export const makeTreemapData = (): TreemapNode[] => [
  { name: 'classify', value: 450.20, children: [
    { name: 'gpt-4o-mini', value: 320.10 },
    { name: 'claude-3-haiku', value: 130.10 },
  ]},
  { name: 'summarize', value: 280.50, children: [
    { name: 'gpt-4o', value: 280.50 },
  ]},
];

9.2 Provider Response Mocks (OpenAI & Anthropic)

Recorded fixtures live in tests/fixtures/. They are captured once from real APIs and committed to the repo.

tests/fixtures/
├── openai/
│   ├── chat_completions_non_streaming.json
│   ├── chat_completions_streaming.txt          # raw SSE stream
│   ├── chat_completions_streaming_with_usage.txt
│   ├── chat_completions_tool_call.json
│   ├── embeddings_response.json
│   ├── error_rate_limit_429.json
│   ├── error_invalid_api_key_401.json
│   └── error_server_error_500.json
├── anthropic/
│   ├── messages_response.json
│   ├── messages_streaming.txt
│   ├── error_overloaded_529.json
│   └── error_rate_limit_429.json
└── dd0c/
    ├── routing_decision_cheapest.json          # expected routing decision output
    ├── routing_decision_cascading.json
    └── request_event_full.json                 # full RequestEvent with all fields

OpenAI non-streaming fixture:

// tests/fixtures/openai/chat_completions_non_streaming.json
{
  "id": "chatcmpl-test123",
  "object": "chat.completion",
  "created": 1709251200,
  "model": "gpt-4o-mini-2024-07-18",
  "choices": [{
    "index": 0,
    "message": { "role": "assistant", "content": "This is a billing inquiry." },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 8,
    "total_tokens": 50
  }
}

OpenAI streaming fixture:

// tests/fixtures/openai/chat_completions_streaming.txt
data: {"id":"chatcmpl-test123","object":"chat.completion.chunk","created":1709251200,"model":"gpt-4o-mini-2024-07-18","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-test123","object":"chat.completion.chunk","created":1709251200,"model":"gpt-4o-mini-2024-07-18","choices":[{"index":0,"delta":{"content":"This"},"finish_reason":null}]}

data: {"id":"chatcmpl-test123","object":"chat.completion.chunk","created":1709251200,"model":"gpt-4o-mini-2024-07-18","choices":[{"index":0,"delta":{"content":" is"},"finish_reason":null}]}

data: {"id":"chatcmpl-test123","object":"chat.completion.chunk","created":1709251200,"model":"gpt-4o-mini-2024-07-18","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":42,"completion_tokens":3,"total_tokens":45}}

data: [DONE]

Fixture loader utility:

// tests/common/fixtures.rs

pub fn load_fixture(path: &str) -> serde_json::Value {
    let fixture_path = Path::new(env!("CARGO_MANIFEST_DIR"))
        .join("tests/fixtures")
        .join(path);
    let content = fs::read_to_string(&fixture_path)
        .unwrap_or_else(|_| panic!("fixture not found: {}", fixture_path.display()));
    serde_json::from_str(&content)
        .unwrap_or_else(|_| panic!("fixture is not valid JSON: {}", path))
}

pub fn load_sse_fixture(path: &str) -> Vec<u8> {
    let fixture_path = Path::new(env!("CARGO_MANIFEST_DIR"))
        .join("tests/fixtures")
        .join(path);
    fs::read(&fixture_path)
        .unwrap_or_else(|_| panic!("SSE fixture not found: {}", fixture_path.display()))
}

9.3 Cost Table Fixtures

// crates/shared/src/testing/cost_tables.rs

#[cfg(any(test, feature = "test-utils"))]
pub fn cost_tables() -> CostTables {
    CostTables::from_vec(vec![
        ModelCost {
            provider: Provider::OpenAI,
            model_id: "gpt-4o-2024-11-20".to_string(),
            model_alias: "gpt-4o".to_string(),
            input_cost_per_m: 2.50,
            output_cost_per_m: 10.00,
            quality_tier: QualityTier::Frontier,
            max_context: 128_000,
            supports_streaming: true,
            supports_tools: true,
            supports_vision: true,
        },
        ModelCost {
            provider: Provider::OpenAI,
            model_id: "gpt-4o-mini-2024-07-18".to_string(),
            model_alias: "gpt-4o-mini".to_string(),
            input_cost_per_m: 0.15,
            output_cost_per_m: 0.60,
            quality_tier: QualityTier::Economy,
            max_context: 128_000,
            supports_streaming: true,
            supports_tools: true,
            supports_vision: true,
        },
        ModelCost {
            provider: Provider::Anthropic,
            model_id: "claude-3-haiku-20240307".to_string(),
            model_alias: "claude-3-haiku".to_string(),
            input_cost_per_m: 0.25,
            output_cost_per_m: 1.25,
            quality_tier: QualityTier::Economy,
            max_context: 200_000,
            supports_streaming: true,
            supports_tools: true,
            supports_vision: false,
        },
        ModelCost {
            provider: Provider::Anthropic,
            model_id: "claude-3-5-sonnet-20241022".to_string(),
            model_alias: "claude-3-5-sonnet".to_string(),
            input_cost_per_m: 3.00,
            output_cost_per_m: 15.00,
            quality_tier: QualityTier::Frontier,
            max_context: 200_000,
            supports_streaming: true,
            supports_tools: true,
            supports_vision: true,
        },
    ])
}

/// Returns all valid (requested, used) pairs where routing makes sense
#[cfg(any(test, feature = "test-utils"))]
pub fn routable_model_pairs() -> Vec<(&'static str, &'static str)> {
    vec![
        ("gpt-4o", "gpt-4o-mini"),
        ("gpt-4o", "claude-3-haiku"),
        ("claude-3-5-sonnet", "gpt-4o-mini"),
        ("claude-3-5-sonnet", "claude-3-haiku"),
        // Same model (zero savings)
        ("gpt-4o-mini", "gpt-4o-mini"),
        ("gpt-4o", "gpt-4o"),
    ]
}

Section 10: TDD Implementation Order

10.1 Bootstrap Sequence

Before writing any product tests, the test infrastructure itself must be bootstrapped. This is the meta-TDD step.

Week 0 (before Epic 1 code):

Day 1: Test infrastructure setup
  ├─ Add dev-dependencies: mockall, proptest, testcontainers, wiremock, criterion
  ├─ Create crates/shared/src/testing/ module (factories, cost_tables, helpers)
  ├─ Create tests/common/ (fixture loader, test app builder, DB setup helpers)
  ├─ Write and pass: "test infrastructure compiles and factories produce valid structs"
  └─ Set up cargo-tarpaulin and confirm coverage reporting works

Day 2: CI pipeline skeleton
  ├─ Create .github/workflows/ci.yml with unit test job (no tests yet — just passes)
  ├─ Add benchmark job with baseline capture
  ├─ Add migration lint script and test it against a sample migration
  └─ Confirm: `git push` → CI green (trivially)

10.2 Epic-by-Epic TDD Order

Tests must be written in dependency order — you can't test the Router Brain without the cost table fixtures, and you can't test the Analytics Pipeline without the proxy event schema.

Phase 1: Foundation (Epic 1 — Proxy Engine)
─────────────────────────────────────────────
WRITE FIRST (before any proxy code):
  1. test: parse_request_extracts_model_and_stream_flag
  2. test: auth_middleware_returns_401_for_unknown_key
  3. test: auth_middleware_caches_valid_key_after_db_lookup
  4. test: response_headers_contain_routing_metadata
  5. test: telemetry_emitter_drops_event_when_channel_is_full

THEN implement proxy core to make them pass.

THEN add property tests:
  6. proptest: api_key_hash_is_deterministic
  7. proptest: response_headers_never_contain_prompt_content

THEN add integration tests (requires Docker):
  8. integration: proxy_forwards_request_to_mock_openai_and_returns_200
  9. integration: proxy_returns_401_for_revoked_key_after_cache_invalidation
  10. contract: proxy_response_matches_openai_response_schema
  11. contract: proxy_preserves_sse_chunk_ordering_for_streaming_requests
  12. contract: proxy_translates_anthropic_response_to_openai_format

Phase 2: Intelligence (Epic 2 — Router Brain)
──────────────────────────────────────────────
WRITE FIRST:
  13. test: rule_engine_returns_first_matching_rule_by_priority
  14. test: rule_engine_falls_through_to_passthrough_when_no_rules_match
  15. test: cheapest_strategy_selects_lowest_cost_model_from_chain
  16. test: classifier_returns_low_for_short_extraction_system_prompt
  17. test: classifier_returns_high_for_code_generation_prompt
  18. test: cost_saved_is_zero_when_models_are_identical
  19. test: cost_saved_is_positive_when_routed_to_cheaper_model
  20. test: circuit_breaker_transitions_to_open_after_error_threshold
  21. test: circuit_breaker_transitions_to_half_open_after_cooldown

THEN implement Router Brain.

THEN add property tests:
  22. proptest: cheapest_strategy_never_selects_more_expensive_model
  23. proptest: complexity_classifier_never_panics_on_arbitrary_input
  24. proptest: cost_saved_is_never_negative

THEN add integration tests:
  25. integration: circuit_breaker_state_is_shared_across_two_proxy_instances
  26. integration: routing_rule_loaded_from_db_takes_effect_on_next_request

Phase 3: Data (Epic 3 — Analytics Pipeline)
─────────────────────────────────────────────
WRITE FIRST:
  27. test: batch_collector_flushes_after_100_events_before_timeout
  28. test: batch_collector_flushes_partial_batch_after_interval
  29. test: proxy_continues_routing_when_telemetry_db_is_unavailable

THEN implement analytics worker.

THEN add integration tests:
  30. integration: batch_worker_inserts_events_into_timescaledb_hypertable
  31. integration: continuous_aggregate_reflects_inserted_events_after_refresh

Phase 4: Control Plane (Epic 4 — Dashboard API)
─────────────────────────────────────────────────
WRITE FIRST:
  32. test: create_api_key_returns_full_key_only_once
  33. test: member_role_cannot_create_routing_rules
  34. test: request_inspector_never_returns_prompt_content
  35. test: provider_credential_is_stored_encrypted
  36. test: revoked_api_key_returns_401_on_next_proxy_request

THEN implement Dashboard API.

THEN add integration tests:
  37. integration: create_org_and_api_key_persists_to_postgres
  38. integration: routing_rules_are_returned_in_priority_order
  39. integration: dashboard_summary_query_returns_correct_aggregates

Phase 5: Transparent Factory (Epic 10 — Cross-cutting)
────────────────────────────────────────────────────────
These tests are written ALONGSIDE the features they govern, not after.

  40. test: routing_strategy_uses_passthrough_when_flag_is_off          (with Epic 2)
  41. test: flag_auto_disables_when_p99_latency_increases_by_5_percent  (with Epic 2)
  42. test: lint_rejects_drop_table                                      (before any migration)
  43. test: routing_decision_emits_ai_routing_decision_span             (with Epic 2)
  44. test: routing_span_never_contains_raw_prompt_content              (with Epic 2)
  45. test: strict_mode_blocks_automatic_routing_rule_update            (with Epic 2)
  46. test: panic_mode_freezes_all_routing_to_hardcoded_provider        (with Epic 2)

Phase 6: UI & CLI (Epics 5 & 6)
─────────────────────────────────
  47. vitest: CostTreemap renders spend breakdown by feature tag
  48. vitest: RoutingRulesEditor allows drag-to-reorder priority
  49. vitest: RequestInspector filters by feature tag
  50. vitest: dd0c-scan detects gpt-4o usage in TypeScript files
  51. vitest: SavingsReport calculates positive savings

Phase 7: E2E (after staging environment is live)
──────────────────────────────────────────────────
  52. playwright: first_route_onboarding_journey_completes_in_under_2_minutes
  53. playwright: routing_rule_created_in_ui_takes_effect_on_next_request
  54. k6: proxy_overhead_p99_is_under_5ms_at_50_concurrent_users
  55. k6: proxy_continues_routing_when_timescaledb_is_killed

10.3 Test Count Milestones

Milestone Tests Written Coverage Target Gate
End of Epic 1 ~50 60% proxy crate CI green
End of Epic 2 ~120 80% shared/router CI green
End of Epic 3 ~150 70% worker CI green
End of Epic 4 ~220 75% api crate CI green
End of Epic 10 ~280 80% overall CI green
End of Epic 5+6 ~320 75% overall CI green
V1 Launch ~400 75% overall Deploy gate

10.4 The "Test It First" Checklist

Before writing any new function, ask:

□ Does this function have a clear, testable contract?
  (If not, the function is probably doing too much — split it)

□ Can I write the test without knowing the implementation?
  (If not, the abstraction is wrong — redesign the interface)

□ Does this function touch the hot path?
  → Add a criterion benchmark

□ Does this function handle money (cost calculations)?
  → Add proptest property tests

□ Does this function touch auth or security?
  → Add tests for the invalid/revoked/malformed cases explicitly

□ Does this function emit telemetry or spans?
  → Add an OTEL span assertion test

□ Does this function change routing behavior?
  → Add a feature flag test (off/on/auto-disabled)

□ Does this function modify the database schema?
  → Add a migration lint test and a dual-write test

Appendix: Test Toolchain Summary

Tool Language Purpose Config
cargo test Rust Unit + integration test runner Cargo.toml
mockall Rust Mock generation for traits #[automock] attribute
proptest Rust Property-based testing proptest! macro
criterion Rust Micro-benchmarks [[bench]] in Cargo.toml
testcontainers Rust Real DB/Redis in tests Docker required
wiremock Rust HTTP mock server WireMock::start().await
cargo-tarpaulin Rust Code coverage cargo tarpaulin
cargo-audit Rust Dependency vulnerability scan cargo audit
vitest TypeScript Unit tests for UI + CLI vitest.config.ts
@testing-library/react TypeScript React component tests With vitest
Playwright TypeScript E2E browser tests playwright.config.ts
k6 JavaScript Load + chaos tests k6 run
migration-lint Python/Bash DDL safety checks Pre-commit + CI
decision-log-check Python Cognitive durability enforcement CI only
benchmark-action GitHub Actions Benchmark regression detection .github/workflows/

Test Architecture document generated for dd0c/route V1 MVP. Total estimated test count at V1 launch: ~400 tests. Target CI runtime: <8 minutes (unit + integration), <15 minutes (full pipeline with E2E).


11. Review Remediation Addendum (Post-Gemini Review)

11.1 Replace MockKeyCache/MockKeyStore with Testcontainers

// BEFORE (anti-pattern — mocks hide real latency):
// let cache = MockKeyCache::new();
// let store = MockKeyStore::new();

// AFTER: Use Testcontainers for hot-path auth tests
#[tokio::test]
async fn auth_middleware_validates_key_under_5ms_with_real_redis() {
    let redis = TestcontainersRedis::start().await;
    let pg = TestcontainersPostgres::start().await;
    let cache = RedisKeyCache::new(redis.connection_string());
    let store = PgKeyStore::new(pg.connection_string());

    let start = Instant::now();
    let result = auth_middleware(&cache, &store, "sk-valid-key").await;
    assert!(start.elapsed() < Duration::from_millis(5));
    assert!(result.is_ok());
}

#[tokio::test]
async fn auth_middleware_handles_redis_connection_pool_exhaustion() {
    // Exhaust all connections, verify fallback to PG
    let redis = TestcontainersRedis::start().await;
    let cache = RedisKeyCache::with_pool_size(redis.connection_string(), 1);
    // Hold the single connection
    let _held = cache.raw_connection().await;
    // Auth must still work via PG fallback
    let result = auth_middleware(&cache, &pg_store, "sk-valid-key").await;
    assert!(result.is_ok());
}

11.2 Fix Encryption Test (Decrypt, Don't Just Assert Non-Plaintext)

// BEFORE (anti-pattern — passes if stored as random garbage):
// assert_ne!(stored.encrypted_key, b"sk-plaintext-key");

// AFTER: Full round-trip encryption test
#[tokio::test]
async fn provider_credential_encrypts_and_decrypts_correctly() {
    let kms = LocalStackKMS::start().await;
    let key_id = kms.create_key().await;
    let store = CredentialStore::new(pg.pool(), kms.client(), key_id);

    let original = "sk-live-abc123xyz";
    store.save_credential("org-1", "openai", original).await.unwrap();

    // Read raw from DB — must NOT be plaintext
    let raw = pg.query_raw("SELECT encrypted_key FROM credentials LIMIT 1").await;
    assert!(!String::from_utf8_lossy(&raw).contains(original));

    // Decrypt via the store — must match original
    let decrypted = store.get_credential("org-1", "openai").await.unwrap();
    assert_eq!(decrypted, original);
}

#[tokio::test]
async fn kms_key_rotation_old_deks_still_decrypt_old_credentials() {
    let kms = LocalStackKMS::start().await;
    let key_id = kms.create_key().await;
    let store = CredentialStore::new(pg.pool(), kms.client(), key_id);

    // Save with original key
    store.save_credential("org-1", "openai", "sk-old").await.unwrap();

    // Rotate KMS key
    kms.rotate_key(key_id).await;

    // Old credential must still decrypt
    let decrypted = store.get_credential("org-1", "openai").await.unwrap();
    assert_eq!(decrypted, "sk-old");

    // New credential uses new DEK
    store.save_credential("org-1", "anthropic", "sk-new").await.unwrap();
    let decrypted_new = store.get_credential("org-1", "anthropic").await.unwrap();
    assert_eq!(decrypted_new, "sk-new");
}

11.3 Slow Dependency Chaos Test

#[tokio::test]
async fn chaos_slow_db_does_not_block_proxy_hot_path() {
    let stack = E2EStack::start().await;

    // Inject 5-second network delay on TimescaleDB port via tc netem
    stack.inject_latency("timescaledb", Duration::from_secs(5)).await;

    // Proxy must still route requests within SLA
    let start = Instant::now();
    let resp = stack.proxy()
        .post("/v1/chat/completions")
        .header("Authorization", "Bearer sk-valid")
        .json(&chat_request())
        .send().await;
    let latency = start.elapsed();

    assert_eq!(resp.status(), 200);
    // Telemetry is dropped, but routing works
    assert!(latency < Duration::from_millis(50),
        "Proxy blocked by slow DB: {:?}", latency);
}

#[tokio::test]
async fn chaos_slow_redis_falls_back_to_pg_for_auth() {
    let stack = E2EStack::start().await;
    stack.inject_latency("redis", Duration::from_secs(3)).await;

    let resp = stack.proxy()
        .post("/v1/chat/completions")
        .header("Authorization", "Bearer sk-valid")
        .json(&chat_request())
        .send().await;
    assert_eq!(resp.status(), 200);
}

11.4 IDOR / Cross-Tenant Test Suite

// tests/integration/idor_test.rs

#[tokio::test]
async fn idor_org_a_cannot_read_org_b_routing_rules() {
    let stack = E2EStack::start().await;
    let org_a_token = stack.create_org_and_token("org-a").await;
    let org_b_token = stack.create_org_and_token("org-b").await;

    // Org B creates a routing rule
    let rule = stack.api()
        .post("/v1/routing-rules")
        .bearer_auth(&org_b_token)
        .json(&json!({ "name": "secret-rule", "model": "gpt-4" }))
        .send().await.json::<RoutingRule>().await;

    // Org A tries to read it
    let resp = stack.api()
        .get(&format!("/v1/routing-rules/{}", rule.id))
        .bearer_auth(&org_a_token)
        .send().await;
    assert_eq!(resp.status(), 404); // Not 403 — don't leak existence
}

#[tokio::test]
async fn idor_org_a_cannot_read_org_b_api_keys() {
    // Same pattern — create key as org B, attempt read as org A
}

#[tokio::test]
async fn idor_org_a_cannot_read_org_b_telemetry() {}

#[tokio::test]
async fn idor_org_a_cannot_mutate_org_b_routing_rules() {}

11.5 SSE Connection Drop / Billing Leak Test

#[tokio::test]
async fn sse_client_disconnect_aborts_upstream_provider_request() {
    let stack = E2EStack::start().await;
    let mock_provider = stack.mock_provider();

    // Configure provider to stream slowly (1 token/sec for 60 tokens)
    mock_provider.configure_slow_stream(60, Duration::from_secs(1));

    // Start streaming request
    let mut stream = stack.proxy()
        .post("/v1/chat/completions")
        .json(&json!({ "stream": true, "model": "gpt-4" }))
        .send().await
        .bytes_stream();

    // Read 5 tokens then drop the connection
    for _ in 0..5 {
        stream.next().await;
    }
    drop(stream);

    // Wait briefly for cleanup
    tokio::time::sleep(Duration::from_millis(500)).await;

    // Provider connection must be aborted — not still streaming
    assert_eq!(mock_provider.active_connections(), 0);

    // Billing: customer should only be charged for 5 tokens, not 60
    let usage = stack.get_last_usage_record().await;
    assert!(usage.completion_tokens <= 10); // Some buffer for in-flight
}

11.6 Concurrent Circuit Breaker Race Condition

#[tokio::test]
async fn circuit_breaker_handles_50_concurrent_failures_cleanly() {
    let redis = TestcontainersRedis::start().await;
    let breaker = RedisCircuitBreaker::new(redis.connection_string(), "openai", 10);

    let mut handles = vec![];
    for _ in 0..50 {
        let b = breaker.clone();
        handles.push(tokio::spawn(async move {
            b.record_failure().await;
        }));
    }
    futures::future::join_all(handles).await;

    // Breaker must be open — no race condition leaving it closed
    assert_eq!(breaker.state().await, CircuitState::Open);
    // Failure count must be exactly 50 (atomic increments)
    assert_eq!(breaker.failure_count().await, 50);
}

11.7 Trace Context Propagation

#[tokio::test]
async fn otel_trace_propagates_from_client_through_proxy_to_provider() {
    let stack = E2EStack::start().await;
    let tracer = stack.in_memory_tracer();

    let resp = stack.proxy()
        .post("/v1/chat/completions")
        .header("traceparent", "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01")
        .json(&chat_request())
        .send().await;

    let spans = tracer.finished_spans();
    let proxy_span = spans.iter().find(|s| s.name == "proxy.route").unwrap();

    // Proxy span must be child of the incoming trace
    assert_eq!(proxy_span.trace_id, "4bf92f3577b34da6a3ce929d0e0e4736");

    // Provider request must carry the same trace_id
    let provider_req = stack.mock_provider().last_request();
    assert!(provider_req.headers["traceparent"].contains("4bf92f3577b34da6a3ce929d0e0e4736"));
}

11.8 Flag Provider Fallback Test

#[test]
fn flag_provider_unreachable_falls_back_to_safe_default() {
    // Simulate missing/corrupt flag config file
    let provider = JsonFileProvider::new("/nonexistent/flags.json");
    let result = provider.evaluate("enable_new_router", false);
    // Must return the safe default (false), not panic or error
    assert_eq!(result, false);
}

#[test]
fn flag_provider_malformed_json_falls_back_to_safe_default() {
    let provider = JsonFileProvider::from_string("{ invalid json }}}");
    let result = provider.evaluate("enable_new_router", false);
    assert_eq!(result, false);
}

11.9 24-Hour Soak Test Spec

// tests/soak/long_running_latency.rs
// Run manually: cargo test --test soak -- --ignored

#[tokio::test]
#[ignore] // Only run in nightly CI
async fn soak_24h_proxy_latency_stays_under_5ms_p99() {
    // k6 config: 10 RPS sustained for 24 hours
    // Assert: p99 < 5ms, no memory growth > 50MB, no connection leaks
    // This catches memory fragmentation and connection pool exhaustion
}

11.10 Panic Mode Authorization

#[tokio::test]
async fn panic_mode_requires_owner_role() {
    let stack = E2EStack::start().await;
    let viewer_token = stack.create_token_with_role("org-1", Role::Viewer).await;

    let resp = stack.api()
        .post("/admin/panic")
        .bearer_auth(&viewer_token)
        .send().await;
    assert_eq!(resp.status(), 403);
}

#[tokio::test]
async fn panic_mode_allowed_for_owner_role() {
    let owner_token = stack.create_token_with_role("org-1", Role::Owner).await;
    let resp = stack.api()
        .post("/admin/panic")
        .bearer_auth(&owner_token)
        .send().await;
    assert_eq!(resp.status(), 200);
}

End of P1 Review Remediation Addendum