Files

Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.

2026-02-28 17:35:02 +00:00

66 KiB

Raw Blame History

dd0c/run — Test Architecture & TDD Strategy

Version: 1.0 | Date: 2026-02-28 | Status: Draft

1. Testing Philosophy & TDD Workflow

1.1 Core Principle: Safety-First Testing

dd0c/run executes commands in production infrastructure. A misclassification that allows a destructive command to auto-execute is an existential risk. This shapes every testing decision:

If it touches execution, tests come first. No exceptions.

The TDD discipline here is not a process preference — it is a safety mechanism. Writing tests first for the Action Classifier and Execution Engine forces the developer to enumerate failure modes before writing code that could cause them.

1.2 Red-Green-Refactor Adapted for dd0c/run

The standard Red-Green-Refactor cycle applies with one critical modification: for any component that classifies or executes commands, the "Red" phase must include at least one test for a known-destructive command being correctly blocked.

Standard TDD:          dd0c/run TDD:
  Red                    Red (write failing test)
  ↓                      ↓
  Green                  Red-Safety (add: "rm -rf / must be 🔴")
  ↓                      ↓
  Refactor               Green (make ALL tests pass)
                         ↓
                         Refactor
                         ↓
                         Canary-Check (run canary suite — must stay green)

The Canary Suite is a mandatory gate: a curated set of ~50 known-destructive commands that must always be classified as 🔴. It runs after every refactor. If any canary passes as 🟢, the build fails.

1.3 When to Write Tests First vs. Integration Tests Lead

Scenario	Approach	Rationale
Deterministic Safety Scanner	Unit tests FIRST, always	Pure function. No I/O. Exhaustive coverage of destructive patterns is mandatory before any code.
Classification Merge Rules	Unit tests FIRST, always	Hardcoded logic. Tests define the spec.
Execution Engine state machine	Unit tests FIRST, always	State transitions are safety-critical. Tests enumerate all valid/invalid transitions.
Approval workflow	Unit tests FIRST	Approval bypass is a threat vector. Tests must prove it's impossible.
Runbook Parser (LLM extraction)	Integration tests lead	LLM behavior is non-deterministic. Integration tests with recorded fixtures define expected behavior.
Slack Bot UI flows	Integration tests lead	Slack API interactions are I/O-heavy. Mock Slack API, test message shapes.
Alert-Runbook Matcher	Integration tests lead	Matching logic depends on DB state. Testcontainers + fixture data.
Audit Trail ingestion	Unit tests first for schema, integration for pipeline	Schema is deterministic; pipeline has I/O.

1.4 Test Naming Conventions

All tests follow the pattern: <component>_<scenario>_<expected_outcome>

// Unit tests
#[test]
fn scanner_rm_rf_root_classifies_as_dangerous() { ... }

#[test]
fn scanner_kubectl_get_pods_classifies_as_safe() { ... }

#[test]
fn merge_engine_scanner_dangerous_overrides_llm_safe() { ... }

#[test]
fn state_machine_caution_step_transitions_to_await_approval() { ... }

// Integration tests
#[tokio::test]
async fn parser_confluence_html_extracts_ordered_steps() { ... }

#[tokio::test]
async fn execution_engine_approval_timeout_does_not_auto_approve() { ... }

// E2E tests
#[tokio::test]
async fn e2e_paste_runbook_classify_approve_execute_audit_full_journey() { ... }

Prohibited naming patterns:

test_thing() — too vague
it_works() — meaningless
test1(), test2() — no context

1.5 Safety-First Rule: Destructive Command Tests Are Mandatory Pre-Code

Hard Rule: No code may be written for the Action Classifier, Execution Engine, or Agent-side Scanner without first writing tests that prove destructive commands are blocked.

This is enforced via CI: the pkg/classifier/ and pkg/executor/ directories require a minimum of 95% test coverage before any PR can merge. The canary test suite must pass on every commit touching these packages.

2. Test Pyramid

2.1 Recommended Ratio

         /\
        /E2E\          ~10% — Critical user journeys, chaos scenarios
       /──────\
      / Integ  \        ~20% — Service boundaries, DB, Slack, gRPC
     /──────────\
    /    Unit    \      ~70% — Pure logic, state machines, classifiers
   /______________\

For the Action Classifier and Execution Engine specifically, the ratio shifts:

Unit:        80%  (exhaustive pattern coverage, state machine transitions)
Integration: 15%  (scanner ↔ classifier merge, engine ↔ agent gRPC)
E2E:          5%  (full execution journeys with sandboxed infra)

2.2 Unit Test Targets (per component)

Component	Unit Test Focus	Coverage Target
Deterministic Safety Scanner	Pattern matching, AST parsing, all risk categories	100%
Classification Merge Engine	All 5 merge rules + edge cases	100%
Execution Engine state machine	All state transitions, trust level enforcement	95%
Runbook Parser (normalizer)	HTML stripping, markdown normalization, whitespace	90%
Variable Detector	Placeholder regex patterns, env ref detection	90%
Branch Mapper	DAG construction, if/else detection	85%
Approval Workflow	Approval gates, typed confirmation, timeout behavior	95%
Audit Trail schema	Event type validation, immutability constraints	90%
Alert-Runbook Matcher	Keyword matching, similarity scoring	85%
Trust Level Enforcement	Level checks per risk level, auto-downgrade	95%
Panic Mode	Trigger conditions, halt sequence, Redis key check	95%
Feature Flag Circuit Breaker	2-failure threshold, 48h bake enforcement	95%

2.3 Integration Test Boundaries

Boundary	Test Type	Infrastructure
Parser ↔ LLM Gateway (dd0c/route)	Contract test with recorded responses	WireMock / recorded fixtures
Classifier ↔ PostgreSQL (audit write)	Integration test	Testcontainers (PostgreSQL)
Execution Engine ↔ Agent (gRPC)	Integration test	In-process gRPC server mock
Execution Engine ↔ Slack Bot	Integration test	Slack API mock
Approval Workflow ↔ Slack	Integration test	Slack API mock
Audit Trail ↔ PostgreSQL	Integration test	Testcontainers (PostgreSQL)
Alert Matcher ↔ PostgreSQL + pgvector	Integration test	Testcontainers (PostgreSQL + pgvector)
Webhook Receiver ↔ PagerDuty/OpsGenie	Contract test	Recorded webhook payloads
RLS enforcement	Integration test	Testcontainers (PostgreSQL with RLS enabled)

2.4 E2E / Smoke Test Scenarios

Scenario	Priority	Infrastructure
Full journey: paste → parse → classify → approve → execute → audit	P0	Docker Compose sandbox
Destructive command blocked at all trust levels	P0	Docker Compose sandbox
Panic mode triggered and halts in-flight execution	P0	Docker Compose sandbox
Approval timeout does not auto-approve	P0	Docker Compose sandbox
Cross-tenant data isolation (RLS)	P0	Testcontainers
Agent reconnect after network partition	P1	Docker Compose sandbox
Mid-execution failure triggers rollback flow	P1	Docker Compose sandbox
Feature flag circuit breaker halts execution after 2 failures	P1	Docker Compose sandbox

3. Unit Test Strategy (Per Component)

3.1 Deterministic Safety Scanner

What to test: Every pattern category. Every edge case. This component has 100% coverage as a hard requirement.

Key test cases:

// pkg/classifier/scanner/tests.rs

// ── BLOCKLIST (🔴 Dangerous) ──────────────────────────────────────────

#[test] fn scanner_kubectl_delete_namespace_is_dangerous() {}
#[test] fn scanner_kubectl_delete_deployment_is_dangerous() {}
#[test] fn scanner_kubectl_delete_pvc_is_dangerous() {}
#[test] fn scanner_kubectl_delete_all_is_dangerous() {}
#[test] fn scanner_drop_table_is_dangerous() {}
#[test] fn scanner_drop_database_is_dangerous() {}
#[test] fn scanner_truncate_table_is_dangerous() {}
#[test] fn scanner_delete_without_where_is_dangerous() {}
#[test] fn scanner_rm_rf_is_dangerous() {}
#[test] fn scanner_rm_rf_root_is_dangerous() {}
#[test] fn scanner_rm_rf_slash_is_dangerous() {}
#[test] fn scanner_aws_ec2_terminate_instances_is_dangerous() {}
#[test] fn scanner_aws_rds_delete_db_instance_is_dangerous() {}
#[test] fn scanner_terraform_destroy_is_dangerous() {}
#[test] fn scanner_dd_if_dev_zero_is_dangerous() {}
#[test] fn scanner_mkfs_is_dangerous() {}
#[test] fn scanner_sudo_rm_is_dangerous() {}
#[test] fn scanner_chmod_777_recursive_is_dangerous() {}
#[test] fn scanner_kubectl_create_clusterrolebinding_is_dangerous() {}
#[test] fn scanner_aws_iam_create_user_is_dangerous() {}
#[test] fn scanner_pipe_to_xargs_rm_is_dangerous() {}
#[test] fn scanner_delete_with_where_but_no_condition_value_is_dangerous() {}

// ── CAUTION LIST (🟡) ────────────────────────────────────────────────

#[test] fn scanner_kubectl_rollout_restart_is_caution() {}
#[test] fn scanner_kubectl_scale_is_caution() {}
#[test] fn scanner_aws_ec2_stop_instances_is_caution() {}
#[test] fn scanner_aws_ec2_start_instances_is_caution() {}
#[test] fn scanner_systemctl_restart_is_caution() {}
#[test] fn scanner_update_with_where_clause_is_caution() {}
#[test] fn scanner_insert_into_is_caution() {}
#[test] fn scanner_docker_restart_is_caution() {}
#[test] fn scanner_aws_autoscaling_set_desired_capacity_is_caution() {}

// ── ALLOWLIST (🟢 Safe) ──────────────────────────────────────────────

#[test] fn scanner_kubectl_get_pods_is_safe() {}
#[test] fn scanner_kubectl_describe_deployment_is_safe() {}
#[test] fn scanner_kubectl_logs_is_safe() {}
#[test] fn scanner_aws_ec2_describe_instances_is_safe() {}
#[test] fn scanner_aws_s3_ls_is_safe() {}
#[test] fn scanner_select_query_is_safe() {}
#[test] fn scanner_explain_query_is_safe() {}
#[test] fn scanner_curl_get_is_safe() {}
#[test] fn scanner_cat_file_is_safe() {}
#[test] fn scanner_grep_is_safe() {}
#[test] fn scanner_tail_f_is_safe() {}
#[test] fn scanner_docker_ps_is_safe() {}
#[test] fn scanner_terraform_plan_is_safe() {}
#[test] fn scanner_dig_is_safe() {}
#[test] fn scanner_nslookup_is_safe() {}

// ── UNKNOWN / EDGE CASES ─────────────────────────────────────────────

#[test] fn scanner_unknown_command_defaults_to_unknown_not_safe() {}
#[test] fn scanner_empty_command_defaults_to_unknown() {}
#[test] fn scanner_custom_script_path_defaults_to_unknown() {}
#[test] fn scanner_select_into_is_dangerous_not_safe() {} // SELECT INTO is a write
#[test] fn scanner_delete_with_where_is_caution_not_dangerous() {}
#[test] fn scanner_curl_post_is_caution_not_safe() {} // POST has side effects
#[test] fn scanner_pipe_chain_with_destructive_segment_is_dangerous() {}
#[test] fn scanner_command_substitution_with_rm_is_dangerous() {}
#[test] fn scanner_multiline_command_with_destructive_line_is_dangerous() {}

// ── AST PARSING (tree-sitter) ────────────────────────────────────────

#[test] fn scanner_sql_ast_delete_without_where_is_dangerous() {}
#[test] fn scanner_sql_ast_update_without_where_is_dangerous() {}
#[test] fn scanner_sql_ast_drop_statement_is_dangerous() {}
#[test] fn scanner_shell_ast_piped_rm_is_dangerous() {}
#[test] fn scanner_shell_ast_subshell_with_destructive_is_dangerous() {}

// ── PERFORMANCE ──────────────────────────────────────────────────────

#[test] fn scanner_classifies_in_under_1ms() {}
#[test] fn scanner_classifies_100_commands_in_under_10ms() {}

Mocking strategy: None. The scanner is a pure function with no I/O. All tests are synchronous, no mocks needed.

Language-specific patterns (Rust):

Use #[test] for synchronous unit tests
Use criterion crate for performance benchmarks
Compile regex sets at test time using lazy_static! or once_cell
Use rstest for parameterized test cases across command variants

// Parameterized test example using rstest
#[rstest]
#[case("kubectl delete namespace production", RiskLevel::Dangerous)]
#[case("kubectl delete deployment payment-svc", RiskLevel::Dangerous)]
#[case("kubectl delete pod payment-abc123", RiskLevel::Dangerous)]
#[case("kubectl delete all --all", RiskLevel::Dangerous)]
fn scanner_kubectl_delete_variants_are_dangerous(
    #[case] command: &str,
    #[case] expected: RiskLevel,
) {
    let scanner = Scanner::new();
    assert_eq!(scanner.classify(command).risk, expected);
}

3.2 Classification Merge Engine

What to test: All 5 merge rules, including every combination of scanner/LLM results.

// pkg/classifier/merge/tests.rs

// Rule 1: Scanner 🔴 → always 🔴
#[test] fn merge_scanner_dangerous_llm_safe_yields_dangerous() {}
#[test] fn merge_scanner_dangerous_llm_caution_yields_dangerous() {}
#[test] fn merge_scanner_dangerous_llm_dangerous_yields_dangerous() {}

// Rule 2: Scanner 🟡, LLM 🟢 → 🟡 (scanner wins)
#[test] fn merge_scanner_caution_llm_safe_yields_caution() {}

// Rule 3: Both 🟢 → 🟢 (only path to safe)
#[test] fn merge_scanner_safe_llm_safe_yields_safe() {}

// Rule 4: Scanner Unknown → 🟡 minimum
#[test] fn merge_scanner_unknown_llm_safe_yields_caution() {}
#[test] fn merge_scanner_unknown_llm_caution_yields_caution() {}
#[test] fn merge_scanner_unknown_llm_dangerous_yields_dangerous() {}

// Rule 5: LLM confidence < 0.9 → escalate one level
#[test] fn merge_low_confidence_safe_escalates_to_caution() {}
#[test] fn merge_low_confidence_caution_escalates_to_dangerous() {}
#[test] fn merge_high_confidence_safe_does_not_escalate() {}

// LLM escalation overrides scanner
#[test] fn merge_scanner_safe_llm_caution_yields_caution() {}
#[test] fn merge_scanner_safe_llm_dangerous_yields_dangerous() {}
#[test] fn merge_scanner_caution_llm_dangerous_yields_dangerous() {}

// Merge rule is logged
#[test] fn merge_result_includes_applied_rule_identifier() {}
#[test] fn merge_result_includes_both_scanner_and_llm_inputs() {}

Mocking strategy: LLM results are passed as plain structs — no mocking needed. The merge engine is a pure function.

3.3 Execution Engine State Machine

What to test: Every valid state transition, every invalid transition (must be rejected), trust level enforcement at each transition.

// pkg/executor/state_machine/tests.rs

// Valid transitions
#[test] fn engine_pending_to_preflight_on_start() {}
#[test] fn engine_preflight_to_step_ready_on_prerequisites_met() {}
#[test] fn engine_step_ready_to_auto_execute_for_safe_step_at_copilot_level() {}
#[test] fn engine_step_ready_to_await_approval_for_caution_step() {}
#[test] fn engine_step_ready_to_blocked_for_dangerous_step_at_copilot_level() {}
#[test] fn engine_await_approval_to_executing_on_human_approval() {}
#[test] fn engine_await_approval_to_skipped_on_human_skip() {}
#[test] fn engine_executing_to_step_complete_on_success() {}
#[test] fn engine_executing_to_step_failed_on_error() {}
#[test] fn engine_executing_to_timed_out_on_timeout() {}
#[test] fn engine_step_complete_to_step_ready_when_more_steps() {}
#[test] fn engine_step_complete_to_runbook_complete_when_last_step() {}
#[test] fn engine_step_failed_to_rollback_available_when_rollback_exists() {}
#[test] fn engine_step_failed_to_manual_intervention_when_no_rollback() {}
#[test] fn engine_rollback_available_to_rolling_back_on_approval() {}
#[test] fn engine_rolling_back_to_step_ready_on_rollback_success() {}
#[test] fn engine_rolling_back_to_manual_intervention_on_rollback_failure() {}
#[test] fn engine_runbook_complete_to_divergence_analysis() {}

// Invalid transitions (must be rejected)
#[test] fn engine_cannot_skip_preflight_state() {}
#[test] fn engine_cannot_auto_execute_caution_step_at_copilot_level() {}
#[test] fn engine_cannot_auto_execute_dangerous_step_at_any_v1_level() {}
#[test] fn engine_cannot_transition_from_completed_to_executing() {}

// Trust level enforcement
#[test] fn engine_safe_step_blocked_at_read_only_trust_level() {}
#[test] fn engine_caution_step_requires_approval_at_copilot_level() {}
#[test] fn engine_dangerous_step_blocked_at_copilot_level_v1() {}
#[test] fn engine_trust_level_checked_per_step_not_per_runbook() {}

// Timeout behavior
#[test] fn engine_safe_step_times_out_after_60_seconds() {}
#[test] fn engine_caution_step_times_out_after_120_seconds() {}
#[test] fn engine_dangerous_step_times_out_after_300_seconds() {}
#[test] fn engine_approval_timeout_does_not_auto_approve() {}
#[test] fn engine_approval_timeout_marks_execution_as_stalled() {}

// Idempotency
#[test] fn engine_duplicate_step_execution_id_is_rejected() {}
#[test] fn engine_duplicate_approval_for_same_step_is_idempotent() {}

// Panic mode
#[test] fn engine_checks_panic_mode_before_each_step() {}
#[test] fn engine_pauses_in_flight_execution_when_panic_mode_set() {}
#[test] fn engine_does_not_kill_executing_step_on_panic_mode() {}

Mocking strategy:

Mock the Agent gRPC client using a trait object (MockAgentClient)
Mock the Slack notification sender
Mock the database using an in-memory state store for pure state machine tests
Use tokio::time::pause() for timeout tests (no real waiting)

3.4 Runbook Parser

What to test: Normalization correctness, LLM output parsing, variable detection, branch mapping.

// pkg/parser/tests.rs

// Normalizer
#[test] fn normalizer_strips_html_tags() {}
#[test] fn normalizer_strips_confluence_macros() {}
#[test] fn normalizer_normalizes_bullet_styles_to_numbered() {}
#[test] fn normalizer_preserves_code_blocks() {}
#[test] fn normalizer_normalizes_whitespace() {}
#[test] fn normalizer_handles_empty_input() {}
#[test] fn normalizer_handles_unicode_content() {}

// LLM output parsing (using recorded fixtures)
#[test] fn parser_extracts_ordered_steps_from_llm_response() {}
#[test] fn parser_handles_llm_returning_empty_steps_array() {}
#[test] fn parser_rejects_llm_response_missing_required_fields() {}
#[test] fn parser_handles_llm_timeout_gracefully() {}
#[test] fn parser_is_idempotent_same_input_same_output() {}
#[test] fn parser_risk_level_is_null_in_output() {} // Parser never classifies

// Variable detection
#[test] fn variable_detector_finds_dollar_sign_vars() {}
#[test] fn variable_detector_finds_angle_bracket_placeholders() {}
#[test] fn variable_detector_finds_curly_brace_templates() {}
#[test] fn variable_detector_identifies_alert_context_sources() {}
#[test] fn variable_detector_identifies_vpn_prerequisite() {}

// Branch mapping
#[test] fn branch_mapper_detects_if_else_conditional() {}
#[test] fn branch_mapper_produces_valid_dag() {}
#[test] fn branch_mapper_handles_nested_conditionals() {}

// Ambiguity detection
#[test] fn ambiguity_highlighter_flags_vague_check_logs_step() {}
#[test] fn ambiguity_highlighter_flags_restart_service_without_name() {}

Mocking strategy: Mock the LLM gateway (dd0c/route) using recorded response fixtures. Use wiremock-rs or a trait-based mock. Never call real LLM in unit tests.

3.5 Approval Workflow

What to test: Approval gates cannot be bypassed. Typed confirmation is enforced for 🔴 steps.

// pkg/approval/tests.rs

#[test] fn approval_caution_step_requires_button_click_not_auto() {}
#[test] fn approval_dangerous_step_requires_typed_resource_name() {}
#[test] fn approval_dangerous_step_rejects_wrong_resource_name() {}
#[test] fn approval_dangerous_step_rejects_generic_yes_confirmation() {}
#[test] fn approval_cannot_bulk_approve_multiple_steps() {}
#[test] fn approval_captures_approver_slack_identity() {}
#[test] fn approval_captures_approval_timestamp() {}
#[test] fn approval_modification_logs_original_command() {}
#[test] fn approval_timeout_30min_marks_as_stalled_not_approved() {}
#[test] fn approval_skip_logs_step_as_skipped_with_actor() {}
#[test] fn approval_abort_halts_all_remaining_steps() {}

3.6 Audit Trail

What to test: Schema validation, immutability enforcement, event completeness.

// pkg/audit/tests.rs

#[test] fn audit_event_requires_tenant_id() {}
#[test] fn audit_event_requires_event_type() {}
#[test] fn audit_event_requires_actor_id_and_type() {}
#[test] fn audit_all_execution_event_types_are_valid_enum_values() {}
#[test] fn audit_step_executed_event_includes_command_hash_not_plaintext() {}
#[test] fn audit_step_executed_event_includes_exit_code() {}
#[test] fn audit_classification_event_includes_both_scanner_and_llm_results() {}
#[test] fn audit_classification_event_includes_merge_rule_applied() {}
#[test] fn audit_hash_chain_each_event_references_previous_hash() {}
#[test] fn audit_hash_chain_modification_breaks_chain_verification() {}

4. Integration Test Strategy

4.1 Service Boundary Tests

All integration tests use Testcontainers for database dependencies and WireMock (via wiremock-rs) for external HTTP services. gRPC boundaries use in-process test servers.

Parser ↔ LLM Gateway (dd0c/route)

// tests/integration/parser_llm_test.rs

#[tokio::test]
async fn parser_sends_normalized_text_to_llm_with_correct_schema_prompt() {
    let mock_route = MockServer::start().await;
    Mock::given(method("POST"))
        .and(path("/v1/completions"))
        .respond_with(ResponseTemplate::new(200).set_body_json(fixture_llm_response()))
        .mount(&mock_route)
        .await;

    let parser = Parser::new(mock_route.uri());
    let result = parser.parse("1. kubectl get pods -n payments").await.unwrap();
    assert_eq!(result.steps.len(), 1);
    assert_eq!(result.steps[0].risk_level, None); // Parser never classifies
}

#[tokio::test]
async fn parser_retries_on_llm_timeout_up_to_3_times() { ... }

#[tokio::test]
async fn parser_returns_error_when_llm_returns_invalid_json() { ... }

#[tokio::test]
async fn parser_handles_llm_returning_no_actionable_steps() { ... }

Runbook format parsing tests:

#[tokio::test]
async fn parser_confluence_html_extracts_steps_correctly() {
    // Load fixture: tests/fixtures/runbooks/confluence_payment_latency.html
    let raw = include_str!("../fixtures/runbooks/confluence_payment_latency.html");
    let result = parser.parse(raw).await.unwrap();
    assert!(result.steps.len() > 0);
    assert!(result.prerequisites.len() > 0);
}

#[tokio::test]
async fn parser_notion_export_markdown_extracts_steps_correctly() { ... }

#[tokio::test]
async fn parser_plain_markdown_numbered_list_extracts_steps_correctly() { ... }

#[tokio::test]
async fn parser_confluence_with_code_blocks_preserves_commands() { ... }

#[tokio::test]
async fn parser_notion_with_callout_blocks_extracts_prerequisites() { ... }

Classifier ↔ PostgreSQL (Audit Write)

// tests/integration/classifier_audit_test.rs

#[tokio::test]
async fn classifier_writes_classification_event_to_audit_log() {
    let pg = TestcontainersPostgres::start().await;
    let classifier = Classifier::new(pg.connection_string());

    let result = classifier.classify(&step_fixture()).await.unwrap();

    let events: Vec<AuditEvent> = pg.query(
        "SELECT * FROM audit_events WHERE event_type = 'runbook.classified'"
    ).await;
    assert_eq!(events.len(), 1);
    assert_eq!(events[0].event_data["final_classification"], "safe");
}

#[tokio::test]
async fn classifier_audit_record_is_immutable_no_update_permitted() {
    // Attempt UPDATE on audit_events — must fail with permission error
    let result = pg.execute("UPDATE audit_events SET event_type = 'tampered'").await;
    assert!(result.is_err());
    assert!(result.unwrap_err().to_string().contains("permission denied"));
}

Execution Engine ↔ Agent (gRPC)

// tests/integration/engine_agent_grpc_test.rs

#[tokio::test]
async fn engine_sends_execute_step_payload_with_correct_fields() {
    let mock_agent = MockAgentServer::start().await;
    let engine = ExecutionEngine::new(mock_agent.address());

    engine.execute_step(&safe_step_fixture()).await.unwrap();

    let received = mock_agent.last_received_command().await;
    assert!(received.execution_id.is_some());
    assert!(received.step_execution_id.is_some());
    assert_eq!(received.risk_level, RiskLevel::Safe);
}

#[tokio::test]
async fn engine_streams_stdout_from_agent_to_slack_in_realtime() { ... }

#[tokio::test]
async fn engine_handles_agent_disconnect_mid_execution() { ... }

#[tokio::test]
async fn engine_rejects_duplicate_step_execution_id_from_agent() { ... }

#[tokio::test]
async fn engine_validates_command_hash_before_sending_to_agent() { ... }

Approvals ↔ Slack

// tests/integration/approval_slack_test.rs

#[tokio::test]
async fn approval_caution_step_posts_block_kit_message_with_approve_button() { ... }

#[tokio::test]
async fn approval_dangerous_step_posts_modal_requiring_typed_confirmation() { ... }

#[tokio::test]
async fn approval_button_click_captures_slack_user_id_as_approver() { ... }

#[tokio::test]
async fn approval_respects_slack_rate_limit_1_message_per_second() { ... }

#[tokio::test]
async fn approval_batches_rapid_output_updates_to_avoid_rate_limit() { ... }

4.2 Testcontainers Setup

// tests/common/testcontainers.rs

pub struct TestDb {
    container: ContainerAsync<Postgres>,
    pub pool: PgPool,
}

impl TestDb {
    pub async fn start() -> Self {
        let container = Postgres::default()
            .with_env_var("POSTGRES_DB", "dd0c_test")
            .start()
            .await
            .unwrap();

        let pool = PgPool::connect(&container.connection_string()).await.unwrap();

        // Run migrations
        sqlx::migrate!("./migrations").run(&pool).await.unwrap();

        // Apply RLS policies
        sqlx::query_file!("./tests/fixtures/sql/apply_rls.sql")
            .execute(&pool)
            .await
            .unwrap();

        Self { container, pool }
    }

    pub async fn with_tenant(&self, tenant_id: Uuid) -> TenantScopedDb {
        // Sets app.current_tenant_id for RLS enforcement
        TenantScopedDb::new(&self.pool, tenant_id)
    }
}

4.3 Sandboxed Execution Environment Tests

For testing actual command execution without touching real infrastructure, use Docker-in-Docker (DinD) or a minimal sandbox container.

// tests/integration/sandbox_execution_test.rs

/// Uses a minimal Alpine container as the execution target.
/// The agent connects to this container instead of real infrastructure.
#[tokio::test]
async fn sandbox_safe_command_executes_and_returns_stdout() {
    let sandbox = SandboxContainer::start("alpine:3.19").await;
    let agent = TestAgent::connect_to(sandbox.socket_path()).await;

    let result = agent.execute("ls /tmp").await.unwrap();
    assert_eq!(result.exit_code, 0);
    assert!(!result.stdout.is_empty());
}

#[tokio::test]
async fn sandbox_agent_rejects_dangerous_command_before_execution() {
    let sandbox = SandboxContainer::start("alpine:3.19").await;
    let agent = TestAgent::connect_to(sandbox.socket_path()).await;

    let result = agent.execute("rm -rf /").await;
    assert!(result.is_err());
    assert_eq!(result.unwrap_err(), AgentError::CommandRejectedByScanner);
    // Verify nothing was deleted
    assert!(sandbox.path_exists("/etc").await);
}

#[tokio::test]
async fn sandbox_command_timeout_kills_process_and_returns_error() {
    let sandbox = SandboxContainer::start("alpine:3.19").await;
    let agent = TestAgent::with_timeout(Duration::from_secs(2))
        .connect_to(sandbox.socket_path())
        .await;

    let result = agent.execute("sleep 60").await;
    assert_eq!(result.unwrap_err(), AgentError::Timeout);
}

#[tokio::test]
async fn sandbox_no_shell_injection_via_command_argument() {
    let sandbox = SandboxContainer::start("alpine:3.19").await;
    let agent = TestAgent::connect_to(sandbox.socket_path()).await;

    // This should execute `echo` with the literal argument, not a shell
    let result = agent.execute("echo $(rm -rf /)").await.unwrap();
    assert_eq!(result.stdout.trim(), "$(rm -rf /)"); // Literal, not executed
    assert!(sandbox.path_exists("/etc").await);
}

4.4 Multi-Tenant RLS Integration Tests

// tests/integration/rls_test.rs

#[tokio::test]
async fn rls_tenant_a_cannot_see_tenant_b_runbooks() {
    let db = TestDb::start().await;
    let tenant_a = Uuid::new_v4();
    let tenant_b = Uuid::new_v4();

    // Insert runbook for tenant B
    db.insert_runbook(tenant_b, "Tenant B Runbook").await;

    // Query as tenant A — must return zero rows, not an error
    let db_a = db.with_tenant(tenant_a).await;
    let runbooks = db_a.query("SELECT * FROM runbooks").await.unwrap();
    assert_eq!(runbooks.len(), 0);
}

#[tokio::test]
async fn rls_cross_tenant_audit_query_returns_zero_rows() { ... }

#[tokio::test]
async fn rls_cross_tenant_execution_query_returns_zero_rows() { ... }

5. E2E & Smoke Tests

5.1 Critical User Journeys

E2E tests run against a full Docker Compose stack with sandboxed infrastructure. No real AWS, no real Kubernetes — all targets are containerized mocks.

Docker Compose E2E Stack:

# tests/e2e/docker-compose.yml
services:
  postgres:     # Real PostgreSQL with migrations applied
  redis:        # Real Redis for panic mode key
  parser:       # Real Parser service
  classifier:   # Real Classifier service
  engine:       # Real Execution Engine
  slack-mock:   # WireMock simulating Slack API
  llm-mock:     # WireMock with recorded LLM responses
  agent:        # Real dd0c Agent binary
  sandbox-host: # Alpine container as execution target

Journey 1: Full Happy Path (P0)

// tests/e2e/happy_path_test.rs

#[tokio::test]
async fn e2e_paste_runbook_classify_approve_execute_audit_full_journey() {
    let stack = E2EStack::start().await;

    // Step 1: Paste runbook
    let parse_resp = stack.api()
        .post("/v1/run/runbooks/parse-preview")
        .json(&json!({ "raw_text": FIXTURE_RUNBOOK_MARKDOWN }))
        .send().await;
    assert_eq!(parse_resp.status(), 200);
    let parsed = parse_resp.json::<ParsePreviewResponse>().await;
    assert!(parsed.steps.iter().any(|s| s.risk_level == "safe"));
    assert!(parsed.steps.iter().any(|s| s.risk_level == "caution"));

    // Step 2: Save runbook
    let runbook = stack.api().post("/v1/run/runbooks")
        .json(&json!({ "raw_text": FIXTURE_RUNBOOK_MARKDOWN, "title": "E2E Test" }))
        .send().await.json::<Runbook>().await;

    // Step 3: Start execution
    let execution = stack.api().post("/v1/run/executions")
        .json(&json!({ "runbook_id": runbook.id, "agent_id": stack.agent_id() }))
        .send().await.json::<Execution>().await;

    // Step 4: Safe steps auto-execute
    stack.wait_for_execution_state(&execution.id, "awaiting_approval").await;

    // Step 5: Approve caution step
    stack.api()
        .post(format!("/v1/run/executions/{}/steps/{}/approve", execution.id, caution_step_id))
        .send().await;

    // Step 6: Wait for completion
    let completed = stack.wait_for_execution_state(&execution.id, "completed").await;
    assert_eq!(completed.steps_executed, 4);
    assert_eq!(completed.steps_failed, 0);

    // Step 7: Verify audit trail
    let audit_events = stack.db()
        .query("SELECT event_type FROM audit_events WHERE execution_id = $1", &[&execution.id])
        .await;
    let event_types: Vec<&str> = audit_events.iter().map(|e| e.event_type.as_str()).collect();
    assert!(event_types.contains(&"execution.started"));
    assert!(event_types.contains(&"step.auto_executed"));
    assert!(event_types.contains(&"step.approved"));
    assert!(event_types.contains(&"step.executed"));
    assert!(event_types.contains(&"execution.completed"));
}

Journey 2: Destructive Command Blocked at All Levels (P0)

#[tokio::test]
async fn e2e_dangerous_command_blocked_at_copilot_trust_level() {
    let stack = E2EStack::start().await;

    let runbook = stack.create_runbook_with_dangerous_step().await;
    let execution = stack.start_execution(&runbook.id).await;

    // Engine must transition to Blocked, not AwaitApproval or AutoExecute
    let step_status = stack.wait_for_step_state(
        &execution.id, &dangerous_step_id, "blocked"
    ).await;
    assert_eq!(step_status, "blocked");

    // Verify audit event logged the block
    let events = stack.audit_events_for_execution(&execution.id).await;
    assert!(events.iter().any(|e| e.event_type == "step.blocked_by_trust_level"));
}

Journey 3: Panic Mode Halts In-Flight Execution (P0)

#[tokio::test]
async fn e2e_panic_mode_halts_in_flight_execution_within_1_second() {
    let stack = E2EStack::start().await;

    // Start a long-running execution
    let execution = stack.start_execution_with_slow_steps().await;
    stack.wait_for_execution_state(&execution.id, "running").await;

    let panic_triggered_at = Instant::now();

    // Trigger panic mode
    stack.api().post("/v1/run/admin/panic").send().await;

    // Execution must be paused within 1 second
    stack.wait_for_execution_state(&execution.id, "paused").await;
    assert!(panic_triggered_at.elapsed() < Duration::from_secs(1));

    // Verify execution is paused, not killed
    let exec = stack.get_execution(&execution.id).await;
    assert_eq!(exec.status, "paused");
    assert_ne!(exec.status, "aborted");
}

Journey 4: Approval Timeout Does Not Auto-Approve (P0)

#[tokio::test]
async fn e2e_approval_timeout_marks_stalled_not_approved() {
    let stack = E2EStack::with_approval_timeout(Duration::from_secs(5)).start().await;

    let execution = stack.start_execution_with_caution_step().await;
    stack.wait_for_execution_state(&execution.id, "awaiting_approval").await;

    // Wait for timeout to expire — do NOT approve
    tokio::time::sleep(Duration::from_secs(6)).await;

    let exec = stack.get_execution(&execution.id).await;
    assert_eq!(exec.status, "stalled");
    assert_ne!(exec.status, "completed"); // Must NOT have auto-approved
}

5.2 Chaos Scenarios

// tests/e2e/chaos_test.rs

#[tokio::test]
async fn chaos_agent_disconnects_mid_execution_engine_pauses_and_alerts() {
    let stack = E2EStack::start().await;
    let execution = stack.start_long_running_execution().await;

    // Kill agent network mid-execution
    stack.disconnect_agent().await;

    let exec = stack.wait_for_execution_state(&execution.id, "paused").await;
    assert_eq!(exec.status, "paused");

    // Reconnect agent — execution should be resumable
    stack.reconnect_agent().await;
    stack.resume_execution(&execution.id).await;
    stack.wait_for_execution_state(&execution.id, "completed").await;
}

#[tokio::test]
async fn chaos_database_failover_engine_resumes_from_last_committed_step() {
    // Simulate RDS failover — engine must reconnect and resume
}

#[tokio::test]
async fn chaos_llm_gateway_down_classification_falls_back_to_scanner_only() {
    // LLM unavailable — scanner-only mode, all unknowns become 🟡
}

#[tokio::test]
async fn chaos_slack_api_outage_execution_pauses_awaiting_approval_channel() {
    // Slack down — approval requests queue, no auto-approve
}

#[tokio::test]
async fn chaos_mid_execution_step_failure_triggers_rollback_flow() {
    let stack = E2EStack::start().await;
    let execution = stack.start_execution_with_failing_step().await;

    stack.wait_for_execution_state(&execution.id, "rollback_available").await;

    // Approve rollback
    stack.approve_rollback(&execution.id, &failed_step_id).await;

    let exec = stack.wait_for_execution_state(&execution.id, "completed").await;
    let events = stack.audit_events_for_execution(&execution.id).await;
    assert!(events.iter().any(|e| e.event_type == "step.rolled_back"));
}

6. Performance & Load Testing

6.1 Parser Throughput

// benches/parser_bench.rs (criterion)

fn bench_normalizer_small_runbook(c: &mut Criterion) {
    let input = include_str!("../fixtures/runbooks/small_10_steps.md");
    c.bench_function("normalizer_small", |b| {
        b.iter(|| Normalizer::new().normalize(black_box(input)))
    });
}

fn bench_normalizer_large_runbook(c: &mut Criterion) {
    // 500-step runbook, heavy HTML from Confluence
    let input = include_str!("../fixtures/runbooks/large_500_steps.html");
    c.bench_function("normalizer_large", |b| {
        b.iter(|| Normalizer::new().normalize(black_box(input)))
    });
}

fn bench_scanner_100_commands(c: &mut Criterion) {
    let commands = fixture_100_mixed_commands();
    let scanner = Scanner::new();
    c.bench_function("scanner_100_commands", |b| {
        b.iter(|| {
            for cmd in &commands {
                black_box(scanner.classify(cmd));
            }
        })
    });
}

Performance targets:

Normalizer: < 10ms for a 500-step Confluence page
Scanner: < 1ms per command, < 10ms for 100 commands in batch
Full parse + classify pipeline: < 5s p95 (including LLM call)
Classification merge: < 1ms per step

6.2 Concurrent Execution Stress Tests

Use k6 or cargo-based load tests for concurrent execution scenarios:

// tests/load/concurrent_executions.js (k6)

export const options = {
  scenarios: {
    concurrent_executions: {
      executor: 'constant-vus',
      vus: 50,           // 50 concurrent execution sessions
      duration: '5m',
    },
  },
  thresholds: {
    http_req_duration: ['p95<500'],   // API responses < 500ms p95
    http_req_failed: ['rate<0.01'],   // < 1% error rate
  },
};

export default function () {
  // Start execution, poll status, approve steps, verify completion
  const execution = startExecution(FIXTURE_RUNBOOK_ID);
  waitForApprovalGate(execution.id);
  approveStep(execution.id, execution.pending_step_id);
  waitForCompletion(execution.id);
}

Stress test targets:

50 concurrent execution sessions: all complete without errors
Approval workflow: < 200ms p95 latency for approval API calls
Audit trail ingestion: handles 1000 events/second without data loss
Scanner: handles 10,000 classifications/second (batch mode)

6.3 Approval Workflow Latency Under Load

// tests/load/approval_latency_test.rs

#[tokio::test]
async fn approval_workflow_p95_latency_under_100_concurrent_approvals() {
    let stack = E2EStack::start().await;
    let mut handles = vec![];

    for _ in 0..100 {
        let stack = stack.clone();
        handles.push(tokio::spawn(async move {
            let execution = stack.start_execution_with_caution_step().await;
            stack.wait_for_execution_state(&execution.id, "awaiting_approval").await;
            let start = Instant::now();
            stack.approve_step(&execution.id, &execution.pending_step_id).await;
            start.elapsed()
        }));
    }

    let latencies: Vec<Duration> = futures::future::join_all(handles)
        .await.into_iter().map(|r| r.unwrap()).collect();

    let p95 = percentile(&latencies, 95);
    assert!(p95 < Duration::from_millis(200), "p95 approval latency: {:?}", p95);
}

7. CI/CD Pipeline Integration

7.1 Test Stages

┌─────────────────────────────────────────────────────────────────────┐
│                    CI/CD TEST PIPELINE                               │
│                                                                      │
│  PRE-COMMIT (local, < 30s)                                          │
│  ├── cargo fmt --check                                               │
│  ├── cargo clippy -- -D warnings                                     │
│  ├── Unit tests for changed packages only                            │
│  └── Canary suite (50 known-destructive commands must stay 🔴)      │
│                                                                      │
│  PR GATE (CI, < 10 min)                                             │
│  ├── Full unit test suite (all packages)                             │
│  ├── Canary suite (mandatory — build fails if any canary is 🟢)    │
│  ├── Integration tests (Testcontainers)                              │
│  ├── Coverage check (see thresholds below)                           │
│  ├── Decision log check (PRs touching classifier/executor/parser     │
│  │   must include decision_log.json)                                 │
│  └── Expired feature flag check (CI blocks if flag TTL exceeded)    │
│                                                                      │
│  MERGE TO MAIN (CI, < 20 min)                                       │
│  ├── Full unit + integration suite                                   │
│  ├── E2E smoke tests (Docker Compose stack)                          │
│  ├── Performance regression check (criterion baselines)              │
│  └── Schema migration validation                                     │
│                                                                      │
│  DEPLOY TO STAGING (post-merge, < 30 min)                           │
│  ├── E2E full suite against staging environment                      │
│  ├── Chaos scenarios (agent disconnect, DB failover)                 │
│  └── Load test (50 concurrent executions, 5 min)                    │
│                                                                      │
│  DEPLOY TO PRODUCTION (manual gate after staging)                   │
│  ├── Smoke test: parse-preview endpoint responds < 5s               │
│  ├── Smoke test: agent heartbeat received                            │
│  └── Smoke test: audit trail write succeeds                         │
└─────────────────────────────────────────────────────────────────────┘

7.2 Coverage Thresholds

# .cargo/coverage.toml (enforced via cargo-tarpaulin in CI)

[thresholds]
# Safety-critical components — highest bar
"pkg/classifier/scanner"   = 100   # Every pattern must be tested
"pkg/classifier/merge"     = 100   # Every merge rule must be tested
"pkg/executor/state_machine" = 95  # Every state transition
"pkg/executor/trust"       = 95    # Trust level enforcement
"pkg/approval"             = 95    # Approval gates

# Core components
"pkg/parser"               = 90
"pkg/audit"                = 90
"pkg/agent/scanner"        = 100   # Agent-side scanner: same as SaaS-side

# Supporting components
"pkg/matcher"              = 85
"pkg/slack"                = 80
"pkg/api"                  = 80

# Overall project minimum
"overall"                  = 85

CI enforcement:

# .github/workflows/ci.yml (excerpt)
- name: Check coverage thresholds
  run: |
    cargo tarpaulin --out Json --output-dir coverage/
    python scripts/check_coverage_thresholds.py coverage/tarpaulin-report.json

- name: Run canary suite (MANDATORY)
  run: cargo test --package dd0c-classifier canary_suite -- --nocapture
  # This job failing blocks ALL other jobs

- name: Check decision logs for safety-critical PRs
  run: |
    CHANGED=$(git diff --name-only origin/main)
    if echo "$CHANGED" | grep -qE "pkg/(parser|classifier|executor|approval)/"; then
      python scripts/check_decision_log.py
    fi

7.3 Test Parallelization Strategy

# GitHub Actions matrix strategy
jobs:
  unit-tests:
    strategy:
      matrix:
        package:
          - dd0c-classifier    # Runs first — safety-critical
          - dd0c-executor      # Runs first — safety-critical
          - dd0c-parser
          - dd0c-audit
          - dd0c-approval
          - dd0c-matcher
          - dd0c-slack
          - dd0c-api
    steps:
      - run: cargo test --package ${{ matrix.package }}

  canary-suite:
    needs: []           # Runs in parallel with unit tests
    steps:
      - run: cargo test --package dd0c-classifier canary_suite

  integration-tests:
    needs: [unit-tests, canary-suite]   # Only after unit tests pass
    strategy:
      matrix:
        suite:
          - parser-llm
          - classifier-audit
          - engine-agent
          - approval-slack
          - rls-isolation
    steps:
      - run: cargo test --test ${{ matrix.suite }}

  e2e-tests:
    needs: [integration-tests]
    steps:
      - run: docker compose -f tests/e2e/docker-compose.yml up -d
      - run: cargo test --test e2e

Parallelization rules:

Canary suite runs in parallel with unit tests — never blocked
Integration tests only start after ALL unit tests pass
E2E tests only start after ALL integration tests pass
Each test package runs in its own job (parallel matrix)
Testcontainers instances are per-test, not shared (no state leakage)

8. Transparent Factory Tenet Testing

8.1 Feature Flag Tests (Atomic Flagging — Story 10.1)

// pkg/flags/tests.rs

// Basic flag evaluation
#[test] fn flag_evaluates_locally_no_network_call() {}
#[test] fn flag_disabled_by_default_for_new_execution_paths() {}
#[test] fn flag_requires_owner_field_or_validation_fails() {}
#[test] fn flag_requires_ttl_field_or_validation_fails() {}
#[test] fn flag_expired_ttl_is_treated_as_disabled() {}

// Destructive flag 48-hour bake enforcement
#[test] fn flag_destructive_true_requires_48h_bake_before_full_rollout() {
    let flag = FeatureFlag {
        name: "enable_kubectl_delete_execution",
        destructive: true,
        rollout_percentage: 100,
        bake_started_at: Some(Utc::now() - Duration::hours(24)), // Only 24h
        ..Default::default()
    };
    let result = FlagValidator::validate(&flag);
    assert!(result.is_err());
    assert!(result.unwrap_err().contains("48-hour bake required for destructive flags"));
}

#[test] fn flag_destructive_true_at_10_percent_before_48h_is_valid() {
    let flag = FeatureFlag {
        destructive: true,
        rollout_percentage: 10,
        bake_started_at: Some(Utc::now() - Duration::hours(12)),
        ..Default::default()
    };
    assert!(FlagValidator::validate(&flag).is_ok());
}

#[test] fn flag_destructive_true_at_100_percent_after_48h_is_valid() {
    let flag = FeatureFlag {
        destructive: true,
        rollout_percentage: 100,
        bake_started_at: Some(Utc::now() - Duration::hours(49)),
        ..Default::default()
    };
    assert!(FlagValidator::validate(&flag).is_ok());
}

// Circuit breaker
#[test] fn circuit_breaker_triggers_after_2_failures_in_10_minute_window() {
    let mut breaker = CircuitBreaker::new("enable_new_parser", 2, Duration::minutes(10));
    breaker.record_failure();
    assert_eq!(breaker.state(), CircuitState::Closed);
    breaker.record_failure();
    assert_eq!(breaker.state(), CircuitState::Open); // Trips on 2nd failure
}

#[test] fn circuit_breaker_open_disables_flag_immediately() {
    let mut breaker = CircuitBreaker::new("enable_new_parser", 2, Duration::minutes(10));
    breaker.record_failure();
    breaker.record_failure();

    let flag_store = FlagStore::with_breaker(breaker);
    assert!(!flag_store.is_enabled("enable_new_parser"));
}

#[test] fn circuit_breaker_pauses_in_flight_executions_not_kills() {
    // Verify executions are paused (status=paused), not aborted (status=aborted)
}

#[test] fn circuit_breaker_resets_after_window_expires() {
    let mut breaker = CircuitBreaker::new("flag", 2, Duration::minutes(10));
    breaker.record_failure();
    breaker.record_failure();
    // Advance time past window
    breaker.advance_time(Duration::minutes(11));
    assert_eq!(breaker.state(), CircuitState::Closed);
}

#[test] fn ci_blocks_if_flag_at_100_percent_past_ttl() {
    // This test validates the CI check script logic
    let flags = vec![
        FeatureFlag { name: "old_flag", rollout: 100, ttl: expired_ttl(), ..Default::default() }
    ];
    let result = CiValidator::check_expired_flags(&flags);
    assert!(result.is_err());
}

8.2 Schema Migration Validation Tests (Elastic Schema — Story 10.2)

// tests/migrations/schema_validation_test.rs

#[tokio::test]
async fn migration_does_not_remove_existing_columns() {
    let db = TestDb::start().await;
    let columns_before = db.get_column_names("audit_events").await;

    // Apply all pending migrations
    sqlx::migrate!("./migrations").run(&db.pool).await.unwrap();

    let columns_after = db.get_column_names("audit_events").await;

    // Every column that existed before must still exist
    for col in &columns_before {
        assert!(columns_after.contains(col),
            "Migration removed column '{}' from audit_events — FORBIDDEN", col);
    }
}

#[tokio::test]
async fn migration_does_not_change_existing_column_types() {
    // Verify no type changes on existing columns
}

#[tokio::test]
async fn migration_does_not_rename_existing_columns() {
    // Verify column names are stable
}

#[tokio::test]
async fn audit_events_table_has_no_update_permission_for_app_role() {
    let db = TestDb::start().await;
    let result = db.as_app_role()
        .execute("UPDATE audit_events SET event_type = 'tampered' WHERE 1=1")
        .await;
    assert!(result.is_err());
    assert!(result.unwrap_err().to_string().contains("permission denied"));
}

#[tokio::test]
async fn audit_events_table_has_no_delete_permission_for_app_role() {
    let db = TestDb::start().await;
    let result = db.as_app_role()
        .execute("DELETE FROM audit_events WHERE 1=1")
        .await;
    assert!(result.is_err());
    assert!(result.unwrap_err().to_string().contains("permission denied"));
}

#[tokio::test]
async fn execution_log_parsers_ignore_unknown_fields() {
    // Simulate a future schema with extra fields — old parser must not fail
    let event_json = json!({
        "id": Uuid::new_v4(),
        "event_type": "step.executed",
        "unknown_future_field": "some_value",  // New field old parser doesn't know
        "tenant_id": Uuid::new_v4(),
        "created_at": Utc::now(),
    });
    let result = AuditEvent::from_json(&event_json);
    assert!(result.is_ok()); // Must not fail on unknown fields
}

#[tokio::test]
async fn migration_includes_sunset_date_comment() {
    // Parse migration files and verify each has a sunset_date comment
    let migrations = read_migration_files("./migrations");
    for migration in &migrations {
        if migration.is_additive() {
            assert!(migration.content.contains("-- sunset_date:"),
                "Migration {} missing sunset_date comment", migration.name);
        }
    }
}

8.3 Decision Log Format Validation (Cognitive Durability — Story 10.3)

// tests/decisions/decision_log_test.rs

#[test]
fn decision_log_schema_requires_all_mandatory_fields() {
    let incomplete_log = json!({
        "prompt": "Why is rm -rf dangerous?",
        // Missing: reasoning, alternatives_considered, confidence, timestamp, author
    });
    let result = DecisionLog::validate(&incomplete_log);
    assert!(result.is_err());
}

#[test]
fn decision_log_confidence_must_be_between_0_and_1() {
    let log = DecisionLog { confidence: 1.5, ..valid_decision_log() };
    assert!(DecisionLog::validate(&log).is_err());
}

#[test]
fn decision_log_destructive_command_list_change_requires_reasoning() {
    // Any PR adding/removing from the destructive command list must have
    // a decision log explaining why
    let change = DestructiveCommandListChange {
        command: "kubectl drain",
        action: ChangeAction::Add,
        decision_log: None, // Missing
    };
    let result = DestructiveCommandChangeValidator::validate(&change);
    assert!(result.is_err());
    assert!(result.unwrap_err().contains("decision log required for destructive command list changes"));
}

#[test]
fn all_existing_decision_logs_are_valid_json() {
    let logs = glob::glob("docs/decisions/*.json").unwrap();
    for log_path in logs {
        let content = std::fs::read_to_string(log_path.unwrap()).unwrap();
        let parsed: serde_json::Value = serde_json::from_str(&content)
            .expect("Decision log must be valid JSON");
        assert!(DecisionLog::validate(&parsed).is_ok());
    }
}

#[test]
fn cyclomatic_complexity_cap_enforced_at_10() {
    // This is validated by the clippy lint in CI:
    // #![deny(clippy::cognitive_complexity)]
    // Test here validates the lint config is present
    let clippy_config = std::fs::read_to_string(".clippy.toml").unwrap();
    assert!(clippy_config.contains("cognitive-complexity-threshold = 10"));
}

8.4 OTEL Span Assertion Tests (Semantic Observability — Story 10.4)

// tests/observability/otel_spans_test.rs

#[tokio::test]
async fn otel_runbook_execution_creates_parent_span() {
    let tracer = InMemoryTracer::new();
    let engine = ExecutionEngine::with_tracer(tracer.clone());

    engine.execute_runbook(&fixture_runbook()).await.unwrap();

    let spans = tracer.finished_spans();
    let parent = spans.iter().find(|s| s.name == "runbook_execution");
    assert!(parent.is_some(), "runbook_execution parent span must exist");
}

#[tokio::test]
async fn otel_each_step_creates_child_spans_3_levels_deep() {
    let tracer = InMemoryTracer::new();
    let engine = ExecutionEngine::with_tracer(tracer.clone());

    engine.execute_runbook(&fixture_runbook_with_one_step()).await.unwrap();

    let spans = tracer.finished_spans();
    let step_classification = spans.iter().find(|s| s.name == "step_classification");
    let step_approval = spans.iter().find(|s| s.name == "step_approval_check");
    let step_execution = spans.iter().find(|s| s.name == "step_execution");

    assert!(step_classification.is_some());
    assert!(step_approval.is_some());
    assert!(step_execution.is_some());

    // Verify parent-child hierarchy
    let parent_id = spans.iter().find(|s| s.name == "runbook_execution").unwrap().span_id;
    assert_eq!(step_classification.unwrap().parent_span_id, Some(parent_id));
}

#[tokio::test]
async fn otel_step_classification_span_has_required_attributes() {
    let tracer = InMemoryTracer::new();
    let engine = ExecutionEngine::with_tracer(tracer.clone());
    engine.execute_runbook(&fixture_runbook()).await.unwrap();

    let span = tracer.span_by_name("step_classification").unwrap();
    assert!(span.attributes.contains_key("step.text_hash"));
    assert!(span.attributes.contains_key("step.classified_as"));
    assert!(span.attributes.contains_key("step.confidence_score"));
    assert!(span.attributes.contains_key("step.alternatives_considered"));
}

#[tokio::test]
async fn otel_step_execution_span_has_required_attributes() {
    let span = tracer.span_by_name("step_execution").unwrap();
    assert!(span.attributes.contains_key("step.command_hash"));
    assert!(span.attributes.contains_key("step.target_host_hash"));
    assert!(span.attributes.contains_key("step.exit_code"));
    assert!(span.attributes.contains_key("step.duration_ms"));
    // Must NOT contain raw command or PII
    assert!(!span.attributes.contains_key("step.command_raw"));
    assert!(!span.attributes.contains_key("step.stdout_raw"));
}

#[tokio::test]
async fn otel_step_approval_span_has_required_attributes() {
    let span = tracer.span_by_name("step_approval_check").unwrap();
    assert!(span.attributes.contains_key("step.approval_required"));
    assert!(span.attributes.contains_key("step.approval_source"));
    assert!(span.attributes.contains_key("step.approval_latency_ms"));
}

#[tokio::test]
async fn otel_no_pii_in_any_span_attributes() {
    // Scan all span attributes for patterns that look like PII or raw commands
    let spans = tracer.finished_spans();
    for span in &spans {
        for (key, value) in &span.attributes {
            assert!(!key.ends_with("_raw"), "Raw data in span attribute: {}", key);
            assert!(!looks_like_email(value), "PII in span: {} = {}", key, value);
        }
    }
}

#[tokio::test]
async fn otel_ai_classification_span_includes_model_metadata() {
    let span = tracer.span_by_name("step_classification").unwrap();
    assert!(span.attributes.contains_key("ai.prompt_hash"));
    assert!(span.attributes.contains_key("ai.model_version"));
    assert!(span.attributes.contains_key("ai.reasoning_chain"));
}

8.5 Governance Policy Enforcement Tests (Configurable Autonomy — Story 10.5)

// pkg/governance/tests.rs

// Strict mode: all steps require approval
#[test]
fn governance_strict_mode_requires_approval_for_safe_steps() {
    let policy = Policy { governance_mode: GovernanceMode::Strict, ..Default::default() };
    let engine = ExecutionEngine::with_policy(policy);
    let next = engine.next_state(&safe_step(), TrustLevel::Copilot);
    assert_eq!(next, State::AwaitApproval); // Even safe steps need approval in strict mode
}

// Audit mode: safe steps auto-execute, destructive always require approval
#[test]
fn governance_audit_mode_auto_executes_safe_steps() {
    let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() };
    let engine = ExecutionEngine::with_policy(policy);
    assert_eq!(engine.next_state(&safe_step(), TrustLevel::Copilot), State::AutoExecute);
}

#[test]
fn governance_audit_mode_still_requires_approval_for_dangerous_steps() {
    let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() };
    let engine = ExecutionEngine::with_policy(policy);
    assert_eq!(engine.next_state(&dangerous_step(), TrustLevel::Copilot), State::AwaitApproval);
}

#[test]
fn governance_no_fully_autonomous_mode_exists() {
    // There is no GovernanceMode::FullAuto variant
    // This test verifies the enum only has Strict and Audit
    let modes = GovernanceMode::all_variants();
    assert!(!modes.contains(&GovernanceMode::FullAuto));
    assert_eq!(modes.len(), 2);
}

// Panic mode
#[tokio::test]
async fn governance_panic_mode_halts_all_executions_within_1_second() {
    // Verified in E2E section — unit test verifies Redis key check
    let redis = MockRedis::new();
    let engine = ExecutionEngine::with_redis(redis.clone());

    redis.set("dd0c:panic", "1").await;

    let result = engine.check_panic_mode().await;
    assert_eq!(result, PanicModeStatus::Active);
}

#[tokio::test]
async fn governance_panic_mode_uses_redis_not_database() {
    // Panic mode must NOT do a DB query — Redis only for <1s requirement
    let db = TrackingDb::new();
    let redis = MockRedis::new();
    redis.set("dd0c:panic", "1").await;

    let engine = ExecutionEngine::with_db(db.clone()).with_redis(redis);
    engine.check_panic_mode().await;

    assert_eq!(db.query_count(), 0, "Panic mode check must not query the database");
}

#[tokio::test]
async fn governance_panic_mode_requires_manual_clearance() {
    // Panic mode cannot be auto-cleared — only manual API call
    let engine = ExecutionEngine::new();
    engine.trigger_panic_mode().await;

    // Simulate time passing
    tokio::time::advance(Duration::hours(24)).await;

    // Must still be in panic mode
    assert_eq!(engine.check_panic_mode().await, PanicModeStatus::Active);
}

// Governance drift monitoring
#[tokio::test]
async fn governance_drift_auto_downgrades_when_auto_execution_exceeds_threshold() {
    let db = TestDb::start().await;
    // Insert execution history: 80% auto-executed (threshold is 70%)
    insert_execution_history_with_auto_ratio(&db, 0.80).await;

    let drift_monitor = GovernanceDriftMonitor::new(&db.pool);
    let result = drift_monitor.check().await;

    assert_eq!(result.action, DriftAction::DowngradeToStrict);
}

// Per-runbook governance override
#[test]
fn governance_runbook_locked_to_strict_ignores_system_audit_mode() {
    let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() };
    let runbook = Runbook { governance_override: Some(GovernanceMode::Strict), ..Default::default() };
    let engine = ExecutionEngine::with_policy(policy);

    // Even in audit mode, this runbook requires approval for safe steps
    assert_eq!(engine.next_state_for_runbook(&safe_step(), &runbook), State::AwaitApproval);
}

9. Test Data & Fixtures

9.1 Runbook Format Factories

// tests/fixtures/runbooks/mod.rs

pub fn fixture_runbook_markdown() -> &'static str {
    include_str!("markdown_basic.md")
}

pub fn fixture_runbook_confluence_html() -> &'static str {
    include_str!("confluence_export.html")
}

pub fn fixture_runbook_notion_export() -> &'static str {
    include_str!("notion_export.md")
}

pub fn fixture_runbook_with_ambiguities() -> &'static str {
    include_str!("ambiguous_steps.md")
}

pub fn fixture_runbook_with_variables() -> &'static str {
    include_str!("variables_and_placeholders.md")
}

9.2 Step Classification Fixtures

// tests/fixtures/commands/mod.rs

pub fn fixture_safe_commands() -> Vec<&'static str> {
    vec![
        "kubectl get pods -n kube-system",
        "aws ec2 describe-instances --region us-east-1",
        "cat /var/log/syslog | grep error",
        "SELECT count(*) FROM users",
    ]
}

pub fn fixture_caution_commands() -> Vec<&'static str> {
    vec![
        "kubectl rollout restart deployment/api",
        "systemctl restart nginx",
        "aws ec2 stop-instances --instance-ids i-1234567890abcdef0",
        "UPDATE users SET status = 'active' WHERE id = 123",
    ]
}

pub fn fixture_destructive_commands() -> Vec<&'static str> {
    vec![
        "kubectl delete namespace prod",
        "rm -rf /var/lib/postgresql/data",
        "DROP TABLE customers",
        "aws rds delete-db-instance --db-instance-identifier prod-db",
        "terraform destroy -auto-approve",
    ]
}

pub fn fixture_ambiguous_commands() -> Vec<&'static str> {
    vec![
        "restart the service",
        "./cleanup.sh",
        "python script.py",
        "curl -X POST http://internal-api/reset",
    ]
}

9.3 Infrastructure Target Mocks

We mock infrastructure targets for execution tests using isolated containers or HTTP mock servers.

// tests/fixtures/infra/mod.rs

/// Spawns a lightweight k3s container for testing kubectl commands safely
pub async fn mock_k8s_cluster() -> K3sContainer {
    K3sContainer::start().await
}

/// Spawns LocalStack for testing AWS CLI commands
pub async fn mock_aws_env() -> LocalStackContainer {
    LocalStackContainer::start().await
}

/// Spawns a bare Alpine container with SSH access
pub async fn mock_bare_metal() -> SshContainer {
    SshContainer::start("alpine:latest").await
}

9.4 Approval Workflow Scenario Fixtures

// tests/fixtures/approvals/mod.rs

pub fn fixture_slack_approval_payload(step_id: &str, user_id: &str) -> serde_json::Value {
    json!({
        "type": "block_actions",
        "user": { "id": user_id, "username": "riley.oncall" },
        "actions": [{
            "action_id": "approve_step",
            "value": step_id
        }]
    })
}

pub fn fixture_slack_typed_confirmation_payload(step_id: &str, resource_name: &str) -> serde_json::Value {
    json!({
        "type": "view_submission",
        "user": { "id": "U123456" },
        "view": {
            "state": {
                "values": {
                    "confirm_block": {
                        "resource_input": { "value": resource_name }
                    }
                }
            },
            "private_metadata": step_id
        }
    })
}

10. TDD Implementation Order

To maintain the safety-first invariant, components must be built and tested in a specific order. Execution code cannot be written until the safety constraints are proven.

10.1 Bootstrap Sequence (Test Infrastructure First)

Testcontainers Setup: Establish TestDb with migrations and RLS policies. Prove cross-tenant isolation fails closed.
OTEL Test Tracer: Implement InMemoryTracer to assert span creation and attributes.
Canary Suite Harness: Create the canary_suite test target that runs a hardcoded list of destructive commands and fails if any return 🟢.

10.2 Epic Dependency Order

Phase	Component	TDD Rule	Rationale
1	Deterministic Safety Scanner	Unit tests FIRST	Foundation of safety. Exhaustive pattern tests must exist before any parser or execution logic.
2	Merge Engine	Unit tests FIRST	Hardcoded rules. Prove 🔴 overrides 🟢 before integrating LLMs.
3	Execution Engine State Machine	Unit tests FIRST	CRITICAL: Prove trust level boundaries and approval gates block 🔴/🟡 steps before writing any code that actually executes commands.
4	Agent-Side Scanner	Unit tests FIRST	Port SaaS scanner logic to Agent binary. Prove the Agent rejects `rm -rf` independently.
5	Agent gRPC & Command Execution	Integration tests FIRST	Use sandbox containers. Prove timeout kills processes and shell injection fails.
6	Runbook Parser	Integration tests lead	Use LLM fixtures. The parser is safe because the classifier catches its mistakes.
7	Audit Trail	Unit tests FIRST	Prove schema immutability and hash chaining.
8	Slack Bot & API	Integration tests lead	UI and routing.

10.3 The Execution Engine Testing Mandate

Execution engine tests MUST be written before any execution code.

Before writing the impl ExecutionEngine { pub async fn execute(...) } function, the following tests must exist and fail (Red phase):

engine_dangerous_step_blocked_at_copilot_level_v1
engine_caution_step_requires_approval_at_copilot_level
engine_safe_step_blocked_at_read_only_trust_level
engine_duplicate_step_execution_id_is_rejected
engine_pauses_in_flight_execution_when_panic_mode_set

Only once these tests are defined can the state machine be implemented to make them pass (Green phase). This ensures no execution path can bypass the Trust Gradient.

66 KiB Raw Blame History