# dd0c/run — Test Architecture & TDD Strategy **Version:** 1.0 | **Date:** 2026-02-28 | **Status:** Draft --- ## 1. Testing Philosophy & TDD Workflow ### 1.1 Core Principle: Safety-First Testing dd0c/run executes commands in production infrastructure. A misclassification that allows a destructive command to auto-execute is an existential risk. This shapes every testing decision: > **If it touches execution, tests come first. No exceptions.** The TDD discipline here is not a process preference — it is a safety mechanism. Writing tests first for the Action Classifier and Execution Engine forces the developer to enumerate failure modes before writing code that could cause them. ### 1.2 Red-Green-Refactor Adapted for dd0c/run The standard Red-Green-Refactor cycle applies with one critical modification: **for any component that classifies or executes commands, the "Red" phase must include at least one test for a known-destructive command being correctly blocked.** ``` Standard TDD: dd0c/run TDD: Red Red (write failing test) ↓ ↓ Green Red-Safety (add: "rm -rf / must be 🔴") ↓ ↓ Refactor Green (make ALL tests pass) ↓ Refactor ↓ Canary-Check (run canary suite — must stay green) ``` The **Canary Suite** is a mandatory gate: a curated set of ~50 known-destructive commands that must always be classified as 🔴. It runs after every refactor. If any canary passes as 🟢, the build fails. ### 1.3 When to Write Tests First vs. Integration Tests Lead | Scenario | Approach | Rationale | |----------|----------|-----------| | Deterministic Safety Scanner | Unit tests FIRST, always | Pure function. No I/O. Exhaustive coverage of destructive patterns is mandatory before any code. | | Classification Merge Rules | Unit tests FIRST, always | Hardcoded logic. Tests define the spec. | | Execution Engine state machine | Unit tests FIRST, always | State transitions are safety-critical. Tests enumerate all valid/invalid transitions. | | Approval workflow | Unit tests FIRST | Approval bypass is a threat vector. Tests must prove it's impossible. | | Runbook Parser (LLM extraction) | Integration tests lead | LLM behavior is non-deterministic. Integration tests with recorded fixtures define expected behavior. | | Slack Bot UI flows | Integration tests lead | Slack API interactions are I/O-heavy. Mock Slack API, test message shapes. | | Alert-Runbook Matcher | Integration tests lead | Matching logic depends on DB state. Testcontainers + fixture data. | | Audit Trail ingestion | Unit tests first for schema, integration for pipeline | Schema is deterministic; pipeline has I/O. | ### 1.4 Test Naming Conventions All tests follow the pattern: `__` ```rust // Unit tests #[test] fn scanner_rm_rf_root_classifies_as_dangerous() { ... } #[test] fn scanner_kubectl_get_pods_classifies_as_safe() { ... } #[test] fn merge_engine_scanner_dangerous_overrides_llm_safe() { ... } #[test] fn state_machine_caution_step_transitions_to_await_approval() { ... } // Integration tests #[tokio::test] async fn parser_confluence_html_extracts_ordered_steps() { ... } #[tokio::test] async fn execution_engine_approval_timeout_does_not_auto_approve() { ... } // E2E tests #[tokio::test] async fn e2e_paste_runbook_classify_approve_execute_audit_full_journey() { ... } ``` **Prohibited naming patterns:** - `test_thing()` — too vague - `it_works()` — meaningless - `test1()`, `test2()` — no context ### 1.5 Safety-First Rule: Destructive Command Tests Are Mandatory Pre-Code > **Hard Rule:** No code may be written for the Action Classifier, Execution Engine, or Agent-side Scanner without first writing tests that prove destructive commands are blocked. This is enforced via CI: the `pkg/classifier/` and `pkg/executor/` directories require a minimum of 95% test coverage before any PR can merge. The canary test suite must pass on every commit touching these packages. --- ## 2. Test Pyramid ### 2.1 Recommended Ratio ``` /\ /E2E\ ~10% — Critical user journeys, chaos scenarios /──────\ / Integ \ ~20% — Service boundaries, DB, Slack, gRPC /──────────\ / Unit \ ~70% — Pure logic, state machines, classifiers /______________\ ``` For the **Action Classifier** and **Execution Engine** specifically, the ratio shifts: ``` Unit: 80% (exhaustive pattern coverage, state machine transitions) Integration: 15% (scanner ↔ classifier merge, engine ↔ agent gRPC) E2E: 5% (full execution journeys with sandboxed infra) ``` ### 2.2 Unit Test Targets (per component) | Component | Unit Test Focus | Coverage Target | |-----------|----------------|-----------------| | Deterministic Safety Scanner | Pattern matching, AST parsing, all risk categories | **100%** | | Classification Merge Engine | All 5 merge rules + edge cases | **100%** | | Execution Engine state machine | All state transitions, trust level enforcement | **95%** | | Runbook Parser (normalizer) | HTML stripping, markdown normalization, whitespace | **90%** | | Variable Detector | Placeholder regex patterns, env ref detection | **90%** | | Branch Mapper | DAG construction, if/else detection | **85%** | | Approval Workflow | Approval gates, typed confirmation, timeout behavior | **95%** | | Audit Trail schema | Event type validation, immutability constraints | **90%** | | Alert-Runbook Matcher | Keyword matching, similarity scoring | **85%** | | Trust Level Enforcement | Level checks per risk level, auto-downgrade | **95%** | | Panic Mode | Trigger conditions, halt sequence, Redis key check | **95%** | | Feature Flag Circuit Breaker | 2-failure threshold, 48h bake enforcement | **95%** | ### 2.3 Integration Test Boundaries | Boundary | Test Type | Infrastructure | |----------|-----------|----------------| | Parser ↔ LLM Gateway (dd0c/route) | Contract test with recorded responses | WireMock / recorded fixtures | | Classifier ↔ PostgreSQL (audit write) | Integration test | Testcontainers (PostgreSQL) | | Execution Engine ↔ Agent (gRPC) | Integration test | In-process gRPC server mock | | Execution Engine ↔ Slack Bot | Integration test | Slack API mock | | Approval Workflow ↔ Slack | Integration test | Slack API mock | | Audit Trail ↔ PostgreSQL | Integration test | Testcontainers (PostgreSQL) | | Alert Matcher ↔ PostgreSQL + pgvector | Integration test | Testcontainers (PostgreSQL + pgvector) | | Webhook Receiver ↔ PagerDuty/OpsGenie | Contract test | Recorded webhook payloads | | RLS enforcement | Integration test | Testcontainers (PostgreSQL with RLS enabled) | ### 2.4 E2E / Smoke Test Scenarios | Scenario | Priority | Infrastructure | |----------|----------|----------------| | Full journey: paste → parse → classify → approve → execute → audit | P0 | Docker Compose sandbox | | Destructive command blocked at all trust levels | P0 | Docker Compose sandbox | | Panic mode triggered and halts in-flight execution | P0 | Docker Compose sandbox | | Approval timeout does not auto-approve | P0 | Docker Compose sandbox | | Cross-tenant data isolation (RLS) | P0 | Testcontainers | | Agent reconnect after network partition | P1 | Docker Compose sandbox | | Mid-execution failure triggers rollback flow | P1 | Docker Compose sandbox | | Feature flag circuit breaker halts execution after 2 failures | P1 | Docker Compose sandbox | --- ## 3. Unit Test Strategy (Per Component) ### 3.1 Deterministic Safety Scanner **What to test:** Every pattern category. Every edge case. This component has 100% coverage as a hard requirement. **Key test cases:** ```rust // pkg/classifier/scanner/tests.rs // ── BLOCKLIST (🔴 Dangerous) ────────────────────────────────────────── #[test] fn scanner_kubectl_delete_namespace_is_dangerous() {} #[test] fn scanner_kubectl_delete_deployment_is_dangerous() {} #[test] fn scanner_kubectl_delete_pvc_is_dangerous() {} #[test] fn scanner_kubectl_delete_all_is_dangerous() {} #[test] fn scanner_drop_table_is_dangerous() {} #[test] fn scanner_drop_database_is_dangerous() {} #[test] fn scanner_truncate_table_is_dangerous() {} #[test] fn scanner_delete_without_where_is_dangerous() {} #[test] fn scanner_rm_rf_is_dangerous() {} #[test] fn scanner_rm_rf_root_is_dangerous() {} #[test] fn scanner_rm_rf_slash_is_dangerous() {} #[test] fn scanner_aws_ec2_terminate_instances_is_dangerous() {} #[test] fn scanner_aws_rds_delete_db_instance_is_dangerous() {} #[test] fn scanner_terraform_destroy_is_dangerous() {} #[test] fn scanner_dd_if_dev_zero_is_dangerous() {} #[test] fn scanner_mkfs_is_dangerous() {} #[test] fn scanner_sudo_rm_is_dangerous() {} #[test] fn scanner_chmod_777_recursive_is_dangerous() {} #[test] fn scanner_kubectl_create_clusterrolebinding_is_dangerous() {} #[test] fn scanner_aws_iam_create_user_is_dangerous() {} #[test] fn scanner_pipe_to_xargs_rm_is_dangerous() {} #[test] fn scanner_delete_with_where_but_no_condition_value_is_dangerous() {} // ── CAUTION LIST (🟡) ──────────────────────────────────────────────── #[test] fn scanner_kubectl_rollout_restart_is_caution() {} #[test] fn scanner_kubectl_scale_is_caution() {} #[test] fn scanner_aws_ec2_stop_instances_is_caution() {} #[test] fn scanner_aws_ec2_start_instances_is_caution() {} #[test] fn scanner_systemctl_restart_is_caution() {} #[test] fn scanner_update_with_where_clause_is_caution() {} #[test] fn scanner_insert_into_is_caution() {} #[test] fn scanner_docker_restart_is_caution() {} #[test] fn scanner_aws_autoscaling_set_desired_capacity_is_caution() {} // ── ALLOWLIST (🟢 Safe) ────────────────────────────────────────────── #[test] fn scanner_kubectl_get_pods_is_safe() {} #[test] fn scanner_kubectl_describe_deployment_is_safe() {} #[test] fn scanner_kubectl_logs_is_safe() {} #[test] fn scanner_aws_ec2_describe_instances_is_safe() {} #[test] fn scanner_aws_s3_ls_is_safe() {} #[test] fn scanner_select_query_is_safe() {} #[test] fn scanner_explain_query_is_safe() {} #[test] fn scanner_curl_get_is_safe() {} #[test] fn scanner_cat_file_is_safe() {} #[test] fn scanner_grep_is_safe() {} #[test] fn scanner_tail_f_is_safe() {} #[test] fn scanner_docker_ps_is_safe() {} #[test] fn scanner_terraform_plan_is_safe() {} #[test] fn scanner_dig_is_safe() {} #[test] fn scanner_nslookup_is_safe() {} // ── UNKNOWN / EDGE CASES ───────────────────────────────────────────── #[test] fn scanner_unknown_command_defaults_to_unknown_not_safe() {} #[test] fn scanner_empty_command_defaults_to_unknown() {} #[test] fn scanner_custom_script_path_defaults_to_unknown() {} #[test] fn scanner_select_into_is_dangerous_not_safe() {} // SELECT INTO is a write #[test] fn scanner_delete_with_where_is_caution_not_dangerous() {} #[test] fn scanner_curl_post_is_caution_not_safe() {} // POST has side effects #[test] fn scanner_pipe_chain_with_destructive_segment_is_dangerous() {} #[test] fn scanner_command_substitution_with_rm_is_dangerous() {} #[test] fn scanner_multiline_command_with_destructive_line_is_dangerous() {} // ── AST PARSING (tree-sitter) ──────────────────────────────────────── #[test] fn scanner_sql_ast_delete_without_where_is_dangerous() {} #[test] fn scanner_sql_ast_update_without_where_is_dangerous() {} #[test] fn scanner_sql_ast_drop_statement_is_dangerous() {} #[test] fn scanner_shell_ast_piped_rm_is_dangerous() {} #[test] fn scanner_shell_ast_subshell_with_destructive_is_dangerous() {} // ── PERFORMANCE ────────────────────────────────────────────────────── #[test] fn scanner_classifies_in_under_1ms() {} #[test] fn scanner_classifies_100_commands_in_under_10ms() {} ``` **Mocking strategy:** None. The scanner is a pure function with no I/O. All tests are synchronous, no mocks needed. **Language-specific patterns (Rust):** - Use `#[test]` for synchronous unit tests - Use `criterion` crate for performance benchmarks - Compile regex sets at test time using `lazy_static!` or `once_cell` - Use `rstest` for parameterized test cases across command variants ```rust // Parameterized test example using rstest #[rstest] #[case("kubectl delete namespace production", RiskLevel::Dangerous)] #[case("kubectl delete deployment payment-svc", RiskLevel::Dangerous)] #[case("kubectl delete pod payment-abc123", RiskLevel::Dangerous)] #[case("kubectl delete all --all", RiskLevel::Dangerous)] fn scanner_kubectl_delete_variants_are_dangerous( #[case] command: &str, #[case] expected: RiskLevel, ) { let scanner = Scanner::new(); assert_eq!(scanner.classify(command).risk, expected); } ``` ### 3.2 Classification Merge Engine **What to test:** All 5 merge rules, including every combination of scanner/LLM results. ```rust // pkg/classifier/merge/tests.rs // Rule 1: Scanner 🔴 → always 🔴 #[test] fn merge_scanner_dangerous_llm_safe_yields_dangerous() {} #[test] fn merge_scanner_dangerous_llm_caution_yields_dangerous() {} #[test] fn merge_scanner_dangerous_llm_dangerous_yields_dangerous() {} // Rule 2: Scanner 🟡, LLM 🟢 → 🟡 (scanner wins) #[test] fn merge_scanner_caution_llm_safe_yields_caution() {} // Rule 3: Both 🟢 → 🟢 (only path to safe) #[test] fn merge_scanner_safe_llm_safe_yields_safe() {} // Rule 4: Scanner Unknown → 🟡 minimum #[test] fn merge_scanner_unknown_llm_safe_yields_caution() {} #[test] fn merge_scanner_unknown_llm_caution_yields_caution() {} #[test] fn merge_scanner_unknown_llm_dangerous_yields_dangerous() {} // Rule 5: LLM confidence < 0.9 → escalate one level #[test] fn merge_low_confidence_safe_escalates_to_caution() {} #[test] fn merge_low_confidence_caution_escalates_to_dangerous() {} #[test] fn merge_high_confidence_safe_does_not_escalate() {} // LLM escalation overrides scanner #[test] fn merge_scanner_safe_llm_caution_yields_caution() {} #[test] fn merge_scanner_safe_llm_dangerous_yields_dangerous() {} #[test] fn merge_scanner_caution_llm_dangerous_yields_dangerous() {} // Merge rule is logged #[test] fn merge_result_includes_applied_rule_identifier() {} #[test] fn merge_result_includes_both_scanner_and_llm_inputs() {} ``` **Mocking strategy:** LLM results are passed as plain structs — no mocking needed. The merge engine is a pure function. ### 3.3 Execution Engine State Machine **What to test:** Every valid state transition, every invalid transition (must be rejected), trust level enforcement at each transition. ```rust // pkg/executor/state_machine/tests.rs // Valid transitions #[test] fn engine_pending_to_preflight_on_start() {} #[test] fn engine_preflight_to_step_ready_on_prerequisites_met() {} #[test] fn engine_step_ready_to_auto_execute_for_safe_step_at_copilot_level() {} #[test] fn engine_step_ready_to_await_approval_for_caution_step() {} #[test] fn engine_step_ready_to_blocked_for_dangerous_step_at_copilot_level() {} #[test] fn engine_await_approval_to_executing_on_human_approval() {} #[test] fn engine_await_approval_to_skipped_on_human_skip() {} #[test] fn engine_executing_to_step_complete_on_success() {} #[test] fn engine_executing_to_step_failed_on_error() {} #[test] fn engine_executing_to_timed_out_on_timeout() {} #[test] fn engine_step_complete_to_step_ready_when_more_steps() {} #[test] fn engine_step_complete_to_runbook_complete_when_last_step() {} #[test] fn engine_step_failed_to_rollback_available_when_rollback_exists() {} #[test] fn engine_step_failed_to_manual_intervention_when_no_rollback() {} #[test] fn engine_rollback_available_to_rolling_back_on_approval() {} #[test] fn engine_rolling_back_to_step_ready_on_rollback_success() {} #[test] fn engine_rolling_back_to_manual_intervention_on_rollback_failure() {} #[test] fn engine_runbook_complete_to_divergence_analysis() {} // Invalid transitions (must be rejected) #[test] fn engine_cannot_skip_preflight_state() {} #[test] fn engine_cannot_auto_execute_caution_step_at_copilot_level() {} #[test] fn engine_cannot_auto_execute_dangerous_step_at_any_v1_level() {} #[test] fn engine_cannot_transition_from_completed_to_executing() {} // Trust level enforcement #[test] fn engine_safe_step_blocked_at_read_only_trust_level() {} #[test] fn engine_caution_step_requires_approval_at_copilot_level() {} #[test] fn engine_dangerous_step_blocked_at_copilot_level_v1() {} #[test] fn engine_trust_level_checked_per_step_not_per_runbook() {} // Timeout behavior #[test] fn engine_safe_step_times_out_after_60_seconds() {} #[test] fn engine_caution_step_times_out_after_120_seconds() {} #[test] fn engine_dangerous_step_times_out_after_300_seconds() {} #[test] fn engine_approval_timeout_does_not_auto_approve() {} #[test] fn engine_approval_timeout_marks_execution_as_stalled() {} // Idempotency #[test] fn engine_duplicate_step_execution_id_is_rejected() {} #[test] fn engine_duplicate_approval_for_same_step_is_idempotent() {} // Panic mode #[test] fn engine_checks_panic_mode_before_each_step() {} #[test] fn engine_pauses_in_flight_execution_when_panic_mode_set() {} #[test] fn engine_does_not_kill_executing_step_on_panic_mode() {} ``` **Mocking strategy:** - Mock the Agent gRPC client using a trait object (`MockAgentClient`) - Mock the Slack notification sender - Mock the database using an in-memory state store for pure state machine tests - Use `tokio::time::pause()` for timeout tests (no real waiting) ### 3.4 Runbook Parser **What to test:** Normalization correctness, LLM output parsing, variable detection, branch mapping. ```rust // pkg/parser/tests.rs // Normalizer #[test] fn normalizer_strips_html_tags() {} #[test] fn normalizer_strips_confluence_macros() {} #[test] fn normalizer_normalizes_bullet_styles_to_numbered() {} #[test] fn normalizer_preserves_code_blocks() {} #[test] fn normalizer_normalizes_whitespace() {} #[test] fn normalizer_handles_empty_input() {} #[test] fn normalizer_handles_unicode_content() {} // LLM output parsing (using recorded fixtures) #[test] fn parser_extracts_ordered_steps_from_llm_response() {} #[test] fn parser_handles_llm_returning_empty_steps_array() {} #[test] fn parser_rejects_llm_response_missing_required_fields() {} #[test] fn parser_handles_llm_timeout_gracefully() {} #[test] fn parser_is_idempotent_same_input_same_output() {} #[test] fn parser_risk_level_is_null_in_output() {} // Parser never classifies // Variable detection #[test] fn variable_detector_finds_dollar_sign_vars() {} #[test] fn variable_detector_finds_angle_bracket_placeholders() {} #[test] fn variable_detector_finds_curly_brace_templates() {} #[test] fn variable_detector_identifies_alert_context_sources() {} #[test] fn variable_detector_identifies_vpn_prerequisite() {} // Branch mapping #[test] fn branch_mapper_detects_if_else_conditional() {} #[test] fn branch_mapper_produces_valid_dag() {} #[test] fn branch_mapper_handles_nested_conditionals() {} // Ambiguity detection #[test] fn ambiguity_highlighter_flags_vague_check_logs_step() {} #[test] fn ambiguity_highlighter_flags_restart_service_without_name() {} ``` **Mocking strategy:** Mock the LLM gateway (`dd0c/route`) using recorded response fixtures. Use `wiremock-rs` or a trait-based mock. Never call real LLM in unit tests. ### 3.5 Approval Workflow **What to test:** Approval gates cannot be bypassed. Typed confirmation is enforced for 🔴 steps. ```rust // pkg/approval/tests.rs #[test] fn approval_caution_step_requires_button_click_not_auto() {} #[test] fn approval_dangerous_step_requires_typed_resource_name() {} #[test] fn approval_dangerous_step_rejects_wrong_resource_name() {} #[test] fn approval_dangerous_step_rejects_generic_yes_confirmation() {} #[test] fn approval_cannot_bulk_approve_multiple_steps() {} #[test] fn approval_captures_approver_slack_identity() {} #[test] fn approval_captures_approval_timestamp() {} #[test] fn approval_modification_logs_original_command() {} #[test] fn approval_timeout_30min_marks_as_stalled_not_approved() {} #[test] fn approval_skip_logs_step_as_skipped_with_actor() {} #[test] fn approval_abort_halts_all_remaining_steps() {} ``` ### 3.6 Audit Trail **What to test:** Schema validation, immutability enforcement, event completeness. ```rust // pkg/audit/tests.rs #[test] fn audit_event_requires_tenant_id() {} #[test] fn audit_event_requires_event_type() {} #[test] fn audit_event_requires_actor_id_and_type() {} #[test] fn audit_all_execution_event_types_are_valid_enum_values() {} #[test] fn audit_step_executed_event_includes_command_hash_not_plaintext() {} #[test] fn audit_step_executed_event_includes_exit_code() {} #[test] fn audit_classification_event_includes_both_scanner_and_llm_results() {} #[test] fn audit_classification_event_includes_merge_rule_applied() {} #[test] fn audit_hash_chain_each_event_references_previous_hash() {} #[test] fn audit_hash_chain_modification_breaks_chain_verification() {} ``` --- ## 4. Integration Test Strategy ### 4.1 Service Boundary Tests All integration tests use **Testcontainers** for database dependencies and **WireMock** (via `wiremock-rs`) for external HTTP services. gRPC boundaries use in-process test servers. #### Parser ↔ LLM Gateway (dd0c/route) ```rust // tests/integration/parser_llm_test.rs #[tokio::test] async fn parser_sends_normalized_text_to_llm_with_correct_schema_prompt() { let mock_route = MockServer::start().await; Mock::given(method("POST")) .and(path("/v1/completions")) .respond_with(ResponseTemplate::new(200).set_body_json(fixture_llm_response())) .mount(&mock_route) .await; let parser = Parser::new(mock_route.uri()); let result = parser.parse("1. kubectl get pods -n payments").await.unwrap(); assert_eq!(result.steps.len(), 1); assert_eq!(result.steps[0].risk_level, None); // Parser never classifies } #[tokio::test] async fn parser_retries_on_llm_timeout_up_to_3_times() { ... } #[tokio::test] async fn parser_returns_error_when_llm_returns_invalid_json() { ... } #[tokio::test] async fn parser_handles_llm_returning_no_actionable_steps() { ... } ``` **Runbook format parsing tests:** ```rust #[tokio::test] async fn parser_confluence_html_extracts_steps_correctly() { // Load fixture: tests/fixtures/runbooks/confluence_payment_latency.html let raw = include_str!("../fixtures/runbooks/confluence_payment_latency.html"); let result = parser.parse(raw).await.unwrap(); assert!(result.steps.len() > 0); assert!(result.prerequisites.len() > 0); } #[tokio::test] async fn parser_notion_export_markdown_extracts_steps_correctly() { ... } #[tokio::test] async fn parser_plain_markdown_numbered_list_extracts_steps_correctly() { ... } #[tokio::test] async fn parser_confluence_with_code_blocks_preserves_commands() { ... } #[tokio::test] async fn parser_notion_with_callout_blocks_extracts_prerequisites() { ... } ``` #### Classifier ↔ PostgreSQL (Audit Write) ```rust // tests/integration/classifier_audit_test.rs #[tokio::test] async fn classifier_writes_classification_event_to_audit_log() { let pg = TestcontainersPostgres::start().await; let classifier = Classifier::new(pg.connection_string()); let result = classifier.classify(&step_fixture()).await.unwrap(); let events: Vec = pg.query( "SELECT * FROM audit_events WHERE event_type = 'runbook.classified'" ).await; assert_eq!(events.len(), 1); assert_eq!(events[0].event_data["final_classification"], "safe"); } #[tokio::test] async fn classifier_audit_record_is_immutable_no_update_permitted() { // Attempt UPDATE on audit_events — must fail with permission error let result = pg.execute("UPDATE audit_events SET event_type = 'tampered'").await; assert!(result.is_err()); assert!(result.unwrap_err().to_string().contains("permission denied")); } ``` #### Execution Engine ↔ Agent (gRPC) ```rust // tests/integration/engine_agent_grpc_test.rs #[tokio::test] async fn engine_sends_execute_step_payload_with_correct_fields() { let mock_agent = MockAgentServer::start().await; let engine = ExecutionEngine::new(mock_agent.address()); engine.execute_step(&safe_step_fixture()).await.unwrap(); let received = mock_agent.last_received_command().await; assert!(received.execution_id.is_some()); assert!(received.step_execution_id.is_some()); assert_eq!(received.risk_level, RiskLevel::Safe); } #[tokio::test] async fn engine_streams_stdout_from_agent_to_slack_in_realtime() { ... } #[tokio::test] async fn engine_handles_agent_disconnect_mid_execution() { ... } #[tokio::test] async fn engine_rejects_duplicate_step_execution_id_from_agent() { ... } #[tokio::test] async fn engine_validates_command_hash_before_sending_to_agent() { ... } ``` #### Approvals ↔ Slack ```rust // tests/integration/approval_slack_test.rs #[tokio::test] async fn approval_caution_step_posts_block_kit_message_with_approve_button() { ... } #[tokio::test] async fn approval_dangerous_step_posts_modal_requiring_typed_confirmation() { ... } #[tokio::test] async fn approval_button_click_captures_slack_user_id_as_approver() { ... } #[tokio::test] async fn approval_respects_slack_rate_limit_1_message_per_second() { ... } #[tokio::test] async fn approval_batches_rapid_output_updates_to_avoid_rate_limit() { ... } ``` ### 4.2 Testcontainers Setup ```rust // tests/common/testcontainers.rs pub struct TestDb { container: ContainerAsync, pub pool: PgPool, } impl TestDb { pub async fn start() -> Self { let container = Postgres::default() .with_env_var("POSTGRES_DB", "dd0c_test") .start() .await .unwrap(); let pool = PgPool::connect(&container.connection_string()).await.unwrap(); // Run migrations sqlx::migrate!("./migrations").run(&pool).await.unwrap(); // Apply RLS policies sqlx::query_file!("./tests/fixtures/sql/apply_rls.sql") .execute(&pool) .await .unwrap(); Self { container, pool } } pub async fn with_tenant(&self, tenant_id: Uuid) -> TenantScopedDb { // Sets app.current_tenant_id for RLS enforcement TenantScopedDb::new(&self.pool, tenant_id) } } ``` ### 4.3 Sandboxed Execution Environment Tests For testing actual command execution without touching real infrastructure, use Docker-in-Docker (DinD) or a minimal sandbox container. ```rust // tests/integration/sandbox_execution_test.rs /// Uses a minimal Alpine container as the execution target. /// The agent connects to this container instead of real infrastructure. #[tokio::test] async fn sandbox_safe_command_executes_and_returns_stdout() { let sandbox = SandboxContainer::start("alpine:3.19").await; let agent = TestAgent::connect_to(sandbox.socket_path()).await; let result = agent.execute("ls /tmp").await.unwrap(); assert_eq!(result.exit_code, 0); assert!(!result.stdout.is_empty()); } #[tokio::test] async fn sandbox_agent_rejects_dangerous_command_before_execution() { let sandbox = SandboxContainer::start("alpine:3.19").await; let agent = TestAgent::connect_to(sandbox.socket_path()).await; let result = agent.execute("rm -rf /").await; assert!(result.is_err()); assert_eq!(result.unwrap_err(), AgentError::CommandRejectedByScanner); // Verify nothing was deleted assert!(sandbox.path_exists("/etc").await); } #[tokio::test] async fn sandbox_command_timeout_kills_process_and_returns_error() { let sandbox = SandboxContainer::start("alpine:3.19").await; let agent = TestAgent::with_timeout(Duration::from_secs(2)) .connect_to(sandbox.socket_path()) .await; let result = agent.execute("sleep 60").await; assert_eq!(result.unwrap_err(), AgentError::Timeout); } #[tokio::test] async fn sandbox_no_shell_injection_via_command_argument() { let sandbox = SandboxContainer::start("alpine:3.19").await; let agent = TestAgent::connect_to(sandbox.socket_path()).await; // This should execute `echo` with the literal argument, not a shell let result = agent.execute("echo $(rm -rf /)").await.unwrap(); assert_eq!(result.stdout.trim(), "$(rm -rf /)"); // Literal, not executed assert!(sandbox.path_exists("/etc").await); } ``` ### 4.4 Multi-Tenant RLS Integration Tests ```rust // tests/integration/rls_test.rs #[tokio::test] async fn rls_tenant_a_cannot_see_tenant_b_runbooks() { let db = TestDb::start().await; let tenant_a = Uuid::new_v4(); let tenant_b = Uuid::new_v4(); // Insert runbook for tenant B db.insert_runbook(tenant_b, "Tenant B Runbook").await; // Query as tenant A — must return zero rows, not an error let db_a = db.with_tenant(tenant_a).await; let runbooks = db_a.query("SELECT * FROM runbooks").await.unwrap(); assert_eq!(runbooks.len(), 0); } #[tokio::test] async fn rls_cross_tenant_audit_query_returns_zero_rows() { ... } #[tokio::test] async fn rls_cross_tenant_execution_query_returns_zero_rows() { ... } ``` --- ## 5. E2E & Smoke Tests ### 5.1 Critical User Journeys E2E tests run against a full Docker Compose stack with sandboxed infrastructure. No real AWS, no real Kubernetes — all targets are containerized mocks. **Docker Compose E2E Stack:** ```yaml # tests/e2e/docker-compose.yml services: postgres: # Real PostgreSQL with migrations applied redis: # Real Redis for panic mode key parser: # Real Parser service classifier: # Real Classifier service engine: # Real Execution Engine slack-mock: # WireMock simulating Slack API llm-mock: # WireMock with recorded LLM responses agent: # Real dd0c Agent binary sandbox-host: # Alpine container as execution target ``` #### Journey 1: Full Happy Path (P0) ```rust // tests/e2e/happy_path_test.rs #[tokio::test] async fn e2e_paste_runbook_classify_approve_execute_audit_full_journey() { let stack = E2EStack::start().await; // Step 1: Paste runbook let parse_resp = stack.api() .post("/v1/run/runbooks/parse-preview") .json(&json!({ "raw_text": FIXTURE_RUNBOOK_MARKDOWN })) .send().await; assert_eq!(parse_resp.status(), 200); let parsed = parse_resp.json::().await; assert!(parsed.steps.iter().any(|s| s.risk_level == "safe")); assert!(parsed.steps.iter().any(|s| s.risk_level == "caution")); // Step 2: Save runbook let runbook = stack.api().post("/v1/run/runbooks") .json(&json!({ "raw_text": FIXTURE_RUNBOOK_MARKDOWN, "title": "E2E Test" })) .send().await.json::().await; // Step 3: Start execution let execution = stack.api().post("/v1/run/executions") .json(&json!({ "runbook_id": runbook.id, "agent_id": stack.agent_id() })) .send().await.json::().await; // Step 4: Safe steps auto-execute stack.wait_for_execution_state(&execution.id, "awaiting_approval").await; // Step 5: Approve caution step stack.api() .post(format!("/v1/run/executions/{}/steps/{}/approve", execution.id, caution_step_id)) .send().await; // Step 6: Wait for completion let completed = stack.wait_for_execution_state(&execution.id, "completed").await; assert_eq!(completed.steps_executed, 4); assert_eq!(completed.steps_failed, 0); // Step 7: Verify audit trail let audit_events = stack.db() .query("SELECT event_type FROM audit_events WHERE execution_id = $1", &[&execution.id]) .await; let event_types: Vec<&str> = audit_events.iter().map(|e| e.event_type.as_str()).collect(); assert!(event_types.contains(&"execution.started")); assert!(event_types.contains(&"step.auto_executed")); assert!(event_types.contains(&"step.approved")); assert!(event_types.contains(&"step.executed")); assert!(event_types.contains(&"execution.completed")); } ``` #### Journey 2: Destructive Command Blocked at All Levels (P0) ```rust #[tokio::test] async fn e2e_dangerous_command_blocked_at_copilot_trust_level() { let stack = E2EStack::start().await; let runbook = stack.create_runbook_with_dangerous_step().await; let execution = stack.start_execution(&runbook.id).await; // Engine must transition to Blocked, not AwaitApproval or AutoExecute let step_status = stack.wait_for_step_state( &execution.id, &dangerous_step_id, "blocked" ).await; assert_eq!(step_status, "blocked"); // Verify audit event logged the block let events = stack.audit_events_for_execution(&execution.id).await; assert!(events.iter().any(|e| e.event_type == "step.blocked_by_trust_level")); } ``` #### Journey 3: Panic Mode Halts In-Flight Execution (P0) ```rust #[tokio::test] async fn e2e_panic_mode_halts_in_flight_execution_within_1_second() { let stack = E2EStack::start().await; // Start a long-running execution let execution = stack.start_execution_with_slow_steps().await; stack.wait_for_execution_state(&execution.id, "running").await; let panic_triggered_at = Instant::now(); // Trigger panic mode stack.api().post("/v1/run/admin/panic").send().await; // Execution must be paused within 1 second stack.wait_for_execution_state(&execution.id, "paused").await; assert!(panic_triggered_at.elapsed() < Duration::from_secs(1)); // Verify execution is paused, not killed let exec = stack.get_execution(&execution.id).await; assert_eq!(exec.status, "paused"); assert_ne!(exec.status, "aborted"); } ``` #### Journey 4: Approval Timeout Does Not Auto-Approve (P0) ```rust #[tokio::test] async fn e2e_approval_timeout_marks_stalled_not_approved() { let stack = E2EStack::with_approval_timeout(Duration::from_secs(5)).start().await; let execution = stack.start_execution_with_caution_step().await; stack.wait_for_execution_state(&execution.id, "awaiting_approval").await; // Wait for timeout to expire — do NOT approve tokio::time::sleep(Duration::from_secs(6)).await; let exec = stack.get_execution(&execution.id).await; assert_eq!(exec.status, "stalled"); assert_ne!(exec.status, "completed"); // Must NOT have auto-approved } ``` ### 5.2 Chaos Scenarios ```rust // tests/e2e/chaos_test.rs #[tokio::test] async fn chaos_agent_disconnects_mid_execution_engine_pauses_and_alerts() { let stack = E2EStack::start().await; let execution = stack.start_long_running_execution().await; // Kill agent network mid-execution stack.disconnect_agent().await; let exec = stack.wait_for_execution_state(&execution.id, "paused").await; assert_eq!(exec.status, "paused"); // Reconnect agent — execution should be resumable stack.reconnect_agent().await; stack.resume_execution(&execution.id).await; stack.wait_for_execution_state(&execution.id, "completed").await; } #[tokio::test] async fn chaos_database_failover_engine_resumes_from_last_committed_step() { // Simulate RDS failover — engine must reconnect and resume } #[tokio::test] async fn chaos_llm_gateway_down_classification_falls_back_to_scanner_only() { // LLM unavailable — scanner-only mode, all unknowns become 🟡 } #[tokio::test] async fn chaos_slack_api_outage_execution_pauses_awaiting_approval_channel() { // Slack down — approval requests queue, no auto-approve } #[tokio::test] async fn chaos_mid_execution_step_failure_triggers_rollback_flow() { let stack = E2EStack::start().await; let execution = stack.start_execution_with_failing_step().await; stack.wait_for_execution_state(&execution.id, "rollback_available").await; // Approve rollback stack.approve_rollback(&execution.id, &failed_step_id).await; let exec = stack.wait_for_execution_state(&execution.id, "completed").await; let events = stack.audit_events_for_execution(&execution.id).await; assert!(events.iter().any(|e| e.event_type == "step.rolled_back")); } ``` --- ## 6. Performance & Load Testing ### 6.1 Parser Throughput ```rust // benches/parser_bench.rs (criterion) fn bench_normalizer_small_runbook(c: &mut Criterion) { let input = include_str!("../fixtures/runbooks/small_10_steps.md"); c.bench_function("normalizer_small", |b| { b.iter(|| Normalizer::new().normalize(black_box(input))) }); } fn bench_normalizer_large_runbook(c: &mut Criterion) { // 500-step runbook, heavy HTML from Confluence let input = include_str!("../fixtures/runbooks/large_500_steps.html"); c.bench_function("normalizer_large", |b| { b.iter(|| Normalizer::new().normalize(black_box(input))) }); } fn bench_scanner_100_commands(c: &mut Criterion) { let commands = fixture_100_mixed_commands(); let scanner = Scanner::new(); c.bench_function("scanner_100_commands", |b| { b.iter(|| { for cmd in &commands { black_box(scanner.classify(cmd)); } }) }); } ``` **Performance targets:** - Normalizer: < 10ms for a 500-step Confluence page - Scanner: < 1ms per command, < 10ms for 100 commands in batch - Full parse + classify pipeline: < 5s p95 (including LLM call) - Classification merge: < 1ms per step ### 6.2 Concurrent Execution Stress Tests Use `k6` or `cargo`-based load tests for concurrent execution scenarios: ```javascript // tests/load/concurrent_executions.js (k6) export const options = { scenarios: { concurrent_executions: { executor: 'constant-vus', vus: 50, // 50 concurrent execution sessions duration: '5m', }, }, thresholds: { http_req_duration: ['p95<500'], // API responses < 500ms p95 http_req_failed: ['rate<0.01'], // < 1% error rate }, }; export default function () { // Start execution, poll status, approve steps, verify completion const execution = startExecution(FIXTURE_RUNBOOK_ID); waitForApprovalGate(execution.id); approveStep(execution.id, execution.pending_step_id); waitForCompletion(execution.id); } ``` **Stress test targets:** - 50 concurrent execution sessions: all complete without errors - Approval workflow: < 200ms p95 latency for approval API calls - Audit trail ingestion: handles 1000 events/second without data loss - Scanner: handles 10,000 classifications/second (batch mode) ### 6.3 Approval Workflow Latency Under Load ```rust // tests/load/approval_latency_test.rs #[tokio::test] async fn approval_workflow_p95_latency_under_100_concurrent_approvals() { let stack = E2EStack::start().await; let mut handles = vec![]; for _ in 0..100 { let stack = stack.clone(); handles.push(tokio::spawn(async move { let execution = stack.start_execution_with_caution_step().await; stack.wait_for_execution_state(&execution.id, "awaiting_approval").await; let start = Instant::now(); stack.approve_step(&execution.id, &execution.pending_step_id).await; start.elapsed() })); } let latencies: Vec = futures::future::join_all(handles) .await.into_iter().map(|r| r.unwrap()).collect(); let p95 = percentile(&latencies, 95); assert!(p95 < Duration::from_millis(200), "p95 approval latency: {:?}", p95); } ``` --- ## 7. CI/CD Pipeline Integration ### 7.1 Test Stages ``` ┌─────────────────────────────────────────────────────────────────────┐ │ CI/CD TEST PIPELINE │ │ │ │ PRE-COMMIT (local, < 30s) │ │ ├── cargo fmt --check │ │ ├── cargo clippy -- -D warnings │ │ ├── Unit tests for changed packages only │ │ └── Canary suite (50 known-destructive commands must stay 🔴) │ │ │ │ PR GATE (CI, < 10 min) │ │ ├── Full unit test suite (all packages) │ │ ├── Canary suite (mandatory — build fails if any canary is 🟢) │ │ ├── Integration tests (Testcontainers) │ │ ├── Coverage check (see thresholds below) │ │ ├── Decision log check (PRs touching classifier/executor/parser │ │ │ must include decision_log.json) │ │ └── Expired feature flag check (CI blocks if flag TTL exceeded) │ │ │ │ MERGE TO MAIN (CI, < 20 min) │ │ ├── Full unit + integration suite │ │ ├── E2E smoke tests (Docker Compose stack) │ │ ├── Performance regression check (criterion baselines) │ │ └── Schema migration validation │ │ │ │ DEPLOY TO STAGING (post-merge, < 30 min) │ │ ├── E2E full suite against staging environment │ │ ├── Chaos scenarios (agent disconnect, DB failover) │ │ └── Load test (50 concurrent executions, 5 min) │ │ │ │ DEPLOY TO PRODUCTION (manual gate after staging) │ │ ├── Smoke test: parse-preview endpoint responds < 5s │ │ ├── Smoke test: agent heartbeat received │ │ └── Smoke test: audit trail write succeeds │ └─────────────────────────────────────────────────────────────────────┘ ``` ### 7.2 Coverage Thresholds ```toml # .cargo/coverage.toml (enforced via cargo-tarpaulin in CI) [thresholds] # Safety-critical components — highest bar "pkg/classifier/scanner" = 100 # Every pattern must be tested "pkg/classifier/merge" = 100 # Every merge rule must be tested "pkg/executor/state_machine" = 95 # Every state transition "pkg/executor/trust" = 95 # Trust level enforcement "pkg/approval" = 95 # Approval gates # Core components "pkg/parser" = 90 "pkg/audit" = 90 "pkg/agent/scanner" = 100 # Agent-side scanner: same as SaaS-side # Supporting components "pkg/matcher" = 85 "pkg/slack" = 80 "pkg/api" = 80 # Overall project minimum "overall" = 85 ``` **CI enforcement:** ```yaml # .github/workflows/ci.yml (excerpt) - name: Check coverage thresholds run: | cargo tarpaulin --out Json --output-dir coverage/ python scripts/check_coverage_thresholds.py coverage/tarpaulin-report.json - name: Run canary suite (MANDATORY) run: cargo test --package dd0c-classifier canary_suite -- --nocapture # This job failing blocks ALL other jobs - name: Check decision logs for safety-critical PRs run: | CHANGED=$(git diff --name-only origin/main) if echo "$CHANGED" | grep -qE "pkg/(parser|classifier|executor|approval)/"; then python scripts/check_decision_log.py fi ``` ### 7.3 Test Parallelization Strategy ```yaml # GitHub Actions matrix strategy jobs: unit-tests: strategy: matrix: package: - dd0c-classifier # Runs first — safety-critical - dd0c-executor # Runs first — safety-critical - dd0c-parser - dd0c-audit - dd0c-approval - dd0c-matcher - dd0c-slack - dd0c-api steps: - run: cargo test --package ${{ matrix.package }} canary-suite: needs: [] # Runs in parallel with unit tests steps: - run: cargo test --package dd0c-classifier canary_suite integration-tests: needs: [unit-tests, canary-suite] # Only after unit tests pass strategy: matrix: suite: - parser-llm - classifier-audit - engine-agent - approval-slack - rls-isolation steps: - run: cargo test --test ${{ matrix.suite }} e2e-tests: needs: [integration-tests] steps: - run: docker compose -f tests/e2e/docker-compose.yml up -d - run: cargo test --test e2e ``` **Parallelization rules:** - Canary suite runs in parallel with unit tests — never blocked - Integration tests only start after ALL unit tests pass - E2E tests only start after ALL integration tests pass - Each test package runs in its own job (parallel matrix) - Testcontainers instances are per-test, not shared (no state leakage) --- ## 8. Transparent Factory Tenet Testing ### 8.1 Feature Flag Tests (Atomic Flagging — Story 10.1) ```rust // pkg/flags/tests.rs // Basic flag evaluation #[test] fn flag_evaluates_locally_no_network_call() {} #[test] fn flag_disabled_by_default_for_new_execution_paths() {} #[test] fn flag_requires_owner_field_or_validation_fails() {} #[test] fn flag_requires_ttl_field_or_validation_fails() {} #[test] fn flag_expired_ttl_is_treated_as_disabled() {} // Destructive flag 48-hour bake enforcement #[test] fn flag_destructive_true_requires_48h_bake_before_full_rollout() { let flag = FeatureFlag { name: "enable_kubectl_delete_execution", destructive: true, rollout_percentage: 100, bake_started_at: Some(Utc::now() - Duration::hours(24)), // Only 24h ..Default::default() }; let result = FlagValidator::validate(&flag); assert!(result.is_err()); assert!(result.unwrap_err().contains("48-hour bake required for destructive flags")); } #[test] fn flag_destructive_true_at_10_percent_before_48h_is_valid() { let flag = FeatureFlag { destructive: true, rollout_percentage: 10, bake_started_at: Some(Utc::now() - Duration::hours(12)), ..Default::default() }; assert!(FlagValidator::validate(&flag).is_ok()); } #[test] fn flag_destructive_true_at_100_percent_after_48h_is_valid() { let flag = FeatureFlag { destructive: true, rollout_percentage: 100, bake_started_at: Some(Utc::now() - Duration::hours(49)), ..Default::default() }; assert!(FlagValidator::validate(&flag).is_ok()); } // Circuit breaker #[test] fn circuit_breaker_triggers_after_2_failures_in_10_minute_window() { let mut breaker = CircuitBreaker::new("enable_new_parser", 2, Duration::minutes(10)); breaker.record_failure(); assert_eq!(breaker.state(), CircuitState::Closed); breaker.record_failure(); assert_eq!(breaker.state(), CircuitState::Open); // Trips on 2nd failure } #[test] fn circuit_breaker_open_disables_flag_immediately() { let mut breaker = CircuitBreaker::new("enable_new_parser", 2, Duration::minutes(10)); breaker.record_failure(); breaker.record_failure(); let flag_store = FlagStore::with_breaker(breaker); assert!(!flag_store.is_enabled("enable_new_parser")); } #[test] fn circuit_breaker_pauses_in_flight_executions_not_kills() { // Verify executions are paused (status=paused), not aborted (status=aborted) } #[test] fn circuit_breaker_resets_after_window_expires() { let mut breaker = CircuitBreaker::new("flag", 2, Duration::minutes(10)); breaker.record_failure(); breaker.record_failure(); // Advance time past window breaker.advance_time(Duration::minutes(11)); assert_eq!(breaker.state(), CircuitState::Closed); } #[test] fn ci_blocks_if_flag_at_100_percent_past_ttl() { // This test validates the CI check script logic let flags = vec![ FeatureFlag { name: "old_flag", rollout: 100, ttl: expired_ttl(), ..Default::default() } ]; let result = CiValidator::check_expired_flags(&flags); assert!(result.is_err()); } ``` ### 8.2 Schema Migration Validation Tests (Elastic Schema — Story 10.2) ```rust // tests/migrations/schema_validation_test.rs #[tokio::test] async fn migration_does_not_remove_existing_columns() { let db = TestDb::start().await; let columns_before = db.get_column_names("audit_events").await; // Apply all pending migrations sqlx::migrate!("./migrations").run(&db.pool).await.unwrap(); let columns_after = db.get_column_names("audit_events").await; // Every column that existed before must still exist for col in &columns_before { assert!(columns_after.contains(col), "Migration removed column '{}' from audit_events — FORBIDDEN", col); } } #[tokio::test] async fn migration_does_not_change_existing_column_types() { // Verify no type changes on existing columns } #[tokio::test] async fn migration_does_not_rename_existing_columns() { // Verify column names are stable } #[tokio::test] async fn audit_events_table_has_no_update_permission_for_app_role() { let db = TestDb::start().await; let result = db.as_app_role() .execute("UPDATE audit_events SET event_type = 'tampered' WHERE 1=1") .await; assert!(result.is_err()); assert!(result.unwrap_err().to_string().contains("permission denied")); } #[tokio::test] async fn audit_events_table_has_no_delete_permission_for_app_role() { let db = TestDb::start().await; let result = db.as_app_role() .execute("DELETE FROM audit_events WHERE 1=1") .await; assert!(result.is_err()); assert!(result.unwrap_err().to_string().contains("permission denied")); } #[tokio::test] async fn execution_log_parsers_ignore_unknown_fields() { // Simulate a future schema with extra fields — old parser must not fail let event_json = json!({ "id": Uuid::new_v4(), "event_type": "step.executed", "unknown_future_field": "some_value", // New field old parser doesn't know "tenant_id": Uuid::new_v4(), "created_at": Utc::now(), }); let result = AuditEvent::from_json(&event_json); assert!(result.is_ok()); // Must not fail on unknown fields } #[tokio::test] async fn migration_includes_sunset_date_comment() { // Parse migration files and verify each has a sunset_date comment let migrations = read_migration_files("./migrations"); for migration in &migrations { if migration.is_additive() { assert!(migration.content.contains("-- sunset_date:"), "Migration {} missing sunset_date comment", migration.name); } } } ``` ### 8.3 Decision Log Format Validation (Cognitive Durability — Story 10.3) ```rust // tests/decisions/decision_log_test.rs #[test] fn decision_log_schema_requires_all_mandatory_fields() { let incomplete_log = json!({ "prompt": "Why is rm -rf dangerous?", // Missing: reasoning, alternatives_considered, confidence, timestamp, author }); let result = DecisionLog::validate(&incomplete_log); assert!(result.is_err()); } #[test] fn decision_log_confidence_must_be_between_0_and_1() { let log = DecisionLog { confidence: 1.5, ..valid_decision_log() }; assert!(DecisionLog::validate(&log).is_err()); } #[test] fn decision_log_destructive_command_list_change_requires_reasoning() { // Any PR adding/removing from the destructive command list must have // a decision log explaining why let change = DestructiveCommandListChange { command: "kubectl drain", action: ChangeAction::Add, decision_log: None, // Missing }; let result = DestructiveCommandChangeValidator::validate(&change); assert!(result.is_err()); assert!(result.unwrap_err().contains("decision log required for destructive command list changes")); } #[test] fn all_existing_decision_logs_are_valid_json() { let logs = glob::glob("docs/decisions/*.json").unwrap(); for log_path in logs { let content = std::fs::read_to_string(log_path.unwrap()).unwrap(); let parsed: serde_json::Value = serde_json::from_str(&content) .expect("Decision log must be valid JSON"); assert!(DecisionLog::validate(&parsed).is_ok()); } } #[test] fn cyclomatic_complexity_cap_enforced_at_10() { // This is validated by the clippy lint in CI: // #![deny(clippy::cognitive_complexity)] // Test here validates the lint config is present let clippy_config = std::fs::read_to_string(".clippy.toml").unwrap(); assert!(clippy_config.contains("cognitive-complexity-threshold = 10")); } ``` ### 8.4 OTEL Span Assertion Tests (Semantic Observability — Story 10.4) ```rust // tests/observability/otel_spans_test.rs #[tokio::test] async fn otel_runbook_execution_creates_parent_span() { let tracer = InMemoryTracer::new(); let engine = ExecutionEngine::with_tracer(tracer.clone()); engine.execute_runbook(&fixture_runbook()).await.unwrap(); let spans = tracer.finished_spans(); let parent = spans.iter().find(|s| s.name == "runbook_execution"); assert!(parent.is_some(), "runbook_execution parent span must exist"); } #[tokio::test] async fn otel_each_step_creates_child_spans_3_levels_deep() { let tracer = InMemoryTracer::new(); let engine = ExecutionEngine::with_tracer(tracer.clone()); engine.execute_runbook(&fixture_runbook_with_one_step()).await.unwrap(); let spans = tracer.finished_spans(); let step_classification = spans.iter().find(|s| s.name == "step_classification"); let step_approval = spans.iter().find(|s| s.name == "step_approval_check"); let step_execution = spans.iter().find(|s| s.name == "step_execution"); assert!(step_classification.is_some()); assert!(step_approval.is_some()); assert!(step_execution.is_some()); // Verify parent-child hierarchy let parent_id = spans.iter().find(|s| s.name == "runbook_execution").unwrap().span_id; assert_eq!(step_classification.unwrap().parent_span_id, Some(parent_id)); } #[tokio::test] async fn otel_step_classification_span_has_required_attributes() { let tracer = InMemoryTracer::new(); let engine = ExecutionEngine::with_tracer(tracer.clone()); engine.execute_runbook(&fixture_runbook()).await.unwrap(); let span = tracer.span_by_name("step_classification").unwrap(); assert!(span.attributes.contains_key("step.text_hash")); assert!(span.attributes.contains_key("step.classified_as")); assert!(span.attributes.contains_key("step.confidence_score")); assert!(span.attributes.contains_key("step.alternatives_considered")); } #[tokio::test] async fn otel_step_execution_span_has_required_attributes() { let span = tracer.span_by_name("step_execution").unwrap(); assert!(span.attributes.contains_key("step.command_hash")); assert!(span.attributes.contains_key("step.target_host_hash")); assert!(span.attributes.contains_key("step.exit_code")); assert!(span.attributes.contains_key("step.duration_ms")); // Must NOT contain raw command or PII assert!(!span.attributes.contains_key("step.command_raw")); assert!(!span.attributes.contains_key("step.stdout_raw")); } #[tokio::test] async fn otel_step_approval_span_has_required_attributes() { let span = tracer.span_by_name("step_approval_check").unwrap(); assert!(span.attributes.contains_key("step.approval_required")); assert!(span.attributes.contains_key("step.approval_source")); assert!(span.attributes.contains_key("step.approval_latency_ms")); } #[tokio::test] async fn otel_no_pii_in_any_span_attributes() { // Scan all span attributes for patterns that look like PII or raw commands let spans = tracer.finished_spans(); for span in &spans { for (key, value) in &span.attributes { assert!(!key.ends_with("_raw"), "Raw data in span attribute: {}", key); assert!(!looks_like_email(value), "PII in span: {} = {}", key, value); } } } #[tokio::test] async fn otel_ai_classification_span_includes_model_metadata() { let span = tracer.span_by_name("step_classification").unwrap(); assert!(span.attributes.contains_key("ai.prompt_hash")); assert!(span.attributes.contains_key("ai.model_version")); assert!(span.attributes.contains_key("ai.reasoning_chain")); } ``` ### 8.5 Governance Policy Enforcement Tests (Configurable Autonomy — Story 10.5) ```rust // pkg/governance/tests.rs // Strict mode: all steps require approval #[test] fn governance_strict_mode_requires_approval_for_safe_steps() { let policy = Policy { governance_mode: GovernanceMode::Strict, ..Default::default() }; let engine = ExecutionEngine::with_policy(policy); let next = engine.next_state(&safe_step(), TrustLevel::Copilot); assert_eq!(next, State::AwaitApproval); // Even safe steps need approval in strict mode } // Audit mode: safe steps auto-execute, destructive always require approval #[test] fn governance_audit_mode_auto_executes_safe_steps() { let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() }; let engine = ExecutionEngine::with_policy(policy); assert_eq!(engine.next_state(&safe_step(), TrustLevel::Copilot), State::AutoExecute); } #[test] fn governance_audit_mode_still_requires_approval_for_dangerous_steps() { let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() }; let engine = ExecutionEngine::with_policy(policy); assert_eq!(engine.next_state(&dangerous_step(), TrustLevel::Copilot), State::AwaitApproval); } #[test] fn governance_no_fully_autonomous_mode_exists() { // There is no GovernanceMode::FullAuto variant // This test verifies the enum only has Strict and Audit let modes = GovernanceMode::all_variants(); assert!(!modes.contains(&GovernanceMode::FullAuto)); assert_eq!(modes.len(), 2); } // Panic mode #[tokio::test] async fn governance_panic_mode_halts_all_executions_within_1_second() { // Verified in E2E section — unit test verifies Redis key check let redis = MockRedis::new(); let engine = ExecutionEngine::with_redis(redis.clone()); redis.set("dd0c:panic", "1").await; let result = engine.check_panic_mode().await; assert_eq!(result, PanicModeStatus::Active); } #[tokio::test] async fn governance_panic_mode_uses_redis_not_database() { // Panic mode must NOT do a DB query — Redis only for <1s requirement let db = TrackingDb::new(); let redis = MockRedis::new(); redis.set("dd0c:panic", "1").await; let engine = ExecutionEngine::with_db(db.clone()).with_redis(redis); engine.check_panic_mode().await; assert_eq!(db.query_count(), 0, "Panic mode check must not query the database"); } #[tokio::test] async fn governance_panic_mode_requires_manual_clearance() { // Panic mode cannot be auto-cleared — only manual API call let engine = ExecutionEngine::new(); engine.trigger_panic_mode().await; // Simulate time passing tokio::time::advance(Duration::hours(24)).await; // Must still be in panic mode assert_eq!(engine.check_panic_mode().await, PanicModeStatus::Active); } // Governance drift monitoring #[tokio::test] async fn governance_drift_auto_downgrades_when_auto_execution_exceeds_threshold() { let db = TestDb::start().await; // Insert execution history: 80% auto-executed (threshold is 70%) insert_execution_history_with_auto_ratio(&db, 0.80).await; let drift_monitor = GovernanceDriftMonitor::new(&db.pool); let result = drift_monitor.check().await; assert_eq!(result.action, DriftAction::DowngradeToStrict); } // Per-runbook governance override #[test] fn governance_runbook_locked_to_strict_ignores_system_audit_mode() { let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() }; let runbook = Runbook { governance_override: Some(GovernanceMode::Strict), ..Default::default() }; let engine = ExecutionEngine::with_policy(policy); // Even in audit mode, this runbook requires approval for safe steps assert_eq!(engine.next_state_for_runbook(&safe_step(), &runbook), State::AwaitApproval); } ``` --- ## 9. Test Data & Fixtures ### 9.1 Runbook Format Factories ```rust // tests/fixtures/runbooks/mod.rs pub fn fixture_runbook_markdown() -> &'static str { include_str!("markdown_basic.md") } pub fn fixture_runbook_confluence_html() -> &'static str { include_str!("confluence_export.html") } pub fn fixture_runbook_notion_export() -> &'static str { include_str!("notion_export.md") } pub fn fixture_runbook_with_ambiguities() -> &'static str { include_str!("ambiguous_steps.md") } pub fn fixture_runbook_with_variables() -> &'static str { include_str!("variables_and_placeholders.md") } ``` ### 9.2 Step Classification Fixtures ```rust // tests/fixtures/commands/mod.rs pub fn fixture_safe_commands() -> Vec<&'static str> { vec![ "kubectl get pods -n kube-system", "aws ec2 describe-instances --region us-east-1", "cat /var/log/syslog | grep error", "SELECT count(*) FROM users", ] } pub fn fixture_caution_commands() -> Vec<&'static str> { vec![ "kubectl rollout restart deployment/api", "systemctl restart nginx", "aws ec2 stop-instances --instance-ids i-1234567890abcdef0", "UPDATE users SET status = 'active' WHERE id = 123", ] } pub fn fixture_destructive_commands() -> Vec<&'static str> { vec![ "kubectl delete namespace prod", "rm -rf /var/lib/postgresql/data", "DROP TABLE customers", "aws rds delete-db-instance --db-instance-identifier prod-db", "terraform destroy -auto-approve", ] } pub fn fixture_ambiguous_commands() -> Vec<&'static str> { vec![ "restart the service", "./cleanup.sh", "python script.py", "curl -X POST http://internal-api/reset", ] } ``` ### 9.3 Infrastructure Target Mocks We mock infrastructure targets for execution tests using isolated containers or HTTP mock servers. ```rust // tests/fixtures/infra/mod.rs /// Spawns a lightweight k3s container for testing kubectl commands safely pub async fn mock_k8s_cluster() -> K3sContainer { K3sContainer::start().await } /// Spawns LocalStack for testing AWS CLI commands pub async fn mock_aws_env() -> LocalStackContainer { LocalStackContainer::start().await } /// Spawns a bare Alpine container with SSH access pub async fn mock_bare_metal() -> SshContainer { SshContainer::start("alpine:latest").await } ``` ### 9.4 Approval Workflow Scenario Fixtures ```rust // tests/fixtures/approvals/mod.rs pub fn fixture_slack_approval_payload(step_id: &str, user_id: &str) -> serde_json::Value { json!({ "type": "block_actions", "user": { "id": user_id, "username": "riley.oncall" }, "actions": [{ "action_id": "approve_step", "value": step_id }] }) } pub fn fixture_slack_typed_confirmation_payload(step_id: &str, resource_name: &str) -> serde_json::Value { json!({ "type": "view_submission", "user": { "id": "U123456" }, "view": { "state": { "values": { "confirm_block": { "resource_input": { "value": resource_name } } } }, "private_metadata": step_id } }) } ``` --- ## 10. TDD Implementation Order To maintain the safety-first invariant, components must be built and tested in a specific order. Execution code cannot be written until the safety constraints are proven. ### 10.1 Bootstrap Sequence (Test Infrastructure First) 1. **Testcontainers Setup**: Establish `TestDb` with migrations and RLS policies. Prove cross-tenant isolation fails closed. 2. **OTEL Test Tracer**: Implement `InMemoryTracer` to assert span creation and attributes. 3. **Canary Suite Harness**: Create the `canary_suite` test target that runs a hardcoded list of destructive commands and fails if any return 🟢. ### 10.2 Epic Dependency Order | Phase | Component | TDD Rule | Rationale | |-------|-----------|----------|-----------| | **1** | **Deterministic Safety Scanner** | Unit tests FIRST | Foundation of safety. Exhaustive pattern tests must exist before any parser or execution logic. | | **2** | **Merge Engine** | Unit tests FIRST | Hardcoded rules. Prove 🔴 overrides 🟢 before integrating LLMs. | | **3** | **Execution Engine State Machine** | Unit tests FIRST | **CRITICAL:** Prove trust level boundaries and approval gates block 🔴/🟡 steps *before* writing any code that actually executes commands. | | **4** | **Agent-Side Scanner** | Unit tests FIRST | Port SaaS scanner logic to Agent binary. Prove the Agent rejects `rm -rf` independently. | | **5** | **Agent gRPC & Command Execution** | Integration tests FIRST | Use sandbox containers. Prove timeout kills processes and shell injection fails. | | **6** | **Runbook Parser** | Integration tests lead | Use LLM fixtures. The parser is safe because the classifier catches its mistakes. | | **7** | **Audit Trail** | Unit tests FIRST | Prove schema immutability and hash chaining. | | **8** | **Slack Bot & API** | Integration tests lead | UI and routing. | ### 10.3 The Execution Engine Testing Mandate > **Execution engine tests MUST be written before any execution code.** Before writing the `impl ExecutionEngine { pub async fn execute(...) }` function, the following tests must exist and fail (Red phase): 1. `engine_dangerous_step_blocked_at_copilot_level_v1` 2. `engine_caution_step_requires_approval_at_copilot_level` 3. `engine_safe_step_blocked_at_read_only_trust_level` 4. `engine_duplicate_step_execution_id_is_rejected` 5. `engine_pauses_in_flight_execution_when_panic_mode_set` Only once these tests are defined can the state machine be implemented to make them pass (Green phase). This ensures no execution path can bypass the Trust Gradient.