Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
66 KiB
dd0c/run — Test Architecture & TDD Strategy
Version: 1.0 | Date: 2026-02-28 | Status: Draft
1. Testing Philosophy & TDD Workflow
1.1 Core Principle: Safety-First Testing
dd0c/run executes commands in production infrastructure. A misclassification that allows a destructive command to auto-execute is an existential risk. This shapes every testing decision:
If it touches execution, tests come first. No exceptions.
The TDD discipline here is not a process preference — it is a safety mechanism. Writing tests first for the Action Classifier and Execution Engine forces the developer to enumerate failure modes before writing code that could cause them.
1.2 Red-Green-Refactor Adapted for dd0c/run
The standard Red-Green-Refactor cycle applies with one critical modification: for any component that classifies or executes commands, the "Red" phase must include at least one test for a known-destructive command being correctly blocked.
Standard TDD: dd0c/run TDD:
Red Red (write failing test)
↓ ↓
Green Red-Safety (add: "rm -rf / must be 🔴")
↓ ↓
Refactor Green (make ALL tests pass)
↓
Refactor
↓
Canary-Check (run canary suite — must stay green)
The Canary Suite is a mandatory gate: a curated set of ~50 known-destructive commands that must always be classified as 🔴. It runs after every refactor. If any canary passes as 🟢, the build fails.
1.3 When to Write Tests First vs. Integration Tests Lead
| Scenario | Approach | Rationale |
|---|---|---|
| Deterministic Safety Scanner | Unit tests FIRST, always | Pure function. No I/O. Exhaustive coverage of destructive patterns is mandatory before any code. |
| Classification Merge Rules | Unit tests FIRST, always | Hardcoded logic. Tests define the spec. |
| Execution Engine state machine | Unit tests FIRST, always | State transitions are safety-critical. Tests enumerate all valid/invalid transitions. |
| Approval workflow | Unit tests FIRST | Approval bypass is a threat vector. Tests must prove it's impossible. |
| Runbook Parser (LLM extraction) | Integration tests lead | LLM behavior is non-deterministic. Integration tests with recorded fixtures define expected behavior. |
| Slack Bot UI flows | Integration tests lead | Slack API interactions are I/O-heavy. Mock Slack API, test message shapes. |
| Alert-Runbook Matcher | Integration tests lead | Matching logic depends on DB state. Testcontainers + fixture data. |
| Audit Trail ingestion | Unit tests first for schema, integration for pipeline | Schema is deterministic; pipeline has I/O. |
1.4 Test Naming Conventions
All tests follow the pattern: <component>_<scenario>_<expected_outcome>
// Unit tests
#[test]
fn scanner_rm_rf_root_classifies_as_dangerous() { ... }
#[test]
fn scanner_kubectl_get_pods_classifies_as_safe() { ... }
#[test]
fn merge_engine_scanner_dangerous_overrides_llm_safe() { ... }
#[test]
fn state_machine_caution_step_transitions_to_await_approval() { ... }
// Integration tests
#[tokio::test]
async fn parser_confluence_html_extracts_ordered_steps() { ... }
#[tokio::test]
async fn execution_engine_approval_timeout_does_not_auto_approve() { ... }
// E2E tests
#[tokio::test]
async fn e2e_paste_runbook_classify_approve_execute_audit_full_journey() { ... }
Prohibited naming patterns:
test_thing()— too vagueit_works()— meaninglesstest1(),test2()— no context
1.5 Safety-First Rule: Destructive Command Tests Are Mandatory Pre-Code
Hard Rule: No code may be written for the Action Classifier, Execution Engine, or Agent-side Scanner without first writing tests that prove destructive commands are blocked.
This is enforced via CI: the pkg/classifier/ and pkg/executor/ directories require a minimum of 95% test coverage before any PR can merge. The canary test suite must pass on every commit touching these packages.
2. Test Pyramid
2.1 Recommended Ratio
/\
/E2E\ ~10% — Critical user journeys, chaos scenarios
/──────\
/ Integ \ ~20% — Service boundaries, DB, Slack, gRPC
/──────────\
/ Unit \ ~70% — Pure logic, state machines, classifiers
/______________\
For the Action Classifier and Execution Engine specifically, the ratio shifts:
Unit: 80% (exhaustive pattern coverage, state machine transitions)
Integration: 15% (scanner ↔ classifier merge, engine ↔ agent gRPC)
E2E: 5% (full execution journeys with sandboxed infra)
2.2 Unit Test Targets (per component)
| Component | Unit Test Focus | Coverage Target |
|---|---|---|
| Deterministic Safety Scanner | Pattern matching, AST parsing, all risk categories | 100% |
| Classification Merge Engine | All 5 merge rules + edge cases | 100% |
| Execution Engine state machine | All state transitions, trust level enforcement | 95% |
| Runbook Parser (normalizer) | HTML stripping, markdown normalization, whitespace | 90% |
| Variable Detector | Placeholder regex patterns, env ref detection | 90% |
| Branch Mapper | DAG construction, if/else detection | 85% |
| Approval Workflow | Approval gates, typed confirmation, timeout behavior | 95% |
| Audit Trail schema | Event type validation, immutability constraints | 90% |
| Alert-Runbook Matcher | Keyword matching, similarity scoring | 85% |
| Trust Level Enforcement | Level checks per risk level, auto-downgrade | 95% |
| Panic Mode | Trigger conditions, halt sequence, Redis key check | 95% |
| Feature Flag Circuit Breaker | 2-failure threshold, 48h bake enforcement | 95% |
2.3 Integration Test Boundaries
| Boundary | Test Type | Infrastructure |
|---|---|---|
| Parser ↔ LLM Gateway (dd0c/route) | Contract test with recorded responses | WireMock / recorded fixtures |
| Classifier ↔ PostgreSQL (audit write) | Integration test | Testcontainers (PostgreSQL) |
| Execution Engine ↔ Agent (gRPC) | Integration test | In-process gRPC server mock |
| Execution Engine ↔ Slack Bot | Integration test | Slack API mock |
| Approval Workflow ↔ Slack | Integration test | Slack API mock |
| Audit Trail ↔ PostgreSQL | Integration test | Testcontainers (PostgreSQL) |
| Alert Matcher ↔ PostgreSQL + pgvector | Integration test | Testcontainers (PostgreSQL + pgvector) |
| Webhook Receiver ↔ PagerDuty/OpsGenie | Contract test | Recorded webhook payloads |
| RLS enforcement | Integration test | Testcontainers (PostgreSQL with RLS enabled) |
2.4 E2E / Smoke Test Scenarios
| Scenario | Priority | Infrastructure |
|---|---|---|
| Full journey: paste → parse → classify → approve → execute → audit | P0 | Docker Compose sandbox |
| Destructive command blocked at all trust levels | P0 | Docker Compose sandbox |
| Panic mode triggered and halts in-flight execution | P0 | Docker Compose sandbox |
| Approval timeout does not auto-approve | P0 | Docker Compose sandbox |
| Cross-tenant data isolation (RLS) | P0 | Testcontainers |
| Agent reconnect after network partition | P1 | Docker Compose sandbox |
| Mid-execution failure triggers rollback flow | P1 | Docker Compose sandbox |
| Feature flag circuit breaker halts execution after 2 failures | P1 | Docker Compose sandbox |
3. Unit Test Strategy (Per Component)
3.1 Deterministic Safety Scanner
What to test: Every pattern category. Every edge case. This component has 100% coverage as a hard requirement.
Key test cases:
// pkg/classifier/scanner/tests.rs
// ── BLOCKLIST (🔴 Dangerous) ──────────────────────────────────────────
#[test] fn scanner_kubectl_delete_namespace_is_dangerous() {}
#[test] fn scanner_kubectl_delete_deployment_is_dangerous() {}
#[test] fn scanner_kubectl_delete_pvc_is_dangerous() {}
#[test] fn scanner_kubectl_delete_all_is_dangerous() {}
#[test] fn scanner_drop_table_is_dangerous() {}
#[test] fn scanner_drop_database_is_dangerous() {}
#[test] fn scanner_truncate_table_is_dangerous() {}
#[test] fn scanner_delete_without_where_is_dangerous() {}
#[test] fn scanner_rm_rf_is_dangerous() {}
#[test] fn scanner_rm_rf_root_is_dangerous() {}
#[test] fn scanner_rm_rf_slash_is_dangerous() {}
#[test] fn scanner_aws_ec2_terminate_instances_is_dangerous() {}
#[test] fn scanner_aws_rds_delete_db_instance_is_dangerous() {}
#[test] fn scanner_terraform_destroy_is_dangerous() {}
#[test] fn scanner_dd_if_dev_zero_is_dangerous() {}
#[test] fn scanner_mkfs_is_dangerous() {}
#[test] fn scanner_sudo_rm_is_dangerous() {}
#[test] fn scanner_chmod_777_recursive_is_dangerous() {}
#[test] fn scanner_kubectl_create_clusterrolebinding_is_dangerous() {}
#[test] fn scanner_aws_iam_create_user_is_dangerous() {}
#[test] fn scanner_pipe_to_xargs_rm_is_dangerous() {}
#[test] fn scanner_delete_with_where_but_no_condition_value_is_dangerous() {}
// ── CAUTION LIST (🟡) ────────────────────────────────────────────────
#[test] fn scanner_kubectl_rollout_restart_is_caution() {}
#[test] fn scanner_kubectl_scale_is_caution() {}
#[test] fn scanner_aws_ec2_stop_instances_is_caution() {}
#[test] fn scanner_aws_ec2_start_instances_is_caution() {}
#[test] fn scanner_systemctl_restart_is_caution() {}
#[test] fn scanner_update_with_where_clause_is_caution() {}
#[test] fn scanner_insert_into_is_caution() {}
#[test] fn scanner_docker_restart_is_caution() {}
#[test] fn scanner_aws_autoscaling_set_desired_capacity_is_caution() {}
// ── ALLOWLIST (🟢 Safe) ──────────────────────────────────────────────
#[test] fn scanner_kubectl_get_pods_is_safe() {}
#[test] fn scanner_kubectl_describe_deployment_is_safe() {}
#[test] fn scanner_kubectl_logs_is_safe() {}
#[test] fn scanner_aws_ec2_describe_instances_is_safe() {}
#[test] fn scanner_aws_s3_ls_is_safe() {}
#[test] fn scanner_select_query_is_safe() {}
#[test] fn scanner_explain_query_is_safe() {}
#[test] fn scanner_curl_get_is_safe() {}
#[test] fn scanner_cat_file_is_safe() {}
#[test] fn scanner_grep_is_safe() {}
#[test] fn scanner_tail_f_is_safe() {}
#[test] fn scanner_docker_ps_is_safe() {}
#[test] fn scanner_terraform_plan_is_safe() {}
#[test] fn scanner_dig_is_safe() {}
#[test] fn scanner_nslookup_is_safe() {}
// ── UNKNOWN / EDGE CASES ─────────────────────────────────────────────
#[test] fn scanner_unknown_command_defaults_to_unknown_not_safe() {}
#[test] fn scanner_empty_command_defaults_to_unknown() {}
#[test] fn scanner_custom_script_path_defaults_to_unknown() {}
#[test] fn scanner_select_into_is_dangerous_not_safe() {} // SELECT INTO is a write
#[test] fn scanner_delete_with_where_is_caution_not_dangerous() {}
#[test] fn scanner_curl_post_is_caution_not_safe() {} // POST has side effects
#[test] fn scanner_pipe_chain_with_destructive_segment_is_dangerous() {}
#[test] fn scanner_command_substitution_with_rm_is_dangerous() {}
#[test] fn scanner_multiline_command_with_destructive_line_is_dangerous() {}
// ── AST PARSING (tree-sitter) ────────────────────────────────────────
#[test] fn scanner_sql_ast_delete_without_where_is_dangerous() {}
#[test] fn scanner_sql_ast_update_without_where_is_dangerous() {}
#[test] fn scanner_sql_ast_drop_statement_is_dangerous() {}
#[test] fn scanner_shell_ast_piped_rm_is_dangerous() {}
#[test] fn scanner_shell_ast_subshell_with_destructive_is_dangerous() {}
// ── PERFORMANCE ──────────────────────────────────────────────────────
#[test] fn scanner_classifies_in_under_1ms() {}
#[test] fn scanner_classifies_100_commands_in_under_10ms() {}
Mocking strategy: None. The scanner is a pure function with no I/O. All tests are synchronous, no mocks needed.
Language-specific patterns (Rust):
- Use
#[test]for synchronous unit tests - Use
criterioncrate for performance benchmarks - Compile regex sets at test time using
lazy_static!oronce_cell - Use
rstestfor parameterized test cases across command variants
// Parameterized test example using rstest
#[rstest]
#[case("kubectl delete namespace production", RiskLevel::Dangerous)]
#[case("kubectl delete deployment payment-svc", RiskLevel::Dangerous)]
#[case("kubectl delete pod payment-abc123", RiskLevel::Dangerous)]
#[case("kubectl delete all --all", RiskLevel::Dangerous)]
fn scanner_kubectl_delete_variants_are_dangerous(
#[case] command: &str,
#[case] expected: RiskLevel,
) {
let scanner = Scanner::new();
assert_eq!(scanner.classify(command).risk, expected);
}
3.2 Classification Merge Engine
What to test: All 5 merge rules, including every combination of scanner/LLM results.
// pkg/classifier/merge/tests.rs
// Rule 1: Scanner 🔴 → always 🔴
#[test] fn merge_scanner_dangerous_llm_safe_yields_dangerous() {}
#[test] fn merge_scanner_dangerous_llm_caution_yields_dangerous() {}
#[test] fn merge_scanner_dangerous_llm_dangerous_yields_dangerous() {}
// Rule 2: Scanner 🟡, LLM 🟢 → 🟡 (scanner wins)
#[test] fn merge_scanner_caution_llm_safe_yields_caution() {}
// Rule 3: Both 🟢 → 🟢 (only path to safe)
#[test] fn merge_scanner_safe_llm_safe_yields_safe() {}
// Rule 4: Scanner Unknown → 🟡 minimum
#[test] fn merge_scanner_unknown_llm_safe_yields_caution() {}
#[test] fn merge_scanner_unknown_llm_caution_yields_caution() {}
#[test] fn merge_scanner_unknown_llm_dangerous_yields_dangerous() {}
// Rule 5: LLM confidence < 0.9 → escalate one level
#[test] fn merge_low_confidence_safe_escalates_to_caution() {}
#[test] fn merge_low_confidence_caution_escalates_to_dangerous() {}
#[test] fn merge_high_confidence_safe_does_not_escalate() {}
// LLM escalation overrides scanner
#[test] fn merge_scanner_safe_llm_caution_yields_caution() {}
#[test] fn merge_scanner_safe_llm_dangerous_yields_dangerous() {}
#[test] fn merge_scanner_caution_llm_dangerous_yields_dangerous() {}
// Merge rule is logged
#[test] fn merge_result_includes_applied_rule_identifier() {}
#[test] fn merge_result_includes_both_scanner_and_llm_inputs() {}
Mocking strategy: LLM results are passed as plain structs — no mocking needed. The merge engine is a pure function.
3.3 Execution Engine State Machine
What to test: Every valid state transition, every invalid transition (must be rejected), trust level enforcement at each transition.
// pkg/executor/state_machine/tests.rs
// Valid transitions
#[test] fn engine_pending_to_preflight_on_start() {}
#[test] fn engine_preflight_to_step_ready_on_prerequisites_met() {}
#[test] fn engine_step_ready_to_auto_execute_for_safe_step_at_copilot_level() {}
#[test] fn engine_step_ready_to_await_approval_for_caution_step() {}
#[test] fn engine_step_ready_to_blocked_for_dangerous_step_at_copilot_level() {}
#[test] fn engine_await_approval_to_executing_on_human_approval() {}
#[test] fn engine_await_approval_to_skipped_on_human_skip() {}
#[test] fn engine_executing_to_step_complete_on_success() {}
#[test] fn engine_executing_to_step_failed_on_error() {}
#[test] fn engine_executing_to_timed_out_on_timeout() {}
#[test] fn engine_step_complete_to_step_ready_when_more_steps() {}
#[test] fn engine_step_complete_to_runbook_complete_when_last_step() {}
#[test] fn engine_step_failed_to_rollback_available_when_rollback_exists() {}
#[test] fn engine_step_failed_to_manual_intervention_when_no_rollback() {}
#[test] fn engine_rollback_available_to_rolling_back_on_approval() {}
#[test] fn engine_rolling_back_to_step_ready_on_rollback_success() {}
#[test] fn engine_rolling_back_to_manual_intervention_on_rollback_failure() {}
#[test] fn engine_runbook_complete_to_divergence_analysis() {}
// Invalid transitions (must be rejected)
#[test] fn engine_cannot_skip_preflight_state() {}
#[test] fn engine_cannot_auto_execute_caution_step_at_copilot_level() {}
#[test] fn engine_cannot_auto_execute_dangerous_step_at_any_v1_level() {}
#[test] fn engine_cannot_transition_from_completed_to_executing() {}
// Trust level enforcement
#[test] fn engine_safe_step_blocked_at_read_only_trust_level() {}
#[test] fn engine_caution_step_requires_approval_at_copilot_level() {}
#[test] fn engine_dangerous_step_blocked_at_copilot_level_v1() {}
#[test] fn engine_trust_level_checked_per_step_not_per_runbook() {}
// Timeout behavior
#[test] fn engine_safe_step_times_out_after_60_seconds() {}
#[test] fn engine_caution_step_times_out_after_120_seconds() {}
#[test] fn engine_dangerous_step_times_out_after_300_seconds() {}
#[test] fn engine_approval_timeout_does_not_auto_approve() {}
#[test] fn engine_approval_timeout_marks_execution_as_stalled() {}
// Idempotency
#[test] fn engine_duplicate_step_execution_id_is_rejected() {}
#[test] fn engine_duplicate_approval_for_same_step_is_idempotent() {}
// Panic mode
#[test] fn engine_checks_panic_mode_before_each_step() {}
#[test] fn engine_pauses_in_flight_execution_when_panic_mode_set() {}
#[test] fn engine_does_not_kill_executing_step_on_panic_mode() {}
Mocking strategy:
- Mock the Agent gRPC client using a trait object (
MockAgentClient) - Mock the Slack notification sender
- Mock the database using an in-memory state store for pure state machine tests
- Use
tokio::time::pause()for timeout tests (no real waiting)
3.4 Runbook Parser
What to test: Normalization correctness, LLM output parsing, variable detection, branch mapping.
// pkg/parser/tests.rs
// Normalizer
#[test] fn normalizer_strips_html_tags() {}
#[test] fn normalizer_strips_confluence_macros() {}
#[test] fn normalizer_normalizes_bullet_styles_to_numbered() {}
#[test] fn normalizer_preserves_code_blocks() {}
#[test] fn normalizer_normalizes_whitespace() {}
#[test] fn normalizer_handles_empty_input() {}
#[test] fn normalizer_handles_unicode_content() {}
// LLM output parsing (using recorded fixtures)
#[test] fn parser_extracts_ordered_steps_from_llm_response() {}
#[test] fn parser_handles_llm_returning_empty_steps_array() {}
#[test] fn parser_rejects_llm_response_missing_required_fields() {}
#[test] fn parser_handles_llm_timeout_gracefully() {}
#[test] fn parser_is_idempotent_same_input_same_output() {}
#[test] fn parser_risk_level_is_null_in_output() {} // Parser never classifies
// Variable detection
#[test] fn variable_detector_finds_dollar_sign_vars() {}
#[test] fn variable_detector_finds_angle_bracket_placeholders() {}
#[test] fn variable_detector_finds_curly_brace_templates() {}
#[test] fn variable_detector_identifies_alert_context_sources() {}
#[test] fn variable_detector_identifies_vpn_prerequisite() {}
// Branch mapping
#[test] fn branch_mapper_detects_if_else_conditional() {}
#[test] fn branch_mapper_produces_valid_dag() {}
#[test] fn branch_mapper_handles_nested_conditionals() {}
// Ambiguity detection
#[test] fn ambiguity_highlighter_flags_vague_check_logs_step() {}
#[test] fn ambiguity_highlighter_flags_restart_service_without_name() {}
Mocking strategy: Mock the LLM gateway (dd0c/route) using recorded response fixtures. Use wiremock-rs or a trait-based mock. Never call real LLM in unit tests.
3.5 Approval Workflow
What to test: Approval gates cannot be bypassed. Typed confirmation is enforced for 🔴 steps.
// pkg/approval/tests.rs
#[test] fn approval_caution_step_requires_button_click_not_auto() {}
#[test] fn approval_dangerous_step_requires_typed_resource_name() {}
#[test] fn approval_dangerous_step_rejects_wrong_resource_name() {}
#[test] fn approval_dangerous_step_rejects_generic_yes_confirmation() {}
#[test] fn approval_cannot_bulk_approve_multiple_steps() {}
#[test] fn approval_captures_approver_slack_identity() {}
#[test] fn approval_captures_approval_timestamp() {}
#[test] fn approval_modification_logs_original_command() {}
#[test] fn approval_timeout_30min_marks_as_stalled_not_approved() {}
#[test] fn approval_skip_logs_step_as_skipped_with_actor() {}
#[test] fn approval_abort_halts_all_remaining_steps() {}
3.6 Audit Trail
What to test: Schema validation, immutability enforcement, event completeness.
// pkg/audit/tests.rs
#[test] fn audit_event_requires_tenant_id() {}
#[test] fn audit_event_requires_event_type() {}
#[test] fn audit_event_requires_actor_id_and_type() {}
#[test] fn audit_all_execution_event_types_are_valid_enum_values() {}
#[test] fn audit_step_executed_event_includes_command_hash_not_plaintext() {}
#[test] fn audit_step_executed_event_includes_exit_code() {}
#[test] fn audit_classification_event_includes_both_scanner_and_llm_results() {}
#[test] fn audit_classification_event_includes_merge_rule_applied() {}
#[test] fn audit_hash_chain_each_event_references_previous_hash() {}
#[test] fn audit_hash_chain_modification_breaks_chain_verification() {}
4. Integration Test Strategy
4.1 Service Boundary Tests
All integration tests use Testcontainers for database dependencies and WireMock (via wiremock-rs) for external HTTP services. gRPC boundaries use in-process test servers.
Parser ↔ LLM Gateway (dd0c/route)
// tests/integration/parser_llm_test.rs
#[tokio::test]
async fn parser_sends_normalized_text_to_llm_with_correct_schema_prompt() {
let mock_route = MockServer::start().await;
Mock::given(method("POST"))
.and(path("/v1/completions"))
.respond_with(ResponseTemplate::new(200).set_body_json(fixture_llm_response()))
.mount(&mock_route)
.await;
let parser = Parser::new(mock_route.uri());
let result = parser.parse("1. kubectl get pods -n payments").await.unwrap();
assert_eq!(result.steps.len(), 1);
assert_eq!(result.steps[0].risk_level, None); // Parser never classifies
}
#[tokio::test]
async fn parser_retries_on_llm_timeout_up_to_3_times() { ... }
#[tokio::test]
async fn parser_returns_error_when_llm_returns_invalid_json() { ... }
#[tokio::test]
async fn parser_handles_llm_returning_no_actionable_steps() { ... }
Runbook format parsing tests:
#[tokio::test]
async fn parser_confluence_html_extracts_steps_correctly() {
// Load fixture: tests/fixtures/runbooks/confluence_payment_latency.html
let raw = include_str!("../fixtures/runbooks/confluence_payment_latency.html");
let result = parser.parse(raw).await.unwrap();
assert!(result.steps.len() > 0);
assert!(result.prerequisites.len() > 0);
}
#[tokio::test]
async fn parser_notion_export_markdown_extracts_steps_correctly() { ... }
#[tokio::test]
async fn parser_plain_markdown_numbered_list_extracts_steps_correctly() { ... }
#[tokio::test]
async fn parser_confluence_with_code_blocks_preserves_commands() { ... }
#[tokio::test]
async fn parser_notion_with_callout_blocks_extracts_prerequisites() { ... }
Classifier ↔ PostgreSQL (Audit Write)
// tests/integration/classifier_audit_test.rs
#[tokio::test]
async fn classifier_writes_classification_event_to_audit_log() {
let pg = TestcontainersPostgres::start().await;
let classifier = Classifier::new(pg.connection_string());
let result = classifier.classify(&step_fixture()).await.unwrap();
let events: Vec<AuditEvent> = pg.query(
"SELECT * FROM audit_events WHERE event_type = 'runbook.classified'"
).await;
assert_eq!(events.len(), 1);
assert_eq!(events[0].event_data["final_classification"], "safe");
}
#[tokio::test]
async fn classifier_audit_record_is_immutable_no_update_permitted() {
// Attempt UPDATE on audit_events — must fail with permission error
let result = pg.execute("UPDATE audit_events SET event_type = 'tampered'").await;
assert!(result.is_err());
assert!(result.unwrap_err().to_string().contains("permission denied"));
}
Execution Engine ↔ Agent (gRPC)
// tests/integration/engine_agent_grpc_test.rs
#[tokio::test]
async fn engine_sends_execute_step_payload_with_correct_fields() {
let mock_agent = MockAgentServer::start().await;
let engine = ExecutionEngine::new(mock_agent.address());
engine.execute_step(&safe_step_fixture()).await.unwrap();
let received = mock_agent.last_received_command().await;
assert!(received.execution_id.is_some());
assert!(received.step_execution_id.is_some());
assert_eq!(received.risk_level, RiskLevel::Safe);
}
#[tokio::test]
async fn engine_streams_stdout_from_agent_to_slack_in_realtime() { ... }
#[tokio::test]
async fn engine_handles_agent_disconnect_mid_execution() { ... }
#[tokio::test]
async fn engine_rejects_duplicate_step_execution_id_from_agent() { ... }
#[tokio::test]
async fn engine_validates_command_hash_before_sending_to_agent() { ... }
Approvals ↔ Slack
// tests/integration/approval_slack_test.rs
#[tokio::test]
async fn approval_caution_step_posts_block_kit_message_with_approve_button() { ... }
#[tokio::test]
async fn approval_dangerous_step_posts_modal_requiring_typed_confirmation() { ... }
#[tokio::test]
async fn approval_button_click_captures_slack_user_id_as_approver() { ... }
#[tokio::test]
async fn approval_respects_slack_rate_limit_1_message_per_second() { ... }
#[tokio::test]
async fn approval_batches_rapid_output_updates_to_avoid_rate_limit() { ... }
4.2 Testcontainers Setup
// tests/common/testcontainers.rs
pub struct TestDb {
container: ContainerAsync<Postgres>,
pub pool: PgPool,
}
impl TestDb {
pub async fn start() -> Self {
let container = Postgres::default()
.with_env_var("POSTGRES_DB", "dd0c_test")
.start()
.await
.unwrap();
let pool = PgPool::connect(&container.connection_string()).await.unwrap();
// Run migrations
sqlx::migrate!("./migrations").run(&pool).await.unwrap();
// Apply RLS policies
sqlx::query_file!("./tests/fixtures/sql/apply_rls.sql")
.execute(&pool)
.await
.unwrap();
Self { container, pool }
}
pub async fn with_tenant(&self, tenant_id: Uuid) -> TenantScopedDb {
// Sets app.current_tenant_id for RLS enforcement
TenantScopedDb::new(&self.pool, tenant_id)
}
}
4.3 Sandboxed Execution Environment Tests
For testing actual command execution without touching real infrastructure, use Docker-in-Docker (DinD) or a minimal sandbox container.
// tests/integration/sandbox_execution_test.rs
/// Uses a minimal Alpine container as the execution target.
/// The agent connects to this container instead of real infrastructure.
#[tokio::test]
async fn sandbox_safe_command_executes_and_returns_stdout() {
let sandbox = SandboxContainer::start("alpine:3.19").await;
let agent = TestAgent::connect_to(sandbox.socket_path()).await;
let result = agent.execute("ls /tmp").await.unwrap();
assert_eq!(result.exit_code, 0);
assert!(!result.stdout.is_empty());
}
#[tokio::test]
async fn sandbox_agent_rejects_dangerous_command_before_execution() {
let sandbox = SandboxContainer::start("alpine:3.19").await;
let agent = TestAgent::connect_to(sandbox.socket_path()).await;
let result = agent.execute("rm -rf /").await;
assert!(result.is_err());
assert_eq!(result.unwrap_err(), AgentError::CommandRejectedByScanner);
// Verify nothing was deleted
assert!(sandbox.path_exists("/etc").await);
}
#[tokio::test]
async fn sandbox_command_timeout_kills_process_and_returns_error() {
let sandbox = SandboxContainer::start("alpine:3.19").await;
let agent = TestAgent::with_timeout(Duration::from_secs(2))
.connect_to(sandbox.socket_path())
.await;
let result = agent.execute("sleep 60").await;
assert_eq!(result.unwrap_err(), AgentError::Timeout);
}
#[tokio::test]
async fn sandbox_no_shell_injection_via_command_argument() {
let sandbox = SandboxContainer::start("alpine:3.19").await;
let agent = TestAgent::connect_to(sandbox.socket_path()).await;
// This should execute `echo` with the literal argument, not a shell
let result = agent.execute("echo $(rm -rf /)").await.unwrap();
assert_eq!(result.stdout.trim(), "$(rm -rf /)"); // Literal, not executed
assert!(sandbox.path_exists("/etc").await);
}
4.4 Multi-Tenant RLS Integration Tests
// tests/integration/rls_test.rs
#[tokio::test]
async fn rls_tenant_a_cannot_see_tenant_b_runbooks() {
let db = TestDb::start().await;
let tenant_a = Uuid::new_v4();
let tenant_b = Uuid::new_v4();
// Insert runbook for tenant B
db.insert_runbook(tenant_b, "Tenant B Runbook").await;
// Query as tenant A — must return zero rows, not an error
let db_a = db.with_tenant(tenant_a).await;
let runbooks = db_a.query("SELECT * FROM runbooks").await.unwrap();
assert_eq!(runbooks.len(), 0);
}
#[tokio::test]
async fn rls_cross_tenant_audit_query_returns_zero_rows() { ... }
#[tokio::test]
async fn rls_cross_tenant_execution_query_returns_zero_rows() { ... }
5. E2E & Smoke Tests
5.1 Critical User Journeys
E2E tests run against a full Docker Compose stack with sandboxed infrastructure. No real AWS, no real Kubernetes — all targets are containerized mocks.
Docker Compose E2E Stack:
# tests/e2e/docker-compose.yml
services:
postgres: # Real PostgreSQL with migrations applied
redis: # Real Redis for panic mode key
parser: # Real Parser service
classifier: # Real Classifier service
engine: # Real Execution Engine
slack-mock: # WireMock simulating Slack API
llm-mock: # WireMock with recorded LLM responses
agent: # Real dd0c Agent binary
sandbox-host: # Alpine container as execution target
Journey 1: Full Happy Path (P0)
// tests/e2e/happy_path_test.rs
#[tokio::test]
async fn e2e_paste_runbook_classify_approve_execute_audit_full_journey() {
let stack = E2EStack::start().await;
// Step 1: Paste runbook
let parse_resp = stack.api()
.post("/v1/run/runbooks/parse-preview")
.json(&json!({ "raw_text": FIXTURE_RUNBOOK_MARKDOWN }))
.send().await;
assert_eq!(parse_resp.status(), 200);
let parsed = parse_resp.json::<ParsePreviewResponse>().await;
assert!(parsed.steps.iter().any(|s| s.risk_level == "safe"));
assert!(parsed.steps.iter().any(|s| s.risk_level == "caution"));
// Step 2: Save runbook
let runbook = stack.api().post("/v1/run/runbooks")
.json(&json!({ "raw_text": FIXTURE_RUNBOOK_MARKDOWN, "title": "E2E Test" }))
.send().await.json::<Runbook>().await;
// Step 3: Start execution
let execution = stack.api().post("/v1/run/executions")
.json(&json!({ "runbook_id": runbook.id, "agent_id": stack.agent_id() }))
.send().await.json::<Execution>().await;
// Step 4: Safe steps auto-execute
stack.wait_for_execution_state(&execution.id, "awaiting_approval").await;
// Step 5: Approve caution step
stack.api()
.post(format!("/v1/run/executions/{}/steps/{}/approve", execution.id, caution_step_id))
.send().await;
// Step 6: Wait for completion
let completed = stack.wait_for_execution_state(&execution.id, "completed").await;
assert_eq!(completed.steps_executed, 4);
assert_eq!(completed.steps_failed, 0);
// Step 7: Verify audit trail
let audit_events = stack.db()
.query("SELECT event_type FROM audit_events WHERE execution_id = $1", &[&execution.id])
.await;
let event_types: Vec<&str> = audit_events.iter().map(|e| e.event_type.as_str()).collect();
assert!(event_types.contains(&"execution.started"));
assert!(event_types.contains(&"step.auto_executed"));
assert!(event_types.contains(&"step.approved"));
assert!(event_types.contains(&"step.executed"));
assert!(event_types.contains(&"execution.completed"));
}
Journey 2: Destructive Command Blocked at All Levels (P0)
#[tokio::test]
async fn e2e_dangerous_command_blocked_at_copilot_trust_level() {
let stack = E2EStack::start().await;
let runbook = stack.create_runbook_with_dangerous_step().await;
let execution = stack.start_execution(&runbook.id).await;
// Engine must transition to Blocked, not AwaitApproval or AutoExecute
let step_status = stack.wait_for_step_state(
&execution.id, &dangerous_step_id, "blocked"
).await;
assert_eq!(step_status, "blocked");
// Verify audit event logged the block
let events = stack.audit_events_for_execution(&execution.id).await;
assert!(events.iter().any(|e| e.event_type == "step.blocked_by_trust_level"));
}
Journey 3: Panic Mode Halts In-Flight Execution (P0)
#[tokio::test]
async fn e2e_panic_mode_halts_in_flight_execution_within_1_second() {
let stack = E2EStack::start().await;
// Start a long-running execution
let execution = stack.start_execution_with_slow_steps().await;
stack.wait_for_execution_state(&execution.id, "running").await;
let panic_triggered_at = Instant::now();
// Trigger panic mode
stack.api().post("/v1/run/admin/panic").send().await;
// Execution must be paused within 1 second
stack.wait_for_execution_state(&execution.id, "paused").await;
assert!(panic_triggered_at.elapsed() < Duration::from_secs(1));
// Verify execution is paused, not killed
let exec = stack.get_execution(&execution.id).await;
assert_eq!(exec.status, "paused");
assert_ne!(exec.status, "aborted");
}
Journey 4: Approval Timeout Does Not Auto-Approve (P0)
#[tokio::test]
async fn e2e_approval_timeout_marks_stalled_not_approved() {
let stack = E2EStack::with_approval_timeout(Duration::from_secs(5)).start().await;
let execution = stack.start_execution_with_caution_step().await;
stack.wait_for_execution_state(&execution.id, "awaiting_approval").await;
// Wait for timeout to expire — do NOT approve
tokio::time::sleep(Duration::from_secs(6)).await;
let exec = stack.get_execution(&execution.id).await;
assert_eq!(exec.status, "stalled");
assert_ne!(exec.status, "completed"); // Must NOT have auto-approved
}
5.2 Chaos Scenarios
// tests/e2e/chaos_test.rs
#[tokio::test]
async fn chaos_agent_disconnects_mid_execution_engine_pauses_and_alerts() {
let stack = E2EStack::start().await;
let execution = stack.start_long_running_execution().await;
// Kill agent network mid-execution
stack.disconnect_agent().await;
let exec = stack.wait_for_execution_state(&execution.id, "paused").await;
assert_eq!(exec.status, "paused");
// Reconnect agent — execution should be resumable
stack.reconnect_agent().await;
stack.resume_execution(&execution.id).await;
stack.wait_for_execution_state(&execution.id, "completed").await;
}
#[tokio::test]
async fn chaos_database_failover_engine_resumes_from_last_committed_step() {
// Simulate RDS failover — engine must reconnect and resume
}
#[tokio::test]
async fn chaos_llm_gateway_down_classification_falls_back_to_scanner_only() {
// LLM unavailable — scanner-only mode, all unknowns become 🟡
}
#[tokio::test]
async fn chaos_slack_api_outage_execution_pauses_awaiting_approval_channel() {
// Slack down — approval requests queue, no auto-approve
}
#[tokio::test]
async fn chaos_mid_execution_step_failure_triggers_rollback_flow() {
let stack = E2EStack::start().await;
let execution = stack.start_execution_with_failing_step().await;
stack.wait_for_execution_state(&execution.id, "rollback_available").await;
// Approve rollback
stack.approve_rollback(&execution.id, &failed_step_id).await;
let exec = stack.wait_for_execution_state(&execution.id, "completed").await;
let events = stack.audit_events_for_execution(&execution.id).await;
assert!(events.iter().any(|e| e.event_type == "step.rolled_back"));
}
6. Performance & Load Testing
6.1 Parser Throughput
// benches/parser_bench.rs (criterion)
fn bench_normalizer_small_runbook(c: &mut Criterion) {
let input = include_str!("../fixtures/runbooks/small_10_steps.md");
c.bench_function("normalizer_small", |b| {
b.iter(|| Normalizer::new().normalize(black_box(input)))
});
}
fn bench_normalizer_large_runbook(c: &mut Criterion) {
// 500-step runbook, heavy HTML from Confluence
let input = include_str!("../fixtures/runbooks/large_500_steps.html");
c.bench_function("normalizer_large", |b| {
b.iter(|| Normalizer::new().normalize(black_box(input)))
});
}
fn bench_scanner_100_commands(c: &mut Criterion) {
let commands = fixture_100_mixed_commands();
let scanner = Scanner::new();
c.bench_function("scanner_100_commands", |b| {
b.iter(|| {
for cmd in &commands {
black_box(scanner.classify(cmd));
}
})
});
}
Performance targets:
- Normalizer: < 10ms for a 500-step Confluence page
- Scanner: < 1ms per command, < 10ms for 100 commands in batch
- Full parse + classify pipeline: < 5s p95 (including LLM call)
- Classification merge: < 1ms per step
6.2 Concurrent Execution Stress Tests
Use k6 or cargo-based load tests for concurrent execution scenarios:
// tests/load/concurrent_executions.js (k6)
export const options = {
scenarios: {
concurrent_executions: {
executor: 'constant-vus',
vus: 50, // 50 concurrent execution sessions
duration: '5m',
},
},
thresholds: {
http_req_duration: ['p95<500'], // API responses < 500ms p95
http_req_failed: ['rate<0.01'], // < 1% error rate
},
};
export default function () {
// Start execution, poll status, approve steps, verify completion
const execution = startExecution(FIXTURE_RUNBOOK_ID);
waitForApprovalGate(execution.id);
approveStep(execution.id, execution.pending_step_id);
waitForCompletion(execution.id);
}
Stress test targets:
- 50 concurrent execution sessions: all complete without errors
- Approval workflow: < 200ms p95 latency for approval API calls
- Audit trail ingestion: handles 1000 events/second without data loss
- Scanner: handles 10,000 classifications/second (batch mode)
6.3 Approval Workflow Latency Under Load
// tests/load/approval_latency_test.rs
#[tokio::test]
async fn approval_workflow_p95_latency_under_100_concurrent_approvals() {
let stack = E2EStack::start().await;
let mut handles = vec![];
for _ in 0..100 {
let stack = stack.clone();
handles.push(tokio::spawn(async move {
let execution = stack.start_execution_with_caution_step().await;
stack.wait_for_execution_state(&execution.id, "awaiting_approval").await;
let start = Instant::now();
stack.approve_step(&execution.id, &execution.pending_step_id).await;
start.elapsed()
}));
}
let latencies: Vec<Duration> = futures::future::join_all(handles)
.await.into_iter().map(|r| r.unwrap()).collect();
let p95 = percentile(&latencies, 95);
assert!(p95 < Duration::from_millis(200), "p95 approval latency: {:?}", p95);
}
7. CI/CD Pipeline Integration
7.1 Test Stages
┌─────────────────────────────────────────────────────────────────────┐
│ CI/CD TEST PIPELINE │
│ │
│ PRE-COMMIT (local, < 30s) │
│ ├── cargo fmt --check │
│ ├── cargo clippy -- -D warnings │
│ ├── Unit tests for changed packages only │
│ └── Canary suite (50 known-destructive commands must stay 🔴) │
│ │
│ PR GATE (CI, < 10 min) │
│ ├── Full unit test suite (all packages) │
│ ├── Canary suite (mandatory — build fails if any canary is 🟢) │
│ ├── Integration tests (Testcontainers) │
│ ├── Coverage check (see thresholds below) │
│ ├── Decision log check (PRs touching classifier/executor/parser │
│ │ must include decision_log.json) │
│ └── Expired feature flag check (CI blocks if flag TTL exceeded) │
│ │
│ MERGE TO MAIN (CI, < 20 min) │
│ ├── Full unit + integration suite │
│ ├── E2E smoke tests (Docker Compose stack) │
│ ├── Performance regression check (criterion baselines) │
│ └── Schema migration validation │
│ │
│ DEPLOY TO STAGING (post-merge, < 30 min) │
│ ├── E2E full suite against staging environment │
│ ├── Chaos scenarios (agent disconnect, DB failover) │
│ └── Load test (50 concurrent executions, 5 min) │
│ │
│ DEPLOY TO PRODUCTION (manual gate after staging) │
│ ├── Smoke test: parse-preview endpoint responds < 5s │
│ ├── Smoke test: agent heartbeat received │
│ └── Smoke test: audit trail write succeeds │
└─────────────────────────────────────────────────────────────────────┘
7.2 Coverage Thresholds
# .cargo/coverage.toml (enforced via cargo-tarpaulin in CI)
[thresholds]
# Safety-critical components — highest bar
"pkg/classifier/scanner" = 100 # Every pattern must be tested
"pkg/classifier/merge" = 100 # Every merge rule must be tested
"pkg/executor/state_machine" = 95 # Every state transition
"pkg/executor/trust" = 95 # Trust level enforcement
"pkg/approval" = 95 # Approval gates
# Core components
"pkg/parser" = 90
"pkg/audit" = 90
"pkg/agent/scanner" = 100 # Agent-side scanner: same as SaaS-side
# Supporting components
"pkg/matcher" = 85
"pkg/slack" = 80
"pkg/api" = 80
# Overall project minimum
"overall" = 85
CI enforcement:
# .github/workflows/ci.yml (excerpt)
- name: Check coverage thresholds
run: |
cargo tarpaulin --out Json --output-dir coverage/
python scripts/check_coverage_thresholds.py coverage/tarpaulin-report.json
- name: Run canary suite (MANDATORY)
run: cargo test --package dd0c-classifier canary_suite -- --nocapture
# This job failing blocks ALL other jobs
- name: Check decision logs for safety-critical PRs
run: |
CHANGED=$(git diff --name-only origin/main)
if echo "$CHANGED" | grep -qE "pkg/(parser|classifier|executor|approval)/"; then
python scripts/check_decision_log.py
fi
7.3 Test Parallelization Strategy
# GitHub Actions matrix strategy
jobs:
unit-tests:
strategy:
matrix:
package:
- dd0c-classifier # Runs first — safety-critical
- dd0c-executor # Runs first — safety-critical
- dd0c-parser
- dd0c-audit
- dd0c-approval
- dd0c-matcher
- dd0c-slack
- dd0c-api
steps:
- run: cargo test --package ${{ matrix.package }}
canary-suite:
needs: [] # Runs in parallel with unit tests
steps:
- run: cargo test --package dd0c-classifier canary_suite
integration-tests:
needs: [unit-tests, canary-suite] # Only after unit tests pass
strategy:
matrix:
suite:
- parser-llm
- classifier-audit
- engine-agent
- approval-slack
- rls-isolation
steps:
- run: cargo test --test ${{ matrix.suite }}
e2e-tests:
needs: [integration-tests]
steps:
- run: docker compose -f tests/e2e/docker-compose.yml up -d
- run: cargo test --test e2e
Parallelization rules:
- Canary suite runs in parallel with unit tests — never blocked
- Integration tests only start after ALL unit tests pass
- E2E tests only start after ALL integration tests pass
- Each test package runs in its own job (parallel matrix)
- Testcontainers instances are per-test, not shared (no state leakage)
8. Transparent Factory Tenet Testing
8.1 Feature Flag Tests (Atomic Flagging — Story 10.1)
// pkg/flags/tests.rs
// Basic flag evaluation
#[test] fn flag_evaluates_locally_no_network_call() {}
#[test] fn flag_disabled_by_default_for_new_execution_paths() {}
#[test] fn flag_requires_owner_field_or_validation_fails() {}
#[test] fn flag_requires_ttl_field_or_validation_fails() {}
#[test] fn flag_expired_ttl_is_treated_as_disabled() {}
// Destructive flag 48-hour bake enforcement
#[test] fn flag_destructive_true_requires_48h_bake_before_full_rollout() {
let flag = FeatureFlag {
name: "enable_kubectl_delete_execution",
destructive: true,
rollout_percentage: 100,
bake_started_at: Some(Utc::now() - Duration::hours(24)), // Only 24h
..Default::default()
};
let result = FlagValidator::validate(&flag);
assert!(result.is_err());
assert!(result.unwrap_err().contains("48-hour bake required for destructive flags"));
}
#[test] fn flag_destructive_true_at_10_percent_before_48h_is_valid() {
let flag = FeatureFlag {
destructive: true,
rollout_percentage: 10,
bake_started_at: Some(Utc::now() - Duration::hours(12)),
..Default::default()
};
assert!(FlagValidator::validate(&flag).is_ok());
}
#[test] fn flag_destructive_true_at_100_percent_after_48h_is_valid() {
let flag = FeatureFlag {
destructive: true,
rollout_percentage: 100,
bake_started_at: Some(Utc::now() - Duration::hours(49)),
..Default::default()
};
assert!(FlagValidator::validate(&flag).is_ok());
}
// Circuit breaker
#[test] fn circuit_breaker_triggers_after_2_failures_in_10_minute_window() {
let mut breaker = CircuitBreaker::new("enable_new_parser", 2, Duration::minutes(10));
breaker.record_failure();
assert_eq!(breaker.state(), CircuitState::Closed);
breaker.record_failure();
assert_eq!(breaker.state(), CircuitState::Open); // Trips on 2nd failure
}
#[test] fn circuit_breaker_open_disables_flag_immediately() {
let mut breaker = CircuitBreaker::new("enable_new_parser", 2, Duration::minutes(10));
breaker.record_failure();
breaker.record_failure();
let flag_store = FlagStore::with_breaker(breaker);
assert!(!flag_store.is_enabled("enable_new_parser"));
}
#[test] fn circuit_breaker_pauses_in_flight_executions_not_kills() {
// Verify executions are paused (status=paused), not aborted (status=aborted)
}
#[test] fn circuit_breaker_resets_after_window_expires() {
let mut breaker = CircuitBreaker::new("flag", 2, Duration::minutes(10));
breaker.record_failure();
breaker.record_failure();
// Advance time past window
breaker.advance_time(Duration::minutes(11));
assert_eq!(breaker.state(), CircuitState::Closed);
}
#[test] fn ci_blocks_if_flag_at_100_percent_past_ttl() {
// This test validates the CI check script logic
let flags = vec![
FeatureFlag { name: "old_flag", rollout: 100, ttl: expired_ttl(), ..Default::default() }
];
let result = CiValidator::check_expired_flags(&flags);
assert!(result.is_err());
}
8.2 Schema Migration Validation Tests (Elastic Schema — Story 10.2)
// tests/migrations/schema_validation_test.rs
#[tokio::test]
async fn migration_does_not_remove_existing_columns() {
let db = TestDb::start().await;
let columns_before = db.get_column_names("audit_events").await;
// Apply all pending migrations
sqlx::migrate!("./migrations").run(&db.pool).await.unwrap();
let columns_after = db.get_column_names("audit_events").await;
// Every column that existed before must still exist
for col in &columns_before {
assert!(columns_after.contains(col),
"Migration removed column '{}' from audit_events — FORBIDDEN", col);
}
}
#[tokio::test]
async fn migration_does_not_change_existing_column_types() {
// Verify no type changes on existing columns
}
#[tokio::test]
async fn migration_does_not_rename_existing_columns() {
// Verify column names are stable
}
#[tokio::test]
async fn audit_events_table_has_no_update_permission_for_app_role() {
let db = TestDb::start().await;
let result = db.as_app_role()
.execute("UPDATE audit_events SET event_type = 'tampered' WHERE 1=1")
.await;
assert!(result.is_err());
assert!(result.unwrap_err().to_string().contains("permission denied"));
}
#[tokio::test]
async fn audit_events_table_has_no_delete_permission_for_app_role() {
let db = TestDb::start().await;
let result = db.as_app_role()
.execute("DELETE FROM audit_events WHERE 1=1")
.await;
assert!(result.is_err());
assert!(result.unwrap_err().to_string().contains("permission denied"));
}
#[tokio::test]
async fn execution_log_parsers_ignore_unknown_fields() {
// Simulate a future schema with extra fields — old parser must not fail
let event_json = json!({
"id": Uuid::new_v4(),
"event_type": "step.executed",
"unknown_future_field": "some_value", // New field old parser doesn't know
"tenant_id": Uuid::new_v4(),
"created_at": Utc::now(),
});
let result = AuditEvent::from_json(&event_json);
assert!(result.is_ok()); // Must not fail on unknown fields
}
#[tokio::test]
async fn migration_includes_sunset_date_comment() {
// Parse migration files and verify each has a sunset_date comment
let migrations = read_migration_files("./migrations");
for migration in &migrations {
if migration.is_additive() {
assert!(migration.content.contains("-- sunset_date:"),
"Migration {} missing sunset_date comment", migration.name);
}
}
}
8.3 Decision Log Format Validation (Cognitive Durability — Story 10.3)
// tests/decisions/decision_log_test.rs
#[test]
fn decision_log_schema_requires_all_mandatory_fields() {
let incomplete_log = json!({
"prompt": "Why is rm -rf dangerous?",
// Missing: reasoning, alternatives_considered, confidence, timestamp, author
});
let result = DecisionLog::validate(&incomplete_log);
assert!(result.is_err());
}
#[test]
fn decision_log_confidence_must_be_between_0_and_1() {
let log = DecisionLog { confidence: 1.5, ..valid_decision_log() };
assert!(DecisionLog::validate(&log).is_err());
}
#[test]
fn decision_log_destructive_command_list_change_requires_reasoning() {
// Any PR adding/removing from the destructive command list must have
// a decision log explaining why
let change = DestructiveCommandListChange {
command: "kubectl drain",
action: ChangeAction::Add,
decision_log: None, // Missing
};
let result = DestructiveCommandChangeValidator::validate(&change);
assert!(result.is_err());
assert!(result.unwrap_err().contains("decision log required for destructive command list changes"));
}
#[test]
fn all_existing_decision_logs_are_valid_json() {
let logs = glob::glob("docs/decisions/*.json").unwrap();
for log_path in logs {
let content = std::fs::read_to_string(log_path.unwrap()).unwrap();
let parsed: serde_json::Value = serde_json::from_str(&content)
.expect("Decision log must be valid JSON");
assert!(DecisionLog::validate(&parsed).is_ok());
}
}
#[test]
fn cyclomatic_complexity_cap_enforced_at_10() {
// This is validated by the clippy lint in CI:
// #![deny(clippy::cognitive_complexity)]
// Test here validates the lint config is present
let clippy_config = std::fs::read_to_string(".clippy.toml").unwrap();
assert!(clippy_config.contains("cognitive-complexity-threshold = 10"));
}
8.4 OTEL Span Assertion Tests (Semantic Observability — Story 10.4)
// tests/observability/otel_spans_test.rs
#[tokio::test]
async fn otel_runbook_execution_creates_parent_span() {
let tracer = InMemoryTracer::new();
let engine = ExecutionEngine::with_tracer(tracer.clone());
engine.execute_runbook(&fixture_runbook()).await.unwrap();
let spans = tracer.finished_spans();
let parent = spans.iter().find(|s| s.name == "runbook_execution");
assert!(parent.is_some(), "runbook_execution parent span must exist");
}
#[tokio::test]
async fn otel_each_step_creates_child_spans_3_levels_deep() {
let tracer = InMemoryTracer::new();
let engine = ExecutionEngine::with_tracer(tracer.clone());
engine.execute_runbook(&fixture_runbook_with_one_step()).await.unwrap();
let spans = tracer.finished_spans();
let step_classification = spans.iter().find(|s| s.name == "step_classification");
let step_approval = spans.iter().find(|s| s.name == "step_approval_check");
let step_execution = spans.iter().find(|s| s.name == "step_execution");
assert!(step_classification.is_some());
assert!(step_approval.is_some());
assert!(step_execution.is_some());
// Verify parent-child hierarchy
let parent_id = spans.iter().find(|s| s.name == "runbook_execution").unwrap().span_id;
assert_eq!(step_classification.unwrap().parent_span_id, Some(parent_id));
}
#[tokio::test]
async fn otel_step_classification_span_has_required_attributes() {
let tracer = InMemoryTracer::new();
let engine = ExecutionEngine::with_tracer(tracer.clone());
engine.execute_runbook(&fixture_runbook()).await.unwrap();
let span = tracer.span_by_name("step_classification").unwrap();
assert!(span.attributes.contains_key("step.text_hash"));
assert!(span.attributes.contains_key("step.classified_as"));
assert!(span.attributes.contains_key("step.confidence_score"));
assert!(span.attributes.contains_key("step.alternatives_considered"));
}
#[tokio::test]
async fn otel_step_execution_span_has_required_attributes() {
let span = tracer.span_by_name("step_execution").unwrap();
assert!(span.attributes.contains_key("step.command_hash"));
assert!(span.attributes.contains_key("step.target_host_hash"));
assert!(span.attributes.contains_key("step.exit_code"));
assert!(span.attributes.contains_key("step.duration_ms"));
// Must NOT contain raw command or PII
assert!(!span.attributes.contains_key("step.command_raw"));
assert!(!span.attributes.contains_key("step.stdout_raw"));
}
#[tokio::test]
async fn otel_step_approval_span_has_required_attributes() {
let span = tracer.span_by_name("step_approval_check").unwrap();
assert!(span.attributes.contains_key("step.approval_required"));
assert!(span.attributes.contains_key("step.approval_source"));
assert!(span.attributes.contains_key("step.approval_latency_ms"));
}
#[tokio::test]
async fn otel_no_pii_in_any_span_attributes() {
// Scan all span attributes for patterns that look like PII or raw commands
let spans = tracer.finished_spans();
for span in &spans {
for (key, value) in &span.attributes {
assert!(!key.ends_with("_raw"), "Raw data in span attribute: {}", key);
assert!(!looks_like_email(value), "PII in span: {} = {}", key, value);
}
}
}
#[tokio::test]
async fn otel_ai_classification_span_includes_model_metadata() {
let span = tracer.span_by_name("step_classification").unwrap();
assert!(span.attributes.contains_key("ai.prompt_hash"));
assert!(span.attributes.contains_key("ai.model_version"));
assert!(span.attributes.contains_key("ai.reasoning_chain"));
}
8.5 Governance Policy Enforcement Tests (Configurable Autonomy — Story 10.5)
// pkg/governance/tests.rs
// Strict mode: all steps require approval
#[test]
fn governance_strict_mode_requires_approval_for_safe_steps() {
let policy = Policy { governance_mode: GovernanceMode::Strict, ..Default::default() };
let engine = ExecutionEngine::with_policy(policy);
let next = engine.next_state(&safe_step(), TrustLevel::Copilot);
assert_eq!(next, State::AwaitApproval); // Even safe steps need approval in strict mode
}
// Audit mode: safe steps auto-execute, destructive always require approval
#[test]
fn governance_audit_mode_auto_executes_safe_steps() {
let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() };
let engine = ExecutionEngine::with_policy(policy);
assert_eq!(engine.next_state(&safe_step(), TrustLevel::Copilot), State::AutoExecute);
}
#[test]
fn governance_audit_mode_still_requires_approval_for_dangerous_steps() {
let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() };
let engine = ExecutionEngine::with_policy(policy);
assert_eq!(engine.next_state(&dangerous_step(), TrustLevel::Copilot), State::AwaitApproval);
}
#[test]
fn governance_no_fully_autonomous_mode_exists() {
// There is no GovernanceMode::FullAuto variant
// This test verifies the enum only has Strict and Audit
let modes = GovernanceMode::all_variants();
assert!(!modes.contains(&GovernanceMode::FullAuto));
assert_eq!(modes.len(), 2);
}
// Panic mode
#[tokio::test]
async fn governance_panic_mode_halts_all_executions_within_1_second() {
// Verified in E2E section — unit test verifies Redis key check
let redis = MockRedis::new();
let engine = ExecutionEngine::with_redis(redis.clone());
redis.set("dd0c:panic", "1").await;
let result = engine.check_panic_mode().await;
assert_eq!(result, PanicModeStatus::Active);
}
#[tokio::test]
async fn governance_panic_mode_uses_redis_not_database() {
// Panic mode must NOT do a DB query — Redis only for <1s requirement
let db = TrackingDb::new();
let redis = MockRedis::new();
redis.set("dd0c:panic", "1").await;
let engine = ExecutionEngine::with_db(db.clone()).with_redis(redis);
engine.check_panic_mode().await;
assert_eq!(db.query_count(), 0, "Panic mode check must not query the database");
}
#[tokio::test]
async fn governance_panic_mode_requires_manual_clearance() {
// Panic mode cannot be auto-cleared — only manual API call
let engine = ExecutionEngine::new();
engine.trigger_panic_mode().await;
// Simulate time passing
tokio::time::advance(Duration::hours(24)).await;
// Must still be in panic mode
assert_eq!(engine.check_panic_mode().await, PanicModeStatus::Active);
}
// Governance drift monitoring
#[tokio::test]
async fn governance_drift_auto_downgrades_when_auto_execution_exceeds_threshold() {
let db = TestDb::start().await;
// Insert execution history: 80% auto-executed (threshold is 70%)
insert_execution_history_with_auto_ratio(&db, 0.80).await;
let drift_monitor = GovernanceDriftMonitor::new(&db.pool);
let result = drift_monitor.check().await;
assert_eq!(result.action, DriftAction::DowngradeToStrict);
}
// Per-runbook governance override
#[test]
fn governance_runbook_locked_to_strict_ignores_system_audit_mode() {
let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() };
let runbook = Runbook { governance_override: Some(GovernanceMode::Strict), ..Default::default() };
let engine = ExecutionEngine::with_policy(policy);
// Even in audit mode, this runbook requires approval for safe steps
assert_eq!(engine.next_state_for_runbook(&safe_step(), &runbook), State::AwaitApproval);
}
9. Test Data & Fixtures
9.1 Runbook Format Factories
// tests/fixtures/runbooks/mod.rs
pub fn fixture_runbook_markdown() -> &'static str {
include_str!("markdown_basic.md")
}
pub fn fixture_runbook_confluence_html() -> &'static str {
include_str!("confluence_export.html")
}
pub fn fixture_runbook_notion_export() -> &'static str {
include_str!("notion_export.md")
}
pub fn fixture_runbook_with_ambiguities() -> &'static str {
include_str!("ambiguous_steps.md")
}
pub fn fixture_runbook_with_variables() -> &'static str {
include_str!("variables_and_placeholders.md")
}
9.2 Step Classification Fixtures
// tests/fixtures/commands/mod.rs
pub fn fixture_safe_commands() -> Vec<&'static str> {
vec![
"kubectl get pods -n kube-system",
"aws ec2 describe-instances --region us-east-1",
"cat /var/log/syslog | grep error",
"SELECT count(*) FROM users",
]
}
pub fn fixture_caution_commands() -> Vec<&'static str> {
vec![
"kubectl rollout restart deployment/api",
"systemctl restart nginx",
"aws ec2 stop-instances --instance-ids i-1234567890abcdef0",
"UPDATE users SET status = 'active' WHERE id = 123",
]
}
pub fn fixture_destructive_commands() -> Vec<&'static str> {
vec![
"kubectl delete namespace prod",
"rm -rf /var/lib/postgresql/data",
"DROP TABLE customers",
"aws rds delete-db-instance --db-instance-identifier prod-db",
"terraform destroy -auto-approve",
]
}
pub fn fixture_ambiguous_commands() -> Vec<&'static str> {
vec![
"restart the service",
"./cleanup.sh",
"python script.py",
"curl -X POST http://internal-api/reset",
]
}
9.3 Infrastructure Target Mocks
We mock infrastructure targets for execution tests using isolated containers or HTTP mock servers.
// tests/fixtures/infra/mod.rs
/// Spawns a lightweight k3s container for testing kubectl commands safely
pub async fn mock_k8s_cluster() -> K3sContainer {
K3sContainer::start().await
}
/// Spawns LocalStack for testing AWS CLI commands
pub async fn mock_aws_env() -> LocalStackContainer {
LocalStackContainer::start().await
}
/// Spawns a bare Alpine container with SSH access
pub async fn mock_bare_metal() -> SshContainer {
SshContainer::start("alpine:latest").await
}
9.4 Approval Workflow Scenario Fixtures
// tests/fixtures/approvals/mod.rs
pub fn fixture_slack_approval_payload(step_id: &str, user_id: &str) -> serde_json::Value {
json!({
"type": "block_actions",
"user": { "id": user_id, "username": "riley.oncall" },
"actions": [{
"action_id": "approve_step",
"value": step_id
}]
})
}
pub fn fixture_slack_typed_confirmation_payload(step_id: &str, resource_name: &str) -> serde_json::Value {
json!({
"type": "view_submission",
"user": { "id": "U123456" },
"view": {
"state": {
"values": {
"confirm_block": {
"resource_input": { "value": resource_name }
}
}
},
"private_metadata": step_id
}
})
}
10. TDD Implementation Order
To maintain the safety-first invariant, components must be built and tested in a specific order. Execution code cannot be written until the safety constraints are proven.
10.1 Bootstrap Sequence (Test Infrastructure First)
- Testcontainers Setup: Establish
TestDbwith migrations and RLS policies. Prove cross-tenant isolation fails closed. - OTEL Test Tracer: Implement
InMemoryTracerto assert span creation and attributes. - Canary Suite Harness: Create the
canary_suitetest target that runs a hardcoded list of destructive commands and fails if any return 🟢.
10.2 Epic Dependency Order
| Phase | Component | TDD Rule | Rationale |
|---|---|---|---|
| 1 | Deterministic Safety Scanner | Unit tests FIRST | Foundation of safety. Exhaustive pattern tests must exist before any parser or execution logic. |
| 2 | Merge Engine | Unit tests FIRST | Hardcoded rules. Prove 🔴 overrides 🟢 before integrating LLMs. |
| 3 | Execution Engine State Machine | Unit tests FIRST | CRITICAL: Prove trust level boundaries and approval gates block 🔴/🟡 steps before writing any code that actually executes commands. |
| 4 | Agent-Side Scanner | Unit tests FIRST | Port SaaS scanner logic to Agent binary. Prove the Agent rejects rm -rf independently. |
| 5 | Agent gRPC & Command Execution | Integration tests FIRST | Use sandbox containers. Prove timeout kills processes and shell injection fails. |
| 6 | Runbook Parser | Integration tests lead | Use LLM fixtures. The parser is safe because the classifier catches its mistakes. |
| 7 | Audit Trail | Unit tests FIRST | Prove schema immutability and hash chaining. |
| 8 | Slack Bot & API | Integration tests lead | UI and routing. |
10.3 The Execution Engine Testing Mandate
Execution engine tests MUST be written before any execution code.
Before writing the impl ExecutionEngine { pub async fn execute(...) } function, the following tests must exist and fail (Red phase):
engine_dangerous_step_blocked_at_copilot_level_v1engine_caution_step_requires_approval_at_copilot_levelengine_safe_step_blocked_at_read_only_trust_levelengine_duplicate_step_execution_id_is_rejectedengine_pauses_in_flight_execution_when_panic_mode_set
Only once these tests are defined can the state machine be implemented to make them pass (Green phase). This ensures no execution path can bypass the Trust Gradient.