- All 6 test architectures patched with Section 11 addendums - P5 (cost) fully rewritten from 232 to ~600 lines - PLG brainstorm + party mode advisory board results - Analytics SDK v2 (PostHog Cloud, Zod strict, Lambda-safe) - Analytics tests v2 (safeParse, no , no timestamp, no PII) - Addresses all Gemini review findings across P1-P6
83 KiB
dd0c/run — Test Architecture & TDD Strategy
Version: 1.0 | Date: 2026-02-28 | Status: Draft
1. Testing Philosophy & TDD Workflow
1.1 Core Principle: Safety-First Testing
dd0c/run executes commands in production infrastructure. A misclassification that allows a destructive command to auto-execute is an existential risk. This shapes every testing decision:
If it touches execution, tests come first. No exceptions.
The TDD discipline here is not a process preference — it is a safety mechanism. Writing tests first for the Action Classifier and Execution Engine forces the developer to enumerate failure modes before writing code that could cause them.
1.2 Red-Green-Refactor Adapted for dd0c/run
The standard Red-Green-Refactor cycle applies with one critical modification: for any component that classifies or executes commands, the "Red" phase must include at least one test for a known-destructive command being correctly blocked.
Standard TDD: dd0c/run TDD:
Red Red (write failing test)
↓ ↓
Green Red-Safety (add: "rm -rf / must be 🔴")
↓ ↓
Refactor Green (make ALL tests pass)
↓
Refactor
↓
Canary-Check (run canary suite — must stay green)
The Canary Suite is a mandatory gate: a curated set of ~50 known-destructive commands that must always be classified as 🔴. It runs after every refactor. If any canary passes as 🟢, the build fails.
1.3 When to Write Tests First vs. Integration Tests Lead
| Scenario | Approach | Rationale |
|---|---|---|
| Deterministic Safety Scanner | Unit tests FIRST, always | Pure function. No I/O. Exhaustive coverage of destructive patterns is mandatory before any code. |
| Classification Merge Rules | Unit tests FIRST, always | Hardcoded logic. Tests define the spec. |
| Execution Engine state machine | Unit tests FIRST, always | State transitions are safety-critical. Tests enumerate all valid/invalid transitions. |
| Approval workflow | Unit tests FIRST | Approval bypass is a threat vector. Tests must prove it's impossible. |
| Runbook Parser (LLM extraction) | Integration tests lead | LLM behavior is non-deterministic. Integration tests with recorded fixtures define expected behavior. |
| Slack Bot UI flows | Integration tests lead | Slack API interactions are I/O-heavy. Mock Slack API, test message shapes. |
| Alert-Runbook Matcher | Integration tests lead | Matching logic depends on DB state. Testcontainers + fixture data. |
| Audit Trail ingestion | Unit tests first for schema, integration for pipeline | Schema is deterministic; pipeline has I/O. |
1.4 Test Naming Conventions
All tests follow the pattern: <component>_<scenario>_<expected_outcome>
// Unit tests
#[test]
fn scanner_rm_rf_root_classifies_as_dangerous() { ... }
#[test]
fn scanner_kubectl_get_pods_classifies_as_safe() { ... }
#[test]
fn merge_engine_scanner_dangerous_overrides_llm_safe() { ... }
#[test]
fn state_machine_caution_step_transitions_to_await_approval() { ... }
// Integration tests
#[tokio::test]
async fn parser_confluence_html_extracts_ordered_steps() { ... }
#[tokio::test]
async fn execution_engine_approval_timeout_does_not_auto_approve() { ... }
// E2E tests
#[tokio::test]
async fn e2e_paste_runbook_classify_approve_execute_audit_full_journey() { ... }
Prohibited naming patterns:
test_thing()— too vagueit_works()— meaninglesstest1(),test2()— no context
1.5 Safety-First Rule: Destructive Command Tests Are Mandatory Pre-Code
Hard Rule: No code may be written for the Action Classifier, Execution Engine, or Agent-side Scanner without first writing tests that prove destructive commands are blocked.
This is enforced via CI: the pkg/classifier/ and pkg/executor/ directories require a minimum of 95% test coverage before any PR can merge. The canary test suite must pass on every commit touching these packages.
2. Test Pyramid
2.1 Recommended Ratio
/\
/E2E\ ~10% — Critical user journeys, chaos scenarios
/──────\
/ Integ \ ~20% — Service boundaries, DB, Slack, gRPC
/──────────\
/ Unit \ ~70% — Pure logic, state machines, classifiers
/______________\
For the Action Classifier and Execution Engine specifically, the ratio shifts:
Unit: 80% (exhaustive pattern coverage, state machine transitions)
Integration: 15% (scanner ↔ classifier merge, engine ↔ agent gRPC)
E2E: 5% (full execution journeys with sandboxed infra)
2.2 Unit Test Targets (per component)
| Component | Unit Test Focus | Coverage Target |
|---|---|---|
| Deterministic Safety Scanner | Pattern matching, AST parsing, all risk categories | 100% |
| Classification Merge Engine | All 5 merge rules + edge cases | 100% |
| Execution Engine state machine | All state transitions, trust level enforcement | 95% |
| Runbook Parser (normalizer) | HTML stripping, markdown normalization, whitespace | 90% |
| Variable Detector | Placeholder regex patterns, env ref detection | 90% |
| Branch Mapper | DAG construction, if/else detection | 85% |
| Approval Workflow | Approval gates, typed confirmation, timeout behavior | 95% |
| Audit Trail schema | Event type validation, immutability constraints | 90% |
| Alert-Runbook Matcher | Keyword matching, similarity scoring | 85% |
| Trust Level Enforcement | Level checks per risk level, auto-downgrade | 95% |
| Panic Mode | Trigger conditions, halt sequence, Redis key check | 95% |
| Feature Flag Circuit Breaker | 2-failure threshold, 48h bake enforcement | 95% |
2.3 Integration Test Boundaries
| Boundary | Test Type | Infrastructure |
|---|---|---|
| Parser ↔ LLM Gateway (dd0c/route) | Contract test with recorded responses | WireMock / recorded fixtures |
| Classifier ↔ PostgreSQL (audit write) | Integration test | Testcontainers (PostgreSQL) |
| Execution Engine ↔ Agent (gRPC) | Integration test | In-process gRPC server mock |
| Execution Engine ↔ Slack Bot | Integration test | Slack API mock |
| Approval Workflow ↔ Slack | Integration test | Slack API mock |
| Audit Trail ↔ PostgreSQL | Integration test | Testcontainers (PostgreSQL) |
| Alert Matcher ↔ PostgreSQL + pgvector | Integration test | Testcontainers (PostgreSQL + pgvector) |
| Webhook Receiver ↔ PagerDuty/OpsGenie | Contract test | Recorded webhook payloads |
| RLS enforcement | Integration test | Testcontainers (PostgreSQL with RLS enabled) |
2.4 E2E / Smoke Test Scenarios
| Scenario | Priority | Infrastructure |
|---|---|---|
| Full journey: paste → parse → classify → approve → execute → audit | P0 | Docker Compose sandbox |
| Destructive command blocked at all trust levels | P0 | Docker Compose sandbox |
| Panic mode triggered and halts in-flight execution | P0 | Docker Compose sandbox |
| Approval timeout does not auto-approve | P0 | Docker Compose sandbox |
| Cross-tenant data isolation (RLS) | P0 | Testcontainers |
| Agent reconnect after network partition | P1 | Docker Compose sandbox |
| Mid-execution failure triggers rollback flow | P1 | Docker Compose sandbox |
| Feature flag circuit breaker halts execution after 2 failures | P1 | Docker Compose sandbox |
3. Unit Test Strategy (Per Component)
3.1 Deterministic Safety Scanner
What to test: Every pattern category. Every edge case. This component has 100% coverage as a hard requirement.
Key test cases:
// pkg/classifier/scanner/tests.rs
// ── BLOCKLIST (🔴 Dangerous) ──────────────────────────────────────────
#[test] fn scanner_kubectl_delete_namespace_is_dangerous() {}
#[test] fn scanner_kubectl_delete_deployment_is_dangerous() {}
#[test] fn scanner_kubectl_delete_pvc_is_dangerous() {}
#[test] fn scanner_kubectl_delete_all_is_dangerous() {}
#[test] fn scanner_drop_table_is_dangerous() {}
#[test] fn scanner_drop_database_is_dangerous() {}
#[test] fn scanner_truncate_table_is_dangerous() {}
#[test] fn scanner_delete_without_where_is_dangerous() {}
#[test] fn scanner_rm_rf_is_dangerous() {}
#[test] fn scanner_rm_rf_root_is_dangerous() {}
#[test] fn scanner_rm_rf_slash_is_dangerous() {}
#[test] fn scanner_aws_ec2_terminate_instances_is_dangerous() {}
#[test] fn scanner_aws_rds_delete_db_instance_is_dangerous() {}
#[test] fn scanner_terraform_destroy_is_dangerous() {}
#[test] fn scanner_dd_if_dev_zero_is_dangerous() {}
#[test] fn scanner_mkfs_is_dangerous() {}
#[test] fn scanner_sudo_rm_is_dangerous() {}
#[test] fn scanner_chmod_777_recursive_is_dangerous() {}
#[test] fn scanner_kubectl_create_clusterrolebinding_is_dangerous() {}
#[test] fn scanner_aws_iam_create_user_is_dangerous() {}
#[test] fn scanner_pipe_to_xargs_rm_is_dangerous() {}
#[test] fn scanner_delete_with_where_but_no_condition_value_is_dangerous() {}
// ── CAUTION LIST (🟡) ────────────────────────────────────────────────
#[test] fn scanner_kubectl_rollout_restart_is_caution() {}
#[test] fn scanner_kubectl_scale_is_caution() {}
#[test] fn scanner_aws_ec2_stop_instances_is_caution() {}
#[test] fn scanner_aws_ec2_start_instances_is_caution() {}
#[test] fn scanner_systemctl_restart_is_caution() {}
#[test] fn scanner_update_with_where_clause_is_caution() {}
#[test] fn scanner_insert_into_is_caution() {}
#[test] fn scanner_docker_restart_is_caution() {}
#[test] fn scanner_aws_autoscaling_set_desired_capacity_is_caution() {}
// ── ALLOWLIST (🟢 Safe) ──────────────────────────────────────────────
#[test] fn scanner_kubectl_get_pods_is_safe() {}
#[test] fn scanner_kubectl_describe_deployment_is_safe() {}
#[test] fn scanner_kubectl_logs_is_safe() {}
#[test] fn scanner_aws_ec2_describe_instances_is_safe() {}
#[test] fn scanner_aws_s3_ls_is_safe() {}
#[test] fn scanner_select_query_is_safe() {}
#[test] fn scanner_explain_query_is_safe() {}
#[test] fn scanner_curl_get_is_safe() {}
#[test] fn scanner_cat_file_is_safe() {}
#[test] fn scanner_grep_is_safe() {}
#[test] fn scanner_tail_f_is_safe() {}
#[test] fn scanner_docker_ps_is_safe() {}
#[test] fn scanner_terraform_plan_is_safe() {}
#[test] fn scanner_dig_is_safe() {}
#[test] fn scanner_nslookup_is_safe() {}
// ── UNKNOWN / EDGE CASES ─────────────────────────────────────────────
#[test] fn scanner_unknown_command_defaults_to_unknown_not_safe() {}
#[test] fn scanner_empty_command_defaults_to_unknown() {}
#[test] fn scanner_custom_script_path_defaults_to_unknown() {}
#[test] fn scanner_select_into_is_dangerous_not_safe() {} // SELECT INTO is a write
#[test] fn scanner_delete_with_where_is_caution_not_dangerous() {}
#[test] fn scanner_curl_post_is_caution_not_safe() {} // POST has side effects
#[test] fn scanner_pipe_chain_with_destructive_segment_is_dangerous() {}
#[test] fn scanner_command_substitution_with_rm_is_dangerous() {}
#[test] fn scanner_multiline_command_with_destructive_line_is_dangerous() {}
// ── AST PARSING (tree-sitter) ────────────────────────────────────────
#[test] fn scanner_sql_ast_delete_without_where_is_dangerous() {}
#[test] fn scanner_sql_ast_update_without_where_is_dangerous() {}
#[test] fn scanner_sql_ast_drop_statement_is_dangerous() {}
#[test] fn scanner_shell_ast_piped_rm_is_dangerous() {}
#[test] fn scanner_shell_ast_subshell_with_destructive_is_dangerous() {}
// ── PERFORMANCE ──────────────────────────────────────────────────────
#[test] fn scanner_classifies_in_under_1ms() {}
#[test] fn scanner_classifies_100_commands_in_under_10ms() {}
Mocking strategy: None. The scanner is a pure function with no I/O. All tests are synchronous, no mocks needed.
Language-specific patterns (Rust):
- Use
#[test]for synchronous unit tests - Use
criterioncrate for performance benchmarks - Compile regex sets at test time using
lazy_static!oronce_cell - Use
rstestfor parameterized test cases across command variants
// Parameterized test example using rstest
#[rstest]
#[case("kubectl delete namespace production", RiskLevel::Dangerous)]
#[case("kubectl delete deployment payment-svc", RiskLevel::Dangerous)]
#[case("kubectl delete pod payment-abc123", RiskLevel::Dangerous)]
#[case("kubectl delete all --all", RiskLevel::Dangerous)]
fn scanner_kubectl_delete_variants_are_dangerous(
#[case] command: &str,
#[case] expected: RiskLevel,
) {
let scanner = Scanner::new();
assert_eq!(scanner.classify(command).risk, expected);
}
3.2 Classification Merge Engine
What to test: All 5 merge rules, including every combination of scanner/LLM results.
// pkg/classifier/merge/tests.rs
// Rule 1: Scanner 🔴 → always 🔴
#[test] fn merge_scanner_dangerous_llm_safe_yields_dangerous() {}
#[test] fn merge_scanner_dangerous_llm_caution_yields_dangerous() {}
#[test] fn merge_scanner_dangerous_llm_dangerous_yields_dangerous() {}
// Rule 2: Scanner 🟡, LLM 🟢 → 🟡 (scanner wins)
#[test] fn merge_scanner_caution_llm_safe_yields_caution() {}
// Rule 3: Both 🟢 → 🟢 (only path to safe)
#[test] fn merge_scanner_safe_llm_safe_yields_safe() {}
// Rule 4: Scanner Unknown → 🟡 minimum
#[test] fn merge_scanner_unknown_llm_safe_yields_caution() {}
#[test] fn merge_scanner_unknown_llm_caution_yields_caution() {}
#[test] fn merge_scanner_unknown_llm_dangerous_yields_dangerous() {}
// Rule 5: LLM confidence < 0.9 → escalate one level
#[test] fn merge_low_confidence_safe_escalates_to_caution() {}
#[test] fn merge_low_confidence_caution_escalates_to_dangerous() {}
#[test] fn merge_high_confidence_safe_does_not_escalate() {}
// LLM escalation overrides scanner
#[test] fn merge_scanner_safe_llm_caution_yields_caution() {}
#[test] fn merge_scanner_safe_llm_dangerous_yields_dangerous() {}
#[test] fn merge_scanner_caution_llm_dangerous_yields_dangerous() {}
// Merge rule is logged
#[test] fn merge_result_includes_applied_rule_identifier() {}
#[test] fn merge_result_includes_both_scanner_and_llm_inputs() {}
Mocking strategy: LLM results are passed as plain structs — no mocking needed. The merge engine is a pure function.
3.3 Execution Engine State Machine
What to test: Every valid state transition, every invalid transition (must be rejected), trust level enforcement at each transition.
// pkg/executor/state_machine/tests.rs
// Valid transitions
#[test] fn engine_pending_to_preflight_on_start() {}
#[test] fn engine_preflight_to_step_ready_on_prerequisites_met() {}
#[test] fn engine_step_ready_to_auto_execute_for_safe_step_at_copilot_level() {}
#[test] fn engine_step_ready_to_await_approval_for_caution_step() {}
#[test] fn engine_step_ready_to_blocked_for_dangerous_step_at_copilot_level() {}
#[test] fn engine_await_approval_to_executing_on_human_approval() {}
#[test] fn engine_await_approval_to_skipped_on_human_skip() {}
#[test] fn engine_executing_to_step_complete_on_success() {}
#[test] fn engine_executing_to_step_failed_on_error() {}
#[test] fn engine_executing_to_timed_out_on_timeout() {}
#[test] fn engine_step_complete_to_step_ready_when_more_steps() {}
#[test] fn engine_step_complete_to_runbook_complete_when_last_step() {}
#[test] fn engine_step_failed_to_rollback_available_when_rollback_exists() {}
#[test] fn engine_step_failed_to_manual_intervention_when_no_rollback() {}
#[test] fn engine_rollback_available_to_rolling_back_on_approval() {}
#[test] fn engine_rolling_back_to_step_ready_on_rollback_success() {}
#[test] fn engine_rolling_back_to_manual_intervention_on_rollback_failure() {}
#[test] fn engine_runbook_complete_to_divergence_analysis() {}
// Invalid transitions (must be rejected)
#[test] fn engine_cannot_skip_preflight_state() {}
#[test] fn engine_cannot_auto_execute_caution_step_at_copilot_level() {}
#[test] fn engine_cannot_auto_execute_dangerous_step_at_any_v1_level() {}
#[test] fn engine_cannot_transition_from_completed_to_executing() {}
// Trust level enforcement
#[test] fn engine_safe_step_blocked_at_read_only_trust_level() {}
#[test] fn engine_caution_step_requires_approval_at_copilot_level() {}
#[test] fn engine_dangerous_step_blocked_at_copilot_level_v1() {}
#[test] fn engine_trust_level_checked_per_step_not_per_runbook() {}
// Timeout behavior
#[test] fn engine_safe_step_times_out_after_60_seconds() {}
#[test] fn engine_caution_step_times_out_after_120_seconds() {}
#[test] fn engine_dangerous_step_times_out_after_300_seconds() {}
#[test] fn engine_approval_timeout_does_not_auto_approve() {}
#[test] fn engine_approval_timeout_marks_execution_as_stalled() {}
// Idempotency
#[test] fn engine_duplicate_step_execution_id_is_rejected() {}
#[test] fn engine_duplicate_approval_for_same_step_is_idempotent() {}
// Panic mode
#[test] fn engine_checks_panic_mode_before_each_step() {}
#[test] fn engine_pauses_in_flight_execution_when_panic_mode_set() {}
#[test] fn engine_does_not_kill_executing_step_on_panic_mode() {}
Mocking strategy:
- Mock the Agent gRPC client using a trait object (
MockAgentClient) - Mock the Slack notification sender
- Mock the database using an in-memory state store for pure state machine tests
- Use
tokio::time::pause()for timeout tests (no real waiting)
3.4 Runbook Parser
What to test: Normalization correctness, LLM output parsing, variable detection, branch mapping.
// pkg/parser/tests.rs
// Normalizer
#[test] fn normalizer_strips_html_tags() {}
#[test] fn normalizer_strips_confluence_macros() {}
#[test] fn normalizer_normalizes_bullet_styles_to_numbered() {}
#[test] fn normalizer_preserves_code_blocks() {}
#[test] fn normalizer_normalizes_whitespace() {}
#[test] fn normalizer_handles_empty_input() {}
#[test] fn normalizer_handles_unicode_content() {}
// LLM output parsing (using recorded fixtures)
#[test] fn parser_extracts_ordered_steps_from_llm_response() {}
#[test] fn parser_handles_llm_returning_empty_steps_array() {}
#[test] fn parser_rejects_llm_response_missing_required_fields() {}
#[test] fn parser_handles_llm_timeout_gracefully() {}
#[test] fn parser_is_idempotent_same_input_same_output() {}
#[test] fn parser_risk_level_is_null_in_output() {} // Parser never classifies
// Variable detection
#[test] fn variable_detector_finds_dollar_sign_vars() {}
#[test] fn variable_detector_finds_angle_bracket_placeholders() {}
#[test] fn variable_detector_finds_curly_brace_templates() {}
#[test] fn variable_detector_identifies_alert_context_sources() {}
#[test] fn variable_detector_identifies_vpn_prerequisite() {}
// Branch mapping
#[test] fn branch_mapper_detects_if_else_conditional() {}
#[test] fn branch_mapper_produces_valid_dag() {}
#[test] fn branch_mapper_handles_nested_conditionals() {}
// Ambiguity detection
#[test] fn ambiguity_highlighter_flags_vague_check_logs_step() {}
#[test] fn ambiguity_highlighter_flags_restart_service_without_name() {}
Mocking strategy: Mock the LLM gateway (dd0c/route) using recorded response fixtures. Use wiremock-rs or a trait-based mock. Never call real LLM in unit tests.
3.5 Approval Workflow
What to test: Approval gates cannot be bypassed. Typed confirmation is enforced for 🔴 steps.
// pkg/approval/tests.rs
#[test] fn approval_caution_step_requires_button_click_not_auto() {}
#[test] fn approval_dangerous_step_requires_typed_resource_name() {}
#[test] fn approval_dangerous_step_rejects_wrong_resource_name() {}
#[test] fn approval_dangerous_step_rejects_generic_yes_confirmation() {}
#[test] fn approval_cannot_bulk_approve_multiple_steps() {}
#[test] fn approval_captures_approver_slack_identity() {}
#[test] fn approval_captures_approval_timestamp() {}
#[test] fn approval_modification_logs_original_command() {}
#[test] fn approval_timeout_30min_marks_as_stalled_not_approved() {}
#[test] fn approval_skip_logs_step_as_skipped_with_actor() {}
#[test] fn approval_abort_halts_all_remaining_steps() {}
3.6 Audit Trail
What to test: Schema validation, immutability enforcement, event completeness.
// pkg/audit/tests.rs
#[test] fn audit_event_requires_tenant_id() {}
#[test] fn audit_event_requires_event_type() {}
#[test] fn audit_event_requires_actor_id_and_type() {}
#[test] fn audit_all_execution_event_types_are_valid_enum_values() {}
#[test] fn audit_step_executed_event_includes_command_hash_not_plaintext() {}
#[test] fn audit_step_executed_event_includes_exit_code() {}
#[test] fn audit_classification_event_includes_both_scanner_and_llm_results() {}
#[test] fn audit_classification_event_includes_merge_rule_applied() {}
#[test] fn audit_hash_chain_each_event_references_previous_hash() {}
#[test] fn audit_hash_chain_modification_breaks_chain_verification() {}
4. Integration Test Strategy
4.1 Service Boundary Tests
All integration tests use Testcontainers for database dependencies and WireMock (via wiremock-rs) for external HTTP services. gRPC boundaries use in-process test servers.
Parser ↔ LLM Gateway (dd0c/route)
// tests/integration/parser_llm_test.rs
#[tokio::test]
async fn parser_sends_normalized_text_to_llm_with_correct_schema_prompt() {
let mock_route = MockServer::start().await;
Mock::given(method("POST"))
.and(path("/v1/completions"))
.respond_with(ResponseTemplate::new(200).set_body_json(fixture_llm_response()))
.mount(&mock_route)
.await;
let parser = Parser::new(mock_route.uri());
let result = parser.parse("1. kubectl get pods -n payments").await.unwrap();
assert_eq!(result.steps.len(), 1);
assert_eq!(result.steps[0].risk_level, None); // Parser never classifies
}
#[tokio::test]
async fn parser_retries_on_llm_timeout_up_to_3_times() { ... }
#[tokio::test]
async fn parser_returns_error_when_llm_returns_invalid_json() { ... }
#[tokio::test]
async fn parser_handles_llm_returning_no_actionable_steps() { ... }
Runbook format parsing tests:
#[tokio::test]
async fn parser_confluence_html_extracts_steps_correctly() {
// Load fixture: tests/fixtures/runbooks/confluence_payment_latency.html
let raw = include_str!("../fixtures/runbooks/confluence_payment_latency.html");
let result = parser.parse(raw).await.unwrap();
assert!(result.steps.len() > 0);
assert!(result.prerequisites.len() > 0);
}
#[tokio::test]
async fn parser_notion_export_markdown_extracts_steps_correctly() { ... }
#[tokio::test]
async fn parser_plain_markdown_numbered_list_extracts_steps_correctly() { ... }
#[tokio::test]
async fn parser_confluence_with_code_blocks_preserves_commands() { ... }
#[tokio::test]
async fn parser_notion_with_callout_blocks_extracts_prerequisites() { ... }
Classifier ↔ PostgreSQL (Audit Write)
// tests/integration/classifier_audit_test.rs
#[tokio::test]
async fn classifier_writes_classification_event_to_audit_log() {
let pg = TestcontainersPostgres::start().await;
let classifier = Classifier::new(pg.connection_string());
let result = classifier.classify(&step_fixture()).await.unwrap();
let events: Vec<AuditEvent> = pg.query(
"SELECT * FROM audit_events WHERE event_type = 'runbook.classified'"
).await;
assert_eq!(events.len(), 1);
assert_eq!(events[0].event_data["final_classification"], "safe");
}
#[tokio::test]
async fn classifier_audit_record_is_immutable_no_update_permitted() {
// Attempt UPDATE on audit_events — must fail with permission error
let result = pg.execute("UPDATE audit_events SET event_type = 'tampered'").await;
assert!(result.is_err());
assert!(result.unwrap_err().to_string().contains("permission denied"));
}
Execution Engine ↔ Agent (gRPC)
// tests/integration/engine_agent_grpc_test.rs
#[tokio::test]
async fn engine_sends_execute_step_payload_with_correct_fields() {
let mock_agent = MockAgentServer::start().await;
let engine = ExecutionEngine::new(mock_agent.address());
engine.execute_step(&safe_step_fixture()).await.unwrap();
let received = mock_agent.last_received_command().await;
assert!(received.execution_id.is_some());
assert!(received.step_execution_id.is_some());
assert_eq!(received.risk_level, RiskLevel::Safe);
}
#[tokio::test]
async fn engine_streams_stdout_from_agent_to_slack_in_realtime() { ... }
#[tokio::test]
async fn engine_handles_agent_disconnect_mid_execution() { ... }
#[tokio::test]
async fn engine_rejects_duplicate_step_execution_id_from_agent() { ... }
#[tokio::test]
async fn engine_validates_command_hash_before_sending_to_agent() { ... }
Approvals ↔ Slack
// tests/integration/approval_slack_test.rs
#[tokio::test]
async fn approval_caution_step_posts_block_kit_message_with_approve_button() { ... }
#[tokio::test]
async fn approval_dangerous_step_posts_modal_requiring_typed_confirmation() { ... }
#[tokio::test]
async fn approval_button_click_captures_slack_user_id_as_approver() { ... }
#[tokio::test]
async fn approval_respects_slack_rate_limit_1_message_per_second() { ... }
#[tokio::test]
async fn approval_batches_rapid_output_updates_to_avoid_rate_limit() { ... }
4.2 Testcontainers Setup
// tests/common/testcontainers.rs
pub struct TestDb {
container: ContainerAsync<Postgres>,
pub pool: PgPool,
}
impl TestDb {
pub async fn start() -> Self {
let container = Postgres::default()
.with_env_var("POSTGRES_DB", "dd0c_test")
.start()
.await
.unwrap();
let pool = PgPool::connect(&container.connection_string()).await.unwrap();
// Run migrations
sqlx::migrate!("./migrations").run(&pool).await.unwrap();
// Apply RLS policies
sqlx::query_file!("./tests/fixtures/sql/apply_rls.sql")
.execute(&pool)
.await
.unwrap();
Self { container, pool }
}
pub async fn with_tenant(&self, tenant_id: Uuid) -> TenantScopedDb {
// Sets app.current_tenant_id for RLS enforcement
TenantScopedDb::new(&self.pool, tenant_id)
}
}
4.3 Sandboxed Execution Environment Tests
For testing actual command execution without touching real infrastructure, use Docker-in-Docker (DinD) or a minimal sandbox container.
// tests/integration/sandbox_execution_test.rs
/// Uses a minimal Alpine container as the execution target.
/// The agent connects to this container instead of real infrastructure.
#[tokio::test]
async fn sandbox_safe_command_executes_and_returns_stdout() {
let sandbox = SandboxContainer::start("alpine:3.19").await;
let agent = TestAgent::connect_to(sandbox.socket_path()).await;
let result = agent.execute("ls /tmp").await.unwrap();
assert_eq!(result.exit_code, 0);
assert!(!result.stdout.is_empty());
}
#[tokio::test]
async fn sandbox_agent_rejects_dangerous_command_before_execution() {
let sandbox = SandboxContainer::start("alpine:3.19").await;
let agent = TestAgent::connect_to(sandbox.socket_path()).await;
let result = agent.execute("rm -rf /").await;
assert!(result.is_err());
assert_eq!(result.unwrap_err(), AgentError::CommandRejectedByScanner);
// Verify nothing was deleted
assert!(sandbox.path_exists("/etc").await);
}
#[tokio::test]
async fn sandbox_command_timeout_kills_process_and_returns_error() {
let sandbox = SandboxContainer::start("alpine:3.19").await;
let agent = TestAgent::with_timeout(Duration::from_secs(2))
.connect_to(sandbox.socket_path())
.await;
let result = agent.execute("sleep 60").await;
assert_eq!(result.unwrap_err(), AgentError::Timeout);
}
#[tokio::test]
async fn sandbox_no_shell_injection_via_command_argument() {
let sandbox = SandboxContainer::start("alpine:3.19").await;
let agent = TestAgent::connect_to(sandbox.socket_path()).await;
// This should execute `echo` with the literal argument, not a shell
let result = agent.execute("echo $(rm -rf /)").await.unwrap();
assert_eq!(result.stdout.trim(), "$(rm -rf /)"); // Literal, not executed
assert!(sandbox.path_exists("/etc").await);
}
4.4 Multi-Tenant RLS Integration Tests
// tests/integration/rls_test.rs
#[tokio::test]
async fn rls_tenant_a_cannot_see_tenant_b_runbooks() {
let db = TestDb::start().await;
let tenant_a = Uuid::new_v4();
let tenant_b = Uuid::new_v4();
// Insert runbook for tenant B
db.insert_runbook(tenant_b, "Tenant B Runbook").await;
// Query as tenant A — must return zero rows, not an error
let db_a = db.with_tenant(tenant_a).await;
let runbooks = db_a.query("SELECT * FROM runbooks").await.unwrap();
assert_eq!(runbooks.len(), 0);
}
#[tokio::test]
async fn rls_cross_tenant_audit_query_returns_zero_rows() { ... }
#[tokio::test]
async fn rls_cross_tenant_execution_query_returns_zero_rows() { ... }
5. E2E & Smoke Tests
5.1 Critical User Journeys
E2E tests run against a full Docker Compose stack with sandboxed infrastructure. No real AWS, no real Kubernetes — all targets are containerized mocks.
Docker Compose E2E Stack:
# tests/e2e/docker-compose.yml
services:
postgres: # Real PostgreSQL with migrations applied
redis: # Real Redis for panic mode key
parser: # Real Parser service
classifier: # Real Classifier service
engine: # Real Execution Engine
slack-mock: # WireMock simulating Slack API
llm-mock: # WireMock with recorded LLM responses
agent: # Real dd0c Agent binary
sandbox-host: # Alpine container as execution target
Journey 1: Full Happy Path (P0)
// tests/e2e/happy_path_test.rs
#[tokio::test]
async fn e2e_paste_runbook_classify_approve_execute_audit_full_journey() {
let stack = E2EStack::start().await;
// Step 1: Paste runbook
let parse_resp = stack.api()
.post("/v1/run/runbooks/parse-preview")
.json(&json!({ "raw_text": FIXTURE_RUNBOOK_MARKDOWN }))
.send().await;
assert_eq!(parse_resp.status(), 200);
let parsed = parse_resp.json::<ParsePreviewResponse>().await;
assert!(parsed.steps.iter().any(|s| s.risk_level == "safe"));
assert!(parsed.steps.iter().any(|s| s.risk_level == "caution"));
// Step 2: Save runbook
let runbook = stack.api().post("/v1/run/runbooks")
.json(&json!({ "raw_text": FIXTURE_RUNBOOK_MARKDOWN, "title": "E2E Test" }))
.send().await.json::<Runbook>().await;
// Step 3: Start execution
let execution = stack.api().post("/v1/run/executions")
.json(&json!({ "runbook_id": runbook.id, "agent_id": stack.agent_id() }))
.send().await.json::<Execution>().await;
// Step 4: Safe steps auto-execute
stack.wait_for_execution_state(&execution.id, "awaiting_approval").await;
// Step 5: Approve caution step
stack.api()
.post(format!("/v1/run/executions/{}/steps/{}/approve", execution.id, caution_step_id))
.send().await;
// Step 6: Wait for completion
let completed = stack.wait_for_execution_state(&execution.id, "completed").await;
assert_eq!(completed.steps_executed, 4);
assert_eq!(completed.steps_failed, 0);
// Step 7: Verify audit trail
let audit_events = stack.db()
.query("SELECT event_type FROM audit_events WHERE execution_id = $1", &[&execution.id])
.await;
let event_types: Vec<&str> = audit_events.iter().map(|e| e.event_type.as_str()).collect();
assert!(event_types.contains(&"execution.started"));
assert!(event_types.contains(&"step.auto_executed"));
assert!(event_types.contains(&"step.approved"));
assert!(event_types.contains(&"step.executed"));
assert!(event_types.contains(&"execution.completed"));
}
Journey 2: Destructive Command Blocked at All Levels (P0)
#[tokio::test]
async fn e2e_dangerous_command_blocked_at_copilot_trust_level() {
let stack = E2EStack::start().await;
let runbook = stack.create_runbook_with_dangerous_step().await;
let execution = stack.start_execution(&runbook.id).await;
// Engine must transition to Blocked, not AwaitApproval or AutoExecute
let step_status = stack.wait_for_step_state(
&execution.id, &dangerous_step_id, "blocked"
).await;
assert_eq!(step_status, "blocked");
// Verify audit event logged the block
let events = stack.audit_events_for_execution(&execution.id).await;
assert!(events.iter().any(|e| e.event_type == "step.blocked_by_trust_level"));
}
Journey 3: Panic Mode Halts In-Flight Execution (P0)
#[tokio::test]
async fn e2e_panic_mode_halts_in_flight_execution_within_1_second() {
let stack = E2EStack::start().await;
// Start a long-running execution
let execution = stack.start_execution_with_slow_steps().await;
stack.wait_for_execution_state(&execution.id, "running").await;
let panic_triggered_at = Instant::now();
// Trigger panic mode
stack.api().post("/v1/run/admin/panic").send().await;
// Execution must be paused within 1 second
stack.wait_for_execution_state(&execution.id, "paused").await;
assert!(panic_triggered_at.elapsed() < Duration::from_secs(1));
// Verify execution is paused, not killed
let exec = stack.get_execution(&execution.id).await;
assert_eq!(exec.status, "paused");
assert_ne!(exec.status, "aborted");
}
Journey 4: Approval Timeout Does Not Auto-Approve (P0)
#[tokio::test]
async fn e2e_approval_timeout_marks_stalled_not_approved() {
let stack = E2EStack::with_approval_timeout(Duration::from_secs(5)).start().await;
let execution = stack.start_execution_with_caution_step().await;
stack.wait_for_execution_state(&execution.id, "awaiting_approval").await;
// Wait for timeout to expire — do NOT approve
tokio::time::sleep(Duration::from_secs(6)).await;
let exec = stack.get_execution(&execution.id).await;
assert_eq!(exec.status, "stalled");
assert_ne!(exec.status, "completed"); // Must NOT have auto-approved
}
5.2 Chaos Scenarios
// tests/e2e/chaos_test.rs
#[tokio::test]
async fn chaos_agent_disconnects_mid_execution_engine_pauses_and_alerts() {
let stack = E2EStack::start().await;
let execution = stack.start_long_running_execution().await;
// Kill agent network mid-execution
stack.disconnect_agent().await;
let exec = stack.wait_for_execution_state(&execution.id, "paused").await;
assert_eq!(exec.status, "paused");
// Reconnect agent — execution should be resumable
stack.reconnect_agent().await;
stack.resume_execution(&execution.id).await;
stack.wait_for_execution_state(&execution.id, "completed").await;
}
#[tokio::test]
async fn chaos_database_failover_engine_resumes_from_last_committed_step() {
// Simulate RDS failover — engine must reconnect and resume
}
#[tokio::test]
async fn chaos_llm_gateway_down_classification_falls_back_to_scanner_only() {
// LLM unavailable — scanner-only mode, all unknowns become 🟡
}
#[tokio::test]
async fn chaos_slack_api_outage_execution_pauses_awaiting_approval_channel() {
// Slack down — approval requests queue, no auto-approve
}
#[tokio::test]
async fn chaos_mid_execution_step_failure_triggers_rollback_flow() {
let stack = E2EStack::start().await;
let execution = stack.start_execution_with_failing_step().await;
stack.wait_for_execution_state(&execution.id, "rollback_available").await;
// Approve rollback
stack.approve_rollback(&execution.id, &failed_step_id).await;
let exec = stack.wait_for_execution_state(&execution.id, "completed").await;
let events = stack.audit_events_for_execution(&execution.id).await;
assert!(events.iter().any(|e| e.event_type == "step.rolled_back"));
}
6. Performance & Load Testing
6.1 Parser Throughput
// benches/parser_bench.rs (criterion)
fn bench_normalizer_small_runbook(c: &mut Criterion) {
let input = include_str!("../fixtures/runbooks/small_10_steps.md");
c.bench_function("normalizer_small", |b| {
b.iter(|| Normalizer::new().normalize(black_box(input)))
});
}
fn bench_normalizer_large_runbook(c: &mut Criterion) {
// 500-step runbook, heavy HTML from Confluence
let input = include_str!("../fixtures/runbooks/large_500_steps.html");
c.bench_function("normalizer_large", |b| {
b.iter(|| Normalizer::new().normalize(black_box(input)))
});
}
fn bench_scanner_100_commands(c: &mut Criterion) {
let commands = fixture_100_mixed_commands();
let scanner = Scanner::new();
c.bench_function("scanner_100_commands", |b| {
b.iter(|| {
for cmd in &commands {
black_box(scanner.classify(cmd));
}
})
});
}
Performance targets:
- Normalizer: < 10ms for a 500-step Confluence page
- Scanner: < 1ms per command, < 10ms for 100 commands in batch
- Full parse + classify pipeline: < 5s p95 (including LLM call)
- Classification merge: < 1ms per step
6.2 Concurrent Execution Stress Tests
Use k6 or cargo-based load tests for concurrent execution scenarios:
// tests/load/concurrent_executions.js (k6)
export const options = {
scenarios: {
concurrent_executions: {
executor: 'constant-vus',
vus: 50, // 50 concurrent execution sessions
duration: '5m',
},
},
thresholds: {
http_req_duration: ['p95<500'], // API responses < 500ms p95
http_req_failed: ['rate<0.01'], // < 1% error rate
},
};
export default function () {
// Start execution, poll status, approve steps, verify completion
const execution = startExecution(FIXTURE_RUNBOOK_ID);
waitForApprovalGate(execution.id);
approveStep(execution.id, execution.pending_step_id);
waitForCompletion(execution.id);
}
Stress test targets:
- 50 concurrent execution sessions: all complete without errors
- Approval workflow: < 200ms p95 latency for approval API calls
- Audit trail ingestion: handles 1000 events/second without data loss
- Scanner: handles 10,000 classifications/second (batch mode)
6.3 Approval Workflow Latency Under Load
// tests/load/approval_latency_test.rs
#[tokio::test]
async fn approval_workflow_p95_latency_under_100_concurrent_approvals() {
let stack = E2EStack::start().await;
let mut handles = vec![];
for _ in 0..100 {
let stack = stack.clone();
handles.push(tokio::spawn(async move {
let execution = stack.start_execution_with_caution_step().await;
stack.wait_for_execution_state(&execution.id, "awaiting_approval").await;
let start = Instant::now();
stack.approve_step(&execution.id, &execution.pending_step_id).await;
start.elapsed()
}));
}
let latencies: Vec<Duration> = futures::future::join_all(handles)
.await.into_iter().map(|r| r.unwrap()).collect();
let p95 = percentile(&latencies, 95);
assert!(p95 < Duration::from_millis(200), "p95 approval latency: {:?}", p95);
}
7. CI/CD Pipeline Integration
7.1 Test Stages
┌─────────────────────────────────────────────────────────────────────┐
│ CI/CD TEST PIPELINE │
│ │
│ PRE-COMMIT (local, < 30s) │
│ ├── cargo fmt --check │
│ ├── cargo clippy -- -D warnings │
│ ├── Unit tests for changed packages only │
│ └── Canary suite (50 known-destructive commands must stay 🔴) │
│ │
│ PR GATE (CI, < 10 min) │
│ ├── Full unit test suite (all packages) │
│ ├── Canary suite (mandatory — build fails if any canary is 🟢) │
│ ├── Integration tests (Testcontainers) │
│ ├── Coverage check (see thresholds below) │
│ ├── Decision log check (PRs touching classifier/executor/parser │
│ │ must include decision_log.json) │
│ └── Expired feature flag check (CI blocks if flag TTL exceeded) │
│ │
│ MERGE TO MAIN (CI, < 20 min) │
│ ├── Full unit + integration suite │
│ ├── E2E smoke tests (Docker Compose stack) │
│ ├── Performance regression check (criterion baselines) │
│ └── Schema migration validation │
│ │
│ DEPLOY TO STAGING (post-merge, < 30 min) │
│ ├── E2E full suite against staging environment │
│ ├── Chaos scenarios (agent disconnect, DB failover) │
│ └── Load test (50 concurrent executions, 5 min) │
│ │
│ DEPLOY TO PRODUCTION (manual gate after staging) │
│ ├── Smoke test: parse-preview endpoint responds < 5s │
│ ├── Smoke test: agent heartbeat received │
│ └── Smoke test: audit trail write succeeds │
└─────────────────────────────────────────────────────────────────────┘
7.2 Coverage Thresholds
# .cargo/coverage.toml (enforced via cargo-tarpaulin in CI)
[thresholds]
# Safety-critical components — highest bar
"pkg/classifier/scanner" = 100 # Every pattern must be tested
"pkg/classifier/merge" = 100 # Every merge rule must be tested
"pkg/executor/state_machine" = 95 # Every state transition
"pkg/executor/trust" = 95 # Trust level enforcement
"pkg/approval" = 95 # Approval gates
# Core components
"pkg/parser" = 90
"pkg/audit" = 90
"pkg/agent/scanner" = 100 # Agent-side scanner: same as SaaS-side
# Supporting components
"pkg/matcher" = 85
"pkg/slack" = 80
"pkg/api" = 80
# Overall project minimum
"overall" = 85
CI enforcement:
# .github/workflows/ci.yml (excerpt)
- name: Check coverage thresholds
run: |
cargo tarpaulin --out Json --output-dir coverage/
python scripts/check_coverage_thresholds.py coverage/tarpaulin-report.json
- name: Run canary suite (MANDATORY)
run: cargo test --package dd0c-classifier canary_suite -- --nocapture
# This job failing blocks ALL other jobs
- name: Check decision logs for safety-critical PRs
run: |
CHANGED=$(git diff --name-only origin/main)
if echo "$CHANGED" | grep -qE "pkg/(parser|classifier|executor|approval)/"; then
python scripts/check_decision_log.py
fi
7.3 Test Parallelization Strategy
# GitHub Actions matrix strategy
jobs:
unit-tests:
strategy:
matrix:
package:
- dd0c-classifier # Runs first — safety-critical
- dd0c-executor # Runs first — safety-critical
- dd0c-parser
- dd0c-audit
- dd0c-approval
- dd0c-matcher
- dd0c-slack
- dd0c-api
steps:
- run: cargo test --package ${{ matrix.package }}
canary-suite:
needs: [] # Runs in parallel with unit tests
steps:
- run: cargo test --package dd0c-classifier canary_suite
integration-tests:
needs: [unit-tests, canary-suite] # Only after unit tests pass
strategy:
matrix:
suite:
- parser-llm
- classifier-audit
- engine-agent
- approval-slack
- rls-isolation
steps:
- run: cargo test --test ${{ matrix.suite }}
e2e-tests:
needs: [integration-tests]
steps:
- run: docker compose -f tests/e2e/docker-compose.yml up -d
- run: cargo test --test e2e
Parallelization rules:
- Canary suite runs in parallel with unit tests — never blocked
- Integration tests only start after ALL unit tests pass
- E2E tests only start after ALL integration tests pass
- Each test package runs in its own job (parallel matrix)
- Testcontainers instances are per-test, not shared (no state leakage)
8. Transparent Factory Tenet Testing
8.1 Feature Flag Tests (Atomic Flagging — Story 10.1)
// pkg/flags/tests.rs
// Basic flag evaluation
#[test] fn flag_evaluates_locally_no_network_call() {}
#[test] fn flag_disabled_by_default_for_new_execution_paths() {}
#[test] fn flag_requires_owner_field_or_validation_fails() {}
#[test] fn flag_requires_ttl_field_or_validation_fails() {}
#[test] fn flag_expired_ttl_is_treated_as_disabled() {}
// Destructive flag 48-hour bake enforcement
#[test] fn flag_destructive_true_requires_48h_bake_before_full_rollout() {
let flag = FeatureFlag {
name: "enable_kubectl_delete_execution",
destructive: true,
rollout_percentage: 100,
bake_started_at: Some(Utc::now() - Duration::hours(24)), // Only 24h
..Default::default()
};
let result = FlagValidator::validate(&flag);
assert!(result.is_err());
assert!(result.unwrap_err().contains("48-hour bake required for destructive flags"));
}
#[test] fn flag_destructive_true_at_10_percent_before_48h_is_valid() {
let flag = FeatureFlag {
destructive: true,
rollout_percentage: 10,
bake_started_at: Some(Utc::now() - Duration::hours(12)),
..Default::default()
};
assert!(FlagValidator::validate(&flag).is_ok());
}
#[test] fn flag_destructive_true_at_100_percent_after_48h_is_valid() {
let flag = FeatureFlag {
destructive: true,
rollout_percentage: 100,
bake_started_at: Some(Utc::now() - Duration::hours(49)),
..Default::default()
};
assert!(FlagValidator::validate(&flag).is_ok());
}
// Circuit breaker
#[test] fn circuit_breaker_triggers_after_2_failures_in_10_minute_window() {
let mut breaker = CircuitBreaker::new("enable_new_parser", 2, Duration::minutes(10));
breaker.record_failure();
assert_eq!(breaker.state(), CircuitState::Closed);
breaker.record_failure();
assert_eq!(breaker.state(), CircuitState::Open); // Trips on 2nd failure
}
#[test] fn circuit_breaker_open_disables_flag_immediately() {
let mut breaker = CircuitBreaker::new("enable_new_parser", 2, Duration::minutes(10));
breaker.record_failure();
breaker.record_failure();
let flag_store = FlagStore::with_breaker(breaker);
assert!(!flag_store.is_enabled("enable_new_parser"));
}
#[test] fn circuit_breaker_pauses_in_flight_executions_not_kills() {
// Verify executions are paused (status=paused), not aborted (status=aborted)
}
#[test] fn circuit_breaker_resets_after_window_expires() {
let mut breaker = CircuitBreaker::new("flag", 2, Duration::minutes(10));
breaker.record_failure();
breaker.record_failure();
// Advance time past window
breaker.advance_time(Duration::minutes(11));
assert_eq!(breaker.state(), CircuitState::Closed);
}
#[test] fn ci_blocks_if_flag_at_100_percent_past_ttl() {
// This test validates the CI check script logic
let flags = vec![
FeatureFlag { name: "old_flag", rollout: 100, ttl: expired_ttl(), ..Default::default() }
];
let result = CiValidator::check_expired_flags(&flags);
assert!(result.is_err());
}
8.2 Schema Migration Validation Tests (Elastic Schema — Story 10.2)
// tests/migrations/schema_validation_test.rs
#[tokio::test]
async fn migration_does_not_remove_existing_columns() {
let db = TestDb::start().await;
let columns_before = db.get_column_names("audit_events").await;
// Apply all pending migrations
sqlx::migrate!("./migrations").run(&db.pool).await.unwrap();
let columns_after = db.get_column_names("audit_events").await;
// Every column that existed before must still exist
for col in &columns_before {
assert!(columns_after.contains(col),
"Migration removed column '{}' from audit_events — FORBIDDEN", col);
}
}
#[tokio::test]
async fn migration_does_not_change_existing_column_types() {
// Verify no type changes on existing columns
}
#[tokio::test]
async fn migration_does_not_rename_existing_columns() {
// Verify column names are stable
}
#[tokio::test]
async fn audit_events_table_has_no_update_permission_for_app_role() {
let db = TestDb::start().await;
let result = db.as_app_role()
.execute("UPDATE audit_events SET event_type = 'tampered' WHERE 1=1")
.await;
assert!(result.is_err());
assert!(result.unwrap_err().to_string().contains("permission denied"));
}
#[tokio::test]
async fn audit_events_table_has_no_delete_permission_for_app_role() {
let db = TestDb::start().await;
let result = db.as_app_role()
.execute("DELETE FROM audit_events WHERE 1=1")
.await;
assert!(result.is_err());
assert!(result.unwrap_err().to_string().contains("permission denied"));
}
#[tokio::test]
async fn execution_log_parsers_ignore_unknown_fields() {
// Simulate a future schema with extra fields — old parser must not fail
let event_json = json!({
"id": Uuid::new_v4(),
"event_type": "step.executed",
"unknown_future_field": "some_value", // New field old parser doesn't know
"tenant_id": Uuid::new_v4(),
"created_at": Utc::now(),
});
let result = AuditEvent::from_json(&event_json);
assert!(result.is_ok()); // Must not fail on unknown fields
}
#[tokio::test]
async fn migration_includes_sunset_date_comment() {
// Parse migration files and verify each has a sunset_date comment
let migrations = read_migration_files("./migrations");
for migration in &migrations {
if migration.is_additive() {
assert!(migration.content.contains("-- sunset_date:"),
"Migration {} missing sunset_date comment", migration.name);
}
}
}
8.3 Decision Log Format Validation (Cognitive Durability — Story 10.3)
// tests/decisions/decision_log_test.rs
#[test]
fn decision_log_schema_requires_all_mandatory_fields() {
let incomplete_log = json!({
"prompt": "Why is rm -rf dangerous?",
// Missing: reasoning, alternatives_considered, confidence, timestamp, author
});
let result = DecisionLog::validate(&incomplete_log);
assert!(result.is_err());
}
#[test]
fn decision_log_confidence_must_be_between_0_and_1() {
let log = DecisionLog { confidence: 1.5, ..valid_decision_log() };
assert!(DecisionLog::validate(&log).is_err());
}
#[test]
fn decision_log_destructive_command_list_change_requires_reasoning() {
// Any PR adding/removing from the destructive command list must have
// a decision log explaining why
let change = DestructiveCommandListChange {
command: "kubectl drain",
action: ChangeAction::Add,
decision_log: None, // Missing
};
let result = DestructiveCommandChangeValidator::validate(&change);
assert!(result.is_err());
assert!(result.unwrap_err().contains("decision log required for destructive command list changes"));
}
#[test]
fn all_existing_decision_logs_are_valid_json() {
let logs = glob::glob("docs/decisions/*.json").unwrap();
for log_path in logs {
let content = std::fs::read_to_string(log_path.unwrap()).unwrap();
let parsed: serde_json::Value = serde_json::from_str(&content)
.expect("Decision log must be valid JSON");
assert!(DecisionLog::validate(&parsed).is_ok());
}
}
#[test]
fn cyclomatic_complexity_cap_enforced_at_10() {
// This is validated by the clippy lint in CI:
// #![deny(clippy::cognitive_complexity)]
// Test here validates the lint config is present
let clippy_config = std::fs::read_to_string(".clippy.toml").unwrap();
assert!(clippy_config.contains("cognitive-complexity-threshold = 10"));
}
8.4 OTEL Span Assertion Tests (Semantic Observability — Story 10.4)
// tests/observability/otel_spans_test.rs
#[tokio::test]
async fn otel_runbook_execution_creates_parent_span() {
let tracer = InMemoryTracer::new();
let engine = ExecutionEngine::with_tracer(tracer.clone());
engine.execute_runbook(&fixture_runbook()).await.unwrap();
let spans = tracer.finished_spans();
let parent = spans.iter().find(|s| s.name == "runbook_execution");
assert!(parent.is_some(), "runbook_execution parent span must exist");
}
#[tokio::test]
async fn otel_each_step_creates_child_spans_3_levels_deep() {
let tracer = InMemoryTracer::new();
let engine = ExecutionEngine::with_tracer(tracer.clone());
engine.execute_runbook(&fixture_runbook_with_one_step()).await.unwrap();
let spans = tracer.finished_spans();
let step_classification = spans.iter().find(|s| s.name == "step_classification");
let step_approval = spans.iter().find(|s| s.name == "step_approval_check");
let step_execution = spans.iter().find(|s| s.name == "step_execution");
assert!(step_classification.is_some());
assert!(step_approval.is_some());
assert!(step_execution.is_some());
// Verify parent-child hierarchy
let parent_id = spans.iter().find(|s| s.name == "runbook_execution").unwrap().span_id;
assert_eq!(step_classification.unwrap().parent_span_id, Some(parent_id));
}
#[tokio::test]
async fn otel_step_classification_span_has_required_attributes() {
let tracer = InMemoryTracer::new();
let engine = ExecutionEngine::with_tracer(tracer.clone());
engine.execute_runbook(&fixture_runbook()).await.unwrap();
let span = tracer.span_by_name("step_classification").unwrap();
assert!(span.attributes.contains_key("step.text_hash"));
assert!(span.attributes.contains_key("step.classified_as"));
assert!(span.attributes.contains_key("step.confidence_score"));
assert!(span.attributes.contains_key("step.alternatives_considered"));
}
#[tokio::test]
async fn otel_step_execution_span_has_required_attributes() {
let span = tracer.span_by_name("step_execution").unwrap();
assert!(span.attributes.contains_key("step.command_hash"));
assert!(span.attributes.contains_key("step.target_host_hash"));
assert!(span.attributes.contains_key("step.exit_code"));
assert!(span.attributes.contains_key("step.duration_ms"));
// Must NOT contain raw command or PII
assert!(!span.attributes.contains_key("step.command_raw"));
assert!(!span.attributes.contains_key("step.stdout_raw"));
}
#[tokio::test]
async fn otel_step_approval_span_has_required_attributes() {
let span = tracer.span_by_name("step_approval_check").unwrap();
assert!(span.attributes.contains_key("step.approval_required"));
assert!(span.attributes.contains_key("step.approval_source"));
assert!(span.attributes.contains_key("step.approval_latency_ms"));
}
#[tokio::test]
async fn otel_no_pii_in_any_span_attributes() {
// Scan all span attributes for patterns that look like PII or raw commands
let spans = tracer.finished_spans();
for span in &spans {
for (key, value) in &span.attributes {
assert!(!key.ends_with("_raw"), "Raw data in span attribute: {}", key);
assert!(!looks_like_email(value), "PII in span: {} = {}", key, value);
}
}
}
#[tokio::test]
async fn otel_ai_classification_span_includes_model_metadata() {
let span = tracer.span_by_name("step_classification").unwrap();
assert!(span.attributes.contains_key("ai.prompt_hash"));
assert!(span.attributes.contains_key("ai.model_version"));
assert!(span.attributes.contains_key("ai.reasoning_chain"));
}
8.5 Governance Policy Enforcement Tests (Configurable Autonomy — Story 10.5)
// pkg/governance/tests.rs
// Strict mode: all steps require approval
#[test]
fn governance_strict_mode_requires_approval_for_safe_steps() {
let policy = Policy { governance_mode: GovernanceMode::Strict, ..Default::default() };
let engine = ExecutionEngine::with_policy(policy);
let next = engine.next_state(&safe_step(), TrustLevel::Copilot);
assert_eq!(next, State::AwaitApproval); // Even safe steps need approval in strict mode
}
// Audit mode: safe steps auto-execute, destructive always require approval
#[test]
fn governance_audit_mode_auto_executes_safe_steps() {
let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() };
let engine = ExecutionEngine::with_policy(policy);
assert_eq!(engine.next_state(&safe_step(), TrustLevel::Copilot), State::AutoExecute);
}
#[test]
fn governance_audit_mode_still_requires_approval_for_dangerous_steps() {
let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() };
let engine = ExecutionEngine::with_policy(policy);
assert_eq!(engine.next_state(&dangerous_step(), TrustLevel::Copilot), State::AwaitApproval);
}
#[test]
fn governance_no_fully_autonomous_mode_exists() {
// There is no GovernanceMode::FullAuto variant
// This test verifies the enum only has Strict and Audit
let modes = GovernanceMode::all_variants();
assert!(!modes.contains(&GovernanceMode::FullAuto));
assert_eq!(modes.len(), 2);
}
// Panic mode
#[tokio::test]
async fn governance_panic_mode_halts_all_executions_within_1_second() {
// Verified in E2E section — unit test verifies Redis key check
let redis = MockRedis::new();
let engine = ExecutionEngine::with_redis(redis.clone());
redis.set("dd0c:panic", "1").await;
let result = engine.check_panic_mode().await;
assert_eq!(result, PanicModeStatus::Active);
}
#[tokio::test]
async fn governance_panic_mode_uses_redis_not_database() {
// Panic mode must NOT do a DB query — Redis only for <1s requirement
let db = TrackingDb::new();
let redis = MockRedis::new();
redis.set("dd0c:panic", "1").await;
let engine = ExecutionEngine::with_db(db.clone()).with_redis(redis);
engine.check_panic_mode().await;
assert_eq!(db.query_count(), 0, "Panic mode check must not query the database");
}
#[tokio::test]
async fn governance_panic_mode_requires_manual_clearance() {
// Panic mode cannot be auto-cleared — only manual API call
let engine = ExecutionEngine::new();
engine.trigger_panic_mode().await;
// Simulate time passing
tokio::time::advance(Duration::hours(24)).await;
// Must still be in panic mode
assert_eq!(engine.check_panic_mode().await, PanicModeStatus::Active);
}
// Governance drift monitoring
#[tokio::test]
async fn governance_drift_auto_downgrades_when_auto_execution_exceeds_threshold() {
let db = TestDb::start().await;
// Insert execution history: 80% auto-executed (threshold is 70%)
insert_execution_history_with_auto_ratio(&db, 0.80).await;
let drift_monitor = GovernanceDriftMonitor::new(&db.pool);
let result = drift_monitor.check().await;
assert_eq!(result.action, DriftAction::DowngradeToStrict);
}
// Per-runbook governance override
#[test]
fn governance_runbook_locked_to_strict_ignores_system_audit_mode() {
let policy = Policy { governance_mode: GovernanceMode::Audit, ..Default::default() };
let runbook = Runbook { governance_override: Some(GovernanceMode::Strict), ..Default::default() };
let engine = ExecutionEngine::with_policy(policy);
// Even in audit mode, this runbook requires approval for safe steps
assert_eq!(engine.next_state_for_runbook(&safe_step(), &runbook), State::AwaitApproval);
}
9. Test Data & Fixtures
9.1 Runbook Format Factories
// tests/fixtures/runbooks/mod.rs
pub fn fixture_runbook_markdown() -> &'static str {
include_str!("markdown_basic.md")
}
pub fn fixture_runbook_confluence_html() -> &'static str {
include_str!("confluence_export.html")
}
pub fn fixture_runbook_notion_export() -> &'static str {
include_str!("notion_export.md")
}
pub fn fixture_runbook_with_ambiguities() -> &'static str {
include_str!("ambiguous_steps.md")
}
pub fn fixture_runbook_with_variables() -> &'static str {
include_str!("variables_and_placeholders.md")
}
9.2 Step Classification Fixtures
// tests/fixtures/commands/mod.rs
pub fn fixture_safe_commands() -> Vec<&'static str> {
vec![
"kubectl get pods -n kube-system",
"aws ec2 describe-instances --region us-east-1",
"cat /var/log/syslog | grep error",
"SELECT count(*) FROM users",
]
}
pub fn fixture_caution_commands() -> Vec<&'static str> {
vec![
"kubectl rollout restart deployment/api",
"systemctl restart nginx",
"aws ec2 stop-instances --instance-ids i-1234567890abcdef0",
"UPDATE users SET status = 'active' WHERE id = 123",
]
}
pub fn fixture_destructive_commands() -> Vec<&'static str> {
vec![
"kubectl delete namespace prod",
"rm -rf /var/lib/postgresql/data",
"DROP TABLE customers",
"aws rds delete-db-instance --db-instance-identifier prod-db",
"terraform destroy -auto-approve",
]
}
pub fn fixture_ambiguous_commands() -> Vec<&'static str> {
vec![
"restart the service",
"./cleanup.sh",
"python script.py",
"curl -X POST http://internal-api/reset",
]
}
9.3 Infrastructure Target Mocks
We mock infrastructure targets for execution tests using isolated containers or HTTP mock servers.
// tests/fixtures/infra/mod.rs
/// Spawns a lightweight k3s container for testing kubectl commands safely
pub async fn mock_k8s_cluster() -> K3sContainer {
K3sContainer::start().await
}
/// Spawns LocalStack for testing AWS CLI commands
pub async fn mock_aws_env() -> LocalStackContainer {
LocalStackContainer::start().await
}
/// Spawns a bare Alpine container with SSH access
pub async fn mock_bare_metal() -> SshContainer {
SshContainer::start("alpine:latest").await
}
9.4 Approval Workflow Scenario Fixtures
// tests/fixtures/approvals/mod.rs
pub fn fixture_slack_approval_payload(step_id: &str, user_id: &str) -> serde_json::Value {
json!({
"type": "block_actions",
"user": { "id": user_id, "username": "riley.oncall" },
"actions": [{
"action_id": "approve_step",
"value": step_id
}]
})
}
pub fn fixture_slack_typed_confirmation_payload(step_id: &str, resource_name: &str) -> serde_json::Value {
json!({
"type": "view_submission",
"user": { "id": "U123456" },
"view": {
"state": {
"values": {
"confirm_block": {
"resource_input": { "value": resource_name }
}
}
},
"private_metadata": step_id
}
})
}
10. TDD Implementation Order
To maintain the safety-first invariant, components must be built and tested in a specific order. Execution code cannot be written until the safety constraints are proven.
10.1 Bootstrap Sequence (Test Infrastructure First)
- Testcontainers Setup: Establish
TestDbwith migrations and RLS policies. Prove cross-tenant isolation fails closed. - OTEL Test Tracer: Implement
InMemoryTracerto assert span creation and attributes. - Canary Suite Harness: Create the
canary_suitetest target that runs a hardcoded list of destructive commands and fails if any return 🟢.
10.2 Epic Dependency Order
| Phase | Component | TDD Rule | Rationale |
|---|---|---|---|
| 1 | Deterministic Safety Scanner | Unit tests FIRST | Foundation of safety. Exhaustive pattern tests must exist before any parser or execution logic. |
| 2 | Merge Engine | Unit tests FIRST | Hardcoded rules. Prove 🔴 overrides 🟢 before integrating LLMs. |
| 3 | Execution Engine State Machine | Unit tests FIRST | CRITICAL: Prove trust level boundaries and approval gates block 🔴/🟡 steps before writing any code that actually executes commands. |
| 4 | Agent-Side Scanner | Unit tests FIRST | Port SaaS scanner logic to Agent binary. Prove the Agent rejects rm -rf independently. |
| 5 | Agent gRPC & Command Execution | Integration tests FIRST | Use sandbox containers. Prove timeout kills processes and shell injection fails. |
| 6 | Runbook Parser | Integration tests lead | Use LLM fixtures. The parser is safe because the classifier catches its mistakes. |
| 7 | Audit Trail | Unit tests FIRST | Prove schema immutability and hash chaining. |
| 8 | Slack Bot & API | Integration tests lead | UI and routing. |
10.3 The Execution Engine Testing Mandate
Execution engine tests MUST be written before any execution code.
Before writing the impl ExecutionEngine { pub async fn execute(...) } function, the following tests must exist and fail (Red phase):
engine_dangerous_step_blocked_at_copilot_level_v1engine_caution_step_requires_approval_at_copilot_levelengine_safe_step_blocked_at_read_only_trust_levelengine_duplicate_step_execution_id_is_rejectedengine_pauses_in_flight_execution_when_panic_mode_set
Only once these tests are defined can the state machine be implemented to make them pass (Green phase). This ensures no execution path can bypass the Trust Gradient.
11. Review Remediation Addendum (Post-Gemini Review)
The following sections address all gaps identified in the TDD review. These are net-new test specifications that must be integrated into the relevant sections above during implementation.
11.1 Missing Epic Coverage
Epic 3.4: Divergence Analysis
// pkg/executor/divergence/tests.rs
#[test] fn divergence_detects_extra_command_not_in_runbook() {}
#[test] fn divergence_detects_modified_command_vs_prescribed() {}
#[test] fn divergence_detects_skipped_step_not_marked_as_skipped() {}
#[test] fn divergence_report_includes_diff_of_prescribed_vs_actual() {}
#[test] fn divergence_flags_env_var_changes_made_during_execution() {}
#[test] fn divergence_ignores_whitespace_differences_in_commands() {}
#[test] fn divergence_analysis_runs_automatically_after_execution_completes() {}
#[test] fn divergence_report_written_to_audit_trail() {}
#[tokio::test]
async fn integration_divergence_analysis_detects_agent_side_extra_commands() {
// Agent executes an extra `whoami` not in the runbook
// Divergence analyzer must flag it
}
Epic 5.3: Compliance Export
// pkg/audit/export/tests.rs
#[tokio::test] async fn export_generates_valid_csv_for_date_range() {}
#[tokio::test] async fn export_generates_valid_pdf_with_execution_summary() {}
#[tokio::test] async fn export_uploads_to_s3_and_returns_presigned_url() {}
#[tokio::test] async fn export_presigned_url_expires_after_24_hours() {}
#[tokio::test] async fn export_scoped_to_tenant_via_rls() {}
#[tokio::test] async fn export_includes_hash_chain_verification_status() {}
#[tokio::test] async fn export_redacts_command_output_but_includes_hashes() {}
Epic 6.4: Classification Query API Rate Limiting
// tests/integration/api_rate_limit_test.rs
#[tokio::test]
async fn api_rate_limit_30_requests_per_minute_per_tenant() {
let stack = E2EStack::start().await;
for i in 0..30 {
let resp = stack.api().get("/v1/run/classifications").send().await;
assert_eq!(resp.status(), 200);
}
// 31st request must be rate-limited
let resp = stack.api().get("/v1/run/classifications").send().await;
assert_eq!(resp.status(), 429);
}
#[tokio::test]
async fn api_rate_limit_resets_after_60_seconds() {}
#[tokio::test]
async fn api_rate_limit_is_per_tenant_not_global() {
// Tenant A hitting limit must not affect Tenant B
}
#[tokio::test]
async fn api_rate_limit_returns_retry_after_header() {}
Epic 7: Dashboard UI (Playwright)
// tests/e2e/ui/dashboard.spec.ts
test('parse preview renders within 5 seconds of paste', async ({ page }) => {
await page.goto('/dashboard/runbooks/new');
await page.fill('[data-testid="runbook-input"]', FIXTURE_RUNBOOK);
const preview = page.locator('[data-testid="parse-preview"]');
await expect(preview).toBeVisible({ timeout: 5000 });
await expect(preview.locator('.step-card')).toHaveCount(4);
});
test('trust level visualization shows correct colors per step', async ({ page }) => {
// 🟢 safe = green, 🟡 caution = yellow, 🔴 dangerous = red
});
test('MTTR dashboard loads and displays chart', async ({ page }) => {
await page.goto('/dashboard/analytics');
await expect(page.locator('[data-testid="mttr-chart"]')).toBeVisible();
});
test('execution timeline shows real-time step progress', async ({ page }) => {});
test('approval modal requires typed confirmation for dangerous steps', async ({ page }) => {});
test('panic mode banner appears when panic is active', async ({ page }) => {});
Epic 9: Onboarding & PLG
// pkg/onboarding/tests.rs
#[test] fn free_tier_allows_5_runbooks() {}
#[test] fn free_tier_allows_50_executions_per_month() {}
#[test] fn free_tier_rejects_6th_runbook_with_upgrade_prompt() {}
#[test] fn free_tier_rejects_51st_execution_with_upgrade_prompt() {}
#[test] fn free_tier_counter_resets_monthly() {}
#[test] fn agent_install_snippet_includes_correct_api_key() {}
#[test] fn agent_install_snippet_includes_correct_gateway_url() {}
#[test] fn agent_install_snippet_is_valid_bash() {}
#[tokio::test] async fn stripe_checkout_creates_session_with_correct_pricing() {}
#[tokio::test] async fn stripe_webhook_checkout_completed_upgrades_tenant() {}
#[tokio::test] async fn stripe_webhook_subscription_deleted_downgrades_tenant() {}
#[tokio::test] async fn stripe_webhook_validates_signature() {}
11.2 Agent-Side Security Tests (Zero-Trust Environment)
The Agent runs in customer VPCs — untrusted territory. These tests prove the Agent defends itself independently of the SaaS backend.
// pkg/agent/security/tests.rs
// Agent-side deterministic blocking (mirrors SaaS scanner)
#[test] fn agent_scanner_blocks_rm_rf_independently_of_saas() {}
#[test] fn agent_scanner_blocks_kubectl_delete_namespace_independently() {}
#[test] fn agent_scanner_blocks_drop_table_independently() {}
#[test] fn agent_scanner_rejects_command_even_if_saas_says_safe() {
// Simulates compromised SaaS sending a "safe" classification for rm -rf
let saas_classification = Classification { risk: RiskLevel::Safe, .. };
let agent_result = agent_scanner.classify("rm -rf /");
assert_eq!(agent_result.risk, RiskLevel::Dangerous);
// Agent MUST override SaaS classification
}
// Binary integrity
#[test] fn agent_validates_binary_checksum_on_startup() {}
#[test] fn agent_refuses_to_start_if_checksum_mismatch() {}
// Payload tampering
#[tokio::test] async fn agent_rejects_grpc_payload_with_invalid_hmac() {}
#[tokio::test] async fn agent_rejects_grpc_payload_with_expired_timestamp() {}
#[tokio::test] async fn agent_rejects_grpc_payload_with_mismatched_execution_id() {}
// Local fallback when SaaS is unreachable
#[tokio::test] async fn agent_falls_back_to_scanner_only_when_saas_disconnected() {}
#[tokio::test] async fn agent_in_fallback_mode_treats_all_unknowns_as_caution() {}
#[tokio::test] async fn agent_reconnects_automatically_when_saas_returns() {}
11.3 Realistic Sandbox Matrix
Replace Alpine-only sandbox with a matrix of realistic execution targets.
// tests/integration/sandbox_matrix_test.rs
#[rstest]
#[case("ubuntu:22.04")]
#[case("amazonlinux:2023")]
#[case("alpine:3.19")]
async fn sandbox_safe_command_executes_on_all_targets(#[case] image: &str) {
let sandbox = SandboxContainer::start(image).await;
let agent = TestAgent::connect_to(sandbox.socket_path()).await;
let result = agent.execute("ls /tmp").await.unwrap();
assert_eq!(result.exit_code, 0);
}
#[rstest]
#[case("ubuntu:22.04")]
#[case("amazonlinux:2023")]
async fn sandbox_dangerous_command_blocked_on_all_targets(#[case] image: &str) {
let sandbox = SandboxContainer::start(image).await;
let agent = TestAgent::connect_to(sandbox.socket_path()).await;
let result = agent.execute("rm -rf /").await;
assert!(result.is_err());
}
// Non-root execution
#[tokio::test]
async fn sandbox_agent_runs_as_non_root_user() {
let sandbox = SandboxContainer::start_as_user("ubuntu:22.04", "dd0c-agent").await;
let agent = TestAgent::connect_to(sandbox.socket_path()).await;
let result = agent.execute("whoami").await.unwrap();
assert_eq!(result.stdout.trim(), "dd0c-agent");
}
#[tokio::test]
async fn sandbox_non_root_agent_cannot_escalate_to_root() {
let sandbox = SandboxContainer::start_as_user("ubuntu:22.04", "dd0c-agent").await;
let agent = TestAgent::connect_to(sandbox.socket_path()).await;
let result = agent.execute("sudo cat /etc/shadow").await;
assert!(result.is_err() || result.unwrap().exit_code != 0);
}
// RBAC-restricted K3s
#[tokio::test]
async fn sandbox_k3s_rbac_denies_kubectl_delete_namespace() {
let k3s = K3sContainer::start_with_rbac("read-only-role").await;
let agent = TestAgent::with_kubeconfig(k3s.kubeconfig()).await;
let result = agent.execute("kubectl delete namespace default").await;
// Should be blocked by BOTH scanner AND K8s RBAC
assert!(result.is_err());
}
11.4 Advanced Command Injection Tests
// pkg/classifier/scanner/injection_tests.rs
// Semicolon injection
#[test] fn scanner_semicolon_rm_rf_is_dangerous() {
assert_dangerous("echo hello; rm -rf /");
}
// Pipe injection
#[test] fn scanner_pipe_to_rm_is_dangerous() {
assert_dangerous("find / -name '*.log' | xargs rm -rf");
}
// Backtick injection
#[test] fn scanner_backtick_rm_is_dangerous() {
assert_dangerous("echo `rm -rf /`");
}
// $() substitution (already tested, but more variants)
#[test] fn scanner_nested_substitution_is_dangerous() {
assert_dangerous("echo $(echo $(rm -rf /))");
}
// Newline injection
#[test] fn scanner_newline_injection_is_dangerous() {
assert_dangerous("echo safe\nrm -rf /");
}
// Null byte injection
#[test] fn scanner_null_byte_injection_is_dangerous() {
assert_dangerous("echo safe\0rm -rf /");
}
// Unicode homoglyph attack
#[test] fn scanner_unicode_homoglyph_rm_is_dangerous() {
// Using Cyrillic 'р' and 'м' that look like 'r' and 'm'
assert_dangerous("rм -rf /"); // Should still catch this
}
// Base64 encoded payload
#[test] fn scanner_base64_decode_pipe_bash_is_dangerous() {
assert_dangerous("echo cm0gLXJmIC8= | base64 -d | bash");
}
// Heredoc injection
#[test] fn scanner_heredoc_with_destructive_is_dangerous() {
assert_dangerous("cat << EOF | bash\nrm -rf /\nEOF");
}
// Environment variable expansion
#[test] fn scanner_env_var_expansion_to_rm_is_dangerous() {
assert_dangerous("$CMD"); // Unknown variable expansion = unknown, not safe
}
11.5 Privilege Escalation Tests
// pkg/classifier/scanner/escalation_tests.rs
#[test] fn scanner_sudo_anything_is_at_least_caution() {
assert_at_least_caution("sudo systemctl restart nginx");
}
#[test] fn scanner_sudo_rm_is_dangerous() {
assert_dangerous("sudo rm -rf /var/log");
}
#[test] fn scanner_su_root_is_dangerous() {
assert_dangerous("su - root -c 'rm -rf /'");
}
#[test] fn scanner_chmod_suid_is_dangerous() {
assert_dangerous("chmod u+s /usr/bin/find");
}
#[test] fn scanner_chown_root_is_caution() {
assert_at_least_caution("chown root:root /tmp/exploit");
}
#[test] fn scanner_nsenter_is_dangerous() {
assert_dangerous("nsenter --target 1 --mount --uts --ipc --net --pid");
}
#[test] fn scanner_docker_run_privileged_is_dangerous() {
assert_dangerous("docker run --privileged -v /:/host ubuntu");
}
#[test] fn scanner_kubectl_exec_as_root_is_caution() {
assert_at_least_caution("kubectl exec -it pod -- /bin/bash");
}
11.6 Rollback Failure & Nested Failure Tests
// pkg/executor/rollback/tests.rs
#[test] fn rollback_failure_transitions_to_manual_intervention() {
let mut engine = ExecutionEngine::new();
engine.transition(State::RollingBack);
engine.report_rollback_failure("rollback command timed out");
assert_eq!(engine.state(), State::ManualIntervention);
}
#[test] fn rollback_failure_does_not_retry_automatically() {
// Rollback failures are terminal — no auto-retry
}
#[test] fn rollback_timeout_kills_rollback_process_after_300s() {}
#[test] fn rollback_hanging_indefinitely_triggers_manual_intervention_after_timeout() {
let mut engine = ExecutionEngine::with_rollback_timeout(Duration::from_secs(5));
engine.transition(State::RollingBack);
// Simulate rollback that never completes
tokio::time::advance(Duration::from_secs(6)).await;
assert_eq!(engine.state(), State::ManualIntervention);
}
#[test] fn manual_intervention_state_sends_slack_alert_to_oncall() {}
#[test] fn manual_intervention_state_logs_full_context_to_audit() {}
11.7 Double Execution & Network Partition Tests
// pkg/executor/idempotency/tests.rs
#[tokio::test]
async fn agent_reconnect_after_partition_resyncs_already_executed_step() {
let stack = E2EStack::start().await;
let execution = stack.start_execution().await;
// Agent executes step successfully
stack.wait_for_step_state(&execution.id, &step_id, "executing").await;
// Network partition AFTER execution but BEFORE ACK
stack.partition_agent().await;
// Agent reconnects
stack.heal_partition().await;
// Engine must recognize step was already executed — no double execution
let step = stack.get_step(&execution.id, &step_id).await;
assert_eq!(step.execution_count, 1); // Exactly once
}
#[tokio::test]
async fn engine_does_not_re_send_command_after_agent_reconnect_if_step_completed() {}
#[tokio::test]
async fn engine_re_sends_command_if_agent_never_started_execution_before_partition() {}
11.8 Slack Payload Forgery Tests
// tests/integration/slack_security_test.rs
#[tokio::test]
async fn slack_approval_webhook_rejects_missing_signature() {
let resp = stack.api()
.post("/v1/run/slack/actions")
.json(&fixture_approval_payload())
// No X-Slack-Signature header
.send().await;
assert_eq!(resp.status(), 401);
}
#[tokio::test]
async fn slack_approval_webhook_rejects_invalid_signature() {
let resp = stack.api()
.post("/v1/run/slack/actions")
.header("X-Slack-Signature", "v0=invalid_hmac")
.header("X-Slack-Request-Timestamp", &now_timestamp())
.json(&fixture_approval_payload())
.send().await;
assert_eq!(resp.status(), 401);
}
#[tokio::test]
async fn slack_approval_webhook_rejects_replayed_timestamp() {
// Timestamp older than 5 minutes
let resp = stack.api()
.post("/v1/run/slack/actions")
.header("X-Slack-Signature", &valid_signature_for_old_timestamp())
.header("X-Slack-Request-Timestamp", &five_minutes_ago())
.json(&fixture_approval_payload())
.send().await;
assert_eq!(resp.status(), 401);
}
#[tokio::test]
async fn slack_approval_webhook_rejects_cross_tenant_approval() {
// Tenant A's user trying to approve Tenant B's execution
}
11.9 Audit Log Encryption Tests
// tests/integration/audit_encryption_test.rs
#[tokio::test]
async fn audit_log_command_field_is_encrypted_at_rest() {
let db = TestDb::start().await;
// Insert an audit event with a command
insert_audit_event(&db, "kubectl get pods").await;
// Read raw bytes from PostgreSQL — must NOT contain plaintext command
let raw = db.query_raw_bytes("SELECT command FROM audit_events LIMIT 1").await;
assert!(!String::from_utf8_lossy(&raw).contains("kubectl get pods"),
"Command stored in plaintext — must be encrypted");
}
#[tokio::test]
async fn audit_log_output_field_is_encrypted_at_rest() {
let db = TestDb::start().await;
insert_audit_event_with_output(&db, "sensitive output data").await;
let raw = db.query_raw_bytes("SELECT output FROM audit_events LIMIT 1").await;
assert!(!String::from_utf8_lossy(&raw).contains("sensitive output data"));
}
#[tokio::test]
async fn audit_log_decryption_requires_kms_key() {
// Verify the app role can decrypt using the KMS key
let db = TestDb::start().await;
insert_audit_event(&db, "kubectl get pods").await;
let decrypted = db.as_app_role()
.query("SELECT decrypt_command(command) FROM audit_events LIMIT 1").await;
assert_eq!(decrypted, "kubectl get pods");
}
11.10 gRPC Output Buffer Limits
// pkg/agent/streaming/tests.rs
#[tokio::test]
async fn agent_truncates_stdout_at_10mb() {
let sandbox = SandboxContainer::start("ubuntu:22.04").await;
let agent = TestAgent::connect_to(sandbox.socket_path()).await;
// Generate 50MB of output
let result = agent.execute("dd if=/dev/urandom bs=1M count=50 | base64").await.unwrap();
// Agent must truncate, not OOM
assert!(result.stdout.len() <= 10 * 1024 * 1024);
assert!(result.truncated);
}
#[tokio::test]
async fn agent_streams_output_in_chunks_not_buffered() {
// Verify output arrives incrementally, not all at once after completion
}
#[tokio::test]
async fn agent_memory_stays_under_256mb_during_large_output() {
// Memory profiling test — agent must not OOM on `cat /dev/urandom`
}
#[tokio::test]
async fn engine_handles_truncated_output_gracefully() {
// Engine receives truncated flag and logs warning
}
11.11 Parse SLA End-to-End Benchmark
// benches/parse_sla_bench.rs
#[tokio::test]
async fn parse_plus_classify_pipeline_under_5s_p95() {
let stack = E2EStack::start().await;
let mut latencies = vec![];
for _ in 0..100 {
let start = Instant::now();
stack.api()
.post("/v1/run/runbooks/parse-preview")
.json(&json!({ "raw_text": FIXTURE_RUNBOOK_10_STEPS }))
.send().await;
latencies.push(start.elapsed());
}
let p95 = percentile(&latencies, 95);
assert!(p95 < Duration::from_secs(5),
"Parse+Classify p95 latency: {:?} — exceeds 5s SLA", p95);
}
11.12 Updated Test Pyramid (Post-Review)
The Execution Engine ratio shifts from 80/15/5 to 60/30/10 per review recommendation:
| Component | Unit | Integration | E2E |
|---|---|---|---|
| Safety Scanner | 80% | 15% | 5% |
| Merge Engine | 90% | 10% | 0% |
| Execution Engine | 60% | 30% | 10% |
| Parser | 50% | 40% | 10% |
| Approval Workflow | 70% | 20% | 10% |
| Audit Trail | 60% | 35% | 5% |
| Agent | 50% | 35% | 15% |
| Dashboard API | 40% | 50% | 10% |
End of Review Remediation Addendum