Add Gemini TDD reviews for all 6 products

P1, P2, P3, P4, P6 reviewed by Gemini subagents. P5 reviewed manually (Gemini credential errors). All reviews flag coverage gaps, anti-patterns, and Transparent Factory tenet gaps.
2026-03-01 00:29:24 +00:00
parent 1101fef096
commit 2fe0ed856e
6 changed files with 501 additions and 0 deletions
--- a/products/06-runbook-automation/test-architecture/review.md
+++ b/products/06-runbook-automation/test-architecture/review.md
@@ -0,0 +1,56 @@
+# dd0c/run - Test Architecture Review
+**Reviewer:** Max Mayfield (Senior TDD Consultant)
+**Status:** Requires Remediation 🔴
+
+Look, this is the highest-risk product in the suite. You are literally executing commands in production environments. The test architecture is a solid start, and the "Safety-First TDD" mandate is great in theory, but there are massive blind spots here. If this ships as-is, you're going to have an existential incident.
+
+Here is the rigorous breakdown of where this architecture falls short and what needs fixing immediately.
+
+## 1. Coverage Gaps Per Epic
+You've focused heavily on the core state machine but completely ghosted several critical epics:
+* **Epic 3 (Execution Engine):** Story 3.4 (Divergence Analysis) is completely missing. No tests for the post-execution analyzer comparing prescribed vs. executed commands or flagging unlisted actions.
+* **Epic 5 (Audit Trail):** Story 5.3 (Compliance Export) is absent. Where are the tests for generating the SOC 2 PDF/CSV exports and verifying the S3 links?
+* **Epic 6 (Dashboard API):** Story 6.4 (Classification Query API) requires rate-limiting to 30 req/min per tenant. There are zero rate-limiting integration tests specified.
+* **Epic 7 (Dashboard UI):** SPA testing is totally ignored. You need Cypress or Playwright tests for the "5-second wow moment" parse preview, Trust Level visualizations, and MTTR dashboard.
+* **Epic 9 (Onboarding & PLG):** The entire self-serve tenant provisioning, free-tier limit enforcement (5 runbooks/50 executions), and Agent installation snippet generation lack test coverage.
+
+## 2. "Safety-First" TDD Enforcement
+The policy is strong (Canary suite, red-safety, 95% threshold), but it’s overly centralized on the SaaS side. 
+* **The Gap:** The Agent runs in customer VPCs (untrusted territory). If the gRPC payload is intercepted or the API gateway is compromised, the SaaS scanner means nothing. You mentioned the Agent-Side Scanner in the thresholds, but there is no explicit TDD mandate for testing Agent-side deterministic blocking, binary integrity, or defense against payload tampering.
+
+## 3. Test Pyramid Evaluation
+Your execution engine ratio is `80% Unit, 15% Integration, 5% E2E`. 
+* **The Reality Check:** That unit ratio is too high for the *orchestration* layer. The state machine in-memory is pure logic, but distributed state is where things break. You need to shift the Execution Engine to at least `60% Unit, 30% Integration, 10% E2E`. You need more integration tests proving state persistence during RDS failovers, Redis evictions, and gRPC disconnects. Pure unit tests won't catch a distributed race condition.
+
+## 4. Anti-Patterns & Sandboxing
+You specified an Alpine DinD container for sandboxed execution tests. This is a massive anti-pattern.
+* **Why it's bad:** Alpine doesn't reflect real-world AWS AMIs, Ubuntu instances, or RBAC-restricted Kubernetes environments. Testing `echo $(rm -rf /)` in Alpine proves nothing about how your agent handles missing binaries, shell variations (bash vs zsh vs sh), or K8s permission denied errors.
+* **The Fix:** Sandbox tests must use realistic targets: Mocked EC2 AMIs, RBAC-configured K3s clusters (not just a blanket k3s mock), and varying user privileges (root vs. limited user).
+
+## 5. Transparent Factory Gaps
+You hit most of the tenets, but you missed some critical enforcement mechanics:
+* **Atomic Flagging:** You test the 48-hour bake logic, but does the CI script *actually* fail the build if a destructive flag bypasses it? You need an integration test against the CI validation script itself.
+* **Elastic Schema:** You test that app roles can't `UPDATE`/`DELETE` the audit log. But Semantic Observability requires "encrypted audit logs." There are **zero tests** verifying data-at-rest encryption (KMS or pg_crypto) for the execution logs.
+* **Configurable Autonomy:** You test the Redis logic for Panic Mode, but Story 10.5 requires a webhook endpoint `POST /admin/panic` that works in <1s. You must test the *entire* path from HTTP request to gRPC stream pause, not just the Redis key state.
+
+## 6. Performance Gaps
+* **Parse SLA:** Epic 1 dictates a 3.5s SLA for LLM extraction, but your performance benchmarks only cover the Rust normalizer (<10ms). You need an end-to-end integration benchmark proving the 5s total SLA (Parser + Classifier) using recorded LLM latency percentiles.
+* **Agent Output Buffering:** You test Slack rate limiting, but what if a user runs `cat /var/log/syslog` and dumps 500MB of text? There are no memory profiling tests or truncation enforcement tests on the Agent side before it attempts to stream over gRPC.
+
+## 7. Missing Threat Scenarios
+Your chaos and security tests are missing these highly probable vectors:
+* **Advanced Command Injection:** Testing `echo $(rm)` is elementary. Where are the tests for `; rm -rf /`, `| rm -rf /`, backticks, and newline injections?
+* **Privilege Escalation:** You block `sudo`, but what if the Agent is run as root by the customer? Tests must cover execution under different user contexts to prevent horizontal escalation.
+* **Rollback Failures:** You test mid-execution failure triggering a rollback. What happens if the *rollback command itself* fails or hangs indefinitely? You need a test for nested/fatal failure states.
+* **Double Execution (Partial Failure):** The Agent executes a step successfully but the network partitions *before* it sends the success ACK to the engine. Your idempotency test covers duplicate IDs, but does it handle agent reconnections re-syncing an already-executed state?
+* **Slack Payload Forgery:** You test that approval timeouts mark as stalled, but what if someone curls the Slack approval webhook directly with a forged payload? You must test Slack signature verification.
+
+## 8. Top 5 Recommendations
+
+1. **Mandate Agent-Side Threat Testing:** The Agent is in a zero-trust environment. Write explicit E2E tests for agent binary tampering, payload forgery, and local fallback scanning if the SaaS connection drops.
+2. **Upgrade the Execution Sandbox:** Ditch the plain Alpine container. Implement a matrix of execution sandboxes (Ubuntu, Amazon Linux 2023, RBAC-restricted K3s) and test execution with non-root users.
+3. **Close the Epic Coverage Blackholes:** Write test architecture specs for Divergence Analysis (Epic 3.4), Compliance Exports (Epic 5.3), API Rate Limiting (Epic 6.4), and PLG Onboarding limits (Epic 9).
+4. **Test the Unknowns of Rollbacks & Double Executions:** Add explicit chaos tests for when rollback commands fail, and simulate network drops occurring exactly between command completion and gRPC acknowledgment.
+5. **Enforce Cryptographic Audit Tests:** Add tests verifying that raw commands and outputs in the PostgreSQL audit log are actually encrypted at rest using pg_crypto or AWS KMS, as mandated by the observability tenets.
+
+Fix these gaps before you write a single line of execution code, or you're asking for a breach.