Add Gemini TDD reviews for all 6 products
P1, P2, P3, P4, P6 reviewed by Gemini subagents. P5 reviewed manually (Gemini credential errors). All reviews flag coverage gaps, anti-patterns, and Transparent Factory tenet gaps.
This commit is contained in:
56
products/06-runbook-automation/test-architecture/review.md
Normal file
56
products/06-runbook-automation/test-architecture/review.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# dd0c/run - Test Architecture Review
|
||||
**Reviewer:** Max Mayfield (Senior TDD Consultant)
|
||||
**Status:** Requires Remediation 🔴
|
||||
|
||||
Look, this is the highest-risk product in the suite. You are literally executing commands in production environments. The test architecture is a solid start, and the "Safety-First TDD" mandate is great in theory, but there are massive blind spots here. If this ships as-is, you're going to have an existential incident.
|
||||
|
||||
Here is the rigorous breakdown of where this architecture falls short and what needs fixing immediately.
|
||||
|
||||
## 1. Coverage Gaps Per Epic
|
||||
You've focused heavily on the core state machine but completely ghosted several critical epics:
|
||||
* **Epic 3 (Execution Engine):** Story 3.4 (Divergence Analysis) is completely missing. No tests for the post-execution analyzer comparing prescribed vs. executed commands or flagging unlisted actions.
|
||||
* **Epic 5 (Audit Trail):** Story 5.3 (Compliance Export) is absent. Where are the tests for generating the SOC 2 PDF/CSV exports and verifying the S3 links?
|
||||
* **Epic 6 (Dashboard API):** Story 6.4 (Classification Query API) requires rate-limiting to 30 req/min per tenant. There are zero rate-limiting integration tests specified.
|
||||
* **Epic 7 (Dashboard UI):** SPA testing is totally ignored. You need Cypress or Playwright tests for the "5-second wow moment" parse preview, Trust Level visualizations, and MTTR dashboard.
|
||||
* **Epic 9 (Onboarding & PLG):** The entire self-serve tenant provisioning, free-tier limit enforcement (5 runbooks/50 executions), and Agent installation snippet generation lack test coverage.
|
||||
|
||||
## 2. "Safety-First" TDD Enforcement
|
||||
The policy is strong (Canary suite, red-safety, 95% threshold), but it’s overly centralized on the SaaS side.
|
||||
* **The Gap:** The Agent runs in customer VPCs (untrusted territory). If the gRPC payload is intercepted or the API gateway is compromised, the SaaS scanner means nothing. You mentioned the Agent-Side Scanner in the thresholds, but there is no explicit TDD mandate for testing Agent-side deterministic blocking, binary integrity, or defense against payload tampering.
|
||||
|
||||
## 3. Test Pyramid Evaluation
|
||||
Your execution engine ratio is `80% Unit, 15% Integration, 5% E2E`.
|
||||
* **The Reality Check:** That unit ratio is too high for the *orchestration* layer. The state machine in-memory is pure logic, but distributed state is where things break. You need to shift the Execution Engine to at least `60% Unit, 30% Integration, 10% E2E`. You need more integration tests proving state persistence during RDS failovers, Redis evictions, and gRPC disconnects. Pure unit tests won't catch a distributed race condition.
|
||||
|
||||
## 4. Anti-Patterns & Sandboxing
|
||||
You specified an Alpine DinD container for sandboxed execution tests. This is a massive anti-pattern.
|
||||
* **Why it's bad:** Alpine doesn't reflect real-world AWS AMIs, Ubuntu instances, or RBAC-restricted Kubernetes environments. Testing `echo $(rm -rf /)` in Alpine proves nothing about how your agent handles missing binaries, shell variations (bash vs zsh vs sh), or K8s permission denied errors.
|
||||
* **The Fix:** Sandbox tests must use realistic targets: Mocked EC2 AMIs, RBAC-configured K3s clusters (not just a blanket k3s mock), and varying user privileges (root vs. limited user).
|
||||
|
||||
## 5. Transparent Factory Gaps
|
||||
You hit most of the tenets, but you missed some critical enforcement mechanics:
|
||||
* **Atomic Flagging:** You test the 48-hour bake logic, but does the CI script *actually* fail the build if a destructive flag bypasses it? You need an integration test against the CI validation script itself.
|
||||
* **Elastic Schema:** You test that app roles can't `UPDATE`/`DELETE` the audit log. But Semantic Observability requires "encrypted audit logs." There are **zero tests** verifying data-at-rest encryption (KMS or pg_crypto) for the execution logs.
|
||||
* **Configurable Autonomy:** You test the Redis logic for Panic Mode, but Story 10.5 requires a webhook endpoint `POST /admin/panic` that works in <1s. You must test the *entire* path from HTTP request to gRPC stream pause, not just the Redis key state.
|
||||
|
||||
## 6. Performance Gaps
|
||||
* **Parse SLA:** Epic 1 dictates a 3.5s SLA for LLM extraction, but your performance benchmarks only cover the Rust normalizer (<10ms). You need an end-to-end integration benchmark proving the 5s total SLA (Parser + Classifier) using recorded LLM latency percentiles.
|
||||
* **Agent Output Buffering:** You test Slack rate limiting, but what if a user runs `cat /var/log/syslog` and dumps 500MB of text? There are no memory profiling tests or truncation enforcement tests on the Agent side before it attempts to stream over gRPC.
|
||||
|
||||
## 7. Missing Threat Scenarios
|
||||
Your chaos and security tests are missing these highly probable vectors:
|
||||
* **Advanced Command Injection:** Testing `echo $(rm)` is elementary. Where are the tests for `; rm -rf /`, `| rm -rf /`, backticks, and newline injections?
|
||||
* **Privilege Escalation:** You block `sudo`, but what if the Agent is run as root by the customer? Tests must cover execution under different user contexts to prevent horizontal escalation.
|
||||
* **Rollback Failures:** You test mid-execution failure triggering a rollback. What happens if the *rollback command itself* fails or hangs indefinitely? You need a test for nested/fatal failure states.
|
||||
* **Double Execution (Partial Failure):** The Agent executes a step successfully but the network partitions *before* it sends the success ACK to the engine. Your idempotency test covers duplicate IDs, but does it handle agent reconnections re-syncing an already-executed state?
|
||||
* **Slack Payload Forgery:** You test that approval timeouts mark as stalled, but what if someone curls the Slack approval webhook directly with a forged payload? You must test Slack signature verification.
|
||||
|
||||
## 8. Top 5 Recommendations
|
||||
|
||||
1. **Mandate Agent-Side Threat Testing:** The Agent is in a zero-trust environment. Write explicit E2E tests for agent binary tampering, payload forgery, and local fallback scanning if the SaaS connection drops.
|
||||
2. **Upgrade the Execution Sandbox:** Ditch the plain Alpine container. Implement a matrix of execution sandboxes (Ubuntu, Amazon Linux 2023, RBAC-restricted K3s) and test execution with non-root users.
|
||||
3. **Close the Epic Coverage Blackholes:** Write test architecture specs for Divergence Analysis (Epic 3.4), Compliance Exports (Epic 5.3), API Rate Limiting (Epic 6.4), and PLG Onboarding limits (Epic 9).
|
||||
4. **Test the Unknowns of Rollbacks & Double Executions:** Add explicit chaos tests for when rollback commands fail, and simulate network drops occurring exactly between command completion and gRPC acknowledgment.
|
||||
5. **Enforce Cryptographic Audit Tests:** Add tests verifying that raw commands and outputs in the PostgreSQL audit log are actually encrypted at rest using pg_crypto or AWS KMS, as mandated by the observability tenets.
|
||||
|
||||
Fix these gaps before you write a single line of execution code, or you're asking for a breach.
|
||||
Reference in New Issue
Block a user