Files

Max Mayfield 2fe0ed856e Add Gemini TDD reviews for all 6 products

P1, P2, P3, P4, P6 reviewed by Gemini subagents.
P5 reviewed manually (Gemini credential errors).
All reviews flag coverage gaps, anti-patterns, and Transparent Factory tenet gaps.

2026-03-01 00:29:24 +00:00

6.8 KiB

Raw Permalink Blame History

dd0c/run - Test Architecture Review

Reviewer: Max Mayfield (Senior TDD Consultant) Status: Requires Remediation 🔴

Look, this is the highest-risk product in the suite. You are literally executing commands in production environments. The test architecture is a solid start, and the "Safety-First TDD" mandate is great in theory, but there are massive blind spots here. If this ships as-is, you're going to have an existential incident.

Here is the rigorous breakdown of where this architecture falls short and what needs fixing immediately.

1. Coverage Gaps Per Epic

You've focused heavily on the core state machine but completely ghosted several critical epics:

Epic 3 (Execution Engine): Story 3.4 (Divergence Analysis) is completely missing. No tests for the post-execution analyzer comparing prescribed vs. executed commands or flagging unlisted actions.
Epic 5 (Audit Trail): Story 5.3 (Compliance Export) is absent. Where are the tests for generating the SOC 2 PDF/CSV exports and verifying the S3 links?
Epic 6 (Dashboard API): Story 6.4 (Classification Query API) requires rate-limiting to 30 req/min per tenant. There are zero rate-limiting integration tests specified.
Epic 7 (Dashboard UI): SPA testing is totally ignored. You need Cypress or Playwright tests for the "5-second wow moment" parse preview, Trust Level visualizations, and MTTR dashboard.
Epic 9 (Onboarding & PLG): The entire self-serve tenant provisioning, free-tier limit enforcement (5 runbooks/50 executions), and Agent installation snippet generation lack test coverage.

2. "Safety-First" TDD Enforcement

The policy is strong (Canary suite, red-safety, 95% threshold), but it’s overly centralized on the SaaS side.

The Gap: The Agent runs in customer VPCs (untrusted territory). If the gRPC payload is intercepted or the API gateway is compromised, the SaaS scanner means nothing. You mentioned the Agent-Side Scanner in the thresholds, but there is no explicit TDD mandate for testing Agent-side deterministic blocking, binary integrity, or defense against payload tampering.

3. Test Pyramid Evaluation

Your execution engine ratio is 80% Unit, 15% Integration, 5% E2E.

The Reality Check: That unit ratio is too high for the orchestration layer. The state machine in-memory is pure logic, but distributed state is where things break. You need to shift the Execution Engine to at least 60% Unit, 30% Integration, 10% E2E. You need more integration tests proving state persistence during RDS failovers, Redis evictions, and gRPC disconnects. Pure unit tests won't catch a distributed race condition.

4. Anti-Patterns & Sandboxing

You specified an Alpine DinD container for sandboxed execution tests. This is a massive anti-pattern.

Why it's bad: Alpine doesn't reflect real-world AWS AMIs, Ubuntu instances, or RBAC-restricted Kubernetes environments. Testing echo $(rm -rf /) in Alpine proves nothing about how your agent handles missing binaries, shell variations (bash vs zsh vs sh), or K8s permission denied errors.
The Fix: Sandbox tests must use realistic targets: Mocked EC2 AMIs, RBAC-configured K3s clusters (not just a blanket k3s mock), and varying user privileges (root vs. limited user).

5. Transparent Factory Gaps

You hit most of the tenets, but you missed some critical enforcement mechanics:

Atomic Flagging: You test the 48-hour bake logic, but does the CI script actually fail the build if a destructive flag bypasses it? You need an integration test against the CI validation script itself.
Elastic Schema: You test that app roles can't UPDATE/DELETE the audit log. But Semantic Observability requires "encrypted audit logs." There are zero tests verifying data-at-rest encryption (KMS or pg_crypto) for the execution logs.
Configurable Autonomy: You test the Redis logic for Panic Mode, but Story 10.5 requires a webhook endpoint POST /admin/panic that works in <1s. You must test the entire path from HTTP request to gRPC stream pause, not just the Redis key state.

6. Performance Gaps

Parse SLA: Epic 1 dictates a 3.5s SLA for LLM extraction, but your performance benchmarks only cover the Rust normalizer (<10ms). You need an end-to-end integration benchmark proving the 5s total SLA (Parser + Classifier) using recorded LLM latency percentiles.
Agent Output Buffering: You test Slack rate limiting, but what if a user runs cat /var/log/syslog and dumps 500MB of text? There are no memory profiling tests or truncation enforcement tests on the Agent side before it attempts to stream over gRPC.

7. Missing Threat Scenarios

Your chaos and security tests are missing these highly probable vectors:

Advanced Command Injection: Testing echo $(rm) is elementary. Where are the tests for ; rm -rf /, | rm -rf /, backticks, and newline injections?
Privilege Escalation: You block sudo, but what if the Agent is run as root by the customer? Tests must cover execution under different user contexts to prevent horizontal escalation.
Rollback Failures: You test mid-execution failure triggering a rollback. What happens if the rollback command itself fails or hangs indefinitely? You need a test for nested/fatal failure states.
Double Execution (Partial Failure): The Agent executes a step successfully but the network partitions before it sends the success ACK to the engine. Your idempotency test covers duplicate IDs, but does it handle agent reconnections re-syncing an already-executed state?
Slack Payload Forgery: You test that approval timeouts mark as stalled, but what if someone curls the Slack approval webhook directly with a forged payload? You must test Slack signature verification.

8. Top 5 Recommendations

Mandate Agent-Side Threat Testing: The Agent is in a zero-trust environment. Write explicit E2E tests for agent binary tampering, payload forgery, and local fallback scanning if the SaaS connection drops.
Upgrade the Execution Sandbox: Ditch the plain Alpine container. Implement a matrix of execution sandboxes (Ubuntu, Amazon Linux 2023, RBAC-restricted K3s) and test execution with non-root users.
Close the Epic Coverage Blackholes: Write test architecture specs for Divergence Analysis (Epic 3.4), Compliance Exports (Epic 5.3), API Rate Limiting (Epic 6.4), and PLG Onboarding limits (Epic 9).
Test the Unknowns of Rollbacks & Double Executions: Add explicit chaos tests for when rollback commands fail, and simulate network drops occurring exactly between command completion and gRPC acknowledgment.
Enforce Cryptographic Audit Tests: Add tests verifying that raw commands and outputs in the PostgreSQL audit log are actually encrypted at rest using pg_crypto or AWS KMS, as mandated by the observability tenets.

Fix these gaps before you write a single line of execution code, or you're asking for a breach.

6.8 KiB Raw Permalink Blame History Unescape Escape