BMad code reviews complete for all 6 products

P1 route: Gemini — 'Ship the proxy, stop writing tests for the tests' P2 drift: Gemini — mTLS revocation, state lock corruption, RLS pool leak P3 alert: Gemini — replay attacks, trace propagation, SQS claim-check P4 portal: Manual — discovery reliability is existential risk P5 cost: Manual — concurrent baselines, remediation RBAC, pricing staleness P6 run: Gemini — policy update loophole, AST parsing, audit streaming
2026-03-01 02:09:19 +00:00
parent 9cc5aeaa03
commit b24cfa7c0d
4 changed files with 220 additions and 0 deletions
--- a/products/02-iac-drift-detection/test-architecture/bmad-review.md
+++ b/products/02-iac-drift-detection/test-architecture/bmad-review.md
@@ -0,0 +1,56 @@
+# dd0c/drift — BMad Code Review
+
+**Reviewer:** BMad Code Review Agent (Gemini)
+**Date:** March 1, 2026
+**Verdict:** "Over 2,000 lines of tests for a V1? You're either a genius or you're over-engineering this to death."
+
+---
+
+## Severity-Rated Findings
+
+### 🔴 Critical
+
+1. **mTLS Revocation Gap.** Certificate rotation tests are fine, but revocation is notoriously hard. Do tests verify CRLs or OCSP stapling? If an agent goes rogue and you revoke the cert, does the SaaS *actively* drop the existing mTLS connection, or do cached sessions stay alive? Also missing: payload replay attacks — if an attacker captures a legitimate drift payload and spams the Event Processor, does the mTLS setup drop it via strict nonce/timestamp validation?
+
+2. **Remediation State Lock Corruption.** Panic mode mid-remediation is a nightmare. If you send a kill signal to an agent while it's midway through a `terraform apply`, you don't just stop the job — you potentially corrupt the remote state file. Tests MUST verify that IaC state locks (like DynamoDB tfstate locks) are safely released upon panic. If hitting "Panic" leaves their infrastructure in an irrecoverable zombie state, nobody will ever trust your product.
+
+3. **RLS Connection Pool Leak.** Postgres RLS is the right move, but tests must simulate connection pooling (PgBouncer). If a worker doesn't clear the `SET LOCAL role` or tenant ID context between SQS messages, Tenant A's drift report ends up in Tenant B's dashboard. Cross-tenant leakage via dirty pooled connections is a classic trap.
+
+### 🟡 Important
+
+4. **Secret Scrubber 100% Coverage is a Vanity Metric.** It just means you successfully tested the regexes you already knew to write. Bypass vectors: Base64-encoded AWS keys hiding in JSON blocks, multi-line RSA private keys dumped in Pulumi outputs, high-entropy custom tokens. Need Shannon entropy scanning, not just regex matching.
+
+5. **pgmq vs SQS Visibility Timeout Mismatch.** SQS FIFO gives guaranteed ordering per `MessageGroupId` and exactly-once processing. pgmq operates inside Postgres transactions with different retry and visibility timeout semantics. If a scan takes 5 minutes but pgmq visibility timeout is 2 minutes, pgmq will hand the job to a second worker — triggering the "concurrent scan during remediation" race condition. Must explicitly test long-running worker timeout edge cases on BOTH queue backends.
+
+6. **Concurrent Scan During Remediation.** If a drift scan runs while a remediation (`terraform apply`) is in progress, the scan will see a half-applied state and report phantom drifts. Tests must verify the system either (a) locks the stack during remediation, or (b) skips scanning with a "remediation in progress" status.
+
+### 🟢 Nice-to-Have
+
+7. **Distributed tracing cross-boundary tests.** Basic structured logs and correlation IDs are fine for V1.
+8. **Backward compatibility serialization (Elastic Schema).** If you break the schema in V1, just force agents to auto-update.
+9. **Noisy neighbor fair-share processing.** Rate-limit at the API gateway for now.
+
+---
+
+## V1 Cut List (Skip for MVP)
+
+- ✂️ Distributed tracing cross-boundary tests — structured logs + correlation IDs are fine
+- ✂️ Backward compatibility serialization — force agent auto-update in V1
+- ✂️ Noisy neighbor fair-share processing — rate-limit at API gateway
+- ✂️ Dashboard UI Playwright tests — test the API, visually verify the UI
+- ✂️ Terratest for full infra validation — CDK synth assertions are sufficient for V1
+- ✂️ Memory profiling for large drift reports — premature optimization
+
+## Must-Have Before Launch
+
+1. **mTLS revocation test** — prove a banned agent is instantly locked out, not cached
+2. **Remediation state-lock recovery** — prove panic mode safely releases terraform state locks
+3. **RLS connection pool leak test** — prove PgBouncer doesn't leak tenant context between requests
+4. **Secret scrubber entropy scanning** — add Shannon entropy detection beyond regex patterns
+5. **pgmq visibility timeout test** — prove long-running scans don't trigger duplicate processing
+6. **Free tier enforcement** — 3 stack limit, upgrade prompt
+7. **Cross-tenant isolation negative tests** — insert data for Tenant A and B, query as A, assert B is invisible
+
+---
+
+*"If hitting Panic leaves their infrastructure in an irrecoverable zombie state, nobody will ever trust your product."*