From b24cfa7c0d1f4a66aca9a6600caa56b55a4d0492 Mon Sep 17 00:00:00 2001 From: Max Mayfield Date: Sun, 1 Mar 2026 02:09:19 +0000 Subject: [PATCH] BMad code reviews complete for all 6 products MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P1 route: Gemini — 'Ship the proxy, stop writing tests for the tests' P2 drift: Gemini — mTLS revocation, state lock corruption, RLS pool leak P3 alert: Gemini — replay attacks, trace propagation, SQS claim-check P4 portal: Manual — discovery reliability is existential risk P5 cost: Manual — concurrent baselines, remediation RBAC, pricing staleness P6 run: Gemini — policy update loophole, AST parsing, audit streaming --- .../test-architecture/bmad-review.md | 64 +++++++++++++++++++ .../test-architecture/bmad-review.md | 56 ++++++++++++++++ .../test-architecture/bmad-review.md | 47 ++++++++++++++ .../test-architecture/bmad-review.md | 53 +++++++++++++++ 4 files changed, 220 insertions(+) create mode 100644 products/01-llm-cost-router/test-architecture/bmad-review.md create mode 100644 products/02-iac-drift-detection/test-architecture/bmad-review.md create mode 100644 products/03-alert-intelligence/test-architecture/bmad-review.md create mode 100644 products/06-runbook-automation/test-architecture/bmad-review.md diff --git a/products/01-llm-cost-router/test-architecture/bmad-review.md b/products/01-llm-cost-router/test-architecture/bmad-review.md new file mode 100644 index 0000000..4148e4d --- /dev/null +++ b/products/01-llm-cost-router/test-architecture/bmad-review.md @@ -0,0 +1,64 @@ +# dd0c/route — BMad Code Review + +**Reviewer:** BMad Code Review Agent (Gemini) +**Date:** March 1, 2026 +**Verdict:** "Ship the proxy. Stop writing tests for the tests." + +--- + +## The Reality Check + +### Completeness +10 epics covered with property-based tests, but missing: +- **Provider failovers and token math.** What happens when OpenAI gives 529 and Anthropic takes 40s to timeout? Where are edge-case tokenization tests (weird unicode in `cl100k_base` vs Anthropic's tokenizer) that will screw up billing? + +### Realism +Can a solo founder implement this? **Absolutely not** as written. 2,553 lines of test architecture for a V1 proxy is top-tier procrastination. Ruthlessly prioritize the hot path (routing & billing). Everything else is a distraction. + +### Anti-Patterns +- Replacing mocks with Testcontainers is correct, but Testcontainers are slow. If CI takes 20 minutes per PR, you'll stop running tests. +- Testing Dashboard UI via E2E is a trap. UI churns daily in V1. Those tests will flake and get ignored. + +--- + +## Severity-Rated Findings + +### 🔴 Critical + +1. **Latency benchmarks (<5ms) in shared CI.** GitHub Actions has variable CPU/IO. Latency tests will flake constantly. Move to dedicated runners or drop hard thresholds in CI. + +2. **No tests for API key redaction in logging/panic traces.** The proxy handles raw OpenAI/Anthropic keys. A panic in axum handlers could dump the bearer token into TimescaleDB or application logs. This is a security incident waiting to happen. + +### 🟡 Important + +3. **Token calculation edge cases and provider failover timeouts** missing from proxy engine epic. + +4. **Self-hosted mode needs memory-constraint and custom CA cert validation.** Enterprise users behind corporate firewalls with SSL MITM. Postgres and TimescaleDB will eat all RAM if not tuned — need a test verifying containers start with strict `deploy.resources.limits.memory`. + +5. **Transparent Factory tenet tests are checkbox theater.** Checking `assert(span.name == "proxy_request")` is useless. Tests must validate spans contain `tenant_id` and `cost_usd` attributes needed for the billing pipeline. + +### 🟢 Nice-to-Have + +6. **Chaos tests (slow DB/Redis).** Cool story, unnecessary for V1. + +--- + +## V1 Cut List (Skip for MVP) + +- **Dashboard UI E2E Tests** — Axe them. Test the API thoroughly. If the UI breaks, you'll see it. +- **Chaos Tests (Slow DB/Redis)** — If Redis is slow, your proxy will be slow. Accept it for V1. +- **OTEL Trace Propagation Tests** — Visually verify in Jaeger/Honeycomb once and move on. +- **24-hour Soak Test** — Your users are your soak test. +- **Slack Integration Epic** — Manually test it. + +## Must-Have Before Launch + +1. **SSE Connection Drop Billing Leak Tests** — If a user drops an SSE stream mid-generation and you don't bill for tokens already streamed, you go bankrupt. Biggest financial risk. +2. **IDOR & Tenant Isolation** — Cross the streams on user data or billing, you're dead. +3. **Cost Calculation Property Tests** — Random payloads, verify `prompt_tokens * rate + comp_tokens * rate` always holds. +4. **Testcontainer Hot Path** — Full integration: `Request → Axum → Redis (rate limit) → Provider → Axum → Postgres (billing)` using real containers. +5. **API Key Redaction** — Prove that panics, error logs, and telemetry never contain raw provider API keys. + +--- + +*"Ship the proxy. Stop writing tests for the tests."* diff --git a/products/02-iac-drift-detection/test-architecture/bmad-review.md b/products/02-iac-drift-detection/test-architecture/bmad-review.md new file mode 100644 index 0000000..fbc4c17 --- /dev/null +++ b/products/02-iac-drift-detection/test-architecture/bmad-review.md @@ -0,0 +1,56 @@ +# dd0c/drift — BMad Code Review + +**Reviewer:** BMad Code Review Agent (Gemini) +**Date:** March 1, 2026 +**Verdict:** "Over 2,000 lines of tests for a V1? You're either a genius or you're over-engineering this to death." + +--- + +## Severity-Rated Findings + +### 🔴 Critical + +1. **mTLS Revocation Gap.** Certificate rotation tests are fine, but revocation is notoriously hard. Do tests verify CRLs or OCSP stapling? If an agent goes rogue and you revoke the cert, does the SaaS *actively* drop the existing mTLS connection, or do cached sessions stay alive? Also missing: payload replay attacks — if an attacker captures a legitimate drift payload and spams the Event Processor, does the mTLS setup drop it via strict nonce/timestamp validation? + +2. **Remediation State Lock Corruption.** Panic mode mid-remediation is a nightmare. If you send a kill signal to an agent while it's midway through a `terraform apply`, you don't just stop the job — you potentially corrupt the remote state file. Tests MUST verify that IaC state locks (like DynamoDB tfstate locks) are safely released upon panic. If hitting "Panic" leaves their infrastructure in an irrecoverable zombie state, nobody will ever trust your product. + +3. **RLS Connection Pool Leak.** Postgres RLS is the right move, but tests must simulate connection pooling (PgBouncer). If a worker doesn't clear the `SET LOCAL role` or tenant ID context between SQS messages, Tenant A's drift report ends up in Tenant B's dashboard. Cross-tenant leakage via dirty pooled connections is a classic trap. + +### 🟡 Important + +4. **Secret Scrubber 100% Coverage is a Vanity Metric.** It just means you successfully tested the regexes you already knew to write. Bypass vectors: Base64-encoded AWS keys hiding in JSON blocks, multi-line RSA private keys dumped in Pulumi outputs, high-entropy custom tokens. Need Shannon entropy scanning, not just regex matching. + +5. **pgmq vs SQS Visibility Timeout Mismatch.** SQS FIFO gives guaranteed ordering per `MessageGroupId` and exactly-once processing. pgmq operates inside Postgres transactions with different retry and visibility timeout semantics. If a scan takes 5 minutes but pgmq visibility timeout is 2 minutes, pgmq will hand the job to a second worker — triggering the "concurrent scan during remediation" race condition. Must explicitly test long-running worker timeout edge cases on BOTH queue backends. + +6. **Concurrent Scan During Remediation.** If a drift scan runs while a remediation (`terraform apply`) is in progress, the scan will see a half-applied state and report phantom drifts. Tests must verify the system either (a) locks the stack during remediation, or (b) skips scanning with a "remediation in progress" status. + +### 🟢 Nice-to-Have + +7. **Distributed tracing cross-boundary tests.** Basic structured logs and correlation IDs are fine for V1. +8. **Backward compatibility serialization (Elastic Schema).** If you break the schema in V1, just force agents to auto-update. +9. **Noisy neighbor fair-share processing.** Rate-limit at the API gateway for now. + +--- + +## V1 Cut List (Skip for MVP) + +- ✂️ Distributed tracing cross-boundary tests — structured logs + correlation IDs are fine +- ✂️ Backward compatibility serialization — force agent auto-update in V1 +- ✂️ Noisy neighbor fair-share processing — rate-limit at API gateway +- ✂️ Dashboard UI Playwright tests — test the API, visually verify the UI +- ✂️ Terratest for full infra validation — CDK synth assertions are sufficient for V1 +- ✂️ Memory profiling for large drift reports — premature optimization + +## Must-Have Before Launch + +1. **mTLS revocation test** — prove a banned agent is instantly locked out, not cached +2. **Remediation state-lock recovery** — prove panic mode safely releases terraform state locks +3. **RLS connection pool leak test** — prove PgBouncer doesn't leak tenant context between requests +4. **Secret scrubber entropy scanning** — add Shannon entropy detection beyond regex patterns +5. **pgmq visibility timeout test** — prove long-running scans don't trigger duplicate processing +6. **Free tier enforcement** — 3 stack limit, upgrade prompt +7. **Cross-tenant isolation negative tests** — insert data for Tenant A and B, query as A, assert B is invisible + +--- + +*"If hitting Panic leaves their infrastructure in an irrecoverable zombie state, nobody will ever trust your product."* diff --git a/products/03-alert-intelligence/test-architecture/bmad-review.md b/products/03-alert-intelligence/test-architecture/bmad-review.md new file mode 100644 index 0000000..b010da1 --- /dev/null +++ b/products/03-alert-intelligence/test-architecture/bmad-review.md @@ -0,0 +1,47 @@ +# dd0c/alert — BMad Code Review + +**Reviewer:** BMad Code Review Agent (Gemini) +**Date:** March 1, 2026 + +--- + +## Severity-Rated Findings + +### 🔴 Critical + +1. **Ingestion Security: Replay Attack Vulnerability.** HMAC tests validate signatures but don't enforce timestamp freshness. Datadog and PagerDuty have timestamp headers, but OpsGenie doesn't always package it cleanly. Without rejecting payloads older than ~5 minutes, an attacker can capture a valid webhook and spam the ingestion endpoint, blowing up SQS queues and Redis windows. + +2. **Trace Propagation: Cross-SQS CI Verification.** Checking that `traceparent` is attached to SQS MessageAttributes isn't enough. AWS SDKs often drop or mangle these when crossing into ECS. CI needs to assert on the *reassembled* trace tree — verify the ECS span registers as a child of the Lambda ingestion span, not a disconnected root span. + +3. **Multi-tenancy: Confused Deputy Vulnerabilities.** Partition key enforcement tests only check the happy path for a single tenant. Need explicit negative tests: insert data for Tenant A and Tenant B, query using Tenant A's JWT, explicitly assert Tenant B's data is `undefined` in the result set. + +### 🟡 Important + +4. **Correlation Quality: Out-of-Order Delivery.** Tests likely simulate perfect chronological delivery. Distributed monitoring is messy. What happens if an alert from T-minus-30s arrives *after* the correlation window has closed and shipped the incident to SQS? Does it trigger a duplicate Slack ping, or correctly attach to the existing incident timeline? + +5. **SQS 256KB: Claim-Check Edge Cases.** Compression + S3 pointers is standard, but tests must cover: (a) S3 put/get latency causing Lambda timeouts, (b) orphaned S3 pointers without lifecycle rules, (c) ECS Fargate failing to fetch payload due to IAM boundary issues. + +6. **Self-Hosted Mode: Behavioral Gaps.** DynamoDB scales connections and handles TTL natively. PostgreSQL requires explicit connection pooling (PgBouncer) and manual partitioning/pruning. Lambda recycles memory; Fastify is persistent — a tiny memory leak in the correlation engine will crash Fastify but go unnoticed in Lambda. Test suite doesn't simulate long-running process memory limits or Postgres connection exhaustion. + +--- + +## V1 Cut List + +- Self-hosted mode / DB abstractions — pick AWS SaaS and commit. Supporting Postgres/Fastify doubles testing surface for zero immediate revenue. +- Dashboard UI E2E (Playwright) — test the API thoroughly, visually verify the UI. +- OTEL trace propagation tests — visually verify in Jaeger once. +- DLQ replay with backpressure — manual replay is fine for V1. +- Slack circuit breaker — if Slack is down, alerts queue. Accept it. + +## Must-Have Before Launch + +1. **HMAC timestamp validation** — reject payloads older than 5 minutes (all 4 sources). +2. **Cross-tenant negative tests** — explicitly assert data isolation between tenants. +3. **Correlation window edge cases** — out-of-order delivery, late arrivals after window close. +4. **SQS 256KB S3 pointer round-trip** — prove the claim-check pattern works end-to-end. +5. **Free tier enforcement** — 10K alerts/month counter, 7-day retention purge. +6. **Slack signature validation** — timing-safe HMAC for interactive payloads. + +--- + +*"If you aren't testing for leakage, it will leak."* diff --git a/products/06-runbook-automation/test-architecture/bmad-review.md b/products/06-runbook-automation/test-architecture/bmad-review.md new file mode 100644 index 0000000..efecd53 --- /dev/null +++ b/products/06-runbook-automation/test-architecture/bmad-review.md @@ -0,0 +1,53 @@ +# dd0c/run — BMad Code Review + +**Reviewer:** BMad Code Review Agent (Gemini) +**Date:** March 1, 2026 +**Verdict:** "Executing shell commands in customer infrastructure is playing with fire, and this architecture has serious blind spots." + +--- + +## Severity-Rated Findings + +### 🔴 Critical + +1. **The Policy Update Loophole.** If the SaaS is compromised, an attacker doesn't need to send a malicious command — they push a malicious *policy update* or agent binary update that disables the Scanner RED layer. If the agent trusts the SaaS for updates without an offline, customer-controlled root of trust (like a locally provisioned public key), "zero-trust" is security theater. + +2. **Obfuscation Beyond Base64.** Testing for semicolons, backticks, null bytes, and base64 is basic. Missing: environment variable concatenation (`X=rm; Y=-rf; $X $Y`), hex encoding (`\x2f\x62\x69\x6e`), and `eval`. The Scanner RED needs to parse the AST of the shell command (use `shfmt` or `mvdan.cc/sh`), not just regex it, or it will get bypassed. + +3. **Rollback Failure Blackhole.** "Rollback failure → manual intervention" will wait indefinitely if the network is severed or the box is locked up. Need a hard TTL/timeout on the intervention state that automatically fails-closed and alerts out-of-band. + +4. **Root Can Delete Audit History.** "Audit log encryption" is useless if the attacker `rm -rf`s the local logs before the agent flushes the gRPC buffer. If the SaaS is compromised and gives an attacker local root, they can wipe their tracks. Need forward-secrecy and append-only cryptographic hash chains shipped *immediately* (streaming, not batching). + +### 🟡 Important + +5. **Idempotency vs. Network Partitions.** Double execution prevention is great, but if a partition happens *during* a non-idempotent command (like `INSERT` or `DROP`), the agent might think it failed, retry, and nuke the database. Agent needs to enforce idempotency at the script level or fail-stop on partition. + +6. **Concurrent Execution Locks.** If two agents (or a resurrected ghost process) try to execute the state machine simultaneously on the same host, race conditions. Need aggressive PID locking or atomic file locks. + +7. **gRPC 10MB Buffer Truncation Attack.** If an attacker spams STDOUT to hit the 10MB limit, do you truncate the logs, drop the connection, or crash? If you truncate, they can hide their malicious payload at the end of a 10MB garbage string. + +### 🟢 Nice-to-Have + +8. **Dual-layer classifier is smart.** Scanner overriding LLM is the right call. Never trust an LLM with root access. + +--- + +## Must-Have Before Launch + +1. **Cryptographic signatures for policy/agent updates.** Agent must verify updates against a key stored locally on the customer's box, completely separate from the SaaS database. +2. **Streaming append-only audit logs.** Do not buffer logs locally. Stream them with hash chains so if the box is compromised, the attacker can't alter the past. +3. **Strict shell AST parsing.** Regex for dangerous commands is a losing game. Use a real shell parser (`shfmt` or `mvdan.cc/sh`) to evaluate the syntax tree for the Canary Suite. +4. **Intervention deadlock TTLs.** Hard fail-close timeouts for any manual intervention states. +5. **Canary Suite on every commit.** 50 known-destructive commands must ALWAYS be 🔴. + +## V1 Cut List + +- Slack payload forgery protection — rely on standard TLS/Auth and webhook secrets. +- Unicode homoglyph detection — edge case, not V1. +- K3s RBAC sandbox matrix — start with Ubuntu/Amazon Linux only. +- Dashboard UI Playwright tests — manual testing sufficient. +- Compliance export (CSV/PDF) — manual export is fine for V1. + +--- + +*"If your agent trusts the SaaS for updates without an offline root of trust, your zero-trust is just security theater."*