Files

Max Mayfield b24cfa7c0d BMad code reviews complete for all 6 products

P1 route: Gemini — 'Ship the proxy, stop writing tests for the tests'
P2 drift: Gemini — mTLS revocation, state lock corruption, RLS pool leak
P3 alert: Gemini — replay attacks, trace propagation, SQS claim-check
P4 portal: Manual — discovery reliability is existential risk
P5 cost: Manual — concurrent baselines, remediation RBAC, pricing staleness
P6 run: Gemini — policy update loophole, AST parsing, audit streaming

2026-03-01 02:09:19 +00:00

3.3 KiB

Raw Permalink Blame History

dd0c/alert — BMad Code Review

Reviewer: BMad Code Review Agent (Gemini) Date: March 1, 2026

Severity-Rated Findings

🔴 Critical

Ingestion Security: Replay Attack Vulnerability. HMAC tests validate signatures but don't enforce timestamp freshness. Datadog and PagerDuty have timestamp headers, but OpsGenie doesn't always package it cleanly. Without rejecting payloads older than ~5 minutes, an attacker can capture a valid webhook and spam the ingestion endpoint, blowing up SQS queues and Redis windows.
Trace Propagation: Cross-SQS CI Verification. Checking that traceparent is attached to SQS MessageAttributes isn't enough. AWS SDKs often drop or mangle these when crossing into ECS. CI needs to assert on the reassembled trace tree — verify the ECS span registers as a child of the Lambda ingestion span, not a disconnected root span.
Multi-tenancy: Confused Deputy Vulnerabilities. Partition key enforcement tests only check the happy path for a single tenant. Need explicit negative tests: insert data for Tenant A and Tenant B, query using Tenant A's JWT, explicitly assert Tenant B's data is undefined in the result set.

🟡 Important

Correlation Quality: Out-of-Order Delivery. Tests likely simulate perfect chronological delivery. Distributed monitoring is messy. What happens if an alert from T-minus-30s arrives after the correlation window has closed and shipped the incident to SQS? Does it trigger a duplicate Slack ping, or correctly attach to the existing incident timeline?
SQS 256KB: Claim-Check Edge Cases. Compression + S3 pointers is standard, but tests must cover: (a) S3 put/get latency causing Lambda timeouts, (b) orphaned S3 pointers without lifecycle rules, (c) ECS Fargate failing to fetch payload due to IAM boundary issues.
Self-Hosted Mode: Behavioral Gaps. DynamoDB scales connections and handles TTL natively. PostgreSQL requires explicit connection pooling (PgBouncer) and manual partitioning/pruning. Lambda recycles memory; Fastify is persistent — a tiny memory leak in the correlation engine will crash Fastify but go unnoticed in Lambda. Test suite doesn't simulate long-running process memory limits or Postgres connection exhaustion.

V1 Cut List

Self-hosted mode / DB abstractions — pick AWS SaaS and commit. Supporting Postgres/Fastify doubles testing surface for zero immediate revenue.
Dashboard UI E2E (Playwright) — test the API thoroughly, visually verify the UI.
OTEL trace propagation tests — visually verify in Jaeger once.
DLQ replay with backpressure — manual replay is fine for V1.
Slack circuit breaker — if Slack is down, alerts queue. Accept it.

Must-Have Before Launch

HMAC timestamp validation — reject payloads older than 5 minutes (all 4 sources).
Cross-tenant negative tests — explicitly assert data isolation between tenants.
Correlation window edge cases — out-of-order delivery, late arrivals after window close.
SQS 256KB S3 pointer round-trip — prove the claim-check pattern works end-to-end.
Free tier enforcement — 10K alerts/month counter, 7-day retention purge.
Slack signature validation — timing-safe HMAC for interactive payloads.

"If you aren't testing for leakage, it will leak."

3.3 KiB Raw Permalink Blame History

dd0c/alert — BMad Code Review

Severity-Rated Findings

🔴 Critical

🟡 Important

V1 Cut List

Must-Have Before Launch

3.3 KiB

Raw Permalink Blame History