P1 route: Gemini — 'Ship the proxy, stop writing tests for the tests' P2 drift: Gemini — mTLS revocation, state lock corruption, RLS pool leak P3 alert: Gemini — replay attacks, trace propagation, SQS claim-check P4 portal: Manual — discovery reliability is existential risk P5 cost: Manual — concurrent baselines, remediation RBAC, pricing staleness P6 run: Gemini — policy update loophole, AST parsing, audit streaming
3.3 KiB
dd0c/alert — BMad Code Review
Reviewer: BMad Code Review Agent (Gemini) Date: March 1, 2026
Severity-Rated Findings
🔴 Critical
-
Ingestion Security: Replay Attack Vulnerability. HMAC tests validate signatures but don't enforce timestamp freshness. Datadog and PagerDuty have timestamp headers, but OpsGenie doesn't always package it cleanly. Without rejecting payloads older than ~5 minutes, an attacker can capture a valid webhook and spam the ingestion endpoint, blowing up SQS queues and Redis windows.
-
Trace Propagation: Cross-SQS CI Verification. Checking that
traceparentis attached to SQS MessageAttributes isn't enough. AWS SDKs often drop or mangle these when crossing into ECS. CI needs to assert on the reassembled trace tree — verify the ECS span registers as a child of the Lambda ingestion span, not a disconnected root span. -
Multi-tenancy: Confused Deputy Vulnerabilities. Partition key enforcement tests only check the happy path for a single tenant. Need explicit negative tests: insert data for Tenant A and Tenant B, query using Tenant A's JWT, explicitly assert Tenant B's data is
undefinedin the result set.
🟡 Important
-
Correlation Quality: Out-of-Order Delivery. Tests likely simulate perfect chronological delivery. Distributed monitoring is messy. What happens if an alert from T-minus-30s arrives after the correlation window has closed and shipped the incident to SQS? Does it trigger a duplicate Slack ping, or correctly attach to the existing incident timeline?
-
SQS 256KB: Claim-Check Edge Cases. Compression + S3 pointers is standard, but tests must cover: (a) S3 put/get latency causing Lambda timeouts, (b) orphaned S3 pointers without lifecycle rules, (c) ECS Fargate failing to fetch payload due to IAM boundary issues.
-
Self-Hosted Mode: Behavioral Gaps. DynamoDB scales connections and handles TTL natively. PostgreSQL requires explicit connection pooling (PgBouncer) and manual partitioning/pruning. Lambda recycles memory; Fastify is persistent — a tiny memory leak in the correlation engine will crash Fastify but go unnoticed in Lambda. Test suite doesn't simulate long-running process memory limits or Postgres connection exhaustion.
V1 Cut List
- Self-hosted mode / DB abstractions — pick AWS SaaS and commit. Supporting Postgres/Fastify doubles testing surface for zero immediate revenue.
- Dashboard UI E2E (Playwright) — test the API thoroughly, visually verify the UI.
- OTEL trace propagation tests — visually verify in Jaeger once.
- DLQ replay with backpressure — manual replay is fine for V1.
- Slack circuit breaker — if Slack is down, alerts queue. Accept it.
Must-Have Before Launch
- HMAC timestamp validation — reject payloads older than 5 minutes (all 4 sources).
- Cross-tenant negative tests — explicitly assert data isolation between tenants.
- Correlation window edge cases — out-of-order delivery, late arrivals after window close.
- SQS 256KB S3 pointer round-trip — prove the claim-check pattern works end-to-end.
- Free tier enforcement — 10K alerts/month counter, 7-day retention purge.
- Slack signature validation — timing-safe HMAC for interactive payloads.
"If you aren't testing for leakage, it will leak."