# dd0c/alert — BMad Code Review **Reviewer:** BMad Code Review Agent (Gemini) **Date:** March 1, 2026 --- ## Severity-Rated Findings ### 🔴 Critical 1. **Ingestion Security: Replay Attack Vulnerability.** HMAC tests validate signatures but don't enforce timestamp freshness. Datadog and PagerDuty have timestamp headers, but OpsGenie doesn't always package it cleanly. Without rejecting payloads older than ~5 minutes, an attacker can capture a valid webhook and spam the ingestion endpoint, blowing up SQS queues and Redis windows. 2. **Trace Propagation: Cross-SQS CI Verification.** Checking that `traceparent` is attached to SQS MessageAttributes isn't enough. AWS SDKs often drop or mangle these when crossing into ECS. CI needs to assert on the *reassembled* trace tree — verify the ECS span registers as a child of the Lambda ingestion span, not a disconnected root span. 3. **Multi-tenancy: Confused Deputy Vulnerabilities.** Partition key enforcement tests only check the happy path for a single tenant. Need explicit negative tests: insert data for Tenant A and Tenant B, query using Tenant A's JWT, explicitly assert Tenant B's data is `undefined` in the result set. ### 🟡 Important 4. **Correlation Quality: Out-of-Order Delivery.** Tests likely simulate perfect chronological delivery. Distributed monitoring is messy. What happens if an alert from T-minus-30s arrives *after* the correlation window has closed and shipped the incident to SQS? Does it trigger a duplicate Slack ping, or correctly attach to the existing incident timeline? 5. **SQS 256KB: Claim-Check Edge Cases.** Compression + S3 pointers is standard, but tests must cover: (a) S3 put/get latency causing Lambda timeouts, (b) orphaned S3 pointers without lifecycle rules, (c) ECS Fargate failing to fetch payload due to IAM boundary issues. 6. **Self-Hosted Mode: Behavioral Gaps.** DynamoDB scales connections and handles TTL natively. PostgreSQL requires explicit connection pooling (PgBouncer) and manual partitioning/pruning. Lambda recycles memory; Fastify is persistent — a tiny memory leak in the correlation engine will crash Fastify but go unnoticed in Lambda. Test suite doesn't simulate long-running process memory limits or Postgres connection exhaustion. --- ## V1 Cut List - Self-hosted mode / DB abstractions — pick AWS SaaS and commit. Supporting Postgres/Fastify doubles testing surface for zero immediate revenue. - Dashboard UI E2E (Playwright) — test the API thoroughly, visually verify the UI. - OTEL trace propagation tests — visually verify in Jaeger once. - DLQ replay with backpressure — manual replay is fine for V1. - Slack circuit breaker — if Slack is down, alerts queue. Accept it. ## Must-Have Before Launch 1. **HMAC timestamp validation** — reject payloads older than 5 minutes (all 4 sources). 2. **Cross-tenant negative tests** — explicitly assert data isolation between tenants. 3. **Correlation window edge cases** — out-of-order delivery, late arrivals after window close. 4. **SQS 256KB S3 pointer round-trip** — prove the claim-check pattern works end-to-end. 5. **Free tier enforcement** — 10K alerts/month counter, 7-day retention purge. 6. **Slack signature validation** — timing-safe HMAC for interactive payloads. --- *"If you aren't testing for leakage, it will leak."*