Add Gemini TDD reviews for all 6 products

P1, P2, P3, P4, P6 reviewed by Gemini subagents. P5 reviewed manually (Gemini credential errors). All reviews flag coverage gaps, anti-patterns, and Transparent Factory tenet gaps.
2026-03-01 00:29:24 +00:00
parent 1101fef096
commit 2fe0ed856e
6 changed files with 501 additions and 0 deletions
--- a/products/01-llm-cost-router/test-architecture/review.md
+++ b/products/01-llm-cost-router/test-architecture/review.md
@@ -0,0 +1,45 @@
 ## 4. Anti-Patterns Identified
 1.  **Testing Implementation instead of Behavior (Over-Mocking):**
    *   **Section 3.1 & 3.4:** `MockKeyCache` and `MockKeyStore` are used to test auth middleware. Mocks hide serialization overhead, connection pool limits, and timeout behavior. A Redis cache test should hit an ephemeral Redis instance via Testcontainers.
 2.  **Brittle Security Tests:**
    *   **Section 3.4 (`provider_credential_is_stored_encrypted`):** The assertion `assert_ne!(stored.encrypted_key, b"sk-plaintext-key")` and checking the length is a classic anti-pattern. This test passes if you accidentally overwrite the key with random garbage or an empty GCM nonce. You must write a test that *decrypts* the key and verifies it matches the original plaintext, using the actual KMS envelope decryption flow.
 3.  **Missing Concurrency in Circuit Breaker Tests:**
    *   **Section 3.2 & 4.1:** The Redis-backed circuit breaker test `circuit_breaker_state_is_shared_across_two_proxy_instances` tests manual `force_open`, but not concurrent threshold tripping. What if 50 requests fail simultaneously across 5 proxy instances? Does the circuit breaker transition cleanly or enter a race condition?
 ## 5. Transparent Factory Gaps
 Section 8 attempts to map tests to the 5 tenets, but misses critical scenarios:
 *   **Atomic Flagging (Story 10.1):** Tests check off/on/auto-disable. **Gap:** What happens when the flag provider is unreachable? Is the default fallback to `passthrough` safe? No test asserts this fail-open/fail-closed behavior.
 *   **Elastic Schema (Story 10.2):** Tests DDL regex. **Gap:** No tests for long-running transaction locks during migrations. A test must verify that `ALTER TABLE` concurrent index creation doesn't block proxy auth reads.
 *   **Semantic Observability (Story 10.4):** Tests span emission. **Gap:** Context propagation. Does the trace ID correctly propagate from the incoming request (e.g., `traceparent` header) to the downstream LLM provider? If the client passes a trace ID, the proxy must stitch it.
 *   **Configurable Autonomy (Story 10.5):** Tests strict/audit/panic modes. **Gap:** Panic mode API security. Who can trigger panic mode? There is no test verifying that only an Owner/Admin can invoke the `POST /admin/panic` endpoint.
 ## 6. Performance Test Gaps
 The latency budget (<5ms proxy overhead) is tested via `criterion` and k6, but the edge cases are completely ignored.
 *   **Missing Chaos Scenarios:** Section 6.3 lists "Kill TimescaleDB" and "Kill Redis". It is missing **"Slow DB"**. A dead DB fails fast (ECONNREFUSED). A slow DB (e.g., hanging connections, 5-second inserts) causes connection pool exhaustion and memory bloat. You need a `tc netem` delay test on the DB port to prove the bounded `mpsc` channel correctly drops telemetry without blocking the proxy hot-path.
 *   **Memory Fragmentation / Garbage Collection:** The p99 latency target is highly sensitive to memory reallocation over time. A 60-second k6 test will not catch memory fragmentation. There must be a 24-hour soak test running at a low but sustained TPS to prove the <5ms SLA holds true in a long-lived process.
 ## 7. Missing Critical Test Scenarios
 These specific scenarios are missing and represent severe risks to the product:
 1.  **Strict Multi-Tenancy Cross-Talk:** There is no explicit test proving Org A cannot access Org B's routing rules, API keys, or request history via the Dashboard API. An IDOR test suite is mandatory.
 2.  **SSE Connection Drops (Billing Leak):** What happens if a client drops the connection mid-stream? Does the proxy continue downloading from the provider, or does it abort the upstream request? Does the customer get billed for the partial tokens received so far? No test covers this financial risk.
 3.  **KMS Key Rotation Verification:** No test verifies that the KMS DEK envelope encryption actually rotates properly, or that old DEKs can still decrypt old keys after a rotation event.
 4.  **Rate Limiter Burst/Race Condition:** The Redis rate limiter test uses a single loop. It needs a concurrent load test to ensure the atomic `INCR` commands enforce limits correctly under a massive traffic burst without allowing overage.
 ## 8. Recommendations (Top 5 Prioritized Improvements)
 1.  **Ditch Mockall for Hot-Path Integration Tests:** Remove `MockKeyCache` and `MockKeyStore` from the proxy auth and routing tests. Replace them with Testcontainers Redis/Postgres. For a <5ms SLA proxy, testing the implementation of a mock is useless. You must test the real DB driver latency and connection pool behavior. (Reference: Section 3.1 & 4.1).
 2.  **Add a "Slow Dependency" Chaos Test:** Add a test that injects a 5-second artificial network delay to TimescaleDB and Redis. Verify that the proxy degrades gracefully (dropping telemetry, falling back to PG for auth) without blocking the main event loop or violating the <5ms SLA. (Reference: Section 6.3).
 3.  **TDD Provider Translation via Fixtures:** Mandate test-first development for the OpenAI/Anthropic translation layer using the official JSON schemas and recorded fixtures *before* implementation. Do not write translators "test-after". (Reference: Section 1.2 & Epic 1).
 4.  **Implement an Explicit IDOR Test Suite:** Add a test suite for the Dashboard API that specifically authenticates as Org A and attempts to fetch/mutate resources belonging to Org B (API keys, routing rules, telemetry). (Reference: Section 3.4 & Epic 4).
 5.  **Test Decryption, Not Just Encryption:** Rewrite the `provider_credential_is_stored_encrypted` test. It currently passes if the DB stores random garbage. It must encrypt the key, store it, fetch it, *decrypt it via the KMS flow*, and assert it matches the original plaintext. (Reference: Section 3.4).
 ---
 *Review generated by TDD Review Subagent. Ensure all critical paths are addressed before approving the V1 Architecture.*
--- a/products/02-iac-drift-detection/test-architecture/review.md
+++ b/products/02-iac-drift-detection/test-architecture/review.md
@@ -0,0 +1,85 @@
 # Test Architecture Review: dd0c/drift
 **Reviewer:** Max Mayfield (Senior TDD Consultant)
 **Date:** February 28, 2026
 Alright, let's get real. I read through the `test-architecture.md` and the `epics.md` documents. You talk a big game about "correctness is non-negotiable" and the secret scrubber being untouchable. I respect that. But when we actually look at how this maps to the product epics, the cracks start showing. You're hiding behind "too many moving parts" to avoid unit testing the hard stuff, and your UI testing strategy is completely MIA.
 Here is my reality check on your testing strategy.
 ---
 ## 1. Coverage Gaps Per Epic
 I cross-referenced your architecture with the V1 Epics. Here’s what you conveniently forgot to test:
 *   **Epic 6 (Dashboard UI):** You wrote an entire testing architecture document and completely ignored the frontend. Where are the React Testing Library specs? Where is Playwright/Cypress for the SPA? You have zero UI test targets in Section 2.2. If the frontend can't render the diff correctly or the OAuth flow breaks, your pristine Go agent doesn't matter.
 *   **Epic 9 (Onboarding & PLG):** 
    *   **Story 9.4 (Stripe Billing):** No mention of testing Stripe webhooks. How are you validating signature verification locally? Where is the `stripe-cli` mock in your Docker Compose setup?
    *   **Story 9.2 (`drift init`):** How are you unit testing local filesystem traversal across Windows, Mac, and Linux without making it a flaky mess?
 *   **Epic 8 (Infrastructure):** Story 8.1 defines your infrastructure in Terraform. Where is `Terratest`? You're building an IaC drift detection tool but not testing your own IaC. That's ironic.
 *   **Epic 2 (Agent Communication):** Story 2.1 mentions mTLS certificate generation and exchange. You haven't documented a single test for certificate expiration, rotation, or revocation handling.
 ## 2. TDD Workflow Critique
 Your adaptation of TDD is pragmatic, but you're giving yourself too many free passes.
 *   **"Too many moving parts to unit test first":** You explicitly called out the Remediation Round-Trip (Slack button → agent apply) and the Onboarding flow as things you'll E2E test first. That's a classic excuse for coupled architecture. If an orchestrator has too many moving parts to unit test, *decouple the orchestrator*. You should be able to unit test the remediation dispatcher with stubs before firing up LocalStack.
 *   **CI/CD Pipeline Configuration:** "Validate by running it, not by unit testing YAML." Fair, but at least use something like `actionlint` for GitHub Actions. Don't just push and pray.
 *   **Test-After for DynamoDB:** Writing the schema and then integration testing against DynamoDB Local is fine, but you should still unit test your repository interfaces using mocks to verify domain logic independently of the DB.
 ## 3. Test Pyramid Balance
 You’ve got a 10% E2E, 20% Integration, and 70% Unit split. Targeting ~500 tests.
 *   **The Problem with 50 E2E Tests:** Running 50 E2E tests against LocalStack and a real PostgreSQL DB inside your CI pipeline will be brutally slow. LocalStack is heavy. You won't hit that `< 5 min` PR promise with 50 E2E flows unless you heavily parallelize, and if you do, your LocalStack instance will probably buckle under the load or you'll hit state contention.
 *   **The Fix:** You need a "Smoke" tier separated from E2E. You only need 5-10 E2E critical paths (like the Onboarding and Revert flows). The rest of those E2E tests should be pushed down to Integration or UI component tests. Stop counting tests as a vanity metric and focus on coverage.
 ## 4. Anti-patterns
 I saw some code smells in your examples that need addressing before they become a nightmare:
 *   **Table-Driven Tests:** Your Go table-driven pattern in Section 3.1 is standard, but you're missing `t.Parallel()` inside the `t.Run()` closure. Without it, those tests run sequentially, wasting time.
 *   **Shared LocalStack State:** You noted `go test -p 4 ./... (4 packages in parallel)` but then said integration tests must not be parallelized within packages. If Package A and Package B both talk to LocalStack S3 at the same time, they will step on each other unless buckets and resources are dynamically named per-test. You haven't mentioned dynamic naming. That's a one-way ticket to flaky CI.
 *   **Time-Based Polling:** `waitForRemediationPlan(t, driftEvent.ID, 5*time.Second)` — Polling is better than hard sleeping, but relying on real time in E2E tests is a smell. Use deterministic event triggers or WebSockets/SQS mocks to know exactly when the plan is ready.
 ## 5. Transparent Factory Tenet Gaps
 You talked a lot about Epic 10 compliance, but you missed some of the hardest parts to test.
 *   **Atomic Flagging (10.1):** You test that the circuit breaker trips on false positives, but how is the distributed state (Redis/Dynamo) of those dismissals tested across multiple agent instances? You only showed an in-memory OpenFeature provider.
 *   **Elastic Schema (10.2):** Validating additive schema changes via CI linting is cute. But you aren't testing the actual agent reading V1 and V2 items concurrently. What happens when an old agent parses a new DynamoDB item with `_v2` attributes? Where is the backward-compatibility serialization test?
 *   **Semantic Observability (10.4):** You assert that OTEL spans are emitted per resource (good), but there’s no trace continuity test. Does the `drift_scan` parent span correctly link to the `POST /v1/drift-reports` HTTP trace and then the SQS consumer trace? That's the entire point of distributed tracing, and your test architecture completely ignores the boundary crossings.
 *   **Configurable Autonomy (10.5):** Panic mode is tested (good). But what about race conditions when panic mode triggers mid-remediation? Are there tests for in-flight command abortion?
 ## 6. Performance Test Gaps
 Your section 6 metrics are aggressive (`< 50ms` for 100 resources), but you missed the boat on the backend.
 *   **Database Connection Pool Exhaustion:** What happens to PostgreSQL when 1,000 agents heartbeat at the same time? You need a load test specifically for the RDS proxy or connection pooler.
 *   **SQS FIFO Limits:** SQS FIFO standard throughput is 300 messages/sec. Your k6 load test target is 50 concurrent agents. What happens at 500 agents? You need a stress test that hits the SaaS with a thundering herd of drift reports, not just a gradual ramp-up.
 *   **SaaS API Memory Profile:** You benchmarked the Go agent, but the Node.js/TypeScript Event Processor is a black box. Where are the V8 memory profiling tests to ensure it doesn't OOM while processing a 1MB drift report payload?
 ## 7. Missing Scenarios
 You overlooked a few critical edge cases that keep SREs up at night:
 *   **Security (RBAC/IAM forgery):** You test RLS for multi-tenancy (excellent). But what about an agent attempting to impersonate another agent by forging a `stack_id` in its payload, while using a valid API key? The API must cross-check the API key's `org_id` against the payload's `stack_id` ownership. 
 *   **Security (Replay Attacks):** What prevents a malicious actor from capturing a drift report payload and replaying it endlessly to DoS your Event Processor or skew the drift score?
 *   **Multi-tenancy Noisy Neighbor:** If Org A has 10,000 drifted resources and Org B has 10, does Org A delay Org B's SQS processing? You need test cases for priority queueing or fair-share processing.
 *   **Race Conditions (Duplicate Processing):** FIFO deduplication helps, but exact-once processing logic inside the Event Processor DB transaction isn't explicitly tested for unique constraints. What if the same `report_id` makes it past SQS? Does PostgreSQL block it?
 *   **Race Conditions (Remediation vs. Polling):** The agent starts a remediation apply (Epic 7.1), but the scheduled drift scan (Epic 1.2) kicks off concurrently. Who wins? Does the scan report half-applied state?
 ## 8. Top 5 Recommendations
 This architecture is solid overall. It has a high bar for correctness, and the 100% scrubber coverage rule is perfect. But to fix the gaps, do this immediately:
 1.  **Stop Ignoring the UI:** Add a dedicated UI testing section (Epic 6) to your architecture using React Testing Library for components and Playwright for the critical path (Auth -> View Stack -> Diff Viewer). If the dashboard is broken, your agent's perfect diff is useless.
 2.  **Mock the Orchestrator for TDD:** Your excuse for not unit-testing the remediation round-trip and onboarding flow is weak. Use stubs for AWS and GitHub API boundaries so you can TDD the orchestrator logic *before* hitting the slow LocalStack E2E layer.
 3.  **Trim the Fat on E2E Tests:** Cap your E2E suite at 10 high-value flows. If you push 50 LocalStack integration tests into every PR, your CI will crawl and your engineers will start ignoring failures. Push the rest down to integration tests with mock HTTP services.
 4.  **Test the Infrastructure Code:** You are literally building an IaC drift detection tool. Epic 8.1 needs Terratest or an equivalent to validate your own Terraform. Dogfooding starts at home.
 5.  **Fix Distributed Tracing Gaps:** Add cross-boundary assertions for Epic 10.4. Ensure the `drift_scan` parent span generated in Go correctly passes its W3C trace context to the TypeScript API, through SQS, and into the Event Processor.
 This is Max, signing off. Make it work.
--- a/products/03-alert-intelligence/test-architecture/review.md
+++ b/products/03-alert-intelligence/test-architecture/review.md
@@ -0,0 +1,75 @@
 # TDD Consultant Review: dd0c/alert Test Architecture
 **Reviewer:** Max Mayfield (Senior TDD Consultant)
 **Date:** Feb 28, 2026
 Alright, I read through the test architecture and the epics. You've got a lot of buzzwords in here, and the general structure isn't terrible, but there are some gaping holes you're casually ignoring. A safety-critical observability tool that "never eats alerts" shouldn't have blind spots this big. 
 Here is my brutal, unfiltered breakdown.
 ---
 ## 1. Coverage Gaps Per Epic
 Let's compare your `test-architecture.md` to `epics.md`. You conveniently "forgot" to write tests for half the product.
 *   **Epic 1 (Webhook Ingestion) & Epic 2:** Story 1.4 requires saving the raw payload asynchronously to S3 for audit/replay. I see *zero* mention of S3 storage testing in your architecture. If that async save fails, you lose the audit trail.
 *   **Epic 5 (Slack Bot):** Story 5.3 introduces interactive feedback actions (clicking buttons in Slack). Section 3.6 tests the *formatting* of these buttons, but there are no tests for the API Gateway endpoint that receives the Slack interaction payload, validates it, and updates the DB.
 *   **Epic 6 (Dashboard API) & Epic 7 (Dashboard UI):** Completely AWOL. You have a whole API with JWT auth, pagination, and analytics endpoints (Stories 6.1 - 6.4), plus a React SPA (Stories 7.1 - 7.4). Your test architecture stops at Slack. How are you testing JWT validation? Tenant isolation on API reads? React component rendering? Total ghost town.
 *   **Epic 9 (Onboarding/PLG):** Story 9.4 mandates free tier limitations (max 10,000 alerts/month, purging data after 7 days). Where are the tests proving the ingestion API drops payloads at the quota limit? Missing.
 ## 2. TDD Workflow Critique
 Your Section 1.2 "Red-Green-Refactor" adaptation is mostly fine until you start making excuses. 
 *   **"When integration tests lead (test-after)"**: Classic trap. Writing parsers against real payloads and "locking them in" later means your design is tightly coupled to the implementation from day one. You should still write the failing unit test for the parser behavior *first*, using the fixture as the input. 
 *   **"When E2E tests lead"**: Building backward from a 60-second TTV happy path isn't TDD, it's ATDD/BDD. That's fine, but call it what it is. Don't pretend you're doing strict TDD here.
 *   **Mocking Pure Functions:** You claim parsers are pure functions in 3.1, which is great. But then in your contract tests (4.1), you spin up LocalStack and fire HTTP requests just to see if the parser output hits SQS. You're bleeding domain logic into integration tests.
 ## 3. Test Pyramid Balance
 You claim 70% Unit, 20% Integration, 10% E2E. For a "solo founder," 20 E2E tests and 100 Integration tests running on LocalStack + Testcontainers (Redis, DynamoDB, Timescale, WireMock) is going to be an anchor around your neck.
 *   Your CI pipeline (7.2) claims the PR Gate takes <5 mins with Testcontainers spinning up. I'll believe that when I see it. Java/Node + 4 heavy Docker containers pulling and starting will easily push you past 10 minutes unless your caching is god-tier.
 *   You are over-relying on heavy integration tests for things that should be tested in-memory with repository fakes. You have `ioredis-mock` in unit tests (3.4) but then test the exact same TTL logic in integration (4.2). Pick a lane. 
 ## 4. Anti-Patterns
 You've got a few nasty anti-patterns mixed in with your "best practices."
 *   **Mocking Pure Logic:** You say `ioredis-mock` is for "unit tests" (3.4). Redis *is* infrastructure. If you're unit testing the correlation engine with `ioredis-mock`, you aren't doing unit testing; you're doing flaky integration testing with a half-baked mock. A true unit test should use an in-memory repository fake implementing a `WindowStore` interface. Save Redis for the integration adapter test.
 *   **Time Manipulation with Sinon:** Section 3.4 explicitly calls out `sinon.useFakeTimers()` for the correlation engine. This is a disaster waiting to happen across async boundaries, especially with `ioredis-mock`. Pass a `Clock` interface or a `getCurrentTime()` function into your correlation engine instead of monkey-patching the Node.js event loop.
 *   **Testing Slack Rate Limits with WireMock:** In 4.5, you test Slack rate limits (1 msg/sec) using WireMock. WireMock won't accurately simulate Slack's complex rate limiting (bucket depletion and `Retry-After` headers) unless you script it perfectly. This will give you a false sense of security.
 *   **Table-Driven Tests mention:** You claim heavy use of table-driven tests (3.4), but your pseudo-code is just standard `describe/it` blocks. If you are doing TDD for scoring, you should be using `.each` arrays defining input state -> expected score.
 ## 5. Transparent Factory Tenet Gaps
 Your TF Tenets (Section 8) look nice on paper, but I found the holes.
 *   **Atomic Flagging (8.1):** You test the circuit breaker, which is good. But where is the test for the OpenFeature provider itself? How do you guarantee the feature flags fall back to the safe default (off) when the JSON file provider or env vars are missing/malformed? Missing.
 *   **Elastic Schema (8.2):** Migration linting is cool, but what about the dual-write rollback scenarios? If V2 deployment fails, can V1 code still read the alerts written during the 5-minute canary? No tests for backward compatibility during the migration window.
 *   **Semantic Observability (8.4):** You assert spans are created, but you completely miss **Trace Context Propagation**. If the `trace_id` isn't passed from API Gateway -> Lambda -> SQS -> ECS Correlation Engine, your observability is dead on arrival. You need a test proving that the span parent-child relationship crosses the SQS boundary correctly.
 ## 6. Performance Test Gaps
 You wrote tests for 1000 webhooks/second (6.1). Cool. But you forgot the real world.
 *   **Payload Size Spikes:** You're testing with `makeAlertPayload()`, which generates a tiny JSON. What happens when a Datadog monitor with 50 tags and a huge custom message drops a 2MB payload on your API Gateway? Does Lambda choke? Does SQS reject it (SQS limit is 256KB)? Missing test.
 *   **Database Growth Over Time:** Your TimescaleDB test (6.2) assumes a clean slate. What's the query latency after 30 days of continuous aggregates and millions of rows?
 *   **DynamoDB Cold Starts:** No tests for DynamoDB throttling under burst. A 500-alert storm hitting a new tenant partition will get rate-limited.
 ## 7. Missing Scenarios (The Reality Check)
 A system like dd0c/alert isn't going to break on the happy path. It's going to fail on edge cases you haven't considered.
 *   **HMAC Bypass (3.2):** What if the signature header is sent as `dd-webhook-signature` vs `DD-WEBHOOK-SIGNATURE`? What if it's completely omitted? You claim to test missing headers, but you missed testing timing attacks on the string comparison itself. Use `crypto.timingSafeEqual` and write a test that enforces it's not a generic `===` comparison.
 *   **Multi-Tenancy Bleed (4.3):** You test that Redis windows isolate per tenant (4.2), but you forgot to test DynamoDB queries. What happens if a bad `partition_key` logic leaks `tenant_A` incidents into `tenant_B`'s dashboard? You need a strict assertion enforcing Tenant ID injection on all Data Access Objects.
 *   **DLQ Overflow (8.1):** Your circuit breaker replays from the DLQ. What if the DLQ has 100,000 alerts? Replaying them all at once into the correlation engine will just trip the breaker again or OOM the ECS task. You need a backpressure or batching test for the replay mechanism.
 *   **Slack Rate Limits (4.5):** You test retries on 429, but what if Slack totally blocks your app due to sustained 429s (like 1,000 requests in a minute)? You don't have a test for a circuit breaker *to* Slack.
 ---
 ## 8. Top 5 Recommendations
 Before you write another line of code, fix these.
 1.  **Stop Ignoring Half the App:** You need an Epic 6 and Epic 7 section in this architecture document. Test the API JWT auth, pagination, tenant isolation, and the React SPA. Also, test the OAuth signup and free tier limits from Epic 9. A solo founder can't afford a broken onboarding flow.
 2.  **Fix the Anti-Pattern in Correlation Engine Tests:** Drop `ioredis-mock` and `sinon.useFakeTimers()`. Create a `WindowStore` interface and use an in-memory implementation for unit tests. Inject a clock. Save the real Redis and Testcontainers for a single integration test class.
 3.  **Add Trace Context Propagation Tests:** If the trace ID gets lost between API Gateway, Lambda, SQS, and the ECS Fargate worker, your Semantic Observability is useless. Write a test asserting that the span parent-child relationship crosses the SQS boundary correctly.
 4.  **Handle the Edge Cases (HMAC & DLQ):** Write a test specifically enforcing `crypto.timingSafeEqual` for HMAC. For the DLQ replay, add a test that asserts backpressure—replay the alerts in manageable batches, not a 100k message firehose.
 5.  **Test Large Payloads (The 256KB SQS Limit):** SQS has a hard 256KB limit. Write a test where Datadog sends a 500KB payload (which happens with large stack traces or massive tag arrays). Your ingestion Lambda needs to compress it, chunk it, or strip unnecessary fields *before* hitting SQS. 
 You're trying to move fast, which is fine, but you're missing the exact things that will wake you up at 3 AM. Fix these, and you'll actually have a "safety-critical observability tool."
 — Max
--- a/products/04-lightweight-idp/test-architecture/review.md
+++ b/products/04-lightweight-idp/test-architecture/review.md
@@ -0,0 +1,105 @@
 # Test Architecture & TDD Strategy Review: dd0c/portal
 **Reviewer:** Senior TDD Consultant
 **Target:** V1 MVP Test Architecture (test-architecture.md, epics.md, architecture.md)
 This review provides a blunt, critical assessment of the proposed test architecture for the lightweight IDP. The current test plan has solid foundations in ownership inference and governance testing, but suffers from critical blind spots, architectural contradictions, and a test pyramid that will drown a solo founder in mock maintenance.
 ---
 ## 1. Coverage Analysis vs. Epics
 While core discovery engines (Epics 1, 2) and ownership logic (Epic 4) are well-covered, several epics are virtually ignored in the test architecture:
 *   **Epic 3 (Service Catalog) - Missing Integrations:** Story 3.4 (PagerDuty/OpsGenie Integration) is entirely absent from the test plan. There are no tests for credential encryption, integration caching, or mapping PD schedules to teams.
 *   **Epic 4 (Search Engine) - Missing Caching:** Story 4.3 requires Redis prefix caching for <10ms Cmd+K responses. Section 4 has zero tests validating Redis cache hit/miss logic or cache invalidation on catalog updates.
 *   **Epic 5 & 6 (UI & Dashboards) - Missing Frontend Tests:** Section 5 (E2E) only tests API responses for the "Miracle" and Cmd+K. There is no mention of Vite/React component testing, testing the progressive disclosure UI (Story 5.4), or verifying dashboard KPI aggregation (Stories 6.1, 6.2).
 *   **Epic 8 (Infrastructure) - Missing IaC Tests:** No tests exist for the AWS CDK/CloudFormation deployments, including IAM least-privilege assertions for the customer role (Story 8.5).
 *   **Epic 9 (Onboarding & PLG) - Critical Flow Untested:** The most important business flows are untested. There are no tests for the Stripe webhook handling (Story 9.2) or the real-time WebSocket discovery progress (Story 9.4). You cannot ship a PLG motion without testing the payment and activation flows.
 ## 2. TDD Workflow Critique
 The stated TDD strategy correctly identifies ownership inference (Section 3.4) and reconciliation (Section 3.3) as the "Strict TDD" targets. This is the correct move—the domain logic here is pure computation, highly risky, and easily testable without side effects.
 However, the "Integration-led" workflow for the scanners (Section 1.2) is contradictory in practice given the sheer volume of unit tests planned for them (Section 3.1, 3.2). Writing 90 unit tests using `moto` or `responses` for the AWS and GitHub scanners is not "Integration-led," it is mock-led.
 For a solo founder, Red-Green-Refactor against synthetic API responses for third-party systems is a recipe for maintenance nightmares. If the AWS API contract changes or the GitHub GraphQL schema evolves, the mocked unit tests will stay green while production burns.
 ## 3. Test Pyramid Balance
 **The 70/20/10 ratio (300 unit tests / 85 integration / 15 E2E) is the wrong shape for an IDP.**
 dd0c/portal is fundamentally a data integration engine (gluing AWS, GitHub, PostgreSQL, and Meilisearch). The pyramid should be an **Integration Honeycomb**.
 *   **Scanners (Epic 1 & 2):** Shift the balance. Ditch the 90+ scanner unit tests using `moto`. Use real LocalStack/WireMock integration tests as the primary verification mechanism. Testing an AWS client wrapper against a mock AWS library just proves you can mock `boto3`.
 *   **The Database (Epic 3):** The API heavily relies on PostgreSQL for searches and tenant isolation (RLS). Mocking the DB in unit tests is useless here. The 30+ Catalog API tests should hit Testcontainers directly.
 *   **Recommendation:** Aim for a 30/60/10 ratio. Keep the 30% unit tests strictly for the Reconciliation and Ownership Inference algorithms. Push the 60% integration tests to the boundaries (Scanners -> LocalStack/WireMock, API -> Postgres/Redis/Meilisearch).
 ## 4. Anti-Patterns Flagged
 **Anti-Pattern 1: E2E Tests aren't End-to-End (Section 5.4)**
 Testing the "5-Minute Miracle" via `docker-compose.e2e.yml` with LocalStack and WireMock is just an over-bloated integration test. True E2E requires a dedicated AWS Sandbox account and a live GitHub Test Org to validate real IAM policies, real GitHub App permissions, real rate limiting, and actual WebSocket streams. Mocks hide integration drift.
 **Anti-Pattern 2: Over-Reliance on Synthetic Factories (Section 9.2)**
 The use of `make_aws_service()` and `make_github_repo()` factories that generate `fake.word()` is dangerous. Generating random structure instead of capturing real API payloads leads to tests passing on fake data that real AWS/GitHub APIs never return. VCR/record-replay patterns are mandatory for this product.
 **Anti-Pattern 3: State Corruption on Failed Scans (Section 3.3)**
 The test `test_marks_previously_discovered_service_as_stale_when_missing` implies a destructive update. If a GitHub GraphQL query times out and returns a partial repo list, does the reconciler mark the rest of the org's services as stale? The test architecture doesn't define resilience against partial API failures.
 ## 5. Transparent Factory Gaps
 Section 8 attempts to map the Transparent Factory tenets to the test architecture, but it exposes a critical, foundational flaw in the product's architecture.
 **The Elastic Schema Contradiction (Section 8.2 vs. Epic 10.2):**
 Epic 10.2 explicitly mandates: "all DynamoDB catalog schema changes... DynamoDB Single Table Design".
 However, Architecture.md and Test Architecture Section 8.2 explicitly build around **PostgreSQL (Aurora Serverless v2)** and test for `drop_column` in `.sql` migrations.
 *   The test architecture tests a SQL database.
 *   The epic demands a NoSQL database.
 This is a fatal misalignment. You must decide whether the core catalog is relational or single-table, and align the elastic schema testing accordingly. The current SQL tests in Section 8.2 do nothing to enforce Epic 10.2's DynamoDB rules.
 Beyond this, Section 8.1 (Atomic Flagging) covers the phantom quarantine circuit breaker, but doesn't test the TTL expiration logic or OpenFeature integration. The CI check blocking expired flags (Story 10.1) is untested.
 ## 6. Performance Test Gaps
 Section 6.1 (Discovery Scan Benchmarks) is insufficient for the target persona. A 500+ repo org will routinely have >10,000 AWS resources (CFN Stacks, Lambdas, ECS tasks, IAM roles, API Gateways, ALBs) across multiple regions.
 The benchmark `aws_scan_completes_within_3min_for_500_resources` is benchmarking a toy account, not a 500-repo org. Testing the 5-minute miracle at this scale guarantees an architecture collapse at 10x that size due to:
 *   Step Functions Payload Limits (256KB max event size between states).
 *   API Gateway WebSocket Timeout (2 hours max, 30s idle).
 *   Lambda execution timeouts for massive pagination operations across 10+ regions.
 Performance tests must validate the pipeline's limits up to 10,000 resources.
 ## 7. Missing Test Scenarios
 The test architecture neglects the harsh reality of distributed systems and external APIs.
 *   **GitHub API Rate Limits (Concurrency):** Section 3.2 mentions handling `Retry-After`. However, the GitHub GraphQL API rate limit is per installation *and* concurrent points. If 5 tenants in the same GitHub App installation run discovery concurrently, or 2 tenants with 1,000 repos each run at the same time, the GitHub GraphQL API will severely throttle. The test architecture has no cross-tenant concurrent throttling tests.
 *   **Stale Service Detection (Partial Failures):** Section 3.3 handles stale services when missing, but what if the GitHub GraphQL scanner only completes 50% of the query due to a timeout or 500 error? Does the system mark the other 50% as stale/deleted? There must be tests validating that partial discovery scans *do not* destructively mutate the catalog.
 *   **Ownership Conflicts:** Section 3.4 tests ambiguous ownership, but misses the scenario where multiple services claim the same repository through conflicting tags or deployment workflows.
 *   **Concurrent Discovery Scans (Same Tenant):** If a user clicks "Rescan" 5 times in 10 seconds, does the system spawn 5 concurrent Step Functions? Does it queue them? The `test_concurrent_scans_for_different_tenants_dont_conflict` doesn't test the same tenant race condition.
 *   **Meilisearch Index Corruption:** What happens when the SQS -> Meilisearch sync Lambda fails mapping a new document structure? The system needs a test validating index rebuilding and handling mapping errors without disrupting search for existing services.
 ## 8. Recommendations (Top 5 Prioritized)
 Don't be nice to the initial plan. This test architecture is over-engineered where it shouldn't be and under-tests the riskiest parts of the product.
 1.  **Resolve the Database Misalignment (Epic 3.1 & 10.2 vs. Section 8.2):**
    Decide immediately if the core catalog is PostgreSQL (Aurora) or DynamoDB. If DynamoDB, rewrite the entire Test Architecture Section 8.2, Section 3.5, and Section 4.3. If PostgreSQL, update Epic 10.2 to test SQL migrations (`add column`, `create table`) and drop the DynamoDB NoSQL references entirely.
 2.  **Invert the Test Pyramid (Section 2.1):**
    A 70% unit test ratio using `moto` and `responses` for integration-heavy glue code is an anti-pattern. Shift to an Integration Honeycomb. Drop 50% of the scanner unit tests and replace them with integration tests using VCR (for GitHub) and LocalStack (for AWS). Mocking external APIs in unit tests will break the 5-Minute Miracle on day one of a real schema change.
 3.  **Mandate Real E2E Infrastructure (Section 5.4):**
    Replace LocalStack in the "5-Minute Miracle" (Section 5.1) with a dedicated AWS Sandbox account and real GitHub Org. Testing the PLG motion against Wiremock is just a glorified integration test. The real risks (IAM assumption, GraphQL schema, real STS) are hidden by `docker-compose.e2e.yml`.
 4.  **Cover the Business-Critical Missing Epics (Epics 9, 3, 5):**
    Add explicit test suites for:
    *   Stripe Checkout webhooks & billing state (Story 9.2).
    *   WebSocket real-time progress streams (Story 9.4).
    *   PagerDuty schedule mapping and credentials (Story 3.4).
    *   Vite/React frontend tests for Service Cards & Dashboards (Stories 5.4, 6.1).
 5.  **Scale Performance Benchmarks 10x (Section 6.1):**
    Benchmarking 500 resources is completely inadequate for the target persona. Mandate tests validating discovery against 10,000 AWS resources and 1,000 GitHub repos to test Step Functions payload constraints, API Gateway limits, and concurrent rate limiting.
--- a/products/05-aws-cost-anomaly/test-architecture/review.md
+++ b/products/05-aws-cost-anomaly/test-architecture/review.md
@@ -0,0 +1,135 @@
 # dd0c/cost — Test Architecture Review
 **Reviewer:** TDD Consultant (Manual Review)
 **Date:** March 1, 2026
 **Verdict:** 🔴 NEEDS SIGNIFICANT WORK — This is the weakest test architecture of all 6 products.
 ---
 ## 1. Coverage Analysis
 This document is 232 lines. For comparison, P1 (route) is 2,241 lines and P6 (run) is 1,762 lines. The coverage gaps are massive.
 | Epic | Coverage Status | Notes |
 |------|----------------|-------|
 | Epic 1: CloudTrail Ingestion | ⚠️ Partial | Section 3.1 has 5 test cases for the normalizer. Missing: SQS FIFO deduplication tests, DLQ retry behavior, EventBridge cross-account rule tests, S3 raw event archival. Story 1.2 (SQS + DLQ) has zero dedicated tests. |
 | Epic 2: Anomaly Detection | ✅ Decent | Section 3.2 covers Z-score, novelty, cold-start. But missing: composite score weighting tests, edge cases (zero stddev, negative costs, NaN handling), baseline maturity transition tests. |
 | Epic 3: Zombie Hunter | ❌ Missing | Zero test cases. The daily scan for idle/stopped resources that are still costing money has no tests at all. |
 | Epic 4: Notification & Remediation | ⚠️ Thin | Section 4.2 has 3 integration tests for cross-account actions. Missing: Slack Block Kit formatting tests, daily digest aggregation, snooze/dismiss logic, interactive payload signature validation. |
 | Epic 5: Onboarding & PLG | ❌ Missing | Zero test cases. CloudFormation template generation, Stripe billing, free tier enforcement — none tested. |
 | Epic 6: Dashboard API | ❌ Missing | Zero test cases. REST API endpoints, tenant isolation, query performance — nothing. |
 | Epic 7: Dashboard UI | ❌ Missing | Zero test cases. |
 | Epic 8: Infrastructure (CDK) | ❌ Missing | Zero test cases. No CDK snapshot tests, no infrastructure drift detection (ironic). |
 | Epic 9: Multi-Account Management | ❌ Missing | Zero test cases. Account linking, bulk scanning, cross-account permissions — nothing. |
 | Epic 10: Transparent Factory | 🔴 Skeletal | Section 8 has exactly 3 test cases total across 2 of 5 tenets. Elastic Schema, Cognitive Durability, and Semantic Observability have zero tests. |
 **Bottom line:** 5 of 10 epics have zero test coverage in this document. This is a skeleton, not a test architecture.
 ---
 ## 2. TDD Workflow Critique
 The philosophy in Section 1 is sound — "test the math first" is correct for an anomaly detection product. But the execution is incomplete:
 - The "strict TDD" list correctly identifies scoring and governance as test-first. Good.
 - The "integration tests lead" for CloudTrail ingestion is acceptable.
 - **Missing:** No guidance on testing the Welford algorithm implementation. This is a numerical algorithm with known floating-point edge cases (catastrophic cancellation with large values). The test architecture should mandate property-based testing (e.g., `fast-check`) for the baseline calculator, not just 3 example-based tests.
 - **Missing:** No guidance on testing the 14-day auto-promotion state machine. This is a time-dependent state transition that needs fake clock testing.
 ---
 ## 3. Test Pyramid Balance
 The 70/20/10 ratio is stated but not justified. For dd0c/cost:
 - **Unit tests should be higher (80%)** — the anomaly scoring engine is pure math. It should have exhaustive property-based tests, not just 50 example tests.
 - **Integration tests (15%)** — DynamoDB Single-Table patterns, EventBridge→SQS→Lambda pipeline, cross-account STS.
 - **E2E (5%)** — two journeys is fine for V1, but they need to be more detailed.
 The current Section 6 (Performance) has exactly 2 test cases. For a product that processes CloudTrail events at scale, this is dangerously thin.
 ---
 ## 4. Anti-Patterns
 1. **Section 3.3 — Welford Algorithm:** Only 3 tests for a numerical algorithm. This is the "happy path only" anti-pattern. Missing: what happens when stddev is 0 (division by zero in Z-score)? What happens with a single data point? What happens with extremely large values (float overflow)?
 2. **Section 4.1 — DynamoDB Transaction Test:** "writes CostEvent and updates Baseline in single transaction" — this tests the happy path. Where's the test for transaction failure? DynamoDB transactions can fail due to conflicts, and the system must handle partial writes.
 3. **Section 5 — E2E Journeys:** Journey 2 tests "Stop Instance" remediation but doesn't test what happens when the customer's IAM role has been revoked between alert and remediation click. This is a real-world race condition.
 4. **No negative tests anywhere.** What happens when CloudTrail sends malformed JSON? What happens when the pricing table doesn't have the instance type? (Section 3.1 mentions "fallback pricing" but there's only 1 test for it.)
 ---
 ## 5. Transparent Factory Gaps
 Section 8 is the biggest problem. It has 3 test cases across 2 tenets. Here's what's missing:
 ### Atomic Flagging (1 test → needs ~10)
 - Missing: flag default state (off), flag TTL enforcement, flag owner metadata, local evaluation (no network calls), CI block on expired flags, multiple concurrent flags.
 - The single circuit breaker test uses ">10 alerts/hour" but Epic 10.1 specifies ">3x baseline" — inconsistency.
 ### Elastic Schema (0 tests → needs ~8)
 - Zero tests. Need: migration lint (no DROP/RENAME/TYPE), additive-only DynamoDB attribute changes, V1 code ignoring V2 attributes, sunset date enforcement, dual-write during migration.
 ### Cognitive Durability (0 tests → needs ~5)
 - Zero tests. Need: decision log schema validation, CI enforcement for scoring PRs, cyclomatic complexity gate, decision log presence check.
 ### Semantic Observability (0 tests → needs ~8)
 - Zero tests. Need: OTEL span emission on every anomaly scoring decision, span attributes (cost.anomaly_score, cost.z_score, cost.baseline_days), PII protection (account ID hashing), fast-path span attributes.
 ### Configurable Autonomy (2 tests → needs ~8)
 - The 14-day auto-promotion tests are good but incomplete. Missing: panic mode activation (<1s), panic mode stops all alerting, per-account governance override, policy decision logging, governance drift monitoring.
 ---
 ## 6. Performance Test Gaps
 Section 6 has 2 tests. For a real-time cost monitoring product, this is inadequate:
 - **Missing:** Burst ingestion (what happens when 1000 CloudTrail events arrive in 1 second during an auto-scaling event?)
 - **Missing:** Baseline calculation performance with 90 days of historical data per account
 - **Missing:** Anomaly scoring latency under concurrent multi-account evaluation
 - **Missing:** DynamoDB hot partition detection (all events for one account hitting the same partition key)
 - **Missing:** SQS FIFO throughput limits (300 msg/s per message group — what happens when a large account exceeds this?)
 - **Missing:** Lambda cold start impact on end-to-end latency
 ---
 ## 7. Missing Test Scenarios
 ### Security
 - **CloudTrail event forgery:** What if someone sends fake CloudTrail events to the EventBridge bus? HMAC/signature validation?
 - **Slack interactive payload signature:** Slack sends a signing secret with interactive payloads. No test validates this.
 - **Cross-account IAM role revocation:** Customer revokes the dd0c role between alert and remediation click.
 - **Remediation authorization:** Who can click "Terminate"? No RBAC tests for remediation actions.
 ### Data Integrity
 - **CloudTrail event deduplication:** CloudTrail can send duplicate events. SQS FIFO dedup is mentioned in Epic 1.2 but has zero tests.
 - **Baseline corruption recovery:** What if a DynamoDB write partially fails and corrupts the running mean/stddev? No recovery tests.
 - **Pricing table staleness:** Static pricing tables will become stale. No test validates that the system handles unknown instance types gracefully beyond the single "fallback pricing" test.
 - **Cost calculation precision:** Floating-point arithmetic on money. No tests for rounding behavior or currency precision.
 ### Operational
 - **DLQ overflow:** What happens when the DLQ fills up? No backpressure tests.
 - **Multi-tenant isolation:** No tests ensuring one tenant's anomalies don't leak to another tenant's Slack channel.
 - **Account onboarding race condition:** What if CloudTrail events arrive before the account is fully onboarded?
 ---
 ## 8. Top 5 Recommendations (Prioritized)
 1. **Expand to cover all 10 epics.** 5 epics have zero tests. At minimum, add unit test stubs for Zombie Hunter (Epic 3), Onboarding (Epic 5), and Dashboard API (Epic 6). These are customer-facing features.
 2. **Rewrite Section 8 (Transparent Factory) from scratch.** 3 tests across 2 tenets is unacceptable. Every tenet needs 5-10 tests. The Elastic Schema and Semantic Observability sections are completely empty.
 3. **Add property-based testing for the anomaly math.** The Welford algorithm, Z-score calculation, and composite scoring are numerical — they need `fast-check` or equivalent, not just example-based tests. Test edge cases: zero stddev, single data point, NaN, Infinity, negative costs.
 4. **Add security tests.** Slack payload signature validation, CloudTrail event authenticity, cross-account IAM revocation handling, remediation RBAC. This product executes `StopInstances` and `DeleteDBInstance` — security testing is non-negotiable.
 5. **Expand performance section to 10+ tests.** Burst ingestion, baseline calculation at scale, DynamoDB hot partitions, SQS FIFO throughput limits, Lambda cold starts. The current 2 tests give zero confidence in production readiness.
 ---
 *This document needs a complete rewrite before it can guide TDD implementation. The scoring engine tests are a good start, but everything else is a placeholder.*
--- a/products/06-runbook-automation/test-architecture/review.md
+++ b/products/06-runbook-automation/test-architecture/review.md
@@ -0,0 +1,56 @@
 # dd0c/run - Test Architecture Review
 **Reviewer:** Max Mayfield (Senior TDD Consultant)
 **Status:** Requires Remediation 🔴
 Look, this is the highest-risk product in the suite. You are literally executing commands in production environments. The test architecture is a solid start, and the "Safety-First TDD" mandate is great in theory, but there are massive blind spots here. If this ships as-is, you're going to have an existential incident.
 Here is the rigorous breakdown of where this architecture falls short and what needs fixing immediately.
 ## 1. Coverage Gaps Per Epic
 You've focused heavily on the core state machine but completely ghosted several critical epics:
 * **Epic 3 (Execution Engine):** Story 3.4 (Divergence Analysis) is completely missing. No tests for the post-execution analyzer comparing prescribed vs. executed commands or flagging unlisted actions.
 * **Epic 5 (Audit Trail):** Story 5.3 (Compliance Export) is absent. Where are the tests for generating the SOC 2 PDF/CSV exports and verifying the S3 links?
 * **Epic 6 (Dashboard API):** Story 6.4 (Classification Query API) requires rate-limiting to 30 req/min per tenant. There are zero rate-limiting integration tests specified.
 * **Epic 7 (Dashboard UI):** SPA testing is totally ignored. You need Cypress or Playwright tests for the "5-second wow moment" parse preview, Trust Level visualizations, and MTTR dashboard.
 * **Epic 9 (Onboarding & PLG):** The entire self-serve tenant provisioning, free-tier limit enforcement (5 runbooks/50 executions), and Agent installation snippet generation lack test coverage.
 ## 2. "Safety-First" TDD Enforcement
 The policy is strong (Canary suite, red-safety, 95% threshold), but it’s overly centralized on the SaaS side. 
 * **The Gap:** The Agent runs in customer VPCs (untrusted territory). If the gRPC payload is intercepted or the API gateway is compromised, the SaaS scanner means nothing. You mentioned the Agent-Side Scanner in the thresholds, but there is no explicit TDD mandate for testing Agent-side deterministic blocking, binary integrity, or defense against payload tampering.
 ## 3. Test Pyramid Evaluation
 Your execution engine ratio is `80% Unit, 15% Integration, 5% E2E`. 
 * **The Reality Check:** That unit ratio is too high for the *orchestration* layer. The state machine in-memory is pure logic, but distributed state is where things break. You need to shift the Execution Engine to at least `60% Unit, 30% Integration, 10% E2E`. You need more integration tests proving state persistence during RDS failovers, Redis evictions, and gRPC disconnects. Pure unit tests won't catch a distributed race condition.
 ## 4. Anti-Patterns & Sandboxing
 You specified an Alpine DinD container for sandboxed execution tests. This is a massive anti-pattern.
 * **Why it's bad:** Alpine doesn't reflect real-world AWS AMIs, Ubuntu instances, or RBAC-restricted Kubernetes environments. Testing `echo $(rm -rf /)` in Alpine proves nothing about how your agent handles missing binaries, shell variations (bash vs zsh vs sh), or K8s permission denied errors.
 * **The Fix:** Sandbox tests must use realistic targets: Mocked EC2 AMIs, RBAC-configured K3s clusters (not just a blanket k3s mock), and varying user privileges (root vs. limited user).
 ## 5. Transparent Factory Gaps
 You hit most of the tenets, but you missed some critical enforcement mechanics:
 * **Atomic Flagging:** You test the 48-hour bake logic, but does the CI script *actually* fail the build if a destructive flag bypasses it? You need an integration test against the CI validation script itself.
 * **Elastic Schema:** You test that app roles can't `UPDATE`/`DELETE` the audit log. But Semantic Observability requires "encrypted audit logs." There are **zero tests** verifying data-at-rest encryption (KMS or pg_crypto) for the execution logs.
 * **Configurable Autonomy:** You test the Redis logic for Panic Mode, but Story 10.5 requires a webhook endpoint `POST /admin/panic` that works in <1s. You must test the *entire* path from HTTP request to gRPC stream pause, not just the Redis key state.
 ## 6. Performance Gaps
 * **Parse SLA:** Epic 1 dictates a 3.5s SLA for LLM extraction, but your performance benchmarks only cover the Rust normalizer (<10ms). You need an end-to-end integration benchmark proving the 5s total SLA (Parser + Classifier) using recorded LLM latency percentiles.
 * **Agent Output Buffering:** You test Slack rate limiting, but what if a user runs `cat /var/log/syslog` and dumps 500MB of text? There are no memory profiling tests or truncation enforcement tests on the Agent side before it attempts to stream over gRPC.
 ## 7. Missing Threat Scenarios
 Your chaos and security tests are missing these highly probable vectors:
 * **Advanced Command Injection:** Testing `echo $(rm)` is elementary. Where are the tests for `; rm -rf /`, `| rm -rf /`, backticks, and newline injections?
 * **Privilege Escalation:** You block `sudo`, but what if the Agent is run as root by the customer? Tests must cover execution under different user contexts to prevent horizontal escalation.
 * **Rollback Failures:** You test mid-execution failure triggering a rollback. What happens if the *rollback command itself* fails or hangs indefinitely? You need a test for nested/fatal failure states.
 * **Double Execution (Partial Failure):** The Agent executes a step successfully but the network partitions *before* it sends the success ACK to the engine. Your idempotency test covers duplicate IDs, but does it handle agent reconnections re-syncing an already-executed state?
 * **Slack Payload Forgery:** You test that approval timeouts mark as stalled, but what if someone curls the Slack approval webhook directly with a forged payload? You must test Slack signature verification.
 ## 8. Top 5 Recommendations
 1. **Mandate Agent-Side Threat Testing:** The Agent is in a zero-trust environment. Write explicit E2E tests for agent binary tampering, payload forgery, and local fallback scanning if the SaaS connection drops.
 2. **Upgrade the Execution Sandbox:** Ditch the plain Alpine container. Implement a matrix of execution sandboxes (Ubuntu, Amazon Linux 2023, RBAC-restricted K3s) and test execution with non-root users.
 3. **Close the Epic Coverage Blackholes:** Write test architecture specs for Divergence Analysis (Epic 3.4), Compliance Exports (Epic 5.3), API Rate Limiting (Epic 6.4), and PLG Onboarding limits (Epic 9).
 4. **Test the Unknowns of Rollbacks & Double Executions:** Add explicit chaos tests for when rollback commands fail, and simulate network drops occurring exactly between command completion and gRPC acknowledgment.
 5. **Enforce Cryptographic Audit Tests:** Add tests verifying that raw commands and outputs in the PostgreSQL audit log are actually encrypted at rest using pg_crypto or AWS KMS, as mandated by the observability tenets.
 Fix these gaps before you write a single line of execution code, or you're asking for a breach.