# dd0c/cost β€” Test Architecture Review **Reviewer:** TDD Consultant (Manual Review) **Date:** March 1, 2026 **Verdict:** πŸ”΄ NEEDS SIGNIFICANT WORK β€” This is the weakest test architecture of all 6 products. --- ## 1. Coverage Analysis This document is 232 lines. For comparison, P1 (route) is 2,241 lines and P6 (run) is 1,762 lines. The coverage gaps are massive. | Epic | Coverage Status | Notes | |------|----------------|-------| | Epic 1: CloudTrail Ingestion | ⚠️ Partial | Section 3.1 has 5 test cases for the normalizer. Missing: SQS FIFO deduplication tests, DLQ retry behavior, EventBridge cross-account rule tests, S3 raw event archival. Story 1.2 (SQS + DLQ) has zero dedicated tests. | | Epic 2: Anomaly Detection | βœ… Decent | Section 3.2 covers Z-score, novelty, cold-start. But missing: composite score weighting tests, edge cases (zero stddev, negative costs, NaN handling), baseline maturity transition tests. | | Epic 3: Zombie Hunter | ❌ Missing | Zero test cases. The daily scan for idle/stopped resources that are still costing money has no tests at all. | | Epic 4: Notification & Remediation | ⚠️ Thin | Section 4.2 has 3 integration tests for cross-account actions. Missing: Slack Block Kit formatting tests, daily digest aggregation, snooze/dismiss logic, interactive payload signature validation. | | Epic 5: Onboarding & PLG | ❌ Missing | Zero test cases. CloudFormation template generation, Stripe billing, free tier enforcement β€” none tested. | | Epic 6: Dashboard API | ❌ Missing | Zero test cases. REST API endpoints, tenant isolation, query performance β€” nothing. | | Epic 7: Dashboard UI | ❌ Missing | Zero test cases. | | Epic 8: Infrastructure (CDK) | ❌ Missing | Zero test cases. No CDK snapshot tests, no infrastructure drift detection (ironic). | | Epic 9: Multi-Account Management | ❌ Missing | Zero test cases. Account linking, bulk scanning, cross-account permissions β€” nothing. | | Epic 10: Transparent Factory | πŸ”΄ Skeletal | Section 8 has exactly 3 test cases total across 2 of 5 tenets. Elastic Schema, Cognitive Durability, and Semantic Observability have zero tests. | **Bottom line:** 5 of 10 epics have zero test coverage in this document. This is a skeleton, not a test architecture. --- ## 2. TDD Workflow Critique The philosophy in Section 1 is sound β€” "test the math first" is correct for an anomaly detection product. But the execution is incomplete: - The "strict TDD" list correctly identifies scoring and governance as test-first. Good. - The "integration tests lead" for CloudTrail ingestion is acceptable. - **Missing:** No guidance on testing the Welford algorithm implementation. This is a numerical algorithm with known floating-point edge cases (catastrophic cancellation with large values). The test architecture should mandate property-based testing (e.g., `fast-check`) for the baseline calculator, not just 3 example-based tests. - **Missing:** No guidance on testing the 14-day auto-promotion state machine. This is a time-dependent state transition that needs fake clock testing. --- ## 3. Test Pyramid Balance The 70/20/10 ratio is stated but not justified. For dd0c/cost: - **Unit tests should be higher (80%)** β€” the anomaly scoring engine is pure math. It should have exhaustive property-based tests, not just 50 example tests. - **Integration tests (15%)** β€” DynamoDB Single-Table patterns, EventBridgeβ†’SQSβ†’Lambda pipeline, cross-account STS. - **E2E (5%)** β€” two journeys is fine for V1, but they need to be more detailed. The current Section 6 (Performance) has exactly 2 test cases. For a product that processes CloudTrail events at scale, this is dangerously thin. --- ## 4. Anti-Patterns 1. **Section 3.3 β€” Welford Algorithm:** Only 3 tests for a numerical algorithm. This is the "happy path only" anti-pattern. Missing: what happens when stddev is 0 (division by zero in Z-score)? What happens with a single data point? What happens with extremely large values (float overflow)? 2. **Section 4.1 β€” DynamoDB Transaction Test:** "writes CostEvent and updates Baseline in single transaction" β€” this tests the happy path. Where's the test for transaction failure? DynamoDB transactions can fail due to conflicts, and the system must handle partial writes. 3. **Section 5 β€” E2E Journeys:** Journey 2 tests "Stop Instance" remediation but doesn't test what happens when the customer's IAM role has been revoked between alert and remediation click. This is a real-world race condition. 4. **No negative tests anywhere.** What happens when CloudTrail sends malformed JSON? What happens when the pricing table doesn't have the instance type? (Section 3.1 mentions "fallback pricing" but there's only 1 test for it.) --- ## 5. Transparent Factory Gaps Section 8 is the biggest problem. It has 3 test cases across 2 tenets. Here's what's missing: ### Atomic Flagging (1 test β†’ needs ~10) - Missing: flag default state (off), flag TTL enforcement, flag owner metadata, local evaluation (no network calls), CI block on expired flags, multiple concurrent flags. - The single circuit breaker test uses ">10 alerts/hour" but Epic 10.1 specifies ">3x baseline" β€” inconsistency. ### Elastic Schema (0 tests β†’ needs ~8) - Zero tests. Need: migration lint (no DROP/RENAME/TYPE), additive-only DynamoDB attribute changes, V1 code ignoring V2 attributes, sunset date enforcement, dual-write during migration. ### Cognitive Durability (0 tests β†’ needs ~5) - Zero tests. Need: decision log schema validation, CI enforcement for scoring PRs, cyclomatic complexity gate, decision log presence check. ### Semantic Observability (0 tests β†’ needs ~8) - Zero tests. Need: OTEL span emission on every anomaly scoring decision, span attributes (cost.anomaly_score, cost.z_score, cost.baseline_days), PII protection (account ID hashing), fast-path span attributes. ### Configurable Autonomy (2 tests β†’ needs ~8) - The 14-day auto-promotion tests are good but incomplete. Missing: panic mode activation (<1s), panic mode stops all alerting, per-account governance override, policy decision logging, governance drift monitoring. --- ## 6. Performance Test Gaps Section 6 has 2 tests. For a real-time cost monitoring product, this is inadequate: - **Missing:** Burst ingestion (what happens when 1000 CloudTrail events arrive in 1 second during an auto-scaling event?) - **Missing:** Baseline calculation performance with 90 days of historical data per account - **Missing:** Anomaly scoring latency under concurrent multi-account evaluation - **Missing:** DynamoDB hot partition detection (all events for one account hitting the same partition key) - **Missing:** SQS FIFO throughput limits (300 msg/s per message group β€” what happens when a large account exceeds this?) - **Missing:** Lambda cold start impact on end-to-end latency --- ## 7. Missing Test Scenarios ### Security - **CloudTrail event forgery:** What if someone sends fake CloudTrail events to the EventBridge bus? HMAC/signature validation? - **Slack interactive payload signature:** Slack sends a signing secret with interactive payloads. No test validates this. - **Cross-account IAM role revocation:** Customer revokes the dd0c role between alert and remediation click. - **Remediation authorization:** Who can click "Terminate"? No RBAC tests for remediation actions. ### Data Integrity - **CloudTrail event deduplication:** CloudTrail can send duplicate events. SQS FIFO dedup is mentioned in Epic 1.2 but has zero tests. - **Baseline corruption recovery:** What if a DynamoDB write partially fails and corrupts the running mean/stddev? No recovery tests. - **Pricing table staleness:** Static pricing tables will become stale. No test validates that the system handles unknown instance types gracefully beyond the single "fallback pricing" test. - **Cost calculation precision:** Floating-point arithmetic on money. No tests for rounding behavior or currency precision. ### Operational - **DLQ overflow:** What happens when the DLQ fills up? No backpressure tests. - **Multi-tenant isolation:** No tests ensuring one tenant's anomalies don't leak to another tenant's Slack channel. - **Account onboarding race condition:** What if CloudTrail events arrive before the account is fully onboarded? --- ## 8. Top 5 Recommendations (Prioritized) 1. **Expand to cover all 10 epics.** 5 epics have zero tests. At minimum, add unit test stubs for Zombie Hunter (Epic 3), Onboarding (Epic 5), and Dashboard API (Epic 6). These are customer-facing features. 2. **Rewrite Section 8 (Transparent Factory) from scratch.** 3 tests across 2 tenets is unacceptable. Every tenet needs 5-10 tests. The Elastic Schema and Semantic Observability sections are completely empty. 3. **Add property-based testing for the anomaly math.** The Welford algorithm, Z-score calculation, and composite scoring are numerical β€” they need `fast-check` or equivalent, not just example-based tests. Test edge cases: zero stddev, single data point, NaN, Infinity, negative costs. 4. **Add security tests.** Slack payload signature validation, CloudTrail event authenticity, cross-account IAM revocation handling, remediation RBAC. This product executes `StopInstances` and `DeleteDBInstance` β€” security testing is non-negotiable. 5. **Expand performance section to 10+ tests.** Burst ingestion, baseline calculation at scale, DynamoDB hot partitions, SQS FIFO throughput limits, Lambda cold starts. The current 2 tests give zero confidence in production readiness. --- *This document needs a complete rewrite before it can guide TDD implementation. The scoring engine tests are a good start, but everything else is a placeholder.*