Files

Max Mayfield 2fe0ed856e Add Gemini TDD reviews for all 6 products

P1, P2, P3, P4, P6 reviewed by Gemini subagents.
P5 reviewed manually (Gemini credential errors).
All reviews flag coverage gaps, anti-patterns, and Transparent Factory tenet gaps.

2026-03-01 00:29:24 +00:00

10 KiB

Raw Blame History

Test Architecture & TDD Strategy Review: dd0c/portal

Reviewer: Senior TDD Consultant Target: V1 MVP Test Architecture (test-architecture.md, epics.md, architecture.md)

This review provides a blunt, critical assessment of the proposed test architecture for the lightweight IDP. The current test plan has solid foundations in ownership inference and governance testing, but suffers from critical blind spots, architectural contradictions, and a test pyramid that will drown a solo founder in mock maintenance.

1. Coverage Analysis vs. Epics

While core discovery engines (Epics 1, 2) and ownership logic (Epic 4) are well-covered, several epics are virtually ignored in the test architecture:

Epic 3 (Service Catalog) - Missing Integrations: Story 3.4 (PagerDuty/OpsGenie Integration) is entirely absent from the test plan. There are no tests for credential encryption, integration caching, or mapping PD schedules to teams.
Epic 4 (Search Engine) - Missing Caching: Story 4.3 requires Redis prefix caching for <10ms Cmd+K responses. Section 4 has zero tests validating Redis cache hit/miss logic or cache invalidation on catalog updates.
Epic 5 & 6 (UI & Dashboards) - Missing Frontend Tests: Section 5 (E2E) only tests API responses for the "Miracle" and Cmd+K. There is no mention of Vite/React component testing, testing the progressive disclosure UI (Story 5.4), or verifying dashboard KPI aggregation (Stories 6.1, 6.2).
Epic 8 (Infrastructure) - Missing IaC Tests: No tests exist for the AWS CDK/CloudFormation deployments, including IAM least-privilege assertions for the customer role (Story 8.5).
Epic 9 (Onboarding & PLG) - Critical Flow Untested: The most important business flows are untested. There are no tests for the Stripe webhook handling (Story 9.2) or the real-time WebSocket discovery progress (Story 9.4). You cannot ship a PLG motion without testing the payment and activation flows.

2. TDD Workflow Critique

The stated TDD strategy correctly identifies ownership inference (Section 3.4) and reconciliation (Section 3.3) as the "Strict TDD" targets. This is the correct move—the domain logic here is pure computation, highly risky, and easily testable without side effects.

However, the "Integration-led" workflow for the scanners (Section 1.2) is contradictory in practice given the sheer volume of unit tests planned for them (Section 3.1, 3.2). Writing 90 unit tests using moto or responses for the AWS and GitHub scanners is not "Integration-led," it is mock-led.

For a solo founder, Red-Green-Refactor against synthetic API responses for third-party systems is a recipe for maintenance nightmares. If the AWS API contract changes or the GitHub GraphQL schema evolves, the mocked unit tests will stay green while production burns.

3. Test Pyramid Balance

The 70/20/10 ratio (300 unit tests / 85 integration / 15 E2E) is the wrong shape for an IDP.

dd0c/portal is fundamentally a data integration engine (gluing AWS, GitHub, PostgreSQL, and Meilisearch). The pyramid should be an Integration Honeycomb.

Scanners (Epic 1 & 2): Shift the balance. Ditch the 90+ scanner unit tests using moto. Use real LocalStack/WireMock integration tests as the primary verification mechanism. Testing an AWS client wrapper against a mock AWS library just proves you can mock boto3.
The Database (Epic 3): The API heavily relies on PostgreSQL for searches and tenant isolation (RLS). Mocking the DB in unit tests is useless here. The 30+ Catalog API tests should hit Testcontainers directly.
Recommendation: Aim for a 30/60/10 ratio. Keep the 30% unit tests strictly for the Reconciliation and Ownership Inference algorithms. Push the 60% integration tests to the boundaries (Scanners -> LocalStack/WireMock, API -> Postgres/Redis/Meilisearch).

4. Anti-Patterns Flagged

Anti-Pattern 1: E2E Tests aren't End-to-End (Section 5.4) Testing the "5-Minute Miracle" via docker-compose.e2e.yml with LocalStack and WireMock is just an over-bloated integration test. True E2E requires a dedicated AWS Sandbox account and a live GitHub Test Org to validate real IAM policies, real GitHub App permissions, real rate limiting, and actual WebSocket streams. Mocks hide integration drift.

Anti-Pattern 2: Over-Reliance on Synthetic Factories (Section 9.2) The use of make_aws_service() and make_github_repo() factories that generate fake.word() is dangerous. Generating random structure instead of capturing real API payloads leads to tests passing on fake data that real AWS/GitHub APIs never return. VCR/record-replay patterns are mandatory for this product.

Anti-Pattern 3: State Corruption on Failed Scans (Section 3.3) The test test_marks_previously_discovered_service_as_stale_when_missing implies a destructive update. If a GitHub GraphQL query times out and returns a partial repo list, does the reconciler mark the rest of the org's services as stale? The test architecture doesn't define resilience against partial API failures.

5. Transparent Factory Gaps

Section 8 attempts to map the Transparent Factory tenets to the test architecture, but it exposes a critical, foundational flaw in the product's architecture.

The Elastic Schema Contradiction (Section 8.2 vs. Epic 10.2): Epic 10.2 explicitly mandates: "all DynamoDB catalog schema changes... DynamoDB Single Table Design". However, Architecture.md and Test Architecture Section 8.2 explicitly build around PostgreSQL (Aurora Serverless v2) and test for drop_column in .sql migrations.

The test architecture tests a SQL database.
The epic demands a NoSQL database. This is a fatal misalignment. You must decide whether the core catalog is relational or single-table, and align the elastic schema testing accordingly. The current SQL tests in Section 8.2 do nothing to enforce Epic 10.2's DynamoDB rules.

Beyond this, Section 8.1 (Atomic Flagging) covers the phantom quarantine circuit breaker, but doesn't test the TTL expiration logic or OpenFeature integration. The CI check blocking expired flags (Story 10.1) is untested.

6. Performance Test Gaps

Section 6.1 (Discovery Scan Benchmarks) is insufficient for the target persona. A 500+ repo org will routinely have >10,000 AWS resources (CFN Stacks, Lambdas, ECS tasks, IAM roles, API Gateways, ALBs) across multiple regions.

The benchmark aws_scan_completes_within_3min_for_500_resources is benchmarking a toy account, not a 500-repo org. Testing the 5-minute miracle at this scale guarantees an architecture collapse at 10x that size due to:

Step Functions Payload Limits (256KB max event size between states).
API Gateway WebSocket Timeout (2 hours max, 30s idle).
Lambda execution timeouts for massive pagination operations across 10+ regions.

Performance tests must validate the pipeline's limits up to 10,000 resources.

7. Missing Test Scenarios

The test architecture neglects the harsh reality of distributed systems and external APIs.

GitHub API Rate Limits (Concurrency): Section 3.2 mentions handling Retry-After. However, the GitHub GraphQL API rate limit is per installation and concurrent points. If 5 tenants in the same GitHub App installation run discovery concurrently, or 2 tenants with 1,000 repos each run at the same time, the GitHub GraphQL API will severely throttle. The test architecture has no cross-tenant concurrent throttling tests.
Stale Service Detection (Partial Failures): Section 3.3 handles stale services when missing, but what if the GitHub GraphQL scanner only completes 50% of the query due to a timeout or 500 error? Does the system mark the other 50% as stale/deleted? There must be tests validating that partial discovery scans do not destructively mutate the catalog.
Ownership Conflicts: Section 3.4 tests ambiguous ownership, but misses the scenario where multiple services claim the same repository through conflicting tags or deployment workflows.
Concurrent Discovery Scans (Same Tenant): If a user clicks "Rescan" 5 times in 10 seconds, does the system spawn 5 concurrent Step Functions? Does it queue them? The test_concurrent_scans_for_different_tenants_dont_conflict doesn't test the same tenant race condition.
Meilisearch Index Corruption: What happens when the SQS -> Meilisearch sync Lambda fails mapping a new document structure? The system needs a test validating index rebuilding and handling mapping errors without disrupting search for existing services.

8. Recommendations (Top 5 Prioritized)

Don't be nice to the initial plan. This test architecture is over-engineered where it shouldn't be and under-tests the riskiest parts of the product.

Resolve the Database Misalignment (Epic 3.1 & 10.2 vs. Section 8.2): Decide immediately if the core catalog is PostgreSQL (Aurora) or DynamoDB. If DynamoDB, rewrite the entire Test Architecture Section 8.2, Section 3.5, and Section 4.3. If PostgreSQL, update Epic 10.2 to test SQL migrations (add column, create table) and drop the DynamoDB NoSQL references entirely.
Invert the Test Pyramid (Section 2.1): A 70% unit test ratio using moto and responses for integration-heavy glue code is an anti-pattern. Shift to an Integration Honeycomb. Drop 50% of the scanner unit tests and replace them with integration tests using VCR (for GitHub) and LocalStack (for AWS). Mocking external APIs in unit tests will break the 5-Minute Miracle on day one of a real schema change.
Mandate Real E2E Infrastructure (Section 5.4): Replace LocalStack in the "5-Minute Miracle" (Section 5.1) with a dedicated AWS Sandbox account and real GitHub Org. Testing the PLG motion against Wiremock is just a glorified integration test. The real risks (IAM assumption, GraphQL schema, real STS) are hidden by docker-compose.e2e.yml.
Cover the Business-Critical Missing Epics (Epics 9, 3, 5): Add explicit test suites for:
- Stripe Checkout webhooks & billing state (Story 9.2).
- WebSocket real-time progress streams (Story 9.4).
- PagerDuty schedule mapping and credentials (Story 3.4).
- Vite/React frontend tests for Service Cards & Dashboards (Stories 5.4, 6.1).
Scale Performance Benchmarks 10x (Section 6.1): Benchmarking 500 resources is completely inadequate for the target persona. Mandate tests validating discovery against 10,000 AWS resources and 1,000 GitHub repos to test Step Functions payload constraints, API Gateway limits, and concurrent rate limiting.

10 KiB Raw Blame History