Files

Max Mayfield 03bfe931fc Implement review remediation + PLG analytics SDK

- All 6 test architectures patched with Section 11 addendums
- P5 (cost) fully rewritten from 232 to ~600 lines
- PLG brainstorm + party mode advisory board results
- Analytics SDK v2 (PostHog Cloud, Zod strict, Lambda-safe)
- Analytics tests v2 (safeParse, no , no timestamp, no PII)
- Addresses all Gemini review findings across P1-P6

2026-03-01 01:42:49 +00:00

78 KiB

Raw Blame History

dd0c/drift — Test Architecture & TDD Strategy

Author: Max Mayfield (Test Architect) Date: February 28, 2026 Product: dd0c/drift — IaC Drift Detection & Remediation SaaS Status: Test Architecture Design Document

Section 1: Testing Philosophy & TDD Workflow

1.1 Core Philosophy

dd0c/drift is a security-critical product. A missed drift event or a false positive in the remediation engine can cause real infrastructure damage. The testing strategy reflects this: correctness is non-negotiable, speed is a constraint, not a goal.

Three principles guide every testing decision:

Tests are the first customer. Before writing a single line of production code, the test defines the contract. If you can't write a test for it, you don't understand the requirement well enough to build it.
The secret scrubber and RLS are untouchable. These two components — the agent's secret scrubbing engine and the SaaS's PostgreSQL Row-Level Security — have 100% test coverage requirements. No exceptions. A bug in either is a trust-destroying incident.
Drift detection logic is pure functions. The comparator, scorer, and classifier take inputs and return outputs with no side effects. This makes them trivially testable and means the test suite runs fast even at high coverage.

1.2 Red-Green-Refactor Adapted for dd0c/drift

The standard TDD cycle applies, but with domain-specific adaptations:

RED   → Write a failing test that describes a drift scenario
         e.g., "security group ingress rule added to 0.0.0.0/0 → severity: critical"

GREEN → Write the minimum code to make it pass
         e.g., add the classification rule to the YAML config + evaluator

REFACTOR → Clean up without breaking the test
            e.g., extract the CIDR check into a reusable predicate

When to write tests first (strict TDD):

All drift detection logic (comparator, classifier, scorer)
Secret scrubbing engine — write tests for every secret pattern BEFORE writing the regex
API request/response contracts — write schema validation tests before implementing handlers
Remediation policy evaluation — write policy enforcement tests before the engine
Feature flag evaluation logic (Epic 10.1)

When integration tests lead (test-after acceptable):

AWS SDK wiring (agent ↔ EC2/IAM/RDS describe calls) — mock the SDK first, integration test confirms the wiring
DynamoDB persistence — write the schema, then integration tests against DynamoDB Local
Slack Block Kit formatting — render the block, visually verify, then snapshot test
CI/CD pipeline configuration — validate by running it, not by unit testing YAML

When E2E tests lead:

Onboarding flow (drift init → drift check → Slack alert) — the happy path must work end-to-end before any unit tests are written for the CLI
Remediation round-trip (Slack button → agent apply → resolution) — too many moving parts to unit test first

1.3 Test Naming Conventions

Go (Agent, State Manager):

// Pattern: Test<Component>_<Scenario>_<ExpectedOutcome>
func TestDriftComparator_SecurityGroupIngressAdded_ReturnsCriticalDrift(t *testing.T)
func TestSecretScrubber_PasswordAttribute_ReturnsRedacted(t *testing.T)
func TestStateParser_V4Format_ExtractsManagedResources(t *testing.T)

// Table-driven test naming: use descriptive name field
tests := []struct {
    name string
    // ...
}{
    {name: "security group with public CIDR → critical"},
    {name: "tag-only change → low severity"},
    {name: "IAM policy document changed → high severity"},
}

TypeScript (SaaS, Dashboard API):

// Pattern: describe("<Component>") > describe("<method/scenario>") > it("<expected behavior>")
describe("DriftClassifier", () => {
  describe("classify()", () => {
    it("returns critical severity for security group with 0.0.0.0/0 ingress")
    it("returns low severity for tag-only changes")
    it("falls back to medium/configuration for unmatched resource types")
  })
})

Integration & E2E:

// File naming: <component>.integration_test.go / <flow>.e2e_test.go
agent_dynamodb_integration_test.go
drift_report_ingestion_integration_test.go
onboarding_flow_e2e_test.go
remediation_roundtrip_e2e_test.go

Section 2: Test Pyramid

2.1 Recommended Ratio

         ┌─────────────────┐
         │   E2E / Smoke   │  ~10%  (~50 tests)
         │  (LocalStack,   │
         │   real flows)   │
         ├─────────────────┤
         │  Integration    │  ~20%  (~100 tests)
         │  (boundaries,   │
         │   real DBs)     │
         ├─────────────────┤
         │   Unit Tests    │  ~70%  (~350 tests)
         │  (pure logic,   │
         │   fast, mocked) │
         └─────────────────┘

Target: ~500 tests total at V1 launch, growing to ~1,000 by month 3.

2.2 Unit Test Targets (Per Component)

Component	Language	Target Coverage	Key Test Count
State Parser (TF v4)	Go	95%	~40 tests
Drift Comparator	Go	95%	~60 tests
Drift Classifier	Go	90%	~30 tests
Secret Scrubber	Go	100%	~50 tests
Drift Scorer	Go/TS	90%	~20 tests
Event Processor (ingestion)	TypeScript	85%	~30 tests
Notification Formatter	TypeScript	85%	~25 tests
Remediation Engine	TypeScript	85%	~30 tests
Dashboard API handlers	TypeScript	80%	~40 tests
Feature Flag evaluator	Go	90%	~20 tests
Policy engine	Go/TS	95%	~30 tests

2.3 Integration Test Boundaries

Boundary	Test Type	Infrastructure
Agent ↔ AWS EC2/IAM/RDS APIs	Integration	LocalStack or recorded HTTP fixtures
Agent ↔ SaaS API (drift report POST)	Integration	Real HTTP server (test instance)
Event Processor ↔ DynamoDB	Integration	DynamoDB Local (Testcontainers)
Event Processor ↔ PostgreSQL	Integration	PostgreSQL (Testcontainers)
Event Processor ↔ SQS	Integration	LocalStack SQS
Notification Service ↔ Slack API	Integration	Slack API mock server
Remediation Engine ↔ Agent	Integration	Agent stub server
Dashboard API ↔ PostgreSQL (RLS)	Integration	PostgreSQL (Testcontainers) — multi-tenant isolation tests

2.4 E2E / Smoke Test Scenarios

Scenario	Priority	Infrastructure
Install agent → run `drift check` → detect drift → Slack alert	P0	LocalStack + Slack mock
Agent heartbeat → SaaS records it → dashboard shows "online"	P0	LocalStack
Click [Revert] in Slack → agent executes terraform apply → event resolved	P0	LocalStack + agent stub
Click [Accept] → GitHub PR created with code patch	P1	GitHub API mock
Free tier stack limit enforcement (register 2nd stack → 403)	P1	Real SaaS test env
Secret scrubbing end-to-end (state with password → report has [REDACTED])	P0	Agent + SaaS test env
Multi-tenant isolation (org A cannot see org B drift events)	P0	PostgreSQL + RLS
Agent offline detection (no heartbeat → Slack "agent offline" alert)	P1	LocalStack

Section 3: Unit Test Strategy (Per Component)

3.1 State Parser (Go — Epic 1, Story 1.1)

What to test:

Correct extraction of managed resources (skip data sources)
Module-prefixed addresses (module.vpc.aws_security_group.api)
Multi-instance resources (aws_instance.worker[0], aws_instance.worker[1])
Graceful handling of unknown/future resource types
Rejection of non-v4 state format versions
Empty state file (zero resources)
State file with only data sources (zero managed resources)
private field stripped from all instances before returning

Key test cases:

func TestStateParser_V4Format_ExtractsManagedResources(t *testing.T) {}
func TestStateParser_DataSourceResources_AreExcluded(t *testing.T) {}
func TestStateParser_ModulePrefixedAddress_ParsedCorrectly(t *testing.T) {}
func TestStateParser_MultiInstanceResource_AllInstancesExtracted(t *testing.T) {}
func TestStateParser_UnsupportedVersion_ReturnsError(t *testing.T) {}
func TestStateParser_EmptyState_ReturnsEmptyResourceList(t *testing.T) {}
func TestStateParser_PrivateField_IsStrippedFromAttributes(t *testing.T) {}

Mocking strategy: None — pure function over a JSON byte slice. Fixtures in testdata/states/.

Table-driven pattern:

func TestStateParser_ResourceExtraction(t *testing.T) {
    tests := []struct {
        name          string
        fixtureFile   string
        wantCount     int
        wantAddresses []string
        wantErr       bool
    }{
        {name: "single managed resource", fixtureFile: "testdata/states/single_sg.tfstate", wantCount: 1},
        {name: "state v3 format", fixtureFile: "testdata/states/v3_format.tfstate", wantErr: true},
        {name: "module-nested resources", fixtureFile: "testdata/states/module_nested.tfstate", wantCount: 5},
    }
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            data, _ := os.ReadFile(tt.fixtureFile)
            got, err := ParseState(data)
            if tt.wantErr { require.Error(t, err); return }
            require.NoError(t, err)
            assert.Len(t, got.Resources, tt.wantCount)
        })
    }
}

3.2 Drift Comparator (Go — Epic 1, Story 1.3)

What to test:

Attribute added in cloud (not in state) → drift detected
Attribute removed from cloud (in state, not in cloud) → drift detected
Attribute value changed → correct old/new values in diff
Attribute unchanged → no drift
Nested attribute changes (ingress rules array)
Ignored attributes (AWS-generated IDs, timestamps, computed fields) → no drift
Null vs. empty string → treated as no drift
Boolean drift (true → false)
Numeric drift (port numbers, counts)

Key test cases:

func TestDriftComparator_AttributeAdded_ReturnsDrift(t *testing.T) {}
func TestDriftComparator_AttributeRemoved_ReturnsDrift(t *testing.T) {}
func TestDriftComparator_AttributeUnchanged_ReturnsNoDrift(t *testing.T) {}
func TestDriftComparator_NestedIngressRuleAdded_ReturnsDrift(t *testing.T) {}
func TestDriftComparator_IgnoredAttribute_ReturnsNoDrift(t *testing.T) {}
func TestDriftComparator_NullVsEmptyString_TreatedAsNoDrift(t *testing.T) {}
func TestDriftComparator_ComputedTimestamp_IsIgnored(t *testing.T) {}

Mocking strategy: None — pure function. State and cloud attributes are both map[string]interface{}.

3.3 Drift Classifier (Go — Epic 3, Story 3.2)

What to test:

Security group with 0.0.0.0/0 ingress → critical/security
IAM role policy document changed → high/security
RDS parameter group changed → high/configuration
Tag-only change → low/tags
Unmatched resource type → medium/configuration (default fallback)
Customer override rules take precedence over defaults
Rule evaluation order (first match wins)
Invalid YAML config → error at startup, not at classification time

func TestDriftClassifier_PublicCIDRIngress_ReturnsCriticalSecurity(t *testing.T) {}
func TestDriftClassifier_IAMPolicyChanged_ReturnsHighSecurity(t *testing.T) {}
func TestDriftClassifier_TagOnlyChange_ReturnsLowTags(t *testing.T) {}
func TestDriftClassifier_UnmatchedResource_ReturnsMediumConfiguration(t *testing.T) {}
func TestDriftClassifier_CustomerOverride_TakesPrecedence(t *testing.T) {}
func TestDriftClassifier_InvalidYAML_ReturnsErrorOnLoad(t *testing.T) {}

3.4 Secret Scrubber (Go — Epic 1, Story 1.4) — 100% Coverage Required

Every secret pattern is a security requirement. No table-driven shortcuts — each pattern gets its own named test.

Key test cases:

func TestSecretScrubber_PasswordKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_SecretKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_TokenKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PrivateKeyKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_ConnectionStringKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_AWSAccessKeyPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PostgresURIPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PEMPrivateKeyPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_JWTTokenPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_SensitiveFlag_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PrivateField_IsStrippedEntirely(t *testing.T) {}
func TestSecretScrubber_NonSensitiveAttribute_PreservesValue(t *testing.T) {}
func TestSecretScrubber_NestedSensitiveKey_RedactsNestedValue(t *testing.T) {}
func TestSecretScrubber_ArrayWithSensitiveValues_AllElementsChecked(t *testing.T) {}
func TestSecretScrubber_RedactedPlaceholder_IsLiteralREDACTEDString(t *testing.T) {}
func TestSecretScrubber_DiffStructureIntact_AfterScrubbing(t *testing.T) {}

3.5 Drift Scorer (TypeScript — Epic 3, Story 3.4)

describe("DriftScorer", () => {
  it("returns 100 for a stack with no drift")
  it("applies heavy penalty for critical severity drift")
  it("applies minimal penalty for low severity drift")
  it("produces weighted score for mixed severity drift")
  it("recalculates upward when drift event is resolved")
  it("handles zero-resource stack without divide-by-zero")
  it("caps score at 0 for catastrophically drifted stacks")
})

3.6 Event Processor — Ingestion & Validation (TypeScript — Epic 3, Story 3.1)

What to test:

Valid drift report → accepted, returns 202
Missing stack_id → 400 DRIFT_REPORT_INVALID
Duplicate report_id → 409 DRIFT_REPORT_DUPLICATE
Payload > 1MB → 400 DRIFT_REPORT_TOO_LARGE
Invalid severity value → 400
Unknown agent ID → 404 AGENT_NOT_FOUND
Revoked agent API key → 403 AGENT_REVOKED
SQS message group ID equals stack_id
SQS deduplication ID equals report_id

Mocking strategy: Mock @aws-sdk/client-sqs. Mock PostgreSQL pool. Use zod schema directly in tests.

3.7 Notification Formatter (TypeScript — Epic 4, Story 4.1)

What to test:

Critical drift → header 🔴 Critical Drift Detected
Diff block truncated at Slack's 3000-char block limit
CloudTrail attribution present → "Changed by: "
CloudTrail attribution absent → "Changed by: Unknown (scheduled scan)"
All four action buttons present (drift_revert, drift_accept, drift_snooze, drift_assign)
[REDACTED] values rendered as-is
Low severity digest format → no [Revert] button

Mocking strategy: None — pure function. Use snapshot tests for Block Kit JSON output.

3.8 Remediation Engine (TypeScript — Epic 7, Stories 7.1–7.2)

What to test:

Revert: generates correct terraform apply -target=<address> command
Blast radius: resource with 3 dependents → blast_radius = 3
Blast radius: isolated resource → blast_radius = 0
require-approval policy → status pending, not executing
auto-revert policy for critical → executes without approval gate
Accept: generates correct code patch for changed attribute
Accept: creates PR with correct branch name and description
Agent heartbeat stale → REMEDIATION_AGENT_OFFLINE
Concurrent revert on same resource → REMEDIATION_IN_PROGRESS
Panic mode active → all remediation blocked

Mocking strategy: Mock agent command dispatcher. Mock GitHub API client (@octokit/rest). Mock PostgreSQL for plan persistence.

3.9 Feature Flag Evaluator (Go — Epic 10, Story 10.1)

func TestFeatureFlag_EnabledFlag_ExecutesFeature(t *testing.T) {}
func TestFeatureFlag_DisabledFlag_SkipsFeatureWithNoSideEffects(t *testing.T) {}
func TestFeatureFlag_UnknownFlag_ReturnsDefaultOff(t *testing.T) {}
func TestFeatureFlag_EnvVarOverride_TakesPrecedenceOverJSONFile(t *testing.T) {}
func TestFeatureFlag_CircuitBreaker_DisablesFlagOnFalsePositiveSpike(t *testing.T) {}
func TestFeatureFlag_ExpiredTTL_CILintDetectsIt(t *testing.T) {} // lint test, not runtime

3.10 Policy Engine (Go — Epic 10, Story 10.5)

func TestPolicyEngine_StrictMode_BlocksAllRemediation(t *testing.T) {}
func TestPolicyEngine_AuditMode_ExecutesAndLogs(t *testing.T) {}
func TestPolicyEngine_CustomerMoreRestrictive_CustomerPolicyWins(t *testing.T) {}
func TestPolicyEngine_CustomerLessRestrictive_SystemPolicyWins(t *testing.T) {}
func TestPolicyEngine_PanicMode_HaltsAllScans(t *testing.T) {}
func TestPolicyEngine_PanicMode_SendsSingleNotification(t *testing.T) {}
func TestPolicyEngine_PolicyDecision_IsLogged(t *testing.T) {}
func TestPolicyEngine_FileReload_NewPolicyTakesEffect(t *testing.T) {}

Section 4: Integration Test Strategy

4.1 Agent ↔ Cloud Provider APIs

Goal: Verify the agent correctly maps Terraform resource types to AWS describe calls and handles API responses.

Approach: Use recorded HTTP fixtures (via go-vcr or httpmock) for unit-speed integration tests. Use LocalStack for full integration runs in CI.

Key test cases:

// pkg/agent/integration/aws_polling_test.go
func TestAWSPolling_SecurityGroup_MapsToDescribeSecurityGroups(t *testing.T) {}
func TestAWSPolling_IAMRole_MapsToGetRole(t *testing.T) {}
func TestAWSPolling_RDSInstance_MapsToDescribeDBInstances(t *testing.T) {}
func TestAWSPolling_ResourceNotFound_ReturnsUnknownDriftState(t *testing.T) {}
func TestAWSPolling_RateLimitResponse_RetriesWithBackoff(t *testing.T) {}
func TestAWSPolling_CredentialError_ReturnsDescriptiveError(t *testing.T) {}
func TestAWSPolling_RegionScopedRequest_UsesConfiguredRegion(t *testing.T) {}

Fixture strategy:

testdata/
  aws-responses/
    ec2_describe_security_groups_clean.json      # cloud matches state
    ec2_describe_security_groups_drifted.json    # ingress rule added
    iam_get_role_policy_changed.json
    rds_describe_db_instances_clean.json
    ec2_describe_security_groups_not_found.json  # resource deleted from cloud

4.2 Agent ↔ SaaS API (Drift Report Submission)

Goal: Verify the agent correctly serializes and transmits DriftReport payloads, handles auth errors, and respects rate limit responses.

Setup: Spin up a lightweight HTTP test server in Go (httptest.NewServer) that mimics the SaaS ingestion endpoint.

func TestTransmitter_ValidReport_Returns202(t *testing.T) {}
func TestTransmitter_InvalidAPIKey_Returns401AndStopsRetrying(t *testing.T) {}
func TestTransmitter_RevokedAPIKey_Returns403AndStopsRetrying(t *testing.T) {}
func TestTransmitter_RateLimited_RespectsRetryAfterHeader(t *testing.T) {}
func TestTransmitter_ServerError_RetriesWithExponentialBackoff(t *testing.T) {}
func TestTransmitter_PayloadCompressed_WhenOverThreshold(t *testing.T) {}
func TestTransmitter_mTLSCertPresented_OnEveryRequest(t *testing.T) {}
func TestTransmitter_NetworkTimeout_RetriesUpToMaxAttempts(t *testing.T) {}

4.3 Event Processor ↔ DynamoDB (Testcontainers)

Goal: Verify event sourcing writes, TTL attribute setting, and checksum generation against a real DynamoDB Local instance.

Setup:

// Use testcontainers-go to spin up DynamoDB Local
func setupDynamoDBLocal(t *testing.T) *dynamodb.Client {
    ctx := context.Background()
    container, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
        ContainerRequest: testcontainers.ContainerRequest{
            Image:        "amazon/dynamodb-local:latest",
            ExposedPorts: []string{"8000/tcp"},
            WaitingFor:   wait.ForListeningPort("8000/tcp"),
        },
        Started: true,
    })
    require.NoError(t, err)
    t.Cleanup(func() { container.Terminate(ctx) })
    // ... return configured client
}

Key test cases:

func TestDynamoDBEventStore_AppendDriftEvent_PersistsWithCorrectPK(t *testing.T) {}
func TestDynamoDBEventStore_AppendDriftEvent_SetsChecksumAttribute(t *testing.T) {}
func TestDynamoDBEventStore_AppendDriftEvent_SetsTTLPerTier(t *testing.T) {}
func TestDynamoDBEventStore_QueryByStackID_ReturnsChronologicalOrder(t *testing.T) {}
func TestDynamoDBEventStore_DuplicateEventID_IsIdempotent(t *testing.T) {}
func TestDynamoDBEventStore_FreeTier_TTL90Days(t *testing.T) {}
func TestDynamoDBEventStore_EnterpriseTier_TTL7Years(t *testing.T) {}

4.4 Event Processor ↔ PostgreSQL (Testcontainers + RLS)

Goal: Verify multi-tenant data isolation via Row-Level Security. This is the most critical integration test suite.

Setup:

// Use testcontainers for Node.js to spin up PostgreSQL 16
// Apply full schema migrations before each test suite
// Create two test orgs: orgA and orgB

Key test cases:

describe("PostgreSQL RLS Integration", () => {
  it("org A cannot read org B drift events via direct query")
  it("org A cannot read org B stacks via direct query")
  it("setting app.current_org_id scopes all queries correctly")
  it("missing app.current_org_id returns zero rows (not an error)")
  it("drift event insert without org_id fails FK constraint")
  it("drift score update is scoped to correct org's stack")
  it("concurrent inserts from two orgs do not cross-contaminate")
})

Critical test — cross-tenant isolation:

it("org A cannot read org B drift events", async () => {
  // Insert drift event for orgB
  await insertDriftEvent(orgBPool, orgBEvent)
  
  // Query as orgA — should return empty, not orgB's data
  await orgAPool.query("SET app.current_org_id = $1", [orgA.id])
  const result = await orgAPool.query("SELECT * FROM drift_events")
  expect(result.rows).toHaveLength(0)
})

4.5 IaC State File Parsing — Multi-Backend Integration

Goal: Verify the agent correctly reads state files from different backends (S3, local file, Terraform Cloud).

Setup: LocalStack S3 for S3 backend tests. Real file system for local backend. WireMock for Terraform Cloud API.

func TestStateBackend_S3_ReadsStateFileFromBucket(t *testing.T) {}
func TestStateBackend_S3_HandlesVersionedBucket(t *testing.T) {}
func TestStateBackend_LocalFile_ReadsFromFilesystem(t *testing.T) {}
func TestStateBackend_TerraformCloud_AuthenticatesAndFetchesState(t *testing.T) {}
func TestStateBackend_S3_AccessDenied_ReturnsDescriptiveError(t *testing.T) {}
func TestStateBackend_S3_BucketNotFound_ReturnsDescriptiveError(t *testing.T) {}

4.6 Notification Service ↔ Slack API

Goal: Verify Slack message delivery, request signature validation, and interactive callback handling.

Setup: WireMock or a custom Go HTTP mock server simulating the Slack API.

describe("Slack Integration", () => {
  it("delivers Block Kit message to configured channel")
  it("falls back to org default channel when stack channel not set")
  it("validates Slack request signature on interaction callbacks")
  it("rejects interaction callback with invalid signature → 401")
  it("updates original message after [Revert] button click")
  it("handles Slack API rate limit (429) with retry")
  it("handles Slack API 500 — logs error, does not crash Lambda")
})

4.7 Terraform State File Parsing — Real Fixture Files

Goal: Verify the parser handles real-world Terraform state files from different provider versions and configurations.

Fixture files sourced from:

Terraform AWS provider v4.x, v5.x state outputs
OpenTofu state files (should be identical format)
State files with modules, count, for_each
State files with workspace prefixes

func TestStateParser_RealWorldAWSProviderV5_ParsesCorrectly(t *testing.T) {}
func TestStateParser_OpenTofuStateFile_ParsesCorrectly(t *testing.T) {}
func TestStateParser_ForEachResources_AllInstancesExtracted(t *testing.T) {}
func TestStateParser_WorkspacePrefixedState_ParsesCorrectly(t *testing.T) {}
func TestStateParser_LargeStateFile_500Resources_CompletesUnder2Seconds(t *testing.T) {}

Section 5: E2E & Smoke Tests

5.1 Infrastructure Setup

All E2E tests run against LocalStack (AWS service simulation) and a real PostgreSQL instance. The test environment is defined as a Docker Compose stack:

# docker-compose.test.yml
services:
  localstack:
    image: localstack/localstack:3.x
    environment:
      SERVICES: s3,sqs,dynamodb,iam,ec2,lambda,eventbridge
      DEBUG: 0
    ports:
      - "4566:4566"

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: drift_test
      POSTGRES_USER: drift
      POSTGRES_PASSWORD: test
    ports:
      - "5432:5432"

  slack-mock:
    image: wiremock/wiremock:latest
    volumes:
      - ./testdata/wiremock/slack:/home/wiremock/mappings
    ports:
      - "8080:8080"

  github-mock:
    image: wiremock/wiremock:latest
    volumes:
      - ./testdata/wiremock/github:/home/wiremock/mappings
    ports:
      - "8081:8080"

Synthetic drift generation: A helper CLI tool (testdata/tools/drift-injector) modifies LocalStack EC2/IAM resources to simulate real drift scenarios without touching real AWS.

5.2 Critical User Journey: Install → Detect → Notify

Journey: Agent installed → drift check run → drift detected → Slack alert delivered

// e2e/onboarding_flow_test.go
func TestE2E_OnboardingFlow_InstallToFirstSlackAlert(t *testing.T) {
    // 1. Register org and agent via API
    org := createTestOrg(t)
    agent := registerAgent(t, org.APIKey)

    // 2. Upload a Terraform state file to LocalStack S3
    uploadStateFixture(t, "testdata/states/prod_networking.tfstate", org.StateBucket)

    // 3. Inject drift into LocalStack EC2 (add 0.0.0.0/0 ingress rule)
    injectSecurityGroupDrift(t, "sg-abc123")

    // 4. Run drift check
    result := runDriftCheck(t, agent, org.StateBucket)
    require.Equal(t, 1, result.DriftedResourceCount)
    require.Equal(t, "critical", result.DriftedResources[0].Severity)

    // 5. Verify Slack mock received the Block Kit message
    slackRequests := getSlackMockRequests(t)
    require.Len(t, slackRequests, 1)
    assert.Contains(t, slackRequests[0].Body, "Critical Drift Detected")
    assert.Contains(t, slackRequests[0].Body, "aws_security_group")

    // 6. Verify drift event persisted in PostgreSQL
    event := getDriftEvent(t, org.ID, result.DriftedResources[0].Address)
    assert.Equal(t, "open", event.Status)
    assert.Equal(t, "critical", event.Severity)
}

5.3 Critical User Journey: Revert Workflow

Journey: Slack [Revert] button clicked → remediation engine queues command → agent executes → event resolved

func TestE2E_RemediationRevert_SlackButtonToResolution(t *testing.T) {
    // Setup: existing open drift event
    org, driftEvent := setupOpenDriftEvent(t, "critical")

    // 1. Simulate Slack [Revert] button click
    payload := buildSlackInteractionPayload("drift_revert", driftEvent.ID, org.SlackUserID)
    resp := postSlackInteraction(t, payload, validSlackSignature(payload))
    assert.Equal(t, 200, resp.StatusCode)

    // 2. Verify Slack message updated to "Reverting..."
    slackUpdates := getSlackMockUpdates(t)
    assert.Contains(t, slackUpdates[0].Body, "Reverting")

    // 3. Verify remediation plan created in DB
    plan := waitForRemediationPlan(t, driftEvent.ID, 5*time.Second)
    assert.Equal(t, "executing", plan.Status)
    assert.Contains(t, plan.TargetResources, driftEvent.ResourceAddress)

    // 4. Simulate agent completing the apply
    reportRemediationComplete(t, plan.ID, "success")

    // 5. Verify drift event resolved
    event := getDriftEvent(t, org.ID, driftEvent.ResourceAddress)
    assert.Equal(t, "resolved", event.Status)
    assert.Equal(t, "reverted", event.ResolutionType)

    // 6. Verify final Slack message shows success
    finalUpdate := getLastSlackUpdate(t)
    assert.Contains(t, finalUpdate.Body, "reverted to declared state")
}

5.4 Critical User Journey: Secret Scrubbing End-to-End

Journey: State file with secrets → agent processes → drift report transmitted → NO secrets in SaaS database

func TestE2E_SecretScrubbing_NoSecretsReachSaaS(t *testing.T) {
    // State file contains: master_password = "supersecret123", db_endpoint = "postgres://..."
    uploadStateFixture(t, "testdata/states/rds_with_secrets.tfstate", org.StateBucket)

    // Inject RDS drift (instance class changed)
    injectRDSDrift(t, "mydb", "db.t3.medium", "db.t3.large")

    // Run drift check
    runDriftCheck(t, agent, org.StateBucket)

    // Verify drift event in DB — no secret values
    event := getDriftEventByResource(t, org.ID, "aws_db_instance.mydb")
    diffJSON, _ := json.Marshal(event.Diff)
    assert.NotContains(t, string(diffJSON), "supersecret123")
    assert.NotContains(t, string(diffJSON), "postgres://")
    assert.Contains(t, string(diffJSON), "[REDACTED]")

    // Verify instance class drift IS present (non-secret attribute preserved)
    assert.Contains(t, string(diffJSON), "db.t3.large")
}

5.5 Smoke Tests (Post-Deploy)

Smoke tests run after every production deployment. They hit real endpoints with minimal side effects.

smoke/
  health_check_test.go         # GET /health → 200 on all services
  agent_registration_test.go   # Register a smoke-test agent → 200
  heartbeat_test.go            # Send heartbeat → 200
  drift_report_ingestion_test.go # POST minimal drift report → 202
  dashboard_api_test.go        # GET /v1/stacks (smoke org) → 200
  slack_connectivity_test.go   # Verify Slack OAuth token still valid

Smoke tests use a dedicated smoke-test organization in production with a pre-provisioned API key. They never write to real customer data.

Section 6: Performance & Load Testing

6.1 Scan Duration Benchmarks

Tool: Go's built-in testing.B for agent benchmarks. k6 for SaaS API load tests.

Targets:

Scenario	Stack Size	Target Duration	Kill Threshold
Full state parse	100 resources	< 50ms	> 200ms
Full state parse	500 resources	< 200ms	> 1s
Full drift check (parse + poll + compare)	20 resources	< 5s	> 30s
Full drift check	100 resources	< 30s	> 120s
Drift report ingestion (SaaS)	single report	< 200ms p99	> 1s p99
Drift report ingestion (SaaS)	100 concurrent	< 500ms p99	> 2s p99

Go benchmark tests:

// pkg/agent/bench_test.go
func BenchmarkStateParser_100Resources(b *testing.B) {
    data, _ := os.ReadFile("testdata/states/100_resources.tfstate")
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _, _ = ParseState(data)
    }
}

func BenchmarkDriftComparator_100Resources(b *testing.B) {
    stateResources := loadStateFixture("testdata/states/100_resources.tfstate")
    cloudResources := loadCloudFixture("testdata/cloud/100_resources_clean.json")
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = CompareDrift(stateResources, cloudResources)
    }
}

func BenchmarkSecretScrubber_LargeDiff(b *testing.B) {
    diff := loadDiffFixture("testdata/diffs/large_diff_50_attributes.json")
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = ScrubSecrets(diff)
    }
}

6.2 Memory & CPU Profiling

Goal: Ensure the agent stays within its ECS task allocation (0.25 vCPU, 512MB) even for large state files.

Profile targets:

State parser memory allocation for 500-resource state files
Drift comparator heap usage during deep JSON comparison
Secret scrubber regex compilation (should be compiled once, not per-call)

// Run with: go test -memprofile=mem.out -cpuprofile=cpu.out -bench=.
// Analyze with: go tool pprof mem.out

func TestMemoryProfile_LargeStateFile_Under100MB(t *testing.T) {
    if testing.Short() { t.Skip("skipping memory profile in short mode") }
    
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    before := m.HeapAlloc

    data, _ := os.ReadFile("testdata/states/500_resources.tfstate")
    _, err := ParseState(data)
    require.NoError(t, err)

    runtime.ReadMemStats(&m)
    after := m.HeapAlloc
    allocatedMB := float64(after-before) / 1024 / 1024
    assert.Less(t, allocatedMB, 100.0, "state parser should use < 100MB for 500 resources")
}

Regex pre-compilation check:

func TestSecretScrubber_RegexPrecompiled_NotCompiledPerCall(t *testing.T) {
    // Call scrubber 1000 times — if regex is compiled per call, this will be slow
    diff := map[string]interface{}{"password": "test123"}
    start := time.Now()
    for i := 0; i < 1000; i++ {
        ScrubSecrets(diff)
    }
    elapsed := time.Since(start)
    assert.Less(t, elapsed, 100*time.Millisecond, "1000 scrub calls should complete in < 100ms")
}

6.3 Concurrent Scan Stress Tests

Goal: Verify the agent handles concurrent scans (multiple stacks) without race conditions or goroutine leaks.

func TestConcurrentScans_MultipleStacks_NoRaceConditions(t *testing.T) {
    // Run with: go test -race ./...
    const numStacks = 10
    var wg sync.WaitGroup
    errors := make(chan error, numStacks)

    for i := 0; i < numStacks; i++ {
        wg.Add(1)
        go func(stackIdx int) {
            defer wg.Done()
            stateFile := fmt.Sprintf("testdata/states/stack_%d.tfstate", stackIdx)
            _, err := RunDriftCheck(stateFile, mockAWSClient)
            if err != nil { errors <- err }
        }(i)
    }

    wg.Wait()
    close(errors)
    for err := range errors {
        t.Errorf("concurrent scan error: %v", err)
    }
}

SaaS load test (k6):

// load-tests/drift-report-ingestion.js
import http from 'k6/http';
import { check } from 'k6';

export const options = {
  stages: [
    { duration: '30s', target: 50 },   // ramp up to 50 concurrent agents
    { duration: '60s', target: 50 },   // hold
    { duration: '10s', target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],  // 99th percentile < 500ms
    http_req_failed: ['rate<0.01'],    // < 1% error rate
  },
};

export default function () {
  const payload = JSON.stringify(buildDriftReport());
  const res = http.post(`${__ENV.API_URL}/v1/drift-reports`, payload, {
    headers: { 'Authorization': `Bearer ${__ENV.API_KEY}`, 'Content-Type': 'application/json' },
  });
  check(res, { 'status is 202': (r) => r.status === 202 });
}

Section 7: CI/CD Pipeline Integration

7.1 Test Stages

┌─────────────────────────────────────────────────────────────────┐
│  PRE-COMMIT (local, < 30s)                                      │
│  • golangci-lint (Go)                                           │
│  • eslint + tsc --noEmit (TypeScript)                           │
│  • go test -short ./... (unit tests only, no I/O)               │
│  • Feature flag TTL audit (make flag-audit)                     │
│  • Decision log presence check (PRs touching pkg/detection/)    │
└─────────────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────┐
│  PR (GitHub Actions, < 5 min)                                   │
│  • Full unit test suite (Go + TypeScript)                       │
│  • go test -race ./... (race detector)                          │
│  • Coverage gate: fail if < 80% overall, < 100% on scrubber     │
│  • Schema migration lint (no destructive changes)               │
│  • Snapshot test diff check (Block Kit formatter)               │
└─────────────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────┐
│  MERGE TO MAIN (GitHub Actions, < 10 min)                       │
│  • All unit tests                                               │
│  • Integration tests (Testcontainers: PostgreSQL + DynamoDB)    │
│  • LocalStack integration tests (S3, SQS, EC2 mock)             │
│  • RLS isolation tests (multi-tenant)                           │
│  • Docker build + Trivy scan                                    │
│  • Go benchmark regression check (fail if > 20% slower)        │
└─────────────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGING DEPLOY (< 15 min)                                      │
│  • E2E test suite against staging environment                   │
│  • Smoke tests (all health endpoints)                           │
│  • Secret scrubbing E2E test                                    │
│  • Multi-tenant isolation E2E test                              │
└─────────────────────────────────────────────────────────────────┘
           │
           ▼ (manual approval gate)
┌─────────────────────────────────────────────────────────────────┐
│  PRODUCTION DEPLOY                                              │
│  • Smoke tests post-deploy                                      │
│  • Canary: route 5% traffic to new version for 10 min          │
│  • Auto-rollback if smoke tests fail                            │
└─────────────────────────────────────────────────────────────────┘

7.2 Coverage Thresholds & Gates

# .github/workflows/test.yml (coverage gate step)
- name: Check coverage thresholds
  run: |
    # Go agent
    go test -coverprofile=coverage.out ./...
    go tool cover -func=coverage.out | grep "total:" | awk '{print $3}' | \
      awk -F'%' '{if ($1 < 80) {print "FAIL: Go coverage " $1 "% < 80%"; exit 1}}'
    
    # Secret scrubber must be 100%
    go tool cover -func=coverage.out | grep "scrubber" | \
      awk -F'%' '{if ($1 < 100) {print "FAIL: Scrubber coverage " $1 "% < 100%"; exit 1}}'
    
    # TypeScript SaaS
    npx vitest run --coverage
    # vitest.config.ts enforces: lines: 80, branches: 75, functions: 80

vitest.config.ts coverage config:

export default defineConfig({
  test: {
    coverage: {
      provider: 'v8',
      thresholds: {
        lines: 80,
        branches: 75,
        functions: 80,
        statements: 80,
      },
      // Stricter thresholds for critical modules
      perFile: true,
    },
  },
})

7.3 Test Parallelization Strategy

Go: Tests are parallelized at the package level by default (go test ./...). Mark individual tests with t.Parallel() where safe. Integration tests that share LocalStack state must NOT be parallelized — use build tags to separate them.

// Unit tests: always parallel
func TestDriftComparator_AttributeAdded_ReturnsDrift(t *testing.T) {
    t.Parallel()
    // ...
}

// Integration tests: sequential within package, parallel across packages
// go test -p 4 ./... (4 packages in parallel)

Build tags for test separation:

//go:build integration
// +build integration

// Run with: go test -tags=integration ./...
// Unit only: go test ./... (no tag)

GitHub Actions matrix:

strategy:
  matrix:
    test-suite:
      - unit-go
      - unit-ts
      - integration-go
      - integration-ts
      - e2e
  fail-fast: false  # don't cancel other suites on first failure

Section 8: Transparent Factory Tenet Testing

8.1 Feature Flag Behavior (Epic 10, Story 10.1)

Testing OpenFeature Go SDK integration:

// pkg/flags/flags_test.go

// Test 1: Flag gates new detection rule
func TestFeatureFlag_NewDetectionRule_GatedByFlag(t *testing.T) {
    // Set up: flag "pulumi-support" = false
    provider := openfeature.NewInMemoryProvider(map[string]openfeature.InMemoryFlag{
        "pulumi-support": {DefaultVariant: "off", Variants: map[string]interface{}{"off": false, "on": true}},
    })
    openfeature.SetProvider(provider)

    result := RunDriftCheck(pulumiStateFixture)
    assert.ErrorIs(t, result.Err, ErrIaCToolNotSupported)
    assert.Equal(t, 0, result.DriftedResourceCount)
}

// Test 2: Flag enabled — feature executes
func TestFeatureFlag_NewDetectionRule_ExecutesWhenEnabled(t *testing.T) {
    provider := openfeature.NewInMemoryProvider(map[string]openfeature.InMemoryFlag{
        "pulumi-support": {DefaultVariant: "on", Variants: map[string]interface{}{"off": false, "on": true}},
    })
    openfeature.SetProvider(provider)

    result := RunDriftCheck(pulumiStateFixture)
    require.NoError(t, result.Err)
    assert.Greater(t, result.ResourceCount, 0)
}

// Test 3: Circuit breaker disables flag on false-positive spike
func TestFeatureFlag_CircuitBreaker_TripsOnFalsePositiveSpike(t *testing.T) {
    flag := NewFeatureFlag("new-sg-rule", circuitBreakerConfig{Threshold: 3.0, Window: time.Hour})
    
    // Simulate 10 dismissals in 1 hour (3x baseline of ~3)
    for i := 0; i < 10; i++ {
        flag.RecordDismissal()
    }
    
    assert.False(t, flag.IsEnabled(), "circuit breaker should have tripped")
}

TTL lint test (CI enforcement):

func TestFeatureFlags_NoExpiredTTLs(t *testing.T) {
    flags := LoadAllFlags("../../config/flags.json")
    for _, flag := range flags {
        if flag.Rollout == 100 {
            assert.True(t, time.Now().Before(flag.TTL),
                "flag %q is at 100%% rollout and past TTL %v — clean it up", flag.Name, flag.TTL)
        }
    }
}

8.2 Schema Migration Validation (Epic 10, Story 10.2)

Goal: CI blocks any migration that removes, renames, or changes the type of existing DynamoDB attributes.

// tools/schema-lint/main_test.go

func TestSchemaMigration_AddNewAttribute_IsAllowed(t *testing.T) {
    migration := Migration{
        Changes: []SchemaChange{
            {Type: ChangeTypeAdd, AttributeName: "new_field_v2", AttributeType: "S"},
        },
    }
    err := ValidateMigration(migration, currentSchema)
    assert.NoError(t, err)
}

func TestSchemaMigration_RemoveAttribute_IsRejected(t *testing.T) {
    migration := Migration{
        Changes: []SchemaChange{
            {Type: ChangeTypeRemove, AttributeName: "event_type"},
        },
    }
    err := ValidateMigration(migration, currentSchema)
    assert.ErrorContains(t, err, "destructive schema change: cannot remove attribute 'event_type'")
}

func TestSchemaMigration_RenameAttribute_IsRejected(t *testing.T) {
    migration := Migration{
        Changes: []SchemaChange{
            {Type: ChangeTypeRename, OldName: "payload", NewName: "event_payload"},
        },
    }
    err := ValidateMigration(migration, currentSchema)
    assert.ErrorContains(t, err, "destructive schema change: cannot rename attribute")
}

func TestSchemaMigration_ChangeAttributeType_IsRejected(t *testing.T) {
    migration := Migration{
        Changes: []SchemaChange{
            {Type: ChangeTypeModify, AttributeName: "timestamp", OldType: "S", NewType: "N"},
        },
    }
    err := ValidateMigration(migration, currentSchema)
    assert.ErrorContains(t, err, "destructive schema change: cannot change type of attribute 'timestamp'")
}

8.3 Decision Log Format Validation (Epic 10, Story 10.3)

// tools/decision-log-lint/main_test.go

func TestDecisionLog_ValidFormat_PassesValidation(t *testing.T) {
    log := DecisionLog{
        Prompt:                 "Why is security group drift classified as critical?",
        Reasoning:              "SG drift is the #1 vector for cloud breaches...",
        AlternativesConsidered: []string{"classify as high", "require manual review"},
        Confidence:             0.9,
        Timestamp:              time.Now(),
        Author:                 "max@dd0c.dev",
    }
    assert.NoError(t, ValidateDecisionLog(log))
}

func TestDecisionLog_MissingReasoning_FailsValidation(t *testing.T) {
    log := DecisionLog{Prompt: "Why?", Confidence: 0.8}
    err := ValidateDecisionLog(log)
    assert.ErrorContains(t, err, "reasoning is required")
}

func TestDecisionLog_ConfidenceOutOfRange_FailsValidation(t *testing.T) {
    log := DecisionLog{Prompt: "Why?", Reasoning: "Because.", Confidence: 1.5}
    err := ValidateDecisionLog(log)
    assert.ErrorContains(t, err, "confidence must be between 0 and 1")
}

// CI check: PRs touching pkg/detection/ must include a decision log
func TestCI_DetectionPackageChange_RequiresDecisionLog(t *testing.T) {
    changedFiles := getChangedFilesInPR()
    touchesDetection := slices.ContainsFunc(changedFiles, func(f string) bool {
        return strings.HasPrefix(f, "pkg/detection/")
    })
    if touchesDetection {
        decisionLogs := findDecisionLogsInPR()
        assert.NotEmpty(t, decisionLogs, "PRs touching pkg/detection/ require a decision log entry")
    }
}

8.4 OTEL Span Assertion Tests (Epic 10, Story 10.4)

Goal: Verify that drift classification emits the correct OpenTelemetry spans with required attributes.

// pkg/observability/spans_test.go

func TestOTELSpans_DriftScan_EmitsParentSpan(t *testing.T) {
    exporter := tracetest.NewInMemoryExporter()
    tp := sdktrace.NewTracerProvider(sdktrace.WithSyncer(exporter))
    otel.SetTracerProvider(tp)

    RunDriftScan(testStateFixture, mockAWSClient)

    spans := exporter.GetSpans()
    parentSpans := filterSpansByName(spans, "drift_scan")
    require.Len(t, parentSpans, 1)
}

func TestOTELSpans_DriftClassification_EmitsChildSpanPerResource(t *testing.T) {
    exporter := tracetest.NewInMemoryExporter()
    // ... setup ...

    RunDriftScan(stateWith3Resources, mockAWSClient)

    classificationSpans := filterSpansByName(exporter.GetSpans(), "drift_classification")
    assert.Len(t, classificationSpans, 3) // one per resource
}

func TestOTELSpans_ClassificationSpan_HasRequiredAttributes(t *testing.T) {
    // ... run scan ...
    span := getClassificationSpan(exporter, "aws_security_group.api")

    attrs := span.Attributes()
    assert.Equal(t, "aws_security_group", getAttr(attrs, "drift.resource_type"))
    assert.NotEmpty(t, getAttr(attrs, "drift.severity_score"))
    assert.NotEmpty(t, getAttr(attrs, "drift.classification_reason"))
    // No PII: resource ARN must be hashed, not raw
    assert.NotContains(t, getAttr(attrs, "drift.resource_id"), "arn:aws:")
}

func TestOTELSpans_NoCustomerPII_InAnySpan(t *testing.T) {
    // Run scan with a state file containing real-looking ARNs
    RunDriftScan(stateWithRealARNs, mockAWSClient)

    for _, span := range exporter.GetSpans() {
        for _, attr := range span.Attributes() {
            assert.NotRegexp(t, `arn:aws:[a-z]+:[a-z0-9-]+:\d{12}:`, attr.Value.AsString(),
                "span %q contains unhashed ARN in attribute %q", span.Name(), attr.Key)
        }
    }
}

8.5 Governance Policy Enforcement Tests (Epic 10, Story 10.5)

func TestGovernance_StrictMode_RemediationNeverExecutes(t *testing.T) {
    engine := NewRemediationEngine(Policy{GovernanceMode: "strict"})
    
    result, err := engine.Revert(criticalDriftEvent)
    
    require.NoError(t, err) // not an error — just blocked
    assert.Equal(t, "blocked_by_policy", result.Status)
    assert.Contains(t, result.Log, "Remediation blocked by strict mode")
    assert.False(t, mockAgentDispatcher.WasCalled())
}

func TestGovernance_CustomerCannotEscalateAboveSystemPolicy(t *testing.T) {
    systemPolicy := Policy{GovernanceMode: "strict"}
    customerPolicy := Policy{GovernanceMode: "audit"} // customer wants less restriction
    
    merged := MergePolicies(systemPolicy, customerPolicy)
    assert.Equal(t, "strict", merged.GovernanceMode, "customer cannot override system to be less restrictive")
}

func TestGovernance_PanicMode_HaltsAllScansImmediately(t *testing.T) {
    agent := NewDriftAgent(Policy{PanicMode: true})
    
    result := agent.RunScan(testStateFixture)
    
    assert.ErrorIs(t, result.Err, ErrPanicModeActive)
    assert.False(t, mockAWSClient.WasCalled(), "no AWS API calls should be made in panic mode")
}

func TestGovernance_PanicMode_SendsExactlyOneNotification(t *testing.T) {
    agent := NewDriftAgent(Policy{PanicMode: true})
    
    // Run scan 3 times — should only notify once
    for i := 0; i < 3; i++ {
        agent.RunScan(testStateFixture)
    }
    
    assert.Equal(t, 1, mockNotifier.CallCount(), "panic mode should send exactly one notification")
}

Section 9: Test Data & Fixtures

9.1 Directory Structure

testdata/
  states/
    # Terraform state v4 fixtures
    single_sg.tfstate                    # 1 resource: aws_security_group
    single_rds.tfstate                   # 1 resource: aws_db_instance (with secrets)
    prod_networking.tfstate              # 23 resources: VPC, SGs, subnets, routes
    prod_compute.tfstate                 # 47 resources: EC2, IAM, Lambda, ECS
    100_resources.tfstate                # benchmark fixture
    500_resources.tfstate                # benchmark fixture
    module_nested.tfstate                # module-prefixed addresses
    for_each_resources.tfstate           # for_each instances
    v3_format.tfstate                    # invalid: old format (should error)
    rds_with_secrets.tfstate             # contains master_password, connection strings
    opentofu_state.tfstate               # OpenTofu-generated state

  aws-responses/
    # Recorded AWS API responses (go-vcr cassettes)
    ec2/
      describe_sg_clean.json             # cloud matches state
      describe_sg_ingress_added.json     # 0.0.0.0/0 rule added
      describe_sg_ingress_removed.json   # rule removed
      describe_sg_not_found.json         # resource deleted from cloud
    iam/
      get_role_clean.json
      get_role_policy_changed.json
      get_role_not_found.json
    rds/
      describe_db_instances_clean.json
      describe_db_instances_class_changed.json
      describe_db_instances_publicly_accessible.json  # critical: made public

  diffs/
    # Pre-computed drift diff fixtures
    sg_ingress_added_critical.json
    iam_policy_changed_high.json
    rds_class_changed_high.json
    tag_only_change_low.json
    large_diff_50_attributes.json        # benchmark fixture

  wiremock/
    slack/
      post_message_success.json
      post_message_rate_limited.json
      post_message_channel_not_found.json
      interactions_revert_payload.json
    github/
      create_branch_success.json
      create_pr_success.json
      create_pr_repo_not_found.json

  policies/
    strict_mode.json
    audit_mode.json
    auto_revert_critical.json
    require_approval_iam.json

9.2 State File Factory (Go)

A factory package generates synthetic Terraform state files for tests. This avoids brittle fixture files that break when the state format changes.

// testutil/statefactory/factory.go

type StateFactory struct {
    version          int
    terraformVersion string
    resources        []StateResource
}

func NewStateFactory() *StateFactory {
    return &StateFactory{version: 4, terraformVersion: "1.7.0"}
}

func (f *StateFactory) WithSecurityGroup(name, vpcID string, ingress []IngressRule) *StateFactory {
    f.resources = append(f.resources, StateResource{
        Mode:     "managed",
        Type:     "aws_security_group",
        Name:     name,
        Provider: "registry.terraform.io/hashicorp/aws",
        Instances: []ResourceInstance{{
            Attributes: map[string]interface{}{
                "id":          fmt.Sprintf("sg-%s", randID()),
                "name":        name,
                "vpc_id":      vpcID,
                "ingress":     ingress,
                "egress":      defaultEgressRules(),
                "tags":        map[string]string{"ManagedBy": "terraform"},
            },
        }},
    })
    return f
}

func (f *StateFactory) WithIAMRole(name, assumeRolePolicy string) *StateFactory { /* ... */ }
func (f *StateFactory) WithRDSInstance(id, instanceClass string) *StateFactory { /* ... */ }
func (f *StateFactory) WithSecret(key, value string) *StateFactory { /* injects secret into last resource */ }
func (f *StateFactory) Build() []byte { /* marshals to JSON */ }

// Usage in tests:
state := NewStateFactory().
    WithSecurityGroup("api", "vpc-abc123", []IngressRule{{Port: 443, CIDR: "10.0.0.0/8"}}).
    WithIAMRole("lambda-exec", assumeRolePolicyJSON).
    Build()

9.3 Cloud Response Factory (Go)

Mirrors the state factory but for AWS API responses. Used to simulate clean vs. drifted cloud state.

// testutil/cloudfactory/factory.go

type CloudResponseFactory struct{}

func (f *CloudResponseFactory) SecurityGroup(id string, opts ...SGOption) *ec2.SecurityGroup {
    sg := &ec2.SecurityGroup{GroupId: aws.String(id), /* defaults */}
    for _, opt := range opts { opt(sg) }
    return sg
}

// Options for injecting drift:
func WithPublicIngress(port int) SGOption {
    return func(sg *ec2.SecurityGroup) {
        sg.IpPermissions = append(sg.IpPermissions, ec2types.IpPermission{
            FromPort: aws.Int32(int32(port)),
            IpRanges: []ec2types.IpRange{{CidrIp: aws.String("0.0.0.0/0")}},
        })
    }
}

func WithInstanceClassChanged(newClass string) RDSOption { /* ... */ }
func WithPolicyDocumentChanged(newPolicy string) IAMOption { /* ... */ }

9.4 Drift Scenario Fixtures

Pre-built scenarios covering the most common real-world drift patterns. Each scenario includes: state file, cloud response, expected diff, expected severity.

Scenario	State Fixture	Cloud Response	Expected Severity	Category
Security group: public HTTPS ingress added	`sg_private.tfstate`	`sg_public_443.json`	critical	security
Security group: SSH port opened to world	`sg_no_ssh.tfstate`	`sg_ssh_open.json`	critical	security
IAM role: `:` policy attached	`iam_role_scoped.tfstate`	`iam_role_star_star.json`	critical	security
S3 bucket: public access enabled	`s3_private.tfstate`	`s3_public.json`	critical	security
RDS: made publicly accessible	`rds_private.tfstate`	`rds_public.json`	critical	security
Lambda: runtime changed (python3.8 → python3.12)	`lambda_py38.tfstate`	`lambda_py312.json`	high	configuration
ECS service: task count changed (2 → 5)	`ecs_2tasks.tfstate`	`ecs_5tasks.json`	low	scaling
EC2 instance: instance type changed	`ec2_t3medium.tfstate`	`ec2_t3large.json`	high	configuration
Route53: TTL changed (300 → 60)	`r53_ttl300.tfstate`	`r53_ttl60.json`	medium	configuration
Tags: Environment tag changed	`tags_prod.tfstate`	`tags_staging.json`	low	tags
Resource deleted from cloud	`sg_exists.tfstate`	`sg_not_found.json`	high	configuration

9.5 TypeScript Test Helpers

// test/helpers/factories.ts

export const buildDriftEvent = (overrides: Partial<DriftEvent> = {}): DriftEvent => ({
  id: `evt_${randomUUID()}`,
  orgId: 'org_test_001',
  stackId: 'stack_prod_networking',
  resourceAddress: 'aws_security_group.api',
  resourceType: 'aws_security_group',
  severity: 'critical',
  category: 'security',
  status: 'open',
  diff: {
    ingress: {
      old: [{ from_port: 443, cidr_blocks: ['10.0.0.0/8'] }],
      new: [{ from_port: 443, cidr_blocks: ['10.0.0.0/8', '0.0.0.0/0'] }],
    },
  },
  attribution: {
    principal: 'arn:aws:iam::123456789:user/jsmith',
    sourceIp: '192.168.1.1',
    eventName: 'AuthorizeSecurityGroupIngress',
    attributedAt: new Date().toISOString(),
  },
  createdAt: new Date().toISOString(),
  ...overrides,
})

export const buildOrg = (overrides: Partial<Organization> = {}): Organization => ({
  id: `org_${randomUUID()}`,
  name: 'Test Org',
  slug: 'test-org',
  plan: 'starter',
  maxStacks: 10,
  pollIntervalS: 300,
  ...overrides,
})

export const buildStack = (orgId: string, overrides: Partial<Stack> = {}): Stack => ({
  id: `stack_${randomUUID()}`,
  orgId,
  name: 'prod-networking',
  backendType: 's3',
  backendHash: 'abc123def456',
  iacTool: 'terraform',
  environment: 'prod',
  driftScore: 100.0,
  resourceCount: 23,
  driftedCount: 0,
  ...overrides,
})

Section 10: TDD Implementation Order

10.1 Bootstrap Sequence (Test Infrastructure First)

Before writing a single product test, the test infrastructure itself must be bootstrapped. This is the meta-TDD step.

Week 0 — Test Infrastructure Bootstrap
────────────────────────────────────────
1. Set up Go test project structure
   • testutil/ package with state factory, cloud factory
   • testdata/ directory with initial fixture files
   • golangci-lint config (.golangci.yml)
   • go test -race baseline (should pass with zero tests)

2. Set up TypeScript test project
   • vitest.config.ts with coverage thresholds
   • test/helpers/factories.ts with builder functions
   • ESLint + tsc --noEmit in CI

3. Set up Docker Compose test environment
   • docker-compose.test.yml (LocalStack, PostgreSQL, WireMock)
   • Makefile targets: make test-unit, make test-integration, make test-e2e

4. Set up CI pipeline skeleton
   • GitHub Actions workflow with test stages
   • Coverage reporting (codecov or similar)
   • Feature flag TTL lint check

10.2 Epic-by-Epic TDD Order

The implementation order follows epic dependencies. Tests are written before code at each step.

Phase 1: Agent Core (Weeks 1–2)
────────────────────────────────
Write tests first, then implement:

1. TestStateParser_* (Epic 1, Story 1.1)
   → Implement StateParser
   → Fixture: single_sg.tfstate, module_nested.tfstate

2. TestDriftComparator_* (Epic 1, Story 1.3)
   → Implement DriftComparator
   → Depends on: StateParser (need parsed state to compare)

3. TestSecretScrubber_* (Epic 1, Story 1.4) ← ALL 16 tests before any code
   → Implement SecretScrubber
   → This is the highest-risk component. Write every test case first.

4. TestDriftClassifier_* (Epic 3, Story 3.2)
   → Implement DriftClassifier with YAML rules
   → Depends on: DriftComparator output format

5. TestAWSPolling_* (Epic 1, Story 1.2) ← Integration tests lead here
   → Implement AWS resource polling for top 5 resource types
   → Use recorded HTTP fixtures (go-vcr)
   → Add remaining 15 resource types iteratively

Phase 2: Agent Communication (Week 2)
───────────────────────────────────────
6. TestTransmitter_* (Epic 2, Story 2.2)
   → Implement HTTPS transmitter with mTLS
   → Depends on: SecretScrubber (scrub before transmit)

7. TestAgentRegistration_* (Epic 2, Story 2.1)
   → Implement agent registration flow
   → Depends on: Transmitter

8. TestHeartbeat_* (Epic 2, Story 2.3)
   → Implement heartbeat goroutine
   → Depends on: AgentRegistration

Phase 3: SaaS Ingestion Pipeline (Week 2–3)
─────────────────────────────────────────────
9. TestEventProcessor_Validation_* (Epic 3, Story 3.1)
   → Implement zod schema validation
   → Write tests for every invalid payload shape

10. TestDynamoDBEventStore_* (Epic 3, Story 3.3) ← Integration tests with Testcontainers
    → Implement DynamoDB persistence
    → Depends on: DynamoDB Local container running

11. TestPostgreSQL_RLS_* (Epic 3, Story 3.3) ← Integration tests with Testcontainers
    → Apply schema migrations
    → Write multi-tenant isolation tests BEFORE any API handlers

12. TestDriftScorer_* (Epic 3, Story 3.4)
    → Implement drift score calculation
    → Depends on: PostgreSQL schema (reads/writes stacks table)

Phase 4: Notifications (Week 3)
─────────────────────────────────
13. TestNotificationFormatter_* (Epic 4, Story 4.1)
    → Implement Block Kit formatter
    → Snapshot tests for output JSON

14. TestSlackDelivery_* (Epic 4, Story 4.2) ← Integration with WireMock
    → Implement Slack API client
    → Depends on: Formatter output

15. TestNotificationBatching_* (Epic 4, Story 4.4)
    → Implement digest queue logic
    → Depends on: Slack delivery working

Phase 5: Dashboard API (Week 3–4)
───────────────────────────────────
16. TestDashboardAuth_* (Epic 5, Story 5.1)
    → Implement Cognito JWT middleware
    → RLS context-setting middleware
    → Write auth tests before any route handlers

17. TestStackEndpoints_* (Epic 5, Story 5.2)
    → Implement GET/PATCH /v1/stacks
    → Depends on: Auth middleware + PostgreSQL

18. TestDriftEventEndpoints_* (Epic 5, Story 5.3)
    → Implement GET /v1/drift-events with filters
    → Depends on: Stack endpoints

Phase 6: Slack Bot & Remediation (Week 4)
───────────────────────────────────────────
19. TestSlackInteraction_SignatureValidation_* (Epic 7, Story 7.1)
    → Implement signature verification FIRST
    → Write tests for valid and invalid signatures before any callback logic

20. TestRemediationEngine_* (Epic 7, Stories 7.1–7.2)
    → Implement revert and accept workflows
    → Depends on: Slack interaction handler, PostgreSQL remediation_plans table

21. TestPolicyEngine_* (Epic 10, Story 10.5)
    → Implement governance policy enforcement
    → Wrap remediation engine with policy checks

Phase 7: Transparent Factory Tenets (Week 4, parallel)
────────────────────────────────────────────────────────
22. TestFeatureFlag_* (Epic 10, Story 10.1)
    → Integrate OpenFeature SDK
    → Write flag tests alongside each new feature (not at the end)

23. TestOTELSpans_* (Epic 10, Story 10.4)
    → Add OTEL instrumentation to drift scan
    → Write span assertion tests

24. TestSchemaMigration_* (Epic 10, Story 10.2)
    → Implement schema lint tool
    → Add to CI pipeline

25. TestDecisionLog_* (Epic 10, Story 10.3)
    → Implement decision log validator
    → Add PR template check to CI

Phase 8: E2E & Performance (Week 4–5)
───────────────────────────────────────
26. E2E: Onboarding flow (install → detect → notify)
    → Requires all Phase 1–4 components working
    → First E2E test written after unit + integration tests pass

27. E2E: Remediation round-trip (Slack → apply → resolve)
    → Requires Phase 5–6 components

28. Performance benchmarks
    → Run after correctness is established
    → Fail CI if regression > 20%

10.3 Test Dependency Graph

StateParser ──────────────────────────────────────────────────────┐
     │                                                             │
     ▼                                                             ▼
DriftComparator ──► SecretScrubber ──► Transmitter ──► E2E: Onboarding
     │
     ▼
DriftClassifier ──► DriftScorer ──► DynamoDB EventStore ──► Dashboard API
                                         │
                                         ▼
                                   PostgreSQL RLS ──► Auth Middleware
                                                           │
                                                           ▼
                                                    Slack Formatter
                                                           │
                                                           ▼
                                                    Slack Delivery
                                                           │
                                                           ▼
                                                  Remediation Engine ──► E2E: Revert
                                                           │
                                                           ▼
                                                    Policy Engine

10.4 "Never Ship Without" Checklist

Before any code ships to production, these tests must be green:

□ TestSecretScrubber_* — all 16 tests passing (100% coverage)
□ TestPostgreSQL_RLS_CrossTenantIsolation — org A cannot read org B data
□ TestTransmitter_mTLSCertPresented_OnEveryRequest
□ TestGovernance_StrictMode_RemediationNeverExecutes
□ TestE2E_SecretScrubbing_NoSecretsReachSaaS
□ TestE2E_MultiTenantIsolation_OrgACannotSeeOrgBEvents
□ go test -race ./... — zero race conditions
□ Coverage gate: ≥ 80% overall, 100% on scrubber
□ Schema migration lint: no destructive changes
□ Feature flag TTL audit: no expired flags at 100% rollout

Document complete. Total estimated test count at V1 launch: ~500 tests. Target by month 3: ~1,000 tests.

11. Review Remediation Addendum (Post-Gemini Review)

11.1 Missing Epic Coverage

Epic 6: Dashboard UI (React Testing Library + Playwright)

// tests/ui/components/DiffViewer.test.tsx
describe('DiffViewer Component', () => {
  it('renders added lines in green', () => {});
  it('renders removed lines in red', () => {});
  it('renders unchanged lines in default color', () => {});
  it('collapses large diffs with "Show more" toggle', () => {});
  it('highlights HCL syntax in diff blocks', () => {});
  it('shows resource type icon next to each drift item', () => {});
});

describe('StackOverview Component', () => {
  it('renders drift count badge per stack', () => {});
  it('sorts stacks by drift severity (critical first)', () => {});
  it('shows last scan timestamp', () => {});
  it('shows agent health indicator (green/yellow/red)', () => {});
});

// tests/e2e/ui/dashboard.spec.ts (Playwright)
test('OAuth login redirects to Cognito and back', async ({ page }) => {
  await page.goto('/dashboard');
  await expect(page).toHaveURL(/cognito/);
});

test('stack list renders with drift counts', async ({ page }) => {
  await page.goto('/dashboard/stacks');
  await expect(page.locator('[data-testid="stack-card"]')).toHaveCountGreaterThan(0);
});

test('diff viewer renders inline diff for Terraform resource', async ({ page }) => {
  await page.goto('/dashboard/stacks/stack-1/drifts/drift-1');
  await expect(page.locator('[data-testid="diff-viewer"]')).toBeVisible();
  await expect(page.locator('.diff-added')).toHaveCountGreaterThan(0);
});

test('revert button triggers confirmation modal', async ({ page }) => {
  await page.goto('/dashboard/stacks/stack-1/drifts/drift-1');
  await page.click('[data-testid="revert-btn"]');
  await expect(page.locator('[data-testid="confirm-modal"]')).toBeVisible();
});

Epic 9: Onboarding & PLG (Stripe + drift init)

// pkg/onboarding/stripe_test.go

func TestStripeWebhookCheckoutCompleted_UpgradesTenant(t *testing.T) {}
func TestStripeWebhookSubscriptionDeleted_DowngradesTenant(t *testing.T) {}
func TestStripeWebhookInvalidSignature_Returns401(t *testing.T) {}
func TestStripeWebhookReplayedEvent_IsIdempotent(t *testing.T) {}

// pkg/agent/init_test.go

func TestDriftInit_DetectsTerraformInCurrentDir(t *testing.T) {}
func TestDriftInit_DetectsCloudFormationInCurrentDir(t *testing.T) {}
func TestDriftInit_DetectsPulumiInCurrentDir(t *testing.T) {}
func TestDriftInit_GeneratesValidYAMLConfig(t *testing.T) {}
func TestDriftInit_HandlesWindowsPaths(t *testing.T) {}
func TestDriftInit_HandlesMacPaths(t *testing.T) {}
func TestDriftInit_HandlesLinuxPaths(t *testing.T) {}
func TestDriftInit_FailsGracefullyOnEmptyDir(t *testing.T) {}

Epic 8: Infrastructure (Terratest)

// tests/infra/terraform_test.go

func TestTerraformPlan_CreatesExpectedResources(t *testing.T) {
    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../../infra/terraform",
    })
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndPlan(t, terraformOptions)
}

func TestTerraformApply_SQSFIFOQueueCreated(t *testing.T) {}
func TestTerraformApply_RDSInstanceCreated(t *testing.T) {}
func TestTerraformApply_IAMRolesHaveLeastPrivilege(t *testing.T) {
    // Verify no IAM policy has Action: "*"
}
func TestTerraformApply_VPCSecurityGroupsRestrictIngress(t *testing.T) {}

Epic 2: mTLS Certificate Lifecycle

// pkg/agent/mtls_test.go

func TestMTLS_CertificateGeneration_ValidX509(t *testing.T) {}
func TestMTLS_CertificateExpiration_AgentRejectsExpiredCert(t *testing.T) {}
func TestMTLS_CertificateRotation_NewCertAcceptedMidConnection(t *testing.T) {}
func TestMTLS_CertificateRevocation_RevokedCertRejected(t *testing.T) {}
func TestMTLS_SelfSignedCert_RejectedBySaaS(t *testing.T) {}
func TestMTLS_CertificateChain_IntermediateCAValidated(t *testing.T) {}

11.2 Add t.Parallel() to Table-Driven Tests

// BEFORE (sequential — wastes CI time):
func TestSecretScrubber(t *testing.T) {
    tests := []struct{ name, input, expected string }{...}
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // runs sequentially
        })
    }
}

// AFTER (parallel):
func TestSecretScrubber(t *testing.T) {
    t.Parallel()
    tests := []struct{ name, input, expected string }{...}
    for _, tt := range tests {
        tt := tt // capture range variable
        t.Run(tt.name, func(t *testing.T) {
            t.Parallel()
            // runs in parallel
        })
    }
}

11.3 Dynamic Resource Naming for LocalStack

// BEFORE (shared state — flaky):
// bucket := "drift-reports"

// AFTER (per-test isolation):
func uniqueBucket(t *testing.T) string {
    return fmt.Sprintf("drift-reports-%s-%d", t.Name(), time.Now().UnixNano())
}

func TestDriftReportUpload(t *testing.T) {
    t.Parallel()
    bucket := uniqueBucket(t)
    s3Client.CreateBucket(ctx, &s3.CreateBucketInput{Bucket: &bucket})
    // Test uses isolated bucket — no cross-test contamination
}

11.4 Distributed Tracing Cross-Boundary Tests

// tests/integration/trace_propagation_test.go

func TestTraceContext_AgentToSaaS_SpanParentChain(t *testing.T) {
    // Agent generates drift_scan span with trace_id
    // POST /v1/drift-reports carries traceparent header
    // SaaS Event Processor creates child span
    // Verify parent-child relationship across HTTP boundary
    
    exporter := tracetest.NewInMemoryExporter()
    
    // Fire drift report with traceparent
    traceID := "4bf92f3577b34da6a3ce929d0e0e4736"
    resp := postDriftReport(t, stack, traceID)
    assert.Equal(t, 200, resp.StatusCode)
    
    spans := exporter.GetSpans()
    eventProcessorSpan := findSpan(spans, "drift_report.process")
    assert.Equal(t, traceID, eventProcessorSpan.SpanContext().TraceID().String())
}

func TestTraceContext_SQSBoundary_PreservesTraceID(t *testing.T) {
    // Verify SQS message attributes contain traceparent
    // Verify consumer extracts and continues the trace
}

func TestTraceContext_AgentScan_CreatesParentSpan(t *testing.T) {
    // Verify agent drift_scan span has correct attributes:
    // drift.stack_id, drift.resource_count, drift.duration_ms
}

11.5 Backward Compatibility Serialization (Elastic Schema)

// tests/schema/backward_compat_test.go

func TestOldAgent_ParsesNewDynamoDBItem_WithV2Attributes(t *testing.T) {
    // Simulate V2 DynamoDB item with new _v2 fields
    item := map[string]types.AttributeValue{
        "PK":              &types.AttributeValueMemberS{Value: "STACK#123"},
        "drift_score":     &types.AttributeValueMemberN{Value: "85"},
        "drift_score_v2":  &types.AttributeValueMemberN{Value: "92"}, // New field
        "remediation_v2":  &types.AttributeValueMemberS{Value: "auto"}, // New field
    }
    
    // V1 parser must ignore unknown fields
    result, err := ParseDriftItem(item)
    assert.NoError(t, err)
    assert.Equal(t, 85, result.DriftScore) // Uses V1 field
}

func TestV1Code_ReadsV2Writes_DuringMigrationWindow(t *testing.T) {
    // V2 writes both drift_score and drift_score_v2
    // V1 reads drift_score (ignores _v2)
    // Verify no data loss
}

11.6 Security: RBAC Forgery & Replay Attacks

// tests/integration/security_test.go

func TestAgentCannotForgeStackID(t *testing.T) {
    // Agent with API key for org-A sends drift report claiming stack belongs to org-B
    orgAKey := createAPIKey(t, "org-a")
    report := makeDriftReport("org-b-stack-id") // Wrong org
    
    resp := postDriftReportWithKey(t, report, orgAKey)
    assert.Equal(t, 403, resp.StatusCode)
}

func TestReplayAttack_DuplicateReportID_Rejected(t *testing.T) {
    report := makeDriftReport("stack-1")
    resp1 := postDriftReport(t, report)
    assert.Equal(t, 200, resp1.StatusCode)
    
    // Replay exact same report
    resp2 := postDriftReport(t, report)
    assert.Equal(t, 409, resp2.StatusCode) // Conflict — already processed
}

func TestReplayAttack_OldTimestamp_Rejected(t *testing.T) {
    report := makeDriftReport("stack-1")
    report.Timestamp = time.Now().Add(-10 * time.Minute) // 10 min old
    
    resp := postDriftReport(t, report)
    assert.Equal(t, 400, resp.StatusCode) // Stale report
}

// tests/integration/fair_share_test.go

func TestNoisyNeighbor_LargeOrgDoesNotStarveSmallOrg(t *testing.T) {
    // Org A: 10,000 drifted resources
    // Org B: 10 drifted resources
    // Both submit reports simultaneously
    
    seedDriftReports(t, "org-a", 10000)
    seedDriftReports(t, "org-b", 10)
    
    // Org B's reports must be processed within 30 seconds
    // (not queued behind all 10K of Org A's)
    start := time.Now()
    waitForProcessed(t, "org-b", 10, 30*time.Second)
    assert.Less(t, time.Since(start), 30*time.Second)
}

11.8 Panic Mode Mid-Remediation Race Condition

// tests/integration/panic_remediation_test.go

func TestPanicMode_AbortsInFlightRemediation(t *testing.T) {
    // Start a remediation (terraform apply)
    execID := startRemediation(t, "stack-1", "drift-1")
    waitForState(t, execID, "applying")
    
    // Trigger panic mode
    triggerPanicMode(t)
    
    // Remediation must be aborted, not completed
    state := waitForState(t, execID, "aborted")
    assert.Equal(t, "aborted", state)
    
    // Verify terraform state is not corrupted
    // (agent should have run terraform state pull to verify)
}

func TestPanicMode_DoesNotAbortReadOnlyScans(t *testing.T) {
    // Drift scans (read-only) should continue during panic
    // Only write operations (remediation) are halted
    scanID := startDriftScan(t, "stack-1")
    triggerPanicMode(t)
    
    state := waitForState(t, scanID, "completed")
    assert.Equal(t, "completed", state) // Scan finishes normally
}

11.9 Remediation vs. Concurrent Scan Race Condition

func TestConcurrentScanDuringRemediation_DoesNotReportHalfAppliedState(t *testing.T) {
    // Start remediation (terraform apply — takes ~30s)
    execID := startRemediation(t, "stack-1", "drift-1")
    waitForState(t, execID, "applying")
    
    // Trigger a drift scan while remediation is in progress
    scanID := startDriftScan(t, "stack-1")
    
    // Scan must either:
    // a) Wait for remediation to complete, OR
    // b) Skip the stack with "remediation in progress" status
    scanResult := waitForScanComplete(t, scanID)
    assert.NotEqual(t, "half-applied", scanResult.Status)
    // Must be either "skipped_remediation_in_progress" or show post-remediation state
}

11.10 SaaS API Memory Profiling

// tests/load/memory_profile_test.go

func TestEventProcessor_DoesNotOOM_On1MB_DriftReport(t *testing.T) {
    // Generate a 1MB drift report (1000 resources with large diffs)
    report := makeLargeDriftReport(1000)
    assert.Greater(t, len(report), 1024*1024)
    
    var memBefore, memAfter runtime.MemStats
    runtime.ReadMemStats(&memBefore)
    
    processReport(t, report)
    
    runtime.ReadMemStats(&memAfter)
    growth := memAfter.Alloc - memBefore.Alloc
    assert.Less(t, growth, uint64(50*1024*1024)) // <50MB growth
}

11.11 Trim E2E to Smoke Tier

Per review recommendation, cap E2E at 10 critical paths. Remaining 40 tests pushed to integration:

E2E (Keep — 10 max)	Demoted to Integration
Onboarding: init → connect → first scan	Agent heartbeat variations
First drift detected → Slack alert	Individual parser format tests
Revert flow: Slack → agent apply → verify	Secret scrubber edge cases
Panic mode halts remediation	DynamoDB access pattern tests
Cross-tenant isolation	Individual webhook format tests
OAuth login → dashboard → view diff	Notification batching
Free tier limit enforcement	Agent config reload
Agent disconnect → reconnect → resume	Baseline score calculations
mTLS cert rotation mid-scan	Individual API endpoint tests
Stripe upgrade → unlock features	Cache invalidation patterns

11.12 Updated Test Pyramid (Post-Review)

Level	Original	Revised	Rationale
Unit	70% (~350)	65% (~350)	Add t.Parallel(), keep count but add UI component tests
Integration	20% (~100)	28% (~150)	Terratest, mTLS, trace propagation, fair-share, security
E2E/Smoke	10% (~50)	7% (~35)	Capped at 10 true E2E + 25 Playwright UI tests

End of P2 Review Remediation Addendum

78 KiB Raw Blame History Unescape Escape

dd0c/drift — Test Architecture & TDD Strategy

Section 1: Testing Philosophy & TDD Workflow

1.1 Core Philosophy

1.2 Red-Green-Refactor Adapted for dd0c/drift

1.3 Test Naming Conventions

Section 2: Test Pyramid

2.1 Recommended Ratio

2.2 Unit Test Targets (Per Component)

2.3 Integration Test Boundaries

2.4 E2E / Smoke Test Scenarios

Section 3: Unit Test Strategy (Per Component)

3.1 State Parser (Go — Epic 1, Story 1.1)

3.2 Drift Comparator (Go — Epic 1, Story 1.3)

3.3 Drift Classifier (Go — Epic 3, Story 3.2)

3.4 Secret Scrubber (Go — Epic 1, Story 1.4) — 100% Coverage Required

3.5 Drift Scorer (TypeScript — Epic 3, Story 3.4)

3.6 Event Processor — Ingestion & Validation (TypeScript — Epic 3, Story 3.1)

3.7 Notification Formatter (TypeScript — Epic 4, Story 4.1)

3.8 Remediation Engine (TypeScript — Epic 7, Stories 7.1–7.2)

3.9 Feature Flag Evaluator (Go — Epic 10, Story 10.1)

3.10 Policy Engine (Go — Epic 10, Story 10.5)

Section 4: Integration Test Strategy

4.1 Agent ↔ Cloud Provider APIs

4.2 Agent ↔ SaaS API (Drift Report Submission)

4.3 Event Processor ↔ DynamoDB (Testcontainers)

4.4 Event Processor ↔ PostgreSQL (Testcontainers + RLS)

4.5 IaC State File Parsing — Multi-Backend Integration

4.6 Notification Service ↔ Slack API

4.7 Terraform State File Parsing — Real Fixture Files

Section 5: E2E & Smoke Tests

5.1 Infrastructure Setup

5.2 Critical User Journey: Install → Detect → Notify

5.3 Critical User Journey: Revert Workflow

5.4 Critical User Journey: Secret Scrubbing End-to-End

5.5 Smoke Tests (Post-Deploy)

Section 6: Performance & Load Testing

6.1 Scan Duration Benchmarks

6.2 Memory & CPU Profiling

6.3 Concurrent Scan Stress Tests

Section 7: CI/CD Pipeline Integration

7.1 Test Stages

7.2 Coverage Thresholds & Gates

7.3 Test Parallelization Strategy

Section 8: Transparent Factory Tenet Testing

8.1 Feature Flag Behavior (Epic 10, Story 10.1)

8.2 Schema Migration Validation (Epic 10, Story 10.2)

8.3 Decision Log Format Validation (Epic 10, Story 10.3)

8.4 OTEL Span Assertion Tests (Epic 10, Story 10.4)

8.5 Governance Policy Enforcement Tests (Epic 10, Story 10.5)

Section 9: Test Data & Fixtures

9.1 Directory Structure

9.2 State File Factory (Go)

9.3 Cloud Response Factory (Go)

9.4 Drift Scenario Fixtures

9.5 TypeScript Test Helpers

Section 10: TDD Implementation Order

10.1 Bootstrap Sequence (Test Infrastructure First)

10.2 Epic-by-Epic TDD Order

10.3 Test Dependency Graph

10.4 "Never Ship Without" Checklist

11. Review Remediation Addendum (Post-Gemini Review)

11.1 Missing Epic Coverage

Epic 6: Dashboard UI (React Testing Library + Playwright)

Epic 9: Onboarding & PLG (Stripe + drift init)

Epic 8: Infrastructure (Terratest)

Epic 2: mTLS Certificate Lifecycle

11.2 Add t.Parallel() to Table-Driven Tests

11.3 Dynamic Resource Naming for LocalStack

11.4 Distributed Tracing Cross-Boundary Tests

11.5 Backward Compatibility Serialization (Elastic Schema)

11.6 Security: RBAC Forgery & Replay Attacks

11.7 Noisy Neighbor & Fair-Share Processing

11.8 Panic Mode Mid-Remediation Race Condition

11.9 Remediation vs. Concurrent Scan Race Condition

11.10 SaaS API Memory Profiling

11.11 Trim E2E to Smoke Tier

11.12 Updated Test Pyramid (Post-Review)

78 KiB

Raw Blame History