products/02-iac-drift-detection/test-architecture/test-architecture.md

# dd0c/drift — Test Architecture & TDD Strategy
**Author:** Max Mayfield (Test Architect)
**Date:** February 28, 2026
**Product:** dd0c/drift — IaC Drift Detection & Remediation SaaS
**Status:** Test Architecture Design Document

---

## Section 1: Testing Philosophy & TDD Workflow

### 1.1 Core Philosophy

dd0c/drift is a security-critical product. A missed drift event or a false positive in the remediation engine can cause real infrastructure damage. The testing strategy reflects this: **correctness is non-negotiable, speed is a constraint, not a goal**.

Three principles guide every testing decision:

1. **Tests are the first customer.** Before writing a single line of production code, the test defines the contract. If you can't write a test for it, you don't understand the requirement well enough to build it.

2. **The secret scrubber and RLS are untouchable.** These two components — the agent's secret scrubbing engine and the SaaS's PostgreSQL Row-Level Security — have 100% test coverage requirements. No exceptions. A bug in either is a trust-destroying incident.

3. **Drift detection logic is pure functions.** The comparator, scorer, and classifier take inputs and return outputs with no side effects. This makes them trivially testable and means the test suite runs fast even at high coverage.

### 1.2 Red-Green-Refactor Adapted for dd0c/drift

The standard TDD cycle applies, but with domain-specific adaptations:

```
RED   → Write a failing test that describes a drift scenario
         e.g., "security group ingress rule added to 0.0.0.0/0 → severity: critical"

GREEN → Write the minimum code to make it pass
         e.g., add the classification rule to the YAML config + evaluator

REFACTOR → Clean up without breaking the test
            e.g., extract the CIDR check into a reusable predicate
```

**When to write tests first (strict TDD):**
- All drift detection logic (comparator, classifier, scorer)
- Secret scrubbing engine — write tests for every secret pattern BEFORE writing the regex
- API request/response contracts — write schema validation tests before implementing handlers
- Remediation policy evaluation — write policy enforcement tests before the engine
- Feature flag evaluation logic (Epic 10.1)

**When integration tests lead (test-after acceptable):**
- AWS SDK wiring (agent ↔ EC2/IAM/RDS describe calls) — mock the SDK first, integration test confirms the wiring
- DynamoDB persistence — write the schema, then integration tests against DynamoDB Local
- Slack Block Kit formatting — render the block, visually verify, then snapshot test
- CI/CD pipeline configuration — validate by running it, not by unit testing YAML

**When E2E tests lead:**
- Onboarding flow (`drift init` → `drift check` → Slack alert) — the happy path must work end-to-end before any unit tests are written for the CLI
- Remediation round-trip (Slack button → agent apply → resolution) — too many moving parts to unit test first

### 1.3 Test Naming Conventions

**Go (Agent, State Manager):**
```go
// Pattern: Test<Component>_<Scenario>_<ExpectedOutcome>
func TestDriftComparator_SecurityGroupIngressAdded_ReturnsCriticalDrift(t *testing.T)
func TestSecretScrubber_PasswordAttribute_ReturnsRedacted(t *testing.T)
func TestStateParser_V4Format_ExtractsManagedResources(t *testing.T)

// Table-driven test naming: use descriptive name field
tests := []struct {
    name string
    // ...
}{
    {name: "security group with public CIDR → critical"},
    {name: "tag-only change → low severity"},
    {name: "IAM policy document changed → high severity"},
}
```

**TypeScript (SaaS, Dashboard API):**
```typescript
// Pattern: describe("<Component>") > describe("<method/scenario>") > it("<expected behavior>")
describe("DriftClassifier", () => {
  describe("classify()", () => {
    it("returns critical severity for security group with 0.0.0.0/0 ingress")
    it("returns low severity for tag-only changes")
    it("falls back to medium/configuration for unmatched resource types")
  })
})
```

**Integration & E2E:**
```
// File naming: <component>.integration_test.go / <flow>.e2e_test.go
agent_dynamodb_integration_test.go
drift_report_ingestion_integration_test.go
onboarding_flow_e2e_test.go
remediation_roundtrip_e2e_test.go
```

---

## Section 2: Test Pyramid

### 2.1 Recommended Ratio

```
         ┌─────────────────┐
         │   E2E / Smoke   │  ~10%  (~50 tests)
         │  (LocalStack,   │
         │   real flows)   │
         ├─────────────────┤
         │  Integration    │  ~20%  (~100 tests)
         │  (boundaries,   │
         │   real DBs)     │
         ├─────────────────┤
         │   Unit Tests    │  ~70%  (~350 tests)
         │  (pure logic,   │
         │   fast, mocked) │
         └─────────────────┘
```

Target: **~500 tests total at V1 launch**, growing to ~1,000 by month 3.

### 2.2 Unit Test Targets (Per Component)

| Component | Language | Target Coverage | Key Test Count |
|---|---|---|---|
| State Parser (TF v4) | Go | 95% | ~40 tests |
| Drift Comparator | Go | 95% | ~60 tests |
| Drift Classifier | Go | 90% | ~30 tests |
| Secret Scrubber | Go | 100% | ~50 tests |
| Drift Scorer | Go/TS | 90% | ~20 tests |
| Event Processor (ingestion) | TypeScript | 85% | ~30 tests |
| Notification Formatter | TypeScript | 85% | ~25 tests |
| Remediation Engine | TypeScript | 85% | ~30 tests |
| Dashboard API handlers | TypeScript | 80% | ~40 tests |
| Feature Flag evaluator | Go | 90% | ~20 tests |
| Policy engine | Go/TS | 95% | ~30 tests |

### 2.3 Integration Test Boundaries

| Boundary | Test Type | Infrastructure |
|---|---|---|
| Agent ↔ AWS EC2/IAM/RDS APIs | Integration | LocalStack or recorded HTTP fixtures |
| Agent ↔ SaaS API (drift report POST) | Integration | Real HTTP server (test instance) |
| Event Processor ↔ DynamoDB | Integration | DynamoDB Local (Testcontainers) |
| Event Processor ↔ PostgreSQL | Integration | PostgreSQL (Testcontainers) |
| Event Processor ↔ SQS | Integration | LocalStack SQS |
| Notification Service ↔ Slack API | Integration | Slack API mock server |
| Remediation Engine ↔ Agent | Integration | Agent stub server |
| Dashboard API ↔ PostgreSQL (RLS) | Integration | PostgreSQL (Testcontainers) — multi-tenant isolation tests |

### 2.4 E2E / Smoke Test Scenarios

| Scenario | Priority | Infrastructure |
|---|---|---|
| Install agent → run `drift check` → detect drift → Slack alert | P0 | LocalStack + Slack mock |
| Agent heartbeat → SaaS records it → dashboard shows "online" | P0 | LocalStack |
| Click [Revert] in Slack → agent executes terraform apply → event resolved | P0 | LocalStack + agent stub |
| Click [Accept] → GitHub PR created with code patch | P1 | GitHub API mock |
| Free tier stack limit enforcement (register 2nd stack → 403) | P1 | Real SaaS test env |
| Secret scrubbing end-to-end (state with password → report has [REDACTED]) | P0 | Agent + SaaS test env |
| Multi-tenant isolation (org A cannot see org B drift events) | P0 | PostgreSQL + RLS |
| Agent offline detection (no heartbeat → Slack "agent offline" alert) | P1 | LocalStack |

---

## Section 3: Unit Test Strategy (Per Component)

### 3.1 State Parser (Go — Epic 1, Story 1.1)

**What to test:**
- Correct extraction of `managed` resources (skip `data` sources)
- Module-prefixed addresses (`module.vpc.aws_security_group.api`)
- Multi-instance resources (`aws_instance.worker[0]`, `aws_instance.worker[1]`)
- Graceful handling of unknown/future resource types
- Rejection of non-v4 state format versions
- Empty state file (zero resources)
- State file with only data sources (zero managed resources)
- `private` field stripped from all instances before returning

**Key test cases:**
```go
func TestStateParser_V4Format_ExtractsManagedResources(t *testing.T) {}
func TestStateParser_DataSourceResources_AreExcluded(t *testing.T) {}
func TestStateParser_ModulePrefixedAddress_ParsedCorrectly(t *testing.T) {}
func TestStateParser_MultiInstanceResource_AllInstancesExtracted(t *testing.T) {}
func TestStateParser_UnsupportedVersion_ReturnsError(t *testing.T) {}
func TestStateParser_EmptyState_ReturnsEmptyResourceList(t *testing.T) {}
func TestStateParser_PrivateField_IsStrippedFromAttributes(t *testing.T) {}
```

**Mocking strategy:** None — pure function over a JSON byte slice. Fixtures in `testdata/states/`.

**Table-driven pattern:**
```go
func TestStateParser_ResourceExtraction(t *testing.T) {
    tests := []struct {
        name          string
        fixtureFile   string
        wantCount     int
        wantAddresses []string
        wantErr       bool
    }{
        {name: "single managed resource", fixtureFile: "testdata/states/single_sg.tfstate", wantCount: 1},
        {name: "state v3 format", fixtureFile: "testdata/states/v3_format.tfstate", wantErr: true},
        {name: "module-nested resources", fixtureFile: "testdata/states/module_nested.tfstate", wantCount: 5},
    }
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            data, _ := os.ReadFile(tt.fixtureFile)
            got, err := ParseState(data)
            if tt.wantErr { require.Error(t, err); return }
            require.NoError(t, err)
            assert.Len(t, got.Resources, tt.wantCount)
        })
    }
}
```

---

### 3.2 Drift Comparator (Go — Epic 1, Story 1.3)

**What to test:**
- Attribute added in cloud (not in state) → drift detected
- Attribute removed from cloud (in state, not in cloud) → drift detected
- Attribute value changed → correct old/new values in diff
- Attribute unchanged → no drift
- Nested attribute changes (ingress rules array)
- Ignored attributes (AWS-generated IDs, timestamps, computed fields) → no drift
- Null vs. empty string → treated as no drift
- Boolean drift (`true` → `false`)
- Numeric drift (port numbers, counts)

**Key test cases:**
```go
func TestDriftComparator_AttributeAdded_ReturnsDrift(t *testing.T) {}
func TestDriftComparator_AttributeRemoved_ReturnsDrift(t *testing.T) {}
func TestDriftComparator_AttributeUnchanged_ReturnsNoDrift(t *testing.T) {}
func TestDriftComparator_NestedIngressRuleAdded_ReturnsDrift(t *testing.T) {}
func TestDriftComparator_IgnoredAttribute_ReturnsNoDrift(t *testing.T) {}
func TestDriftComparator_NullVsEmptyString_TreatedAsNoDrift(t *testing.T) {}
func TestDriftComparator_ComputedTimestamp_IsIgnored(t *testing.T) {}
```

**Mocking strategy:** None — pure function. State and cloud attributes are both `map[string]interface{}`.

---

### 3.3 Drift Classifier (Go — Epic 3, Story 3.2)

**What to test:**
- Security group with `0.0.0.0/0` ingress → `critical/security`
- IAM role policy document changed → `high/security`
- RDS parameter group changed → `high/configuration`
- Tag-only change → `low/tags`
- Unmatched resource type → `medium/configuration` (default fallback)
- Customer override rules take precedence over defaults
- Rule evaluation order (first match wins)
- Invalid YAML config → error at startup, not at classification time

```go
func TestDriftClassifier_PublicCIDRIngress_ReturnsCriticalSecurity(t *testing.T) {}
func TestDriftClassifier_IAMPolicyChanged_ReturnsHighSecurity(t *testing.T) {}
func TestDriftClassifier_TagOnlyChange_ReturnsLowTags(t *testing.T) {}
func TestDriftClassifier_UnmatchedResource_ReturnsMediumConfiguration(t *testing.T) {}
func TestDriftClassifier_CustomerOverride_TakesPrecedence(t *testing.T) {}
func TestDriftClassifier_InvalidYAML_ReturnsErrorOnLoad(t *testing.T) {}
```

---

### 3.4 Secret Scrubber (Go — Epic 1, Story 1.4) — **100% Coverage Required**

Every secret pattern is a security requirement. No table-driven shortcuts — each pattern gets its own named test.

**Key test cases:**
```go
func TestSecretScrubber_PasswordKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_SecretKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_TokenKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PrivateKeyKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_ConnectionStringKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_AWSAccessKeyPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PostgresURIPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PEMPrivateKeyPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_JWTTokenPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_SensitiveFlag_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PrivateField_IsStrippedEntirely(t *testing.T) {}
func TestSecretScrubber_NonSensitiveAttribute_PreservesValue(t *testing.T) {}
func TestSecretScrubber_NestedSensitiveKey_RedactsNestedValue(t *testing.T) {}
func TestSecretScrubber_ArrayWithSensitiveValues_AllElementsChecked(t *testing.T) {}
func TestSecretScrubber_RedactedPlaceholder_IsLiteralREDACTEDString(t *testing.T) {}
func TestSecretScrubber_DiffStructureIntact_AfterScrubbing(t *testing.T) {}
```

---

### 3.5 Drift Scorer (TypeScript — Epic 3, Story 3.4)

```typescript
describe("DriftScorer", () => {
  it("returns 100 for a stack with no drift")
  it("applies heavy penalty for critical severity drift")
  it("applies minimal penalty for low severity drift")
  it("produces weighted score for mixed severity drift")
  it("recalculates upward when drift event is resolved")
  it("handles zero-resource stack without divide-by-zero")
  it("caps score at 0 for catastrophically drifted stacks")
})
```

---

### 3.6 Event Processor — Ingestion & Validation (TypeScript — Epic 3, Story 3.1)

**What to test:**
- Valid drift report → accepted, returns 202
- Missing `stack_id` → 400 `DRIFT_REPORT_INVALID`
- Duplicate `report_id` → 409 `DRIFT_REPORT_DUPLICATE`
- Payload > 1MB → 400 `DRIFT_REPORT_TOO_LARGE`
- Invalid severity value → 400
- Unknown agent ID → 404 `AGENT_NOT_FOUND`
- Revoked agent API key → 403 `AGENT_REVOKED`
- SQS message group ID equals `stack_id`
- SQS deduplication ID equals `report_id`

**Mocking strategy:** Mock `@aws-sdk/client-sqs`. Mock PostgreSQL pool. Use `zod` schema directly in tests.

---

### 3.7 Notification Formatter (TypeScript — Epic 4, Story 4.1)

**What to test:**
- Critical drift → header `🔴 Critical Drift Detected`
- Diff block truncated at Slack's 3000-char block limit
- CloudTrail attribution present → "Changed by: <IAM ARN>"
- CloudTrail attribution absent → "Changed by: Unknown (scheduled scan)"
- All four action buttons present (`drift_revert`, `drift_accept`, `drift_snooze`, `drift_assign`)
- `[REDACTED]` values rendered as-is
- Low severity digest format → no `[Revert]` button

**Mocking strategy:** None — pure function. Use snapshot tests for Block Kit JSON output.

---

### 3.8 Remediation Engine (TypeScript — Epic 7, Stories 7.1–7.2)

**What to test:**
- Revert: generates correct `terraform apply -target=<address>` command
- Blast radius: resource with 3 dependents → `blast_radius = 3`
- Blast radius: isolated resource → `blast_radius = 0`
- `require-approval` policy → status `pending`, not `executing`
- `auto-revert` policy for critical → executes without approval gate
- Accept: generates correct code patch for changed attribute
- Accept: creates PR with correct branch name and description
- Agent heartbeat stale → `REMEDIATION_AGENT_OFFLINE`
- Concurrent revert on same resource → `REMEDIATION_IN_PROGRESS`
- Panic mode active → all remediation blocked

**Mocking strategy:** Mock agent command dispatcher. Mock GitHub API client (`@octokit/rest`). Mock PostgreSQL for plan persistence.

---

### 3.9 Feature Flag Evaluator (Go — Epic 10, Story 10.1)

```go
func TestFeatureFlag_EnabledFlag_ExecutesFeature(t *testing.T) {}
func TestFeatureFlag_DisabledFlag_SkipsFeatureWithNoSideEffects(t *testing.T) {}
func TestFeatureFlag_UnknownFlag_ReturnsDefaultOff(t *testing.T) {}
func TestFeatureFlag_EnvVarOverride_TakesPrecedenceOverJSONFile(t *testing.T) {}
func TestFeatureFlag_CircuitBreaker_DisablesFlagOnFalsePositiveSpike(t *testing.T) {}
func TestFeatureFlag_ExpiredTTL_CILintDetectsIt(t *testing.T) {} // lint test, not runtime
```

---

### 3.10 Policy Engine (Go — Epic 10, Story 10.5)

```go
func TestPolicyEngine_StrictMode_BlocksAllRemediation(t *testing.T) {}
func TestPolicyEngine_AuditMode_ExecutesAndLogs(t *testing.T) {}
func TestPolicyEngine_CustomerMoreRestrictive_CustomerPolicyWins(t *testing.T) {}
func TestPolicyEngine_CustomerLessRestrictive_SystemPolicyWins(t *testing.T) {}
func TestPolicyEngine_PanicMode_HaltsAllScans(t *testing.T) {}
func TestPolicyEngine_PanicMode_SendsSingleNotification(t *testing.T) {}
func TestPolicyEngine_PolicyDecision_IsLogged(t *testing.T) {}
func TestPolicyEngine_FileReload_NewPolicyTakesEffect(t *testing.T) {}
```

---

## Section 4: Integration Test Strategy

### 4.1 Agent ↔ Cloud Provider APIs

**Goal:** Verify the agent correctly maps Terraform resource types to AWS describe calls and handles API responses.

**Approach:** Use recorded HTTP fixtures (via `go-vcr` or `httpmock`) for unit-speed integration tests. Use LocalStack for full integration runs in CI.

**Key test cases:**
```go
// pkg/agent/integration/aws_polling_test.go
func TestAWSPolling_SecurityGroup_MapsToDescribeSecurityGroups(t *testing.T) {}
func TestAWSPolling_IAMRole_MapsToGetRole(t *testing.T) {}
func TestAWSPolling_RDSInstance_MapsToDescribeDBInstances(t *testing.T) {}
func TestAWSPolling_ResourceNotFound_ReturnsUnknownDriftState(t *testing.T) {}
func TestAWSPolling_RateLimitResponse_RetriesWithBackoff(t *testing.T) {}
func TestAWSPolling_CredentialError_ReturnsDescriptiveError(t *testing.T) {}
func TestAWSPolling_RegionScopedRequest_UsesConfiguredRegion(t *testing.T) {}
```

**Fixture strategy:**
```
testdata/
  aws-responses/
    ec2_describe_security_groups_clean.json      # cloud matches state
    ec2_describe_security_groups_drifted.json    # ingress rule added
    iam_get_role_policy_changed.json
    rds_describe_db_instances_clean.json
    ec2_describe_security_groups_not_found.json  # resource deleted from cloud
```

---

### 4.2 Agent ↔ SaaS API (Drift Report Submission)

**Goal:** Verify the agent correctly serializes and transmits `DriftReport` payloads, handles auth errors, and respects rate limit responses.

**Setup:** Spin up a lightweight HTTP test server in Go (`httptest.NewServer`) that mimics the SaaS ingestion endpoint.

```go
func TestTransmitter_ValidReport_Returns202(t *testing.T) {}
func TestTransmitter_InvalidAPIKey_Returns401AndStopsRetrying(t *testing.T) {}
func TestTransmitter_RevokedAPIKey_Returns403AndStopsRetrying(t *testing.T) {}
func TestTransmitter_RateLimited_RespectsRetryAfterHeader(t *testing.T) {}
func TestTransmitter_ServerError_RetriesWithExponentialBackoff(t *testing.T) {}
func TestTransmitter_PayloadCompressed_WhenOverThreshold(t *testing.T) {}
func TestTransmitter_mTLSCertPresented_OnEveryRequest(t *testing.T) {}
func TestTransmitter_NetworkTimeout_RetriesUpToMaxAttempts(t *testing.T) {}
```

---

### 4.3 Event Processor ↔ DynamoDB (Testcontainers)

**Goal:** Verify event sourcing writes, TTL attribute setting, and checksum generation against a real DynamoDB Local instance.

**Setup:**
```go
// Use testcontainers-go to spin up DynamoDB Local
func setupDynamoDBLocal(t *testing.T) *dynamodb.Client {
    ctx := context.Background()
    container, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
        ContainerRequest: testcontainers.ContainerRequest{
            Image:        "amazon/dynamodb-local:latest",
            ExposedPorts: []string{"8000/tcp"},
            WaitingFor:   wait.ForListeningPort("8000/tcp"),
        },
        Started: true,
    })
    require.NoError(t, err)
    t.Cleanup(func() { container.Terminate(ctx) })
    // ... return configured client
}
```

**Key test cases:**
```go
func TestDynamoDBEventStore_AppendDriftEvent_PersistsWithCorrectPK(t *testing.T) {}
func TestDynamoDBEventStore_AppendDriftEvent_SetsChecksumAttribute(t *testing.T) {}
func TestDynamoDBEventStore_AppendDriftEvent_SetsTTLPerTier(t *testing.T) {}
func TestDynamoDBEventStore_QueryByStackID_ReturnsChronologicalOrder(t *testing.T) {}
func TestDynamoDBEventStore_DuplicateEventID_IsIdempotent(t *testing.T) {}
func TestDynamoDBEventStore_FreeTier_TTL90Days(t *testing.T) {}
func TestDynamoDBEventStore_EnterpriseTier_TTL7Years(t *testing.T) {}
```

---

### 4.4 Event Processor ↔ PostgreSQL (Testcontainers + RLS)

**Goal:** Verify multi-tenant data isolation via Row-Level Security. This is the most critical integration test suite.

**Setup:**
```typescript
// Use testcontainers for Node.js to spin up PostgreSQL 16
// Apply full schema migrations before each test suite
// Create two test orgs: orgA and orgB
```

**Key test cases:**
```typescript
describe("PostgreSQL RLS Integration", () => {
  it("org A cannot read org B drift events via direct query")
  it("org A cannot read org B stacks via direct query")
  it("setting app.current_org_id scopes all queries correctly")
  it("missing app.current_org_id returns zero rows (not an error)")
  it("drift event insert without org_id fails FK constraint")
  it("drift score update is scoped to correct org's stack")
  it("concurrent inserts from two orgs do not cross-contaminate")
})
```

**Critical test — cross-tenant isolation:**
```typescript
it("org A cannot read org B drift events", async () => {
  // Insert drift event for orgB
  await insertDriftEvent(orgBPool, orgBEvent)
  
  // Query as orgA — should return empty, not orgB's data
  await orgAPool.query("SET app.current_org_id = $1", [orgA.id])
  const result = await orgAPool.query("SELECT * FROM drift_events")
  expect(result.rows).toHaveLength(0)
})
```

---

### 4.5 IaC State File Parsing — Multi-Backend Integration

**Goal:** Verify the agent correctly reads state files from different backends (S3, local file, Terraform Cloud).

**Setup:** LocalStack S3 for S3 backend tests. Real file system for local backend. WireMock for Terraform Cloud API.

```go
func TestStateBackend_S3_ReadsStateFileFromBucket(t *testing.T) {}
func TestStateBackend_S3_HandlesVersionedBucket(t *testing.T) {}
func TestStateBackend_LocalFile_ReadsFromFilesystem(t *testing.T) {}
func TestStateBackend_TerraformCloud_AuthenticatesAndFetchesState(t *testing.T) {}
func TestStateBackend_S3_AccessDenied_ReturnsDescriptiveError(t *testing.T) {}
func TestStateBackend_S3_BucketNotFound_ReturnsDescriptiveError(t *testing.T) {}
```

---

### 4.6 Notification Service ↔ Slack API

**Goal:** Verify Slack message delivery, request signature validation, and interactive callback handling.

**Setup:** WireMock or a custom Go HTTP mock server simulating the Slack API.

```typescript
describe("Slack Integration", () => {
  it("delivers Block Kit message to configured channel")
  it("falls back to org default channel when stack channel not set")
  it("validates Slack request signature on interaction callbacks")
  it("rejects interaction callback with invalid signature → 401")
  it("updates original message after [Revert] button click")
  it("handles Slack API rate limit (429) with retry")
  it("handles Slack API 500 — logs error, does not crash Lambda")
})
```

---

### 4.7 Terraform State File Parsing — Real Fixture Files

**Goal:** Verify the parser handles real-world Terraform state files from different provider versions and configurations.

Fixture files sourced from:
- Terraform AWS provider v4.x, v5.x state outputs
- OpenTofu state files (should be identical format)
- State files with modules, count, for_each
- State files with workspace prefixes

```go
func TestStateParser_RealWorldAWSProviderV5_ParsesCorrectly(t *testing.T) {}
func TestStateParser_OpenTofuStateFile_ParsesCorrectly(t *testing.T) {}
func TestStateParser_ForEachResources_AllInstancesExtracted(t *testing.T) {}
func TestStateParser_WorkspacePrefixedState_ParsesCorrectly(t *testing.T) {}
func TestStateParser_LargeStateFile_500Resources_CompletesUnder2Seconds(t *testing.T) {}
```

---

## Section 5: E2E & Smoke Tests

### 5.1 Infrastructure Setup

All E2E tests run against LocalStack (AWS service simulation) and a real PostgreSQL instance. The test environment is defined as a Docker Compose stack:

```yaml
# docker-compose.test.yml
services:
  localstack:
    image: localstack/localstack:3.x
    environment:
      SERVICES: s3,sqs,dynamodb,iam,ec2,lambda,eventbridge
      DEBUG: 0
    ports:
      - "4566:4566"

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: drift_test
      POSTGRES_USER: drift
      POSTGRES_PASSWORD: test
    ports:
      - "5432:5432"

  slack-mock:
    image: wiremock/wiremock:latest
    volumes:
      - ./testdata/wiremock/slack:/home/wiremock/mappings
    ports:
      - "8080:8080"

  github-mock:
    image: wiremock/wiremock:latest
    volumes:
      - ./testdata/wiremock/github:/home/wiremock/mappings
    ports:
      - "8081:8080"
```

**Synthetic drift generation:** A helper CLI tool (`testdata/tools/drift-injector`) modifies LocalStack EC2/IAM resources to simulate real drift scenarios without touching real AWS.

---

### 5.2 Critical User Journey: Install → Detect → Notify

**Journey:** Agent installed → `drift check` run → drift detected → Slack alert delivered

```go
// e2e/onboarding_flow_test.go
func TestE2E_OnboardingFlow_InstallToFirstSlackAlert(t *testing.T) {
    // 1. Register org and agent via API
    org := createTestOrg(t)
    agent := registerAgent(t, org.APIKey)

    // 2. Upload a Terraform state file to LocalStack S3
    uploadStateFixture(t, "testdata/states/prod_networking.tfstate", org.StateBucket)

    // 3. Inject drift into LocalStack EC2 (add 0.0.0.0/0 ingress rule)
    injectSecurityGroupDrift(t, "sg-abc123")

    // 4. Run drift check
    result := runDriftCheck(t, agent, org.StateBucket)
    require.Equal(t, 1, result.DriftedResourceCount)
    require.Equal(t, "critical", result.DriftedResources[0].Severity)

    // 5. Verify Slack mock received the Block Kit message
    slackRequests := getSlackMockRequests(t)
    require.Len(t, slackRequests, 1)
    assert.Contains(t, slackRequests[0].Body, "Critical Drift Detected")
    assert.Contains(t, slackRequests[0].Body, "aws_security_group")

    // 6. Verify drift event persisted in PostgreSQL
    event := getDriftEvent(t, org.ID, result.DriftedResources[0].Address)
    assert.Equal(t, "open", event.Status)
    assert.Equal(t, "critical", event.Severity)
}
```

---

### 5.3 Critical User Journey: Revert Workflow

**Journey:** Slack [Revert] button clicked → remediation engine queues command → agent executes → event resolved

```go
func TestE2E_RemediationRevert_SlackButtonToResolution(t *testing.T) {
    // Setup: existing open drift event
    org, driftEvent := setupOpenDriftEvent(t, "critical")

    // 1. Simulate Slack [Revert] button click
    payload := buildSlackInteractionPayload("drift_revert", driftEvent.ID, org.SlackUserID)
    resp := postSlackInteraction(t, payload, validSlackSignature(payload))
    assert.Equal(t, 200, resp.StatusCode)

    // 2. Verify Slack message updated to "Reverting..."
    slackUpdates := getSlackMockUpdates(t)
    assert.Contains(t, slackUpdates[0].Body, "Reverting")

    // 3. Verify remediation plan created in DB
    plan := waitForRemediationPlan(t, driftEvent.ID, 5*time.Second)
    assert.Equal(t, "executing", plan.Status)
    assert.Contains(t, plan.TargetResources, driftEvent.ResourceAddress)

    // 4. Simulate agent completing the apply
    reportRemediationComplete(t, plan.ID, "success")

    // 5. Verify drift event resolved
    event := getDriftEvent(t, org.ID, driftEvent.ResourceAddress)
    assert.Equal(t, "resolved", event.Status)
    assert.Equal(t, "reverted", event.ResolutionType)

    // 6. Verify final Slack message shows success
    finalUpdate := getLastSlackUpdate(t)
    assert.Contains(t, finalUpdate.Body, "reverted to declared state")
}
```

---

### 5.4 Critical User Journey: Secret Scrubbing End-to-End

**Journey:** State file with secrets → agent processes → drift report transmitted → NO secrets in SaaS database

```go
func TestE2E_SecretScrubbing_NoSecretsReachSaaS(t *testing.T) {
    // State file contains: master_password = "supersecret123", db_endpoint = "postgres://..."
    uploadStateFixture(t, "testdata/states/rds_with_secrets.tfstate", org.StateBucket)

    // Inject RDS drift (instance class changed)
    injectRDSDrift(t, "mydb", "db.t3.medium", "db.t3.large")

    // Run drift check
    runDriftCheck(t, agent, org.StateBucket)

    // Verify drift event in DB — no secret values
    event := getDriftEventByResource(t, org.ID, "aws_db_instance.mydb")
    diffJSON, _ := json.Marshal(event.Diff)
    assert.NotContains(t, string(diffJSON), "supersecret123")
    assert.NotContains(t, string(diffJSON), "postgres://")
    assert.Contains(t, string(diffJSON), "[REDACTED]")

    // Verify instance class drift IS present (non-secret attribute preserved)
    assert.Contains(t, string(diffJSON), "db.t3.large")
}
```

---

### 5.5 Smoke Tests (Post-Deploy)

Smoke tests run after every production deployment. They hit real endpoints with minimal side effects.

```
smoke/
  health_check_test.go         # GET /health → 200 on all services
  agent_registration_test.go   # Register a smoke-test agent → 200
  heartbeat_test.go            # Send heartbeat → 200
  drift_report_ingestion_test.go # POST minimal drift report → 202
  dashboard_api_test.go        # GET /v1/stacks (smoke org) → 200
  slack_connectivity_test.go   # Verify Slack OAuth token still valid
```

Smoke tests use a dedicated `smoke-test` organization in production with a pre-provisioned API key. They never write to real customer data.

---

## Section 6: Performance & Load Testing

### 6.1 Scan Duration Benchmarks

**Tool:** Go's built-in `testing.B` for agent benchmarks. `k6` for SaaS API load tests.

**Targets:**

| Scenario | Stack Size | Target Duration | Kill Threshold |
|---|---|---|---|
| Full state parse | 100 resources | < 50ms | > 200ms |
| Full state parse | 500 resources | < 200ms | > 1s |
| Full drift check (parse + poll + compare) | 20 resources | < 5s | > 30s |
| Full drift check | 100 resources | < 30s | > 120s |
| Drift report ingestion (SaaS) | single report | < 200ms p99 | > 1s p99 |
| Drift report ingestion (SaaS) | 100 concurrent | < 500ms p99 | > 2s p99 |

**Go benchmark tests:**
```go
// pkg/agent/bench_test.go
func BenchmarkStateParser_100Resources(b *testing.B) {
    data, _ := os.ReadFile("testdata/states/100_resources.tfstate")
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _, _ = ParseState(data)
    }
}

func BenchmarkDriftComparator_100Resources(b *testing.B) {
    stateResources := loadStateFixture("testdata/states/100_resources.tfstate")
    cloudResources := loadCloudFixture("testdata/cloud/100_resources_clean.json")
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = CompareDrift(stateResources, cloudResources)
    }
}

func BenchmarkSecretScrubber_LargeDiff(b *testing.B) {
    diff := loadDiffFixture("testdata/diffs/large_diff_50_attributes.json")
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = ScrubSecrets(diff)
    }
}
```

---

### 6.2 Memory & CPU Profiling

**Goal:** Ensure the agent stays within its ECS task allocation (0.25 vCPU, 512MB) even for large state files.

**Profile targets:**
- State parser memory allocation for 500-resource state files
- Drift comparator heap usage during deep JSON comparison
- Secret scrubber regex compilation (should be compiled once, not per-call)

```go
// Run with: go test -memprofile=mem.out -cpuprofile=cpu.out -bench=.
// Analyze with: go tool pprof mem.out

func TestMemoryProfile_LargeStateFile_Under100MB(t *testing.T) {
    if testing.Short() { t.Skip("skipping memory profile in short mode") }
    
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    before := m.HeapAlloc

    data, _ := os.ReadFile("testdata/states/500_resources.tfstate")
    _, err := ParseState(data)
    require.NoError(t, err)

    runtime.ReadMemStats(&m)
    after := m.HeapAlloc
    allocatedMB := float64(after-before) / 1024 / 1024
    assert.Less(t, allocatedMB, 100.0, "state parser should use < 100MB for 500 resources")
}
```

**Regex pre-compilation check:**
```go
func TestSecretScrubber_RegexPrecompiled_NotCompiledPerCall(t *testing.T) {
    // Call scrubber 1000 times — if regex is compiled per call, this will be slow
    diff := map[string]interface{}{"password": "test123"}
    start := time.Now()
    for i := 0; i < 1000; i++ {
        ScrubSecrets(diff)
    }
    elapsed := time.Since(start)
    assert.Less(t, elapsed, 100*time.Millisecond, "1000 scrub calls should complete in < 100ms")
}
```

---

### 6.3 Concurrent Scan Stress Tests

**Goal:** Verify the agent handles concurrent scans (multiple stacks) without race conditions or goroutine leaks.

```go
func TestConcurrentScans_MultipleStacks_NoRaceConditions(t *testing.T) {
    // Run with: go test -race ./...
    const numStacks = 10
    var wg sync.WaitGroup
    errors := make(chan error, numStacks)

    for i := 0; i < numStacks; i++ {
        wg.Add(1)
        go func(stackIdx int) {
            defer wg.Done()
            stateFile := fmt.Sprintf("testdata/states/stack_%d.tfstate", stackIdx)
            _, err := RunDriftCheck(stateFile, mockAWSClient)
            if err != nil { errors <- err }
        }(i)
    }

    wg.Wait()
    close(errors)
    for err := range errors {
        t.Errorf("concurrent scan error: %v", err)
    }
}
```

**SaaS load test (k6):**
```javascript
// load-tests/drift-report-ingestion.js
import http from 'k6/http';
import { check } from 'k6';

export const options = {
  stages: [
    { duration: '30s', target: 50 },   // ramp up to 50 concurrent agents
    { duration: '60s', target: 50 },   // hold
    { duration: '10s', target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],  // 99th percentile < 500ms
    http_req_failed: ['rate<0.01'],    // < 1% error rate
  },
};

export default function () {
  const payload = JSON.stringify(buildDriftReport());
  const res = http.post(`${__ENV.API_URL}/v1/drift-reports`, payload, {
    headers: { 'Authorization': `Bearer ${__ENV.API_KEY}`, 'Content-Type': 'application/json' },
  });
  check(res, { 'status is 202': (r) => r.status === 202 });
}
```

---

## Section 7: CI/CD Pipeline Integration

### 7.1 Test Stages

```
┌─────────────────────────────────────────────────────────────────┐
│  PRE-COMMIT (local, < 30s)                                      │
│  • golangci-lint (Go)                                           │
│  • eslint + tsc --noEmit (TypeScript)                           │
│  • go test -short ./... (unit tests only, no I/O)               │
│  • Feature flag TTL audit (make flag-audit)                     │
│  • Decision log presence check (PRs touching pkg/detection/)    │
└─────────────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────┐
│  PR (GitHub Actions, < 5 min)                                   │
│  • Full unit test suite (Go + TypeScript)                       │
│  • go test -race ./... (race detector)                          │
│  • Coverage gate: fail if < 80% overall, < 100% on scrubber     │
│  • Schema migration lint (no destructive changes)               │
│  • Snapshot test diff check (Block Kit formatter)               │
└─────────────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────┐
│  MERGE TO MAIN (GitHub Actions, < 10 min)                       │
│  • All unit tests                                               │
│  • Integration tests (Testcontainers: PostgreSQL + DynamoDB)    │
│  • LocalStack integration tests (S3, SQS, EC2 mock)             │
│  • RLS isolation tests (multi-tenant)                           │
│  • Docker build + Trivy scan                                    │
│  • Go benchmark regression check (fail if > 20% slower)        │
└─────────────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGING DEPLOY (< 15 min)                                      │
│  • E2E test suite against staging environment                   │
│  • Smoke tests (all health endpoints)                           │
│  • Secret scrubbing E2E test                                    │
│  • Multi-tenant isolation E2E test                              │
└─────────────────────────────────────────────────────────────────┘
           │
           ▼ (manual approval gate)
┌─────────────────────────────────────────────────────────────────┐
│  PRODUCTION DEPLOY                                              │
│  • Smoke tests post-deploy                                      │
│  • Canary: route 5% traffic to new version for 10 min          │
│  • Auto-rollback if smoke tests fail                            │
└─────────────────────────────────────────────────────────────────┘
```

---

### 7.2 Coverage Thresholds & Gates

```yaml
# .github/workflows/test.yml (coverage gate step)
- name: Check coverage thresholds
  run: |
    # Go agent
    go test -coverprofile=coverage.out ./...
    go tool cover -func=coverage.out | grep "total:" | awk '{print $3}' | \
      awk -F'%' '{if ($1 < 80) {print "FAIL: Go coverage " $1 "% < 80%"; exit 1}}'
    
    # Secret scrubber must be 100%
    go tool cover -func=coverage.out | grep "scrubber" | \
      awk -F'%' '{if ($1 < 100) {print "FAIL: Scrubber coverage " $1 "% < 100%"; exit 1}}'
    
    # TypeScript SaaS
    npx vitest run --coverage
    # vitest.config.ts enforces: lines: 80, branches: 75, functions: 80
```

**`vitest.config.ts` coverage config:**
```typescript
export default defineConfig({
  test: {
    coverage: {
      provider: 'v8',
      thresholds: {
        lines: 80,
        branches: 75,
        functions: 80,
        statements: 80,
      },
      // Stricter thresholds for critical modules
      perFile: true,
    },
  },
})
```

---

### 7.3 Test Parallelization Strategy

**Go:** Tests are parallelized at the package level by default (`go test ./...`). Mark individual tests with `t.Parallel()` where safe. Integration tests that share LocalStack state must NOT be parallelized — use build tags to separate them.

```go
// Unit tests: always parallel
func TestDriftComparator_AttributeAdded_ReturnsDrift(t *testing.T) {
    t.Parallel()
    // ...
}

// Integration tests: sequential within package, parallel across packages
// go test -p 4 ./... (4 packages in parallel)
```

**Build tags for test separation:**
```go
//go:build integration
// +build integration

// Run with: go test -tags=integration ./...
// Unit only: go test ./... (no tag)
```

**GitHub Actions matrix:**
```yaml
strategy:
  matrix:
    test-suite:
      - unit-go
      - unit-ts
      - integration-go
      - integration-ts
      - e2e
  fail-fast: false  # don't cancel other suites on first failure
```

---

## Section 8: Transparent Factory Tenet Testing

### 8.1 Feature Flag Behavior (Epic 10, Story 10.1)

**Testing OpenFeature Go SDK integration:**

```go
// pkg/flags/flags_test.go

// Test 1: Flag gates new detection rule
func TestFeatureFlag_NewDetectionRule_GatedByFlag(t *testing.T) {
    // Set up: flag "pulumi-support" = false
    provider := openfeature.NewInMemoryProvider(map[string]openfeature.InMemoryFlag{
        "pulumi-support": {DefaultVariant: "off", Variants: map[string]interface{}{"off": false, "on": true}},
    })
    openfeature.SetProvider(provider)

    result := RunDriftCheck(pulumiStateFixture)
    assert.ErrorIs(t, result.Err, ErrIaCToolNotSupported)
    assert.Equal(t, 0, result.DriftedResourceCount)
}

// Test 2: Flag enabled — feature executes
func TestFeatureFlag_NewDetectionRule_ExecutesWhenEnabled(t *testing.T) {
    provider := openfeature.NewInMemoryProvider(map[string]openfeature.InMemoryFlag{
        "pulumi-support": {DefaultVariant: "on", Variants: map[string]interface{}{"off": false, "on": true}},
    })
    openfeature.SetProvider(provider)

    result := RunDriftCheck(pulumiStateFixture)
    require.NoError(t, result.Err)
    assert.Greater(t, result.ResourceCount, 0)
}

// Test 3: Circuit breaker disables flag on false-positive spike
func TestFeatureFlag_CircuitBreaker_TripsOnFalsePositiveSpike(t *testing.T) {
    flag := NewFeatureFlag("new-sg-rule", circuitBreakerConfig{Threshold: 3.0, Window: time.Hour})
    
    // Simulate 10 dismissals in 1 hour (3x baseline of ~3)
    for i := 0; i < 10; i++ {
        flag.RecordDismissal()
    }
    
    assert.False(t, flag.IsEnabled(), "circuit breaker should have tripped")
}
```

**TTL lint test (CI enforcement):**
```go
func TestFeatureFlags_NoExpiredTTLs(t *testing.T) {
    flags := LoadAllFlags("../../config/flags.json")
    for _, flag := range flags {
        if flag.Rollout == 100 {
            assert.True(t, time.Now().Before(flag.TTL),
                "flag %q is at 100%% rollout and past TTL %v — clean it up", flag.Name, flag.TTL)
        }
    }
}
```

---

### 8.2 Schema Migration Validation (Epic 10, Story 10.2)

**Goal:** CI blocks any migration that removes, renames, or changes the type of existing DynamoDB attributes.

```go
// tools/schema-lint/main_test.go

func TestSchemaMigration_AddNewAttribute_IsAllowed(t *testing.T) {
    migration := Migration{
        Changes: []SchemaChange{
            {Type: ChangeTypeAdd, AttributeName: "new_field_v2", AttributeType: "S"},
        },
    }
    err := ValidateMigration(migration, currentSchema)
    assert.NoError(t, err)
}

func TestSchemaMigration_RemoveAttribute_IsRejected(t *testing.T) {
    migration := Migration{
        Changes: []SchemaChange{
            {Type: ChangeTypeRemove, AttributeName: "event_type"},
        },
    }
    err := ValidateMigration(migration, currentSchema)
    assert.ErrorContains(t, err, "destructive schema change: cannot remove attribute 'event_type'")
}

func TestSchemaMigration_RenameAttribute_IsRejected(t *testing.T) {
    migration := Migration{
        Changes: []SchemaChange{
            {Type: ChangeTypeRename, OldName: "payload", NewName: "event_payload"},
        },
    }
    err := ValidateMigration(migration, currentSchema)
    assert.ErrorContains(t, err, "destructive schema change: cannot rename attribute")
}

func TestSchemaMigration_ChangeAttributeType_IsRejected(t *testing.T) {
    migration := Migration{
        Changes: []SchemaChange{
            {Type: ChangeTypeModify, AttributeName: "timestamp", OldType: "S", NewType: "N"},
        },
    }
    err := ValidateMigration(migration, currentSchema)
    assert.ErrorContains(t, err, "destructive schema change: cannot change type of attribute 'timestamp'")
}
```

---

### 8.3 Decision Log Format Validation (Epic 10, Story 10.3)

```go
// tools/decision-log-lint/main_test.go

func TestDecisionLog_ValidFormat_PassesValidation(t *testing.T) {
    log := DecisionLog{
        Prompt:                 "Why is security group drift classified as critical?",
        Reasoning:              "SG drift is the #1 vector for cloud breaches...",
        AlternativesConsidered: []string{"classify as high", "require manual review"},
        Confidence:             0.9,
        Timestamp:              time.Now(),
        Author:                 "max@dd0c.dev",
    }
    assert.NoError(t, ValidateDecisionLog(log))
}

func TestDecisionLog_MissingReasoning_FailsValidation(t *testing.T) {
    log := DecisionLog{Prompt: "Why?", Confidence: 0.8}
    err := ValidateDecisionLog(log)
    assert.ErrorContains(t, err, "reasoning is required")
}

func TestDecisionLog_ConfidenceOutOfRange_FailsValidation(t *testing.T) {
    log := DecisionLog{Prompt: "Why?", Reasoning: "Because.", Confidence: 1.5}
    err := ValidateDecisionLog(log)
    assert.ErrorContains(t, err, "confidence must be between 0 and 1")
}

// CI check: PRs touching pkg/detection/ must include a decision log
func TestCI_DetectionPackageChange_RequiresDecisionLog(t *testing.T) {
    changedFiles := getChangedFilesInPR()
    touchesDetection := slices.ContainsFunc(changedFiles, func(f string) bool {
        return strings.HasPrefix(f, "pkg/detection/")
    })
    if touchesDetection {
        decisionLogs := findDecisionLogsInPR()
        assert.NotEmpty(t, decisionLogs, "PRs touching pkg/detection/ require a decision log entry")
    }
}
```

---

### 8.4 OTEL Span Assertion Tests (Epic 10, Story 10.4)

**Goal:** Verify that drift classification emits the correct OpenTelemetry spans with required attributes.

```go
// pkg/observability/spans_test.go

func TestOTELSpans_DriftScan_EmitsParentSpan(t *testing.T) {
    exporter := tracetest.NewInMemoryExporter()
    tp := sdktrace.NewTracerProvider(sdktrace.WithSyncer(exporter))
    otel.SetTracerProvider(tp)

    RunDriftScan(testStateFixture, mockAWSClient)

    spans := exporter.GetSpans()
    parentSpans := filterSpansByName(spans, "drift_scan")
    require.Len(t, parentSpans, 1)
}

func TestOTELSpans_DriftClassification_EmitsChildSpanPerResource(t *testing.T) {
    exporter := tracetest.NewInMemoryExporter()
    // ... setup ...

    RunDriftScan(stateWith3Resources, mockAWSClient)

    classificationSpans := filterSpansByName(exporter.GetSpans(), "drift_classification")
    assert.Len(t, classificationSpans, 3) // one per resource
}

func TestOTELSpans_ClassificationSpan_HasRequiredAttributes(t *testing.T) {
    // ... run scan ...
    span := getClassificationSpan(exporter, "aws_security_group.api")

    attrs := span.Attributes()
    assert.Equal(t, "aws_security_group", getAttr(attrs, "drift.resource_type"))
    assert.NotEmpty(t, getAttr(attrs, "drift.severity_score"))
    assert.NotEmpty(t, getAttr(attrs, "drift.classification_reason"))
    // No PII: resource ARN must be hashed, not raw
    assert.NotContains(t, getAttr(attrs, "drift.resource_id"), "arn:aws:")
}

func TestOTELSpans_NoCustomerPII_InAnySpan(t *testing.T) {
    // Run scan with a state file containing real-looking ARNs
    RunDriftScan(stateWithRealARNs, mockAWSClient)

    for _, span := range exporter.GetSpans() {
        for _, attr := range span.Attributes() {
            assert.NotRegexp(t, `arn:aws:[a-z]+:[a-z0-9-]+:\d{12}:`, attr.Value.AsString(),
                "span %q contains unhashed ARN in attribute %q", span.Name(), attr.Key)
        }
    }
}
```

---

### 8.5 Governance Policy Enforcement Tests (Epic 10, Story 10.5)

```go
func TestGovernance_StrictMode_RemediationNeverExecutes(t *testing.T) {
    engine := NewRemediationEngine(Policy{GovernanceMode: "strict"})
    
    result, err := engine.Revert(criticalDriftEvent)
    
    require.NoError(t, err) // not an error — just blocked
    assert.Equal(t, "blocked_by_policy", result.Status)
    assert.Contains(t, result.Log, "Remediation blocked by strict mode")
    assert.False(t, mockAgentDispatcher.WasCalled())
}

func TestGovernance_CustomerCannotEscalateAboveSystemPolicy(t *testing.T) {
    systemPolicy := Policy{GovernanceMode: "strict"}
    customerPolicy := Policy{GovernanceMode: "audit"} // customer wants less restriction
    
    merged := MergePolicies(systemPolicy, customerPolicy)
    assert.Equal(t, "strict", merged.GovernanceMode, "customer cannot override system to be less restrictive")
}

func TestGovernance_PanicMode_HaltsAllScansImmediately(t *testing.T) {
    agent := NewDriftAgent(Policy{PanicMode: true})
    
    result := agent.RunScan(testStateFixture)
    
    assert.ErrorIs(t, result.Err, ErrPanicModeActive)
    assert.False(t, mockAWSClient.WasCalled(), "no AWS API calls should be made in panic mode")
}

func TestGovernance_PanicMode_SendsExactlyOneNotification(t *testing.T) {
    agent := NewDriftAgent(Policy{PanicMode: true})
    
    // Run scan 3 times — should only notify once
    for i := 0; i < 3; i++ {
        agent.RunScan(testStateFixture)
    }
    
    assert.Equal(t, 1, mockNotifier.CallCount(), "panic mode should send exactly one notification")
}
```

---

## Section 9: Test Data & Fixtures

### 9.1 Directory Structure

```
testdata/
  states/
    # Terraform state v4 fixtures
    single_sg.tfstate                    # 1 resource: aws_security_group
    single_rds.tfstate                   # 1 resource: aws_db_instance (with secrets)
    prod_networking.tfstate              # 23 resources: VPC, SGs, subnets, routes
    prod_compute.tfstate                 # 47 resources: EC2, IAM, Lambda, ECS
    100_resources.tfstate                # benchmark fixture
    500_resources.tfstate                # benchmark fixture
    module_nested.tfstate                # module-prefixed addresses
    for_each_resources.tfstate           # for_each instances
    v3_format.tfstate                    # invalid: old format (should error)
    rds_with_secrets.tfstate             # contains master_password, connection strings
    opentofu_state.tfstate               # OpenTofu-generated state

  aws-responses/
    # Recorded AWS API responses (go-vcr cassettes)
    ec2/
      describe_sg_clean.json             # cloud matches state
      describe_sg_ingress_added.json     # 0.0.0.0/0 rule added
      describe_sg_ingress_removed.json   # rule removed
      describe_sg_not_found.json         # resource deleted from cloud
    iam/
      get_role_clean.json
      get_role_policy_changed.json
      get_role_not_found.json
    rds/
      describe_db_instances_clean.json
      describe_db_instances_class_changed.json
      describe_db_instances_publicly_accessible.json  # critical: made public

  diffs/
    # Pre-computed drift diff fixtures
    sg_ingress_added_critical.json
    iam_policy_changed_high.json
    rds_class_changed_high.json
    tag_only_change_low.json
    large_diff_50_attributes.json        # benchmark fixture

  wiremock/
    slack/
      post_message_success.json
      post_message_rate_limited.json
      post_message_channel_not_found.json
      interactions_revert_payload.json
    github/
      create_branch_success.json
      create_pr_success.json
      create_pr_repo_not_found.json

  policies/
    strict_mode.json
    audit_mode.json
    auto_revert_critical.json
    require_approval_iam.json
```

---

### 9.2 State File Factory (Go)

A factory package generates synthetic Terraform state files for tests. This avoids brittle fixture files that break when the state format changes.

```go
// testutil/statefactory/factory.go

type StateFactory struct {
    version          int
    terraformVersion string
    resources        []StateResource
}

func NewStateFactory() *StateFactory {
    return &StateFactory{version: 4, terraformVersion: "1.7.0"}
}

func (f *StateFactory) WithSecurityGroup(name, vpcID string, ingress []IngressRule) *StateFactory {
    f.resources = append(f.resources, StateResource{
        Mode:     "managed",
        Type:     "aws_security_group",
        Name:     name,
        Provider: "registry.terraform.io/hashicorp/aws",
        Instances: []ResourceInstance{{
            Attributes: map[string]interface{}{
                "id":          fmt.Sprintf("sg-%s", randID()),
                "name":        name,
                "vpc_id":      vpcID,
                "ingress":     ingress,
                "egress":      defaultEgressRules(),
                "tags":        map[string]string{"ManagedBy": "terraform"},
            },
        }},
    })
    return f
}

func (f *StateFactory) WithIAMRole(name, assumeRolePolicy string) *StateFactory { /* ... */ }
func (f *StateFactory) WithRDSInstance(id, instanceClass string) *StateFactory { /* ... */ }
func (f *StateFactory) WithSecret(key, value string) *StateFactory { /* injects secret into last resource */ }
func (f *StateFactory) Build() []byte { /* marshals to JSON */ }

// Usage in tests:
state := NewStateFactory().
    WithSecurityGroup("api", "vpc-abc123", []IngressRule{{Port: 443, CIDR: "10.0.0.0/8"}}).
    WithIAMRole("lambda-exec", assumeRolePolicyJSON).
    Build()
```

---

### 9.3 Cloud Response Factory (Go)

Mirrors the state factory but for AWS API responses. Used to simulate clean vs. drifted cloud state.

```go
// testutil/cloudfactory/factory.go

type CloudResponseFactory struct{}

func (f *CloudResponseFactory) SecurityGroup(id string, opts ...SGOption) *ec2.SecurityGroup {
    sg := &ec2.SecurityGroup{GroupId: aws.String(id), /* defaults */}
    for _, opt := range opts { opt(sg) }
    return sg
}

// Options for injecting drift:
func WithPublicIngress(port int) SGOption {
    return func(sg *ec2.SecurityGroup) {
        sg.IpPermissions = append(sg.IpPermissions, ec2types.IpPermission{
            FromPort: aws.Int32(int32(port)),
            IpRanges: []ec2types.IpRange{{CidrIp: aws.String("0.0.0.0/0")}},
        })
    }
}

func WithInstanceClassChanged(newClass string) RDSOption { /* ... */ }
func WithPolicyDocumentChanged(newPolicy string) IAMOption { /* ... */ }
```

---

### 9.4 Drift Scenario Fixtures

Pre-built scenarios covering the most common real-world drift patterns. Each scenario includes: state file, cloud response, expected diff, expected severity.

| Scenario | State Fixture | Cloud Response | Expected Severity | Category |
|---|---|---|---|---|
| Security group: public HTTPS ingress added | `sg_private.tfstate` | `sg_public_443.json` | critical | security |
| Security group: SSH port opened to world | `sg_no_ssh.tfstate` | `sg_ssh_open.json` | critical | security |
| IAM role: `*:*` policy attached | `iam_role_scoped.tfstate` | `iam_role_star_star.json` | critical | security |
| S3 bucket: public access enabled | `s3_private.tfstate` | `s3_public.json` | critical | security |
| RDS: made publicly accessible | `rds_private.tfstate` | `rds_public.json` | critical | security |
| Lambda: runtime changed (python3.8 → python3.12) | `lambda_py38.tfstate` | `lambda_py312.json` | high | configuration |
| ECS service: task count changed (2 → 5) | `ecs_2tasks.tfstate` | `ecs_5tasks.json` | low | scaling |
| EC2 instance: instance type changed | `ec2_t3medium.tfstate` | `ec2_t3large.json` | high | configuration |
| Route53: TTL changed (300 → 60) | `r53_ttl300.tfstate` | `r53_ttl60.json` | medium | configuration |
| Tags: Environment tag changed | `tags_prod.tfstate` | `tags_staging.json` | low | tags |
| Resource deleted from cloud | `sg_exists.tfstate` | `sg_not_found.json` | high | configuration |

---

### 9.5 TypeScript Test Helpers

```typescript
// test/helpers/factories.ts

export const buildDriftEvent = (overrides: Partial<DriftEvent> = {}): DriftEvent => ({
  id: `evt_${randomUUID()}`,
  orgId: 'org_test_001',
  stackId: 'stack_prod_networking',
  resourceAddress: 'aws_security_group.api',
  resourceType: 'aws_security_group',
  severity: 'critical',
  category: 'security',
  status: 'open',
  diff: {
    ingress: {
      old: [{ from_port: 443, cidr_blocks: ['10.0.0.0/8'] }],
      new: [{ from_port: 443, cidr_blocks: ['10.0.0.0/8', '0.0.0.0/0'] }],
    },
  },
  attribution: {
    principal: 'arn:aws:iam::123456789:user/jsmith',
    sourceIp: '192.168.1.1',
    eventName: 'AuthorizeSecurityGroupIngress',
    attributedAt: new Date().toISOString(),
  },
  createdAt: new Date().toISOString(),
  ...overrides,
})

export const buildOrg = (overrides: Partial<Organization> = {}): Organization => ({
  id: `org_${randomUUID()}`,
  name: 'Test Org',
  slug: 'test-org',
  plan: 'starter',
  maxStacks: 10,
  pollIntervalS: 300,
  ...overrides,
})

export const buildStack = (orgId: string, overrides: Partial<Stack> = {}): Stack => ({
  id: `stack_${randomUUID()}`,
  orgId,
  name: 'prod-networking',
  backendType: 's3',
  backendHash: 'abc123def456',
  iacTool: 'terraform',
  environment: 'prod',
  driftScore: 100.0,
  resourceCount: 23,
  driftedCount: 0,
  ...overrides,
})
```

---

## Section 10: TDD Implementation Order

### 10.1 Bootstrap Sequence (Test Infrastructure First)

Before writing a single product test, the test infrastructure itself must be bootstrapped. This is the meta-TDD step.

```
Week 0 — Test Infrastructure Bootstrap
────────────────────────────────────────
1. Set up Go test project structure
   • testutil/ package with state factory, cloud factory
   • testdata/ directory with initial fixture files
   • golangci-lint config (.golangci.yml)
   • go test -race baseline (should pass with zero tests)

2. Set up TypeScript test project
   • vitest.config.ts with coverage thresholds
   • test/helpers/factories.ts with builder functions
   • ESLint + tsc --noEmit in CI

3. Set up Docker Compose test environment
   • docker-compose.test.yml (LocalStack, PostgreSQL, WireMock)
   • Makefile targets: make test-unit, make test-integration, make test-e2e

4. Set up CI pipeline skeleton
   • GitHub Actions workflow with test stages
   • Coverage reporting (codecov or similar)
   • Feature flag TTL lint check
```

---

### 10.2 Epic-by-Epic TDD Order

The implementation order follows epic dependencies. Tests are written before code at each step.

```
Phase 1: Agent Core (Weeks 1–2)
────────────────────────────────
Write tests first, then implement:

1. TestStateParser_* (Epic 1, Story 1.1)
   → Implement StateParser
   → Fixture: single_sg.tfstate, module_nested.tfstate

2. TestDriftComparator_* (Epic 1, Story 1.3)
   → Implement DriftComparator
   → Depends on: StateParser (need parsed state to compare)

3. TestSecretScrubber_* (Epic 1, Story 1.4) ← ALL 16 tests before any code
   → Implement SecretScrubber
   → This is the highest-risk component. Write every test case first.

4. TestDriftClassifier_* (Epic 3, Story 3.2)
   → Implement DriftClassifier with YAML rules
   → Depends on: DriftComparator output format

5. TestAWSPolling_* (Epic 1, Story 1.2) ← Integration tests lead here
   → Implement AWS resource polling for top 5 resource types
   → Use recorded HTTP fixtures (go-vcr)
   → Add remaining 15 resource types iteratively

Phase 2: Agent Communication (Week 2)
───────────────────────────────────────
6. TestTransmitter_* (Epic 2, Story 2.2)
   → Implement HTTPS transmitter with mTLS
   → Depends on: SecretScrubber (scrub before transmit)

7. TestAgentRegistration_* (Epic 2, Story 2.1)
   → Implement agent registration flow
   → Depends on: Transmitter

8. TestHeartbeat_* (Epic 2, Story 2.3)
   → Implement heartbeat goroutine
   → Depends on: AgentRegistration

Phase 3: SaaS Ingestion Pipeline (Week 2–3)
─────────────────────────────────────────────
9. TestEventProcessor_Validation_* (Epic 3, Story 3.1)
   → Implement zod schema validation
   → Write tests for every invalid payload shape

10. TestDynamoDBEventStore_* (Epic 3, Story 3.3) ← Integration tests with Testcontainers
    → Implement DynamoDB persistence
    → Depends on: DynamoDB Local container running

11. TestPostgreSQL_RLS_* (Epic 3, Story 3.3) ← Integration tests with Testcontainers
    → Apply schema migrations
    → Write multi-tenant isolation tests BEFORE any API handlers

12. TestDriftScorer_* (Epic 3, Story 3.4)
    → Implement drift score calculation
    → Depends on: PostgreSQL schema (reads/writes stacks table)

Phase 4: Notifications (Week 3)
─────────────────────────────────
13. TestNotificationFormatter_* (Epic 4, Story 4.1)
    → Implement Block Kit formatter
    → Snapshot tests for output JSON

14. TestSlackDelivery_* (Epic 4, Story 4.2) ← Integration with WireMock
    → Implement Slack API client
    → Depends on: Formatter output

15. TestNotificationBatching_* (Epic 4, Story 4.4)
    → Implement digest queue logic
    → Depends on: Slack delivery working

Phase 5: Dashboard API (Week 3–4)
───────────────────────────────────
16. TestDashboardAuth_* (Epic 5, Story 5.1)
    → Implement Cognito JWT middleware
    → RLS context-setting middleware
    → Write auth tests before any route handlers

17. TestStackEndpoints_* (Epic 5, Story 5.2)
    → Implement GET/PATCH /v1/stacks
    → Depends on: Auth middleware + PostgreSQL

18. TestDriftEventEndpoints_* (Epic 5, Story 5.3)
    → Implement GET /v1/drift-events with filters
    → Depends on: Stack endpoints

Phase 6: Slack Bot & Remediation (Week 4)
───────────────────────────────────────────
19. TestSlackInteraction_SignatureValidation_* (Epic 7, Story 7.1)
    → Implement signature verification FIRST
    → Write tests for valid and invalid signatures before any callback logic

20. TestRemediationEngine_* (Epic 7, Stories 7.1–7.2)
    → Implement revert and accept workflows
    → Depends on: Slack interaction handler, PostgreSQL remediation_plans table

21. TestPolicyEngine_* (Epic 10, Story 10.5)
    → Implement governance policy enforcement
    → Wrap remediation engine with policy checks

Phase 7: Transparent Factory Tenets (Week 4, parallel)
────────────────────────────────────────────────────────
22. TestFeatureFlag_* (Epic 10, Story 10.1)
    → Integrate OpenFeature SDK
    → Write flag tests alongside each new feature (not at the end)

23. TestOTELSpans_* (Epic 10, Story 10.4)
    → Add OTEL instrumentation to drift scan
    → Write span assertion tests

24. TestSchemaMigration_* (Epic 10, Story 10.2)
    → Implement schema lint tool
    → Add to CI pipeline

25. TestDecisionLog_* (Epic 10, Story 10.3)
    → Implement decision log validator
    → Add PR template check to CI

Phase 8: E2E & Performance (Week 4–5)
───────────────────────────────────────
26. E2E: Onboarding flow (install → detect → notify)
    → Requires all Phase 1–4 components working
    → First E2E test written after unit + integration tests pass

27. E2E: Remediation round-trip (Slack → apply → resolve)
    → Requires Phase 5–6 components

28. Performance benchmarks
    → Run after correctness is established
    → Fail CI if regression > 20%
```

---

### 10.3 Test Dependency Graph

```
StateParser ──────────────────────────────────────────────────────┐
     │                                                             │
     ▼                                                             ▼
DriftComparator ──► SecretScrubber ──► Transmitter ──► E2E: Onboarding
     │
     ▼
DriftClassifier ──► DriftScorer ──► DynamoDB EventStore ──► Dashboard API
                                         │
                                         ▼
                                   PostgreSQL RLS ──► Auth Middleware
                                                           │
                                                           ▼
                                                    Slack Formatter
                                                           │
                                                           ▼
                                                    Slack Delivery
                                                           │
                                                           ▼
                                                  Remediation Engine ──► E2E: Revert
                                                           │
                                                           ▼
                                                    Policy Engine
```

---

### 10.4 "Never Ship Without" Checklist

Before any code ships to production, these tests must be green:

```
□ TestSecretScrubber_* — all 16 tests passing (100% coverage)
□ TestPostgreSQL_RLS_CrossTenantIsolation — org A cannot read org B data
□ TestTransmitter_mTLSCertPresented_OnEveryRequest
□ TestGovernance_StrictMode_RemediationNeverExecutes
□ TestE2E_SecretScrubbing_NoSecretsReachSaaS
□ TestE2E_MultiTenantIsolation_OrgACannotSeeOrgBEvents
□ go test -race ./... — zero race conditions
□ Coverage gate: ≥ 80% overall, 100% on scrubber
□ Schema migration lint: no destructive changes
□ Feature flag TTL audit: no expired flags at 100% rollout
```

---

*Document complete. Total estimated test count at V1 launch: ~500 tests. Target by month 3: ~1,000 tests.*