Files
dd0c/products/02-iac-drift-detection/test-architecture/test-architecture.md

2097 lines
78 KiB
Markdown
Raw Normal View History

# dd0c/drift — Test Architecture & TDD Strategy
**Author:** Max Mayfield (Test Architect)
**Date:** February 28, 2026
**Product:** dd0c/drift — IaC Drift Detection & Remediation SaaS
**Status:** Test Architecture Design Document
---
## Section 1: Testing Philosophy & TDD Workflow
### 1.1 Core Philosophy
dd0c/drift is a security-critical product. A missed drift event or a false positive in the remediation engine can cause real infrastructure damage. The testing strategy reflects this: **correctness is non-negotiable, speed is a constraint, not a goal**.
Three principles guide every testing decision:
1. **Tests are the first customer.** Before writing a single line of production code, the test defines the contract. If you can't write a test for it, you don't understand the requirement well enough to build it.
2. **The secret scrubber and RLS are untouchable.** These two components — the agent's secret scrubbing engine and the SaaS's PostgreSQL Row-Level Security — have 100% test coverage requirements. No exceptions. A bug in either is a trust-destroying incident.
3. **Drift detection logic is pure functions.** The comparator, scorer, and classifier take inputs and return outputs with no side effects. This makes them trivially testable and means the test suite runs fast even at high coverage.
### 1.2 Red-Green-Refactor Adapted for dd0c/drift
The standard TDD cycle applies, but with domain-specific adaptations:
```
RED → Write a failing test that describes a drift scenario
e.g., "security group ingress rule added to 0.0.0.0/0 → severity: critical"
GREEN → Write the minimum code to make it pass
e.g., add the classification rule to the YAML config + evaluator
REFACTOR → Clean up without breaking the test
e.g., extract the CIDR check into a reusable predicate
```
**When to write tests first (strict TDD):**
- All drift detection logic (comparator, classifier, scorer)
- Secret scrubbing engine — write tests for every secret pattern BEFORE writing the regex
- API request/response contracts — write schema validation tests before implementing handlers
- Remediation policy evaluation — write policy enforcement tests before the engine
- Feature flag evaluation logic (Epic 10.1)
**When integration tests lead (test-after acceptable):**
- AWS SDK wiring (agent ↔ EC2/IAM/RDS describe calls) — mock the SDK first, integration test confirms the wiring
- DynamoDB persistence — write the schema, then integration tests against DynamoDB Local
- Slack Block Kit formatting — render the block, visually verify, then snapshot test
- CI/CD pipeline configuration — validate by running it, not by unit testing YAML
**When E2E tests lead:**
- Onboarding flow (`drift init``drift check` → Slack alert) — the happy path must work end-to-end before any unit tests are written for the CLI
- Remediation round-trip (Slack button → agent apply → resolution) — too many moving parts to unit test first
### 1.3 Test Naming Conventions
**Go (Agent, State Manager):**
```go
// Pattern: Test<Component>_<Scenario>_<ExpectedOutcome>
func TestDriftComparator_SecurityGroupIngressAdded_ReturnsCriticalDrift(t *testing.T)
func TestSecretScrubber_PasswordAttribute_ReturnsRedacted(t *testing.T)
func TestStateParser_V4Format_ExtractsManagedResources(t *testing.T)
// Table-driven test naming: use descriptive name field
tests := []struct {
name string
// ...
}{
{name: "security group with public CIDR → critical"},
{name: "tag-only change → low severity"},
{name: "IAM policy document changed → high severity"},
}
```
**TypeScript (SaaS, Dashboard API):**
```typescript
// Pattern: describe("<Component>") > describe("<method/scenario>") > it("<expected behavior>")
describe("DriftClassifier", () => {
describe("classify()", () => {
it("returns critical severity for security group with 0.0.0.0/0 ingress")
it("returns low severity for tag-only changes")
it("falls back to medium/configuration for unmatched resource types")
})
})
```
**Integration & E2E:**
```
// File naming: <component>.integration_test.go / <flow>.e2e_test.go
agent_dynamodb_integration_test.go
drift_report_ingestion_integration_test.go
onboarding_flow_e2e_test.go
remediation_roundtrip_e2e_test.go
```
---
## Section 2: Test Pyramid
### 2.1 Recommended Ratio
```
┌─────────────────┐
│ E2E / Smoke │ ~10% (~50 tests)
│ (LocalStack, │
│ real flows) │
├─────────────────┤
│ Integration │ ~20% (~100 tests)
│ (boundaries, │
│ real DBs) │
├─────────────────┤
│ Unit Tests │ ~70% (~350 tests)
│ (pure logic, │
│ fast, mocked) │
└─────────────────┘
```
Target: **~500 tests total at V1 launch**, growing to ~1,000 by month 3.
### 2.2 Unit Test Targets (Per Component)
| Component | Language | Target Coverage | Key Test Count |
|---|---|---|---|
| State Parser (TF v4) | Go | 95% | ~40 tests |
| Drift Comparator | Go | 95% | ~60 tests |
| Drift Classifier | Go | 90% | ~30 tests |
| Secret Scrubber | Go | 100% | ~50 tests |
| Drift Scorer | Go/TS | 90% | ~20 tests |
| Event Processor (ingestion) | TypeScript | 85% | ~30 tests |
| Notification Formatter | TypeScript | 85% | ~25 tests |
| Remediation Engine | TypeScript | 85% | ~30 tests |
| Dashboard API handlers | TypeScript | 80% | ~40 tests |
| Feature Flag evaluator | Go | 90% | ~20 tests |
| Policy engine | Go/TS | 95% | ~30 tests |
### 2.3 Integration Test Boundaries
| Boundary | Test Type | Infrastructure |
|---|---|---|
| Agent ↔ AWS EC2/IAM/RDS APIs | Integration | LocalStack or recorded HTTP fixtures |
| Agent ↔ SaaS API (drift report POST) | Integration | Real HTTP server (test instance) |
| Event Processor ↔ DynamoDB | Integration | DynamoDB Local (Testcontainers) |
| Event Processor ↔ PostgreSQL | Integration | PostgreSQL (Testcontainers) |
| Event Processor ↔ SQS | Integration | LocalStack SQS |
| Notification Service ↔ Slack API | Integration | Slack API mock server |
| Remediation Engine ↔ Agent | Integration | Agent stub server |
| Dashboard API ↔ PostgreSQL (RLS) | Integration | PostgreSQL (Testcontainers) — multi-tenant isolation tests |
### 2.4 E2E / Smoke Test Scenarios
| Scenario | Priority | Infrastructure |
|---|---|---|
| Install agent → run `drift check` → detect drift → Slack alert | P0 | LocalStack + Slack mock |
| Agent heartbeat → SaaS records it → dashboard shows "online" | P0 | LocalStack |
| Click [Revert] in Slack → agent executes terraform apply → event resolved | P0 | LocalStack + agent stub |
| Click [Accept] → GitHub PR created with code patch | P1 | GitHub API mock |
| Free tier stack limit enforcement (register 2nd stack → 403) | P1 | Real SaaS test env |
| Secret scrubbing end-to-end (state with password → report has [REDACTED]) | P0 | Agent + SaaS test env |
| Multi-tenant isolation (org A cannot see org B drift events) | P0 | PostgreSQL + RLS |
| Agent offline detection (no heartbeat → Slack "agent offline" alert) | P1 | LocalStack |
---
## Section 3: Unit Test Strategy (Per Component)
### 3.1 State Parser (Go — Epic 1, Story 1.1)
**What to test:**
- Correct extraction of `managed` resources (skip `data` sources)
- Module-prefixed addresses (`module.vpc.aws_security_group.api`)
- Multi-instance resources (`aws_instance.worker[0]`, `aws_instance.worker[1]`)
- Graceful handling of unknown/future resource types
- Rejection of non-v4 state format versions
- Empty state file (zero resources)
- State file with only data sources (zero managed resources)
- `private` field stripped from all instances before returning
**Key test cases:**
```go
func TestStateParser_V4Format_ExtractsManagedResources(t *testing.T) {}
func TestStateParser_DataSourceResources_AreExcluded(t *testing.T) {}
func TestStateParser_ModulePrefixedAddress_ParsedCorrectly(t *testing.T) {}
func TestStateParser_MultiInstanceResource_AllInstancesExtracted(t *testing.T) {}
func TestStateParser_UnsupportedVersion_ReturnsError(t *testing.T) {}
func TestStateParser_EmptyState_ReturnsEmptyResourceList(t *testing.T) {}
func TestStateParser_PrivateField_IsStrippedFromAttributes(t *testing.T) {}
```
**Mocking strategy:** None — pure function over a JSON byte slice. Fixtures in `testdata/states/`.
**Table-driven pattern:**
```go
func TestStateParser_ResourceExtraction(t *testing.T) {
tests := []struct {
name string
fixtureFile string
wantCount int
wantAddresses []string
wantErr bool
}{
{name: "single managed resource", fixtureFile: "testdata/states/single_sg.tfstate", wantCount: 1},
{name: "state v3 format", fixtureFile: "testdata/states/v3_format.tfstate", wantErr: true},
{name: "module-nested resources", fixtureFile: "testdata/states/module_nested.tfstate", wantCount: 5},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
data, _ := os.ReadFile(tt.fixtureFile)
got, err := ParseState(data)
if tt.wantErr { require.Error(t, err); return }
require.NoError(t, err)
assert.Len(t, got.Resources, tt.wantCount)
})
}
}
```
---
### 3.2 Drift Comparator (Go — Epic 1, Story 1.3)
**What to test:**
- Attribute added in cloud (not in state) → drift detected
- Attribute removed from cloud (in state, not in cloud) → drift detected
- Attribute value changed → correct old/new values in diff
- Attribute unchanged → no drift
- Nested attribute changes (ingress rules array)
- Ignored attributes (AWS-generated IDs, timestamps, computed fields) → no drift
- Null vs. empty string → treated as no drift
- Boolean drift (`true``false`)
- Numeric drift (port numbers, counts)
**Key test cases:**
```go
func TestDriftComparator_AttributeAdded_ReturnsDrift(t *testing.T) {}
func TestDriftComparator_AttributeRemoved_ReturnsDrift(t *testing.T) {}
func TestDriftComparator_AttributeUnchanged_ReturnsNoDrift(t *testing.T) {}
func TestDriftComparator_NestedIngressRuleAdded_ReturnsDrift(t *testing.T) {}
func TestDriftComparator_IgnoredAttribute_ReturnsNoDrift(t *testing.T) {}
func TestDriftComparator_NullVsEmptyString_TreatedAsNoDrift(t *testing.T) {}
func TestDriftComparator_ComputedTimestamp_IsIgnored(t *testing.T) {}
```
**Mocking strategy:** None — pure function. State and cloud attributes are both `map[string]interface{}`.
---
### 3.3 Drift Classifier (Go — Epic 3, Story 3.2)
**What to test:**
- Security group with `0.0.0.0/0` ingress → `critical/security`
- IAM role policy document changed → `high/security`
- RDS parameter group changed → `high/configuration`
- Tag-only change → `low/tags`
- Unmatched resource type → `medium/configuration` (default fallback)
- Customer override rules take precedence over defaults
- Rule evaluation order (first match wins)
- Invalid YAML config → error at startup, not at classification time
```go
func TestDriftClassifier_PublicCIDRIngress_ReturnsCriticalSecurity(t *testing.T) {}
func TestDriftClassifier_IAMPolicyChanged_ReturnsHighSecurity(t *testing.T) {}
func TestDriftClassifier_TagOnlyChange_ReturnsLowTags(t *testing.T) {}
func TestDriftClassifier_UnmatchedResource_ReturnsMediumConfiguration(t *testing.T) {}
func TestDriftClassifier_CustomerOverride_TakesPrecedence(t *testing.T) {}
func TestDriftClassifier_InvalidYAML_ReturnsErrorOnLoad(t *testing.T) {}
```
---
### 3.4 Secret Scrubber (Go — Epic 1, Story 1.4) — **100% Coverage Required**
Every secret pattern is a security requirement. No table-driven shortcuts — each pattern gets its own named test.
**Key test cases:**
```go
func TestSecretScrubber_PasswordKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_SecretKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_TokenKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PrivateKeyKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_ConnectionStringKey_RedactsValue(t *testing.T) {}
func TestSecretScrubber_AWSAccessKeyPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PostgresURIPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PEMPrivateKeyPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_JWTTokenPattern_RedactsValue(t *testing.T) {}
func TestSecretScrubber_SensitiveFlag_RedactsValue(t *testing.T) {}
func TestSecretScrubber_PrivateField_IsStrippedEntirely(t *testing.T) {}
func TestSecretScrubber_NonSensitiveAttribute_PreservesValue(t *testing.T) {}
func TestSecretScrubber_NestedSensitiveKey_RedactsNestedValue(t *testing.T) {}
func TestSecretScrubber_ArrayWithSensitiveValues_AllElementsChecked(t *testing.T) {}
func TestSecretScrubber_RedactedPlaceholder_IsLiteralREDACTEDString(t *testing.T) {}
func TestSecretScrubber_DiffStructureIntact_AfterScrubbing(t *testing.T) {}
```
---
### 3.5 Drift Scorer (TypeScript — Epic 3, Story 3.4)
```typescript
describe("DriftScorer", () => {
it("returns 100 for a stack with no drift")
it("applies heavy penalty for critical severity drift")
it("applies minimal penalty for low severity drift")
it("produces weighted score for mixed severity drift")
it("recalculates upward when drift event is resolved")
it("handles zero-resource stack without divide-by-zero")
it("caps score at 0 for catastrophically drifted stacks")
})
```
---
### 3.6 Event Processor — Ingestion & Validation (TypeScript — Epic 3, Story 3.1)
**What to test:**
- Valid drift report → accepted, returns 202
- Missing `stack_id` → 400 `DRIFT_REPORT_INVALID`
- Duplicate `report_id` → 409 `DRIFT_REPORT_DUPLICATE`
- Payload > 1MB → 400 `DRIFT_REPORT_TOO_LARGE`
- Invalid severity value → 400
- Unknown agent ID → 404 `AGENT_NOT_FOUND`
- Revoked agent API key → 403 `AGENT_REVOKED`
- SQS message group ID equals `stack_id`
- SQS deduplication ID equals `report_id`
**Mocking strategy:** Mock `@aws-sdk/client-sqs`. Mock PostgreSQL pool. Use `zod` schema directly in tests.
---
### 3.7 Notification Formatter (TypeScript — Epic 4, Story 4.1)
**What to test:**
- Critical drift → header `🔴 Critical Drift Detected`
- Diff block truncated at Slack's 3000-char block limit
- CloudTrail attribution present → "Changed by: <IAM ARN>"
- CloudTrail attribution absent → "Changed by: Unknown (scheduled scan)"
- All four action buttons present (`drift_revert`, `drift_accept`, `drift_snooze`, `drift_assign`)
- `[REDACTED]` values rendered as-is
- Low severity digest format → no `[Revert]` button
**Mocking strategy:** None — pure function. Use snapshot tests for Block Kit JSON output.
---
### 3.8 Remediation Engine (TypeScript — Epic 7, Stories 7.17.2)
**What to test:**
- Revert: generates correct `terraform apply -target=<address>` command
- Blast radius: resource with 3 dependents → `blast_radius = 3`
- Blast radius: isolated resource → `blast_radius = 0`
- `require-approval` policy → status `pending`, not `executing`
- `auto-revert` policy for critical → executes without approval gate
- Accept: generates correct code patch for changed attribute
- Accept: creates PR with correct branch name and description
- Agent heartbeat stale → `REMEDIATION_AGENT_OFFLINE`
- Concurrent revert on same resource → `REMEDIATION_IN_PROGRESS`
- Panic mode active → all remediation blocked
**Mocking strategy:** Mock agent command dispatcher. Mock GitHub API client (`@octokit/rest`). Mock PostgreSQL for plan persistence.
---
### 3.9 Feature Flag Evaluator (Go — Epic 10, Story 10.1)
```go
func TestFeatureFlag_EnabledFlag_ExecutesFeature(t *testing.T) {}
func TestFeatureFlag_DisabledFlag_SkipsFeatureWithNoSideEffects(t *testing.T) {}
func TestFeatureFlag_UnknownFlag_ReturnsDefaultOff(t *testing.T) {}
func TestFeatureFlag_EnvVarOverride_TakesPrecedenceOverJSONFile(t *testing.T) {}
func TestFeatureFlag_CircuitBreaker_DisablesFlagOnFalsePositiveSpike(t *testing.T) {}
func TestFeatureFlag_ExpiredTTL_CILintDetectsIt(t *testing.T) {} // lint test, not runtime
```
---
### 3.10 Policy Engine (Go — Epic 10, Story 10.5)
```go
func TestPolicyEngine_StrictMode_BlocksAllRemediation(t *testing.T) {}
func TestPolicyEngine_AuditMode_ExecutesAndLogs(t *testing.T) {}
func TestPolicyEngine_CustomerMoreRestrictive_CustomerPolicyWins(t *testing.T) {}
func TestPolicyEngine_CustomerLessRestrictive_SystemPolicyWins(t *testing.T) {}
func TestPolicyEngine_PanicMode_HaltsAllScans(t *testing.T) {}
func TestPolicyEngine_PanicMode_SendsSingleNotification(t *testing.T) {}
func TestPolicyEngine_PolicyDecision_IsLogged(t *testing.T) {}
func TestPolicyEngine_FileReload_NewPolicyTakesEffect(t *testing.T) {}
```
---
## Section 4: Integration Test Strategy
### 4.1 Agent ↔ Cloud Provider APIs
**Goal:** Verify the agent correctly maps Terraform resource types to AWS describe calls and handles API responses.
**Approach:** Use recorded HTTP fixtures (via `go-vcr` or `httpmock`) for unit-speed integration tests. Use LocalStack for full integration runs in CI.
**Key test cases:**
```go
// pkg/agent/integration/aws_polling_test.go
func TestAWSPolling_SecurityGroup_MapsToDescribeSecurityGroups(t *testing.T) {}
func TestAWSPolling_IAMRole_MapsToGetRole(t *testing.T) {}
func TestAWSPolling_RDSInstance_MapsToDescribeDBInstances(t *testing.T) {}
func TestAWSPolling_ResourceNotFound_ReturnsUnknownDriftState(t *testing.T) {}
func TestAWSPolling_RateLimitResponse_RetriesWithBackoff(t *testing.T) {}
func TestAWSPolling_CredentialError_ReturnsDescriptiveError(t *testing.T) {}
func TestAWSPolling_RegionScopedRequest_UsesConfiguredRegion(t *testing.T) {}
```
**Fixture strategy:**
```
testdata/
aws-responses/
ec2_describe_security_groups_clean.json # cloud matches state
ec2_describe_security_groups_drifted.json # ingress rule added
iam_get_role_policy_changed.json
rds_describe_db_instances_clean.json
ec2_describe_security_groups_not_found.json # resource deleted from cloud
```
---
### 4.2 Agent ↔ SaaS API (Drift Report Submission)
**Goal:** Verify the agent correctly serializes and transmits `DriftReport` payloads, handles auth errors, and respects rate limit responses.
**Setup:** Spin up a lightweight HTTP test server in Go (`httptest.NewServer`) that mimics the SaaS ingestion endpoint.
```go
func TestTransmitter_ValidReport_Returns202(t *testing.T) {}
func TestTransmitter_InvalidAPIKey_Returns401AndStopsRetrying(t *testing.T) {}
func TestTransmitter_RevokedAPIKey_Returns403AndStopsRetrying(t *testing.T) {}
func TestTransmitter_RateLimited_RespectsRetryAfterHeader(t *testing.T) {}
func TestTransmitter_ServerError_RetriesWithExponentialBackoff(t *testing.T) {}
func TestTransmitter_PayloadCompressed_WhenOverThreshold(t *testing.T) {}
func TestTransmitter_mTLSCertPresented_OnEveryRequest(t *testing.T) {}
func TestTransmitter_NetworkTimeout_RetriesUpToMaxAttempts(t *testing.T) {}
```
---
### 4.3 Event Processor ↔ DynamoDB (Testcontainers)
**Goal:** Verify event sourcing writes, TTL attribute setting, and checksum generation against a real DynamoDB Local instance.
**Setup:**
```go
// Use testcontainers-go to spin up DynamoDB Local
func setupDynamoDBLocal(t *testing.T) *dynamodb.Client {
ctx := context.Background()
container, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
ContainerRequest: testcontainers.ContainerRequest{
Image: "amazon/dynamodb-local:latest",
ExposedPorts: []string{"8000/tcp"},
WaitingFor: wait.ForListeningPort("8000/tcp"),
},
Started: true,
})
require.NoError(t, err)
t.Cleanup(func() { container.Terminate(ctx) })
// ... return configured client
}
```
**Key test cases:**
```go
func TestDynamoDBEventStore_AppendDriftEvent_PersistsWithCorrectPK(t *testing.T) {}
func TestDynamoDBEventStore_AppendDriftEvent_SetsChecksumAttribute(t *testing.T) {}
func TestDynamoDBEventStore_AppendDriftEvent_SetsTTLPerTier(t *testing.T) {}
func TestDynamoDBEventStore_QueryByStackID_ReturnsChronologicalOrder(t *testing.T) {}
func TestDynamoDBEventStore_DuplicateEventID_IsIdempotent(t *testing.T) {}
func TestDynamoDBEventStore_FreeTier_TTL90Days(t *testing.T) {}
func TestDynamoDBEventStore_EnterpriseTier_TTL7Years(t *testing.T) {}
```
---
### 4.4 Event Processor ↔ PostgreSQL (Testcontainers + RLS)
**Goal:** Verify multi-tenant data isolation via Row-Level Security. This is the most critical integration test suite.
**Setup:**
```typescript
// Use testcontainers for Node.js to spin up PostgreSQL 16
// Apply full schema migrations before each test suite
// Create two test orgs: orgA and orgB
```
**Key test cases:**
```typescript
describe("PostgreSQL RLS Integration", () => {
it("org A cannot read org B drift events via direct query")
it("org A cannot read org B stacks via direct query")
it("setting app.current_org_id scopes all queries correctly")
it("missing app.current_org_id returns zero rows (not an error)")
it("drift event insert without org_id fails FK constraint")
it("drift score update is scoped to correct org's stack")
it("concurrent inserts from two orgs do not cross-contaminate")
})
```
**Critical test — cross-tenant isolation:**
```typescript
it("org A cannot read org B drift events", async () => {
// Insert drift event for orgB
await insertDriftEvent(orgBPool, orgBEvent)
// Query as orgA — should return empty, not orgB's data
await orgAPool.query("SET app.current_org_id = $1", [orgA.id])
const result = await orgAPool.query("SELECT * FROM drift_events")
expect(result.rows).toHaveLength(0)
})
```
---
### 4.5 IaC State File Parsing — Multi-Backend Integration
**Goal:** Verify the agent correctly reads state files from different backends (S3, local file, Terraform Cloud).
**Setup:** LocalStack S3 for S3 backend tests. Real file system for local backend. WireMock for Terraform Cloud API.
```go
func TestStateBackend_S3_ReadsStateFileFromBucket(t *testing.T) {}
func TestStateBackend_S3_HandlesVersionedBucket(t *testing.T) {}
func TestStateBackend_LocalFile_ReadsFromFilesystem(t *testing.T) {}
func TestStateBackend_TerraformCloud_AuthenticatesAndFetchesState(t *testing.T) {}
func TestStateBackend_S3_AccessDenied_ReturnsDescriptiveError(t *testing.T) {}
func TestStateBackend_S3_BucketNotFound_ReturnsDescriptiveError(t *testing.T) {}
```
---
### 4.6 Notification Service ↔ Slack API
**Goal:** Verify Slack message delivery, request signature validation, and interactive callback handling.
**Setup:** WireMock or a custom Go HTTP mock server simulating the Slack API.
```typescript
describe("Slack Integration", () => {
it("delivers Block Kit message to configured channel")
it("falls back to org default channel when stack channel not set")
it("validates Slack request signature on interaction callbacks")
it("rejects interaction callback with invalid signature → 401")
it("updates original message after [Revert] button click")
it("handles Slack API rate limit (429) with retry")
it("handles Slack API 500 — logs error, does not crash Lambda")
})
```
---
### 4.7 Terraform State File Parsing — Real Fixture Files
**Goal:** Verify the parser handles real-world Terraform state files from different provider versions and configurations.
Fixture files sourced from:
- Terraform AWS provider v4.x, v5.x state outputs
- OpenTofu state files (should be identical format)
- State files with modules, count, for_each
- State files with workspace prefixes
```go
func TestStateParser_RealWorldAWSProviderV5_ParsesCorrectly(t *testing.T) {}
func TestStateParser_OpenTofuStateFile_ParsesCorrectly(t *testing.T) {}
func TestStateParser_ForEachResources_AllInstancesExtracted(t *testing.T) {}
func TestStateParser_WorkspacePrefixedState_ParsesCorrectly(t *testing.T) {}
func TestStateParser_LargeStateFile_500Resources_CompletesUnder2Seconds(t *testing.T) {}
```
---
## Section 5: E2E & Smoke Tests
### 5.1 Infrastructure Setup
All E2E tests run against LocalStack (AWS service simulation) and a real PostgreSQL instance. The test environment is defined as a Docker Compose stack:
```yaml
# docker-compose.test.yml
services:
localstack:
image: localstack/localstack:3.x
environment:
SERVICES: s3,sqs,dynamodb,iam,ec2,lambda,eventbridge
DEBUG: 0
ports:
- "4566:4566"
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: drift_test
POSTGRES_USER: drift
POSTGRES_PASSWORD: test
ports:
- "5432:5432"
slack-mock:
image: wiremock/wiremock:latest
volumes:
- ./testdata/wiremock/slack:/home/wiremock/mappings
ports:
- "8080:8080"
github-mock:
image: wiremock/wiremock:latest
volumes:
- ./testdata/wiremock/github:/home/wiremock/mappings
ports:
- "8081:8080"
```
**Synthetic drift generation:** A helper CLI tool (`testdata/tools/drift-injector`) modifies LocalStack EC2/IAM resources to simulate real drift scenarios without touching real AWS.
---
### 5.2 Critical User Journey: Install → Detect → Notify
**Journey:** Agent installed → `drift check` run → drift detected → Slack alert delivered
```go
// e2e/onboarding_flow_test.go
func TestE2E_OnboardingFlow_InstallToFirstSlackAlert(t *testing.T) {
// 1. Register org and agent via API
org := createTestOrg(t)
agent := registerAgent(t, org.APIKey)
// 2. Upload a Terraform state file to LocalStack S3
uploadStateFixture(t, "testdata/states/prod_networking.tfstate", org.StateBucket)
// 3. Inject drift into LocalStack EC2 (add 0.0.0.0/0 ingress rule)
injectSecurityGroupDrift(t, "sg-abc123")
// 4. Run drift check
result := runDriftCheck(t, agent, org.StateBucket)
require.Equal(t, 1, result.DriftedResourceCount)
require.Equal(t, "critical", result.DriftedResources[0].Severity)
// 5. Verify Slack mock received the Block Kit message
slackRequests := getSlackMockRequests(t)
require.Len(t, slackRequests, 1)
assert.Contains(t, slackRequests[0].Body, "Critical Drift Detected")
assert.Contains(t, slackRequests[0].Body, "aws_security_group")
// 6. Verify drift event persisted in PostgreSQL
event := getDriftEvent(t, org.ID, result.DriftedResources[0].Address)
assert.Equal(t, "open", event.Status)
assert.Equal(t, "critical", event.Severity)
}
```
---
### 5.3 Critical User Journey: Revert Workflow
**Journey:** Slack [Revert] button clicked → remediation engine queues command → agent executes → event resolved
```go
func TestE2E_RemediationRevert_SlackButtonToResolution(t *testing.T) {
// Setup: existing open drift event
org, driftEvent := setupOpenDriftEvent(t, "critical")
// 1. Simulate Slack [Revert] button click
payload := buildSlackInteractionPayload("drift_revert", driftEvent.ID, org.SlackUserID)
resp := postSlackInteraction(t, payload, validSlackSignature(payload))
assert.Equal(t, 200, resp.StatusCode)
// 2. Verify Slack message updated to "Reverting..."
slackUpdates := getSlackMockUpdates(t)
assert.Contains(t, slackUpdates[0].Body, "Reverting")
// 3. Verify remediation plan created in DB
plan := waitForRemediationPlan(t, driftEvent.ID, 5*time.Second)
assert.Equal(t, "executing", plan.Status)
assert.Contains(t, plan.TargetResources, driftEvent.ResourceAddress)
// 4. Simulate agent completing the apply
reportRemediationComplete(t, plan.ID, "success")
// 5. Verify drift event resolved
event := getDriftEvent(t, org.ID, driftEvent.ResourceAddress)
assert.Equal(t, "resolved", event.Status)
assert.Equal(t, "reverted", event.ResolutionType)
// 6. Verify final Slack message shows success
finalUpdate := getLastSlackUpdate(t)
assert.Contains(t, finalUpdate.Body, "reverted to declared state")
}
```
---
### 5.4 Critical User Journey: Secret Scrubbing End-to-End
**Journey:** State file with secrets → agent processes → drift report transmitted → NO secrets in SaaS database
```go
func TestE2E_SecretScrubbing_NoSecretsReachSaaS(t *testing.T) {
// State file contains: master_password = "supersecret123", db_endpoint = "postgres://..."
uploadStateFixture(t, "testdata/states/rds_with_secrets.tfstate", org.StateBucket)
// Inject RDS drift (instance class changed)
injectRDSDrift(t, "mydb", "db.t3.medium", "db.t3.large")
// Run drift check
runDriftCheck(t, agent, org.StateBucket)
// Verify drift event in DB — no secret values
event := getDriftEventByResource(t, org.ID, "aws_db_instance.mydb")
diffJSON, _ := json.Marshal(event.Diff)
assert.NotContains(t, string(diffJSON), "supersecret123")
assert.NotContains(t, string(diffJSON), "postgres://")
assert.Contains(t, string(diffJSON), "[REDACTED]")
// Verify instance class drift IS present (non-secret attribute preserved)
assert.Contains(t, string(diffJSON), "db.t3.large")
}
```
---
### 5.5 Smoke Tests (Post-Deploy)
Smoke tests run after every production deployment. They hit real endpoints with minimal side effects.
```
smoke/
health_check_test.go # GET /health → 200 on all services
agent_registration_test.go # Register a smoke-test agent → 200
heartbeat_test.go # Send heartbeat → 200
drift_report_ingestion_test.go # POST minimal drift report → 202
dashboard_api_test.go # GET /v1/stacks (smoke org) → 200
slack_connectivity_test.go # Verify Slack OAuth token still valid
```
Smoke tests use a dedicated `smoke-test` organization in production with a pre-provisioned API key. They never write to real customer data.
---
## Section 6: Performance & Load Testing
### 6.1 Scan Duration Benchmarks
**Tool:** Go's built-in `testing.B` for agent benchmarks. `k6` for SaaS API load tests.
**Targets:**
| Scenario | Stack Size | Target Duration | Kill Threshold |
|---|---|---|---|
| Full state parse | 100 resources | < 50ms | > 200ms |
| Full state parse | 500 resources | < 200ms | > 1s |
| Full drift check (parse + poll + compare) | 20 resources | < 5s | > 30s |
| Full drift check | 100 resources | < 30s | > 120s |
| Drift report ingestion (SaaS) | single report | < 200ms p99 | > 1s p99 |
| Drift report ingestion (SaaS) | 100 concurrent | < 500ms p99 | > 2s p99 |
**Go benchmark tests:**
```go
// pkg/agent/bench_test.go
func BenchmarkStateParser_100Resources(b *testing.B) {
data, _ := os.ReadFile("testdata/states/100_resources.tfstate")
b.ResetTimer()
for i := 0; i < b.N; i++ {
_, _ = ParseState(data)
}
}
func BenchmarkDriftComparator_100Resources(b *testing.B) {
stateResources := loadStateFixture("testdata/states/100_resources.tfstate")
cloudResources := loadCloudFixture("testdata/cloud/100_resources_clean.json")
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = CompareDrift(stateResources, cloudResources)
}
}
func BenchmarkSecretScrubber_LargeDiff(b *testing.B) {
diff := loadDiffFixture("testdata/diffs/large_diff_50_attributes.json")
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = ScrubSecrets(diff)
}
}
```
---
### 6.2 Memory & CPU Profiling
**Goal:** Ensure the agent stays within its ECS task allocation (0.25 vCPU, 512MB) even for large state files.
**Profile targets:**
- State parser memory allocation for 500-resource state files
- Drift comparator heap usage during deep JSON comparison
- Secret scrubber regex compilation (should be compiled once, not per-call)
```go
// Run with: go test -memprofile=mem.out -cpuprofile=cpu.out -bench=.
// Analyze with: go tool pprof mem.out
func TestMemoryProfile_LargeStateFile_Under100MB(t *testing.T) {
if testing.Short() { t.Skip("skipping memory profile in short mode") }
var m runtime.MemStats
runtime.ReadMemStats(&m)
before := m.HeapAlloc
data, _ := os.ReadFile("testdata/states/500_resources.tfstate")
_, err := ParseState(data)
require.NoError(t, err)
runtime.ReadMemStats(&m)
after := m.HeapAlloc
allocatedMB := float64(after-before) / 1024 / 1024
assert.Less(t, allocatedMB, 100.0, "state parser should use < 100MB for 500 resources")
}
```
**Regex pre-compilation check:**
```go
func TestSecretScrubber_RegexPrecompiled_NotCompiledPerCall(t *testing.T) {
// Call scrubber 1000 times — if regex is compiled per call, this will be slow
diff := map[string]interface{}{"password": "test123"}
start := time.Now()
for i := 0; i < 1000; i++ {
ScrubSecrets(diff)
}
elapsed := time.Since(start)
assert.Less(t, elapsed, 100*time.Millisecond, "1000 scrub calls should complete in < 100ms")
}
```
---
### 6.3 Concurrent Scan Stress Tests
**Goal:** Verify the agent handles concurrent scans (multiple stacks) without race conditions or goroutine leaks.
```go
func TestConcurrentScans_MultipleStacks_NoRaceConditions(t *testing.T) {
// Run with: go test -race ./...
const numStacks = 10
var wg sync.WaitGroup
errors := make(chan error, numStacks)
for i := 0; i < numStacks; i++ {
wg.Add(1)
go func(stackIdx int) {
defer wg.Done()
stateFile := fmt.Sprintf("testdata/states/stack_%d.tfstate", stackIdx)
_, err := RunDriftCheck(stateFile, mockAWSClient)
if err != nil { errors <- err }
}(i)
}
wg.Wait()
close(errors)
for err := range errors {
t.Errorf("concurrent scan error: %v", err)
}
}
```
**SaaS load test (k6):**
```javascript
// load-tests/drift-report-ingestion.js
import http from 'k6/http';
import { check } from 'k6';
export const options = {
stages: [
{ duration: '30s', target: 50 }, // ramp up to 50 concurrent agents
{ duration: '60s', target: 50 }, // hold
{ duration: '10s', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p(99)<500'], // 99th percentile < 500ms
http_req_failed: ['rate<0.01'], // < 1% error rate
},
};
export default function () {
const payload = JSON.stringify(buildDriftReport());
const res = http.post(`${__ENV.API_URL}/v1/drift-reports`, payload, {
headers: { 'Authorization': `Bearer ${__ENV.API_KEY}`, 'Content-Type': 'application/json' },
});
check(res, { 'status is 202': (r) => r.status === 202 });
}
```
---
## Section 7: CI/CD Pipeline Integration
### 7.1 Test Stages
```
┌─────────────────────────────────────────────────────────────────┐
│ PRE-COMMIT (local, < 30s) │
│ • golangci-lint (Go) │
│ • eslint + tsc --noEmit (TypeScript) │
│ • go test -short ./... (unit tests only, no I/O) │
│ • Feature flag TTL audit (make flag-audit) │
│ • Decision log presence check (PRs touching pkg/detection/) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ PR (GitHub Actions, < 5 min) │
│ • Full unit test suite (Go + TypeScript) │
│ • go test -race ./... (race detector) │
│ • Coverage gate: fail if < 80% overall, < 100% on scrubber │
│ • Schema migration lint (no destructive changes) │
│ • Snapshot test diff check (Block Kit formatter) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ MERGE TO MAIN (GitHub Actions, < 10 min) │
│ • All unit tests │
│ • Integration tests (Testcontainers: PostgreSQL + DynamoDB) │
│ • LocalStack integration tests (S3, SQS, EC2 mock) │
│ • RLS isolation tests (multi-tenant) │
│ • Docker build + Trivy scan │
│ • Go benchmark regression check (fail if > 20% slower) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STAGING DEPLOY (< 15 min) │
│ • E2E test suite against staging environment │
│ • Smoke tests (all health endpoints) │
│ • Secret scrubbing E2E test │
│ • Multi-tenant isolation E2E test │
└─────────────────────────────────────────────────────────────────┘
▼ (manual approval gate)
┌─────────────────────────────────────────────────────────────────┐
│ PRODUCTION DEPLOY │
│ • Smoke tests post-deploy │
│ • Canary: route 5% traffic to new version for 10 min │
│ • Auto-rollback if smoke tests fail │
└─────────────────────────────────────────────────────────────────┘
```
---
### 7.2 Coverage Thresholds & Gates
```yaml
# .github/workflows/test.yml (coverage gate step)
- name: Check coverage thresholds
run: |
# Go agent
go test -coverprofile=coverage.out ./...
go tool cover -func=coverage.out | grep "total:" | awk '{print $3}' | \
awk -F'%' '{if ($1 < 80) {print "FAIL: Go coverage " $1 "% < 80%"; exit 1}}'
# Secret scrubber must be 100%
go tool cover -func=coverage.out | grep "scrubber" | \
awk -F'%' '{if ($1 < 100) {print "FAIL: Scrubber coverage " $1 "% < 100%"; exit 1}}'
# TypeScript SaaS
npx vitest run --coverage
# vitest.config.ts enforces: lines: 80, branches: 75, functions: 80
```
**`vitest.config.ts` coverage config:**
```typescript
export default defineConfig({
test: {
coverage: {
provider: 'v8',
thresholds: {
lines: 80,
branches: 75,
functions: 80,
statements: 80,
},
// Stricter thresholds for critical modules
perFile: true,
},
},
})
```
---
### 7.3 Test Parallelization Strategy
**Go:** Tests are parallelized at the package level by default (`go test ./...`). Mark individual tests with `t.Parallel()` where safe. Integration tests that share LocalStack state must NOT be parallelized — use build tags to separate them.
```go
// Unit tests: always parallel
func TestDriftComparator_AttributeAdded_ReturnsDrift(t *testing.T) {
t.Parallel()
// ...
}
// Integration tests: sequential within package, parallel across packages
// go test -p 4 ./... (4 packages in parallel)
```
**Build tags for test separation:**
```go
//go:build integration
// +build integration
// Run with: go test -tags=integration ./...
// Unit only: go test ./... (no tag)
```
**GitHub Actions matrix:**
```yaml
strategy:
matrix:
test-suite:
- unit-go
- unit-ts
- integration-go
- integration-ts
- e2e
fail-fast: false # don't cancel other suites on first failure
```
---
## Section 8: Transparent Factory Tenet Testing
### 8.1 Feature Flag Behavior (Epic 10, Story 10.1)
**Testing OpenFeature Go SDK integration:**
```go
// pkg/flags/flags_test.go
// Test 1: Flag gates new detection rule
func TestFeatureFlag_NewDetectionRule_GatedByFlag(t *testing.T) {
// Set up: flag "pulumi-support" = false
provider := openfeature.NewInMemoryProvider(map[string]openfeature.InMemoryFlag{
"pulumi-support": {DefaultVariant: "off", Variants: map[string]interface{}{"off": false, "on": true}},
})
openfeature.SetProvider(provider)
result := RunDriftCheck(pulumiStateFixture)
assert.ErrorIs(t, result.Err, ErrIaCToolNotSupported)
assert.Equal(t, 0, result.DriftedResourceCount)
}
// Test 2: Flag enabled — feature executes
func TestFeatureFlag_NewDetectionRule_ExecutesWhenEnabled(t *testing.T) {
provider := openfeature.NewInMemoryProvider(map[string]openfeature.InMemoryFlag{
"pulumi-support": {DefaultVariant: "on", Variants: map[string]interface{}{"off": false, "on": true}},
})
openfeature.SetProvider(provider)
result := RunDriftCheck(pulumiStateFixture)
require.NoError(t, result.Err)
assert.Greater(t, result.ResourceCount, 0)
}
// Test 3: Circuit breaker disables flag on false-positive spike
func TestFeatureFlag_CircuitBreaker_TripsOnFalsePositiveSpike(t *testing.T) {
flag := NewFeatureFlag("new-sg-rule", circuitBreakerConfig{Threshold: 3.0, Window: time.Hour})
// Simulate 10 dismissals in 1 hour (3x baseline of ~3)
for i := 0; i < 10; i++ {
flag.RecordDismissal()
}
assert.False(t, flag.IsEnabled(), "circuit breaker should have tripped")
}
```
**TTL lint test (CI enforcement):**
```go
func TestFeatureFlags_NoExpiredTTLs(t *testing.T) {
flags := LoadAllFlags("../../config/flags.json")
for _, flag := range flags {
if flag.Rollout == 100 {
assert.True(t, time.Now().Before(flag.TTL),
"flag %q is at 100%% rollout and past TTL %v — clean it up", flag.Name, flag.TTL)
}
}
}
```
---
### 8.2 Schema Migration Validation (Epic 10, Story 10.2)
**Goal:** CI blocks any migration that removes, renames, or changes the type of existing DynamoDB attributes.
```go
// tools/schema-lint/main_test.go
func TestSchemaMigration_AddNewAttribute_IsAllowed(t *testing.T) {
migration := Migration{
Changes: []SchemaChange{
{Type: ChangeTypeAdd, AttributeName: "new_field_v2", AttributeType: "S"},
},
}
err := ValidateMigration(migration, currentSchema)
assert.NoError(t, err)
}
func TestSchemaMigration_RemoveAttribute_IsRejected(t *testing.T) {
migration := Migration{
Changes: []SchemaChange{
{Type: ChangeTypeRemove, AttributeName: "event_type"},
},
}
err := ValidateMigration(migration, currentSchema)
assert.ErrorContains(t, err, "destructive schema change: cannot remove attribute 'event_type'")
}
func TestSchemaMigration_RenameAttribute_IsRejected(t *testing.T) {
migration := Migration{
Changes: []SchemaChange{
{Type: ChangeTypeRename, OldName: "payload", NewName: "event_payload"},
},
}
err := ValidateMigration(migration, currentSchema)
assert.ErrorContains(t, err, "destructive schema change: cannot rename attribute")
}
func TestSchemaMigration_ChangeAttributeType_IsRejected(t *testing.T) {
migration := Migration{
Changes: []SchemaChange{
{Type: ChangeTypeModify, AttributeName: "timestamp", OldType: "S", NewType: "N"},
},
}
err := ValidateMigration(migration, currentSchema)
assert.ErrorContains(t, err, "destructive schema change: cannot change type of attribute 'timestamp'")
}
```
---
### 8.3 Decision Log Format Validation (Epic 10, Story 10.3)
```go
// tools/decision-log-lint/main_test.go
func TestDecisionLog_ValidFormat_PassesValidation(t *testing.T) {
log := DecisionLog{
Prompt: "Why is security group drift classified as critical?",
Reasoning: "SG drift is the #1 vector for cloud breaches...",
AlternativesConsidered: []string{"classify as high", "require manual review"},
Confidence: 0.9,
Timestamp: time.Now(),
Author: "max@dd0c.dev",
}
assert.NoError(t, ValidateDecisionLog(log))
}
func TestDecisionLog_MissingReasoning_FailsValidation(t *testing.T) {
log := DecisionLog{Prompt: "Why?", Confidence: 0.8}
err := ValidateDecisionLog(log)
assert.ErrorContains(t, err, "reasoning is required")
}
func TestDecisionLog_ConfidenceOutOfRange_FailsValidation(t *testing.T) {
log := DecisionLog{Prompt: "Why?", Reasoning: "Because.", Confidence: 1.5}
err := ValidateDecisionLog(log)
assert.ErrorContains(t, err, "confidence must be between 0 and 1")
}
// CI check: PRs touching pkg/detection/ must include a decision log
func TestCI_DetectionPackageChange_RequiresDecisionLog(t *testing.T) {
changedFiles := getChangedFilesInPR()
touchesDetection := slices.ContainsFunc(changedFiles, func(f string) bool {
return strings.HasPrefix(f, "pkg/detection/")
})
if touchesDetection {
decisionLogs := findDecisionLogsInPR()
assert.NotEmpty(t, decisionLogs, "PRs touching pkg/detection/ require a decision log entry")
}
}
```
---
### 8.4 OTEL Span Assertion Tests (Epic 10, Story 10.4)
**Goal:** Verify that drift classification emits the correct OpenTelemetry spans with required attributes.
```go
// pkg/observability/spans_test.go
func TestOTELSpans_DriftScan_EmitsParentSpan(t *testing.T) {
exporter := tracetest.NewInMemoryExporter()
tp := sdktrace.NewTracerProvider(sdktrace.WithSyncer(exporter))
otel.SetTracerProvider(tp)
RunDriftScan(testStateFixture, mockAWSClient)
spans := exporter.GetSpans()
parentSpans := filterSpansByName(spans, "drift_scan")
require.Len(t, parentSpans, 1)
}
func TestOTELSpans_DriftClassification_EmitsChildSpanPerResource(t *testing.T) {
exporter := tracetest.NewInMemoryExporter()
// ... setup ...
RunDriftScan(stateWith3Resources, mockAWSClient)
classificationSpans := filterSpansByName(exporter.GetSpans(), "drift_classification")
assert.Len(t, classificationSpans, 3) // one per resource
}
func TestOTELSpans_ClassificationSpan_HasRequiredAttributes(t *testing.T) {
// ... run scan ...
span := getClassificationSpan(exporter, "aws_security_group.api")
attrs := span.Attributes()
assert.Equal(t, "aws_security_group", getAttr(attrs, "drift.resource_type"))
assert.NotEmpty(t, getAttr(attrs, "drift.severity_score"))
assert.NotEmpty(t, getAttr(attrs, "drift.classification_reason"))
// No PII: resource ARN must be hashed, not raw
assert.NotContains(t, getAttr(attrs, "drift.resource_id"), "arn:aws:")
}
func TestOTELSpans_NoCustomerPII_InAnySpan(t *testing.T) {
// Run scan with a state file containing real-looking ARNs
RunDriftScan(stateWithRealARNs, mockAWSClient)
for _, span := range exporter.GetSpans() {
for _, attr := range span.Attributes() {
assert.NotRegexp(t, `arn:aws:[a-z]+:[a-z0-9-]+:\d{12}:`, attr.Value.AsString(),
"span %q contains unhashed ARN in attribute %q", span.Name(), attr.Key)
}
}
}
```
---
### 8.5 Governance Policy Enforcement Tests (Epic 10, Story 10.5)
```go
func TestGovernance_StrictMode_RemediationNeverExecutes(t *testing.T) {
engine := NewRemediationEngine(Policy{GovernanceMode: "strict"})
result, err := engine.Revert(criticalDriftEvent)
require.NoError(t, err) // not an error — just blocked
assert.Equal(t, "blocked_by_policy", result.Status)
assert.Contains(t, result.Log, "Remediation blocked by strict mode")
assert.False(t, mockAgentDispatcher.WasCalled())
}
func TestGovernance_CustomerCannotEscalateAboveSystemPolicy(t *testing.T) {
systemPolicy := Policy{GovernanceMode: "strict"}
customerPolicy := Policy{GovernanceMode: "audit"} // customer wants less restriction
merged := MergePolicies(systemPolicy, customerPolicy)
assert.Equal(t, "strict", merged.GovernanceMode, "customer cannot override system to be less restrictive")
}
func TestGovernance_PanicMode_HaltsAllScansImmediately(t *testing.T) {
agent := NewDriftAgent(Policy{PanicMode: true})
result := agent.RunScan(testStateFixture)
assert.ErrorIs(t, result.Err, ErrPanicModeActive)
assert.False(t, mockAWSClient.WasCalled(), "no AWS API calls should be made in panic mode")
}
func TestGovernance_PanicMode_SendsExactlyOneNotification(t *testing.T) {
agent := NewDriftAgent(Policy{PanicMode: true})
// Run scan 3 times — should only notify once
for i := 0; i < 3; i++ {
agent.RunScan(testStateFixture)
}
assert.Equal(t, 1, mockNotifier.CallCount(), "panic mode should send exactly one notification")
}
```
---
## Section 9: Test Data & Fixtures
### 9.1 Directory Structure
```
testdata/
states/
# Terraform state v4 fixtures
single_sg.tfstate # 1 resource: aws_security_group
single_rds.tfstate # 1 resource: aws_db_instance (with secrets)
prod_networking.tfstate # 23 resources: VPC, SGs, subnets, routes
prod_compute.tfstate # 47 resources: EC2, IAM, Lambda, ECS
100_resources.tfstate # benchmark fixture
500_resources.tfstate # benchmark fixture
module_nested.tfstate # module-prefixed addresses
for_each_resources.tfstate # for_each instances
v3_format.tfstate # invalid: old format (should error)
rds_with_secrets.tfstate # contains master_password, connection strings
opentofu_state.tfstate # OpenTofu-generated state
aws-responses/
# Recorded AWS API responses (go-vcr cassettes)
ec2/
describe_sg_clean.json # cloud matches state
describe_sg_ingress_added.json # 0.0.0.0/0 rule added
describe_sg_ingress_removed.json # rule removed
describe_sg_not_found.json # resource deleted from cloud
iam/
get_role_clean.json
get_role_policy_changed.json
get_role_not_found.json
rds/
describe_db_instances_clean.json
describe_db_instances_class_changed.json
describe_db_instances_publicly_accessible.json # critical: made public
diffs/
# Pre-computed drift diff fixtures
sg_ingress_added_critical.json
iam_policy_changed_high.json
rds_class_changed_high.json
tag_only_change_low.json
large_diff_50_attributes.json # benchmark fixture
wiremock/
slack/
post_message_success.json
post_message_rate_limited.json
post_message_channel_not_found.json
interactions_revert_payload.json
github/
create_branch_success.json
create_pr_success.json
create_pr_repo_not_found.json
policies/
strict_mode.json
audit_mode.json
auto_revert_critical.json
require_approval_iam.json
```
---
### 9.2 State File Factory (Go)
A factory package generates synthetic Terraform state files for tests. This avoids brittle fixture files that break when the state format changes.
```go
// testutil/statefactory/factory.go
type StateFactory struct {
version int
terraformVersion string
resources []StateResource
}
func NewStateFactory() *StateFactory {
return &StateFactory{version: 4, terraformVersion: "1.7.0"}
}
func (f *StateFactory) WithSecurityGroup(name, vpcID string, ingress []IngressRule) *StateFactory {
f.resources = append(f.resources, StateResource{
Mode: "managed",
Type: "aws_security_group",
Name: name,
Provider: "registry.terraform.io/hashicorp/aws",
Instances: []ResourceInstance{{
Attributes: map[string]interface{}{
"id": fmt.Sprintf("sg-%s", randID()),
"name": name,
"vpc_id": vpcID,
"ingress": ingress,
"egress": defaultEgressRules(),
"tags": map[string]string{"ManagedBy": "terraform"},
},
}},
})
return f
}
func (f *StateFactory) WithIAMRole(name, assumeRolePolicy string) *StateFactory { /* ... */ }
func (f *StateFactory) WithRDSInstance(id, instanceClass string) *StateFactory { /* ... */ }
func (f *StateFactory) WithSecret(key, value string) *StateFactory { /* injects secret into last resource */ }
func (f *StateFactory) Build() []byte { /* marshals to JSON */ }
// Usage in tests:
state := NewStateFactory().
WithSecurityGroup("api", "vpc-abc123", []IngressRule{{Port: 443, CIDR: "10.0.0.0/8"}}).
WithIAMRole("lambda-exec", assumeRolePolicyJSON).
Build()
```
---
### 9.3 Cloud Response Factory (Go)
Mirrors the state factory but for AWS API responses. Used to simulate clean vs. drifted cloud state.
```go
// testutil/cloudfactory/factory.go
type CloudResponseFactory struct{}
func (f *CloudResponseFactory) SecurityGroup(id string, opts ...SGOption) *ec2.SecurityGroup {
sg := &ec2.SecurityGroup{GroupId: aws.String(id), /* defaults */}
for _, opt := range opts { opt(sg) }
return sg
}
// Options for injecting drift:
func WithPublicIngress(port int) SGOption {
return func(sg *ec2.SecurityGroup) {
sg.IpPermissions = append(sg.IpPermissions, ec2types.IpPermission{
FromPort: aws.Int32(int32(port)),
IpRanges: []ec2types.IpRange{{CidrIp: aws.String("0.0.0.0/0")}},
})
}
}
func WithInstanceClassChanged(newClass string) RDSOption { /* ... */ }
func WithPolicyDocumentChanged(newPolicy string) IAMOption { /* ... */ }
```
---
### 9.4 Drift Scenario Fixtures
Pre-built scenarios covering the most common real-world drift patterns. Each scenario includes: state file, cloud response, expected diff, expected severity.
| Scenario | State Fixture | Cloud Response | Expected Severity | Category |
|---|---|---|---|---|
| Security group: public HTTPS ingress added | `sg_private.tfstate` | `sg_public_443.json` | critical | security |
| Security group: SSH port opened to world | `sg_no_ssh.tfstate` | `sg_ssh_open.json` | critical | security |
| IAM role: `*:*` policy attached | `iam_role_scoped.tfstate` | `iam_role_star_star.json` | critical | security |
| S3 bucket: public access enabled | `s3_private.tfstate` | `s3_public.json` | critical | security |
| RDS: made publicly accessible | `rds_private.tfstate` | `rds_public.json` | critical | security |
| Lambda: runtime changed (python3.8 → python3.12) | `lambda_py38.tfstate` | `lambda_py312.json` | high | configuration |
| ECS service: task count changed (2 → 5) | `ecs_2tasks.tfstate` | `ecs_5tasks.json` | low | scaling |
| EC2 instance: instance type changed | `ec2_t3medium.tfstate` | `ec2_t3large.json` | high | configuration |
| Route53: TTL changed (300 → 60) | `r53_ttl300.tfstate` | `r53_ttl60.json` | medium | configuration |
| Tags: Environment tag changed | `tags_prod.tfstate` | `tags_staging.json` | low | tags |
| Resource deleted from cloud | `sg_exists.tfstate` | `sg_not_found.json` | high | configuration |
---
### 9.5 TypeScript Test Helpers
```typescript
// test/helpers/factories.ts
export const buildDriftEvent = (overrides: Partial<DriftEvent> = {}): DriftEvent => ({
id: `evt_${randomUUID()}`,
orgId: 'org_test_001',
stackId: 'stack_prod_networking',
resourceAddress: 'aws_security_group.api',
resourceType: 'aws_security_group',
severity: 'critical',
category: 'security',
status: 'open',
diff: {
ingress: {
old: [{ from_port: 443, cidr_blocks: ['10.0.0.0/8'] }],
new: [{ from_port: 443, cidr_blocks: ['10.0.0.0/8', '0.0.0.0/0'] }],
},
},
attribution: {
principal: 'arn:aws:iam::123456789:user/jsmith',
sourceIp: '192.168.1.1',
eventName: 'AuthorizeSecurityGroupIngress',
attributedAt: new Date().toISOString(),
},
createdAt: new Date().toISOString(),
...overrides,
})
export const buildOrg = (overrides: Partial<Organization> = {}): Organization => ({
id: `org_${randomUUID()}`,
name: 'Test Org',
slug: 'test-org',
plan: 'starter',
maxStacks: 10,
pollIntervalS: 300,
...overrides,
})
export const buildStack = (orgId: string, overrides: Partial<Stack> = {}): Stack => ({
id: `stack_${randomUUID()}`,
orgId,
name: 'prod-networking',
backendType: 's3',
backendHash: 'abc123def456',
iacTool: 'terraform',
environment: 'prod',
driftScore: 100.0,
resourceCount: 23,
driftedCount: 0,
...overrides,
})
```
---
## Section 10: TDD Implementation Order
### 10.1 Bootstrap Sequence (Test Infrastructure First)
Before writing a single product test, the test infrastructure itself must be bootstrapped. This is the meta-TDD step.
```
Week 0 — Test Infrastructure Bootstrap
────────────────────────────────────────
1. Set up Go test project structure
• testutil/ package with state factory, cloud factory
• testdata/ directory with initial fixture files
• golangci-lint config (.golangci.yml)
• go test -race baseline (should pass with zero tests)
2. Set up TypeScript test project
• vitest.config.ts with coverage thresholds
• test/helpers/factories.ts with builder functions
• ESLint + tsc --noEmit in CI
3. Set up Docker Compose test environment
• docker-compose.test.yml (LocalStack, PostgreSQL, WireMock)
• Makefile targets: make test-unit, make test-integration, make test-e2e
4. Set up CI pipeline skeleton
• GitHub Actions workflow with test stages
• Coverage reporting (codecov or similar)
• Feature flag TTL lint check
```
---
### 10.2 Epic-by-Epic TDD Order
The implementation order follows epic dependencies. Tests are written before code at each step.
```
Phase 1: Agent Core (Weeks 12)
────────────────────────────────
Write tests first, then implement:
1. TestStateParser_* (Epic 1, Story 1.1)
→ Implement StateParser
→ Fixture: single_sg.tfstate, module_nested.tfstate
2. TestDriftComparator_* (Epic 1, Story 1.3)
→ Implement DriftComparator
→ Depends on: StateParser (need parsed state to compare)
3. TestSecretScrubber_* (Epic 1, Story 1.4) ← ALL 16 tests before any code
→ Implement SecretScrubber
→ This is the highest-risk component. Write every test case first.
4. TestDriftClassifier_* (Epic 3, Story 3.2)
→ Implement DriftClassifier with YAML rules
→ Depends on: DriftComparator output format
5. TestAWSPolling_* (Epic 1, Story 1.2) ← Integration tests lead here
→ Implement AWS resource polling for top 5 resource types
→ Use recorded HTTP fixtures (go-vcr)
→ Add remaining 15 resource types iteratively
Phase 2: Agent Communication (Week 2)
───────────────────────────────────────
6. TestTransmitter_* (Epic 2, Story 2.2)
→ Implement HTTPS transmitter with mTLS
→ Depends on: SecretScrubber (scrub before transmit)
7. TestAgentRegistration_* (Epic 2, Story 2.1)
→ Implement agent registration flow
→ Depends on: Transmitter
8. TestHeartbeat_* (Epic 2, Story 2.3)
→ Implement heartbeat goroutine
→ Depends on: AgentRegistration
Phase 3: SaaS Ingestion Pipeline (Week 23)
─────────────────────────────────────────────
9. TestEventProcessor_Validation_* (Epic 3, Story 3.1)
→ Implement zod schema validation
→ Write tests for every invalid payload shape
10. TestDynamoDBEventStore_* (Epic 3, Story 3.3) ← Integration tests with Testcontainers
→ Implement DynamoDB persistence
→ Depends on: DynamoDB Local container running
11. TestPostgreSQL_RLS_* (Epic 3, Story 3.3) ← Integration tests with Testcontainers
→ Apply schema migrations
→ Write multi-tenant isolation tests BEFORE any API handlers
12. TestDriftScorer_* (Epic 3, Story 3.4)
→ Implement drift score calculation
→ Depends on: PostgreSQL schema (reads/writes stacks table)
Phase 4: Notifications (Week 3)
─────────────────────────────────
13. TestNotificationFormatter_* (Epic 4, Story 4.1)
→ Implement Block Kit formatter
→ Snapshot tests for output JSON
14. TestSlackDelivery_* (Epic 4, Story 4.2) ← Integration with WireMock
→ Implement Slack API client
→ Depends on: Formatter output
15. TestNotificationBatching_* (Epic 4, Story 4.4)
→ Implement digest queue logic
→ Depends on: Slack delivery working
Phase 5: Dashboard API (Week 34)
───────────────────────────────────
16. TestDashboardAuth_* (Epic 5, Story 5.1)
→ Implement Cognito JWT middleware
→ RLS context-setting middleware
→ Write auth tests before any route handlers
17. TestStackEndpoints_* (Epic 5, Story 5.2)
→ Implement GET/PATCH /v1/stacks
→ Depends on: Auth middleware + PostgreSQL
18. TestDriftEventEndpoints_* (Epic 5, Story 5.3)
→ Implement GET /v1/drift-events with filters
→ Depends on: Stack endpoints
Phase 6: Slack Bot & Remediation (Week 4)
───────────────────────────────────────────
19. TestSlackInteraction_SignatureValidation_* (Epic 7, Story 7.1)
→ Implement signature verification FIRST
→ Write tests for valid and invalid signatures before any callback logic
20. TestRemediationEngine_* (Epic 7, Stories 7.17.2)
→ Implement revert and accept workflows
→ Depends on: Slack interaction handler, PostgreSQL remediation_plans table
21. TestPolicyEngine_* (Epic 10, Story 10.5)
→ Implement governance policy enforcement
→ Wrap remediation engine with policy checks
Phase 7: Transparent Factory Tenets (Week 4, parallel)
────────────────────────────────────────────────────────
22. TestFeatureFlag_* (Epic 10, Story 10.1)
→ Integrate OpenFeature SDK
→ Write flag tests alongside each new feature (not at the end)
23. TestOTELSpans_* (Epic 10, Story 10.4)
→ Add OTEL instrumentation to drift scan
→ Write span assertion tests
24. TestSchemaMigration_* (Epic 10, Story 10.2)
→ Implement schema lint tool
→ Add to CI pipeline
25. TestDecisionLog_* (Epic 10, Story 10.3)
→ Implement decision log validator
→ Add PR template check to CI
Phase 8: E2E & Performance (Week 45)
───────────────────────────────────────
26. E2E: Onboarding flow (install → detect → notify)
→ Requires all Phase 14 components working
→ First E2E test written after unit + integration tests pass
27. E2E: Remediation round-trip (Slack → apply → resolve)
→ Requires Phase 56 components
28. Performance benchmarks
→ Run after correctness is established
→ Fail CI if regression > 20%
```
---
### 10.3 Test Dependency Graph
```
StateParser ──────────────────────────────────────────────────────┐
│ │
▼ ▼
DriftComparator ──► SecretScrubber ──► Transmitter ──► E2E: Onboarding
DriftClassifier ──► DriftScorer ──► DynamoDB EventStore ──► Dashboard API
PostgreSQL RLS ──► Auth Middleware
Slack Formatter
Slack Delivery
Remediation Engine ──► E2E: Revert
Policy Engine
```
---
### 10.4 "Never Ship Without" Checklist
Before any code ships to production, these tests must be green:
```
□ TestSecretScrubber_* — all 16 tests passing (100% coverage)
□ TestPostgreSQL_RLS_CrossTenantIsolation — org A cannot read org B data
□ TestTransmitter_mTLSCertPresented_OnEveryRequest
□ TestGovernance_StrictMode_RemediationNeverExecutes
□ TestE2E_SecretScrubbing_NoSecretsReachSaaS
□ TestE2E_MultiTenantIsolation_OrgACannotSeeOrgBEvents
□ go test -race ./... — zero race conditions
□ Coverage gate: ≥ 80% overall, 100% on scrubber
□ Schema migration lint: no destructive changes
□ Feature flag TTL audit: no expired flags at 100% rollout
```
---
*Document complete. Total estimated test count at V1 launch: ~500 tests. Target by month 3: ~1,000 tests.*
---
## 11. Review Remediation Addendum (Post-Gemini Review)
### 11.1 Missing Epic Coverage
#### Epic 6: Dashboard UI (React Testing Library + Playwright)
```typescript
// tests/ui/components/DiffViewer.test.tsx
describe('DiffViewer Component', () => {
it('renders added lines in green', () => {});
it('renders removed lines in red', () => {});
it('renders unchanged lines in default color', () => {});
it('collapses large diffs with "Show more" toggle', () => {});
it('highlights HCL syntax in diff blocks', () => {});
it('shows resource type icon next to each drift item', () => {});
});
describe('StackOverview Component', () => {
it('renders drift count badge per stack', () => {});
it('sorts stacks by drift severity (critical first)', () => {});
it('shows last scan timestamp', () => {});
it('shows agent health indicator (green/yellow/red)', () => {});
});
// tests/e2e/ui/dashboard.spec.ts (Playwright)
test('OAuth login redirects to Cognito and back', async ({ page }) => {
await page.goto('/dashboard');
await expect(page).toHaveURL(/cognito/);
});
test('stack list renders with drift counts', async ({ page }) => {
await page.goto('/dashboard/stacks');
await expect(page.locator('[data-testid="stack-card"]')).toHaveCountGreaterThan(0);
});
test('diff viewer renders inline diff for Terraform resource', async ({ page }) => {
await page.goto('/dashboard/stacks/stack-1/drifts/drift-1');
await expect(page.locator('[data-testid="diff-viewer"]')).toBeVisible();
await expect(page.locator('.diff-added')).toHaveCountGreaterThan(0);
});
test('revert button triggers confirmation modal', async ({ page }) => {
await page.goto('/dashboard/stacks/stack-1/drifts/drift-1');
await page.click('[data-testid="revert-btn"]');
await expect(page.locator('[data-testid="confirm-modal"]')).toBeVisible();
});
```
#### Epic 9: Onboarding & PLG (Stripe + drift init)
```go
// pkg/onboarding/stripe_test.go
func TestStripeWebhookCheckoutCompleted_UpgradesTenant(t *testing.T) {}
func TestStripeWebhookSubscriptionDeleted_DowngradesTenant(t *testing.T) {}
func TestStripeWebhookInvalidSignature_Returns401(t *testing.T) {}
func TestStripeWebhookReplayedEvent_IsIdempotent(t *testing.T) {}
// pkg/agent/init_test.go
func TestDriftInit_DetectsTerraformInCurrentDir(t *testing.T) {}
func TestDriftInit_DetectsCloudFormationInCurrentDir(t *testing.T) {}
func TestDriftInit_DetectsPulumiInCurrentDir(t *testing.T) {}
func TestDriftInit_GeneratesValidYAMLConfig(t *testing.T) {}
func TestDriftInit_HandlesWindowsPaths(t *testing.T) {}
func TestDriftInit_HandlesMacPaths(t *testing.T) {}
func TestDriftInit_HandlesLinuxPaths(t *testing.T) {}
func TestDriftInit_FailsGracefullyOnEmptyDir(t *testing.T) {}
```
#### Epic 8: Infrastructure (Terratest)
```go
// tests/infra/terraform_test.go
func TestTerraformPlan_CreatesExpectedResources(t *testing.T) {
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../../infra/terraform",
})
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndPlan(t, terraformOptions)
}
func TestTerraformApply_SQSFIFOQueueCreated(t *testing.T) {}
func TestTerraformApply_RDSInstanceCreated(t *testing.T) {}
func TestTerraformApply_IAMRolesHaveLeastPrivilege(t *testing.T) {
// Verify no IAM policy has Action: "*"
}
func TestTerraformApply_VPCSecurityGroupsRestrictIngress(t *testing.T) {}
```
#### Epic 2: mTLS Certificate Lifecycle
```go
// pkg/agent/mtls_test.go
func TestMTLS_CertificateGeneration_ValidX509(t *testing.T) {}
func TestMTLS_CertificateExpiration_AgentRejectsExpiredCert(t *testing.T) {}
func TestMTLS_CertificateRotation_NewCertAcceptedMidConnection(t *testing.T) {}
func TestMTLS_CertificateRevocation_RevokedCertRejected(t *testing.T) {}
func TestMTLS_SelfSignedCert_RejectedBySaaS(t *testing.T) {}
func TestMTLS_CertificateChain_IntermediateCAValidated(t *testing.T) {}
```
### 11.2 Add t.Parallel() to Table-Driven Tests
```go
// BEFORE (sequential — wastes CI time):
func TestSecretScrubber(t *testing.T) {
tests := []struct{ name, input, expected string }{...}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
// runs sequentially
})
}
}
// AFTER (parallel):
func TestSecretScrubber(t *testing.T) {
t.Parallel()
tests := []struct{ name, input, expected string }{...}
for _, tt := range tests {
tt := tt // capture range variable
t.Run(tt.name, func(t *testing.T) {
t.Parallel()
// runs in parallel
})
}
}
```
### 11.3 Dynamic Resource Naming for LocalStack
```go
// BEFORE (shared state — flaky):
// bucket := "drift-reports"
// AFTER (per-test isolation):
func uniqueBucket(t *testing.T) string {
return fmt.Sprintf("drift-reports-%s-%d", t.Name(), time.Now().UnixNano())
}
func TestDriftReportUpload(t *testing.T) {
t.Parallel()
bucket := uniqueBucket(t)
s3Client.CreateBucket(ctx, &s3.CreateBucketInput{Bucket: &bucket})
// Test uses isolated bucket — no cross-test contamination
}
```
### 11.4 Distributed Tracing Cross-Boundary Tests
```go
// tests/integration/trace_propagation_test.go
func TestTraceContext_AgentToSaaS_SpanParentChain(t *testing.T) {
// Agent generates drift_scan span with trace_id
// POST /v1/drift-reports carries traceparent header
// SaaS Event Processor creates child span
// Verify parent-child relationship across HTTP boundary
exporter := tracetest.NewInMemoryExporter()
// Fire drift report with traceparent
traceID := "4bf92f3577b34da6a3ce929d0e0e4736"
resp := postDriftReport(t, stack, traceID)
assert.Equal(t, 200, resp.StatusCode)
spans := exporter.GetSpans()
eventProcessorSpan := findSpan(spans, "drift_report.process")
assert.Equal(t, traceID, eventProcessorSpan.SpanContext().TraceID().String())
}
func TestTraceContext_SQSBoundary_PreservesTraceID(t *testing.T) {
// Verify SQS message attributes contain traceparent
// Verify consumer extracts and continues the trace
}
func TestTraceContext_AgentScan_CreatesParentSpan(t *testing.T) {
// Verify agent drift_scan span has correct attributes:
// drift.stack_id, drift.resource_count, drift.duration_ms
}
```
### 11.5 Backward Compatibility Serialization (Elastic Schema)
```go
// tests/schema/backward_compat_test.go
func TestOldAgent_ParsesNewDynamoDBItem_WithV2Attributes(t *testing.T) {
// Simulate V2 DynamoDB item with new _v2 fields
item := map[string]types.AttributeValue{
"PK": &types.AttributeValueMemberS{Value: "STACK#123"},
"drift_score": &types.AttributeValueMemberN{Value: "85"},
"drift_score_v2": &types.AttributeValueMemberN{Value: "92"}, // New field
"remediation_v2": &types.AttributeValueMemberS{Value: "auto"}, // New field
}
// V1 parser must ignore unknown fields
result, err := ParseDriftItem(item)
assert.NoError(t, err)
assert.Equal(t, 85, result.DriftScore) // Uses V1 field
}
func TestV1Code_ReadsV2Writes_DuringMigrationWindow(t *testing.T) {
// V2 writes both drift_score and drift_score_v2
// V1 reads drift_score (ignores _v2)
// Verify no data loss
}
```
### 11.6 Security: RBAC Forgery & Replay Attacks
```go
// tests/integration/security_test.go
func TestAgentCannotForgeStackID(t *testing.T) {
// Agent with API key for org-A sends drift report claiming stack belongs to org-B
orgAKey := createAPIKey(t, "org-a")
report := makeDriftReport("org-b-stack-id") // Wrong org
resp := postDriftReportWithKey(t, report, orgAKey)
assert.Equal(t, 403, resp.StatusCode)
}
func TestReplayAttack_DuplicateReportID_Rejected(t *testing.T) {
report := makeDriftReport("stack-1")
resp1 := postDriftReport(t, report)
assert.Equal(t, 200, resp1.StatusCode)
// Replay exact same report
resp2 := postDriftReport(t, report)
assert.Equal(t, 409, resp2.StatusCode) // Conflict — already processed
}
func TestReplayAttack_OldTimestamp_Rejected(t *testing.T) {
report := makeDriftReport("stack-1")
report.Timestamp = time.Now().Add(-10 * time.Minute) // 10 min old
resp := postDriftReport(t, report)
assert.Equal(t, 400, resp.StatusCode) // Stale report
}
```
### 11.7 Noisy Neighbor & Fair-Share Processing
```go
// tests/integration/fair_share_test.go
func TestNoisyNeighbor_LargeOrgDoesNotStarveSmallOrg(t *testing.T) {
// Org A: 10,000 drifted resources
// Org B: 10 drifted resources
// Both submit reports simultaneously
seedDriftReports(t, "org-a", 10000)
seedDriftReports(t, "org-b", 10)
// Org B's reports must be processed within 30 seconds
// (not queued behind all 10K of Org A's)
start := time.Now()
waitForProcessed(t, "org-b", 10, 30*time.Second)
assert.Less(t, time.Since(start), 30*time.Second)
}
```
### 11.8 Panic Mode Mid-Remediation Race Condition
```go
// tests/integration/panic_remediation_test.go
func TestPanicMode_AbortsInFlightRemediation(t *testing.T) {
// Start a remediation (terraform apply)
execID := startRemediation(t, "stack-1", "drift-1")
waitForState(t, execID, "applying")
// Trigger panic mode
triggerPanicMode(t)
// Remediation must be aborted, not completed
state := waitForState(t, execID, "aborted")
assert.Equal(t, "aborted", state)
// Verify terraform state is not corrupted
// (agent should have run terraform state pull to verify)
}
func TestPanicMode_DoesNotAbortReadOnlyScans(t *testing.T) {
// Drift scans (read-only) should continue during panic
// Only write operations (remediation) are halted
scanID := startDriftScan(t, "stack-1")
triggerPanicMode(t)
state := waitForState(t, scanID, "completed")
assert.Equal(t, "completed", state) // Scan finishes normally
}
```
### 11.9 Remediation vs. Concurrent Scan Race Condition
```go
func TestConcurrentScanDuringRemediation_DoesNotReportHalfAppliedState(t *testing.T) {
// Start remediation (terraform apply — takes ~30s)
execID := startRemediation(t, "stack-1", "drift-1")
waitForState(t, execID, "applying")
// Trigger a drift scan while remediation is in progress
scanID := startDriftScan(t, "stack-1")
// Scan must either:
// a) Wait for remediation to complete, OR
// b) Skip the stack with "remediation in progress" status
scanResult := waitForScanComplete(t, scanID)
assert.NotEqual(t, "half-applied", scanResult.Status)
// Must be either "skipped_remediation_in_progress" or show post-remediation state
}
```
### 11.10 SaaS API Memory Profiling
```go
// tests/load/memory_profile_test.go
func TestEventProcessor_DoesNotOOM_On1MB_DriftReport(t *testing.T) {
// Generate a 1MB drift report (1000 resources with large diffs)
report := makeLargeDriftReport(1000)
assert.Greater(t, len(report), 1024*1024)
var memBefore, memAfter runtime.MemStats
runtime.ReadMemStats(&memBefore)
processReport(t, report)
runtime.ReadMemStats(&memAfter)
growth := memAfter.Alloc - memBefore.Alloc
assert.Less(t, growth, uint64(50*1024*1024)) // <50MB growth
}
```
### 11.11 Trim E2E to Smoke Tier
Per review recommendation, cap E2E at 10 critical paths. Remaining 40 tests pushed to integration:
| E2E (Keep — 10 max) | Demoted to Integration |
|---------------------|----------------------|
| Onboarding: init → connect → first scan | Agent heartbeat variations |
| First drift detected → Slack alert | Individual parser format tests |
| Revert flow: Slack → agent apply → verify | Secret scrubber edge cases |
| Panic mode halts remediation | DynamoDB access pattern tests |
| Cross-tenant isolation | Individual webhook format tests |
| OAuth login → dashboard → view diff | Notification batching |
| Free tier limit enforcement | Agent config reload |
| Agent disconnect → reconnect → resume | Baseline score calculations |
| mTLS cert rotation mid-scan | Individual API endpoint tests |
| Stripe upgrade → unlock features | Cache invalidation patterns |
### 11.12 Updated Test Pyramid (Post-Review)
| Level | Original | Revised | Rationale |
|-------|----------|---------|-----------|
| Unit | 70% (~350) | 65% (~350) | Add t.Parallel(), keep count but add UI component tests |
| Integration | 20% (~100) | 28% (~150) | Terratest, mTLS, trace propagation, fair-share, security |
| E2E/Smoke | 10% (~50) | 7% (~35) | Capped at 10 true E2E + 25 Playwright UI tests |
*End of P2 Review Remediation Addendum*