Files
dd0c/products/03-alert-intelligence/test-architecture/test-architecture.md

2092 lines
75 KiB
Markdown
Raw Normal View History

# dd0c/alert — Test Architecture & TDD Strategy
**Product:** dd0c/alert — Alert Intelligence Platform
**Author:** Test Architecture Phase
**Date:** February 28, 2026
**Status:** V1 MVP — Solo Founder Scope
---
## Section 1: Testing Philosophy & TDD Workflow
### 1.1 Core Philosophy
dd0c/alert is a **safety-critical observability tool** — a bug that silently suppresses a real alert during an incident is worse than having no tool at all. The test suite is the contract that guarantees "we will never eat your alerts."
Guiding principle: **tests describe observable behavior from the on-call engineer's perspective**. If a test can't be explained as "when X happens, the engineer sees Y," it's testing implementation, not behavior.
For a solo founder, the test suite is also the **regression safety net** — it catches the subtle scoring bugs that would erode customer trust over weeks.
### 1.2 Red-Green-Refactor Adapted to dd0c/alert
```
RED → Write a failing test that describes the desired behavior
(e.g., "3 Datadog alerts for the same service within 5 minutes
should produce 1 correlated incident")
GREEN → Write the minimum code to make it pass
(hardcode the window, just make it work)
REFACTOR → Clean up without breaking tests
(extract the window manager, add Redis backing,
optimize the fingerprinting)
```
**When to write tests first (strict TDD):**
- All correlation logic (time-window clustering, service graph traversal, deploy correlation)
- All noise scoring algorithms (rule-based scoring, threshold calculations)
- All HMAC signature validation (security-critical)
- All fingerprinting/deduplication logic
- All suppression governance (strict vs. audit mode)
- All circuit breaker state transitions (suppression DLQ replay)
**When integration tests lead (test-after, then harden):**
- Provider webhook parsers — implement against real payload samples, then lock in with contract tests
- SQS FIFO message ordering — test against LocalStack after implementation
- Slack message formatting — build the blocks, then snapshot test the output
**When E2E tests lead:**
- The 60-second time-to-value journey — define the happy path first, build backward
- Weekly noise digest generation — define expected output, then build the aggregation
### 1.3 Test Naming Conventions
```typescript
// Unit tests (vitest)
describe('CorrelationEngine', () => {
it('groups alerts for same service within 5min window into single incident', () => {});
it('extends window by 2min when alert arrives in last 30 seconds', () => {});
it('caps window extension at 15 minutes total', () => {});
it('merges downstream service alerts when upstream window is active', () => {});
});
describe('NoiseScorer', () => {
it('scores deploy-correlated alerts higher when deploy is within 10min', () => {});
it('returns zero noise score for first-ever alert from a service', () => {});
it('adds 5 points when PR title matches config or feature-flag', () => {});
});
describe('HmacValidator', () => {
it('rejects Datadog webhook with missing DD-WEBHOOK-SIGNATURE header', () => {});
it('rejects PagerDuty webhook with tampered body', () => {});
it('accepts valid signature and passes payload through', () => {});
});
```
**Rules:**
- Describe the **observable outcome**, not the internal mechanism
- Use present tense ("groups", "rejects", "scores")
- If you need "and" in the name, split into two tests
- Group by component in `describe` blocks
---
## Section 2: Test Pyramid
### 2.1 Ratio
| Level | Target | Count (V1) | Runtime |
|-------|--------|------------|---------|
| Unit | 70% | ~350 tests | <30s |
| Integration | 20% | ~100 tests | <5min |
| E2E/Smoke | 10% | ~20 tests | <10min |
### 2.2 Unit Test Targets (per component)
| Component | Key Behaviors | Est. Tests |
|-----------|--------------|------------|
| Webhook Parsers (Datadog, PD, OpsGenie, Grafana) | Payload normalization, field mapping, batch handling | 60 |
| HMAC Validator | Signature verification per provider, rejection paths | 20 |
| Fingerprint Generator | Deterministic hashing, dedup detection | 15 |
| Correlation Engine | Time-window open/close/extend, service graph merge, deploy correlation | 80 |
| Noise Scorer | Rule-based scoring, deploy proximity weighting, threshold calculations | 60 |
| Suggestion Engine | Suppression recommendations, "what would have happened" calculations | 30 |
| Notification Formatter | Slack block formatting, digest generation, in-place message updates | 25 |
| Governance Policy | Strict/audit mode enforcement, panic mode, per-customer overrides | 30 |
| Feature Flags | Circuit breaker on suppression volume, flag lifecycle | 15 |
| Canonical Schema Mapper | Provider → canonical field mapping, severity normalization | 15 |
### 2.3 Integration Test Boundaries
| Boundary | What's Tested | Infrastructure |
|----------|--------------|----------------|
| Lambda → SQS FIFO | Message ordering, dedup, tenant partitioning | LocalStack |
| SQS → Correlation Engine | Consumer polling, batch processing, error handling | LocalStack |
| Correlation Engine → Redis | Window CRUD, sorted set operations, TTL expiry | Testcontainers Redis |
| Correlation Engine → DynamoDB | Incident persistence, tenant config reads | Testcontainers DynamoDB Local |
| Correlation Engine → TimescaleDB | Time-series writes, continuous aggregate queries | Testcontainers PostgreSQL + TimescaleDB |
| Notification Service → Slack | Block formatting, rate limiting, message update | WireMock |
| API Gateway → Lambda | Webhook routing, auth, throttling | LocalStack |
### 2.4 E2E/Smoke Scenarios
1. **60-Second TTV Journey**: Webhook received → alert in Slack within 60s
2. **Alert Storm Correlation**: 50 alerts in 2 minutes → grouped into 1 incident
3. **Deploy Correlation**: Deploy event + alert storm → deploy identified as trigger
4. **Noise Digest**: 7 days of alerts → weekly Slack digest with noise stats
5. **Multi-Provider Merge**: Datadog + PagerDuty alerts for same service → single incident
6. **Panic Mode**: Enable panic → all suppression stops → alerts pass through raw
---
## Section 3: Unit Test Strategy
### 3.1 Webhook Parsers
Each provider parser is a pure function: payload in, canonical alert(s) out. No side effects, no DB calls.
```typescript
// tests/unit/parsers/datadog.test.ts
describe('DatadogParser', () => {
it('normalizes single alert payload to canonical schema', () => {});
it('normalizes batched alert array into multiple canonical alerts', () => {});
it('maps Datadog P1 to critical, P5 to info', () => {});
it('extracts service name from tags array', () => {});
it('handles missing optional fields without throwing', () => {});
it('generates stable fingerprint from title + service + tenant', () => {});
});
// tests/unit/parsers/pagerduty.test.ts
describe('PagerDutyParser', () => {
it('normalizes incident.triggered event to canonical alert', () => {});
it('normalizes incident.resolved event with resolution metadata', () => {});
it('ignores incident.acknowledged events (not alerts)', () => {});
it('maps PD urgency high to critical, low to info', () => {});
});
// tests/unit/parsers/opsgenie.test.ts
describe('OpsGenieParser', () => {
it('normalizes alert.created action to canonical alert', () => {});
it('extracts priority P1-P5 and maps to severity', () => {});
it('handles custom fields in details object', () => {});
});
// tests/unit/parsers/grafana.test.ts
describe('GrafanaParser', () => {
it('normalizes Grafana Alertmanager webhook payload', () => {});
it('handles multiple alerts in single webhook (Grafana batches)', () => {});
it('extracts dashboard URL as context link', () => {});
});
```
**Mocking strategy:** None needed — parsers are pure functions. Use recorded payload fixtures from `fixtures/webhooks/{provider}/`.
**Fixture structure:**
```
fixtures/webhooks/
datadog/
single-alert.json
batched-alerts.json
monitor-recovered.json
pagerduty/
incident-triggered.json
incident-resolved.json
incident-acknowledged.json
opsgenie/
alert-created.json
alert-closed.json
grafana/
single-firing.json
multi-firing.json
resolved.json
```
### 3.2 HMAC Validator
```typescript
describe('HmacValidator', () => {
// Datadog uses hex-encoded HMAC-SHA256
it('validates correct Datadog DD-WEBHOOK-SIGNATURE header', () => {});
it('rejects Datadog webhook with wrong signature', () => {});
it('rejects Datadog webhook with missing signature header', () => {});
// PagerDuty uses v1= prefix with HMAC-SHA256
it('validates correct PagerDuty X-PagerDuty-Signature header', () => {});
it('rejects PagerDuty webhook with tampered body', () => {});
// OpsGenie uses different header name
it('validates correct OpsGenie X-OpsGenie-Signature header', () => {});
// Edge cases
it('rejects empty body with any signature', () => {});
it('handles timing-safe comparison to prevent timing attacks', () => {});
});
```
**Mocking strategy:** None — crypto operations are deterministic. Use known secret + body + expected signature triples.
### 3.3 Fingerprint Generator
```typescript
describe('FingerprintGenerator', () => {
it('generates deterministic SHA-256 from tenant_id + provider + service + title', () => {});
it('produces same fingerprint for identical alerts regardless of timestamp', () => {});
it('produces different fingerprints when service differs', () => {});
it('normalizes title whitespace before hashing', () => {});
it('handles unicode characters in title consistently', () => {});
});
```
### 3.4 Correlation Engine
The most complex component. Heavy use of table-driven tests.
```typescript
describe('CorrelationEngine', () => {
describe('Time-Window Management', () => {
it('opens new 5min window on first alert for a service', () => {});
it('adds subsequent alerts to existing open window', () => {});
it('extends window by 2min when alert arrives in last 30 seconds', () => {});
it('caps total window duration at 15 minutes', () => {});
it('closes window after timeout with no new alerts', () => {});
it('generates incident record when window closes', () => {});
});
describe('Service Graph Correlation', () => {
it('merges downstream alerts into upstream window when dependency exists', () => {});
it('does not merge alerts for unrelated services', () => {});
it('handles circular dependencies without infinite loop', () => {});
it('traverses multi-level dependency chains (A→B→C)', () => {});
});
describe('Deploy Correlation', () => {
it('tags incident with deploy_id when deploy event within 10min of first alert', () => {});
it('does not correlate deploy older than 10 minutes', () => {});
it('correlates deploy to correct service even with multiple recent deploys', () => {});
it('adds deploy correlation score boost to noise calculation', () => {});
});
describe('Multi-Tenant Isolation', () => {
it('never correlates alerts across different tenants', () => {});
it('maintains separate windows per tenant', () => {});
it('handles concurrent alerts from multiple tenants', () => {});
});
});
```
**Mocking strategy:**
- Mock Redis client (`ioredis-mock`) for window state
- Mock DynamoDB client for service dependency reads
- Mock SQS for downstream message publishing
- Use `sinon.useFakeTimers()` for time-window testing
### 3.5 Noise Scorer
```typescript
describe('NoiseScorer', () => {
describe('Rule-Based Scoring', () => {
it('returns 0 for first-ever alert from a service (no history)', () => {});
it('scores higher when alert has fired >5 times in 24 hours', () => {});
it('scores higher when alert auto-resolved within 5 minutes', () => {});
it('adds deploy correlation bonus (+15 points) when deploy is recent', () => {});
it('adds feature-flag bonus (+5 points) when PR title matches config/feature-flag', () => {});
it('caps total score at 100', () => {});
it('never scores critical severity alerts above 80 (safety cap)', () => {});
});
describe('Threshold Calculations', () => {
it('classifies score 0-30 as signal (keep)', () => {});
it('classifies score 31-70 as review (annotate)', () => {});
it('classifies score 71-100 as noise (suggest suppress)', () => {});
it('uses tenant-specific thresholds when configured', () => {});
});
describe('What-Would-Have-Happened', () => {
it('calculates suppression count for historical window', () => {});
it('reports zero false negatives when no suppressed alert was critical', () => {});
it('flags false negative when suppressed alert was later escalated', () => {});
});
});
```
**Mocking strategy:** Mock the alert history store (DynamoDB queries). Scorer logic itself is pure calculation.
### 3.6 Notification Formatter
```typescript
describe('NotificationFormatter', () => {
describe('Slack Blocks', () => {
it('formats single-alert notification with service, title, severity', () => {});
it('formats correlated incident with alert count and sources', () => {});
it('includes deploy trigger when deploy correlation exists', () => {});
it('includes noise score badge (🟢 signal / 🟡 review / 🔴 noise)', () => {});
it('includes feedback buttons (👍 Helpful / 👎 Not helpful)', () => {});
it('formats in-place update message (replaces initial alert)', () => {});
});
describe('Weekly Digest', () => {
it('aggregates 7 days of incidents into summary stats', () => {});
it('highlights top 3 noisiest services', () => {});
it('shows suppression savings ("would have saved X pages")', () => {});
});
});
```
**Mocking strategy:** Snapshot tests — render the Slack blocks to JSON and compare against golden fixtures.
### 3.7 Governance Policy Engine
```typescript
describe('GovernancePolicy', () => {
describe('Mode Enforcement', () => {
it('in strict mode: annotates alerts but never suppresses', () => {});
it('in audit mode: auto-suppresses with full logging', () => {});
it('defaults new tenants to strict mode', () => {});
});
describe('Panic Mode', () => {
it('when panic=true: all suppression stops immediately', () => {});
it('when panic=true: all alerts pass through unmodified', () => {});
it('panic mode activatable via Redis key check', () => {});
it('panic mode shows banner in dashboard API response', () => {});
});
describe('Per-Customer Override', () => {
it('customer can set stricter mode than system default', () => {});
it('customer cannot set less restrictive mode than system default', () => {});
it('merge logic: max_restrictive(system, customer)', () => {});
});
describe('Policy Decision Logging', () => {
it('logs "suppressed by audit mode" with full context', () => {});
it('logs "annotation-only, strict mode active" for strict tenants', () => {});
it('logs "panic mode active — all alerts passing through"', () => {});
});
});
```
### 3.8 Feature Flag Circuit Breaker
```typescript
describe('SuppressionCircuitBreaker', () => {
it('allows suppression when volume is within baseline', () => {});
it('trips breaker when suppression exceeds 2x baseline over 30min', () => {});
it('auto-disables the scoring flag when breaker trips', () => {});
it('replays suppressed alerts from DLQ when breaker trips', () => {});
it('resets breaker after manual flag re-enable', () => {});
it('tracks suppression count per flag in Redis sliding window', () => {});
});
```
---
## Section 4: Integration Test Strategy
### 4.1 Webhook Contract Tests
Each provider integration gets a contract test suite that validates the full path: HTTP request → Lambda → SQS message.
```typescript
// tests/integration/webhooks/datadog.contract.test.ts
describe('Datadog Webhook Contract', () => {
let localstack: LocalStackContainer;
let sqsClient: SQSClient;
beforeAll(async () => {
localstack = await new LocalStackContainer().start();
sqsClient = new SQSClient({ endpoint: localstack.getEndpoint() });
// Create SQS FIFO queue
await sqsClient.send(new CreateQueueCommand({
QueueName: 'alert-ingested.fifo',
Attributes: { FifoQueue: 'true', ContentBasedDeduplication: 'true' }
}));
});
it('accepts valid Datadog webhook and produces canonical SQS message', async () => {
const payload = loadFixture('webhooks/datadog/single-alert.json');
const signature = computeHmac(payload, TEST_SECRET);
const res = await request(app)
.post('/v1/wh/tenant-123/datadog')
.set('DD-WEBHOOK-SIGNATURE', signature)
.send(payload);
expect(res.status).toBe(200);
const messages = await pollSqs(sqsClient, 'alert-ingested.fifo');
expect(messages).toHaveLength(1);
expect(messages[0].body).toMatchObject({
tenant_id: 'tenant-123',
provider: 'datadog',
severity: expect.stringMatching(/critical|high|medium|low|info/),
fingerprint: expect.stringMatching(/^[a-f0-9]{64}$/),
});
});
it('rejects webhook with invalid HMAC and produces no SQS message', async () => {
const payload = loadFixture('webhooks/datadog/single-alert.json');
const res = await request(app)
.post('/v1/wh/tenant-123/datadog')
.set('DD-WEBHOOK-SIGNATURE', 'bad-signature')
.send(payload);
expect(res.status).toBe(401);
const messages = await pollSqs(sqsClient, 'alert-ingested.fifo', { waitMs: 1000 });
expect(messages).toHaveLength(0);
});
});
```
Repeat pattern for PagerDuty, OpsGenie, Grafana — each with provider-specific signature headers and payload formats.
### 4.2 Correlation Engine → Redis Integration
```typescript
// tests/integration/correlation/redis-windows.test.ts
describe('Correlation Engine + Redis', () => {
let redis: StartedTestContainer;
let redisClient: Redis;
beforeAll(async () => {
redis = await new GenericContainer('redis:7-alpine')
.withExposedPorts(6379)
.start();
redisClient = new Redis({ host: redis.getHost(), port: redis.getMappedPort(6379) });
});
it('opens window in Redis sorted set with correct TTL', async () => {
await correlationEngine.processAlert(makeAlert({ service: 'payment-api' }));
const windows = await redisClient.zrange('windows:tenant-123', 0, -1, 'WITHSCORES');
expect(windows).toHaveLength(2); // [windowId, closesAtEpoch]
const ttl = await redisClient.ttl('window:tenant-123:payment-api');
expect(ttl).toBeGreaterThan(280); // ~5min minus processing time
});
it('extends window when alert arrives in last 30 seconds', async () => {
// Open window, advance clock to T+4m31s, send another alert
await correlationEngine.processAlert(makeAlert({ service: 'payment-api' }));
vi.advanceTimersByTime(4 * 60 * 1000 + 31 * 1000);
await correlationEngine.processAlert(makeAlert({ service: 'payment-api' }));
const ttl = await redisClient.ttl('window:tenant-123:payment-api');
expect(ttl).toBeGreaterThan(100); // Extended by ~2min
});
it('isolates windows between tenants', async () => {
await correlationEngine.processAlert(makeAlert({ tenant: 'A', service: 'api' }));
await correlationEngine.processAlert(makeAlert({ tenant: 'B', service: 'api' }));
const windowsA = await redisClient.zrange('windows:A', 0, -1);
const windowsB = await redisClient.zrange('windows:B', 0, -1);
expect(windowsA).toHaveLength(1);
expect(windowsB).toHaveLength(1);
expect(windowsA[0]).not.toBe(windowsB[0]);
});
});
```
### 4.3 Correlation Engine → DynamoDB Integration
```typescript
// tests/integration/correlation/dynamodb-incidents.test.ts
describe('Correlation Engine + DynamoDB', () => {
let dynamodb: StartedTestContainer;
beforeAll(async () => {
dynamodb = await new GenericContainer('amazon/dynamodb-local:latest')
.withExposedPorts(8000)
.start();
// Create tables: alerts, incidents, tenant_config, service_dependencies
});
it('persists incident record when correlation window closes', async () => {
await correlationEngine.processAlert(makeAlert({ service: 'api' }));
await correlationEngine.processAlert(makeAlert({ service: 'api' }));
await correlationEngine.closeExpiredWindows();
const incidents = await queryIncidents('tenant-123');
expect(incidents).toHaveLength(1);
expect(incidents[0].alert_count).toBe(2);
expect(incidents[0].services).toContain('api');
});
it('reads service dependencies for cascading correlation', async () => {
await putServiceDependency('tenant-123', 'api', 'database');
await correlationEngine.processAlert(makeAlert({ service: 'database' }));
await correlationEngine.processAlert(makeAlert({ service: 'api' }));
// Both should be in the same window
const windows = await getActiveWindows('tenant-123');
expect(windows).toHaveLength(1);
expect(windows[0].services).toEqual(expect.arrayContaining(['api', 'database']));
});
});
```
### 4.4 Correlation Engine → TimescaleDB Integration
```typescript
// tests/integration/correlation/timescaledb-trends.test.ts
describe('Correlation Engine + TimescaleDB', () => {
let pg: StartedTestContainer;
beforeAll(async () => {
pg = await new GenericContainer('timescale/timescaledb:latest-pg16')
.withExposedPorts(5432)
.withEnvironment({ POSTGRES_PASSWORD: 'test' })
.start();
// Run migrations: create hypertables, continuous aggregates
});
it('writes alert frequency data to hypertable', async () => {
await correlationEngine.recordAlertEvent(makeAlert({ service: 'api' }));
const rows = await query('SELECT * FROM alert_events WHERE service = $1', ['api']);
expect(rows).toHaveLength(1);
});
it('continuous aggregate calculates hourly alert counts', async () => {
// Insert 10 alerts spread over 2 hours
await insertAlertEvents(10, { spreadHours: 2 });
await refreshContinuousAggregate('hourly_alert_summary');
const summary = await query('SELECT * FROM hourly_alert_summary');
expect(summary).toHaveLength(2);
expect(summary.reduce((s, r) => s + r.alert_count, 0)).toBe(10);
});
});
```
### 4.5 Notification Service → Slack (WireMock)
```typescript
// tests/integration/notifications/slack.test.ts
describe('Notification Service + Slack', () => {
let wiremock: WireMockContainer;
beforeAll(async () => {
wiremock = await new WireMockContainer().start();
wiremock.stub({
request: { method: 'POST', urlPath: '/api/chat.postMessage' },
response: { status: 200, body: JSON.stringify({ ok: true, ts: '1234.5678' }) }
});
wiremock.stub({
request: { method: 'POST', urlPath: '/api/chat.update' },
response: { status: 200, body: JSON.stringify({ ok: true }) }
});
});
it('sends initial alert notification to correct Slack channel', async () => {});
it('updates message in-place when correlation completes', async () => {});
it('respects Slack rate limits (1 msg/sec per channel)', async () => {});
it('retries on 429 with exponential backoff', async () => {});
it('includes feedback buttons in correlated incident message', async () => {});
});
```
---
## Section 5: E2E & Smoke Tests
### 5.1 Critical User Journeys
**Journey 1: 60-Second Time-to-Value**
The defining test for dd0c/alert. Validates the entire pipeline from webhook to Slack notification.
```typescript
// tests/e2e/journeys/sixty-second-ttv.test.ts
describe('60-Second Time-to-Value', () => {
it('delivers first correlated incident to Slack within 60 seconds of webhook', async () => {
const start = Date.now();
// 1. Send Datadog webhook
await sendWebhook('datadog', fixtures.datadog.singleAlert, { tenant: 'e2e-tenant' });
// 2. Wait for Slack message
const slackMessage = await waitForSlackMessage('e2e-channel', { timeoutMs: 60_000 });
const elapsed = Date.now() - start;
expect(elapsed).toBeLessThan(60_000);
expect(slackMessage.text).toContain('New alert');
expect(slackMessage.blocks).toBeDefined();
});
});
```
**Journey 2: Alert Storm Correlation**
```typescript
// tests/e2e/journeys/alert-storm.test.ts
describe('Alert Storm Correlation', () => {
it('groups 50 alerts in 2 minutes into a single correlated incident', async () => {
// Fire 50 alerts for same service over 2 minutes
for (let i = 0; i < 50; i++) {
await sendWebhook('datadog', makeAlertPayload({
service: 'payment-api',
title: `High latency on payment-api (${i})`,
}));
await sleep(2400); // ~50 alerts in 2 min
}
// Wait for correlation window to close
await sleep(5 * 60 * 1000 + 30_000); // 5min window + buffer
const slackMessages = await getSlackMessages('e2e-channel');
const incidentMessages = slackMessages.filter(m => m.text.includes('Incident'));
expect(incidentMessages).toHaveLength(1);
expect(incidentMessages[0].text).toContain('50 alerts grouped');
});
});
```
**Journey 3: Deploy Correlation**
```typescript
// tests/e2e/journeys/deploy-correlation.test.ts
describe('Deploy Correlation', () => {
it('identifies deploy as trigger when alerts follow within 10 minutes', async () => {
// 1. Send deploy event
await sendWebhook('github-actions', makeDeployPayload({
service: 'payment-api',
commit: 'abc123',
pr_title: 'feat: add retry logic',
}));
// 2. Wait 2 minutes, then fire alerts
await sleep(2 * 60 * 1000);
await sendWebhook('datadog', makeAlertPayload({ service: 'payment-api' }));
await sendWebhook('pagerduty', makeAlertPayload({ service: 'payment-api' }));
// 3. Wait for correlation
await sleep(6 * 60 * 1000);
const slackMessage = await getLatestSlackMessage('e2e-channel');
expect(slackMessage.text).toContain('Deploy #');
expect(slackMessage.text).toContain('abc123');
});
});
```
**Journey 4: Panic Mode**
```typescript
// tests/e2e/journeys/panic-mode.test.ts
describe('Panic Mode', () => {
it('stops all suppression immediately when panic mode is activated', async () => {
// 1. Enable audit mode, verify suppression works
await setGovernanceMode('e2e-tenant', 'audit');
await sendNoisyAlerts(10);
const beforePanic = await getSlackMessages('e2e-channel');
const suppressedBefore = beforePanic.filter(m => m.text.includes('suppressed'));
// 2. Activate panic mode
await fetch('/admin/panic', { method: 'POST' });
// 3. Send more alerts — all should pass through
await sendNoisyAlerts(10);
const afterPanic = await getSlackMessages('e2e-channel');
const rawAlerts = afterPanic.filter(m => !m.text.includes('suppressed'));
expect(rawAlerts.length).toBeGreaterThanOrEqual(10);
});
});
```
### 5.2 E2E Infrastructure
```yaml
# docker-compose.e2e.yml
services:
localstack:
image: localstack/localstack:3
environment:
SERVICES: sqs,s3,dynamodb,apigateway,lambda
ports: ["4566:4566"]
timescaledb:
image: timescale/timescaledb:latest-pg16
environment:
POSTGRES_PASSWORD: test
ports: ["5432:5432"]
redis:
image: redis:7-alpine
ports: ["6379:6379"]
wiremock:
image: wiremock/wiremock:3
ports: ["8080:8080"]
volumes:
- ./fixtures/wiremock:/home/wiremock/mappings
app:
build: .
environment:
AWS_ENDPOINT: http://localstack:4566
REDIS_URL: redis://redis:6379
TIMESCALE_URL: postgres://postgres:test@timescaledb:5432/test
SLACK_API_URL: http://wiremock:8080
depends_on: [localstack, timescaledb, redis, wiremock]
```
### 5.3 Synthetic Alert Generation
```typescript
// tests/e2e/helpers/alert-generator.ts
export function makeAlertPayload(overrides: Partial<AlertPayload> = {}): DatadogWebhookPayload {
return {
id: ulid(),
title: overrides.title ?? `Alert: ${faker.hacker.phrase()}`,
text: faker.lorem.sentence(),
date_happened: Math.floor(Date.now() / 1000),
priority: overrides.priority ?? 'normal',
tags: [`service:${overrides.service ?? 'test-service'}`],
alert_type: overrides.severity ?? 'warning',
...overrides,
};
}
export async function sendNoisyAlerts(count: number, opts?: { service?: string }) {
for (let i = 0; i < count; i++) {
await sendWebhook('datadog', makeAlertPayload({
service: opts?.service ?? 'noisy-service',
title: `Flapping alert #${i}`,
}));
}
}
```
---
## Section 6: Performance & Load Testing
### 6.1 Alert Ingestion Throughput
```typescript
// tests/perf/ingestion-throughput.test.ts
describe('Ingestion Throughput', () => {
it('processes 1000 webhooks/second without dropping payloads', async () => {
const results = await k6.run({
vus: 100,
duration: '30s',
thresholds: {
http_req_duration: ['p95<200'], // 200ms p95
http_req_failed: ['rate<0.001'], // <0.1% failure
},
script: `
import http from 'k6/http';
export default function() {
http.post('${WEBHOOK_URL}/v1/wh/perf-tenant/datadog',
JSON.stringify(makeAlertPayload()),
{ headers: { 'DD-WEBHOOK-SIGNATURE': validSig } }
);
}
`,
});
expect(results.metrics.http_req_failed.rate).toBeLessThan(0.001);
});
});
```
### 6.2 Correlation Latency Under Alert Storms
```typescript
describe('Correlation Storm Performance', () => {
it('correlates 500 alerts across 10 services within 30 seconds', async () => {
const start = Date.now();
// Simulate incident storm: 500 alerts, 10 services, 2 minutes
await generateAlertStorm({ alerts: 500, services: 10, durationMs: 120_000 });
// Wait for all windows to close
await waitForIncidents('perf-tenant', { minCount: 1, timeoutMs: 30_000 });
const elapsed = Date.now() - start - 120_000; // subtract generation time
expect(elapsed).toBeLessThan(30_000);
});
it('Redis memory stays under 50MB during 10K active windows', async () => {
// Open 10K windows across 100 tenants
for (let t = 0; t < 100; t++) {
for (let s = 0; s < 100; s++) {
await correlationEngine.processAlert(makeAlert({
tenant: `tenant-${t}`,
service: `service-${s}`,
}));
}
}
const memoryUsage = await redisClient.info('memory');
const usedMb = parseRedisMemory(memoryUsage);
expect(usedMb).toBeLessThan(50);
});
});
```
### 6.3 Noise Scoring Latency
```typescript
describe('Noise Scoring Performance', () => {
it('scores a correlated incident with 50 alerts in <100ms', async () => {
const incident = makeIncident({ alertCount: 50, withHistory: true });
const start = performance.now();
const score = await noiseScorer.score(incident);
const elapsed = performance.now() - start;
expect(elapsed).toBeLessThan(100);
expect(score).toBeGreaterThanOrEqual(0);
expect(score).toBeLessThanOrEqual(100);
});
});
```
### 6.4 Memory Pressure During High-Cardinality Correlation
```typescript
describe('Memory Pressure', () => {
it('ECS task stays under 512MB with 1000 concurrent correlation windows', async () => {
// Monitor ECS task memory while processing high-cardinality alerts
const memBefore = process.memoryUsage().heapUsed;
await processHighCardinalityAlerts({ tenants: 100, servicesPerTenant: 10 });
const memAfter = process.memoryUsage().heapUsed;
const deltaMb = (memAfter - memBefore) / 1024 / 1024;
expect(deltaMb).toBeLessThan(256); // Leave headroom in 512MB task
});
});
```
---
## Section 7: CI/CD Pipeline Integration
### 7.1 Pipeline Stages
```
┌─────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Pre-Commit │───▶│ PR Gate │───▶│ Merge │───▶│ Staging │───▶│ Prod │
│ (local) │ │ (CI) │ │ (CI) │ │ (CD) │ │ (CD) │
└─────────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
lint + format unit tests full suite E2E + perf smoke + canary
type check integration coverage gate LocalStack deploy event
<10s <5min <10min <15min self-dogfood
```
### 7.2 Stage Details
**Pre-Commit (local, <10s):**
- `eslint` + `prettier` format check
- `tsc --noEmit` type check
- Affected unit tests only (`vitest --changed`)
**PR Gate (CI, <5min):**
- Full unit test suite
- Integration tests (Testcontainers spin up in CI)
- Schema migration lint (no DROP/RENAME/TYPE changes)
- Decision log presence check for scoring/correlation PRs
- Coverage diff: new code must have ≥80% coverage
**Merge to Main (CI, <10min):**
- Full test suite (unit + integration)
- Coverage gate: overall ≥80%, scoring engine ≥90%
- CDK synth + diff (infrastructure changes)
- Security scan (`npm audit`, `trivy`)
**Staging (CD, <15min):**
- Deploy to staging environment
- E2E journey tests against LocalStack
- Performance benchmarks (ingestion throughput, correlation latency)
- Synthetic alert generation + validation
**Production (CD):**
- Canary deploy (10% traffic for 5 minutes)
- Smoke tests (send test webhook, verify Slack delivery)
- dd0c/alert dogfoods itself: deploy event sent to own webhook
- Automated rollback if error rate >1% during canary
### 7.3 Coverage Thresholds
| Component | Minimum | Target |
|-----------|---------|--------|
| Webhook Parsers | 90% | 95% |
| HMAC Validator | 95% | 100% |
| Correlation Engine | 85% | 90% |
| Noise Scorer | 90% | 95% |
| Governance Policy | 90% | 95% |
| Notification Formatter | 75% | 85% |
| Overall | 80% | 85% |
### 7.4 Test Parallelization
```yaml
# .github/workflows/test.yml
jobs:
unit:
runs-on: ubuntu-latest
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- run: vitest --shard=${{ matrix.shard }}/4
integration:
runs-on: ubuntu-latest
strategy:
matrix:
suite: [webhooks, correlation, notifications, storage]
steps:
- run: vitest --project=integration --grep=${{ matrix.suite }}
e2e:
needs: [unit, integration]
runs-on: ubuntu-latest
steps:
- run: docker compose -f docker-compose.e2e.yml up -d
- run: vitest --project=e2e
```
---
## Section 8: Transparent Factory Tenet Testing
### 8.1 Atomic Flagging — Suppression Circuit Breaker
```typescript
describe('Atomic Flagging', () => {
describe('Flag Lifecycle', () => {
it('new scoring rule flag defaults to false (off)', () => {});
it('flag has owner and ttl metadata', () => {});
it('CI blocks when flag at 100% exceeds 14-day TTL', () => {});
});
describe('Circuit Breaker on Suppression Volume', () => {
it('allows suppression when volume is within 2x baseline', () => {});
it('trips breaker when suppression exceeds 2x baseline over 30min', () => {});
it('auto-disables the flag when breaker trips', () => {});
it('buffers suppressed alerts in DLQ during normal operation', () => {});
it('replays DLQ alerts when breaker trips', async () => {
// 1. Enable scoring flag, suppress 20 alerts
// 2. Trip the breaker by spiking suppression rate
// 3. Verify all 20 suppressed alerts are re-emitted from DLQ
// 4. Verify flag is now disabled
});
it('DLQ retains alerts for 1 hour before expiry', () => {});
});
describe('Local Evaluation', () => {
it('flag evaluation does not make network calls', () => {});
it('flag state is cached in-memory and refreshed every 60s', () => {});
});
});
```
### 8.2 Elastic Schema — Migration Validation
```typescript
describe('Elastic Schema', () => {
describe('Migration Lint', () => {
it('rejects migration with DROP COLUMN statement', () => {
const migration = 'ALTER TABLE alert_events DROP COLUMN old_field;';
expect(lintMigration(migration)).toContainError('DROP not allowed');
});
it('rejects migration with ALTER COLUMN TYPE', () => {
const migration = 'ALTER TABLE alert_events ALTER COLUMN severity TYPE integer;';
expect(lintMigration(migration)).toContainError('TYPE change not allowed');
});
it('rejects migration with RENAME COLUMN', () => {});
it('accepts migration with ADD COLUMN (nullable)', () => {
const migration = 'ALTER TABLE alert_events ADD COLUMN noise_score_v2 integer;';
expect(lintMigration(migration)).toBeValid();
});
it('accepts migration with new table creation', () => {});
});
describe('DynamoDB Schema', () => {
it('rejects attribute type change in table definition', () => {});
it('accepts new attribute addition', () => {});
it('V1 code ignores V2 attributes without error', () => {});
});
describe('Sunset Enforcement', () => {
it('every migration file contains sunset_date comment', () => {
const migrations = glob.sync('migrations/*.sql');
for (const m of migrations) {
const content = fs.readFileSync(m, 'utf-8');
expect(content).toMatch(/-- sunset_date: \d{4}-\d{2}-\d{2}/);
}
});
it('CI warns when migration is past sunset date', () => {});
});
});
```
### 8.3 Cognitive Durability — Decision Log Validation
```typescript
describe('Cognitive Durability', () => {
it('decision_log.json exists for every PR touching scoring/', () => {
// CI hook: check git diff for files in src/scoring/
// If touched, require docs/decisions/*.json in the same PR
});
it('decision log has required fields', () => {
const logs = glob.sync('docs/decisions/*.json');
for (const log of logs) {
const entry = JSON.parse(fs.readFileSync(log, 'utf-8'));
expect(entry).toHaveProperty('reasoning');
expect(entry).toHaveProperty('alternatives_considered');
expect(entry).toHaveProperty('confidence');
expect(entry).toHaveProperty('timestamp');
expect(entry).toHaveProperty('author');
}
});
it('cyclomatic complexity stays under 10 for all scoring functions', () => {
// Run eslint with complexity rule
const result = execSync('eslint src/scoring/ --rule "complexity: [error, 10]"');
expect(result.exitCode).toBe(0);
});
});
```
### 8.4 Semantic Observability — OTEL Span Assertions
```typescript
describe('Semantic Observability', () => {
let spanExporter: InMemorySpanExporter;
beforeEach(() => {
spanExporter = new InMemorySpanExporter();
// Configure OTEL with in-memory exporter for testing
});
describe('Alert Evaluation Spans', () => {
it('emits parent alert_evaluation span for each alert', async () => {
await processAlert(makeAlert());
const spans = spanExporter.getFinishedSpans();
const evalSpan = spans.find(s => s.name === 'alert_evaluation');
expect(evalSpan).toBeDefined();
});
it('emits child noise_scoring span with score attributes', async () => {
await processAlert(makeAlert());
const spans = spanExporter.getFinishedSpans();
const scoreSpan = spans.find(s => s.name === 'noise_scoring');
expect(scoreSpan).toBeDefined();
expect(scoreSpan.attributes['alert.noise_score']).toBeGreaterThanOrEqual(0);
expect(scoreSpan.attributes['alert.noise_score']).toBeLessThanOrEqual(100);
});
it('emits child correlation_matching span with match data', async () => {
await processAlert(makeAlert());
const spans = spanExporter.getFinishedSpans();
const corrSpan = spans.find(s => s.name === 'correlation_matching');
expect(corrSpan).toBeDefined();
expect(corrSpan.attributes).toHaveProperty('alert.correlation_matches');
});
it('emits suppression_decision span with reason', async () => {
await processAlert(makeAlert());
const spans = spanExporter.getFinishedSpans();
const suppSpan = spans.find(s => s.name === 'suppression_decision');
expect(suppSpan.attributes).toHaveProperty('alert.suppressed');
expect(suppSpan.attributes).toHaveProperty('alert.suppression_reason');
});
});
describe('PII Protection', () => {
it('never includes raw alert payload in span attributes', async () => {
await processAlert(makeAlert({ title: 'User john@example.com failed login' }));
const spans = spanExporter.getFinishedSpans();
for (const span of spans) {
const attrs = JSON.stringify(span.attributes);
expect(attrs).not.toContain('john@example.com');
}
});
it('uses hashed alert source identifier, not raw', async () => {
await processAlert(makeAlert({ source: 'prod-payment-api' }));
const spans = spanExporter.getFinishedSpans();
const evalSpan = spans.find(s => s.name === 'alert_evaluation');
expect(evalSpan.attributes['alert.source']).toMatch(/^[a-f0-9]+$/);
});
});
});
```
### 8.5 Configurable Autonomy — Governance Policy Tests
```typescript
describe('Configurable Autonomy', () => {
describe('Governance Mode Enforcement', () => {
it('strict mode: annotates but never suppresses', async () => {
setPolicy({ governance_mode: 'strict' });
const result = await processNoisyAlert(makeAlert({ noiseScore: 95 }));
expect(result.suppressed).toBe(false);
expect(result.annotation).toContain('noise_score: 95');
});
it('audit mode: auto-suppresses with logging', async () => {
setPolicy({ governance_mode: 'audit' });
const result = await processNoisyAlert(makeAlert({ noiseScore: 95 }));
expect(result.suppressed).toBe(true);
expect(result.log).toContain('suppressed by audit mode');
});
});
describe('Panic Mode', () => {
it('activates in <1 second via API call', async () => {
const start = Date.now();
await fetch('/admin/panic', { method: 'POST' });
const panicActive = await redisClient.get('dd0c:panic');
expect(Date.now() - start).toBeLessThan(1000);
expect(panicActive).toBe('true');
});
it('stops all suppression when active', async () => {
await activatePanic();
const results = await Promise.all(
Array.from({ length: 10 }, () => processNoisyAlert(makeAlert({ noiseScore: 99 })))
);
expect(results.every(r => r.suppressed === false)).toBe(true);
});
});
describe('Per-Customer Override', () => {
it('customer strict overrides system audit', async () => {
setPolicy({ governance_mode: 'audit' });
setCustomerPolicy('tenant-123', { governance_mode: 'strict' });
const result = await processNoisyAlert(makeAlert({ tenant: 'tenant-123', noiseScore: 95 }));
expect(result.suppressed).toBe(false);
});
it('customer cannot downgrade from system strict to audit', async () => {
setPolicy({ governance_mode: 'strict' });
setCustomerPolicy('tenant-123', { governance_mode: 'audit' });
const result = await processNoisyAlert(makeAlert({ tenant: 'tenant-123', noiseScore: 95 }));
expect(result.suppressed).toBe(false); // System strict wins
});
});
});
```
---
## Section 9: Test Data & Fixtures
### 9.1 Directory Structure
```
tests/
fixtures/
webhooks/
datadog/
single-alert.json
batched-alerts.json
monitor-recovered.json
high-priority.json
pagerduty/
incident-triggered.json
incident-resolved.json
incident-acknowledged.json
opsgenie/
alert-created.json
alert-closed.json
grafana/
single-firing.json
multi-firing.json
resolved.json
deploys/
github-actions-success.json
github-actions-failure.json
gitlab-ci-pipeline.json
argocd-sync.json
scenarios/
alert-storm-50-alerts.json
cascading-failure-3-services.json
flapping-alert-10-cycles.json
maintenance-window-suppression.json
deploy-correlated-incident.json
slack/
initial-alert-blocks.json
correlated-incident-blocks.json
weekly-digest-blocks.json
schemas/
canonical-alert.json
incident-record.json
tenant-config.json
```
### 9.2 Alert Payload Factory
```typescript
// tests/helpers/factories.ts
export function makeCanonicalAlert(overrides: Partial<CanonicalAlert> = {}): CanonicalAlert {
return {
alert_id: ulid(),
tenant_id: overrides.tenant_id ?? 'test-tenant',
provider: overrides.provider ?? 'datadog',
service: overrides.service ?? 'test-service',
title: overrides.title ?? `Alert: ${faker.hacker.phrase()}`,
severity: overrides.severity ?? 'warning',
fingerprint: overrides.fingerprint ?? crypto.randomBytes(32).toString('hex'),
timestamp: overrides.timestamp ?? new Date().toISOString(),
raw_payload_s3_key: overrides.raw_payload_s3_key ?? `raw/${ulid()}.json`,
metadata: overrides.metadata ?? {},
...overrides,
};
}
export function makeIncident(overrides: Partial<Incident> = {}): Incident {
const alertCount = overrides.alert_count ?? 5;
return {
incident_id: ulid(),
tenant_id: overrides.tenant_id ?? 'test-tenant',
services: overrides.services ?? ['test-service'],
alert_count: alertCount,
alerts: Array.from({ length: alertCount }, () => makeCanonicalAlert()),
noise_score: overrides.noise_score ?? 0,
deploy_correlation: overrides.deploy_correlation ?? null,
window_opened_at: overrides.window_opened_at ?? new Date().toISOString(),
window_closed_at: overrides.window_closed_at ?? new Date().toISOString(),
...overrides,
};
}
export function makeDeployEvent(overrides: Partial<DeployEvent> = {}): DeployEvent {
return {
deploy_id: ulid(),
tenant_id: overrides.tenant_id ?? 'test-tenant',
service: overrides.service ?? 'test-service',
commit_sha: overrides.commit_sha ?? faker.git.commitSha(),
pr_title: overrides.pr_title ?? faker.git.commitMessage(),
deployed_at: overrides.deployed_at ?? new Date().toISOString(),
provider: overrides.provider ?? 'github-actions',
...overrides,
};
}
```
### 9.3 Noise Scenario Fixtures
```typescript
// tests/helpers/scenarios.ts
export const NOISE_SCENARIOS = {
alertStorm: {
description: '50 alerts for same service in 2 minutes',
alerts: Array.from({ length: 50 }, (_, i) => makeCanonicalAlert({
service: 'payment-api',
title: `High latency variant ${i}`,
timestamp: new Date(Date.now() + i * 2400).toISOString(),
})),
expectedIncidents: 1,
expectedNoiseScore: { min: 70, max: 95 },
},
flappingAlert: {
description: 'Alert fires and resolves 10 times in 1 hour',
alerts: Array.from({ length: 20 }, (_, i) => makeCanonicalAlert({
service: 'health-check',
title: 'Health check failed',
severity: i % 2 === 0 ? 'warning' : 'info', // alternating fire/resolve
timestamp: new Date(Date.now() + i * 3 * 60 * 1000).toISOString(),
})),
expectedNoiseScore: { min: 80, max: 100 },
},
cascadingFailure: {
description: 'Database fails, then API, then frontend',
alerts: [
makeCanonicalAlert({ service: 'database', severity: 'critical', timestamp: t(0) }),
makeCanonicalAlert({ service: 'api', severity: 'high', timestamp: t(30) }),
makeCanonicalAlert({ service: 'api', severity: 'high', timestamp: t(45) }),
makeCanonicalAlert({ service: 'frontend', severity: 'medium', timestamp: t(60) }),
makeCanonicalAlert({ service: 'frontend', severity: 'medium', timestamp: t(90) }),
],
serviceDependencies: [['api', 'database'], ['frontend', 'api']],
expectedIncidents: 1, // All merged via dependency graph
expectedNoiseScore: { min: 0, max: 30 }, // Real incident, not noise
},
deployCorrelated: {
description: 'Deploy followed by alert storm',
deploy: makeDeployEvent({ service: 'payment-api', pr_title: 'feat: add retry logic' }),
alerts: Array.from({ length: 8 }, () => makeCanonicalAlert({
service: 'payment-api',
severity: 'high',
})),
deployToAlertGapMs: 2 * 60 * 1000, // 2 minutes after deploy
expectedNoiseScore: { min: 50, max: 85 }, // Deploy correlation boosts noise score
},
};
```
---
## Section 10: TDD Implementation Order
### 10.1 Bootstrap Sequence
The test infrastructure itself must be built before any product code. This is the order:
```
Phase 0: Test Infrastructure (Week 0)
├── 0.1 vitest config + TypeScript setup
├── 0.2 Testcontainers helper (Redis, DynamoDB Local, TimescaleDB)
├── 0.3 LocalStack helper (SQS, S3, API Gateway)
├── 0.4 Fixture loader utility
├── 0.5 Factory functions (makeCanonicalAlert, makeIncident, makeDeployEvent)
├── 0.6 WireMock Slack stub
└── 0.7 CI pipeline with test stages
```
### 10.2 Epic-by-Epic TDD Order
```
Phase 1: Webhook Ingestion (Epic 1) — Tests First
├── 1.1 RED: HMAC validator tests (all providers)
├── 1.2 GREEN: Implement HMAC validation
├── 1.3 RED: Datadog parser tests (single + batch)
├── 1.4 GREEN: Implement Datadog parser
├── 1.5 RED: PagerDuty parser tests
├── 1.6 GREEN: Implement PagerDuty parser
├── 1.7 RED: Fingerprint generator tests
├── 1.8 GREEN: Implement fingerprinting
├── 1.9 INTEGRATION: Lambda → SQS contract test
└── 1.10 REFACTOR: Extract provider parser interface
Phase 2: Correlation Engine (Epic 2) — Tests First
├── 2.1 RED: Time-window open/close/extend tests
├── 2.2 GREEN: Implement window manager
├── 2.3 RED: Service graph correlation tests
├── 2.4 GREEN: Implement dependency traversal
├── 2.5 RED: Deploy correlation tests
├── 2.6 GREEN: Implement deploy tracker
├── 2.7 INTEGRATION: Correlation → Redis window tests
├── 2.8 INTEGRATION: Correlation → DynamoDB incident persistence
└── 2.9 INTEGRATION: Correlation → TimescaleDB trend writes
Phase 3: Noise Analysis (Epic 3) — Tests First
├── 3.1 RED: Rule-based noise scoring tests (all rules)
├── 3.2 GREEN: Implement scorer
├── 3.3 RED: Threshold classification tests
├── 3.4 GREEN: Implement classifier
├── 3.5 RED: "What would have happened" calculation tests
├── 3.6 GREEN: Implement historical analysis
└── 3.7 REFACTOR: Extract scoring rules into configurable pipeline
Phase 4: Notifications (Epic 4) — Integration Tests Lead
├── 4.1 Implement Slack block formatter
├── 4.2 RED: Snapshot tests for all message formats
├── 4.3 INTEGRATION: Notification → Slack (WireMock)
├── 4.4 RED: Rate limiting tests
└── 4.5 GREEN: Implement rate limiter
Phase 5: Governance (Epic 10) — Tests First
├── 5.1 RED: Strict/audit mode enforcement tests
├── 5.2 GREEN: Implement policy engine
├── 5.3 RED: Panic mode tests (<1s activation)
├── 5.4 GREEN: Implement panic mode
├── 5.5 RED: Circuit breaker + DLQ replay tests
├── 5.6 GREEN: Implement circuit breaker
├── 5.7 RED: OTEL span assertion tests
└── 5.8 GREEN: Instrument all components
Phase 6: E2E Validation
├── 6.1 60-second TTV journey
├── 6.2 Alert storm correlation journey
├── 6.3 Deploy correlation journey
├── 6.4 Panic mode journey
└── 6.5 Performance benchmarks
```
### 10.3 "Never Ship Without" Checklist
Before any release, these tests must pass:
- [ ] All HMAC validation tests (security gate)
- [ ] All correlation window tests (correctness gate)
- [ ] All noise scoring tests (safety gate — never eat real alerts)
- [ ] All governance policy tests (compliance gate)
- [ ] Circuit breaker DLQ replay test (safety net gate)
- [ ] 60-second TTV E2E journey (product promise gate)
- [ ] PII protection span tests (privacy gate)
- [ ] Schema migration lint (no breaking changes)
- [ ] Coverage ≥80% overall, ≥90% on scoring engine
---
*End of dd0c/alert Test Architecture*
---
## 11. Review Remediation Addendum (Post-Gemini Review)
### 11.1 Missing Epic Coverage
#### Epic 6: Dashboard API
```typescript
describe('Dashboard API', () => {
describe('Authentication', () => {
it('returns 401 for missing Cognito JWT', async () => {});
it('returns 401 for expired JWT', async () => {});
it('returns 401 for JWT signed by wrong issuer', async () => {});
it('extracts tenantId from JWT claims', async () => {});
});
describe('Incident Listing (GET /v1/incidents)', () => {
it('returns paginated incidents for authenticated tenant', async () => {});
it('supports cursor-based pagination', async () => {});
it('filters by status (open, acknowledged, resolved)', async () => {});
it('filters by severity (critical, warning, info)', async () => {});
it('filters by time range (since, until)', async () => {});
it('returns empty array for tenant with no incidents', async () => {});
});
describe('Incident Detail (GET /v1/incidents/:id)', () => {
it('returns full incident with correlated alerts', async () => {});
it('returns 404 for incident belonging to different tenant', async () => {});
it('includes timeline of state transitions', async () => {});
});
describe('Analytics (GET /v1/analytics)', () => {
it('returns MTTR for last 7/30/90 days', async () => {});
it('returns alert volume by source', async () => {});
it('returns noise reduction percentage', async () => {});
it('scopes all analytics to authenticated tenant', async () => {});
});
describe('Tenant Isolation', () => {
it('tenant A cannot read tenant B incidents via API', async () => {});
it('tenant A cannot read tenant B analytics', async () => {});
it('all DynamoDB queries include tenantId partition key', async () => {});
});
});
```
#### Epic 7: Dashboard UI (Playwright)
```typescript
// tests/e2e/ui/dashboard.spec.ts
test('login redirects to Cognito hosted UI', async ({ page }) => {
await page.goto('/dashboard');
await expect(page).toHaveURL(/cognito/);
});
test('incident list renders with correct severity badges', async ({ page }) => {
await page.goto('/dashboard/incidents');
await expect(page.locator('[data-testid="incident-card"]')).toHaveCount(5);
await expect(page.locator('.severity-critical')).toBeVisible();
});
test('incident detail shows correlated alert timeline', async ({ page }) => {
await page.goto('/dashboard/incidents/inc-123');
await expect(page.locator('[data-testid="alert-timeline"]')).toBeVisible();
await expect(page.locator('.timeline-event')).toHaveCountGreaterThan(1);
});
test('MTTR chart renders with real data', async ({ page }) => {
await page.goto('/dashboard/analytics');
await expect(page.locator('[data-testid="mttr-chart"]')).toBeVisible();
});
test('noise reduction percentage displays correctly', async ({ page }) => {
await page.goto('/dashboard/analytics');
const noise = page.locator('[data-testid="noise-reduction"]');
await expect(noise).toContainText('%');
});
test('webhook setup wizard generates correct URL', async ({ page }) => {
await page.goto('/dashboard/settings/integrations');
await page.click('[data-testid="add-datadog"]');
const url = await page.locator('[data-testid="webhook-url"]').textContent();
expect(url).toMatch(/\/v1\/webhooks\/ingest\/.+/);
});
```
#### Epic 9: Onboarding & PLG
```typescript
describe('Free Tier Enforcement', () => {
it('allows up to 10,000 alerts/month on free tier', async () => {});
it('returns 429 with upgrade prompt at 10,001st alert', async () => {});
it('resets counter on first of each month', async () => {});
it('purges alert data older than 7 days on free tier', async () => {});
it('retains alert data for 90 days on pro tier', async () => {});
});
describe('OAuth Signup', () => {
it('creates tenant record on first Cognito login', async () => {});
it('assigns free tier by default', async () => {});
it('generates unique webhook URL per tenant', async () => {});
});
describe('Stripe Integration', () => {
it('creates checkout session with correct pricing', async () => {});
it('upgrades tenant on checkout.session.completed webhook', async () => {});
it('downgrades tenant on subscription.deleted webhook', async () => {});
it('validates Stripe webhook signature', async () => {});
});
```
#### Epic 5.3: Slack Feedback Endpoint
```typescript
describe('Slack Interactive Actions Endpoint', () => {
it('validates Slack request signature (HMAC-SHA256)', async () => {});
it('rejects request with invalid signature', async () => {});
it('handles "helpful" feedback — updates incident quality score', async () => {});
it('handles "noise" feedback — adds to suppression training data', async () => {});
it('handles "escalate" action — triggers PagerDuty/OpsGenie', async () => {});
it('updates original Slack message after action', async () => {});
it('scopes action to correct tenant', async () => {});
});
```
#### Epic 1.4: S3 Raw Payload Archival
```typescript
describe('Raw Payload Archival', () => {
it('saves raw webhook payload to S3 asynchronously', async () => {});
it('S3 key includes tenantId, source, and timestamp', async () => {});
it('archival failure does not block alert processing', async () => {});
it('archived payload is retrievable for replay', async () => {});
it('S3 lifecycle policy deletes after retention period', async () => {});
});
```
### 11.2 Anti-Pattern Fixes
#### Replace ioredis-mock with WindowStore Interface
```typescript
// BEFORE (anti-pattern):
// import RedisMock from 'ioredis-mock';
// const engine = new CorrelationEngine(new RedisMock());
// AFTER (correct):
interface WindowStore {
addEvent(tenantId: string, key: string, event: Alert, ttlMs: number): Promise<void>;
getWindow(tenantId: string, key: string): Promise<Alert[]>;
clearWindow(tenantId: string, key: string): Promise<void>;
}
class InMemoryWindowStore implements WindowStore {
private store = new Map<string, { events: Alert[]; expiresAt: number }>();
async addEvent(tenantId: string, key: string, event: Alert, ttlMs: number) {
const fullKey = `${tenantId}:${key}`;
const existing = this.store.get(fullKey) || { events: [], expiresAt: Date.now() + ttlMs };
existing.events.push(event);
this.store.set(fullKey, existing);
}
async getWindow(tenantId: string, key: string): Promise<Alert[]> {
const fullKey = `${tenantId}:${key}`;
const entry = this.store.get(fullKey);
if (!entry || entry.expiresAt < Date.now()) return [];
return entry.events;
}
}
// Unit tests use InMemoryWindowStore — no Redis dependency
// Integration tests use RedisWindowStore with Testcontainers
```
#### Replace sinon.useFakeTimers with Clock Interface
```typescript
// BEFORE (anti-pattern):
// sinon.useFakeTimers(new Date('2026-03-01T00:00:00Z'));
// AFTER (correct):
interface Clock {
now(): number;
advanceBy(ms: number): void;
}
class FakeClock implements Clock {
private current: number;
constructor(start: Date = new Date()) { this.current = start.getTime(); }
now() { return this.current; }
advanceBy(ms: number) { this.current += ms; }
}
class SystemClock implements Clock {
now() { return Date.now(); }
advanceBy() { throw new Error('Cannot advance system clock'); }
}
// Inject into CorrelationEngine:
const engine = new CorrelationEngine(new InMemoryWindowStore(), new FakeClock());
```
### 11.3 Trace Context Propagation Tests
```typescript
describe('Trace Context Propagation', () => {
it('API Gateway passes trace_id to Lambda via X-Amzn-Trace-Id', async () => {});
it('Lambda propagates trace_id into SQS message attributes', async () => {
// Verify SQS message has MessageAttribute 'traceparent' with W3C format
const msg = await getLastSQSMessage(localstack, 'alert-queue');
expect(msg.MessageAttributes.traceparent).toBeDefined();
expect(msg.MessageAttributes.traceparent.StringValue).toMatch(
/^00-[0-9a-f]{32}-[0-9a-f]{16}-0[01]$/
);
});
it('ECS Correlation Engine extracts trace_id from SQS message', async () => {
// Verify the correlation span has the correct parent from SQS
const spans = inMemoryExporter.getFinishedSpans();
const correlationSpan = spans.find(s => s.name === 'alert.correlation');
const ingestSpan = spans.find(s => s.name === 'webhook.ingest');
expect(correlationSpan.parentSpanId).toBeDefined();
// Parent chain must trace back to the original ingest span
});
it('end-to-end trace spans webhook → SQS → correlation → notification', async () => {
// Fire a webhook, wait for Slack notification, verify all spans share trace_id
const traceId = await fireWebhookAndGetTraceId();
const spans = await getSpansByTraceId(traceId);
const spanNames = spans.map(s => s.name);
expect(spanNames).toContain('webhook.ingest');
expect(spanNames).toContain('alert.normalize');
expect(spanNames).toContain('alert.correlation');
expect(spanNames).toContain('notification.slack');
});
});
```
### 11.4 HMAC Security Hardening
```typescript
describe('HMAC Signature Validation (Hardened)', () => {
it('uses crypto.timingSafeEqual, not === comparison', () => {
// Inspect the source to verify timing-safe comparison
const source = fs.readFileSync('src/ingestion/hmac.ts', 'utf8');
expect(source).toContain('timingSafeEqual');
expect(source).not.toMatch(/signature\s*===\s*/);
});
it('handles case-insensitive header names (dd-webhook-signature vs DD-WEBHOOK-SIGNATURE)', async () => {
const payload = makeAlertPayload('datadog');
const sig = computeHMAC(payload, DATADOG_SECRET);
// Lowercase header
const resp1 = await ingest(payload, { 'dd-webhook-signature': sig });
expect(resp1.status).toBe(200);
// Uppercase header
const resp2 = await ingest(payload, { 'DD-WEBHOOK-SIGNATURE': sig });
expect(resp2.status).toBe(200);
});
it('rejects completely missing signature header', async () => {
const resp = await ingest(makeAlertPayload('datadog'), {});
expect(resp.status).toBe(401);
});
it('rejects empty signature header', async () => {
const resp = await ingest(makeAlertPayload('datadog'), { 'dd-webhook-signature': '' });
expect(resp.status).toBe(401);
});
});
```
### 11.5 SQS 256KB Payload Limit
```typescript
describe('Large Payload Handling', () => {
it('compresses payloads >200KB before sending to SQS', async () => {
const largePayload = makeLargeAlertPayload(300 * 1024); // 300KB
const resp = await ingest(largePayload);
expect(resp.status).toBe(200);
const msg = await getLastSQSMessage(localstack, 'alert-queue');
// Payload must be compressed or use S3 pointer
expect(msg.Body.length).toBeLessThan(256 * 1024);
});
it('uses S3 pointer for payloads >256KB after compression', async () => {
const hugePayload = makeLargeAlertPayload(500 * 1024); // 500KB
const resp = await ingest(hugePayload);
expect(resp.status).toBe(200);
const msg = await getLastSQSMessage(localstack, 'alert-queue');
const body = JSON.parse(msg.Body);
expect(body.s3Pointer).toBeDefined();
expect(body.s3Pointer).toMatch(/^s3:\/\/dd0c-alert-overflow\//);
});
it('strips unnecessary fields from Datadog payload before SQS', async () => {
const payload = makeDatadogPayloadWithLargeTags(100); // 100 tags
const resp = await ingest(payload);
expect(resp.status).toBe(200);
const msg = await getLastSQSMessage(localstack, 'alert-queue');
const normalized = JSON.parse(msg.Body);
// Only essential fields should remain
expect(normalized.tags.length).toBeLessThanOrEqual(20);
});
it('rejects payloads >2MB at API Gateway level', async () => {
const massive = makeLargeAlertPayload(3 * 1024 * 1024);
const resp = await ingest(massive);
expect(resp.status).toBe(413);
});
});
```
### 11.6 DLQ Backpressure & Replay
```typescript
describe('DLQ Replay with Backpressure', () => {
it('replays DLQ messages in batches of 100', async () => {
await seedDLQ(10000); // 10K messages
const replayer = new DLQReplayer({ batchSize: 100, delayBetweenBatchesMs: 500 });
await replayer.start();
// Verify batched processing
expect(replayer.batchesProcessed).toBeGreaterThan(0);
expect(replayer.maxConcurrentMessages).toBeLessThanOrEqual(100);
});
it('pauses replay if correlation engine error rate exceeds 10%', async () => {
await seedDLQ(1000);
const replayer = new DLQReplayer({ batchSize: 100, errorThreshold: 0.1 });
// Simulate correlation engine returning errors
mockCorrelationEngine.failRate = 0.15;
await replayer.start();
expect(replayer.state).toBe('paused');
expect(replayer.pauseReason).toContain('error rate exceeded');
});
it('does not replay if circuit breaker is currently tripped', async () => {
await seedDLQ(100);
await tripCircuitBreaker();
const replayer = new DLQReplayer();
await replayer.start();
expect(replayer.messagesReplayed).toBe(0);
expect(replayer.state).toBe('blocked_by_circuit_breaker');
});
it('tracks replay progress for resumability', async () => {
await seedDLQ(500);
const replayer = new DLQReplayer({ batchSize: 50 });
// Process 3 batches then stop
await replayer.processNBatches(3);
expect(replayer.checkpoint).toBe(150);
// Resume from checkpoint
const replayer2 = new DLQReplayer({ resumeFrom: replayer.checkpoint });
await replayer2.start();
expect(replayer2.startedFrom).toBe(150);
});
});
```
### 11.7 Multi-Tenancy Isolation (DynamoDB)
```typescript
describe('DynamoDB Tenant Isolation', () => {
it('all DAO methods require tenantId parameter', () => {
// Compile-time check: DAO interface has tenantId as first param
const daoSource = fs.readFileSync('src/data/incident-dao.ts', 'utf8');
const methods = extractPublicMethods(daoSource);
for (const method of methods) {
expect(method.params[0].name).toBe('tenantId');
}
});
it('query for tenant A returns zero results for tenant B data', async () => {
const dao = new IncidentDAO(dynamoClient);
await dao.create('tenant-A', makeIncident());
await dao.create('tenant-B', makeIncident());
const results = await dao.list('tenant-A');
expect(results.every(r => r.tenantId === 'tenant-A')).toBe(true);
});
it('partition key always includes tenantId prefix', async () => {
const dao = new IncidentDAO(dynamoClient);
await dao.create('tenant-X', makeIncident());
// Read raw DynamoDB item
const item = await dynamoClient.scan({ TableName: 'dd0c-alert-main' });
expect(item.Items[0].PK.S).toStartWith('TENANT#tenant-X');
});
});
```
### 11.8 Slack Circuit Breaker
```typescript
describe('Slack Notification Circuit Breaker', () => {
it('opens circuit after 10 consecutive 429s from Slack', async () => {
const slackClient = new SlackClient({ circuitBreakerThreshold: 10 });
for (let i = 0; i < 10; i++) {
mockSlack.respondWith(429);
await slackClient.send(makeMessage()).catch(() => {});
}
expect(slackClient.circuitState).toBe('open');
});
it('queues notifications while circuit is open', async () => {
slackClient.openCircuit();
await slackClient.send(makeMessage());
expect(slackClient.queuedMessages).toBe(1);
});
it('half-opens circuit after 60 seconds', async () => {
slackClient.openCircuit();
clock.advanceBy(61000);
expect(slackClient.circuitState).toBe('half-open');
});
it('drains queue on successful half-open probe', async () => {
slackClient.openCircuit();
slackClient.queue(makeMessage());
slackClient.queue(makeMessage());
clock.advanceBy(61000);
mockSlack.respondWith(200);
await slackClient.probe();
expect(slackClient.circuitState).toBe('closed');
expect(slackClient.queuedMessages).toBe(0);
});
});
```
### 11.9 Updated Test Pyramid (Post-Review)
| Level | Original | Revised | Rationale |
|-------|----------|---------|-----------|
| Unit | 70% (~140) | 65% (~180) | More tests total, but integration share grows |
| Integration | 20% (~40) | 25% (~70) | Dashboard API, tenant isolation, trace propagation |
| E2E | 10% (~20) | 10% (~28) | Dashboard UI (Playwright), onboarding flow |
*End of P3 Review Remediation Addendum*
---
## 12. BMad Review Implementation (Must-Have Before Launch)
### 12.1 HMAC Timestamp Freshness (Replay Attack Prevention)
```typescript
describe('HMAC Replay Attack Prevention', () => {
it('rejects Datadog webhook with timestamp older than 5 minutes', async () => {
const payload = makeDatadogPayload();
const staleTimestamp = Math.floor(Date.now() / 1000) - 301; // 5min + 1s
const sig = computeDatadogHMAC(payload, staleTimestamp);
const resp = await ingest(payload, {
'dd-webhook-timestamp': staleTimestamp.toString(),
'dd-webhook-signature': sig,
});
expect(resp.status).toBe(401);
expect(resp.body.error).toContain('stale timestamp');
});
it('rejects PagerDuty webhook with missing timestamp', async () => {
const payload = makePagerDutyPayload();
const sig = computePagerDutyHMAC(payload);
const resp = await ingest(payload, {
'x-pagerduty-signature': sig,
// No timestamp header
});
expect(resp.status).toBe(401);
});
it('rejects OpsGenie webhook replayed after 5 minutes', async () => {
// OpsGenie doesn't always package timestamp cleanly
// Must extract from payload body and validate
const payload = makeOpsGeniePayload({ timestamp: fiveMinutesAgo() });
const sig = computeOpsGenieHMAC(payload);
const resp = await ingest(payload, { 'x-opsgenie-signature': sig });
expect(resp.status).toBe(401);
});
it('accepts fresh webhook within 5-minute window', async () => {
const payload = makeDatadogPayload();
const freshTimestamp = Math.floor(Date.now() / 1000);
const sig = computeDatadogHMAC(payload, freshTimestamp);
const resp = await ingest(payload, {
'dd-webhook-timestamp': freshTimestamp.toString(),
'dd-webhook-signature': sig,
});
expect(resp.status).toBe(200);
});
});
```
### 12.2 Cross-Tenant Negative Isolation Tests
```typescript
describe('DynamoDB Tenant Isolation (Negative Tests)', () => {
it('Tenant A cannot read Tenant B incidents', async () => {
// Seed data for both tenants
await createIncident('tenant-a', { title: 'A incident' });
await createIncident('tenant-b', { title: 'B incident' });
// Query as Tenant A
const results = await dao.listIncidents('tenant-a');
// Explicitly assert Tenant B data is absent
const tenantIds = results.map(r => r.tenantId);
expect(tenantIds).not.toContain('tenant-b');
expect(results.every(r => r.tenantId === 'tenant-a')).toBe(true);
});
it('Tenant A cannot read Tenant B analytics', async () => {
await seedAnalytics('tenant-a', { alertCount: 100 });
await seedAnalytics('tenant-b', { alertCount: 200 });
const analytics = await dao.getAnalytics('tenant-a');
expect(analytics.alertCount).toBe(100); // Not 300 (combined)
});
it('API returns 404 (not 403) for cross-tenant incident access', async () => {
const incident = await createIncident('tenant-b', { title: 'secret' });
const resp = await api.get(`/v1/incidents/${incident.id}`)
.set('Authorization', `Bearer ${tenantAToken}`);
// 404 not 403 — don't leak existence
expect(resp.status).toBe(404);
});
});
```
### 12.3 Correlation Window Edge Cases
```typescript
describe('Out-of-Order Alert Delivery', () => {
it('late alert attaches to existing incident (not duplicate)', async () => {
const clock = new FakeClock();
const engine = new CorrelationEngine(new InMemoryWindowStore(), clock);
// Alert 1 arrives at T=0
const alert1 = makeAlert({ service: 'auth', fingerprint: 'cpu-high', timestamp: 0 });
const incident1 = await engine.process(alert1);
// Window closes at T=5min, incident shipped
clock.advanceBy(5 * 60 * 1000);
await engine.flushWindows();
// Late alert arrives at T=6min with timestamp T=2min (within original window)
const lateAlert = makeAlert({ service: 'auth', fingerprint: 'cpu-high', timestamp: 2 * 60 * 1000 });
const result = await engine.process(lateAlert);
// Must attach to existing incident, not create new one
expect(result.incidentId).toBe(incident1.incidentId);
expect(result.action).toBe('attached_to_existing');
});
it('very late alert (>2x window) creates new incident', async () => {
const clock = new FakeClock();
const engine = new CorrelationEngine(new InMemoryWindowStore(), clock);
const alert1 = makeAlert({ service: 'auth', fingerprint: 'cpu-high' });
const incident1 = await engine.process(alert1);
// 15 minutes later (3x the 5-min window)
clock.advanceBy(15 * 60 * 1000);
const lateAlert = makeAlert({ service: 'auth', fingerprint: 'cpu-high' });
const result = await engine.process(lateAlert);
expect(result.incidentId).not.toBe(incident1.incidentId);
expect(result.action).toBe('new_incident');
});
});
```
### 12.4 SQS Claim-Check Round-Trip
```typescript
describe('SQS 256KB Claim-Check End-to-End', () => {
it('large payload round-trips through S3 pointer', async () => {
const largePayload = makeLargeAlertPayload(300 * 1024); // 300KB
// Ingestion compresses and stores in S3
const resp = await ingest(largePayload);
expect(resp.status).toBe(200);
// SQS message contains S3 pointer
const sqsMsg = await getLastSQSMessage(localstack, 'alert-queue');
const body = JSON.parse(sqsMsg.Body);
expect(body.s3Pointer).toBeDefined();
// Correlation engine fetches from S3 and processes
const incident = await waitForIncidentCreated(5000);
expect(incident).toBeDefined();
expect(incident.sourceAlertCount).toBeGreaterThan(0);
});
it('S3 fetch timeout does not crash correlation engine', async () => {
// Inject S3 latency (10 second delay)
mockS3.setLatency(10000);
const largePayload = makeLargeAlertPayload(300 * 1024);
await ingest(largePayload);
// Correlation engine should timeout and send to DLQ
const dlqMsg = await getDLQMessage(localstack, 'alert-dlq', 15000);
expect(dlqMsg).toBeDefined();
// Engine is still healthy
const health = await api.get('/health');
expect(health.status).toBe(200);
});
});
```
### 12.5 Free Tier Enforcement
```typescript
describe('Free Tier (10K alerts/month, 7-day retention)', () => {
it('accepts alert at 9,999 count', async () => {
await setAlertCounter('tenant-free', 9999);
const resp = await ingestAsTenat('tenant-free', makeAlert());
expect(resp.status).toBe(200);
});
it('rejects alert at 10,001 with upgrade prompt', async () => {
await setAlertCounter('tenant-free', 10000);
const resp = await ingestAsTenant('tenant-free', makeAlert());
expect(resp.status).toBe(429);
expect(resp.body.upgrade_url).toContain('stripe');
});
it('counter resets on first of month', async () => {
await setAlertCounter('tenant-free', 10000);
clock.advanceToFirstOfNextMonth();
await runMonthlyReset();
const resp = await ingestAsTenant('tenant-free', makeAlert());
expect(resp.status).toBe(200);
});
it('purges data older than 7 days on free tier', async () => {
await createIncident('tenant-free', { createdAt: eightDaysAgo() });
await runRetentionPurge();
const incidents = await dao.listIncidents('tenant-free');
expect(incidents).toHaveLength(0);
});
it('retains data for 90 days on pro tier', async () => {
await createIncident('tenant-pro', { createdAt: thirtyDaysAgo() });
await runRetentionPurge();
const incidents = await dao.listIncidents('tenant-pro');
expect(incidents).toHaveLength(1);
});
});
```
*End of P3 BMad Implementation*