51 KiB
dd0c/alert — Test Architecture & TDD Strategy
Product: dd0c/alert — Alert Intelligence Platform Author: Test Architecture Phase Date: February 28, 2026 Status: V1 MVP — Solo Founder Scope
Section 1: Testing Philosophy & TDD Workflow
1.1 Core Philosophy
dd0c/alert is a safety-critical observability tool — a bug that silently suppresses a real alert during an incident is worse than having no tool at all. The test suite is the contract that guarantees "we will never eat your alerts."
Guiding principle: tests describe observable behavior from the on-call engineer's perspective. If a test can't be explained as "when X happens, the engineer sees Y," it's testing implementation, not behavior.
For a solo founder, the test suite is also the regression safety net — it catches the subtle scoring bugs that would erode customer trust over weeks.
1.2 Red-Green-Refactor Adapted to dd0c/alert
RED → Write a failing test that describes the desired behavior
(e.g., "3 Datadog alerts for the same service within 5 minutes
should produce 1 correlated incident")
GREEN → Write the minimum code to make it pass
(hardcode the window, just make it work)
REFACTOR → Clean up without breaking tests
(extract the window manager, add Redis backing,
optimize the fingerprinting)
When to write tests first (strict TDD):
- All correlation logic (time-window clustering, service graph traversal, deploy correlation)
- All noise scoring algorithms (rule-based scoring, threshold calculations)
- All HMAC signature validation (security-critical)
- All fingerprinting/deduplication logic
- All suppression governance (strict vs. audit mode)
- All circuit breaker state transitions (suppression DLQ replay)
When integration tests lead (test-after, then harden):
- Provider webhook parsers — implement against real payload samples, then lock in with contract tests
- SQS FIFO message ordering — test against LocalStack after implementation
- Slack message formatting — build the blocks, then snapshot test the output
When E2E tests lead:
- The 60-second time-to-value journey — define the happy path first, build backward
- Weekly noise digest generation — define expected output, then build the aggregation
1.3 Test Naming Conventions
// Unit tests (vitest)
describe('CorrelationEngine', () => {
it('groups alerts for same service within 5min window into single incident', () => {});
it('extends window by 2min when alert arrives in last 30 seconds', () => {});
it('caps window extension at 15 minutes total', () => {});
it('merges downstream service alerts when upstream window is active', () => {});
});
describe('NoiseScorer', () => {
it('scores deploy-correlated alerts higher when deploy is within 10min', () => {});
it('returns zero noise score for first-ever alert from a service', () => {});
it('adds 5 points when PR title matches config or feature-flag', () => {});
});
describe('HmacValidator', () => {
it('rejects Datadog webhook with missing DD-WEBHOOK-SIGNATURE header', () => {});
it('rejects PagerDuty webhook with tampered body', () => {});
it('accepts valid signature and passes payload through', () => {});
});
Rules:
- Describe the observable outcome, not the internal mechanism
- Use present tense ("groups", "rejects", "scores")
- If you need "and" in the name, split into two tests
- Group by component in
describeblocks
Section 2: Test Pyramid
2.1 Ratio
| Level | Target | Count (V1) | Runtime |
|---|---|---|---|
| Unit | 70% | ~350 tests | <30s |
| Integration | 20% | ~100 tests | <5min |
| E2E/Smoke | 10% | ~20 tests | <10min |
2.2 Unit Test Targets (per component)
| Component | Key Behaviors | Est. Tests |
|---|---|---|
| Webhook Parsers (Datadog, PD, OpsGenie, Grafana) | Payload normalization, field mapping, batch handling | 60 |
| HMAC Validator | Signature verification per provider, rejection paths | 20 |
| Fingerprint Generator | Deterministic hashing, dedup detection | 15 |
| Correlation Engine | Time-window open/close/extend, service graph merge, deploy correlation | 80 |
| Noise Scorer | Rule-based scoring, deploy proximity weighting, threshold calculations | 60 |
| Suggestion Engine | Suppression recommendations, "what would have happened" calculations | 30 |
| Notification Formatter | Slack block formatting, digest generation, in-place message updates | 25 |
| Governance Policy | Strict/audit mode enforcement, panic mode, per-customer overrides | 30 |
| Feature Flags | Circuit breaker on suppression volume, flag lifecycle | 15 |
| Canonical Schema Mapper | Provider → canonical field mapping, severity normalization | 15 |
2.3 Integration Test Boundaries
| Boundary | What's Tested | Infrastructure |
|---|---|---|
| Lambda → SQS FIFO | Message ordering, dedup, tenant partitioning | LocalStack |
| SQS → Correlation Engine | Consumer polling, batch processing, error handling | LocalStack |
| Correlation Engine → Redis | Window CRUD, sorted set operations, TTL expiry | Testcontainers Redis |
| Correlation Engine → DynamoDB | Incident persistence, tenant config reads | Testcontainers DynamoDB Local |
| Correlation Engine → TimescaleDB | Time-series writes, continuous aggregate queries | Testcontainers PostgreSQL + TimescaleDB |
| Notification Service → Slack | Block formatting, rate limiting, message update | WireMock |
| API Gateway → Lambda | Webhook routing, auth, throttling | LocalStack |
2.4 E2E/Smoke Scenarios
- 60-Second TTV Journey: Webhook received → alert in Slack within 60s
- Alert Storm Correlation: 50 alerts in 2 minutes → grouped into 1 incident
- Deploy Correlation: Deploy event + alert storm → deploy identified as trigger
- Noise Digest: 7 days of alerts → weekly Slack digest with noise stats
- Multi-Provider Merge: Datadog + PagerDuty alerts for same service → single incident
- Panic Mode: Enable panic → all suppression stops → alerts pass through raw
Section 3: Unit Test Strategy
3.1 Webhook Parsers
Each provider parser is a pure function: payload in, canonical alert(s) out. No side effects, no DB calls.
// tests/unit/parsers/datadog.test.ts
describe('DatadogParser', () => {
it('normalizes single alert payload to canonical schema', () => {});
it('normalizes batched alert array into multiple canonical alerts', () => {});
it('maps Datadog P1 to critical, P5 to info', () => {});
it('extracts service name from tags array', () => {});
it('handles missing optional fields without throwing', () => {});
it('generates stable fingerprint from title + service + tenant', () => {});
});
// tests/unit/parsers/pagerduty.test.ts
describe('PagerDutyParser', () => {
it('normalizes incident.triggered event to canonical alert', () => {});
it('normalizes incident.resolved event with resolution metadata', () => {});
it('ignores incident.acknowledged events (not alerts)', () => {});
it('maps PD urgency high to critical, low to info', () => {});
});
// tests/unit/parsers/opsgenie.test.ts
describe('OpsGenieParser', () => {
it('normalizes alert.created action to canonical alert', () => {});
it('extracts priority P1-P5 and maps to severity', () => {});
it('handles custom fields in details object', () => {});
});
// tests/unit/parsers/grafana.test.ts
describe('GrafanaParser', () => {
it('normalizes Grafana Alertmanager webhook payload', () => {});
it('handles multiple alerts in single webhook (Grafana batches)', () => {});
it('extracts dashboard URL as context link', () => {});
});
Mocking strategy: None needed — parsers are pure functions. Use recorded payload fixtures from fixtures/webhooks/{provider}/.
Fixture structure:
fixtures/webhooks/
datadog/
single-alert.json
batched-alerts.json
monitor-recovered.json
pagerduty/
incident-triggered.json
incident-resolved.json
incident-acknowledged.json
opsgenie/
alert-created.json
alert-closed.json
grafana/
single-firing.json
multi-firing.json
resolved.json
3.2 HMAC Validator
describe('HmacValidator', () => {
// Datadog uses hex-encoded HMAC-SHA256
it('validates correct Datadog DD-WEBHOOK-SIGNATURE header', () => {});
it('rejects Datadog webhook with wrong signature', () => {});
it('rejects Datadog webhook with missing signature header', () => {});
// PagerDuty uses v1= prefix with HMAC-SHA256
it('validates correct PagerDuty X-PagerDuty-Signature header', () => {});
it('rejects PagerDuty webhook with tampered body', () => {});
// OpsGenie uses different header name
it('validates correct OpsGenie X-OpsGenie-Signature header', () => {});
// Edge cases
it('rejects empty body with any signature', () => {});
it('handles timing-safe comparison to prevent timing attacks', () => {});
});
Mocking strategy: None — crypto operations are deterministic. Use known secret + body + expected signature triples.
3.3 Fingerprint Generator
describe('FingerprintGenerator', () => {
it('generates deterministic SHA-256 from tenant_id + provider + service + title', () => {});
it('produces same fingerprint for identical alerts regardless of timestamp', () => {});
it('produces different fingerprints when service differs', () => {});
it('normalizes title whitespace before hashing', () => {});
it('handles unicode characters in title consistently', () => {});
});
3.4 Correlation Engine
The most complex component. Heavy use of table-driven tests.
describe('CorrelationEngine', () => {
describe('Time-Window Management', () => {
it('opens new 5min window on first alert for a service', () => {});
it('adds subsequent alerts to existing open window', () => {});
it('extends window by 2min when alert arrives in last 30 seconds', () => {});
it('caps total window duration at 15 minutes', () => {});
it('closes window after timeout with no new alerts', () => {});
it('generates incident record when window closes', () => {});
});
describe('Service Graph Correlation', () => {
it('merges downstream alerts into upstream window when dependency exists', () => {});
it('does not merge alerts for unrelated services', () => {});
it('handles circular dependencies without infinite loop', () => {});
it('traverses multi-level dependency chains (A→B→C)', () => {});
});
describe('Deploy Correlation', () => {
it('tags incident with deploy_id when deploy event within 10min of first alert', () => {});
it('does not correlate deploy older than 10 minutes', () => {});
it('correlates deploy to correct service even with multiple recent deploys', () => {});
it('adds deploy correlation score boost to noise calculation', () => {});
});
describe('Multi-Tenant Isolation', () => {
it('never correlates alerts across different tenants', () => {});
it('maintains separate windows per tenant', () => {});
it('handles concurrent alerts from multiple tenants', () => {});
});
});
Mocking strategy:
- Mock Redis client (
ioredis-mock) for window state - Mock DynamoDB client for service dependency reads
- Mock SQS for downstream message publishing
- Use
sinon.useFakeTimers()for time-window testing
3.5 Noise Scorer
describe('NoiseScorer', () => {
describe('Rule-Based Scoring', () => {
it('returns 0 for first-ever alert from a service (no history)', () => {});
it('scores higher when alert has fired >5 times in 24 hours', () => {});
it('scores higher when alert auto-resolved within 5 minutes', () => {});
it('adds deploy correlation bonus (+15 points) when deploy is recent', () => {});
it('adds feature-flag bonus (+5 points) when PR title matches config/feature-flag', () => {});
it('caps total score at 100', () => {});
it('never scores critical severity alerts above 80 (safety cap)', () => {});
});
describe('Threshold Calculations', () => {
it('classifies score 0-30 as signal (keep)', () => {});
it('classifies score 31-70 as review (annotate)', () => {});
it('classifies score 71-100 as noise (suggest suppress)', () => {});
it('uses tenant-specific thresholds when configured', () => {});
});
describe('What-Would-Have-Happened', () => {
it('calculates suppression count for historical window', () => {});
it('reports zero false negatives when no suppressed alert was critical', () => {});
it('flags false negative when suppressed alert was later escalated', () => {});
});
});
Mocking strategy: Mock the alert history store (DynamoDB queries). Scorer logic itself is pure calculation.
3.6 Notification Formatter
describe('NotificationFormatter', () => {
describe('Slack Blocks', () => {
it('formats single-alert notification with service, title, severity', () => {});
it('formats correlated incident with alert count and sources', () => {});
it('includes deploy trigger when deploy correlation exists', () => {});
it('includes noise score badge (🟢 signal / 🟡 review / 🔴 noise)', () => {});
it('includes feedback buttons (👍 Helpful / 👎 Not helpful)', () => {});
it('formats in-place update message (replaces initial alert)', () => {});
});
describe('Weekly Digest', () => {
it('aggregates 7 days of incidents into summary stats', () => {});
it('highlights top 3 noisiest services', () => {});
it('shows suppression savings ("would have saved X pages")', () => {});
});
});
Mocking strategy: Snapshot tests — render the Slack blocks to JSON and compare against golden fixtures.
3.7 Governance Policy Engine
describe('GovernancePolicy', () => {
describe('Mode Enforcement', () => {
it('in strict mode: annotates alerts but never suppresses', () => {});
it('in audit mode: auto-suppresses with full logging', () => {});
it('defaults new tenants to strict mode', () => {});
});
describe('Panic Mode', () => {
it('when panic=true: all suppression stops immediately', () => {});
it('when panic=true: all alerts pass through unmodified', () => {});
it('panic mode activatable via Redis key check', () => {});
it('panic mode shows banner in dashboard API response', () => {});
});
describe('Per-Customer Override', () => {
it('customer can set stricter mode than system default', () => {});
it('customer cannot set less restrictive mode than system default', () => {});
it('merge logic: max_restrictive(system, customer)', () => {});
});
describe('Policy Decision Logging', () => {
it('logs "suppressed by audit mode" with full context', () => {});
it('logs "annotation-only, strict mode active" for strict tenants', () => {});
it('logs "panic mode active — all alerts passing through"', () => {});
});
});
3.8 Feature Flag Circuit Breaker
describe('SuppressionCircuitBreaker', () => {
it('allows suppression when volume is within baseline', () => {});
it('trips breaker when suppression exceeds 2x baseline over 30min', () => {});
it('auto-disables the scoring flag when breaker trips', () => {});
it('replays suppressed alerts from DLQ when breaker trips', () => {});
it('resets breaker after manual flag re-enable', () => {});
it('tracks suppression count per flag in Redis sliding window', () => {});
});
Section 4: Integration Test Strategy
4.1 Webhook Contract Tests
Each provider integration gets a contract test suite that validates the full path: HTTP request → Lambda → SQS message.
// tests/integration/webhooks/datadog.contract.test.ts
describe('Datadog Webhook Contract', () => {
let localstack: LocalStackContainer;
let sqsClient: SQSClient;
beforeAll(async () => {
localstack = await new LocalStackContainer().start();
sqsClient = new SQSClient({ endpoint: localstack.getEndpoint() });
// Create SQS FIFO queue
await sqsClient.send(new CreateQueueCommand({
QueueName: 'alert-ingested.fifo',
Attributes: { FifoQueue: 'true', ContentBasedDeduplication: 'true' }
}));
});
it('accepts valid Datadog webhook and produces canonical SQS message', async () => {
const payload = loadFixture('webhooks/datadog/single-alert.json');
const signature = computeHmac(payload, TEST_SECRET);
const res = await request(app)
.post('/v1/wh/tenant-123/datadog')
.set('DD-WEBHOOK-SIGNATURE', signature)
.send(payload);
expect(res.status).toBe(200);
const messages = await pollSqs(sqsClient, 'alert-ingested.fifo');
expect(messages).toHaveLength(1);
expect(messages[0].body).toMatchObject({
tenant_id: 'tenant-123',
provider: 'datadog',
severity: expect.stringMatching(/critical|high|medium|low|info/),
fingerprint: expect.stringMatching(/^[a-f0-9]{64}$/),
});
});
it('rejects webhook with invalid HMAC and produces no SQS message', async () => {
const payload = loadFixture('webhooks/datadog/single-alert.json');
const res = await request(app)
.post('/v1/wh/tenant-123/datadog')
.set('DD-WEBHOOK-SIGNATURE', 'bad-signature')
.send(payload);
expect(res.status).toBe(401);
const messages = await pollSqs(sqsClient, 'alert-ingested.fifo', { waitMs: 1000 });
expect(messages).toHaveLength(0);
});
});
Repeat pattern for PagerDuty, OpsGenie, Grafana — each with provider-specific signature headers and payload formats.
4.2 Correlation Engine → Redis Integration
// tests/integration/correlation/redis-windows.test.ts
describe('Correlation Engine + Redis', () => {
let redis: StartedTestContainer;
let redisClient: Redis;
beforeAll(async () => {
redis = await new GenericContainer('redis:7-alpine')
.withExposedPorts(6379)
.start();
redisClient = new Redis({ host: redis.getHost(), port: redis.getMappedPort(6379) });
});
it('opens window in Redis sorted set with correct TTL', async () => {
await correlationEngine.processAlert(makeAlert({ service: 'payment-api' }));
const windows = await redisClient.zrange('windows:tenant-123', 0, -1, 'WITHSCORES');
expect(windows).toHaveLength(2); // [windowId, closesAtEpoch]
const ttl = await redisClient.ttl('window:tenant-123:payment-api');
expect(ttl).toBeGreaterThan(280); // ~5min minus processing time
});
it('extends window when alert arrives in last 30 seconds', async () => {
// Open window, advance clock to T+4m31s, send another alert
await correlationEngine.processAlert(makeAlert({ service: 'payment-api' }));
vi.advanceTimersByTime(4 * 60 * 1000 + 31 * 1000);
await correlationEngine.processAlert(makeAlert({ service: 'payment-api' }));
const ttl = await redisClient.ttl('window:tenant-123:payment-api');
expect(ttl).toBeGreaterThan(100); // Extended by ~2min
});
it('isolates windows between tenants', async () => {
await correlationEngine.processAlert(makeAlert({ tenant: 'A', service: 'api' }));
await correlationEngine.processAlert(makeAlert({ tenant: 'B', service: 'api' }));
const windowsA = await redisClient.zrange('windows:A', 0, -1);
const windowsB = await redisClient.zrange('windows:B', 0, -1);
expect(windowsA).toHaveLength(1);
expect(windowsB).toHaveLength(1);
expect(windowsA[0]).not.toBe(windowsB[0]);
});
});
4.3 Correlation Engine → DynamoDB Integration
// tests/integration/correlation/dynamodb-incidents.test.ts
describe('Correlation Engine + DynamoDB', () => {
let dynamodb: StartedTestContainer;
beforeAll(async () => {
dynamodb = await new GenericContainer('amazon/dynamodb-local:latest')
.withExposedPorts(8000)
.start();
// Create tables: alerts, incidents, tenant_config, service_dependencies
});
it('persists incident record when correlation window closes', async () => {
await correlationEngine.processAlert(makeAlert({ service: 'api' }));
await correlationEngine.processAlert(makeAlert({ service: 'api' }));
await correlationEngine.closeExpiredWindows();
const incidents = await queryIncidents('tenant-123');
expect(incidents).toHaveLength(1);
expect(incidents[0].alert_count).toBe(2);
expect(incidents[0].services).toContain('api');
});
it('reads service dependencies for cascading correlation', async () => {
await putServiceDependency('tenant-123', 'api', 'database');
await correlationEngine.processAlert(makeAlert({ service: 'database' }));
await correlationEngine.processAlert(makeAlert({ service: 'api' }));
// Both should be in the same window
const windows = await getActiveWindows('tenant-123');
expect(windows).toHaveLength(1);
expect(windows[0].services).toEqual(expect.arrayContaining(['api', 'database']));
});
});
4.4 Correlation Engine → TimescaleDB Integration
// tests/integration/correlation/timescaledb-trends.test.ts
describe('Correlation Engine + TimescaleDB', () => {
let pg: StartedTestContainer;
beforeAll(async () => {
pg = await new GenericContainer('timescale/timescaledb:latest-pg16')
.withExposedPorts(5432)
.withEnvironment({ POSTGRES_PASSWORD: 'test' })
.start();
// Run migrations: create hypertables, continuous aggregates
});
it('writes alert frequency data to hypertable', async () => {
await correlationEngine.recordAlertEvent(makeAlert({ service: 'api' }));
const rows = await query('SELECT * FROM alert_events WHERE service = $1', ['api']);
expect(rows).toHaveLength(1);
});
it('continuous aggregate calculates hourly alert counts', async () => {
// Insert 10 alerts spread over 2 hours
await insertAlertEvents(10, { spreadHours: 2 });
await refreshContinuousAggregate('hourly_alert_summary');
const summary = await query('SELECT * FROM hourly_alert_summary');
expect(summary).toHaveLength(2);
expect(summary.reduce((s, r) => s + r.alert_count, 0)).toBe(10);
});
});
4.5 Notification Service → Slack (WireMock)
// tests/integration/notifications/slack.test.ts
describe('Notification Service + Slack', () => {
let wiremock: WireMockContainer;
beforeAll(async () => {
wiremock = await new WireMockContainer().start();
wiremock.stub({
request: { method: 'POST', urlPath: '/api/chat.postMessage' },
response: { status: 200, body: JSON.stringify({ ok: true, ts: '1234.5678' }) }
});
wiremock.stub({
request: { method: 'POST', urlPath: '/api/chat.update' },
response: { status: 200, body: JSON.stringify({ ok: true }) }
});
});
it('sends initial alert notification to correct Slack channel', async () => {});
it('updates message in-place when correlation completes', async () => {});
it('respects Slack rate limits (1 msg/sec per channel)', async () => {});
it('retries on 429 with exponential backoff', async () => {});
it('includes feedback buttons in correlated incident message', async () => {});
});
Section 5: E2E & Smoke Tests
5.1 Critical User Journeys
Journey 1: 60-Second Time-to-Value
The defining test for dd0c/alert. Validates the entire pipeline from webhook to Slack notification.
// tests/e2e/journeys/sixty-second-ttv.test.ts
describe('60-Second Time-to-Value', () => {
it('delivers first correlated incident to Slack within 60 seconds of webhook', async () => {
const start = Date.now();
// 1. Send Datadog webhook
await sendWebhook('datadog', fixtures.datadog.singleAlert, { tenant: 'e2e-tenant' });
// 2. Wait for Slack message
const slackMessage = await waitForSlackMessage('e2e-channel', { timeoutMs: 60_000 });
const elapsed = Date.now() - start;
expect(elapsed).toBeLessThan(60_000);
expect(slackMessage.text).toContain('New alert');
expect(slackMessage.blocks).toBeDefined();
});
});
Journey 2: Alert Storm Correlation
// tests/e2e/journeys/alert-storm.test.ts
describe('Alert Storm Correlation', () => {
it('groups 50 alerts in 2 minutes into a single correlated incident', async () => {
// Fire 50 alerts for same service over 2 minutes
for (let i = 0; i < 50; i++) {
await sendWebhook('datadog', makeAlertPayload({
service: 'payment-api',
title: `High latency on payment-api (${i})`,
}));
await sleep(2400); // ~50 alerts in 2 min
}
// Wait for correlation window to close
await sleep(5 * 60 * 1000 + 30_000); // 5min window + buffer
const slackMessages = await getSlackMessages('e2e-channel');
const incidentMessages = slackMessages.filter(m => m.text.includes('Incident'));
expect(incidentMessages).toHaveLength(1);
expect(incidentMessages[0].text).toContain('50 alerts grouped');
});
});
Journey 3: Deploy Correlation
// tests/e2e/journeys/deploy-correlation.test.ts
describe('Deploy Correlation', () => {
it('identifies deploy as trigger when alerts follow within 10 minutes', async () => {
// 1. Send deploy event
await sendWebhook('github-actions', makeDeployPayload({
service: 'payment-api',
commit: 'abc123',
pr_title: 'feat: add retry logic',
}));
// 2. Wait 2 minutes, then fire alerts
await sleep(2 * 60 * 1000);
await sendWebhook('datadog', makeAlertPayload({ service: 'payment-api' }));
await sendWebhook('pagerduty', makeAlertPayload({ service: 'payment-api' }));
// 3. Wait for correlation
await sleep(6 * 60 * 1000);
const slackMessage = await getLatestSlackMessage('e2e-channel');
expect(slackMessage.text).toContain('Deploy #');
expect(slackMessage.text).toContain('abc123');
});
});
Journey 4: Panic Mode
// tests/e2e/journeys/panic-mode.test.ts
describe('Panic Mode', () => {
it('stops all suppression immediately when panic mode is activated', async () => {
// 1. Enable audit mode, verify suppression works
await setGovernanceMode('e2e-tenant', 'audit');
await sendNoisyAlerts(10);
const beforePanic = await getSlackMessages('e2e-channel');
const suppressedBefore = beforePanic.filter(m => m.text.includes('suppressed'));
// 2. Activate panic mode
await fetch('/admin/panic', { method: 'POST' });
// 3. Send more alerts — all should pass through
await sendNoisyAlerts(10);
const afterPanic = await getSlackMessages('e2e-channel');
const rawAlerts = afterPanic.filter(m => !m.text.includes('suppressed'));
expect(rawAlerts.length).toBeGreaterThanOrEqual(10);
});
});
5.2 E2E Infrastructure
# docker-compose.e2e.yml
services:
localstack:
image: localstack/localstack:3
environment:
SERVICES: sqs,s3,dynamodb,apigateway,lambda
ports: ["4566:4566"]
timescaledb:
image: timescale/timescaledb:latest-pg16
environment:
POSTGRES_PASSWORD: test
ports: ["5432:5432"]
redis:
image: redis:7-alpine
ports: ["6379:6379"]
wiremock:
image: wiremock/wiremock:3
ports: ["8080:8080"]
volumes:
- ./fixtures/wiremock:/home/wiremock/mappings
app:
build: .
environment:
AWS_ENDPOINT: http://localstack:4566
REDIS_URL: redis://redis:6379
TIMESCALE_URL: postgres://postgres:test@timescaledb:5432/test
SLACK_API_URL: http://wiremock:8080
depends_on: [localstack, timescaledb, redis, wiremock]
5.3 Synthetic Alert Generation
// tests/e2e/helpers/alert-generator.ts
export function makeAlertPayload(overrides: Partial<AlertPayload> = {}): DatadogWebhookPayload {
return {
id: ulid(),
title: overrides.title ?? `Alert: ${faker.hacker.phrase()}`,
text: faker.lorem.sentence(),
date_happened: Math.floor(Date.now() / 1000),
priority: overrides.priority ?? 'normal',
tags: [`service:${overrides.service ?? 'test-service'}`],
alert_type: overrides.severity ?? 'warning',
...overrides,
};
}
export async function sendNoisyAlerts(count: number, opts?: { service?: string }) {
for (let i = 0; i < count; i++) {
await sendWebhook('datadog', makeAlertPayload({
service: opts?.service ?? 'noisy-service',
title: `Flapping alert #${i}`,
}));
}
}
Section 6: Performance & Load Testing
6.1 Alert Ingestion Throughput
// tests/perf/ingestion-throughput.test.ts
describe('Ingestion Throughput', () => {
it('processes 1000 webhooks/second without dropping payloads', async () => {
const results = await k6.run({
vus: 100,
duration: '30s',
thresholds: {
http_req_duration: ['p95<200'], // 200ms p95
http_req_failed: ['rate<0.001'], // <0.1% failure
},
script: `
import http from 'k6/http';
export default function() {
http.post('${WEBHOOK_URL}/v1/wh/perf-tenant/datadog',
JSON.stringify(makeAlertPayload()),
{ headers: { 'DD-WEBHOOK-SIGNATURE': validSig } }
);
}
`,
});
expect(results.metrics.http_req_failed.rate).toBeLessThan(0.001);
});
});
6.2 Correlation Latency Under Alert Storms
describe('Correlation Storm Performance', () => {
it('correlates 500 alerts across 10 services within 30 seconds', async () => {
const start = Date.now();
// Simulate incident storm: 500 alerts, 10 services, 2 minutes
await generateAlertStorm({ alerts: 500, services: 10, durationMs: 120_000 });
// Wait for all windows to close
await waitForIncidents('perf-tenant', { minCount: 1, timeoutMs: 30_000 });
const elapsed = Date.now() - start - 120_000; // subtract generation time
expect(elapsed).toBeLessThan(30_000);
});
it('Redis memory stays under 50MB during 10K active windows', async () => {
// Open 10K windows across 100 tenants
for (let t = 0; t < 100; t++) {
for (let s = 0; s < 100; s++) {
await correlationEngine.processAlert(makeAlert({
tenant: `tenant-${t}`,
service: `service-${s}`,
}));
}
}
const memoryUsage = await redisClient.info('memory');
const usedMb = parseRedisMemory(memoryUsage);
expect(usedMb).toBeLessThan(50);
});
});
6.3 Noise Scoring Latency
describe('Noise Scoring Performance', () => {
it('scores a correlated incident with 50 alerts in <100ms', async () => {
const incident = makeIncident({ alertCount: 50, withHistory: true });
const start = performance.now();
const score = await noiseScorer.score(incident);
const elapsed = performance.now() - start;
expect(elapsed).toBeLessThan(100);
expect(score).toBeGreaterThanOrEqual(0);
expect(score).toBeLessThanOrEqual(100);
});
});
6.4 Memory Pressure During High-Cardinality Correlation
describe('Memory Pressure', () => {
it('ECS task stays under 512MB with 1000 concurrent correlation windows', async () => {
// Monitor ECS task memory while processing high-cardinality alerts
const memBefore = process.memoryUsage().heapUsed;
await processHighCardinalityAlerts({ tenants: 100, servicesPerTenant: 10 });
const memAfter = process.memoryUsage().heapUsed;
const deltaMb = (memAfter - memBefore) / 1024 / 1024;
expect(deltaMb).toBeLessThan(256); // Leave headroom in 512MB task
});
});
Section 7: CI/CD Pipeline Integration
7.1 Pipeline Stages
┌─────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Pre-Commit │───▶│ PR Gate │───▶│ Merge │───▶│ Staging │───▶│ Prod │
│ (local) │ │ (CI) │ │ (CI) │ │ (CD) │ │ (CD) │
└─────────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
lint + format unit tests full suite E2E + perf smoke + canary
type check integration coverage gate LocalStack deploy event
<10s <5min <10min <15min self-dogfood
7.2 Stage Details
Pre-Commit (local, <10s):
eslint+prettierformat checktsc --noEmittype check- Affected unit tests only (
vitest --changed)
PR Gate (CI, <5min):
- Full unit test suite
- Integration tests (Testcontainers spin up in CI)
- Schema migration lint (no DROP/RENAME/TYPE changes)
- Decision log presence check for scoring/correlation PRs
- Coverage diff: new code must have ≥80% coverage
Merge to Main (CI, <10min):
- Full test suite (unit + integration)
- Coverage gate: overall ≥80%, scoring engine ≥90%
- CDK synth + diff (infrastructure changes)
- Security scan (
npm audit,trivy)
Staging (CD, <15min):
- Deploy to staging environment
- E2E journey tests against LocalStack
- Performance benchmarks (ingestion throughput, correlation latency)
- Synthetic alert generation + validation
Production (CD):
- Canary deploy (10% traffic for 5 minutes)
- Smoke tests (send test webhook, verify Slack delivery)
- dd0c/alert dogfoods itself: deploy event sent to own webhook
- Automated rollback if error rate >1% during canary
7.3 Coverage Thresholds
| Component | Minimum | Target |
|---|---|---|
| Webhook Parsers | 90% | 95% |
| HMAC Validator | 95% | 100% |
| Correlation Engine | 85% | 90% |
| Noise Scorer | 90% | 95% |
| Governance Policy | 90% | 95% |
| Notification Formatter | 75% | 85% |
| Overall | 80% | 85% |
7.4 Test Parallelization
# .github/workflows/test.yml
jobs:
unit:
runs-on: ubuntu-latest
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- run: vitest --shard=${{ matrix.shard }}/4
integration:
runs-on: ubuntu-latest
strategy:
matrix:
suite: [webhooks, correlation, notifications, storage]
steps:
- run: vitest --project=integration --grep=${{ matrix.suite }}
e2e:
needs: [unit, integration]
runs-on: ubuntu-latest
steps:
- run: docker compose -f docker-compose.e2e.yml up -d
- run: vitest --project=e2e
Section 8: Transparent Factory Tenet Testing
8.1 Atomic Flagging — Suppression Circuit Breaker
describe('Atomic Flagging', () => {
describe('Flag Lifecycle', () => {
it('new scoring rule flag defaults to false (off)', () => {});
it('flag has owner and ttl metadata', () => {});
it('CI blocks when flag at 100% exceeds 14-day TTL', () => {});
});
describe('Circuit Breaker on Suppression Volume', () => {
it('allows suppression when volume is within 2x baseline', () => {});
it('trips breaker when suppression exceeds 2x baseline over 30min', () => {});
it('auto-disables the flag when breaker trips', () => {});
it('buffers suppressed alerts in DLQ during normal operation', () => {});
it('replays DLQ alerts when breaker trips', async () => {
// 1. Enable scoring flag, suppress 20 alerts
// 2. Trip the breaker by spiking suppression rate
// 3. Verify all 20 suppressed alerts are re-emitted from DLQ
// 4. Verify flag is now disabled
});
it('DLQ retains alerts for 1 hour before expiry', () => {});
});
describe('Local Evaluation', () => {
it('flag evaluation does not make network calls', () => {});
it('flag state is cached in-memory and refreshed every 60s', () => {});
});
});
8.2 Elastic Schema — Migration Validation
describe('Elastic Schema', () => {
describe('Migration Lint', () => {
it('rejects migration with DROP COLUMN statement', () => {
const migration = 'ALTER TABLE alert_events DROP COLUMN old_field;';
expect(lintMigration(migration)).toContainError('DROP not allowed');
});
it('rejects migration with ALTER COLUMN TYPE', () => {
const migration = 'ALTER TABLE alert_events ALTER COLUMN severity TYPE integer;';
expect(lintMigration(migration)).toContainError('TYPE change not allowed');
});
it('rejects migration with RENAME COLUMN', () => {});
it('accepts migration with ADD COLUMN (nullable)', () => {
const migration = 'ALTER TABLE alert_events ADD COLUMN noise_score_v2 integer;';
expect(lintMigration(migration)).toBeValid();
});
it('accepts migration with new table creation', () => {});
});
describe('DynamoDB Schema', () => {
it('rejects attribute type change in table definition', () => {});
it('accepts new attribute addition', () => {});
it('V1 code ignores V2 attributes without error', () => {});
});
describe('Sunset Enforcement', () => {
it('every migration file contains sunset_date comment', () => {
const migrations = glob.sync('migrations/*.sql');
for (const m of migrations) {
const content = fs.readFileSync(m, 'utf-8');
expect(content).toMatch(/-- sunset_date: \d{4}-\d{2}-\d{2}/);
}
});
it('CI warns when migration is past sunset date', () => {});
});
});
8.3 Cognitive Durability — Decision Log Validation
describe('Cognitive Durability', () => {
it('decision_log.json exists for every PR touching scoring/', () => {
// CI hook: check git diff for files in src/scoring/
// If touched, require docs/decisions/*.json in the same PR
});
it('decision log has required fields', () => {
const logs = glob.sync('docs/decisions/*.json');
for (const log of logs) {
const entry = JSON.parse(fs.readFileSync(log, 'utf-8'));
expect(entry).toHaveProperty('reasoning');
expect(entry).toHaveProperty('alternatives_considered');
expect(entry).toHaveProperty('confidence');
expect(entry).toHaveProperty('timestamp');
expect(entry).toHaveProperty('author');
}
});
it('cyclomatic complexity stays under 10 for all scoring functions', () => {
// Run eslint with complexity rule
const result = execSync('eslint src/scoring/ --rule "complexity: [error, 10]"');
expect(result.exitCode).toBe(0);
});
});
8.4 Semantic Observability — OTEL Span Assertions
describe('Semantic Observability', () => {
let spanExporter: InMemorySpanExporter;
beforeEach(() => {
spanExporter = new InMemorySpanExporter();
// Configure OTEL with in-memory exporter for testing
});
describe('Alert Evaluation Spans', () => {
it('emits parent alert_evaluation span for each alert', async () => {
await processAlert(makeAlert());
const spans = spanExporter.getFinishedSpans();
const evalSpan = spans.find(s => s.name === 'alert_evaluation');
expect(evalSpan).toBeDefined();
});
it('emits child noise_scoring span with score attributes', async () => {
await processAlert(makeAlert());
const spans = spanExporter.getFinishedSpans();
const scoreSpan = spans.find(s => s.name === 'noise_scoring');
expect(scoreSpan).toBeDefined();
expect(scoreSpan.attributes['alert.noise_score']).toBeGreaterThanOrEqual(0);
expect(scoreSpan.attributes['alert.noise_score']).toBeLessThanOrEqual(100);
});
it('emits child correlation_matching span with match data', async () => {
await processAlert(makeAlert());
const spans = spanExporter.getFinishedSpans();
const corrSpan = spans.find(s => s.name === 'correlation_matching');
expect(corrSpan).toBeDefined();
expect(corrSpan.attributes).toHaveProperty('alert.correlation_matches');
});
it('emits suppression_decision span with reason', async () => {
await processAlert(makeAlert());
const spans = spanExporter.getFinishedSpans();
const suppSpan = spans.find(s => s.name === 'suppression_decision');
expect(suppSpan.attributes).toHaveProperty('alert.suppressed');
expect(suppSpan.attributes).toHaveProperty('alert.suppression_reason');
});
});
describe('PII Protection', () => {
it('never includes raw alert payload in span attributes', async () => {
await processAlert(makeAlert({ title: 'User john@example.com failed login' }));
const spans = spanExporter.getFinishedSpans();
for (const span of spans) {
const attrs = JSON.stringify(span.attributes);
expect(attrs).not.toContain('john@example.com');
}
});
it('uses hashed alert source identifier, not raw', async () => {
await processAlert(makeAlert({ source: 'prod-payment-api' }));
const spans = spanExporter.getFinishedSpans();
const evalSpan = spans.find(s => s.name === 'alert_evaluation');
expect(evalSpan.attributes['alert.source']).toMatch(/^[a-f0-9]+$/);
});
});
});
8.5 Configurable Autonomy — Governance Policy Tests
describe('Configurable Autonomy', () => {
describe('Governance Mode Enforcement', () => {
it('strict mode: annotates but never suppresses', async () => {
setPolicy({ governance_mode: 'strict' });
const result = await processNoisyAlert(makeAlert({ noiseScore: 95 }));
expect(result.suppressed).toBe(false);
expect(result.annotation).toContain('noise_score: 95');
});
it('audit mode: auto-suppresses with logging', async () => {
setPolicy({ governance_mode: 'audit' });
const result = await processNoisyAlert(makeAlert({ noiseScore: 95 }));
expect(result.suppressed).toBe(true);
expect(result.log).toContain('suppressed by audit mode');
});
});
describe('Panic Mode', () => {
it('activates in <1 second via API call', async () => {
const start = Date.now();
await fetch('/admin/panic', { method: 'POST' });
const panicActive = await redisClient.get('dd0c:panic');
expect(Date.now() - start).toBeLessThan(1000);
expect(panicActive).toBe('true');
});
it('stops all suppression when active', async () => {
await activatePanic();
const results = await Promise.all(
Array.from({ length: 10 }, () => processNoisyAlert(makeAlert({ noiseScore: 99 })))
);
expect(results.every(r => r.suppressed === false)).toBe(true);
});
});
describe('Per-Customer Override', () => {
it('customer strict overrides system audit', async () => {
setPolicy({ governance_mode: 'audit' });
setCustomerPolicy('tenant-123', { governance_mode: 'strict' });
const result = await processNoisyAlert(makeAlert({ tenant: 'tenant-123', noiseScore: 95 }));
expect(result.suppressed).toBe(false);
});
it('customer cannot downgrade from system strict to audit', async () => {
setPolicy({ governance_mode: 'strict' });
setCustomerPolicy('tenant-123', { governance_mode: 'audit' });
const result = await processNoisyAlert(makeAlert({ tenant: 'tenant-123', noiseScore: 95 }));
expect(result.suppressed).toBe(false); // System strict wins
});
});
});
Section 9: Test Data & Fixtures
9.1 Directory Structure
tests/
fixtures/
webhooks/
datadog/
single-alert.json
batched-alerts.json
monitor-recovered.json
high-priority.json
pagerduty/
incident-triggered.json
incident-resolved.json
incident-acknowledged.json
opsgenie/
alert-created.json
alert-closed.json
grafana/
single-firing.json
multi-firing.json
resolved.json
deploys/
github-actions-success.json
github-actions-failure.json
gitlab-ci-pipeline.json
argocd-sync.json
scenarios/
alert-storm-50-alerts.json
cascading-failure-3-services.json
flapping-alert-10-cycles.json
maintenance-window-suppression.json
deploy-correlated-incident.json
slack/
initial-alert-blocks.json
correlated-incident-blocks.json
weekly-digest-blocks.json
schemas/
canonical-alert.json
incident-record.json
tenant-config.json
9.2 Alert Payload Factory
// tests/helpers/factories.ts
export function makeCanonicalAlert(overrides: Partial<CanonicalAlert> = {}): CanonicalAlert {
return {
alert_id: ulid(),
tenant_id: overrides.tenant_id ?? 'test-tenant',
provider: overrides.provider ?? 'datadog',
service: overrides.service ?? 'test-service',
title: overrides.title ?? `Alert: ${faker.hacker.phrase()}`,
severity: overrides.severity ?? 'warning',
fingerprint: overrides.fingerprint ?? crypto.randomBytes(32).toString('hex'),
timestamp: overrides.timestamp ?? new Date().toISOString(),
raw_payload_s3_key: overrides.raw_payload_s3_key ?? `raw/${ulid()}.json`,
metadata: overrides.metadata ?? {},
...overrides,
};
}
export function makeIncident(overrides: Partial<Incident> = {}): Incident {
const alertCount = overrides.alert_count ?? 5;
return {
incident_id: ulid(),
tenant_id: overrides.tenant_id ?? 'test-tenant',
services: overrides.services ?? ['test-service'],
alert_count: alertCount,
alerts: Array.from({ length: alertCount }, () => makeCanonicalAlert()),
noise_score: overrides.noise_score ?? 0,
deploy_correlation: overrides.deploy_correlation ?? null,
window_opened_at: overrides.window_opened_at ?? new Date().toISOString(),
window_closed_at: overrides.window_closed_at ?? new Date().toISOString(),
...overrides,
};
}
export function makeDeployEvent(overrides: Partial<DeployEvent> = {}): DeployEvent {
return {
deploy_id: ulid(),
tenant_id: overrides.tenant_id ?? 'test-tenant',
service: overrides.service ?? 'test-service',
commit_sha: overrides.commit_sha ?? faker.git.commitSha(),
pr_title: overrides.pr_title ?? faker.git.commitMessage(),
deployed_at: overrides.deployed_at ?? new Date().toISOString(),
provider: overrides.provider ?? 'github-actions',
...overrides,
};
}
9.3 Noise Scenario Fixtures
// tests/helpers/scenarios.ts
export const NOISE_SCENARIOS = {
alertStorm: {
description: '50 alerts for same service in 2 minutes',
alerts: Array.from({ length: 50 }, (_, i) => makeCanonicalAlert({
service: 'payment-api',
title: `High latency variant ${i}`,
timestamp: new Date(Date.now() + i * 2400).toISOString(),
})),
expectedIncidents: 1,
expectedNoiseScore: { min: 70, max: 95 },
},
flappingAlert: {
description: 'Alert fires and resolves 10 times in 1 hour',
alerts: Array.from({ length: 20 }, (_, i) => makeCanonicalAlert({
service: 'health-check',
title: 'Health check failed',
severity: i % 2 === 0 ? 'warning' : 'info', // alternating fire/resolve
timestamp: new Date(Date.now() + i * 3 * 60 * 1000).toISOString(),
})),
expectedNoiseScore: { min: 80, max: 100 },
},
cascadingFailure: {
description: 'Database fails, then API, then frontend',
alerts: [
makeCanonicalAlert({ service: 'database', severity: 'critical', timestamp: t(0) }),
makeCanonicalAlert({ service: 'api', severity: 'high', timestamp: t(30) }),
makeCanonicalAlert({ service: 'api', severity: 'high', timestamp: t(45) }),
makeCanonicalAlert({ service: 'frontend', severity: 'medium', timestamp: t(60) }),
makeCanonicalAlert({ service: 'frontend', severity: 'medium', timestamp: t(90) }),
],
serviceDependencies: [['api', 'database'], ['frontend', 'api']],
expectedIncidents: 1, // All merged via dependency graph
expectedNoiseScore: { min: 0, max: 30 }, // Real incident, not noise
},
deployCorrelated: {
description: 'Deploy followed by alert storm',
deploy: makeDeployEvent({ service: 'payment-api', pr_title: 'feat: add retry logic' }),
alerts: Array.from({ length: 8 }, () => makeCanonicalAlert({
service: 'payment-api',
severity: 'high',
})),
deployToAlertGapMs: 2 * 60 * 1000, // 2 minutes after deploy
expectedNoiseScore: { min: 50, max: 85 }, // Deploy correlation boosts noise score
},
};
Section 10: TDD Implementation Order
10.1 Bootstrap Sequence
The test infrastructure itself must be built before any product code. This is the order:
Phase 0: Test Infrastructure (Week 0)
├── 0.1 vitest config + TypeScript setup
├── 0.2 Testcontainers helper (Redis, DynamoDB Local, TimescaleDB)
├── 0.3 LocalStack helper (SQS, S3, API Gateway)
├── 0.4 Fixture loader utility
├── 0.5 Factory functions (makeCanonicalAlert, makeIncident, makeDeployEvent)
├── 0.6 WireMock Slack stub
└── 0.7 CI pipeline with test stages
10.2 Epic-by-Epic TDD Order
Phase 1: Webhook Ingestion (Epic 1) — Tests First
├── 1.1 RED: HMAC validator tests (all providers)
├── 1.2 GREEN: Implement HMAC validation
├── 1.3 RED: Datadog parser tests (single + batch)
├── 1.4 GREEN: Implement Datadog parser
├── 1.5 RED: PagerDuty parser tests
├── 1.6 GREEN: Implement PagerDuty parser
├── 1.7 RED: Fingerprint generator tests
├── 1.8 GREEN: Implement fingerprinting
├── 1.9 INTEGRATION: Lambda → SQS contract test
└── 1.10 REFACTOR: Extract provider parser interface
Phase 2: Correlation Engine (Epic 2) — Tests First
├── 2.1 RED: Time-window open/close/extend tests
├── 2.2 GREEN: Implement window manager
├── 2.3 RED: Service graph correlation tests
├── 2.4 GREEN: Implement dependency traversal
├── 2.5 RED: Deploy correlation tests
├── 2.6 GREEN: Implement deploy tracker
├── 2.7 INTEGRATION: Correlation → Redis window tests
├── 2.8 INTEGRATION: Correlation → DynamoDB incident persistence
└── 2.9 INTEGRATION: Correlation → TimescaleDB trend writes
Phase 3: Noise Analysis (Epic 3) — Tests First
├── 3.1 RED: Rule-based noise scoring tests (all rules)
├── 3.2 GREEN: Implement scorer
├── 3.3 RED: Threshold classification tests
├── 3.4 GREEN: Implement classifier
├── 3.5 RED: "What would have happened" calculation tests
├── 3.6 GREEN: Implement historical analysis
└── 3.7 REFACTOR: Extract scoring rules into configurable pipeline
Phase 4: Notifications (Epic 4) — Integration Tests Lead
├── 4.1 Implement Slack block formatter
├── 4.2 RED: Snapshot tests for all message formats
├── 4.3 INTEGRATION: Notification → Slack (WireMock)
├── 4.4 RED: Rate limiting tests
└── 4.5 GREEN: Implement rate limiter
Phase 5: Governance (Epic 10) — Tests First
├── 5.1 RED: Strict/audit mode enforcement tests
├── 5.2 GREEN: Implement policy engine
├── 5.3 RED: Panic mode tests (<1s activation)
├── 5.4 GREEN: Implement panic mode
├── 5.5 RED: Circuit breaker + DLQ replay tests
├── 5.6 GREEN: Implement circuit breaker
├── 5.7 RED: OTEL span assertion tests
└── 5.8 GREEN: Instrument all components
Phase 6: E2E Validation
├── 6.1 60-second TTV journey
├── 6.2 Alert storm correlation journey
├── 6.3 Deploy correlation journey
├── 6.4 Panic mode journey
└── 6.5 Performance benchmarks
10.3 "Never Ship Without" Checklist
Before any release, these tests must pass:
- All HMAC validation tests (security gate)
- All correlation window tests (correctness gate)
- All noise scoring tests (safety gate — never eat real alerts)
- All governance policy tests (compliance gate)
- Circuit breaker DLQ replay test (safety net gate)
- 60-second TTV E2E journey (product promise gate)
- PII protection span tests (privacy gate)
- Schema migration lint (no breaking changes)
- Coverage ≥80% overall, ≥90% on scoring engine
End of dd0c/alert Test Architecture