# dd0c/alert — Test Architecture & TDD Strategy **Product:** dd0c/alert — Alert Intelligence Platform **Author:** Test Architecture Phase **Date:** February 28, 2026 **Status:** V1 MVP — Solo Founder Scope --- ## Section 1: Testing Philosophy & TDD Workflow ### 1.1 Core Philosophy dd0c/alert is a **safety-critical observability tool** — a bug that silently suppresses a real alert during an incident is worse than having no tool at all. The test suite is the contract that guarantees "we will never eat your alerts." Guiding principle: **tests describe observable behavior from the on-call engineer's perspective**. If a test can't be explained as "when X happens, the engineer sees Y," it's testing implementation, not behavior. For a solo founder, the test suite is also the **regression safety net** — it catches the subtle scoring bugs that would erode customer trust over weeks. ### 1.2 Red-Green-Refactor Adapted to dd0c/alert ``` RED → Write a failing test that describes the desired behavior (e.g., "3 Datadog alerts for the same service within 5 minutes should produce 1 correlated incident") GREEN → Write the minimum code to make it pass (hardcode the window, just make it work) REFACTOR → Clean up without breaking tests (extract the window manager, add Redis backing, optimize the fingerprinting) ``` **When to write tests first (strict TDD):** - All correlation logic (time-window clustering, service graph traversal, deploy correlation) - All noise scoring algorithms (rule-based scoring, threshold calculations) - All HMAC signature validation (security-critical) - All fingerprinting/deduplication logic - All suppression governance (strict vs. audit mode) - All circuit breaker state transitions (suppression DLQ replay) **When integration tests lead (test-after, then harden):** - Provider webhook parsers — implement against real payload samples, then lock in with contract tests - SQS FIFO message ordering — test against LocalStack after implementation - Slack message formatting — build the blocks, then snapshot test the output **When E2E tests lead:** - The 60-second time-to-value journey — define the happy path first, build backward - Weekly noise digest generation — define expected output, then build the aggregation ### 1.3 Test Naming Conventions ```typescript // Unit tests (vitest) describe('CorrelationEngine', () => { it('groups alerts for same service within 5min window into single incident', () => {}); it('extends window by 2min when alert arrives in last 30 seconds', () => {}); it('caps window extension at 15 minutes total', () => {}); it('merges downstream service alerts when upstream window is active', () => {}); }); describe('NoiseScorer', () => { it('scores deploy-correlated alerts higher when deploy is within 10min', () => {}); it('returns zero noise score for first-ever alert from a service', () => {}); it('adds 5 points when PR title matches config or feature-flag', () => {}); }); describe('HmacValidator', () => { it('rejects Datadog webhook with missing DD-WEBHOOK-SIGNATURE header', () => {}); it('rejects PagerDuty webhook with tampered body', () => {}); it('accepts valid signature and passes payload through', () => {}); }); ``` **Rules:** - Describe the **observable outcome**, not the internal mechanism - Use present tense ("groups", "rejects", "scores") - If you need "and" in the name, split into two tests - Group by component in `describe` blocks --- ## Section 2: Test Pyramid ### 2.1 Ratio | Level | Target | Count (V1) | Runtime | |-------|--------|------------|---------| | Unit | 70% | ~350 tests | <30s | | Integration | 20% | ~100 tests | <5min | | E2E/Smoke | 10% | ~20 tests | <10min | ### 2.2 Unit Test Targets (per component) | Component | Key Behaviors | Est. Tests | |-----------|--------------|------------| | Webhook Parsers (Datadog, PD, OpsGenie, Grafana) | Payload normalization, field mapping, batch handling | 60 | | HMAC Validator | Signature verification per provider, rejection paths | 20 | | Fingerprint Generator | Deterministic hashing, dedup detection | 15 | | Correlation Engine | Time-window open/close/extend, service graph merge, deploy correlation | 80 | | Noise Scorer | Rule-based scoring, deploy proximity weighting, threshold calculations | 60 | | Suggestion Engine | Suppression recommendations, "what would have happened" calculations | 30 | | Notification Formatter | Slack block formatting, digest generation, in-place message updates | 25 | | Governance Policy | Strict/audit mode enforcement, panic mode, per-customer overrides | 30 | | Feature Flags | Circuit breaker on suppression volume, flag lifecycle | 15 | | Canonical Schema Mapper | Provider → canonical field mapping, severity normalization | 15 | ### 2.3 Integration Test Boundaries | Boundary | What's Tested | Infrastructure | |----------|--------------|----------------| | Lambda → SQS FIFO | Message ordering, dedup, tenant partitioning | LocalStack | | SQS → Correlation Engine | Consumer polling, batch processing, error handling | LocalStack | | Correlation Engine → Redis | Window CRUD, sorted set operations, TTL expiry | Testcontainers Redis | | Correlation Engine → DynamoDB | Incident persistence, tenant config reads | Testcontainers DynamoDB Local | | Correlation Engine → TimescaleDB | Time-series writes, continuous aggregate queries | Testcontainers PostgreSQL + TimescaleDB | | Notification Service → Slack | Block formatting, rate limiting, message update | WireMock | | API Gateway → Lambda | Webhook routing, auth, throttling | LocalStack | ### 2.4 E2E/Smoke Scenarios 1. **60-Second TTV Journey**: Webhook received → alert in Slack within 60s 2. **Alert Storm Correlation**: 50 alerts in 2 minutes → grouped into 1 incident 3. **Deploy Correlation**: Deploy event + alert storm → deploy identified as trigger 4. **Noise Digest**: 7 days of alerts → weekly Slack digest with noise stats 5. **Multi-Provider Merge**: Datadog + PagerDuty alerts for same service → single incident 6. **Panic Mode**: Enable panic → all suppression stops → alerts pass through raw --- ## Section 3: Unit Test Strategy ### 3.1 Webhook Parsers Each provider parser is a pure function: payload in, canonical alert(s) out. No side effects, no DB calls. ```typescript // tests/unit/parsers/datadog.test.ts describe('DatadogParser', () => { it('normalizes single alert payload to canonical schema', () => {}); it('normalizes batched alert array into multiple canonical alerts', () => {}); it('maps Datadog P1 to critical, P5 to info', () => {}); it('extracts service name from tags array', () => {}); it('handles missing optional fields without throwing', () => {}); it('generates stable fingerprint from title + service + tenant', () => {}); }); // tests/unit/parsers/pagerduty.test.ts describe('PagerDutyParser', () => { it('normalizes incident.triggered event to canonical alert', () => {}); it('normalizes incident.resolved event with resolution metadata', () => {}); it('ignores incident.acknowledged events (not alerts)', () => {}); it('maps PD urgency high to critical, low to info', () => {}); }); // tests/unit/parsers/opsgenie.test.ts describe('OpsGenieParser', () => { it('normalizes alert.created action to canonical alert', () => {}); it('extracts priority P1-P5 and maps to severity', () => {}); it('handles custom fields in details object', () => {}); }); // tests/unit/parsers/grafana.test.ts describe('GrafanaParser', () => { it('normalizes Grafana Alertmanager webhook payload', () => {}); it('handles multiple alerts in single webhook (Grafana batches)', () => {}); it('extracts dashboard URL as context link', () => {}); }); ``` **Mocking strategy:** None needed — parsers are pure functions. Use recorded payload fixtures from `fixtures/webhooks/{provider}/`. **Fixture structure:** ``` fixtures/webhooks/ datadog/ single-alert.json batched-alerts.json monitor-recovered.json pagerduty/ incident-triggered.json incident-resolved.json incident-acknowledged.json opsgenie/ alert-created.json alert-closed.json grafana/ single-firing.json multi-firing.json resolved.json ``` ### 3.2 HMAC Validator ```typescript describe('HmacValidator', () => { // Datadog uses hex-encoded HMAC-SHA256 it('validates correct Datadog DD-WEBHOOK-SIGNATURE header', () => {}); it('rejects Datadog webhook with wrong signature', () => {}); it('rejects Datadog webhook with missing signature header', () => {}); // PagerDuty uses v1= prefix with HMAC-SHA256 it('validates correct PagerDuty X-PagerDuty-Signature header', () => {}); it('rejects PagerDuty webhook with tampered body', () => {}); // OpsGenie uses different header name it('validates correct OpsGenie X-OpsGenie-Signature header', () => {}); // Edge cases it('rejects empty body with any signature', () => {}); it('handles timing-safe comparison to prevent timing attacks', () => {}); }); ``` **Mocking strategy:** None — crypto operations are deterministic. Use known secret + body + expected signature triples. ### 3.3 Fingerprint Generator ```typescript describe('FingerprintGenerator', () => { it('generates deterministic SHA-256 from tenant_id + provider + service + title', () => {}); it('produces same fingerprint for identical alerts regardless of timestamp', () => {}); it('produces different fingerprints when service differs', () => {}); it('normalizes title whitespace before hashing', () => {}); it('handles unicode characters in title consistently', () => {}); }); ``` ### 3.4 Correlation Engine The most complex component. Heavy use of table-driven tests. ```typescript describe('CorrelationEngine', () => { describe('Time-Window Management', () => { it('opens new 5min window on first alert for a service', () => {}); it('adds subsequent alerts to existing open window', () => {}); it('extends window by 2min when alert arrives in last 30 seconds', () => {}); it('caps total window duration at 15 minutes', () => {}); it('closes window after timeout with no new alerts', () => {}); it('generates incident record when window closes', () => {}); }); describe('Service Graph Correlation', () => { it('merges downstream alerts into upstream window when dependency exists', () => {}); it('does not merge alerts for unrelated services', () => {}); it('handles circular dependencies without infinite loop', () => {}); it('traverses multi-level dependency chains (A→B→C)', () => {}); }); describe('Deploy Correlation', () => { it('tags incident with deploy_id when deploy event within 10min of first alert', () => {}); it('does not correlate deploy older than 10 minutes', () => {}); it('correlates deploy to correct service even with multiple recent deploys', () => {}); it('adds deploy correlation score boost to noise calculation', () => {}); }); describe('Multi-Tenant Isolation', () => { it('never correlates alerts across different tenants', () => {}); it('maintains separate windows per tenant', () => {}); it('handles concurrent alerts from multiple tenants', () => {}); }); }); ``` **Mocking strategy:** - Mock Redis client (`ioredis-mock`) for window state - Mock DynamoDB client for service dependency reads - Mock SQS for downstream message publishing - Use `sinon.useFakeTimers()` for time-window testing ### 3.5 Noise Scorer ```typescript describe('NoiseScorer', () => { describe('Rule-Based Scoring', () => { it('returns 0 for first-ever alert from a service (no history)', () => {}); it('scores higher when alert has fired >5 times in 24 hours', () => {}); it('scores higher when alert auto-resolved within 5 minutes', () => {}); it('adds deploy correlation bonus (+15 points) when deploy is recent', () => {}); it('adds feature-flag bonus (+5 points) when PR title matches config/feature-flag', () => {}); it('caps total score at 100', () => {}); it('never scores critical severity alerts above 80 (safety cap)', () => {}); }); describe('Threshold Calculations', () => { it('classifies score 0-30 as signal (keep)', () => {}); it('classifies score 31-70 as review (annotate)', () => {}); it('classifies score 71-100 as noise (suggest suppress)', () => {}); it('uses tenant-specific thresholds when configured', () => {}); }); describe('What-Would-Have-Happened', () => { it('calculates suppression count for historical window', () => {}); it('reports zero false negatives when no suppressed alert was critical', () => {}); it('flags false negative when suppressed alert was later escalated', () => {}); }); }); ``` **Mocking strategy:** Mock the alert history store (DynamoDB queries). Scorer logic itself is pure calculation. ### 3.6 Notification Formatter ```typescript describe('NotificationFormatter', () => { describe('Slack Blocks', () => { it('formats single-alert notification with service, title, severity', () => {}); it('formats correlated incident with alert count and sources', () => {}); it('includes deploy trigger when deploy correlation exists', () => {}); it('includes noise score badge (🟢 signal / 🟡 review / 🔴 noise)', () => {}); it('includes feedback buttons (👍 Helpful / 👎 Not helpful)', () => {}); it('formats in-place update message (replaces initial alert)', () => {}); }); describe('Weekly Digest', () => { it('aggregates 7 days of incidents into summary stats', () => {}); it('highlights top 3 noisiest services', () => {}); it('shows suppression savings ("would have saved X pages")', () => {}); }); }); ``` **Mocking strategy:** Snapshot tests — render the Slack blocks to JSON and compare against golden fixtures. ### 3.7 Governance Policy Engine ```typescript describe('GovernancePolicy', () => { describe('Mode Enforcement', () => { it('in strict mode: annotates alerts but never suppresses', () => {}); it('in audit mode: auto-suppresses with full logging', () => {}); it('defaults new tenants to strict mode', () => {}); }); describe('Panic Mode', () => { it('when panic=true: all suppression stops immediately', () => {}); it('when panic=true: all alerts pass through unmodified', () => {}); it('panic mode activatable via Redis key check', () => {}); it('panic mode shows banner in dashboard API response', () => {}); }); describe('Per-Customer Override', () => { it('customer can set stricter mode than system default', () => {}); it('customer cannot set less restrictive mode than system default', () => {}); it('merge logic: max_restrictive(system, customer)', () => {}); }); describe('Policy Decision Logging', () => { it('logs "suppressed by audit mode" with full context', () => {}); it('logs "annotation-only, strict mode active" for strict tenants', () => {}); it('logs "panic mode active — all alerts passing through"', () => {}); }); }); ``` ### 3.8 Feature Flag Circuit Breaker ```typescript describe('SuppressionCircuitBreaker', () => { it('allows suppression when volume is within baseline', () => {}); it('trips breaker when suppression exceeds 2x baseline over 30min', () => {}); it('auto-disables the scoring flag when breaker trips', () => {}); it('replays suppressed alerts from DLQ when breaker trips', () => {}); it('resets breaker after manual flag re-enable', () => {}); it('tracks suppression count per flag in Redis sliding window', () => {}); }); ``` --- ## Section 4: Integration Test Strategy ### 4.1 Webhook Contract Tests Each provider integration gets a contract test suite that validates the full path: HTTP request → Lambda → SQS message. ```typescript // tests/integration/webhooks/datadog.contract.test.ts describe('Datadog Webhook Contract', () => { let localstack: LocalStackContainer; let sqsClient: SQSClient; beforeAll(async () => { localstack = await new LocalStackContainer().start(); sqsClient = new SQSClient({ endpoint: localstack.getEndpoint() }); // Create SQS FIFO queue await sqsClient.send(new CreateQueueCommand({ QueueName: 'alert-ingested.fifo', Attributes: { FifoQueue: 'true', ContentBasedDeduplication: 'true' } })); }); it('accepts valid Datadog webhook and produces canonical SQS message', async () => { const payload = loadFixture('webhooks/datadog/single-alert.json'); const signature = computeHmac(payload, TEST_SECRET); const res = await request(app) .post('/v1/wh/tenant-123/datadog') .set('DD-WEBHOOK-SIGNATURE', signature) .send(payload); expect(res.status).toBe(200); const messages = await pollSqs(sqsClient, 'alert-ingested.fifo'); expect(messages).toHaveLength(1); expect(messages[0].body).toMatchObject({ tenant_id: 'tenant-123', provider: 'datadog', severity: expect.stringMatching(/critical|high|medium|low|info/), fingerprint: expect.stringMatching(/^[a-f0-9]{64}$/), }); }); it('rejects webhook with invalid HMAC and produces no SQS message', async () => { const payload = loadFixture('webhooks/datadog/single-alert.json'); const res = await request(app) .post('/v1/wh/tenant-123/datadog') .set('DD-WEBHOOK-SIGNATURE', 'bad-signature') .send(payload); expect(res.status).toBe(401); const messages = await pollSqs(sqsClient, 'alert-ingested.fifo', { waitMs: 1000 }); expect(messages).toHaveLength(0); }); }); ``` Repeat pattern for PagerDuty, OpsGenie, Grafana — each with provider-specific signature headers and payload formats. ### 4.2 Correlation Engine → Redis Integration ```typescript // tests/integration/correlation/redis-windows.test.ts describe('Correlation Engine + Redis', () => { let redis: StartedTestContainer; let redisClient: Redis; beforeAll(async () => { redis = await new GenericContainer('redis:7-alpine') .withExposedPorts(6379) .start(); redisClient = new Redis({ host: redis.getHost(), port: redis.getMappedPort(6379) }); }); it('opens window in Redis sorted set with correct TTL', async () => { await correlationEngine.processAlert(makeAlert({ service: 'payment-api' })); const windows = await redisClient.zrange('windows:tenant-123', 0, -1, 'WITHSCORES'); expect(windows).toHaveLength(2); // [windowId, closesAtEpoch] const ttl = await redisClient.ttl('window:tenant-123:payment-api'); expect(ttl).toBeGreaterThan(280); // ~5min minus processing time }); it('extends window when alert arrives in last 30 seconds', async () => { // Open window, advance clock to T+4m31s, send another alert await correlationEngine.processAlert(makeAlert({ service: 'payment-api' })); vi.advanceTimersByTime(4 * 60 * 1000 + 31 * 1000); await correlationEngine.processAlert(makeAlert({ service: 'payment-api' })); const ttl = await redisClient.ttl('window:tenant-123:payment-api'); expect(ttl).toBeGreaterThan(100); // Extended by ~2min }); it('isolates windows between tenants', async () => { await correlationEngine.processAlert(makeAlert({ tenant: 'A', service: 'api' })); await correlationEngine.processAlert(makeAlert({ tenant: 'B', service: 'api' })); const windowsA = await redisClient.zrange('windows:A', 0, -1); const windowsB = await redisClient.zrange('windows:B', 0, -1); expect(windowsA).toHaveLength(1); expect(windowsB).toHaveLength(1); expect(windowsA[0]).not.toBe(windowsB[0]); }); }); ``` ### 4.3 Correlation Engine → DynamoDB Integration ```typescript // tests/integration/correlation/dynamodb-incidents.test.ts describe('Correlation Engine + DynamoDB', () => { let dynamodb: StartedTestContainer; beforeAll(async () => { dynamodb = await new GenericContainer('amazon/dynamodb-local:latest') .withExposedPorts(8000) .start(); // Create tables: alerts, incidents, tenant_config, service_dependencies }); it('persists incident record when correlation window closes', async () => { await correlationEngine.processAlert(makeAlert({ service: 'api' })); await correlationEngine.processAlert(makeAlert({ service: 'api' })); await correlationEngine.closeExpiredWindows(); const incidents = await queryIncidents('tenant-123'); expect(incidents).toHaveLength(1); expect(incidents[0].alert_count).toBe(2); expect(incidents[0].services).toContain('api'); }); it('reads service dependencies for cascading correlation', async () => { await putServiceDependency('tenant-123', 'api', 'database'); await correlationEngine.processAlert(makeAlert({ service: 'database' })); await correlationEngine.processAlert(makeAlert({ service: 'api' })); // Both should be in the same window const windows = await getActiveWindows('tenant-123'); expect(windows).toHaveLength(1); expect(windows[0].services).toEqual(expect.arrayContaining(['api', 'database'])); }); }); ``` ### 4.4 Correlation Engine → TimescaleDB Integration ```typescript // tests/integration/correlation/timescaledb-trends.test.ts describe('Correlation Engine + TimescaleDB', () => { let pg: StartedTestContainer; beforeAll(async () => { pg = await new GenericContainer('timescale/timescaledb:latest-pg16') .withExposedPorts(5432) .withEnvironment({ POSTGRES_PASSWORD: 'test' }) .start(); // Run migrations: create hypertables, continuous aggregates }); it('writes alert frequency data to hypertable', async () => { await correlationEngine.recordAlertEvent(makeAlert({ service: 'api' })); const rows = await query('SELECT * FROM alert_events WHERE service = $1', ['api']); expect(rows).toHaveLength(1); }); it('continuous aggregate calculates hourly alert counts', async () => { // Insert 10 alerts spread over 2 hours await insertAlertEvents(10, { spreadHours: 2 }); await refreshContinuousAggregate('hourly_alert_summary'); const summary = await query('SELECT * FROM hourly_alert_summary'); expect(summary).toHaveLength(2); expect(summary.reduce((s, r) => s + r.alert_count, 0)).toBe(10); }); }); ``` ### 4.5 Notification Service → Slack (WireMock) ```typescript // tests/integration/notifications/slack.test.ts describe('Notification Service + Slack', () => { let wiremock: WireMockContainer; beforeAll(async () => { wiremock = await new WireMockContainer().start(); wiremock.stub({ request: { method: 'POST', urlPath: '/api/chat.postMessage' }, response: { status: 200, body: JSON.stringify({ ok: true, ts: '1234.5678' }) } }); wiremock.stub({ request: { method: 'POST', urlPath: '/api/chat.update' }, response: { status: 200, body: JSON.stringify({ ok: true }) } }); }); it('sends initial alert notification to correct Slack channel', async () => {}); it('updates message in-place when correlation completes', async () => {}); it('respects Slack rate limits (1 msg/sec per channel)', async () => {}); it('retries on 429 with exponential backoff', async () => {}); it('includes feedback buttons in correlated incident message', async () => {}); }); ``` --- ## Section 5: E2E & Smoke Tests ### 5.1 Critical User Journeys **Journey 1: 60-Second Time-to-Value** The defining test for dd0c/alert. Validates the entire pipeline from webhook to Slack notification. ```typescript // tests/e2e/journeys/sixty-second-ttv.test.ts describe('60-Second Time-to-Value', () => { it('delivers first correlated incident to Slack within 60 seconds of webhook', async () => { const start = Date.now(); // 1. Send Datadog webhook await sendWebhook('datadog', fixtures.datadog.singleAlert, { tenant: 'e2e-tenant' }); // 2. Wait for Slack message const slackMessage = await waitForSlackMessage('e2e-channel', { timeoutMs: 60_000 }); const elapsed = Date.now() - start; expect(elapsed).toBeLessThan(60_000); expect(slackMessage.text).toContain('New alert'); expect(slackMessage.blocks).toBeDefined(); }); }); ``` **Journey 2: Alert Storm Correlation** ```typescript // tests/e2e/journeys/alert-storm.test.ts describe('Alert Storm Correlation', () => { it('groups 50 alerts in 2 minutes into a single correlated incident', async () => { // Fire 50 alerts for same service over 2 minutes for (let i = 0; i < 50; i++) { await sendWebhook('datadog', makeAlertPayload({ service: 'payment-api', title: `High latency on payment-api (${i})`, })); await sleep(2400); // ~50 alerts in 2 min } // Wait for correlation window to close await sleep(5 * 60 * 1000 + 30_000); // 5min window + buffer const slackMessages = await getSlackMessages('e2e-channel'); const incidentMessages = slackMessages.filter(m => m.text.includes('Incident')); expect(incidentMessages).toHaveLength(1); expect(incidentMessages[0].text).toContain('50 alerts grouped'); }); }); ``` **Journey 3: Deploy Correlation** ```typescript // tests/e2e/journeys/deploy-correlation.test.ts describe('Deploy Correlation', () => { it('identifies deploy as trigger when alerts follow within 10 minutes', async () => { // 1. Send deploy event await sendWebhook('github-actions', makeDeployPayload({ service: 'payment-api', commit: 'abc123', pr_title: 'feat: add retry logic', })); // 2. Wait 2 minutes, then fire alerts await sleep(2 * 60 * 1000); await sendWebhook('datadog', makeAlertPayload({ service: 'payment-api' })); await sendWebhook('pagerduty', makeAlertPayload({ service: 'payment-api' })); // 3. Wait for correlation await sleep(6 * 60 * 1000); const slackMessage = await getLatestSlackMessage('e2e-channel'); expect(slackMessage.text).toContain('Deploy #'); expect(slackMessage.text).toContain('abc123'); }); }); ``` **Journey 4: Panic Mode** ```typescript // tests/e2e/journeys/panic-mode.test.ts describe('Panic Mode', () => { it('stops all suppression immediately when panic mode is activated', async () => { // 1. Enable audit mode, verify suppression works await setGovernanceMode('e2e-tenant', 'audit'); await sendNoisyAlerts(10); const beforePanic = await getSlackMessages('e2e-channel'); const suppressedBefore = beforePanic.filter(m => m.text.includes('suppressed')); // 2. Activate panic mode await fetch('/admin/panic', { method: 'POST' }); // 3. Send more alerts — all should pass through await sendNoisyAlerts(10); const afterPanic = await getSlackMessages('e2e-channel'); const rawAlerts = afterPanic.filter(m => !m.text.includes('suppressed')); expect(rawAlerts.length).toBeGreaterThanOrEqual(10); }); }); ``` ### 5.2 E2E Infrastructure ```yaml # docker-compose.e2e.yml services: localstack: image: localstack/localstack:3 environment: SERVICES: sqs,s3,dynamodb,apigateway,lambda ports: ["4566:4566"] timescaledb: image: timescale/timescaledb:latest-pg16 environment: POSTGRES_PASSWORD: test ports: ["5432:5432"] redis: image: redis:7-alpine ports: ["6379:6379"] wiremock: image: wiremock/wiremock:3 ports: ["8080:8080"] volumes: - ./fixtures/wiremock:/home/wiremock/mappings app: build: . environment: AWS_ENDPOINT: http://localstack:4566 REDIS_URL: redis://redis:6379 TIMESCALE_URL: postgres://postgres:test@timescaledb:5432/test SLACK_API_URL: http://wiremock:8080 depends_on: [localstack, timescaledb, redis, wiremock] ``` ### 5.3 Synthetic Alert Generation ```typescript // tests/e2e/helpers/alert-generator.ts export function makeAlertPayload(overrides: Partial = {}): DatadogWebhookPayload { return { id: ulid(), title: overrides.title ?? `Alert: ${faker.hacker.phrase()}`, text: faker.lorem.sentence(), date_happened: Math.floor(Date.now() / 1000), priority: overrides.priority ?? 'normal', tags: [`service:${overrides.service ?? 'test-service'}`], alert_type: overrides.severity ?? 'warning', ...overrides, }; } export async function sendNoisyAlerts(count: number, opts?: { service?: string }) { for (let i = 0; i < count; i++) { await sendWebhook('datadog', makeAlertPayload({ service: opts?.service ?? 'noisy-service', title: `Flapping alert #${i}`, })); } } ``` --- ## Section 6: Performance & Load Testing ### 6.1 Alert Ingestion Throughput ```typescript // tests/perf/ingestion-throughput.test.ts describe('Ingestion Throughput', () => { it('processes 1000 webhooks/second without dropping payloads', async () => { const results = await k6.run({ vus: 100, duration: '30s', thresholds: { http_req_duration: ['p95<200'], // 200ms p95 http_req_failed: ['rate<0.001'], // <0.1% failure }, script: ` import http from 'k6/http'; export default function() { http.post('${WEBHOOK_URL}/v1/wh/perf-tenant/datadog', JSON.stringify(makeAlertPayload()), { headers: { 'DD-WEBHOOK-SIGNATURE': validSig } } ); } `, }); expect(results.metrics.http_req_failed.rate).toBeLessThan(0.001); }); }); ``` ### 6.2 Correlation Latency Under Alert Storms ```typescript describe('Correlation Storm Performance', () => { it('correlates 500 alerts across 10 services within 30 seconds', async () => { const start = Date.now(); // Simulate incident storm: 500 alerts, 10 services, 2 minutes await generateAlertStorm({ alerts: 500, services: 10, durationMs: 120_000 }); // Wait for all windows to close await waitForIncidents('perf-tenant', { minCount: 1, timeoutMs: 30_000 }); const elapsed = Date.now() - start - 120_000; // subtract generation time expect(elapsed).toBeLessThan(30_000); }); it('Redis memory stays under 50MB during 10K active windows', async () => { // Open 10K windows across 100 tenants for (let t = 0; t < 100; t++) { for (let s = 0; s < 100; s++) { await correlationEngine.processAlert(makeAlert({ tenant: `tenant-${t}`, service: `service-${s}`, })); } } const memoryUsage = await redisClient.info('memory'); const usedMb = parseRedisMemory(memoryUsage); expect(usedMb).toBeLessThan(50); }); }); ``` ### 6.3 Noise Scoring Latency ```typescript describe('Noise Scoring Performance', () => { it('scores a correlated incident with 50 alerts in <100ms', async () => { const incident = makeIncident({ alertCount: 50, withHistory: true }); const start = performance.now(); const score = await noiseScorer.score(incident); const elapsed = performance.now() - start; expect(elapsed).toBeLessThan(100); expect(score).toBeGreaterThanOrEqual(0); expect(score).toBeLessThanOrEqual(100); }); }); ``` ### 6.4 Memory Pressure During High-Cardinality Correlation ```typescript describe('Memory Pressure', () => { it('ECS task stays under 512MB with 1000 concurrent correlation windows', async () => { // Monitor ECS task memory while processing high-cardinality alerts const memBefore = process.memoryUsage().heapUsed; await processHighCardinalityAlerts({ tenants: 100, servicesPerTenant: 10 }); const memAfter = process.memoryUsage().heapUsed; const deltaMb = (memAfter - memBefore) / 1024 / 1024; expect(deltaMb).toBeLessThan(256); // Leave headroom in 512MB task }); }); ``` --- ## Section 7: CI/CD Pipeline Integration ### 7.1 Pipeline Stages ``` ┌─────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Pre-Commit │───▶│ PR Gate │───▶│ Merge │───▶│ Staging │───▶│ Prod │ │ (local) │ │ (CI) │ │ (CI) │ │ (CD) │ │ (CD) │ └─────────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ lint + format unit tests full suite E2E + perf smoke + canary type check integration coverage gate LocalStack deploy event <10s <5min <10min <15min self-dogfood ``` ### 7.2 Stage Details **Pre-Commit (local, <10s):** - `eslint` + `prettier` format check - `tsc --noEmit` type check - Affected unit tests only (`vitest --changed`) **PR Gate (CI, <5min):** - Full unit test suite - Integration tests (Testcontainers spin up in CI) - Schema migration lint (no DROP/RENAME/TYPE changes) - Decision log presence check for scoring/correlation PRs - Coverage diff: new code must have ≥80% coverage **Merge to Main (CI, <10min):** - Full test suite (unit + integration) - Coverage gate: overall ≥80%, scoring engine ≥90% - CDK synth + diff (infrastructure changes) - Security scan (`npm audit`, `trivy`) **Staging (CD, <15min):** - Deploy to staging environment - E2E journey tests against LocalStack - Performance benchmarks (ingestion throughput, correlation latency) - Synthetic alert generation + validation **Production (CD):** - Canary deploy (10% traffic for 5 minutes) - Smoke tests (send test webhook, verify Slack delivery) - dd0c/alert dogfoods itself: deploy event sent to own webhook - Automated rollback if error rate >1% during canary ### 7.3 Coverage Thresholds | Component | Minimum | Target | |-----------|---------|--------| | Webhook Parsers | 90% | 95% | | HMAC Validator | 95% | 100% | | Correlation Engine | 85% | 90% | | Noise Scorer | 90% | 95% | | Governance Policy | 90% | 95% | | Notification Formatter | 75% | 85% | | Overall | 80% | 85% | ### 7.4 Test Parallelization ```yaml # .github/workflows/test.yml jobs: unit: runs-on: ubuntu-latest strategy: matrix: shard: [1, 2, 3, 4] steps: - run: vitest --shard=${{ matrix.shard }}/4 integration: runs-on: ubuntu-latest strategy: matrix: suite: [webhooks, correlation, notifications, storage] steps: - run: vitest --project=integration --grep=${{ matrix.suite }} e2e: needs: [unit, integration] runs-on: ubuntu-latest steps: - run: docker compose -f docker-compose.e2e.yml up -d - run: vitest --project=e2e ``` --- ## Section 8: Transparent Factory Tenet Testing ### 8.1 Atomic Flagging — Suppression Circuit Breaker ```typescript describe('Atomic Flagging', () => { describe('Flag Lifecycle', () => { it('new scoring rule flag defaults to false (off)', () => {}); it('flag has owner and ttl metadata', () => {}); it('CI blocks when flag at 100% exceeds 14-day TTL', () => {}); }); describe('Circuit Breaker on Suppression Volume', () => { it('allows suppression when volume is within 2x baseline', () => {}); it('trips breaker when suppression exceeds 2x baseline over 30min', () => {}); it('auto-disables the flag when breaker trips', () => {}); it('buffers suppressed alerts in DLQ during normal operation', () => {}); it('replays DLQ alerts when breaker trips', async () => { // 1. Enable scoring flag, suppress 20 alerts // 2. Trip the breaker by spiking suppression rate // 3. Verify all 20 suppressed alerts are re-emitted from DLQ // 4. Verify flag is now disabled }); it('DLQ retains alerts for 1 hour before expiry', () => {}); }); describe('Local Evaluation', () => { it('flag evaluation does not make network calls', () => {}); it('flag state is cached in-memory and refreshed every 60s', () => {}); }); }); ``` ### 8.2 Elastic Schema — Migration Validation ```typescript describe('Elastic Schema', () => { describe('Migration Lint', () => { it('rejects migration with DROP COLUMN statement', () => { const migration = 'ALTER TABLE alert_events DROP COLUMN old_field;'; expect(lintMigration(migration)).toContainError('DROP not allowed'); }); it('rejects migration with ALTER COLUMN TYPE', () => { const migration = 'ALTER TABLE alert_events ALTER COLUMN severity TYPE integer;'; expect(lintMigration(migration)).toContainError('TYPE change not allowed'); }); it('rejects migration with RENAME COLUMN', () => {}); it('accepts migration with ADD COLUMN (nullable)', () => { const migration = 'ALTER TABLE alert_events ADD COLUMN noise_score_v2 integer;'; expect(lintMigration(migration)).toBeValid(); }); it('accepts migration with new table creation', () => {}); }); describe('DynamoDB Schema', () => { it('rejects attribute type change in table definition', () => {}); it('accepts new attribute addition', () => {}); it('V1 code ignores V2 attributes without error', () => {}); }); describe('Sunset Enforcement', () => { it('every migration file contains sunset_date comment', () => { const migrations = glob.sync('migrations/*.sql'); for (const m of migrations) { const content = fs.readFileSync(m, 'utf-8'); expect(content).toMatch(/-- sunset_date: \d{4}-\d{2}-\d{2}/); } }); it('CI warns when migration is past sunset date', () => {}); }); }); ``` ### 8.3 Cognitive Durability — Decision Log Validation ```typescript describe('Cognitive Durability', () => { it('decision_log.json exists for every PR touching scoring/', () => { // CI hook: check git diff for files in src/scoring/ // If touched, require docs/decisions/*.json in the same PR }); it('decision log has required fields', () => { const logs = glob.sync('docs/decisions/*.json'); for (const log of logs) { const entry = JSON.parse(fs.readFileSync(log, 'utf-8')); expect(entry).toHaveProperty('reasoning'); expect(entry).toHaveProperty('alternatives_considered'); expect(entry).toHaveProperty('confidence'); expect(entry).toHaveProperty('timestamp'); expect(entry).toHaveProperty('author'); } }); it('cyclomatic complexity stays under 10 for all scoring functions', () => { // Run eslint with complexity rule const result = execSync('eslint src/scoring/ --rule "complexity: [error, 10]"'); expect(result.exitCode).toBe(0); }); }); ``` ### 8.4 Semantic Observability — OTEL Span Assertions ```typescript describe('Semantic Observability', () => { let spanExporter: InMemorySpanExporter; beforeEach(() => { spanExporter = new InMemorySpanExporter(); // Configure OTEL with in-memory exporter for testing }); describe('Alert Evaluation Spans', () => { it('emits parent alert_evaluation span for each alert', async () => { await processAlert(makeAlert()); const spans = spanExporter.getFinishedSpans(); const evalSpan = spans.find(s => s.name === 'alert_evaluation'); expect(evalSpan).toBeDefined(); }); it('emits child noise_scoring span with score attributes', async () => { await processAlert(makeAlert()); const spans = spanExporter.getFinishedSpans(); const scoreSpan = spans.find(s => s.name === 'noise_scoring'); expect(scoreSpan).toBeDefined(); expect(scoreSpan.attributes['alert.noise_score']).toBeGreaterThanOrEqual(0); expect(scoreSpan.attributes['alert.noise_score']).toBeLessThanOrEqual(100); }); it('emits child correlation_matching span with match data', async () => { await processAlert(makeAlert()); const spans = spanExporter.getFinishedSpans(); const corrSpan = spans.find(s => s.name === 'correlation_matching'); expect(corrSpan).toBeDefined(); expect(corrSpan.attributes).toHaveProperty('alert.correlation_matches'); }); it('emits suppression_decision span with reason', async () => { await processAlert(makeAlert()); const spans = spanExporter.getFinishedSpans(); const suppSpan = spans.find(s => s.name === 'suppression_decision'); expect(suppSpan.attributes).toHaveProperty('alert.suppressed'); expect(suppSpan.attributes).toHaveProperty('alert.suppression_reason'); }); }); describe('PII Protection', () => { it('never includes raw alert payload in span attributes', async () => { await processAlert(makeAlert({ title: 'User john@example.com failed login' })); const spans = spanExporter.getFinishedSpans(); for (const span of spans) { const attrs = JSON.stringify(span.attributes); expect(attrs).not.toContain('john@example.com'); } }); it('uses hashed alert source identifier, not raw', async () => { await processAlert(makeAlert({ source: 'prod-payment-api' })); const spans = spanExporter.getFinishedSpans(); const evalSpan = spans.find(s => s.name === 'alert_evaluation'); expect(evalSpan.attributes['alert.source']).toMatch(/^[a-f0-9]+$/); }); }); }); ``` ### 8.5 Configurable Autonomy — Governance Policy Tests ```typescript describe('Configurable Autonomy', () => { describe('Governance Mode Enforcement', () => { it('strict mode: annotates but never suppresses', async () => { setPolicy({ governance_mode: 'strict' }); const result = await processNoisyAlert(makeAlert({ noiseScore: 95 })); expect(result.suppressed).toBe(false); expect(result.annotation).toContain('noise_score: 95'); }); it('audit mode: auto-suppresses with logging', async () => { setPolicy({ governance_mode: 'audit' }); const result = await processNoisyAlert(makeAlert({ noiseScore: 95 })); expect(result.suppressed).toBe(true); expect(result.log).toContain('suppressed by audit mode'); }); }); describe('Panic Mode', () => { it('activates in <1 second via API call', async () => { const start = Date.now(); await fetch('/admin/panic', { method: 'POST' }); const panicActive = await redisClient.get('dd0c:panic'); expect(Date.now() - start).toBeLessThan(1000); expect(panicActive).toBe('true'); }); it('stops all suppression when active', async () => { await activatePanic(); const results = await Promise.all( Array.from({ length: 10 }, () => processNoisyAlert(makeAlert({ noiseScore: 99 }))) ); expect(results.every(r => r.suppressed === false)).toBe(true); }); }); describe('Per-Customer Override', () => { it('customer strict overrides system audit', async () => { setPolicy({ governance_mode: 'audit' }); setCustomerPolicy('tenant-123', { governance_mode: 'strict' }); const result = await processNoisyAlert(makeAlert({ tenant: 'tenant-123', noiseScore: 95 })); expect(result.suppressed).toBe(false); }); it('customer cannot downgrade from system strict to audit', async () => { setPolicy({ governance_mode: 'strict' }); setCustomerPolicy('tenant-123', { governance_mode: 'audit' }); const result = await processNoisyAlert(makeAlert({ tenant: 'tenant-123', noiseScore: 95 })); expect(result.suppressed).toBe(false); // System strict wins }); }); }); ``` --- ## Section 9: Test Data & Fixtures ### 9.1 Directory Structure ``` tests/ fixtures/ webhooks/ datadog/ single-alert.json batched-alerts.json monitor-recovered.json high-priority.json pagerduty/ incident-triggered.json incident-resolved.json incident-acknowledged.json opsgenie/ alert-created.json alert-closed.json grafana/ single-firing.json multi-firing.json resolved.json deploys/ github-actions-success.json github-actions-failure.json gitlab-ci-pipeline.json argocd-sync.json scenarios/ alert-storm-50-alerts.json cascading-failure-3-services.json flapping-alert-10-cycles.json maintenance-window-suppression.json deploy-correlated-incident.json slack/ initial-alert-blocks.json correlated-incident-blocks.json weekly-digest-blocks.json schemas/ canonical-alert.json incident-record.json tenant-config.json ``` ### 9.2 Alert Payload Factory ```typescript // tests/helpers/factories.ts export function makeCanonicalAlert(overrides: Partial = {}): CanonicalAlert { return { alert_id: ulid(), tenant_id: overrides.tenant_id ?? 'test-tenant', provider: overrides.provider ?? 'datadog', service: overrides.service ?? 'test-service', title: overrides.title ?? `Alert: ${faker.hacker.phrase()}`, severity: overrides.severity ?? 'warning', fingerprint: overrides.fingerprint ?? crypto.randomBytes(32).toString('hex'), timestamp: overrides.timestamp ?? new Date().toISOString(), raw_payload_s3_key: overrides.raw_payload_s3_key ?? `raw/${ulid()}.json`, metadata: overrides.metadata ?? {}, ...overrides, }; } export function makeIncident(overrides: Partial = {}): Incident { const alertCount = overrides.alert_count ?? 5; return { incident_id: ulid(), tenant_id: overrides.tenant_id ?? 'test-tenant', services: overrides.services ?? ['test-service'], alert_count: alertCount, alerts: Array.from({ length: alertCount }, () => makeCanonicalAlert()), noise_score: overrides.noise_score ?? 0, deploy_correlation: overrides.deploy_correlation ?? null, window_opened_at: overrides.window_opened_at ?? new Date().toISOString(), window_closed_at: overrides.window_closed_at ?? new Date().toISOString(), ...overrides, }; } export function makeDeployEvent(overrides: Partial = {}): DeployEvent { return { deploy_id: ulid(), tenant_id: overrides.tenant_id ?? 'test-tenant', service: overrides.service ?? 'test-service', commit_sha: overrides.commit_sha ?? faker.git.commitSha(), pr_title: overrides.pr_title ?? faker.git.commitMessage(), deployed_at: overrides.deployed_at ?? new Date().toISOString(), provider: overrides.provider ?? 'github-actions', ...overrides, }; } ``` ### 9.3 Noise Scenario Fixtures ```typescript // tests/helpers/scenarios.ts export const NOISE_SCENARIOS = { alertStorm: { description: '50 alerts for same service in 2 minutes', alerts: Array.from({ length: 50 }, (_, i) => makeCanonicalAlert({ service: 'payment-api', title: `High latency variant ${i}`, timestamp: new Date(Date.now() + i * 2400).toISOString(), })), expectedIncidents: 1, expectedNoiseScore: { min: 70, max: 95 }, }, flappingAlert: { description: 'Alert fires and resolves 10 times in 1 hour', alerts: Array.from({ length: 20 }, (_, i) => makeCanonicalAlert({ service: 'health-check', title: 'Health check failed', severity: i % 2 === 0 ? 'warning' : 'info', // alternating fire/resolve timestamp: new Date(Date.now() + i * 3 * 60 * 1000).toISOString(), })), expectedNoiseScore: { min: 80, max: 100 }, }, cascadingFailure: { description: 'Database fails, then API, then frontend', alerts: [ makeCanonicalAlert({ service: 'database', severity: 'critical', timestamp: t(0) }), makeCanonicalAlert({ service: 'api', severity: 'high', timestamp: t(30) }), makeCanonicalAlert({ service: 'api', severity: 'high', timestamp: t(45) }), makeCanonicalAlert({ service: 'frontend', severity: 'medium', timestamp: t(60) }), makeCanonicalAlert({ service: 'frontend', severity: 'medium', timestamp: t(90) }), ], serviceDependencies: [['api', 'database'], ['frontend', 'api']], expectedIncidents: 1, // All merged via dependency graph expectedNoiseScore: { min: 0, max: 30 }, // Real incident, not noise }, deployCorrelated: { description: 'Deploy followed by alert storm', deploy: makeDeployEvent({ service: 'payment-api', pr_title: 'feat: add retry logic' }), alerts: Array.from({ length: 8 }, () => makeCanonicalAlert({ service: 'payment-api', severity: 'high', })), deployToAlertGapMs: 2 * 60 * 1000, // 2 minutes after deploy expectedNoiseScore: { min: 50, max: 85 }, // Deploy correlation boosts noise score }, }; ``` --- ## Section 10: TDD Implementation Order ### 10.1 Bootstrap Sequence The test infrastructure itself must be built before any product code. This is the order: ``` Phase 0: Test Infrastructure (Week 0) ├── 0.1 vitest config + TypeScript setup ├── 0.2 Testcontainers helper (Redis, DynamoDB Local, TimescaleDB) ├── 0.3 LocalStack helper (SQS, S3, API Gateway) ├── 0.4 Fixture loader utility ├── 0.5 Factory functions (makeCanonicalAlert, makeIncident, makeDeployEvent) ├── 0.6 WireMock Slack stub └── 0.7 CI pipeline with test stages ``` ### 10.2 Epic-by-Epic TDD Order ``` Phase 1: Webhook Ingestion (Epic 1) — Tests First ├── 1.1 RED: HMAC validator tests (all providers) ├── 1.2 GREEN: Implement HMAC validation ├── 1.3 RED: Datadog parser tests (single + batch) ├── 1.4 GREEN: Implement Datadog parser ├── 1.5 RED: PagerDuty parser tests ├── 1.6 GREEN: Implement PagerDuty parser ├── 1.7 RED: Fingerprint generator tests ├── 1.8 GREEN: Implement fingerprinting ├── 1.9 INTEGRATION: Lambda → SQS contract test └── 1.10 REFACTOR: Extract provider parser interface Phase 2: Correlation Engine (Epic 2) — Tests First ├── 2.1 RED: Time-window open/close/extend tests ├── 2.2 GREEN: Implement window manager ├── 2.3 RED: Service graph correlation tests ├── 2.4 GREEN: Implement dependency traversal ├── 2.5 RED: Deploy correlation tests ├── 2.6 GREEN: Implement deploy tracker ├── 2.7 INTEGRATION: Correlation → Redis window tests ├── 2.8 INTEGRATION: Correlation → DynamoDB incident persistence └── 2.9 INTEGRATION: Correlation → TimescaleDB trend writes Phase 3: Noise Analysis (Epic 3) — Tests First ├── 3.1 RED: Rule-based noise scoring tests (all rules) ├── 3.2 GREEN: Implement scorer ├── 3.3 RED: Threshold classification tests ├── 3.4 GREEN: Implement classifier ├── 3.5 RED: "What would have happened" calculation tests ├── 3.6 GREEN: Implement historical analysis └── 3.7 REFACTOR: Extract scoring rules into configurable pipeline Phase 4: Notifications (Epic 4) — Integration Tests Lead ├── 4.1 Implement Slack block formatter ├── 4.2 RED: Snapshot tests for all message formats ├── 4.3 INTEGRATION: Notification → Slack (WireMock) ├── 4.4 RED: Rate limiting tests └── 4.5 GREEN: Implement rate limiter Phase 5: Governance (Epic 10) — Tests First ├── 5.1 RED: Strict/audit mode enforcement tests ├── 5.2 GREEN: Implement policy engine ├── 5.3 RED: Panic mode tests (<1s activation) ├── 5.4 GREEN: Implement panic mode ├── 5.5 RED: Circuit breaker + DLQ replay tests ├── 5.6 GREEN: Implement circuit breaker ├── 5.7 RED: OTEL span assertion tests └── 5.8 GREEN: Instrument all components Phase 6: E2E Validation ├── 6.1 60-second TTV journey ├── 6.2 Alert storm correlation journey ├── 6.3 Deploy correlation journey ├── 6.4 Panic mode journey └── 6.5 Performance benchmarks ``` ### 10.3 "Never Ship Without" Checklist Before any release, these tests must pass: - [ ] All HMAC validation tests (security gate) - [ ] All correlation window tests (correctness gate) - [ ] All noise scoring tests (safety gate — never eat real alerts) - [ ] All governance policy tests (compliance gate) - [ ] Circuit breaker DLQ replay test (safety net gate) - [ ] 60-second TTV E2E journey (product promise gate) - [ ] PII protection span tests (privacy gate) - [ ] Schema migration lint (no breaking changes) - [ ] Coverage ≥80% overall, ≥90% on scoring engine --- *End of dd0c/alert Test Architecture* --- ## 11. Review Remediation Addendum (Post-Gemini Review) ### 11.1 Missing Epic Coverage #### Epic 6: Dashboard API ```typescript describe('Dashboard API', () => { describe('Authentication', () => { it('returns 401 for missing Cognito JWT', async () => {}); it('returns 401 for expired JWT', async () => {}); it('returns 401 for JWT signed by wrong issuer', async () => {}); it('extracts tenantId from JWT claims', async () => {}); }); describe('Incident Listing (GET /v1/incidents)', () => { it('returns paginated incidents for authenticated tenant', async () => {}); it('supports cursor-based pagination', async () => {}); it('filters by status (open, acknowledged, resolved)', async () => {}); it('filters by severity (critical, warning, info)', async () => {}); it('filters by time range (since, until)', async () => {}); it('returns empty array for tenant with no incidents', async () => {}); }); describe('Incident Detail (GET /v1/incidents/:id)', () => { it('returns full incident with correlated alerts', async () => {}); it('returns 404 for incident belonging to different tenant', async () => {}); it('includes timeline of state transitions', async () => {}); }); describe('Analytics (GET /v1/analytics)', () => { it('returns MTTR for last 7/30/90 days', async () => {}); it('returns alert volume by source', async () => {}); it('returns noise reduction percentage', async () => {}); it('scopes all analytics to authenticated tenant', async () => {}); }); describe('Tenant Isolation', () => { it('tenant A cannot read tenant B incidents via API', async () => {}); it('tenant A cannot read tenant B analytics', async () => {}); it('all DynamoDB queries include tenantId partition key', async () => {}); }); }); ``` #### Epic 7: Dashboard UI (Playwright) ```typescript // tests/e2e/ui/dashboard.spec.ts test('login redirects to Cognito hosted UI', async ({ page }) => { await page.goto('/dashboard'); await expect(page).toHaveURL(/cognito/); }); test('incident list renders with correct severity badges', async ({ page }) => { await page.goto('/dashboard/incidents'); await expect(page.locator('[data-testid="incident-card"]')).toHaveCount(5); await expect(page.locator('.severity-critical')).toBeVisible(); }); test('incident detail shows correlated alert timeline', async ({ page }) => { await page.goto('/dashboard/incidents/inc-123'); await expect(page.locator('[data-testid="alert-timeline"]')).toBeVisible(); await expect(page.locator('.timeline-event')).toHaveCountGreaterThan(1); }); test('MTTR chart renders with real data', async ({ page }) => { await page.goto('/dashboard/analytics'); await expect(page.locator('[data-testid="mttr-chart"]')).toBeVisible(); }); test('noise reduction percentage displays correctly', async ({ page }) => { await page.goto('/dashboard/analytics'); const noise = page.locator('[data-testid="noise-reduction"]'); await expect(noise).toContainText('%'); }); test('webhook setup wizard generates correct URL', async ({ page }) => { await page.goto('/dashboard/settings/integrations'); await page.click('[data-testid="add-datadog"]'); const url = await page.locator('[data-testid="webhook-url"]').textContent(); expect(url).toMatch(/\/v1\/webhooks\/ingest\/.+/); }); ``` #### Epic 9: Onboarding & PLG ```typescript describe('Free Tier Enforcement', () => { it('allows up to 10,000 alerts/month on free tier', async () => {}); it('returns 429 with upgrade prompt at 10,001st alert', async () => {}); it('resets counter on first of each month', async () => {}); it('purges alert data older than 7 days on free tier', async () => {}); it('retains alert data for 90 days on pro tier', async () => {}); }); describe('OAuth Signup', () => { it('creates tenant record on first Cognito login', async () => {}); it('assigns free tier by default', async () => {}); it('generates unique webhook URL per tenant', async () => {}); }); describe('Stripe Integration', () => { it('creates checkout session with correct pricing', async () => {}); it('upgrades tenant on checkout.session.completed webhook', async () => {}); it('downgrades tenant on subscription.deleted webhook', async () => {}); it('validates Stripe webhook signature', async () => {}); }); ``` #### Epic 5.3: Slack Feedback Endpoint ```typescript describe('Slack Interactive Actions Endpoint', () => { it('validates Slack request signature (HMAC-SHA256)', async () => {}); it('rejects request with invalid signature', async () => {}); it('handles "helpful" feedback — updates incident quality score', async () => {}); it('handles "noise" feedback — adds to suppression training data', async () => {}); it('handles "escalate" action — triggers PagerDuty/OpsGenie', async () => {}); it('updates original Slack message after action', async () => {}); it('scopes action to correct tenant', async () => {}); }); ``` #### Epic 1.4: S3 Raw Payload Archival ```typescript describe('Raw Payload Archival', () => { it('saves raw webhook payload to S3 asynchronously', async () => {}); it('S3 key includes tenantId, source, and timestamp', async () => {}); it('archival failure does not block alert processing', async () => {}); it('archived payload is retrievable for replay', async () => {}); it('S3 lifecycle policy deletes after retention period', async () => {}); }); ``` ### 11.2 Anti-Pattern Fixes #### Replace ioredis-mock with WindowStore Interface ```typescript // BEFORE (anti-pattern): // import RedisMock from 'ioredis-mock'; // const engine = new CorrelationEngine(new RedisMock()); // AFTER (correct): interface WindowStore { addEvent(tenantId: string, key: string, event: Alert, ttlMs: number): Promise; getWindow(tenantId: string, key: string): Promise; clearWindow(tenantId: string, key: string): Promise; } class InMemoryWindowStore implements WindowStore { private store = new Map(); async addEvent(tenantId: string, key: string, event: Alert, ttlMs: number) { const fullKey = `${tenantId}:${key}`; const existing = this.store.get(fullKey) || { events: [], expiresAt: Date.now() + ttlMs }; existing.events.push(event); this.store.set(fullKey, existing); } async getWindow(tenantId: string, key: string): Promise { const fullKey = `${tenantId}:${key}`; const entry = this.store.get(fullKey); if (!entry || entry.expiresAt < Date.now()) return []; return entry.events; } } // Unit tests use InMemoryWindowStore — no Redis dependency // Integration tests use RedisWindowStore with Testcontainers ``` #### Replace sinon.useFakeTimers with Clock Interface ```typescript // BEFORE (anti-pattern): // sinon.useFakeTimers(new Date('2026-03-01T00:00:00Z')); // AFTER (correct): interface Clock { now(): number; advanceBy(ms: number): void; } class FakeClock implements Clock { private current: number; constructor(start: Date = new Date()) { this.current = start.getTime(); } now() { return this.current; } advanceBy(ms: number) { this.current += ms; } } class SystemClock implements Clock { now() { return Date.now(); } advanceBy() { throw new Error('Cannot advance system clock'); } } // Inject into CorrelationEngine: const engine = new CorrelationEngine(new InMemoryWindowStore(), new FakeClock()); ``` ### 11.3 Trace Context Propagation Tests ```typescript describe('Trace Context Propagation', () => { it('API Gateway passes trace_id to Lambda via X-Amzn-Trace-Id', async () => {}); it('Lambda propagates trace_id into SQS message attributes', async () => { // Verify SQS message has MessageAttribute 'traceparent' with W3C format const msg = await getLastSQSMessage(localstack, 'alert-queue'); expect(msg.MessageAttributes.traceparent).toBeDefined(); expect(msg.MessageAttributes.traceparent.StringValue).toMatch( /^00-[0-9a-f]{32}-[0-9a-f]{16}-0[01]$/ ); }); it('ECS Correlation Engine extracts trace_id from SQS message', async () => { // Verify the correlation span has the correct parent from SQS const spans = inMemoryExporter.getFinishedSpans(); const correlationSpan = spans.find(s => s.name === 'alert.correlation'); const ingestSpan = spans.find(s => s.name === 'webhook.ingest'); expect(correlationSpan.parentSpanId).toBeDefined(); // Parent chain must trace back to the original ingest span }); it('end-to-end trace spans webhook → SQS → correlation → notification', async () => { // Fire a webhook, wait for Slack notification, verify all spans share trace_id const traceId = await fireWebhookAndGetTraceId(); const spans = await getSpansByTraceId(traceId); const spanNames = spans.map(s => s.name); expect(spanNames).toContain('webhook.ingest'); expect(spanNames).toContain('alert.normalize'); expect(spanNames).toContain('alert.correlation'); expect(spanNames).toContain('notification.slack'); }); }); ``` ### 11.4 HMAC Security Hardening ```typescript describe('HMAC Signature Validation (Hardened)', () => { it('uses crypto.timingSafeEqual, not === comparison', () => { // Inspect the source to verify timing-safe comparison const source = fs.readFileSync('src/ingestion/hmac.ts', 'utf8'); expect(source).toContain('timingSafeEqual'); expect(source).not.toMatch(/signature\s*===\s*/); }); it('handles case-insensitive header names (dd-webhook-signature vs DD-WEBHOOK-SIGNATURE)', async () => { const payload = makeAlertPayload('datadog'); const sig = computeHMAC(payload, DATADOG_SECRET); // Lowercase header const resp1 = await ingest(payload, { 'dd-webhook-signature': sig }); expect(resp1.status).toBe(200); // Uppercase header const resp2 = await ingest(payload, { 'DD-WEBHOOK-SIGNATURE': sig }); expect(resp2.status).toBe(200); }); it('rejects completely missing signature header', async () => { const resp = await ingest(makeAlertPayload('datadog'), {}); expect(resp.status).toBe(401); }); it('rejects empty signature header', async () => { const resp = await ingest(makeAlertPayload('datadog'), { 'dd-webhook-signature': '' }); expect(resp.status).toBe(401); }); }); ``` ### 11.5 SQS 256KB Payload Limit ```typescript describe('Large Payload Handling', () => { it('compresses payloads >200KB before sending to SQS', async () => { const largePayload = makeLargeAlertPayload(300 * 1024); // 300KB const resp = await ingest(largePayload); expect(resp.status).toBe(200); const msg = await getLastSQSMessage(localstack, 'alert-queue'); // Payload must be compressed or use S3 pointer expect(msg.Body.length).toBeLessThan(256 * 1024); }); it('uses S3 pointer for payloads >256KB after compression', async () => { const hugePayload = makeLargeAlertPayload(500 * 1024); // 500KB const resp = await ingest(hugePayload); expect(resp.status).toBe(200); const msg = await getLastSQSMessage(localstack, 'alert-queue'); const body = JSON.parse(msg.Body); expect(body.s3Pointer).toBeDefined(); expect(body.s3Pointer).toMatch(/^s3:\/\/dd0c-alert-overflow\//); }); it('strips unnecessary fields from Datadog payload before SQS', async () => { const payload = makeDatadogPayloadWithLargeTags(100); // 100 tags const resp = await ingest(payload); expect(resp.status).toBe(200); const msg = await getLastSQSMessage(localstack, 'alert-queue'); const normalized = JSON.parse(msg.Body); // Only essential fields should remain expect(normalized.tags.length).toBeLessThanOrEqual(20); }); it('rejects payloads >2MB at API Gateway level', async () => { const massive = makeLargeAlertPayload(3 * 1024 * 1024); const resp = await ingest(massive); expect(resp.status).toBe(413); }); }); ``` ### 11.6 DLQ Backpressure & Replay ```typescript describe('DLQ Replay with Backpressure', () => { it('replays DLQ messages in batches of 100', async () => { await seedDLQ(10000); // 10K messages const replayer = new DLQReplayer({ batchSize: 100, delayBetweenBatchesMs: 500 }); await replayer.start(); // Verify batched processing expect(replayer.batchesProcessed).toBeGreaterThan(0); expect(replayer.maxConcurrentMessages).toBeLessThanOrEqual(100); }); it('pauses replay if correlation engine error rate exceeds 10%', async () => { await seedDLQ(1000); const replayer = new DLQReplayer({ batchSize: 100, errorThreshold: 0.1 }); // Simulate correlation engine returning errors mockCorrelationEngine.failRate = 0.15; await replayer.start(); expect(replayer.state).toBe('paused'); expect(replayer.pauseReason).toContain('error rate exceeded'); }); it('does not replay if circuit breaker is currently tripped', async () => { await seedDLQ(100); await tripCircuitBreaker(); const replayer = new DLQReplayer(); await replayer.start(); expect(replayer.messagesReplayed).toBe(0); expect(replayer.state).toBe('blocked_by_circuit_breaker'); }); it('tracks replay progress for resumability', async () => { await seedDLQ(500); const replayer = new DLQReplayer({ batchSize: 50 }); // Process 3 batches then stop await replayer.processNBatches(3); expect(replayer.checkpoint).toBe(150); // Resume from checkpoint const replayer2 = new DLQReplayer({ resumeFrom: replayer.checkpoint }); await replayer2.start(); expect(replayer2.startedFrom).toBe(150); }); }); ``` ### 11.7 Multi-Tenancy Isolation (DynamoDB) ```typescript describe('DynamoDB Tenant Isolation', () => { it('all DAO methods require tenantId parameter', () => { // Compile-time check: DAO interface has tenantId as first param const daoSource = fs.readFileSync('src/data/incident-dao.ts', 'utf8'); const methods = extractPublicMethods(daoSource); for (const method of methods) { expect(method.params[0].name).toBe('tenantId'); } }); it('query for tenant A returns zero results for tenant B data', async () => { const dao = new IncidentDAO(dynamoClient); await dao.create('tenant-A', makeIncident()); await dao.create('tenant-B', makeIncident()); const results = await dao.list('tenant-A'); expect(results.every(r => r.tenantId === 'tenant-A')).toBe(true); }); it('partition key always includes tenantId prefix', async () => { const dao = new IncidentDAO(dynamoClient); await dao.create('tenant-X', makeIncident()); // Read raw DynamoDB item const item = await dynamoClient.scan({ TableName: 'dd0c-alert-main' }); expect(item.Items[0].PK.S).toStartWith('TENANT#tenant-X'); }); }); ``` ### 11.8 Slack Circuit Breaker ```typescript describe('Slack Notification Circuit Breaker', () => { it('opens circuit after 10 consecutive 429s from Slack', async () => { const slackClient = new SlackClient({ circuitBreakerThreshold: 10 }); for (let i = 0; i < 10; i++) { mockSlack.respondWith(429); await slackClient.send(makeMessage()).catch(() => {}); } expect(slackClient.circuitState).toBe('open'); }); it('queues notifications while circuit is open', async () => { slackClient.openCircuit(); await slackClient.send(makeMessage()); expect(slackClient.queuedMessages).toBe(1); }); it('half-opens circuit after 60 seconds', async () => { slackClient.openCircuit(); clock.advanceBy(61000); expect(slackClient.circuitState).toBe('half-open'); }); it('drains queue on successful half-open probe', async () => { slackClient.openCircuit(); slackClient.queue(makeMessage()); slackClient.queue(makeMessage()); clock.advanceBy(61000); mockSlack.respondWith(200); await slackClient.probe(); expect(slackClient.circuitState).toBe('closed'); expect(slackClient.queuedMessages).toBe(0); }); }); ``` ### 11.9 Updated Test Pyramid (Post-Review) | Level | Original | Revised | Rationale | |-------|----------|---------|-----------| | Unit | 70% (~140) | 65% (~180) | More tests total, but integration share grows | | Integration | 20% (~40) | 25% (~70) | Dashboard API, tenant isolation, trace propagation | | E2E | 10% (~20) | 10% (~28) | Dashboard UI (Playwright), onboarding flow | *End of P3 Review Remediation Addendum* --- ## 12. BMad Review Implementation (Must-Have Before Launch) ### 12.1 HMAC Timestamp Freshness (Replay Attack Prevention) ```typescript describe('HMAC Replay Attack Prevention', () => { it('rejects Datadog webhook with timestamp older than 5 minutes', async () => { const payload = makeDatadogPayload(); const staleTimestamp = Math.floor(Date.now() / 1000) - 301; // 5min + 1s const sig = computeDatadogHMAC(payload, staleTimestamp); const resp = await ingest(payload, { 'dd-webhook-timestamp': staleTimestamp.toString(), 'dd-webhook-signature': sig, }); expect(resp.status).toBe(401); expect(resp.body.error).toContain('stale timestamp'); }); it('rejects PagerDuty webhook with missing timestamp', async () => { const payload = makePagerDutyPayload(); const sig = computePagerDutyHMAC(payload); const resp = await ingest(payload, { 'x-pagerduty-signature': sig, // No timestamp header }); expect(resp.status).toBe(401); }); it('rejects OpsGenie webhook replayed after 5 minutes', async () => { // OpsGenie doesn't always package timestamp cleanly // Must extract from payload body and validate const payload = makeOpsGeniePayload({ timestamp: fiveMinutesAgo() }); const sig = computeOpsGenieHMAC(payload); const resp = await ingest(payload, { 'x-opsgenie-signature': sig }); expect(resp.status).toBe(401); }); it('accepts fresh webhook within 5-minute window', async () => { const payload = makeDatadogPayload(); const freshTimestamp = Math.floor(Date.now() / 1000); const sig = computeDatadogHMAC(payload, freshTimestamp); const resp = await ingest(payload, { 'dd-webhook-timestamp': freshTimestamp.toString(), 'dd-webhook-signature': sig, }); expect(resp.status).toBe(200); }); }); ``` ### 12.2 Cross-Tenant Negative Isolation Tests ```typescript describe('DynamoDB Tenant Isolation (Negative Tests)', () => { it('Tenant A cannot read Tenant B incidents', async () => { // Seed data for both tenants await createIncident('tenant-a', { title: 'A incident' }); await createIncident('tenant-b', { title: 'B incident' }); // Query as Tenant A const results = await dao.listIncidents('tenant-a'); // Explicitly assert Tenant B data is absent const tenantIds = results.map(r => r.tenantId); expect(tenantIds).not.toContain('tenant-b'); expect(results.every(r => r.tenantId === 'tenant-a')).toBe(true); }); it('Tenant A cannot read Tenant B analytics', async () => { await seedAnalytics('tenant-a', { alertCount: 100 }); await seedAnalytics('tenant-b', { alertCount: 200 }); const analytics = await dao.getAnalytics('tenant-a'); expect(analytics.alertCount).toBe(100); // Not 300 (combined) }); it('API returns 404 (not 403) for cross-tenant incident access', async () => { const incident = await createIncident('tenant-b', { title: 'secret' }); const resp = await api.get(`/v1/incidents/${incident.id}`) .set('Authorization', `Bearer ${tenantAToken}`); // 404 not 403 — don't leak existence expect(resp.status).toBe(404); }); }); ``` ### 12.3 Correlation Window Edge Cases ```typescript describe('Out-of-Order Alert Delivery', () => { it('late alert attaches to existing incident (not duplicate)', async () => { const clock = new FakeClock(); const engine = new CorrelationEngine(new InMemoryWindowStore(), clock); // Alert 1 arrives at T=0 const alert1 = makeAlert({ service: 'auth', fingerprint: 'cpu-high', timestamp: 0 }); const incident1 = await engine.process(alert1); // Window closes at T=5min, incident shipped clock.advanceBy(5 * 60 * 1000); await engine.flushWindows(); // Late alert arrives at T=6min with timestamp T=2min (within original window) const lateAlert = makeAlert({ service: 'auth', fingerprint: 'cpu-high', timestamp: 2 * 60 * 1000 }); const result = await engine.process(lateAlert); // Must attach to existing incident, not create new one expect(result.incidentId).toBe(incident1.incidentId); expect(result.action).toBe('attached_to_existing'); }); it('very late alert (>2x window) creates new incident', async () => { const clock = new FakeClock(); const engine = new CorrelationEngine(new InMemoryWindowStore(), clock); const alert1 = makeAlert({ service: 'auth', fingerprint: 'cpu-high' }); const incident1 = await engine.process(alert1); // 15 minutes later (3x the 5-min window) clock.advanceBy(15 * 60 * 1000); const lateAlert = makeAlert({ service: 'auth', fingerprint: 'cpu-high' }); const result = await engine.process(lateAlert); expect(result.incidentId).not.toBe(incident1.incidentId); expect(result.action).toBe('new_incident'); }); }); ``` ### 12.4 SQS Claim-Check Round-Trip ```typescript describe('SQS 256KB Claim-Check End-to-End', () => { it('large payload round-trips through S3 pointer', async () => { const largePayload = makeLargeAlertPayload(300 * 1024); // 300KB // Ingestion compresses and stores in S3 const resp = await ingest(largePayload); expect(resp.status).toBe(200); // SQS message contains S3 pointer const sqsMsg = await getLastSQSMessage(localstack, 'alert-queue'); const body = JSON.parse(sqsMsg.Body); expect(body.s3Pointer).toBeDefined(); // Correlation engine fetches from S3 and processes const incident = await waitForIncidentCreated(5000); expect(incident).toBeDefined(); expect(incident.sourceAlertCount).toBeGreaterThan(0); }); it('S3 fetch timeout does not crash correlation engine', async () => { // Inject S3 latency (10 second delay) mockS3.setLatency(10000); const largePayload = makeLargeAlertPayload(300 * 1024); await ingest(largePayload); // Correlation engine should timeout and send to DLQ const dlqMsg = await getDLQMessage(localstack, 'alert-dlq', 15000); expect(dlqMsg).toBeDefined(); // Engine is still healthy const health = await api.get('/health'); expect(health.status).toBe(200); }); }); ``` ### 12.5 Free Tier Enforcement ```typescript describe('Free Tier (10K alerts/month, 7-day retention)', () => { it('accepts alert at 9,999 count', async () => { await setAlertCounter('tenant-free', 9999); const resp = await ingestAsTenat('tenant-free', makeAlert()); expect(resp.status).toBe(200); }); it('rejects alert at 10,001 with upgrade prompt', async () => { await setAlertCounter('tenant-free', 10000); const resp = await ingestAsTenant('tenant-free', makeAlert()); expect(resp.status).toBe(429); expect(resp.body.upgrade_url).toContain('stripe'); }); it('counter resets on first of month', async () => { await setAlertCounter('tenant-free', 10000); clock.advanceToFirstOfNextMonth(); await runMonthlyReset(); const resp = await ingestAsTenant('tenant-free', makeAlert()); expect(resp.status).toBe(200); }); it('purges data older than 7 days on free tier', async () => { await createIncident('tenant-free', { createdAt: eightDaysAgo() }); await runRetentionPurge(); const incidents = await dao.listIncidents('tenant-free'); expect(incidents).toHaveLength(0); }); it('retains data for 90 days on pro tier', async () => { await createIncident('tenant-pro', { createdAt: thirtyDaysAgo() }); await runRetentionPurge(); const incidents = await dao.listIncidents('tenant-pro'); expect(incidents).toHaveLength(1); }); }); ``` *End of P3 BMad Implementation*