Files

Max Mayfield 03bfe931fc Implement review remediation + PLG analytics SDK

- All 6 test architectures patched with Section 11 addendums
- P5 (cost) fully rewritten from 232 to ~600 lines
- PLG brainstorm + party mode advisory board results
- Analytics SDK v2 (PostHog Cloud, Zod strict, Lambda-safe)
- Analytics tests v2 (safeParse, no , no timestamp, no PII)
- Addresses all Gemini review findings across P1-P6

2026-03-01 01:42:49 +00:00

67 KiB

Raw Blame History

dd0c/alert — Test Architecture & TDD Strategy

Product: dd0c/alert — Alert Intelligence Platform Author: Test Architecture Phase Date: February 28, 2026 Status: V1 MVP — Solo Founder Scope

Section 1: Testing Philosophy & TDD Workflow

1.1 Core Philosophy

dd0c/alert is a safety-critical observability tool — a bug that silently suppresses a real alert during an incident is worse than having no tool at all. The test suite is the contract that guarantees "we will never eat your alerts."

Guiding principle: tests describe observable behavior from the on-call engineer's perspective. If a test can't be explained as "when X happens, the engineer sees Y," it's testing implementation, not behavior.

For a solo founder, the test suite is also the regression safety net — it catches the subtle scoring bugs that would erode customer trust over weeks.

1.2 Red-Green-Refactor Adapted to dd0c/alert

RED   → Write a failing test that describes the desired behavior
         (e.g., "3 Datadog alerts for the same service within 5 minutes
          should produce 1 correlated incident")

GREEN → Write the minimum code to make it pass
         (hardcode the window, just make it work)

REFACTOR → Clean up without breaking tests
            (extract the window manager, add Redis backing,
             optimize the fingerprinting)

When to write tests first (strict TDD):

All correlation logic (time-window clustering, service graph traversal, deploy correlation)
All noise scoring algorithms (rule-based scoring, threshold calculations)
All HMAC signature validation (security-critical)
All fingerprinting/deduplication logic
All suppression governance (strict vs. audit mode)
All circuit breaker state transitions (suppression DLQ replay)

When integration tests lead (test-after, then harden):

Provider webhook parsers — implement against real payload samples, then lock in with contract tests
SQS FIFO message ordering — test against LocalStack after implementation
Slack message formatting — build the blocks, then snapshot test the output

When E2E tests lead:

The 60-second time-to-value journey — define the happy path first, build backward
Weekly noise digest generation — define expected output, then build the aggregation

1.3 Test Naming Conventions

// Unit tests (vitest)
describe('CorrelationEngine', () => {
  it('groups alerts for same service within 5min window into single incident', () => {});
  it('extends window by 2min when alert arrives in last 30 seconds', () => {});
  it('caps window extension at 15 minutes total', () => {});
  it('merges downstream service alerts when upstream window is active', () => {});
});

describe('NoiseScorer', () => {
  it('scores deploy-correlated alerts higher when deploy is within 10min', () => {});
  it('returns zero noise score for first-ever alert from a service', () => {});
  it('adds 5 points when PR title matches config or feature-flag', () => {});
});

describe('HmacValidator', () => {
  it('rejects Datadog webhook with missing DD-WEBHOOK-SIGNATURE header', () => {});
  it('rejects PagerDuty webhook with tampered body', () => {});
  it('accepts valid signature and passes payload through', () => {});
});

Rules:

Describe the observable outcome, not the internal mechanism
Use present tense ("groups", "rejects", "scores")
If you need "and" in the name, split into two tests
Group by component in describe blocks

Section 2: Test Pyramid

2.1 Ratio

Level	Target	Count (V1)	Runtime
Unit	70%	~350 tests	<30s
Integration	20%	~100 tests	<5min
E2E/Smoke	10%	~20 tests	<10min

2.2 Unit Test Targets (per component)

Component	Key Behaviors	Est. Tests
Webhook Parsers (Datadog, PD, OpsGenie, Grafana)	Payload normalization, field mapping, batch handling	60
HMAC Validator	Signature verification per provider, rejection paths	20
Fingerprint Generator	Deterministic hashing, dedup detection	15
Correlation Engine	Time-window open/close/extend, service graph merge, deploy correlation	80
Noise Scorer	Rule-based scoring, deploy proximity weighting, threshold calculations	60
Suggestion Engine	Suppression recommendations, "what would have happened" calculations	30
Notification Formatter	Slack block formatting, digest generation, in-place message updates	25
Governance Policy	Strict/audit mode enforcement, panic mode, per-customer overrides	30
Feature Flags	Circuit breaker on suppression volume, flag lifecycle	15
Canonical Schema Mapper	Provider → canonical field mapping, severity normalization	15

2.3 Integration Test Boundaries

Boundary	What's Tested	Infrastructure
Lambda → SQS FIFO	Message ordering, dedup, tenant partitioning	LocalStack
SQS → Correlation Engine	Consumer polling, batch processing, error handling	LocalStack
Correlation Engine → Redis	Window CRUD, sorted set operations, TTL expiry	Testcontainers Redis
Correlation Engine → DynamoDB	Incident persistence, tenant config reads	Testcontainers DynamoDB Local
Correlation Engine → TimescaleDB	Time-series writes, continuous aggregate queries	Testcontainers PostgreSQL + TimescaleDB
Notification Service → Slack	Block formatting, rate limiting, message update	WireMock
API Gateway → Lambda	Webhook routing, auth, throttling	LocalStack

2.4 E2E/Smoke Scenarios

60-Second TTV Journey: Webhook received → alert in Slack within 60s
Alert Storm Correlation: 50 alerts in 2 minutes → grouped into 1 incident
Deploy Correlation: Deploy event + alert storm → deploy identified as trigger
Noise Digest: 7 days of alerts → weekly Slack digest with noise stats
Multi-Provider Merge: Datadog + PagerDuty alerts for same service → single incident
Panic Mode: Enable panic → all suppression stops → alerts pass through raw

Section 3: Unit Test Strategy

3.1 Webhook Parsers

Each provider parser is a pure function: payload in, canonical alert(s) out. No side effects, no DB calls.

// tests/unit/parsers/datadog.test.ts
describe('DatadogParser', () => {
  it('normalizes single alert payload to canonical schema', () => {});
  it('normalizes batched alert array into multiple canonical alerts', () => {});
  it('maps Datadog P1 to critical, P5 to info', () => {});
  it('extracts service name from tags array', () => {});
  it('handles missing optional fields without throwing', () => {});
  it('generates stable fingerprint from title + service + tenant', () => {});
});

// tests/unit/parsers/pagerduty.test.ts
describe('PagerDutyParser', () => {
  it('normalizes incident.triggered event to canonical alert', () => {});
  it('normalizes incident.resolved event with resolution metadata', () => {});
  it('ignores incident.acknowledged events (not alerts)', () => {});
  it('maps PD urgency high to critical, low to info', () => {});
});

// tests/unit/parsers/opsgenie.test.ts
describe('OpsGenieParser', () => {
  it('normalizes alert.created action to canonical alert', () => {});
  it('extracts priority P1-P5 and maps to severity', () => {});
  it('handles custom fields in details object', () => {});
});

// tests/unit/parsers/grafana.test.ts
describe('GrafanaParser', () => {
  it('normalizes Grafana Alertmanager webhook payload', () => {});
  it('handles multiple alerts in single webhook (Grafana batches)', () => {});
  it('extracts dashboard URL as context link', () => {});
});

Mocking strategy: None needed — parsers are pure functions. Use recorded payload fixtures from fixtures/webhooks/{provider}/.

Fixture structure:

fixtures/webhooks/
  datadog/
    single-alert.json
    batched-alerts.json
    monitor-recovered.json
  pagerduty/
    incident-triggered.json
    incident-resolved.json
    incident-acknowledged.json
  opsgenie/
    alert-created.json
    alert-closed.json
  grafana/
    single-firing.json
    multi-firing.json
    resolved.json

3.2 HMAC Validator

describe('HmacValidator', () => {
  // Datadog uses hex-encoded HMAC-SHA256
  it('validates correct Datadog DD-WEBHOOK-SIGNATURE header', () => {});
  it('rejects Datadog webhook with wrong signature', () => {});
  it('rejects Datadog webhook with missing signature header', () => {});

  // PagerDuty uses v1= prefix with HMAC-SHA256
  it('validates correct PagerDuty X-PagerDuty-Signature header', () => {});
  it('rejects PagerDuty webhook with tampered body', () => {});

  // OpsGenie uses different header name
  it('validates correct OpsGenie X-OpsGenie-Signature header', () => {});

  // Edge cases
  it('rejects empty body with any signature', () => {});
  it('handles timing-safe comparison to prevent timing attacks', () => {});
});

Mocking strategy: None — crypto operations are deterministic. Use known secret + body + expected signature triples.

3.3 Fingerprint Generator

describe('FingerprintGenerator', () => {
  it('generates deterministic SHA-256 from tenant_id + provider + service + title', () => {});
  it('produces same fingerprint for identical alerts regardless of timestamp', () => {});
  it('produces different fingerprints when service differs', () => {});
  it('normalizes title whitespace before hashing', () => {});
  it('handles unicode characters in title consistently', () => {});
});

3.4 Correlation Engine

The most complex component. Heavy use of table-driven tests.

describe('CorrelationEngine', () => {
  describe('Time-Window Management', () => {
    it('opens new 5min window on first alert for a service', () => {});
    it('adds subsequent alerts to existing open window', () => {});
    it('extends window by 2min when alert arrives in last 30 seconds', () => {});
    it('caps total window duration at 15 minutes', () => {});
    it('closes window after timeout with no new alerts', () => {});
    it('generates incident record when window closes', () => {});
  });

  describe('Service Graph Correlation', () => {
    it('merges downstream alerts into upstream window when dependency exists', () => {});
    it('does not merge alerts for unrelated services', () => {});
    it('handles circular dependencies without infinite loop', () => {});
    it('traverses multi-level dependency chains (A→B→C)', () => {});
  });

  describe('Deploy Correlation', () => {
    it('tags incident with deploy_id when deploy event within 10min of first alert', () => {});
    it('does not correlate deploy older than 10 minutes', () => {});
    it('correlates deploy to correct service even with multiple recent deploys', () => {});
    it('adds deploy correlation score boost to noise calculation', () => {});
  });

  describe('Multi-Tenant Isolation', () => {
    it('never correlates alerts across different tenants', () => {});
    it('maintains separate windows per tenant', () => {});
    it('handles concurrent alerts from multiple tenants', () => {});
  });
});

Mocking strategy:

Mock Redis client (ioredis-mock) for window state
Mock DynamoDB client for service dependency reads
Mock SQS for downstream message publishing
Use sinon.useFakeTimers() for time-window testing

3.5 Noise Scorer

describe('NoiseScorer', () => {
  describe('Rule-Based Scoring', () => {
    it('returns 0 for first-ever alert from a service (no history)', () => {});
    it('scores higher when alert has fired >5 times in 24 hours', () => {});
    it('scores higher when alert auto-resolved within 5 minutes', () => {});
    it('adds deploy correlation bonus (+15 points) when deploy is recent', () => {});
    it('adds feature-flag bonus (+5 points) when PR title matches config/feature-flag', () => {});
    it('caps total score at 100', () => {});
    it('never scores critical severity alerts above 80 (safety cap)', () => {});
  });

  describe('Threshold Calculations', () => {
    it('classifies score 0-30 as signal (keep)', () => {});
    it('classifies score 31-70 as review (annotate)', () => {});
    it('classifies score 71-100 as noise (suggest suppress)', () => {});
    it('uses tenant-specific thresholds when configured', () => {});
  });

  describe('What-Would-Have-Happened', () => {
    it('calculates suppression count for historical window', () => {});
    it('reports zero false negatives when no suppressed alert was critical', () => {});
    it('flags false negative when suppressed alert was later escalated', () => {});
  });
});

Mocking strategy: Mock the alert history store (DynamoDB queries). Scorer logic itself is pure calculation.

3.6 Notification Formatter

describe('NotificationFormatter', () => {
  describe('Slack Blocks', () => {
    it('formats single-alert notification with service, title, severity', () => {});
    it('formats correlated incident with alert count and sources', () => {});
    it('includes deploy trigger when deploy correlation exists', () => {});
    it('includes noise score badge (🟢 signal / 🟡 review / 🔴 noise)', () => {});
    it('includes feedback buttons (👍 Helpful / 👎 Not helpful)', () => {});
    it('formats in-place update message (replaces initial alert)', () => {});
  });

  describe('Weekly Digest', () => {
    it('aggregates 7 days of incidents into summary stats', () => {});
    it('highlights top 3 noisiest services', () => {});
    it('shows suppression savings ("would have saved X pages")', () => {});
  });
});

Mocking strategy: Snapshot tests — render the Slack blocks to JSON and compare against golden fixtures.

3.7 Governance Policy Engine

describe('GovernancePolicy', () => {
  describe('Mode Enforcement', () => {
    it('in strict mode: annotates alerts but never suppresses', () => {});
    it('in audit mode: auto-suppresses with full logging', () => {});
    it('defaults new tenants to strict mode', () => {});
  });

  describe('Panic Mode', () => {
    it('when panic=true: all suppression stops immediately', () => {});
    it('when panic=true: all alerts pass through unmodified', () => {});
    it('panic mode activatable via Redis key check', () => {});
    it('panic mode shows banner in dashboard API response', () => {});
  });

  describe('Per-Customer Override', () => {
    it('customer can set stricter mode than system default', () => {});
    it('customer cannot set less restrictive mode than system default', () => {});
    it('merge logic: max_restrictive(system, customer)', () => {});
  });

  describe('Policy Decision Logging', () => {
    it('logs "suppressed by audit mode" with full context', () => {});
    it('logs "annotation-only, strict mode active" for strict tenants', () => {});
    it('logs "panic mode active — all alerts passing through"', () => {});
  });
});

3.8 Feature Flag Circuit Breaker

describe('SuppressionCircuitBreaker', () => {
  it('allows suppression when volume is within baseline', () => {});
  it('trips breaker when suppression exceeds 2x baseline over 30min', () => {});
  it('auto-disables the scoring flag when breaker trips', () => {});
  it('replays suppressed alerts from DLQ when breaker trips', () => {});
  it('resets breaker after manual flag re-enable', () => {});
  it('tracks suppression count per flag in Redis sliding window', () => {});
});

Section 4: Integration Test Strategy

4.1 Webhook Contract Tests

Each provider integration gets a contract test suite that validates the full path: HTTP request → Lambda → SQS message.

// tests/integration/webhooks/datadog.contract.test.ts
describe('Datadog Webhook Contract', () => {
  let localstack: LocalStackContainer;
  let sqsClient: SQSClient;

  beforeAll(async () => {
    localstack = await new LocalStackContainer().start();
    sqsClient = new SQSClient({ endpoint: localstack.getEndpoint() });
    // Create SQS FIFO queue
    await sqsClient.send(new CreateQueueCommand({
      QueueName: 'alert-ingested.fifo',
      Attributes: { FifoQueue: 'true', ContentBasedDeduplication: 'true' }
    }));
  });

  it('accepts valid Datadog webhook and produces canonical SQS message', async () => {
    const payload = loadFixture('webhooks/datadog/single-alert.json');
    const signature = computeHmac(payload, TEST_SECRET);

    const res = await request(app)
      .post('/v1/wh/tenant-123/datadog')
      .set('DD-WEBHOOK-SIGNATURE', signature)
      .send(payload);

    expect(res.status).toBe(200);

    const messages = await pollSqs(sqsClient, 'alert-ingested.fifo');
    expect(messages).toHaveLength(1);
    expect(messages[0].body).toMatchObject({
      tenant_id: 'tenant-123',
      provider: 'datadog',
      severity: expect.stringMatching(/critical|high|medium|low|info/),
      fingerprint: expect.stringMatching(/^[a-f0-9]{64}$/),
    });
  });

  it('rejects webhook with invalid HMAC and produces no SQS message', async () => {
    const payload = loadFixture('webhooks/datadog/single-alert.json');

    const res = await request(app)
      .post('/v1/wh/tenant-123/datadog')
      .set('DD-WEBHOOK-SIGNATURE', 'bad-signature')
      .send(payload);

    expect(res.status).toBe(401);
    const messages = await pollSqs(sqsClient, 'alert-ingested.fifo', { waitMs: 1000 });
    expect(messages).toHaveLength(0);
  });
});

Repeat pattern for PagerDuty, OpsGenie, Grafana — each with provider-specific signature headers and payload formats.

4.2 Correlation Engine → Redis Integration

// tests/integration/correlation/redis-windows.test.ts
describe('Correlation Engine + Redis', () => {
  let redis: StartedTestContainer;
  let redisClient: Redis;

  beforeAll(async () => {
    redis = await new GenericContainer('redis:7-alpine')
      .withExposedPorts(6379)
      .start();
    redisClient = new Redis({ host: redis.getHost(), port: redis.getMappedPort(6379) });
  });

  it('opens window in Redis sorted set with correct TTL', async () => {
    await correlationEngine.processAlert(makeAlert({ service: 'payment-api' }));

    const windows = await redisClient.zrange('windows:tenant-123', 0, -1, 'WITHSCORES');
    expect(windows).toHaveLength(2); // [windowId, closesAtEpoch]
    const ttl = await redisClient.ttl('window:tenant-123:payment-api');
    expect(ttl).toBeGreaterThan(280); // ~5min minus processing time
  });

  it('extends window when alert arrives in last 30 seconds', async () => {
    // Open window, advance clock to T+4m31s, send another alert
    await correlationEngine.processAlert(makeAlert({ service: 'payment-api' }));
    vi.advanceTimersByTime(4 * 60 * 1000 + 31 * 1000);
    await correlationEngine.processAlert(makeAlert({ service: 'payment-api' }));

    const ttl = await redisClient.ttl('window:tenant-123:payment-api');
    expect(ttl).toBeGreaterThan(100); // Extended by ~2min
  });

  it('isolates windows between tenants', async () => {
    await correlationEngine.processAlert(makeAlert({ tenant: 'A', service: 'api' }));
    await correlationEngine.processAlert(makeAlert({ tenant: 'B', service: 'api' }));

    const windowsA = await redisClient.zrange('windows:A', 0, -1);
    const windowsB = await redisClient.zrange('windows:B', 0, -1);
    expect(windowsA).toHaveLength(1);
    expect(windowsB).toHaveLength(1);
    expect(windowsA[0]).not.toBe(windowsB[0]);
  });
});

4.3 Correlation Engine → DynamoDB Integration

// tests/integration/correlation/dynamodb-incidents.test.ts
describe('Correlation Engine + DynamoDB', () => {
  let dynamodb: StartedTestContainer;

  beforeAll(async () => {
    dynamodb = await new GenericContainer('amazon/dynamodb-local:latest')
      .withExposedPorts(8000)
      .start();
    // Create tables: alerts, incidents, tenant_config, service_dependencies
  });

  it('persists incident record when correlation window closes', async () => {
    await correlationEngine.processAlert(makeAlert({ service: 'api' }));
    await correlationEngine.processAlert(makeAlert({ service: 'api' }));
    await correlationEngine.closeExpiredWindows();

    const incidents = await queryIncidents('tenant-123');
    expect(incidents).toHaveLength(1);
    expect(incidents[0].alert_count).toBe(2);
    expect(incidents[0].services).toContain('api');
  });

  it('reads service dependencies for cascading correlation', async () => {
    await putServiceDependency('tenant-123', 'api', 'database');
    await correlationEngine.processAlert(makeAlert({ service: 'database' }));
    await correlationEngine.processAlert(makeAlert({ service: 'api' }));

    // Both should be in the same window
    const windows = await getActiveWindows('tenant-123');
    expect(windows).toHaveLength(1);
    expect(windows[0].services).toEqual(expect.arrayContaining(['api', 'database']));
  });
});

4.4 Correlation Engine → TimescaleDB Integration

// tests/integration/correlation/timescaledb-trends.test.ts
describe('Correlation Engine + TimescaleDB', () => {
  let pg: StartedTestContainer;

  beforeAll(async () => {
    pg = await new GenericContainer('timescale/timescaledb:latest-pg16')
      .withExposedPorts(5432)
      .withEnvironment({ POSTGRES_PASSWORD: 'test' })
      .start();
    // Run migrations: create hypertables, continuous aggregates
  });

  it('writes alert frequency data to hypertable', async () => {
    await correlationEngine.recordAlertEvent(makeAlert({ service: 'api' }));
    const rows = await query('SELECT * FROM alert_events WHERE service = $1', ['api']);
    expect(rows).toHaveLength(1);
  });

  it('continuous aggregate calculates hourly alert counts', async () => {
    // Insert 10 alerts spread over 2 hours
    await insertAlertEvents(10, { spreadHours: 2 });
    await refreshContinuousAggregate('hourly_alert_summary');

    const summary = await query('SELECT * FROM hourly_alert_summary');
    expect(summary).toHaveLength(2);
    expect(summary.reduce((s, r) => s + r.alert_count, 0)).toBe(10);
  });
});

4.5 Notification Service → Slack (WireMock)

// tests/integration/notifications/slack.test.ts
describe('Notification Service + Slack', () => {
  let wiremock: WireMockContainer;

  beforeAll(async () => {
    wiremock = await new WireMockContainer().start();
    wiremock.stub({
      request: { method: 'POST', urlPath: '/api/chat.postMessage' },
      response: { status: 200, body: JSON.stringify({ ok: true, ts: '1234.5678' }) }
    });
    wiremock.stub({
      request: { method: 'POST', urlPath: '/api/chat.update' },
      response: { status: 200, body: JSON.stringify({ ok: true }) }
    });
  });

  it('sends initial alert notification to correct Slack channel', async () => {});
  it('updates message in-place when correlation completes', async () => {});
  it('respects Slack rate limits (1 msg/sec per channel)', async () => {});
  it('retries on 429 with exponential backoff', async () => {});
  it('includes feedback buttons in correlated incident message', async () => {});
});

Section 5: E2E & Smoke Tests

5.1 Critical User Journeys

Journey 1: 60-Second Time-to-Value

The defining test for dd0c/alert. Validates the entire pipeline from webhook to Slack notification.

// tests/e2e/journeys/sixty-second-ttv.test.ts
describe('60-Second Time-to-Value', () => {
  it('delivers first correlated incident to Slack within 60 seconds of webhook', async () => {
    const start = Date.now();

    // 1. Send Datadog webhook
    await sendWebhook('datadog', fixtures.datadog.singleAlert, { tenant: 'e2e-tenant' });

    // 2. Wait for Slack message
    const slackMessage = await waitForSlackMessage('e2e-channel', { timeoutMs: 60_000 });

    const elapsed = Date.now() - start;
    expect(elapsed).toBeLessThan(60_000);
    expect(slackMessage.text).toContain('New alert');
    expect(slackMessage.blocks).toBeDefined();
  });
});

Journey 2: Alert Storm Correlation

// tests/e2e/journeys/alert-storm.test.ts
describe('Alert Storm Correlation', () => {
  it('groups 50 alerts in 2 minutes into a single correlated incident', async () => {
    // Fire 50 alerts for same service over 2 minutes
    for (let i = 0; i < 50; i++) {
      await sendWebhook('datadog', makeAlertPayload({
        service: 'payment-api',
        title: `High latency on payment-api (${i})`,
      }));
      await sleep(2400); // ~50 alerts in 2 min
    }

    // Wait for correlation window to close
    await sleep(5 * 60 * 1000 + 30_000); // 5min window + buffer

    const slackMessages = await getSlackMessages('e2e-channel');
    const incidentMessages = slackMessages.filter(m => m.text.includes('Incident'));
    expect(incidentMessages).toHaveLength(1);
    expect(incidentMessages[0].text).toContain('50 alerts grouped');
  });
});

Journey 3: Deploy Correlation

// tests/e2e/journeys/deploy-correlation.test.ts
describe('Deploy Correlation', () => {
  it('identifies deploy as trigger when alerts follow within 10 minutes', async () => {
    // 1. Send deploy event
    await sendWebhook('github-actions', makeDeployPayload({
      service: 'payment-api',
      commit: 'abc123',
      pr_title: 'feat: add retry logic',
    }));

    // 2. Wait 2 minutes, then fire alerts
    await sleep(2 * 60 * 1000);
    await sendWebhook('datadog', makeAlertPayload({ service: 'payment-api' }));
    await sendWebhook('pagerduty', makeAlertPayload({ service: 'payment-api' }));

    // 3. Wait for correlation
    await sleep(6 * 60 * 1000);

    const slackMessage = await getLatestSlackMessage('e2e-channel');
    expect(slackMessage.text).toContain('Deploy #');
    expect(slackMessage.text).toContain('abc123');
  });
});

Journey 4: Panic Mode

// tests/e2e/journeys/panic-mode.test.ts
describe('Panic Mode', () => {
  it('stops all suppression immediately when panic mode is activated', async () => {
    // 1. Enable audit mode, verify suppression works
    await setGovernanceMode('e2e-tenant', 'audit');
    await sendNoisyAlerts(10);
    const beforePanic = await getSlackMessages('e2e-channel');
    const suppressedBefore = beforePanic.filter(m => m.text.includes('suppressed'));

    // 2. Activate panic mode
    await fetch('/admin/panic', { method: 'POST' });

    // 3. Send more alerts — all should pass through
    await sendNoisyAlerts(10);
    const afterPanic = await getSlackMessages('e2e-channel');
    const rawAlerts = afterPanic.filter(m => !m.text.includes('suppressed'));
    expect(rawAlerts.length).toBeGreaterThanOrEqual(10);
  });
});

5.2 E2E Infrastructure

# docker-compose.e2e.yml
services:
  localstack:
    image: localstack/localstack:3
    environment:
      SERVICES: sqs,s3,dynamodb,apigateway,lambda
    ports: ["4566:4566"]

  timescaledb:
    image: timescale/timescaledb:latest-pg16
    environment:
      POSTGRES_PASSWORD: test
    ports: ["5432:5432"]

  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]

  wiremock:
    image: wiremock/wiremock:3
    ports: ["8080:8080"]
    volumes:
      - ./fixtures/wiremock:/home/wiremock/mappings

  app:
    build: .
    environment:
      AWS_ENDPOINT: http://localstack:4566
      REDIS_URL: redis://redis:6379
      TIMESCALE_URL: postgres://postgres:test@timescaledb:5432/test
      SLACK_API_URL: http://wiremock:8080
    depends_on: [localstack, timescaledb, redis, wiremock]

5.3 Synthetic Alert Generation

// tests/e2e/helpers/alert-generator.ts
export function makeAlertPayload(overrides: Partial<AlertPayload> = {}): DatadogWebhookPayload {
  return {
    id: ulid(),
    title: overrides.title ?? `Alert: ${faker.hacker.phrase()}`,
    text: faker.lorem.sentence(),
    date_happened: Math.floor(Date.now() / 1000),
    priority: overrides.priority ?? 'normal',
    tags: [`service:${overrides.service ?? 'test-service'}`],
    alert_type: overrides.severity ?? 'warning',
    ...overrides,
  };
}

export async function sendNoisyAlerts(count: number, opts?: { service?: string }) {
  for (let i = 0; i < count; i++) {
    await sendWebhook('datadog', makeAlertPayload({
      service: opts?.service ?? 'noisy-service',
      title: `Flapping alert #${i}`,
    }));
  }
}

Section 6: Performance & Load Testing

6.1 Alert Ingestion Throughput

// tests/perf/ingestion-throughput.test.ts
describe('Ingestion Throughput', () => {
  it('processes 1000 webhooks/second without dropping payloads', async () => {
    const results = await k6.run({
      vus: 100,
      duration: '30s',
      thresholds: {
        http_req_duration: ['p95<200'],  // 200ms p95
        http_req_failed: ['rate<0.001'],  // <0.1% failure
      },
      script: `
        import http from 'k6/http';
        export default function() {
          http.post('${WEBHOOK_URL}/v1/wh/perf-tenant/datadog', 
            JSON.stringify(makeAlertPayload()),
            { headers: { 'DD-WEBHOOK-SIGNATURE': validSig } }
          );
        }
      `,
    });
    expect(results.metrics.http_req_failed.rate).toBeLessThan(0.001);
  });
});

6.2 Correlation Latency Under Alert Storms

describe('Correlation Storm Performance', () => {
  it('correlates 500 alerts across 10 services within 30 seconds', async () => {
    const start = Date.now();
    
    // Simulate incident storm: 500 alerts, 10 services, 2 minutes
    await generateAlertStorm({ alerts: 500, services: 10, durationMs: 120_000 });
    
    // Wait for all windows to close
    await waitForIncidents('perf-tenant', { minCount: 1, timeoutMs: 30_000 });
    
    const elapsed = Date.now() - start - 120_000; // subtract generation time
    expect(elapsed).toBeLessThan(30_000);
  });

  it('Redis memory stays under 50MB during 10K active windows', async () => {
    // Open 10K windows across 100 tenants
    for (let t = 0; t < 100; t++) {
      for (let s = 0; s < 100; s++) {
        await correlationEngine.processAlert(makeAlert({
          tenant: `tenant-${t}`,
          service: `service-${s}`,
        }));
      }
    }
    const memoryUsage = await redisClient.info('memory');
    const usedMb = parseRedisMemory(memoryUsage);
    expect(usedMb).toBeLessThan(50);
  });
});

6.3 Noise Scoring Latency

describe('Noise Scoring Performance', () => {
  it('scores a correlated incident with 50 alerts in <100ms', async () => {
    const incident = makeIncident({ alertCount: 50, withHistory: true });
    
    const start = performance.now();
    const score = await noiseScorer.score(incident);
    const elapsed = performance.now() - start;
    
    expect(elapsed).toBeLessThan(100);
    expect(score).toBeGreaterThanOrEqual(0);
    expect(score).toBeLessThanOrEqual(100);
  });
});

6.4 Memory Pressure During High-Cardinality Correlation

describe('Memory Pressure', () => {
  it('ECS task stays under 512MB with 1000 concurrent correlation windows', async () => {
    // Monitor ECS task memory while processing high-cardinality alerts
    const memBefore = process.memoryUsage().heapUsed;
    
    await processHighCardinalityAlerts({ tenants: 100, servicesPerTenant: 10 });
    
    const memAfter = process.memoryUsage().heapUsed;
    const deltaMb = (memAfter - memBefore) / 1024 / 1024;
    expect(deltaMb).toBeLessThan(256); // Leave headroom in 512MB task
  });
});

Section 7: CI/CD Pipeline Integration

7.1 Pipeline Stages

┌─────────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│ Pre-Commit   │───▶│ PR Gate  │───▶│ Merge    │───▶│ Staging  │───▶│ Prod     │
│ (local)      │    │ (CI)     │    │ (CI)     │    │ (CD)     │    │ (CD)     │
└─────────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘
  lint + format     unit tests      full suite      E2E + perf     smoke + canary
  type check        integration     coverage gate   LocalStack     deploy event
  <10s              <5min           <10min          <15min         self-dogfood

7.2 Stage Details

Pre-Commit (local, <10s):

eslint + prettier format check
tsc --noEmit type check
Affected unit tests only (vitest --changed)

PR Gate (CI, <5min):

Full unit test suite
Integration tests (Testcontainers spin up in CI)
Schema migration lint (no DROP/RENAME/TYPE changes)
Decision log presence check for scoring/correlation PRs
Coverage diff: new code must have ≥80% coverage

Merge to Main (CI, <10min):

Full test suite (unit + integration)
Coverage gate: overall ≥80%, scoring engine ≥90%
CDK synth + diff (infrastructure changes)
Security scan (npm audit, trivy)

Staging (CD, <15min):

Deploy to staging environment
E2E journey tests against LocalStack
Performance benchmarks (ingestion throughput, correlation latency)
Synthetic alert generation + validation

Production (CD):

Canary deploy (10% traffic for 5 minutes)
Smoke tests (send test webhook, verify Slack delivery)
dd0c/alert dogfoods itself: deploy event sent to own webhook
Automated rollback if error rate >1% during canary

7.3 Coverage Thresholds

Component	Minimum	Target
Webhook Parsers	90%	95%
HMAC Validator	95%	100%
Correlation Engine	85%	90%
Noise Scorer	90%	95%
Governance Policy	90%	95%
Notification Formatter	75%	85%
Overall	80%	85%

7.4 Test Parallelization

# .github/workflows/test.yml
jobs:
  unit:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - run: vitest --shard=${{ matrix.shard }}/4

  integration:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        suite: [webhooks, correlation, notifications, storage]
    steps:
      - run: vitest --project=integration --grep=${{ matrix.suite }}

  e2e:
    needs: [unit, integration]
    runs-on: ubuntu-latest
    steps:
      - run: docker compose -f docker-compose.e2e.yml up -d
      - run: vitest --project=e2e

Section 8: Transparent Factory Tenet Testing

8.1 Atomic Flagging — Suppression Circuit Breaker

describe('Atomic Flagging', () => {
  describe('Flag Lifecycle', () => {
    it('new scoring rule flag defaults to false (off)', () => {});
    it('flag has owner and ttl metadata', () => {});
    it('CI blocks when flag at 100% exceeds 14-day TTL', () => {});
  });

  describe('Circuit Breaker on Suppression Volume', () => {
    it('allows suppression when volume is within 2x baseline', () => {});
    it('trips breaker when suppression exceeds 2x baseline over 30min', () => {});
    it('auto-disables the flag when breaker trips', () => {});
    it('buffers suppressed alerts in DLQ during normal operation', () => {});
    it('replays DLQ alerts when breaker trips', async () => {
      // 1. Enable scoring flag, suppress 20 alerts
      // 2. Trip the breaker by spiking suppression rate
      // 3. Verify all 20 suppressed alerts are re-emitted from DLQ
      // 4. Verify flag is now disabled
    });
    it('DLQ retains alerts for 1 hour before expiry', () => {});
  });

  describe('Local Evaluation', () => {
    it('flag evaluation does not make network calls', () => {});
    it('flag state is cached in-memory and refreshed every 60s', () => {});
  });
});

8.2 Elastic Schema — Migration Validation

describe('Elastic Schema', () => {
  describe('Migration Lint', () => {
    it('rejects migration with DROP COLUMN statement', () => {
      const migration = 'ALTER TABLE alert_events DROP COLUMN old_field;';
      expect(lintMigration(migration)).toContainError('DROP not allowed');
    });
    it('rejects migration with ALTER COLUMN TYPE', () => {
      const migration = 'ALTER TABLE alert_events ALTER COLUMN severity TYPE integer;';
      expect(lintMigration(migration)).toContainError('TYPE change not allowed');
    });
    it('rejects migration with RENAME COLUMN', () => {});
    it('accepts migration with ADD COLUMN (nullable)', () => {
      const migration = 'ALTER TABLE alert_events ADD COLUMN noise_score_v2 integer;';
      expect(lintMigration(migration)).toBeValid();
    });
    it('accepts migration with new table creation', () => {});
  });

  describe('DynamoDB Schema', () => {
    it('rejects attribute type change in table definition', () => {});
    it('accepts new attribute addition', () => {});
    it('V1 code ignores V2 attributes without error', () => {});
  });

  describe('Sunset Enforcement', () => {
    it('every migration file contains sunset_date comment', () => {
      const migrations = glob.sync('migrations/*.sql');
      for (const m of migrations) {
        const content = fs.readFileSync(m, 'utf-8');
        expect(content).toMatch(/-- sunset_date: \d{4}-\d{2}-\d{2}/);
      }
    });
    it('CI warns when migration is past sunset date', () => {});
  });
});

8.3 Cognitive Durability — Decision Log Validation

describe('Cognitive Durability', () => {
  it('decision_log.json exists for every PR touching scoring/', () => {
    // CI hook: check git diff for files in src/scoring/
    // If touched, require docs/decisions/*.json in the same PR
  });

  it('decision log has required fields', () => {
    const logs = glob.sync('docs/decisions/*.json');
    for (const log of logs) {
      const entry = JSON.parse(fs.readFileSync(log, 'utf-8'));
      expect(entry).toHaveProperty('reasoning');
      expect(entry).toHaveProperty('alternatives_considered');
      expect(entry).toHaveProperty('confidence');
      expect(entry).toHaveProperty('timestamp');
      expect(entry).toHaveProperty('author');
    }
  });

  it('cyclomatic complexity stays under 10 for all scoring functions', () => {
    // Run eslint with complexity rule
    const result = execSync('eslint src/scoring/ --rule "complexity: [error, 10]"');
    expect(result.exitCode).toBe(0);
  });
});

8.4 Semantic Observability — OTEL Span Assertions

describe('Semantic Observability', () => {
  let spanExporter: InMemorySpanExporter;

  beforeEach(() => {
    spanExporter = new InMemorySpanExporter();
    // Configure OTEL with in-memory exporter for testing
  });

  describe('Alert Evaluation Spans', () => {
    it('emits parent alert_evaluation span for each alert', async () => {
      await processAlert(makeAlert());
      const spans = spanExporter.getFinishedSpans();
      const evalSpan = spans.find(s => s.name === 'alert_evaluation');
      expect(evalSpan).toBeDefined();
    });

    it('emits child noise_scoring span with score attributes', async () => {
      await processAlert(makeAlert());
      const spans = spanExporter.getFinishedSpans();
      const scoreSpan = spans.find(s => s.name === 'noise_scoring');
      expect(scoreSpan).toBeDefined();
      expect(scoreSpan.attributes['alert.noise_score']).toBeGreaterThanOrEqual(0);
      expect(scoreSpan.attributes['alert.noise_score']).toBeLessThanOrEqual(100);
    });

    it('emits child correlation_matching span with match data', async () => {
      await processAlert(makeAlert());
      const spans = spanExporter.getFinishedSpans();
      const corrSpan = spans.find(s => s.name === 'correlation_matching');
      expect(corrSpan).toBeDefined();
      expect(corrSpan.attributes).toHaveProperty('alert.correlation_matches');
    });

    it('emits suppression_decision span with reason', async () => {
      await processAlert(makeAlert());
      const spans = spanExporter.getFinishedSpans();
      const suppSpan = spans.find(s => s.name === 'suppression_decision');
      expect(suppSpan.attributes).toHaveProperty('alert.suppressed');
      expect(suppSpan.attributes).toHaveProperty('alert.suppression_reason');
    });
  });

  describe('PII Protection', () => {
    it('never includes raw alert payload in span attributes', async () => {
      await processAlert(makeAlert({ title: 'User john@example.com failed login' }));
      const spans = spanExporter.getFinishedSpans();
      for (const span of spans) {
        const attrs = JSON.stringify(span.attributes);
        expect(attrs).not.toContain('john@example.com');
      }
    });

    it('uses hashed alert source identifier, not raw', async () => {
      await processAlert(makeAlert({ source: 'prod-payment-api' }));
      const spans = spanExporter.getFinishedSpans();
      const evalSpan = spans.find(s => s.name === 'alert_evaluation');
      expect(evalSpan.attributes['alert.source']).toMatch(/^[a-f0-9]+$/);
    });
  });
});

8.5 Configurable Autonomy — Governance Policy Tests

describe('Configurable Autonomy', () => {
  describe('Governance Mode Enforcement', () => {
    it('strict mode: annotates but never suppresses', async () => {
      setPolicy({ governance_mode: 'strict' });
      const result = await processNoisyAlert(makeAlert({ noiseScore: 95 }));
      expect(result.suppressed).toBe(false);
      expect(result.annotation).toContain('noise_score: 95');
    });

    it('audit mode: auto-suppresses with logging', async () => {
      setPolicy({ governance_mode: 'audit' });
      const result = await processNoisyAlert(makeAlert({ noiseScore: 95 }));
      expect(result.suppressed).toBe(true);
      expect(result.log).toContain('suppressed by audit mode');
    });
  });

  describe('Panic Mode', () => {
    it('activates in <1 second via API call', async () => {
      const start = Date.now();
      await fetch('/admin/panic', { method: 'POST' });
      const panicActive = await redisClient.get('dd0c:panic');
      expect(Date.now() - start).toBeLessThan(1000);
      expect(panicActive).toBe('true');
    });

    it('stops all suppression when active', async () => {
      await activatePanic();
      const results = await Promise.all(
        Array.from({ length: 10 }, () => processNoisyAlert(makeAlert({ noiseScore: 99 })))
      );
      expect(results.every(r => r.suppressed === false)).toBe(true);
    });
  });

  describe('Per-Customer Override', () => {
    it('customer strict overrides system audit', async () => {
      setPolicy({ governance_mode: 'audit' });
      setCustomerPolicy('tenant-123', { governance_mode: 'strict' });
      const result = await processNoisyAlert(makeAlert({ tenant: 'tenant-123', noiseScore: 95 }));
      expect(result.suppressed).toBe(false);
    });

    it('customer cannot downgrade from system strict to audit', async () => {
      setPolicy({ governance_mode: 'strict' });
      setCustomerPolicy('tenant-123', { governance_mode: 'audit' });
      const result = await processNoisyAlert(makeAlert({ tenant: 'tenant-123', noiseScore: 95 }));
      expect(result.suppressed).toBe(false); // System strict wins
    });
  });
});

Section 9: Test Data & Fixtures

9.1 Directory Structure

tests/
  fixtures/
    webhooks/
      datadog/
        single-alert.json
        batched-alerts.json
        monitor-recovered.json
        high-priority.json
      pagerduty/
        incident-triggered.json
        incident-resolved.json
        incident-acknowledged.json
      opsgenie/
        alert-created.json
        alert-closed.json
      grafana/
        single-firing.json
        multi-firing.json
        resolved.json
    deploys/
      github-actions-success.json
      github-actions-failure.json
      gitlab-ci-pipeline.json
      argocd-sync.json
    scenarios/
      alert-storm-50-alerts.json
      cascading-failure-3-services.json
      flapping-alert-10-cycles.json
      maintenance-window-suppression.json
      deploy-correlated-incident.json
    slack/
      initial-alert-blocks.json
      correlated-incident-blocks.json
      weekly-digest-blocks.json
    schemas/
      canonical-alert.json
      incident-record.json
      tenant-config.json

9.2 Alert Payload Factory

// tests/helpers/factories.ts
export function makeCanonicalAlert(overrides: Partial<CanonicalAlert> = {}): CanonicalAlert {
  return {
    alert_id: ulid(),
    tenant_id: overrides.tenant_id ?? 'test-tenant',
    provider: overrides.provider ?? 'datadog',
    service: overrides.service ?? 'test-service',
    title: overrides.title ?? `Alert: ${faker.hacker.phrase()}`,
    severity: overrides.severity ?? 'warning',
    fingerprint: overrides.fingerprint ?? crypto.randomBytes(32).toString('hex'),
    timestamp: overrides.timestamp ?? new Date().toISOString(),
    raw_payload_s3_key: overrides.raw_payload_s3_key ?? `raw/${ulid()}.json`,
    metadata: overrides.metadata ?? {},
    ...overrides,
  };
}

export function makeIncident(overrides: Partial<Incident> = {}): Incident {
  const alertCount = overrides.alert_count ?? 5;
  return {
    incident_id: ulid(),
    tenant_id: overrides.tenant_id ?? 'test-tenant',
    services: overrides.services ?? ['test-service'],
    alert_count: alertCount,
    alerts: Array.from({ length: alertCount }, () => makeCanonicalAlert()),
    noise_score: overrides.noise_score ?? 0,
    deploy_correlation: overrides.deploy_correlation ?? null,
    window_opened_at: overrides.window_opened_at ?? new Date().toISOString(),
    window_closed_at: overrides.window_closed_at ?? new Date().toISOString(),
    ...overrides,
  };
}

export function makeDeployEvent(overrides: Partial<DeployEvent> = {}): DeployEvent {
  return {
    deploy_id: ulid(),
    tenant_id: overrides.tenant_id ?? 'test-tenant',
    service: overrides.service ?? 'test-service',
    commit_sha: overrides.commit_sha ?? faker.git.commitSha(),
    pr_title: overrides.pr_title ?? faker.git.commitMessage(),
    deployed_at: overrides.deployed_at ?? new Date().toISOString(),
    provider: overrides.provider ?? 'github-actions',
    ...overrides,
  };
}

9.3 Noise Scenario Fixtures

// tests/helpers/scenarios.ts
export const NOISE_SCENARIOS = {
  alertStorm: {
    description: '50 alerts for same service in 2 minutes',
    alerts: Array.from({ length: 50 }, (_, i) => makeCanonicalAlert({
      service: 'payment-api',
      title: `High latency variant ${i}`,
      timestamp: new Date(Date.now() + i * 2400).toISOString(),
    })),
    expectedIncidents: 1,
    expectedNoiseScore: { min: 70, max: 95 },
  },

  flappingAlert: {
    description: 'Alert fires and resolves 10 times in 1 hour',
    alerts: Array.from({ length: 20 }, (_, i) => makeCanonicalAlert({
      service: 'health-check',
      title: 'Health check failed',
      severity: i % 2 === 0 ? 'warning' : 'info', // alternating fire/resolve
      timestamp: new Date(Date.now() + i * 3 * 60 * 1000).toISOString(),
    })),
    expectedNoiseScore: { min: 80, max: 100 },
  },

  cascadingFailure: {
    description: 'Database fails, then API, then frontend',
    alerts: [
      makeCanonicalAlert({ service: 'database', severity: 'critical', timestamp: t(0) }),
      makeCanonicalAlert({ service: 'api', severity: 'high', timestamp: t(30) }),
      makeCanonicalAlert({ service: 'api', severity: 'high', timestamp: t(45) }),
      makeCanonicalAlert({ service: 'frontend', severity: 'medium', timestamp: t(60) }),
      makeCanonicalAlert({ service: 'frontend', severity: 'medium', timestamp: t(90) }),
    ],
    serviceDependencies: [['api', 'database'], ['frontend', 'api']],
    expectedIncidents: 1, // All merged via dependency graph
    expectedNoiseScore: { min: 0, max: 30 }, // Real incident, not noise
  },

  deployCorrelated: {
    description: 'Deploy followed by alert storm',
    deploy: makeDeployEvent({ service: 'payment-api', pr_title: 'feat: add retry logic' }),
    alerts: Array.from({ length: 8 }, () => makeCanonicalAlert({
      service: 'payment-api',
      severity: 'high',
    })),
    deployToAlertGapMs: 2 * 60 * 1000, // 2 minutes after deploy
    expectedNoiseScore: { min: 50, max: 85 }, // Deploy correlation boosts noise score
  },
};

Section 10: TDD Implementation Order

10.1 Bootstrap Sequence

The test infrastructure itself must be built before any product code. This is the order:

Phase 0: Test Infrastructure (Week 0)
  ├── 0.1 vitest config + TypeScript setup
  ├── 0.2 Testcontainers helper (Redis, DynamoDB Local, TimescaleDB)
  ├── 0.3 LocalStack helper (SQS, S3, API Gateway)
  ├── 0.4 Fixture loader utility
  ├── 0.5 Factory functions (makeCanonicalAlert, makeIncident, makeDeployEvent)
  ├── 0.6 WireMock Slack stub
  └── 0.7 CI pipeline with test stages

10.2 Epic-by-Epic TDD Order

Phase 1: Webhook Ingestion (Epic 1) — Tests First
  ├── 1.1 RED: HMAC validator tests (all providers)
  ├── 1.2 GREEN: Implement HMAC validation
  ├── 1.3 RED: Datadog parser tests (single + batch)
  ├── 1.4 GREEN: Implement Datadog parser
  ├── 1.5 RED: PagerDuty parser tests
  ├── 1.6 GREEN: Implement PagerDuty parser
  ├── 1.7 RED: Fingerprint generator tests
  ├── 1.8 GREEN: Implement fingerprinting
  ├── 1.9 INTEGRATION: Lambda → SQS contract test
  └── 1.10 REFACTOR: Extract provider parser interface

Phase 2: Correlation Engine (Epic 2) — Tests First
  ├── 2.1 RED: Time-window open/close/extend tests
  ├── 2.2 GREEN: Implement window manager
  ├── 2.3 RED: Service graph correlation tests
  ├── 2.4 GREEN: Implement dependency traversal
  ├── 2.5 RED: Deploy correlation tests
  ├── 2.6 GREEN: Implement deploy tracker
  ├── 2.7 INTEGRATION: Correlation → Redis window tests
  ├── 2.8 INTEGRATION: Correlation → DynamoDB incident persistence
  └── 2.9 INTEGRATION: Correlation → TimescaleDB trend writes

Phase 3: Noise Analysis (Epic 3) — Tests First
  ├── 3.1 RED: Rule-based noise scoring tests (all rules)
  ├── 3.2 GREEN: Implement scorer
  ├── 3.3 RED: Threshold classification tests
  ├── 3.4 GREEN: Implement classifier
  ├── 3.5 RED: "What would have happened" calculation tests
  ├── 3.6 GREEN: Implement historical analysis
  └── 3.7 REFACTOR: Extract scoring rules into configurable pipeline

Phase 4: Notifications (Epic 4) — Integration Tests Lead
  ├── 4.1 Implement Slack block formatter
  ├── 4.2 RED: Snapshot tests for all message formats
  ├── 4.3 INTEGRATION: Notification → Slack (WireMock)
  ├── 4.4 RED: Rate limiting tests
  └── 4.5 GREEN: Implement rate limiter

Phase 5: Governance (Epic 10) — Tests First
  ├── 5.1 RED: Strict/audit mode enforcement tests
  ├── 5.2 GREEN: Implement policy engine
  ├── 5.3 RED: Panic mode tests (<1s activation)
  ├── 5.4 GREEN: Implement panic mode
  ├── 5.5 RED: Circuit breaker + DLQ replay tests
  ├── 5.6 GREEN: Implement circuit breaker
  ├── 5.7 RED: OTEL span assertion tests
  └── 5.8 GREEN: Instrument all components

Phase 6: E2E Validation
  ├── 6.1 60-second TTV journey
  ├── 6.2 Alert storm correlation journey
  ├── 6.3 Deploy correlation journey
  ├── 6.4 Panic mode journey
  └── 6.5 Performance benchmarks

10.3 "Never Ship Without" Checklist

Before any release, these tests must pass:

All HMAC validation tests (security gate)
All correlation window tests (correctness gate)
All noise scoring tests (safety gate — never eat real alerts)
All governance policy tests (compliance gate)
Circuit breaker DLQ replay test (safety net gate)
60-second TTV E2E journey (product promise gate)
PII protection span tests (privacy gate)
Schema migration lint (no breaking changes)
Coverage ≥80% overall, ≥90% on scoring engine

End of dd0c/alert Test Architecture

11. Review Remediation Addendum (Post-Gemini Review)

11.1 Missing Epic Coverage

Epic 6: Dashboard API

describe('Dashboard API', () => {
  describe('Authentication', () => {
    it('returns 401 for missing Cognito JWT', async () => {});
    it('returns 401 for expired JWT', async () => {});
    it('returns 401 for JWT signed by wrong issuer', async () => {});
    it('extracts tenantId from JWT claims', async () => {});
  });

  describe('Incident Listing (GET /v1/incidents)', () => {
    it('returns paginated incidents for authenticated tenant', async () => {});
    it('supports cursor-based pagination', async () => {});
    it('filters by status (open, acknowledged, resolved)', async () => {});
    it('filters by severity (critical, warning, info)', async () => {});
    it('filters by time range (since, until)', async () => {});
    it('returns empty array for tenant with no incidents', async () => {});
  });

  describe('Incident Detail (GET /v1/incidents/:id)', () => {
    it('returns full incident with correlated alerts', async () => {});
    it('returns 404 for incident belonging to different tenant', async () => {});
    it('includes timeline of state transitions', async () => {});
  });

  describe('Analytics (GET /v1/analytics)', () => {
    it('returns MTTR for last 7/30/90 days', async () => {});
    it('returns alert volume by source', async () => {});
    it('returns noise reduction percentage', async () => {});
    it('scopes all analytics to authenticated tenant', async () => {});
  });

  describe('Tenant Isolation', () => {
    it('tenant A cannot read tenant B incidents via API', async () => {});
    it('tenant A cannot read tenant B analytics', async () => {});
    it('all DynamoDB queries include tenantId partition key', async () => {});
  });
});

Epic 7: Dashboard UI (Playwright)

// tests/e2e/ui/dashboard.spec.ts

test('login redirects to Cognito hosted UI', async ({ page }) => {
  await page.goto('/dashboard');
  await expect(page).toHaveURL(/cognito/);
});

test('incident list renders with correct severity badges', async ({ page }) => {
  await page.goto('/dashboard/incidents');
  await expect(page.locator('[data-testid="incident-card"]')).toHaveCount(5);
  await expect(page.locator('.severity-critical')).toBeVisible();
});

test('incident detail shows correlated alert timeline', async ({ page }) => {
  await page.goto('/dashboard/incidents/inc-123');
  await expect(page.locator('[data-testid="alert-timeline"]')).toBeVisible();
  await expect(page.locator('.timeline-event')).toHaveCountGreaterThan(1);
});

test('MTTR chart renders with real data', async ({ page }) => {
  await page.goto('/dashboard/analytics');
  await expect(page.locator('[data-testid="mttr-chart"]')).toBeVisible();
});

test('noise reduction percentage displays correctly', async ({ page }) => {
  await page.goto('/dashboard/analytics');
  const noise = page.locator('[data-testid="noise-reduction"]');
  await expect(noise).toContainText('%');
});

test('webhook setup wizard generates correct URL', async ({ page }) => {
  await page.goto('/dashboard/settings/integrations');
  await page.click('[data-testid="add-datadog"]');
  const url = await page.locator('[data-testid="webhook-url"]').textContent();
  expect(url).toMatch(/\/v1\/webhooks\/ingest\/.+/);
});

Epic 9: Onboarding & PLG

describe('Free Tier Enforcement', () => {
  it('allows up to 10,000 alerts/month on free tier', async () => {});
  it('returns 429 with upgrade prompt at 10,001st alert', async () => {});
  it('resets counter on first of each month', async () => {});
  it('purges alert data older than 7 days on free tier', async () => {});
  it('retains alert data for 90 days on pro tier', async () => {});
});

describe('OAuth Signup', () => {
  it('creates tenant record on first Cognito login', async () => {});
  it('assigns free tier by default', async () => {});
  it('generates unique webhook URL per tenant', async () => {});
});

describe('Stripe Integration', () => {
  it('creates checkout session with correct pricing', async () => {});
  it('upgrades tenant on checkout.session.completed webhook', async () => {});
  it('downgrades tenant on subscription.deleted webhook', async () => {});
  it('validates Stripe webhook signature', async () => {});
});

Epic 5.3: Slack Feedback Endpoint

describe('Slack Interactive Actions Endpoint', () => {
  it('validates Slack request signature (HMAC-SHA256)', async () => {});
  it('rejects request with invalid signature', async () => {});
  it('handles "helpful" feedback — updates incident quality score', async () => {});
  it('handles "noise" feedback — adds to suppression training data', async () => {});
  it('handles "escalate" action — triggers PagerDuty/OpsGenie', async () => {});
  it('updates original Slack message after action', async () => {});
  it('scopes action to correct tenant', async () => {});
});

Epic 1.4: S3 Raw Payload Archival

describe('Raw Payload Archival', () => {
  it('saves raw webhook payload to S3 asynchronously', async () => {});
  it('S3 key includes tenantId, source, and timestamp', async () => {});
  it('archival failure does not block alert processing', async () => {});
  it('archived payload is retrievable for replay', async () => {});
  it('S3 lifecycle policy deletes after retention period', async () => {});
});

11.2 Anti-Pattern Fixes

Replace ioredis-mock with WindowStore Interface

// BEFORE (anti-pattern):
// import RedisMock from 'ioredis-mock';
// const engine = new CorrelationEngine(new RedisMock());

// AFTER (correct):
interface WindowStore {
  addEvent(tenantId: string, key: string, event: Alert, ttlMs: number): Promise<void>;
  getWindow(tenantId: string, key: string): Promise<Alert[]>;
  clearWindow(tenantId: string, key: string): Promise<void>;
}

class InMemoryWindowStore implements WindowStore {
  private store = new Map<string, { events: Alert[]; expiresAt: number }>();
  
  async addEvent(tenantId: string, key: string, event: Alert, ttlMs: number) {
    const fullKey = `${tenantId}:${key}`;
    const existing = this.store.get(fullKey) || { events: [], expiresAt: Date.now() + ttlMs };
    existing.events.push(event);
    this.store.set(fullKey, existing);
  }

  async getWindow(tenantId: string, key: string): Promise<Alert[]> {
    const fullKey = `${tenantId}:${key}`;
    const entry = this.store.get(fullKey);
    if (!entry || entry.expiresAt < Date.now()) return [];
    return entry.events;
  }
}

// Unit tests use InMemoryWindowStore — no Redis dependency
// Integration tests use RedisWindowStore with Testcontainers

Replace sinon.useFakeTimers with Clock Interface

// BEFORE (anti-pattern):
// sinon.useFakeTimers(new Date('2026-03-01T00:00:00Z'));

// AFTER (correct):
interface Clock {
  now(): number;
  advanceBy(ms: number): void;
}

class FakeClock implements Clock {
  private current: number;
  constructor(start: Date = new Date()) { this.current = start.getTime(); }
  now() { return this.current; }
  advanceBy(ms: number) { this.current += ms; }
}

class SystemClock implements Clock {
  now() { return Date.now(); }
  advanceBy() { throw new Error('Cannot advance system clock'); }
}

// Inject into CorrelationEngine:
const engine = new CorrelationEngine(new InMemoryWindowStore(), new FakeClock());

11.3 Trace Context Propagation Tests

describe('Trace Context Propagation', () => {
  it('API Gateway passes trace_id to Lambda via X-Amzn-Trace-Id', async () => {});
  
  it('Lambda propagates trace_id into SQS message attributes', async () => {
    // Verify SQS message has MessageAttribute 'traceparent' with W3C format
    const msg = await getLastSQSMessage(localstack, 'alert-queue');
    expect(msg.MessageAttributes.traceparent).toBeDefined();
    expect(msg.MessageAttributes.traceparent.StringValue).toMatch(
      /^00-[0-9a-f]{32}-[0-9a-f]{16}-0[01]$/
    );
  });

  it('ECS Correlation Engine extracts trace_id from SQS message', async () => {
    // Verify the correlation span has the correct parent from SQS
    const spans = inMemoryExporter.getFinishedSpans();
    const correlationSpan = spans.find(s => s.name === 'alert.correlation');
    const ingestSpan = spans.find(s => s.name === 'webhook.ingest');
    expect(correlationSpan.parentSpanId).toBeDefined();
    // Parent chain must trace back to the original ingest span
  });

  it('end-to-end trace spans webhook → SQS → correlation → notification', async () => {
    // Fire a webhook, wait for Slack notification, verify all spans share trace_id
    const traceId = await fireWebhookAndGetTraceId();
    const spans = await getSpansByTraceId(traceId);
    const spanNames = spans.map(s => s.name);
    expect(spanNames).toContain('webhook.ingest');
    expect(spanNames).toContain('alert.normalize');
    expect(spanNames).toContain('alert.correlation');
    expect(spanNames).toContain('notification.slack');
  });
});

11.4 HMAC Security Hardening

describe('HMAC Signature Validation (Hardened)', () => {
  it('uses crypto.timingSafeEqual, not === comparison', () => {
    // Inspect the source to verify timing-safe comparison
    const source = fs.readFileSync('src/ingestion/hmac.ts', 'utf8');
    expect(source).toContain('timingSafeEqual');
    expect(source).not.toMatch(/signature\s*===\s*/);
  });

  it('handles case-insensitive header names (dd-webhook-signature vs DD-WEBHOOK-SIGNATURE)', async () => {
    const payload = makeAlertPayload('datadog');
    const sig = computeHMAC(payload, DATADOG_SECRET);
    
    // Lowercase header
    const resp1 = await ingest(payload, { 'dd-webhook-signature': sig });
    expect(resp1.status).toBe(200);
    
    // Uppercase header
    const resp2 = await ingest(payload, { 'DD-WEBHOOK-SIGNATURE': sig });
    expect(resp2.status).toBe(200);
  });

  it('rejects completely missing signature header', async () => {
    const resp = await ingest(makeAlertPayload('datadog'), {});
    expect(resp.status).toBe(401);
  });

  it('rejects empty signature header', async () => {
    const resp = await ingest(makeAlertPayload('datadog'), { 'dd-webhook-signature': '' });
    expect(resp.status).toBe(401);
  });
});

11.5 SQS 256KB Payload Limit

describe('Large Payload Handling', () => {
  it('compresses payloads >200KB before sending to SQS', async () => {
    const largePayload = makeLargeAlertPayload(300 * 1024); // 300KB
    const resp = await ingest(largePayload);
    expect(resp.status).toBe(200);

    const msg = await getLastSQSMessage(localstack, 'alert-queue');
    // Payload must be compressed or use S3 pointer
    expect(msg.Body.length).toBeLessThan(256 * 1024);
  });

  it('uses S3 pointer for payloads >256KB after compression', async () => {
    const hugePayload = makeLargeAlertPayload(500 * 1024); // 500KB
    const resp = await ingest(hugePayload);
    expect(resp.status).toBe(200);

    const msg = await getLastSQSMessage(localstack, 'alert-queue');
    const body = JSON.parse(msg.Body);
    expect(body.s3Pointer).toBeDefined();
    expect(body.s3Pointer).toMatch(/^s3:\/\/dd0c-alert-overflow\//);
  });

  it('strips unnecessary fields from Datadog payload before SQS', async () => {
    const payload = makeDatadogPayloadWithLargeTags(100); // 100 tags
    const resp = await ingest(payload);
    expect(resp.status).toBe(200);

    const msg = await getLastSQSMessage(localstack, 'alert-queue');
    const normalized = JSON.parse(msg.Body);
    // Only essential fields should remain
    expect(normalized.tags.length).toBeLessThanOrEqual(20);
  });

  it('rejects payloads >2MB at API Gateway level', async () => {
    const massive = makeLargeAlertPayload(3 * 1024 * 1024);
    const resp = await ingest(massive);
    expect(resp.status).toBe(413);
  });
});

11.6 DLQ Backpressure & Replay

describe('DLQ Replay with Backpressure', () => {
  it('replays DLQ messages in batches of 100', async () => {
    await seedDLQ(10000); // 10K messages
    const replayer = new DLQReplayer({ batchSize: 100, delayBetweenBatchesMs: 500 });
    await replayer.start();

    // Verify batched processing
    expect(replayer.batchesProcessed).toBeGreaterThan(0);
    expect(replayer.maxConcurrentMessages).toBeLessThanOrEqual(100);
  });

  it('pauses replay if correlation engine error rate exceeds 10%', async () => {
    await seedDLQ(1000);
    const replayer = new DLQReplayer({ batchSize: 100, errorThreshold: 0.1 });
    
    // Simulate correlation engine returning errors
    mockCorrelationEngine.failRate = 0.15;
    await replayer.start();

    expect(replayer.state).toBe('paused');
    expect(replayer.pauseReason).toContain('error rate exceeded');
  });

  it('does not replay if circuit breaker is currently tripped', async () => {
    await seedDLQ(100);
    await tripCircuitBreaker();

    const replayer = new DLQReplayer();
    await replayer.start();

    expect(replayer.messagesReplayed).toBe(0);
    expect(replayer.state).toBe('blocked_by_circuit_breaker');
  });

  it('tracks replay progress for resumability', async () => {
    await seedDLQ(500);
    const replayer = new DLQReplayer({ batchSize: 50 });
    
    // Process 3 batches then stop
    await replayer.processNBatches(3);
    expect(replayer.checkpoint).toBe(150);

    // Resume from checkpoint
    const replayer2 = new DLQReplayer({ resumeFrom: replayer.checkpoint });
    await replayer2.start();
    expect(replayer2.startedFrom).toBe(150);
  });
});

11.7 Multi-Tenancy Isolation (DynamoDB)

describe('DynamoDB Tenant Isolation', () => {
  it('all DAO methods require tenantId parameter', () => {
    // Compile-time check: DAO interface has tenantId as first param
    const daoSource = fs.readFileSync('src/data/incident-dao.ts', 'utf8');
    const methods = extractPublicMethods(daoSource);
    for (const method of methods) {
      expect(method.params[0].name).toBe('tenantId');
    }
  });

  it('query for tenant A returns zero results for tenant B data', async () => {
    const dao = new IncidentDAO(dynamoClient);
    await dao.create('tenant-A', makeIncident());
    await dao.create('tenant-B', makeIncident());

    const results = await dao.list('tenant-A');
    expect(results.every(r => r.tenantId === 'tenant-A')).toBe(true);
  });

  it('partition key always includes tenantId prefix', async () => {
    const dao = new IncidentDAO(dynamoClient);
    await dao.create('tenant-X', makeIncident());

    // Read raw DynamoDB item
    const item = await dynamoClient.scan({ TableName: 'dd0c-alert-main' });
    expect(item.Items[0].PK.S).toStartWith('TENANT#tenant-X');
  });
});

11.8 Slack Circuit Breaker

describe('Slack Notification Circuit Breaker', () => {
  it('opens circuit after 10 consecutive 429s from Slack', async () => {
    const slackClient = new SlackClient({ circuitBreakerThreshold: 10 });
    for (let i = 0; i < 10; i++) {
      mockSlack.respondWith(429);
      await slackClient.send(makeMessage()).catch(() => {});
    }
    expect(slackClient.circuitState).toBe('open');
  });

  it('queues notifications while circuit is open', async () => {
    slackClient.openCircuit();
    await slackClient.send(makeMessage());
    expect(slackClient.queuedMessages).toBe(1);
  });

  it('half-opens circuit after 60 seconds', async () => {
    slackClient.openCircuit();
    clock.advanceBy(61000);
    expect(slackClient.circuitState).toBe('half-open');
  });

  it('drains queue on successful half-open probe', async () => {
    slackClient.openCircuit();
    slackClient.queue(makeMessage());
    slackClient.queue(makeMessage());
    clock.advanceBy(61000);
    mockSlack.respondWith(200);
    await slackClient.probe();
    expect(slackClient.circuitState).toBe('closed');
    expect(slackClient.queuedMessages).toBe(0);
  });
});

11.9 Updated Test Pyramid (Post-Review)

Level	Original	Revised	Rationale
Unit	70% (~140)	65% (~180)	More tests total, but integration share grows
Integration	20% (~40)	25% (~70)	Dashboard API, tenant isolation, trace propagation
E2E	10% (~20)	10% (~28)	Dashboard UI (Playwright), onboarding flow

End of P3 Review Remediation Addendum

67 KiB Raw Blame History