Files
dd0c/products/03-alert-intelligence/architecture/architecture.md
Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
2026-02-28 17:35:02 +00:00

57 KiB
Raw Blame History

dd0c/alert — Technical Architecture

Alert Intelligence Platform

Version: 1.0 | Date: 2026-02-28 | Phase: 6 — Architecture | Author: dd0c Engineering


1. SYSTEM OVERVIEW

1.1 High-Level Architecture

graph TB
    subgraph Providers["Alert Sources"]
        PD[PagerDuty]
        DD[Datadog]
        GF[Grafana]
        OG[OpsGenie]
        CW[Custom Webhooks]
    end

    subgraph CICD["CI/CD Sources"]
        GHA[GitHub Actions]
        GLC[GitLab CI]
        ARGO[ArgoCD]
    end

    subgraph Ingestion["Ingestion Layer (API Gateway + Lambda)"]
        WH[Webhook Receiver<br/>POST /webhooks/:provider]
        HMAC[HMAC Validator]
        NORM[Payload Normalizer]
        SCHEMA[Canonical Schema<br/>Mapper]
    end

    subgraph Queue["Event Bus (SQS/SNS)"]
        ALERT_Q[alert-ingested<br/>SQS FIFO]
        DEPLOY_Q[deploy-event<br/>SQS FIFO]
        CORR_Q[correlation-request<br/>SQS Standard]
        NOTIFY_Q[notification<br/>SQS Standard]
    end

    subgraph Processing["Processing Layer (ECS Fargate)"]
        CE[Correlation Engine]
        DT[Deployment Tracker]
        SE[Suggestion Engine]
        NS[Notification Service]
    end

    subgraph Storage["Data Layer"]
        DDB[(DynamoDB<br/>Alerts + Tenants)]
        TS[(TimescaleDB on RDS<br/>Time-Series Correlation)]
        CACHE[(ElastiCache Redis<br/>Active Windows)]
        S3[(S3<br/>Raw Payloads + Exports)]
    end

    subgraph Output["Delivery"]
        SLACK[Slack Bot]
        DASH[Dashboard API<br/>CloudFront + S3 SPA]
        API[REST API]
    end

    PD & DD & GF & OG & CW --> WH
    GHA & GLC & ARGO --> WH
    WH --> HMAC --> NORM --> SCHEMA
    SCHEMA -- alerts --> ALERT_Q
    SCHEMA -- deploys --> DEPLOY_Q
    ALERT_Q --> CE
    DEPLOY_Q --> DT
    DT -- deploy context --> CE
    CE -- correlation results --> CORR_Q
    CORR_Q --> SE
    SE -- suggestions --> NOTIFY_Q
    NOTIFY_Q --> NS
    NS --> SLACK
    CE & DT & SE --> DDB & TS
    CE --> CACHE
    SCHEMA --> S3
    DASH & API --> DDB & TS

1.2 Component Inventory

Component Responsibility AWS Service Scaling Model
Webhook Receiver Accept, authenticate, and normalize incoming alert/deploy webhooks API Gateway HTTP API + Lambda Auto-scales to 10K concurrent; API Gateway handles burst
Payload Normalizer Transform provider-specific payloads into canonical alert schema Lambda (part of ingestion) Stateless, scales with webhook volume
Event Bus Decouple ingestion from processing; buffer during spikes SQS FIFO (alerts/deploys) + SQS Standard (correlation/notify) Unlimited throughput; FIFO for ordering guarantees on per-tenant basis
Correlation Engine Time-window clustering, service-dependency matching, deploy correlation ECS Fargate (long-running) Horizontal scaling via ECS Service Auto Scaling
Deployment Tracker Ingest CI/CD events, maintain deploy timeline per service ECS Fargate (shared with CE) Co-located with Correlation Engine
Suggestion Engine Score noise, generate grouping suggestions, compute "what would have happened" ECS Fargate Scales with correlation output volume
Notification Service Format and deliver Slack messages, digests, weekly reports Lambda (event-driven from SQS) Auto-scales; Slack rate limits are the bottleneck
Alert Store Canonical alert records, tenant config, correlation results DynamoDB On-demand capacity; single-digit ms reads
Time-Series Store Correlation windows, alert frequency histograms, trend data TimescaleDB on RDS (PostgreSQL) Vertical scaling initially; read replicas at scale
Active Window Cache In-flight correlation windows, recent alert fingerprints ElastiCache Redis Single node → cluster at 10K+ alerts/day
Raw Payload Archive Original webhook payloads for audit, replay, and simulation S3 Standard → S3 IA (30d) → Glacier (90d) Unlimited; lifecycle policies manage cost
Dashboard SPA React frontend for noise reports, suppression logs, integration management CloudFront + S3 Static hosting; API calls to backend
REST API Dashboard backend, alert query, correlation results, admin API Gateway + Lambda Auto-scales

1.3 Technology Choices

Decision Choice Justification
Compute: Ingestion AWS Lambda Webhook traffic is bursty (incident storms). Lambda handles 0→10K RPS without pre-provisioning. Pay-per-invocation keeps costs near-zero at low volume. Cold starts acceptable (<200ms with Node.js runtime).
Compute: Processing ECS Fargate Correlation Engine needs long-running processes with in-memory state (active correlation windows). Lambda's 15-min timeout is insufficient. Fargate = no EC2 management, scales to zero tasks during quiet periods.
Queue SQS (FIFO for ingestion, Standard for downstream) FIFO guarantees per-tenant ordering for alert ingestion (critical for time-window accuracy). Standard queues for correlation→notification path where ordering is less critical. SNS fan-out for multi-consumer patterns. No Kafka overhead for V1.
Primary Store DynamoDB Single-digit ms latency for alert lookups. On-demand pricing = pay for what you use. Partition key = tenant_id, sort key = alert_id or timestamp. Global tables for multi-region (V2+).
Time-Series TimescaleDB on RDS Correlation windows require time-range queries ("all alerts for tenant X, service Y, in the last 5 minutes"). TimescaleDB's hypertables + continuous aggregates are purpose-built for this. PostgreSQL compatibility means standard tooling.
Cache ElastiCache Redis Sub-ms reads for active correlation windows. Sorted sets for time-windowed alert lookups. TTL-based expiry for window cleanup. Pub/Sub for real-time correlation triggers.
Object Store S3 Raw payload archival, alert history exports, simulation data. Lifecycle policies for cost management. Event notifications trigger replay/simulation workflows.
API Layer API Gateway HTTP API Lower latency and cost than REST API type. JWT authorizer for API key validation. Built-in throttling per API key (maps to tenant rate limits by tier).
Frontend React SPA on CloudFront Static hosting = zero server cost. CloudFront edge caching. Dashboard is secondary to Slack (most users never open it), so minimal investment.
Language TypeScript (Node.js 20) Single language across Lambda + ECS + frontend. Strong typing catches schema mapping bugs at compile time. Excellent AWS SDK support. Fast cold starts on Lambda.
IaC AWS CDK (TypeScript) Same language as application code. L2 constructs reduce boilerplate. Synthesizes to CloudFormation for drift detection.
CI/CD GitHub Actions Free for public repos, cheap for private. Native webhook integration (dd0c/alert dogfoods its own deployment tracking).

1.4 Webhook-First Ingestion Model (60-Second Time-to-Value)

The entire architecture is designed around a single constraint: a new customer must see their first correlated incident in Slack within 60 seconds of pasting a webhook URL.

The flow:

T+0s    Customer signs up → gets unique webhook URL:
        https://hooks.dd0c.com/v1/wh/{tenant_id}/{provider}

T+5s    Customer pastes URL into Datadog notification channel
        (or PagerDuty webhook extension, or Grafana contact point)

T+10s   First alert fires from their monitoring tool

T+10.1s API Gateway receives POST, Lambda validates HMAC,
        normalizes payload, writes to SQS FIFO

T+10.3s Correlation Engine picks up alert, opens a new
        correlation window (default: 5 minutes)

T+10.5s Alert appears in customer's Slack channel:
        "🔔 New alert: [service] [title] — watching for related alerts..."

T+60s   Window closes (or more alerts arrive). Correlation Engine
        groups related alerts. Suggestion Engine scores noise.

T+61s   Slack message updates in-place:
        "📊 Incident #1: 12 alerts grouped → 1 incident
         Sources: Datadog (8), PagerDuty (4)
         Trigger: Deploy #1042 to payment-service (2 min before first alert)
         Noise score: 87% — we'd suggest suppressing 10 of 12
         👍 Helpful  👎 Not helpful"

Design decisions enabling 60-second TTV:

  1. No SDK, no agent, no credentials. Just a URL. The webhook URL encodes tenant ID and provider type — zero configuration needed.
  2. Pre-provisioned Slack connection. During signup, customer connects Slack via OAuth before getting the webhook URL. Slack is ready before the first alert arrives.
  3. Eager first-alert notification. Don't wait for the correlation window to close. Show the first alert immediately in Slack ("watching for related alerts..."), then update the message in-place when correlation completes. The customer sees activity within seconds.
  4. Default correlation window of 5 minutes. Long enough to catch deploy-correlated alert storms, short enough to deliver results quickly. Configurable per tenant.
  5. Webhook URL as the product. The URL IS the integration. No config files, no YAML, no terraform modules. Copy. Paste. Done.

2. CORE COMPONENTS

2.1 Webhook Ingestion Layer

The ingestion layer is the front door. It must be fast (sub-100ms response to the sending provider), reliable (zero dropped webhooks), and flexible (parse any provider's payload format without code changes for supported providers).

Architecture: API Gateway HTTP API → Lambda function → SQS FIFO

sequenceDiagram
    participant P as Provider (Datadog/PD/etc)
    participant AG as API Gateway
    participant L as Ingestion Lambda
    participant SQS as SQS FIFO
    participant S3 as S3 (Raw Archive)

    P->>AG: POST /v1/wh/{tenant_id}/{provider}
    AG->>L: Invoke (< 50ms)
    L->>L: Validate HMAC signature
    L->>L: Parse provider-specific payload
    L->>L: Map to canonical alert schema
    L-->>S3: Async: store raw payload
    L->>SQS: Send canonical alert message
    L->>P: 200 OK (< 100ms total)

Provider Parsers:

Each supported provider has a dedicated parser module that transforms the provider's webhook payload into the canonical alert schema. Parsers are stateless functions — no database calls, no external dependencies.

// Provider parser interface
interface ProviderParser {
  provider: string;
  validateSignature(headers: Headers, body: string, secret: string): boolean;
  parse(payload: unknown): CanonicalAlert[];
  // Some providers send batched payloads (Datadog sends arrays)
}

// Registry — new providers added by implementing the interface
const parsers: Record<string, ProviderParser> = {
  'pagerduty':  new PagerDutyParser(),
  'datadog':    new DatadogParser(),
  'grafana':    new GrafanaParser(),
  'opsgenie':   new OpsGenieParser(),
  'github':     new GitHubActionsParser(),   // deploy events
  'gitlab':     new GitLabCIParser(),        // deploy events
  'argocd':     new ArgoCDParser(),          // deploy events
  'custom':     new CustomWebhookParser(),   // user-defined mapping
};

HMAC Validation per Provider:

Provider Signature Header Algorithm Payload
PagerDuty X-PagerDuty-Signature HMAC-SHA256 Raw body
Datadog DD-WEBHOOK-SIGNATURE HMAC-SHA256 Raw body
Grafana X-Grafana-Alerting-Signature HMAC-SHA256 Raw body
OpsGenie X-OpsGenie-Signature HMAC-SHA256 Raw body
GitHub X-Hub-Signature-256 HMAC-SHA256 Raw body
GitLab X-Gitlab-Token Token comparison N/A
Custom X-DD0C-Signature HMAC-SHA256 Raw body

Canonical Alert Schema:

interface CanonicalAlert {
  // Identity
  alert_id: string;          // dd0c-generated ULID (sortable, unique)
  tenant_id: string;         // From webhook URL path
  provider: string;          // 'pagerduty' | 'datadog' | 'grafana' | 'opsgenie' | 'custom'
  provider_alert_id: string; // Original alert ID from provider
  provider_incident_id?: string; // Provider's incident grouping (if any)

  // Classification
  severity: 'critical' | 'high' | 'medium' | 'low' | 'info';
  status: 'triggered' | 'acknowledged' | 'resolved';
  category: 'infrastructure' | 'application' | 'security' | 'deployment' | 'custom';

  // Context
  service: string;           // Normalized service name
  environment: string;       // 'production' | 'staging' | 'development'
  title: string;             // Human-readable alert title
  description?: string;      // Alert body/details (may be stripped in privacy mode)
  tags: Record<string, string>; // Normalized key-value tags

  // Fingerprint (for dedup)
  fingerprint: string;       // SHA-256 of (tenant_id + provider + service + title_normalized)

  // Timestamps
  triggered_at: string;      // ISO 8601 — when the alert originally fired
  received_at: string;       // ISO 8601 — when dd0c received the webhook
  resolved_at?: string;      // ISO 8601 — when resolved (if status=resolved)

  // Metadata
  raw_payload_s3_key: string; // S3 key for the original payload
  source_url?: string;       // Deep link back to the alert in the provider's UI
}

Key design decisions:

  1. ULID for alert_id. ULIDs are lexicographically sortable by time (unlike UUIDv4), which makes DynamoDB range queries efficient and eliminates the need for a secondary index on timestamp.
  2. Fingerprint for dedup. The fingerprint is a deterministic hash of the alert's identity fields. Two alerts with the same fingerprint from the same provider within a correlation window are duplicates. This catches the "same alert firing every 30 seconds" pattern without ML.
  3. Severity normalization. Each provider uses different severity scales (Datadog: P1-P5, PagerDuty: critical/high/low, Grafana: alerting/ok). The parser normalizes to a 5-level scale. Mapping is configurable per tenant.
  4. Privacy mode. When enabled, description is set to null and tags are hashed. Only structural metadata (service, severity, timestamp) is stored. Reduces intelligence slightly but eliminates sensitive data concerns.

2.2 Correlation Engine

The Correlation Engine is the core intelligence of dd0c/alert. It takes a stream of canonical alerts and produces correlated incidents — groups of related alerts that represent a single underlying issue.

Architecture: ECS Fargate service consuming from SQS FIFO, maintaining active correlation windows in Redis, writing results to DynamoDB + TimescaleDB.

graph LR
    subgraph Input
        SQS[SQS FIFO<br/>alert-ingested]
    end

    subgraph CE["Correlation Engine (ECS Fargate)"]
        RECV[Message Receiver]
        FP[Fingerprint Dedup]
        TW[Time-Window<br/>Correlator]
        SDG[Service Dependency<br/>Graph Matcher]
        DC[Deploy Correlation]
        SCORE[Incident Scorer]
    end

    subgraph State
        REDIS[(Redis<br/>Active Windows)]
        DDB[(DynamoDB<br/>Incidents)]
        TSDB[(TimescaleDB<br/>Time-Series)]
    end

    SQS --> RECV
    RECV --> FP --> TW --> SDG --> DC --> SCORE
    TW <--> REDIS
    SDG <--> DDB
    DC <--> DDB
    SCORE --> DDB & TSDB
    SCORE --> CORR_Q[SQS: correlation-request]

Correlation Pipeline (executed per alert):

Alert arrives
  │
  ├─ Step 1: FINGERPRINT DEDUP
  │   Is there an alert with the same fingerprint in the active window?
  │   YES → Increment count on existing alert, skip to scoring
  │   NO  → Continue
  │
  ├─ Step 2: TIME-WINDOW CORRELATION
  │   Find all open correlation windows for this tenant
  │   Does this alert fall within an existing window's time range + service scope?
  │   YES → Add alert to existing window
  │   NO  → Open a new correlation window (default: 5 min, configurable)
  │
  ├─ Step 3: SERVICE-DEPENDENCY MATCHING
  │   Does the service dependency graph show a relationship between
  │   this alert's service and services in any open window?
  │   YES → Merge windows (upstream DB alert + downstream API errors = one incident)
  │   NO  → Keep windows separate
  │
  ├─ Step 4: DEPLOY CORRELATION
  │   Was there a deployment to this service (or an upstream dependency)
  │   within the lookback period (default: 15 min)?
  │   YES → Tag the correlation window with deploy context
  │   NO  → Continue
  │
  └─ Step 5: INCIDENT SCORING
      When a correlation window closes (timeout or manual trigger):
      - Count total alerts in window
      - Count unique services affected
      - Calculate noise score (0-100)
      - Generate incident summary
      - Write Incident record to DynamoDB
      - Emit to correlation-request queue for Suggestion Engine

Time-Window Correlation — The Algorithm:

interface CorrelationWindow {
  window_id: string;           // ULID
  tenant_id: string;
  opened_at: string;           // ISO 8601
  closes_at: string;           // opened_at + window_duration
  window_duration_ms: number;  // Default 300000 (5 min), configurable
  status: 'open' | 'closed';

  // Alerts in this window
  alert_ids: string[];
  alert_count: number;
  unique_fingerprints: number;

  // Services involved
  services: Set<string>;
  environments: Set<string>;

  // Deploy context (if matched)
  deploy_event_id?: string;
  deploy_service?: string;
  deploy_pr?: string;
  deploy_author?: string;
  deploy_timestamp?: string;

  // Scoring (computed on close)
  noise_score?: number;        // 0-100 (100 = pure noise)
  severity_max?: string;       // Highest severity alert in window
}

Window management in Redis:

# Active windows stored as Redis sorted sets (score = closes_at timestamp)
ZADD tenant:{tenant_id}:windows {closes_at_epoch} {window_id}

# Alert-to-window mapping
SADD window:{window_id}:alerts {alert_id}

# Service-to-window index (for dependency matching)
SADD tenant:{tenant_id}:service:{service_name}:windows {window_id}

# Window metadata as hash
HSET window:{window_id} opened_at ... closes_at ... alert_count ...

# TTL on all keys = window_duration + 1 hour (cleanup buffer)

Window extension logic: If a new alert arrives for an open window within the last 30 seconds of the window's lifetime, the window extends by 2 minutes (up to a maximum of 15 minutes total). This catches cascading failures where alerts trickle in over time.

Service Dependency Graph:

The dependency graph is built from two sources:

  1. Inferred from alert patterns. If Service A alerts consistently fire 1-3 minutes before Service B alerts, dd0c infers a dependency (A → B). Requires 3+ occurrences to establish.
  2. Explicit configuration. Customers can declare dependencies via the dashboard or API: payment-service → notification-service → email-service.
interface ServiceDependency {
  tenant_id: string;
  upstream_service: string;
  downstream_service: string;
  source: 'inferred' | 'explicit';
  confidence: number;          // 0.0-1.0 (inferred only)
  occurrence_count: number;    // Times this pattern was observed
  last_seen: string;           // ISO 8601
}

Stored in DynamoDB with GSI on tenant_id + upstream_service for fast lookups during correlation.

2.3 Deployment Tracker

The Deployment Tracker ingests CI/CD webhook events and maintains a timeline of deployments per service per tenant. The Correlation Engine queries this timeline to answer: "Was there a deploy to this service (or its dependencies) in the last N minutes?"

Deploy Event Schema:

interface DeployEvent {
  deploy_id: string;           // ULID
  tenant_id: string;
  provider: 'github' | 'gitlab' | 'argocd' | 'custom';
  provider_deploy_id: string;  // e.g., GitHub Actions run_id

  // What was deployed
  service: string;             // Target service name
  environment: string;         // production | staging | development
  version?: string;            // Git SHA, tag, or version string

  // Who and what
  author: string;              // Git commit author or deployer
  commit_sha: string;
  commit_message?: string;
  pr_number?: string;
  pr_url?: string;

  // Timing
  started_at: string;          // ISO 8601
  completed_at?: string;       // ISO 8601
  status: 'in_progress' | 'success' | 'failure' | 'cancelled';

  // Metadata
  source_url: string;          // Link to CI/CD run
  changes_summary?: string;    // Files changed, lines added/removed
}

Deploy-to-Alert Correlation Logic:

When Correlation Engine processes an alert:
  1. Query DynamoDB: all deploys for tenant where
     service IN (alert.service, ...upstream_dependencies)
     AND completed_at > (alert.triggered_at - lookback_window)
     AND completed_at < alert.triggered_at
     AND environment = alert.environment

  2. If match found:
     - Attach deploy context to correlation window
     - Boost noise_score by 15-30 points (deploy-correlated alerts
       are more likely to be transient noise)
     - Include deploy details in Slack incident card

  3. Lookback window defaults:
     - Production: 15 minutes
     - Staging: 30 minutes
     - Configurable per tenant

Service name mapping challenge:

The biggest practical challenge is mapping CI/CD service names to monitoring service names. GitHub Actions might deploy payment-api while Datadog monitors prod-payment-api-us-east-1. Solutions:

  1. Convention-based matching. Strip common prefixes/suffixes (prod-, -us-east-1, -service). Fuzzy match on the core name.
  2. Explicit mapping. Dashboard UI where customers map: GitHub: payment-apiDatadog: prod-payment-api-us-east-1.
  3. Tag-based matching. If both CI/CD and monitoring use a common tag (e.g., dd0c.service=payment), match on that.

V1 uses convention-based + explicit mapping. Tag-based matching added in V2.

2.4 Suggestion Engine

The Suggestion Engine takes correlated incidents from the Correlation Engine and generates actionable suggestions. In V1, this is strictly observe-and-suggest — no auto-action.

Suggestion Types:

type SuggestionType =
  | 'group'          // "These 12 alerts are one incident"
  | 'suppress'       // "We'd suppress these 8 alerts (deploy noise)"
  | 'tune'           // "This alert fires 40x/week and is never actioned — consider tuning"
  | 'dependency'     // "Service A alerts always precede Service B — likely upstream dependency"
  | 'runbook'        // "This pattern was resolved by [runbook] 5 times before" (V2+)
  ;

interface Suggestion {
  suggestion_id: string;
  tenant_id: string;
  incident_id: string;
  type: SuggestionType;
  confidence: number;          // 0.0-1.0
  title: string;               // Human-readable summary
  reasoning: string;           // Plain-English explanation of WHY
  affected_alert_ids: string[];
  action_taken: 'none';        // V1: always 'none' (observe-only)
  user_feedback?: 'helpful' | 'not_helpful' | null;
  created_at: string;
}

Noise Scoring Algorithm (V1 — Rule-Based):

function calculateNoiseScore(window: CorrelationWindow): number {
  let score = 0;

  // Factor 1: Duplicate fingerprints (0-30 points)
  const dupRatio = 1 - (window.unique_fingerprints / window.alert_count);
  score += Math.round(dupRatio * 30);

  // Factor 2: Deploy correlation (0-25 points)
  if (window.deploy_event_id) {
    score += 25;
    // Bonus if deploy author matches a known "config change" pattern
    if (window.deploy_pr?.includes('config') || window.deploy_pr?.includes('feature-flag')) {
      score += 5;
    }
  }

  // Factor 3: Historical pattern (0-20 points)
  // If this service+alert combination has fired and auto-resolved
  // N times in the past 7 days without human action
  const autoResolveRate = getAutoResolveRate(window.tenant_id, window.services);
  score += Math.round(autoResolveRate * 20);

  // Factor 4: Severity distribution (0-15 points)
  // Windows with only low/info severity alerts score higher
  if (window.severity_max === 'low' || window.severity_max === 'info') {
    score += 15;
  } else if (window.severity_max === 'medium') {
    score += 8;
  }
  // critical/high = 0 bonus (never boost noise score for critical alerts)

  // Factor 5: Time of day (0-10 points)
  // Alerts during deploy windows (10am-4pm weekdays) are more likely noise
  const hour = new Date(window.opened_at).getUTCHours();
  const isBusinessHours = hour >= 14 && hour <= 23; // ~10am-4pm US timezones
  if (isBusinessHours) score += 10;

  // Cap at 100, floor at 0
  return Math.max(0, Math.min(100, score));
}

"Never Suppress" Safelist (Default):

These alert categories are never scored above 50 (noise), regardless of pattern matching:

const NEVER_SUPPRESS_DEFAULTS = [
  { category: 'security', reason: 'Security alerts require human review' },
  { severity: 'critical', reason: 'Critical severity always surfaces' },
  { service_pattern: /database|db|rds|dynamo/i, reason: 'Database alerts are high-risk' },
  { service_pattern: /payment|billing|stripe|checkout/i, reason: 'Payment path alerts are business-critical' },
  { title_pattern: /data.?loss|corruption|breach/i, reason: 'Data integrity alerts are never noise' },
];

Configurable per tenant. Customers can add/remove patterns. Defaults are opt-out (must explicitly remove).

2.5 Notification Service

The Notification Service is the primary delivery mechanism. Slack is the V1 interface — most engineers never open the dashboard.

Slack Message Format (Incident Card):

📊 Incident #47 — payment-service
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔴 Severity: HIGH | Noise Score: 82/100

12 alerts → 1 incident
├─ Datadog: 8 alerts (latency spike, error rate, CPU)
├─ PagerDuty: 3 pages (payment-service P2)
└─ Grafana: 1 alert (downstream notification-service)

🚀 Deploy detected: PR #1042 "Add retry logic to payment processor"
   by @marcus • merged 3 min before first alert
   https://github.com/acme/payment-service/pull/1042

💡 Suggestion: This looks like deploy noise.
   Similar pattern seen 4 times this month — auto-resolved within 8 min each time.
   We'd suppress 10 of 12 alerts. [What would change →]

👍 Helpful   👎 Not helpful   🔇 Mute this pattern

Notification Types:

Type Trigger Channel
Incident Card Correlation window closes with grouped alerts Configured Slack channel
Real-time Alert First alert in a new window (eager notification) Configured Slack channel
Daily Digest 9:00 AM in tenant's timezone Configured Slack channel or DM
Weekly Noise Report Monday 9:00 AM Configured Slack channel + email to admin
Integration Health Webhook volume drops to zero for >2 hours DM to integration owner

Daily Digest Format:

📋 dd0c/alert Daily Digest — Feb 28, 2026
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Yesterday: 247 alerts → 18 incidents
Noise ratio: 87% | You'd have been paged 18 times instead of 247

Top noisy alerts:
1. checkout-latency (Datadog) — 43 fires, 0 incidents → 🔇 Tune candidate
2. disk-usage-warning (Grafana) — 31 fires, 0 incidents → 🔇 Tune candidate
3. auth-service-timeout (PagerDuty) — 22 fires, 1 real incident

Deploy-correlated noise: 67% of alerts fired within 15 min of a deploy
Noisiest deploy: PR #1038 "Update feature flags" triggered 34 alerts

💰 Estimated savings: 4.2 engineering hours not spent triaging noise

Slack Integration Architecture:

  • OAuth 2.0 flow during onboarding (Slack App Directory compliant)
  • Bot token stored encrypted in DynamoDB (per-tenant)
  • Uses Slack chat.postMessage for new messages, chat.update for in-place updates
  • Block Kit for rich formatting
  • Interactive components (buttons) handled via Slack Events API → API Gateway → Lambda
  • Rate limiting: Slack allows 1 message/second per channel. Batch notifications during incident storms (queue in SQS, drain at 1/sec)

3. DATA ARCHITECTURE

3.1 Canonical Alert Schema (Provider-Agnostic)

The canonical schema (defined in §2.1) is the single source of truth for all alert data. Every provider's payload is normalized into this schema at ingestion time. Downstream components never touch raw provider payloads.

Schema evolution strategy: The canonical schema uses a schema_version field (integer, starting at 1). When the schema changes:

  1. New fields are always optional (backward compatible)
  2. Removed fields are deprecated for 2 versions before removal
  3. The ingestion Lambda writes the current schema version; consumers handle version differences gracefully
  4. DynamoDB items carry their schema version — no backfill migrations needed

3.2 Event Sourcing for Alert History

All alert data follows an event-sourcing pattern. The raw event stream is the source of truth; materialized views are derived and rebuildable.

Event Stream:

type AlertEvent =
  | { type: 'alert.received';    alert: CanonicalAlert; }
  | { type: 'alert.deduplicated'; alert_id: string; original_alert_id: string; }
  | { type: 'alert.correlated';  alert_id: string; window_id: string; }
  | { type: 'alert.resolved';    alert_id: string; resolved_at: string; }
  | { type: 'window.opened';     window: CorrelationWindow; }
  | { type: 'window.extended';   window_id: string; new_closes_at: string; }
  | { type: 'window.closed';     window_id: string; incident_id: string; }
  | { type: 'incident.created';  incident: Incident; }
  | { type: 'suggestion.created'; suggestion: Suggestion; }
  | { type: 'feedback.received'; suggestion_id: string; feedback: 'helpful' | 'not_helpful'; }
  | { type: 'deploy.received';   deploy: DeployEvent; }
  ;

Storage:

Store What Why Retention
S3 (raw payloads) Original webhook bodies, exactly as received Audit trail, replay, simulation mode, debugging provider parser issues Free: 7d, Pro: 90d, Business: 1yr, Enterprise: custom
DynamoDB (events) Canonical alert events, incidents, suggestions, feedback Primary operational store. Fast reads for dashboard, API, correlation lookups Free: 7d, Pro: 90d, Business: 1yr
TimescaleDB (time-series) Alert counts, noise ratios, correlation metrics per time bucket Trend analysis, Noise Report Card, business impact dashboard 1yr rolling (continuous aggregates compress older data)
Redis (ephemeral) Active correlation windows, recent fingerprints, rate counters Real-time correlation state. Ephemeral by design — rebuilt from DynamoDB on cold start TTL-based: window_duration + 1hr

Replay capability: Because raw payloads are archived in S3, the entire alert history can be replayed through the ingestion pipeline. This enables:

  • Alert Simulation Mode: Upload historical exports → replay through correlation engine → show "what would have happened"
  • Parser upgrades: When a provider parser is improved, replay recent payloads to backfill better-normalized data
  • Correlation tuning: Replay last 30 days with different window durations to find optimal settings per tenant

3.3 Time-Series Storage for Correlation Windows

TimescaleDB handles all time-range queries that DynamoDB's key-value model handles poorly.

Hypertables:

-- Alert time-series (one row per alert)
CREATE TABLE alert_timeseries (
  tenant_id    TEXT        NOT NULL,
  alert_id     TEXT        NOT NULL,
  service      TEXT        NOT NULL,
  severity     TEXT        NOT NULL,
  provider     TEXT        NOT NULL,
  fingerprint  TEXT        NOT NULL,
  triggered_at TIMESTAMPTZ NOT NULL,
  received_at  TIMESTAMPTZ NOT NULL,
  noise_score  SMALLINT,
  incident_id  TEXT,
  PRIMARY KEY (tenant_id, triggered_at, alert_id)
);
SELECT create_hypertable('alert_timeseries', 'triggered_at',
  chunk_time_interval => INTERVAL '1 day');

-- Continuous aggregate: hourly alert counts per tenant/service
CREATE MATERIALIZED VIEW alert_hourly
WITH (timescaledb.continuous) AS
SELECT
  tenant_id,
  service,
  time_bucket('1 hour', triggered_at) AS bucket,
  COUNT(*)                             AS alert_count,
  COUNT(DISTINCT fingerprint)          AS unique_alerts,
  COUNT(DISTINCT incident_id)          AS incident_count,
  AVG(noise_score)                     AS avg_noise_score
FROM alert_timeseries
GROUP BY tenant_id, service, bucket;

-- Continuous aggregate: daily noise report per tenant
CREATE MATERIALIZED VIEW noise_daily
WITH (timescaledb.continuous) AS
SELECT
  tenant_id,
  time_bucket('1 day', triggered_at)   AS bucket,
  COUNT(*)                              AS total_alerts,
  COUNT(DISTINCT incident_id)           AS total_incidents,
  ROUND(100.0 * (1 - COUNT(DISTINCT incident_id)::NUMERIC / NULLIF(COUNT(*), 0)), 1)
                                        AS noise_pct,
  AVG(noise_score)                      AS avg_noise_score
FROM alert_timeseries
GROUP BY tenant_id, bucket;

Why TimescaleDB over pure DynamoDB:

  • DynamoDB excels at point lookups and narrow range scans. It's terrible at "give me all alerts for this tenant in the last 5 minutes across all services" — that's a full partition scan.
  • TimescaleDB's hypertable chunking + continuous aggregates make time-range queries fast and pre-computed aggregates make dashboard queries instant.
  • The trade-off: TimescaleDB is a managed RDS instance (not serverless). Cost is fixed (~$50/month for db.t4g.medium). Acceptable for V1; evaluate Aurora Serverless v2 at scale.

3.4 Service Dependency Graph Storage

// DynamoDB table: service-dependencies
// Partition key: tenant_id
// Sort key: upstream_service#downstream_service

interface ServiceDependencyRecord {
  tenant_id: string;                    // PK
  edge_key: string;                     // SK: "payment-service#notification-service"
  upstream_service: string;
  downstream_service: string;
  source: 'inferred' | 'explicit';
  confidence: number;                   // 0.0-1.0
  occurrence_count: number;
  first_seen: string;
  last_seen: string;
  avg_lag_ms: number;                   // Average time between upstream and downstream alerts
  ttl?: number;                         // Epoch seconds — inferred edges expire after 30d of no observations
}

Graph query pattern: When the Correlation Engine needs to check dependencies for a service, it does a DynamoDB query:

  • PK = tenant_id, SK begins_with upstream_service# → all downstream dependencies
  • GSI: PK = tenant_id, SK begins_with #downstream_service → all upstream dependencies (inverted GSI)

This is O(degree) per lookup — fast enough for real-time correlation. The graph is small per tenant (typically <100 edges for a 50-service architecture).

3.5 Multi-Tenant Data Isolation

Isolation model: Logical isolation with tenant_id partitioning.

Every data record includes tenant_id as the partition key (DynamoDB) or a required column (TimescaleDB). There is no cross-tenant data access path.

Layer Isolation Mechanism
API Gateway API key → tenant_id mapping. All requests scoped to tenant.
Webhook URLs Tenant ID embedded in URL path. Validated against tenant record.
DynamoDB tenant_id is the partition key on every table. No scan operations in application code — all queries are PK-scoped.
TimescaleDB Row-level security (RLS) policies enforce tenant_id = current_setting('app.tenant_id'). Connection pool sets tenant context per request.
Redis All keys prefixed with tenant:{tenant_id}:. No KEYS * in application code.
S3 Object key prefix: raw/{tenant_id}/{date}/{alert_id}.json. Bucket policy prevents cross-prefix access.
Slack Bot token per tenant. Stored encrypted. Never shared across tenants.

Why not per-tenant databases? Cost. At $19/seat with potentially thousands of tenants, per-tenant RDS instances are economically impossible. Logical isolation with strong key-scoping is the standard pattern for multi-tenant SaaS at this price point. SOC2 auditors accept this model with proper access controls and audit logging.

3.6 Retention Policies

Tier Raw Payloads (S3) Alert Events (DynamoDB) Time-Series (TimescaleDB) Correlation Windows
Free 7 days 7 days 7 days (no aggregates) 24 hours
Pro 90 days 90 days 90 days + hourly aggregates for 1yr 30 days
Business 1 year 1 year 1 year + daily aggregates for 2yr 90 days
Enterprise Custom Custom Custom Custom

Implementation:

  • S3: Lifecycle policies transition objects: Standard → IA (30d) → Glacier (90d) → Delete (per tier)
  • DynamoDB: TTL attribute on every item. DynamoDB automatically deletes expired items (eventually consistent, ~48hr window). No Lambda cleanup needed.
  • TimescaleDB: drop_chunks() policy per hypertable. Continuous aggregates survive chunk drops (aggregated data persists longer than raw data).
  • Redis: TTL on all keys. Self-cleaning by design.

4. INFRASTRUCTURE

4.1 AWS Architecture

Region: us-east-1 (primary). Single-region for V1. Multi-region (us-west-2 failover) at 10K+ alerts/day or first EU customer requiring data residency.

graph TB
    subgraph Edge["Edge Layer"]
        CF[CloudFront<br/>Dashboard CDN]
        R53[Route 53<br/>hooks.dd0c.com]
    end

    subgraph Ingestion["Ingestion (Serverless)"]
        APIGW[API Gateway HTTP API<br/>Webhook Receiver + REST API]
        L_INGEST[Lambda: webhook-ingest<br/>256MB, 10s timeout]
        L_API[Lambda: api-handler<br/>512MB, 30s timeout]
        L_SLACK[Lambda: slack-events<br/>256MB, 10s timeout]
        L_NOTIFY[Lambda: notification-sender<br/>256MB, 30s timeout]
    end

    subgraph Queue["Message Bus"]
        SQS_ALERT[SQS FIFO<br/>alert-ingested<br/>MessageGroupId=tenant_id]
        SQS_DEPLOY[SQS FIFO<br/>deploy-event<br/>MessageGroupId=tenant_id]
        SQS_CORR[SQS Standard<br/>correlation-result]
        SQS_NOTIFY[SQS Standard<br/>notification]
        DLQ[SQS DLQ<br/>dead-letters]
    end

    subgraph Processing["Processing (ECS Fargate)"]
        ECS_CE[ECS Service: correlation-engine<br/>1 vCPU, 2GB RAM<br/>Desired: 1, Max: 4]
        ECS_SE[ECS Service: suggestion-engine<br/>0.5 vCPU, 1GB RAM<br/>Desired: 1, Max: 2]
    end

    subgraph Data["Data Layer"]
        DDB[(DynamoDB<br/>On-Demand Capacity)]
        RDS[(RDS PostgreSQL 16<br/>TimescaleDB<br/>db.t4g.medium)]
        REDIS[(ElastiCache Redis<br/>cache.t4g.micro)]
        S3_RAW[(S3: dd0c-raw-payloads)]
        S3_STATIC[(S3: dd0c-dashboard)]
    end

    subgraph Ops["Operations"]
        CW_LOGS[CloudWatch Logs]
        CW_METRICS[CloudWatch Metrics]
        CW_ALARMS[CloudWatch Alarms]
        XRAY[X-Ray Tracing]
        SM[Secrets Manager]
    end

    R53 --> APIGW
    CF --> S3_STATIC
    APIGW --> L_INGEST & L_API & L_SLACK
    L_INGEST --> SQS_ALERT & SQS_DEPLOY & S3_RAW
    SQS_ALERT & SQS_DEPLOY --> ECS_CE
    ECS_CE --> SQS_CORR & DDB & RDS & REDIS
    SQS_CORR --> ECS_SE
    ECS_SE --> SQS_NOTIFY & DDB & RDS
    SQS_NOTIFY --> L_NOTIFY
    L_API --> DDB & RDS
    L_SLACK --> DDB
    SQS_ALERT & SQS_DEPLOY & SQS_CORR & SQS_NOTIFY -.-> DLQ
    L_INGEST & L_API & ECS_CE & ECS_SE --> CW_LOGS & XRAY
    ECS_CE & ECS_SE --> SM

DynamoDB Tables:

Table PK SK GSIs Purpose
alerts tenant_id alert_id (ULID) GSI1: tenant_id + triggered_at Canonical alert records
incidents tenant_id incident_id (ULID) GSI1: tenant_id + created_at Correlated incident records
suggestions tenant_id suggestion_id GSI1: incident_id Noise suggestions + feedback
deploys tenant_id deploy_id (ULID) GSI1: tenant_id + service + completed_at Deploy events
dependencies tenant_id upstream#downstream GSI1: tenant_id + downstream#upstream Service dependency graph
tenants tenant_id GSI1: api_key Tenant config, billing, integrations
integrations tenant_id integration_id Webhook configs, Slack tokens

All tables use on-demand capacity mode (no capacity planning, pay-per-request). Switch to provisioned with auto-scaling when read/write patterns stabilize (typically at 50K+ alerts/day).

4.2 Real-Time Processing Pipeline

The critical path from webhook receipt to Slack notification must complete in under 10 seconds for the "eager first alert" notification, and under 5 minutes + 10 seconds for the full correlated incident card.

Latency budget:

Webhook received by API Gateway          0ms
├─ Lambda cold start (worst case)       +200ms
├─ HMAC validation + parsing            +10ms
├─ DynamoDB write (raw event)           +5ms
├─ SQS FIFO send                        +20ms
├─ S3 async put (non-blocking)          +0ms (async)
│                                       ─────
│ Total ingestion:                       ~235ms (p99)
│
├─ SQS → ECS polling interval           +100ms (long-polling, 0.1s)
├─ Correlation Engine processing        +15ms
├─ Redis read/write (window state)      +2ms
├─ DynamoDB read (deploy lookup)        +5ms
│                                       ─────
│ Total to correlation decision:         ~357ms (p99)
│
├─ SQS → Lambda (notification)          +50ms
├─ Slack API call                        +300ms
│                                       ─────
│ Total webhook → Slack (eager):         ~707ms (p99)
│
│ ... correlation window (5 min default) ...
│
├─ Window close → Suggestion Engine     +200ms
├─ Slack chat.update (in-place)         +300ms
│                                       ─────
│ Total webhook → full incident card:    ~5min + 1.2s

ECS Fargate task configuration:

// CDK definition
const correlationService = new ecs.FargateService(this, 'CorrelationEngine', {
  cluster,
  taskDefinition: new ecs.FargateTaskDefinition(this, 'CorrTask', {
    cpu: 1024,        // 1 vCPU
    memoryLimitMiB: 2048,  // 2 GB
  }),
  desiredCount: 1,
  minHealthyPercent: 100,
  maxHealthyPercent: 200,
  circuitBreaker: { rollback: true },
});

// Auto-scaling based on SQS queue depth
const scaling = correlationService.autoScaleTaskCount({
  minCapacity: 1,
  maxCapacity: 4,
});
scaling.scaleOnMetric('QueueDepthScaling', {
  metric: alertQueue.metricApproximateNumberOfMessagesVisible(),
  scalingSteps: [
    { upper: 0, change: -1 },      // Scale in when queue empty
    { lower: 100, change: +1 },    // Scale out at 100 messages
    { lower: 1000, change: +2 },   // Scale out faster at 1000
  ],
  adjustmentType: ecs.AdjustmentType.CHANGE_IN_CAPACITY,
  cooldown: Duration.seconds(60),
});

SQS FIFO configuration:

const alertQueue = new sqs.Queue(this, 'AlertIngested', {
  fifo: true,
  contentBasedDeduplication: false,  // We set explicit dedup IDs
  deduplicationScope: sqs.DeduplicationScope.MESSAGE_GROUP,
  fifoThroughputLimit: sqs.FifoThroughputLimit.PER_MESSAGE_GROUP_ID,
  // High-throughput FIFO: 30K messages/sec per group
  visibilityTimeout: Duration.seconds(60),
  retentionPeriod: Duration.days(4),
  deadLetterQueue: {
    queue: dlq,
    maxReceiveCount: 3,
  },
});

Why SQS FIFO with per-tenant message groups:

  • MessageGroupId = tenant_id ensures alerts from the same tenant are processed in order (critical for time-window accuracy)
  • Different tenants are processed in parallel (no head-of-line blocking)
  • High-throughput FIFO mode supports 30K messages/sec per message group — more than enough for any single tenant's alert volume

4.3 Cost Estimates

All estimates assume us-east-1 pricing as of 2026. Costs are monthly.

1K alerts/day (~30K/month) — Early Stage

Service Configuration Monthly Cost
API Gateway HTTP API 30K requests $0.03
Lambda (ingestion) 30K invocations × 256MB × 200ms $0.02
Lambda (API + notifications) 50K invocations × 512MB × 300ms $0.10
SQS (FIFO + Standard) 120K messages $0.05
ECS Fargate (Correlation) 1 task × 1vCPU × 2GB × 24/7 $48.00
ECS Fargate (Suggestion) 1 task × 0.5vCPU × 1GB × 24/7 $18.00
DynamoDB (on-demand) ~100K reads + 60K writes $0.15
RDS (TimescaleDB) db.t4g.micro (free tier eligible) $0.00
ElastiCache Redis cache.t4g.micro $12.00
S3 <1 GB stored $0.02
CloudWatch Logs + metrics $5.00
Route 53 1 hosted zone $0.50
Secrets Manager 5 secrets $2.00
Total ~$86/month

10K alerts/day (~300K/month) — Growth Stage

Service Configuration Monthly Cost
API Gateway HTTP API 300K requests $0.30
Lambda (ingestion) 300K invocations $0.20
Lambda (API + notifications) 500K invocations $1.00
SQS 1.2M messages $0.50
ECS Fargate (Correlation) 2 tasks avg (auto-scaling) $96.00
ECS Fargate (Suggestion) 1 task $18.00
DynamoDB (on-demand) ~1M reads + 600K writes $1.50
RDS (TimescaleDB) db.t4g.medium $50.00
ElastiCache Redis cache.t4g.small $25.00
S3 ~10 GB stored + transitions $2.00
CloudWatch Logs + metrics + alarms $15.00
Route 53 + ACM $1.00
Secrets Manager 20 secrets $8.00
Total ~$218/month

100K alerts/day (~3M/month) — Scale Stage

Service Configuration Monthly Cost
API Gateway HTTP API 3M requests $3.00
Lambda (ingestion) 3M invocations $2.00
Lambda (API + notifications) 5M invocations $10.00
SQS 12M messages $5.00
ECS Fargate (Correlation) 4 tasks avg $192.00
ECS Fargate (Suggestion) 2 tasks avg $36.00
DynamoDB (on-demand) ~10M reads + 6M writes $15.00
RDS (TimescaleDB) db.r6g.large + read replica $350.00
ElastiCache Redis cache.r6g.large (cluster mode) $200.00
S3 ~100 GB + lifecycle $10.00
CloudWatch Full observability stack $50.00
CloudFront Dashboard CDN $5.00
WAF API protection $10.00
Total ~$888/month

Gross margin analysis:

Scale Monthly Infra Cost Estimated MRR Gross Margin
1K alerts/day (~35 teams) $86 $6,650 98.7%
10K alerts/day (~175 teams) $218 $33,250 99.3%
100K alerts/day (~700 teams) $888 $133,000 99.3%

Infrastructure costs are negligible relative to revenue. The cost structure is dominated by the founder's time, not AWS spend. This is the structural advantage of a webhook-based SaaS — no agents to host, no data to scrape, no heavy compute. Just receive, correlate, notify.

4.4 Scaling Strategy

Alert volume is bursty. During a major incident, a single tenant might generate 500 alerts in 2 minutes, then nothing for hours. The architecture must handle this without pre-provisioning for peak.

Burst handling by layer:

Layer Burst Strategy Limit
API Gateway Built-in burst capacity: 5,000 RPS default, increasable to 50K+ Effectively unlimited for our scale
Lambda Concurrency auto-scales. Reserved concurrency per function prevents one function from starving others 1,000 concurrent (default), increase via support ticket
SQS FIFO High-throughput mode: 30K msg/sec per message group. Queue absorbs bursts that processing can't handle immediately Unlimited queue depth
ECS Fargate Auto-scaling on SQS queue depth. Scale-out in ~60 seconds (Fargate task launch time). During the 60s gap, SQS buffers Min 1, Max 4 (V1). Increase max as needed
DynamoDB On-demand mode handles burst to 2x previous peak automatically. For sustained spikes, DynamoDB auto-adjusts within minutes Effectively unlimited with on-demand
Redis Single node handles 100K+ ops/sec. Cluster mode at scale Not a bottleneck until 100K+ alerts/day
TimescaleDB Write-ahead log buffers burst writes. Hypertable chunking prevents table bloat RDS instance size is the limit; vertical scaling

The SQS buffer is the key architectural decision. During an incident storm, the ingestion Lambda writes to SQS in <20ms and returns 200 OK to the provider. The Correlation Engine processes at its own pace. If the engine falls behind, the queue grows — but no webhooks are dropped. This decoupling is what makes the system reliable under burst load.

Scaling triggers and actions:

Alert volume < 1K/day:
  - 1 Correlation Engine task, 1 Suggestion Engine task
  - cache.t4g.micro Redis, db.t4g.micro RDS
  - Total: ~$86/month

Alert volume 1K-10K/day:
  - Auto-scale CE to 2 tasks during bursts
  - Upgrade Redis to cache.t4g.small
  - Upgrade RDS to db.t4g.medium
  - Total: ~$218/month

Alert volume 10K-100K/day:
  - Auto-scale CE to 4 tasks
  - Auto-scale SE to 2 tasks
  - Redis cluster mode (3 shards)
  - RDS db.r6g.large + read replica
  - Add WAF for API protection
  - Total: ~$888/month

Alert volume > 100K/day:
  - Evaluate Kinesis Data Streams replacing SQS for higher throughput
  - Consider Aurora Serverless v2 replacing RDS
  - Multi-region deployment for latency + redundancy
  - Dedicated capacity DynamoDB with auto-scaling
  - Total: $2K-5K/month (still <1% of revenue at this scale)

4.5 CI/CD Pipeline

graph LR
    subgraph Dev
        CODE[Push to main]
        PR[Pull Request]
    end

    subgraph CI["GitHub Actions CI"]
        LINT[Lint + Type Check]
        TEST[Unit Tests]
        INT[Integration Tests<br/>LocalStack]
        BUILD[Docker Build<br/>+ CDK Synth]
    end

    subgraph CD["GitHub Actions CD"]
        STAGING[Deploy to Staging<br/>CDK deploy]
        SMOKE[Smoke Tests<br/>against staging]
        PROD[Deploy to Production<br/>CDK deploy]
        CANARY[Canary: send test<br/>webhook, verify Slack]
    end

    subgraph Dogfood["Dogfooding"]
        DD0C[dd0c/alert receives<br/>its own deploy webhook]
    end

    PR --> LINT & TEST & INT
    CODE --> BUILD --> STAGING --> SMOKE --> PROD --> CANARY
    PROD --> DD0C

Pipeline details:

  1. PR checks (< 3 min): ESLint, TypeScript strict mode, unit tests (vitest), integration tests against LocalStack (DynamoDB, SQS, S3 emulation)
  2. Staging deploy (< 5 min): CDK deploy to staging account. Separate AWS account for isolation.
  3. Smoke tests (< 2 min): Send test webhooks to staging endpoint. Verify: webhook accepted, alert appears in DynamoDB, correlation window opens, Slack notification sent to test channel.
  4. Production deploy (< 5 min): CDK deploy to production. Blue/green for ECS services (CodeDeploy). Lambda versioning with aliases for instant rollback.
  5. Canary (continuous): Post-deploy canary sends a synthetic webhook every 5 minutes. If Slack notification doesn't arrive within 30 seconds, CloudWatch alarm fires → auto-rollback.
  6. Dogfooding: dd0c/alert's own GitHub Actions workflow sends deploy webhooks to dd0c/alert. The product monitors its own deployments. If a deploy causes alert correlation to degrade, dd0c/alert tells you about it.

Rollback strategy:

  • Lambda: Alias shift to previous version (instant, <1 second)
  • ECS: CodeDeploy blue/green rollback (< 2 minutes)
  • DynamoDB: No schema migrations in V1 (schema-on-read). No rollback needed.
  • TimescaleDB: Flyway migrations with rollback scripts. Test in staging first.

5. SECURITY

5.1 Webhook Authentication

We cannot trust unauthenticated webhooks, as an attacker could flood a tenant with fake alerts.

  • HMAC Signatures: Every webhook request is verified using the provider's signature header (e.g., X-PagerDuty-Signature, DD-WEBHOOK-SIGNATURE).
  • Secret Management: Provider secrets are generated upon integration creation, stored in AWS Secrets Manager (or DynamoDB KMS-encrypted), and retrieved by the ingestion Lambda.
  • Timestamp Validation: Signatures must include a timestamp check to prevent replay attacks (requests older than 5 minutes are rejected).
  • Rate Limiting: API Gateway enforces rate limits per tenant based on their tier to prevent noisy neighbor problems and DDoS.

5.2 API Key Management

For customers using the REST API (Business tier) or Custom Webhooks:

  • API keys are generated as cryptographically secure random strings with a prefix (e.g., dd0c_live_...).
  • Only a one-way hash (SHA-256) is stored in DynamoDB. The raw key is shown only once upon creation.
  • API keys are tied to specific scopes (e.g., write:alerts, read:incidents).
  • API Gateway Lambda Authorizer validates the key and injects the tenant_id into the request context, ensuring strict tenant isolation.

5.3 Alert Data Sensitivity

Alert payloads often contain sensitive infrastructure details (hostnames, IP addresses) and sometimes PII (error messages containing user data).

  • Payload Stripping Mode (Privacy Mode): Configurable per-tenant. When enabled, the ingestion layer strips the description and raw payload bodies before saving to DynamoDB or S3. Only structural metadata (service, severity, timestamp) is retained.
  • Encryption at Rest: All DynamoDB tables, RDS instances, and S3 buckets use AWS KMS encryption with customer master keys (CMK) or AWS-managed keys.
  • Encryption in Transit: TLS 1.2+ enforced on all API Gateway endpoints and inter-service communications.

5.4 SOC 2 Considerations

While SOC 2 Type II certification is targeted for Month 6-9, the V1 architecture lays the groundwork:

  • Audit Logging: Every configuration change (adding integrations, modifying suppression rules) is logged to an immutable audit table.
  • Access Control: No human access to production databases. Read-only access via AWS SSO for debugging. Changes via CI/CD only.
  • Vulnerability Scanning: ECR image scanning on push, npm audit in CI pipeline.
  • Separation of Duties: Staging and Production are in completely separate AWS accounts.

5.5 Data Residency Options

  • V1: All data resides in us-east-1.
  • V2/Enterprise: The architecture supports multi-region deployment. European customers can be provisioned in eu-central-1. The tenant_id can dictate routing at the edge (e.g., Route 53 latency-based routing or CloudFront Lambda@Edge routing based on tenant prefix).

6. MVP SCOPE

6.1 V1 MVP (Observe-and-Suggest)

The V1 MVP is strictly scoped to prove the 60-second time-to-value constraint and earn engineer trust.

  • Integrations: Datadog, PagerDuty, and GitHub Actions (for deploy events).
  • Core Engine: Time-window clustering and deployment correlation. Rule-based only.
  • Actionability: Observe-and-suggest ONLY. No auto-suppression.
  • Delivery: Slack Bot (incident cards, real-time alerts, daily digests).
  • Dashboard: Minimal UI for generating webhook URLs and viewing the Noise Report Card.

6.2 Deferred to V2+

  • Auto-suppression: Requires explicit user opt-in and a proven track record.
  • More Integrations: Grafana, OpsGenie, GitLab CI, ArgoCD.
  • Semantic Deduplication: Sentence-transformer ML embeddings for fuzzy alert matching.
  • Predictive Severity: ML-based scoring of historical resolution patterns.
  • Advanced Dashboard: Custom charting, RBAC, SSO/SAML.
  • dd0c/run integration: Runbook automation.

6.3 The 60-Second Onboarding Flow

  1. User authenticates via Slack (OAuth).
  2. UI provisions a tenant_id and generates Datadog/PagerDuty webhook URLs.
  3. User pastes the URL into their monitoring tool.
  4. First alert fires → Ingestion Lambda receives it.
  5. Slack bot immediately posts: "🔔 New alert: [service] [title] — watching for related alerts..."
  6. V1 value is proven instantly.

6.4 Technical Debt Budget

Given the 30-day build timeline, intentional technical debt is accepted in specific areas:

  • Testing: Integration tests focus on the golden path (webhook → correlation → Slack). Edge cases in provider parsing will be fixed fast-forward in production.
  • Dashboard UI: Built with off-the-shelf Tailwind components. Not pixel-perfect.
  • Database Migrations: None. Schema-on-read in DynamoDB.
  • Infrastructure Code: Hardcoded region (us-east-1) and basic CI/CD.

6.5 Solo Founder Operational Model

  • Support: Community Slack channel. No SLA for the Free/Pro tiers.
  • On-call: Standard AWS alarms (5XX errors, high queue depth) page the founder.
  • Resilience: The overlay architecture means if dd0c/alert goes down, the customer just receives their raw alerts from PagerDuty/Datadog. It degrades gracefully to the status quo.

7. API DESIGN

7.1 Webhook Receiver Endpoints

  • POST /v1/webhooks/{tenant_id}/datadog
  • POST /v1/webhooks/{tenant_id}/pagerduty
  • POST /v1/webhooks/{tenant_id}/github

Headers must include provider-specific signatures.

7.2 Alert Query & Search API

  • GET /v1/alerts?service={service}&status={status}&start={iso8601}&end={iso8601} Returns paginated canonical alerts.
  • GET /v1/alerts/{alert_id}

7.3 Correlation Results API

  • GET /v1/incidents?status=open Returns active correlation windows and their grouped alerts.
  • GET /v1/incidents/{incident_id}
  • GET /v1/incidents/{incident_id}/suggestions

7.4 Slack Slash Commands

  • /dd0c status — Shows current open correlation windows.
  • /dd0c config — Link to the tenant dashboard.
  • /dd0c mute [service] — Temporarily ignore alerts for a noisy service (adds to suppression list).

7.5 Dashboard REST API

Backend-for-frontend (BFF) used by the React SPA:

  • GET /api/v1/reports/noise-daily — TimescaleDB aggregation for the Noise Report Card.
  • GET /api/v1/integrations — List configured webhooks and status.
  • POST /api/v1/integrations — Generate new webhook credentials.

7.6 Integration Marketplace Hooks

For native app directory listings (e.g., PagerDuty Marketplace):

  • GET /api/v1/oauth/callback — Handles OAuth flows for third-party integrations.
  • POST /api/v1/lifecycle/uninstall — Cleans up tenant data when the app is removed from a workspace.