Files
dd0c/products/03-alert-intelligence/architecture/architecture.md

1280 lines
57 KiB
Markdown
Raw Normal View History

# dd0c/alert — Technical Architecture
### Alert Intelligence Platform
**Version:** 1.0 | **Date:** 2026-02-28 | **Phase:** 6 — Architecture | **Author:** dd0c Engineering
---
## 1. SYSTEM OVERVIEW
### 1.1 High-Level Architecture
```mermaid
graph TB
subgraph Providers["Alert Sources"]
PD[PagerDuty]
DD[Datadog]
GF[Grafana]
OG[OpsGenie]
CW[Custom Webhooks]
end
subgraph CICD["CI/CD Sources"]
GHA[GitHub Actions]
GLC[GitLab CI]
ARGO[ArgoCD]
end
subgraph Ingestion["Ingestion Layer (API Gateway + Lambda)"]
WH[Webhook Receiver<br/>POST /webhooks/:provider]
HMAC[HMAC Validator]
NORM[Payload Normalizer]
SCHEMA[Canonical Schema<br/>Mapper]
end
subgraph Queue["Event Bus (SQS/SNS)"]
ALERT_Q[alert-ingested<br/>SQS FIFO]
DEPLOY_Q[deploy-event<br/>SQS FIFO]
CORR_Q[correlation-request<br/>SQS Standard]
NOTIFY_Q[notification<br/>SQS Standard]
end
subgraph Processing["Processing Layer (ECS Fargate)"]
CE[Correlation Engine]
DT[Deployment Tracker]
SE[Suggestion Engine]
NS[Notification Service]
end
subgraph Storage["Data Layer"]
DDB[(DynamoDB<br/>Alerts + Tenants)]
TS[(TimescaleDB on RDS<br/>Time-Series Correlation)]
CACHE[(ElastiCache Redis<br/>Active Windows)]
S3[(S3<br/>Raw Payloads + Exports)]
end
subgraph Output["Delivery"]
SLACK[Slack Bot]
DASH[Dashboard API<br/>CloudFront + S3 SPA]
API[REST API]
end
PD & DD & GF & OG & CW --> WH
GHA & GLC & ARGO --> WH
WH --> HMAC --> NORM --> SCHEMA
SCHEMA -- alerts --> ALERT_Q
SCHEMA -- deploys --> DEPLOY_Q
ALERT_Q --> CE
DEPLOY_Q --> DT
DT -- deploy context --> CE
CE -- correlation results --> CORR_Q
CORR_Q --> SE
SE -- suggestions --> NOTIFY_Q
NOTIFY_Q --> NS
NS --> SLACK
CE & DT & SE --> DDB & TS
CE --> CACHE
SCHEMA --> S3
DASH & API --> DDB & TS
```
### 1.2 Component Inventory
| Component | Responsibility | AWS Service | Scaling Model |
|---|---|---|---|
| **Webhook Receiver** | Accept, authenticate, and normalize incoming alert/deploy webhooks | API Gateway HTTP API + Lambda | Auto-scales to 10K concurrent; API Gateway handles burst |
| **Payload Normalizer** | Transform provider-specific payloads into canonical alert schema | Lambda (part of ingestion) | Stateless, scales with webhook volume |
| **Event Bus** | Decouple ingestion from processing; buffer during spikes | SQS FIFO (alerts/deploys) + SQS Standard (correlation/notify) | Unlimited throughput; FIFO for ordering guarantees on per-tenant basis |
| **Correlation Engine** | Time-window clustering, service-dependency matching, deploy correlation | ECS Fargate (long-running) | Horizontal scaling via ECS Service Auto Scaling |
| **Deployment Tracker** | Ingest CI/CD events, maintain deploy timeline per service | ECS Fargate (shared with CE) | Co-located with Correlation Engine |
| **Suggestion Engine** | Score noise, generate grouping suggestions, compute "what would have happened" | ECS Fargate | Scales with correlation output volume |
| **Notification Service** | Format and deliver Slack messages, digests, weekly reports | Lambda (event-driven from SQS) | Auto-scales; Slack rate limits are the bottleneck |
| **Alert Store** | Canonical alert records, tenant config, correlation results | DynamoDB | On-demand capacity; single-digit ms reads |
| **Time-Series Store** | Correlation windows, alert frequency histograms, trend data | TimescaleDB on RDS (PostgreSQL) | Vertical scaling initially; read replicas at scale |
| **Active Window Cache** | In-flight correlation windows, recent alert fingerprints | ElastiCache Redis | Single node → cluster at 10K+ alerts/day |
| **Raw Payload Archive** | Original webhook payloads for audit, replay, and simulation | S3 Standard → S3 IA (30d) → Glacier (90d) | Unlimited; lifecycle policies manage cost |
| **Dashboard SPA** | React frontend for noise reports, suppression logs, integration management | CloudFront + S3 | Static hosting; API calls to backend |
| **REST API** | Dashboard backend, alert query, correlation results, admin | API Gateway + Lambda | Auto-scales |
### 1.3 Technology Choices
| Decision | Choice | Justification |
|---|---|---|
| **Compute: Ingestion** | AWS Lambda | Webhook traffic is bursty (incident storms). Lambda handles 0→10K RPS without pre-provisioning. Pay-per-invocation keeps costs near-zero at low volume. Cold starts acceptable (<200ms with Node.js runtime). |
| **Compute: Processing** | ECS Fargate | Correlation Engine needs long-running processes with in-memory state (active correlation windows). Lambda's 15-min timeout is insufficient. Fargate = no EC2 management, scales to zero tasks during quiet periods. |
| **Queue** | SQS (FIFO for ingestion, Standard for downstream) | FIFO guarantees per-tenant ordering for alert ingestion (critical for time-window accuracy). Standard queues for correlation→notification path where ordering is less critical. SNS fan-out for multi-consumer patterns. No Kafka overhead for V1. |
| **Primary Store** | DynamoDB | Single-digit ms latency for alert lookups. On-demand pricing = pay for what you use. Partition key = `tenant_id`, sort key = `alert_id` or `timestamp`. Global tables for multi-region (V2+). |
| **Time-Series** | TimescaleDB on RDS | Correlation windows require time-range queries ("all alerts for tenant X, service Y, in the last 5 minutes"). TimescaleDB's hypertables + continuous aggregates are purpose-built for this. PostgreSQL compatibility means standard tooling. |
| **Cache** | ElastiCache Redis | Sub-ms reads for active correlation windows. Sorted sets for time-windowed alert lookups. TTL-based expiry for window cleanup. Pub/Sub for real-time correlation triggers. |
| **Object Store** | S3 | Raw payload archival, alert history exports, simulation data. Lifecycle policies for cost management. Event notifications trigger replay/simulation workflows. |
| **API Layer** | API Gateway HTTP API | Lower latency and cost than REST API type. JWT authorizer for API key validation. Built-in throttling per API key (maps to tenant rate limits by tier). |
| **Frontend** | React SPA on CloudFront | Static hosting = zero server cost. CloudFront edge caching. Dashboard is secondary to Slack (most users never open it), so minimal investment. |
| **Language** | TypeScript (Node.js 20) | Single language across Lambda + ECS + frontend. Strong typing catches schema mapping bugs at compile time. Excellent AWS SDK support. Fast cold starts on Lambda. |
| **IaC** | AWS CDK (TypeScript) | Same language as application code. L2 constructs reduce boilerplate. Synthesizes to CloudFormation for drift detection. |
| **CI/CD** | GitHub Actions | Free for public repos, cheap for private. Native webhook integration (dd0c/alert dogfoods its own deployment tracking). |
### 1.4 Webhook-First Ingestion Model (60-Second Time-to-Value)
The entire architecture is designed around a single constraint: **a new customer must see their first correlated incident in Slack within 60 seconds of pasting a webhook URL.**
**The flow:**
```
T+0s Customer signs up → gets unique webhook URL:
https://hooks.dd0c.com/v1/wh/{tenant_id}/{provider}
T+5s Customer pastes URL into Datadog notification channel
(or PagerDuty webhook extension, or Grafana contact point)
T+10s First alert fires from their monitoring tool
T+10.1s API Gateway receives POST, Lambda validates HMAC,
normalizes payload, writes to SQS FIFO
T+10.3s Correlation Engine picks up alert, opens a new
correlation window (default: 5 minutes)
T+10.5s Alert appears in customer's Slack channel:
"🔔 New alert: [service] [title] — watching for related alerts..."
T+60s Window closes (or more alerts arrive). Correlation Engine
groups related alerts. Suggestion Engine scores noise.
T+61s Slack message updates in-place:
"📊 Incident #1: 12 alerts grouped → 1 incident
Sources: Datadog (8), PagerDuty (4)
Trigger: Deploy #1042 to payment-service (2 min before first alert)
Noise score: 87% — we'd suggest suppressing 10 of 12
👍 Helpful 👎 Not helpful"
```
**Design decisions enabling 60-second TTV:**
1. **No SDK, no agent, no credentials.** Just a URL. The webhook URL encodes tenant ID and provider type — zero configuration needed.
2. **Pre-provisioned Slack connection.** During signup, customer connects Slack via OAuth before getting the webhook URL. Slack is ready before the first alert arrives.
3. **Eager first-alert notification.** Don't wait for the correlation window to close. Show the first alert immediately in Slack ("watching for related alerts..."), then update the message in-place when correlation completes. The customer sees activity within seconds.
4. **Default correlation window of 5 minutes.** Long enough to catch deploy-correlated alert storms, short enough to deliver results quickly. Configurable per tenant.
5. **Webhook URL as the product.** The URL IS the integration. No config files, no YAML, no terraform modules. Copy. Paste. Done.
---
## 2. CORE COMPONENTS
### 2.1 Webhook Ingestion Layer
The ingestion layer is the front door. It must be fast (sub-100ms response to the sending provider), reliable (zero dropped webhooks), and flexible (parse any provider's payload format without code changes for supported providers).
**Architecture:** API Gateway HTTP API → Lambda function → SQS FIFO
```mermaid
sequenceDiagram
participant P as Provider (Datadog/PD/etc)
participant AG as API Gateway
participant L as Ingestion Lambda
participant SQS as SQS FIFO
participant S3 as S3 (Raw Archive)
P->>AG: POST /v1/wh/{tenant_id}/{provider}
AG->>L: Invoke (< 50ms)
L->>L: Validate HMAC signature
L->>L: Parse provider-specific payload
L->>L: Map to canonical alert schema
L-->>S3: Async: store raw payload
L->>SQS: Send canonical alert message
L->>P: 200 OK (< 100ms total)
```
**Provider Parsers:**
Each supported provider has a dedicated parser module that transforms the provider's webhook payload into the canonical alert schema. Parsers are stateless functions — no database calls, no external dependencies.
```typescript
// Provider parser interface
interface ProviderParser {
provider: string;
validateSignature(headers: Headers, body: string, secret: string): boolean;
parse(payload: unknown): CanonicalAlert[];
// Some providers send batched payloads (Datadog sends arrays)
}
// Registry — new providers added by implementing the interface
const parsers: Record<string, ProviderParser> = {
'pagerduty': new PagerDutyParser(),
'datadog': new DatadogParser(),
'grafana': new GrafanaParser(),
'opsgenie': new OpsGenieParser(),
'github': new GitHubActionsParser(), // deploy events
'gitlab': new GitLabCIParser(), // deploy events
'argocd': new ArgoCDParser(), // deploy events
'custom': new CustomWebhookParser(), // user-defined mapping
};
```
**HMAC Validation per Provider:**
| Provider | Signature Header | Algorithm | Payload |
|---|---|---|---|
| PagerDuty | `X-PagerDuty-Signature` | HMAC-SHA256 | Raw body |
| Datadog | `DD-WEBHOOK-SIGNATURE` | HMAC-SHA256 | Raw body |
| Grafana | `X-Grafana-Alerting-Signature` | HMAC-SHA256 | Raw body |
| OpsGenie | `X-OpsGenie-Signature` | HMAC-SHA256 | Raw body |
| GitHub | `X-Hub-Signature-256` | HMAC-SHA256 | Raw body |
| GitLab | `X-Gitlab-Token` | Token comparison | N/A |
| Custom | `X-DD0C-Signature` | HMAC-SHA256 | Raw body |
**Canonical Alert Schema:**
```typescript
interface CanonicalAlert {
// Identity
alert_id: string; // dd0c-generated ULID (sortable, unique)
tenant_id: string; // From webhook URL path
provider: string; // 'pagerduty' | 'datadog' | 'grafana' | 'opsgenie' | 'custom'
provider_alert_id: string; // Original alert ID from provider
provider_incident_id?: string; // Provider's incident grouping (if any)
// Classification
severity: 'critical' | 'high' | 'medium' | 'low' | 'info';
status: 'triggered' | 'acknowledged' | 'resolved';
category: 'infrastructure' | 'application' | 'security' | 'deployment' | 'custom';
// Context
service: string; // Normalized service name
environment: string; // 'production' | 'staging' | 'development'
title: string; // Human-readable alert title
description?: string; // Alert body/details (may be stripped in privacy mode)
tags: Record<string, string>; // Normalized key-value tags
// Fingerprint (for dedup)
fingerprint: string; // SHA-256 of (tenant_id + provider + service + title_normalized)
// Timestamps
triggered_at: string; // ISO 8601 — when the alert originally fired
received_at: string; // ISO 8601 — when dd0c received the webhook
resolved_at?: string; // ISO 8601 — when resolved (if status=resolved)
// Metadata
raw_payload_s3_key: string; // S3 key for the original payload
source_url?: string; // Deep link back to the alert in the provider's UI
}
```
**Key design decisions:**
1. **ULID for alert_id.** ULIDs are lexicographically sortable by time (unlike UUIDv4), which makes DynamoDB range queries efficient and eliminates the need for a secondary index on timestamp.
2. **Fingerprint for dedup.** The fingerprint is a deterministic hash of the alert's identity fields. Two alerts with the same fingerprint from the same provider within a correlation window are duplicates. This catches the "same alert firing every 30 seconds" pattern without ML.
3. **Severity normalization.** Each provider uses different severity scales (Datadog: P1-P5, PagerDuty: critical/high/low, Grafana: alerting/ok). The parser normalizes to a 5-level scale. Mapping is configurable per tenant.
4. **Privacy mode.** When enabled, `description` is set to `null` and `tags` are hashed. Only structural metadata (service, severity, timestamp) is stored. Reduces intelligence slightly but eliminates sensitive data concerns.
### 2.2 Correlation Engine
The Correlation Engine is the core intelligence of dd0c/alert. It takes a stream of canonical alerts and produces correlated incidents — groups of related alerts that represent a single underlying issue.
**Architecture:** ECS Fargate service consuming from SQS FIFO, maintaining active correlation windows in Redis, writing results to DynamoDB + TimescaleDB.
```mermaid
graph LR
subgraph Input
SQS[SQS FIFO<br/>alert-ingested]
end
subgraph CE["Correlation Engine (ECS Fargate)"]
RECV[Message Receiver]
FP[Fingerprint Dedup]
TW[Time-Window<br/>Correlator]
SDG[Service Dependency<br/>Graph Matcher]
DC[Deploy Correlation]
SCORE[Incident Scorer]
end
subgraph State
REDIS[(Redis<br/>Active Windows)]
DDB[(DynamoDB<br/>Incidents)]
TSDB[(TimescaleDB<br/>Time-Series)]
end
SQS --> RECV
RECV --> FP --> TW --> SDG --> DC --> SCORE
TW <--> REDIS
SDG <--> DDB
DC <--> DDB
SCORE --> DDB & TSDB
SCORE --> CORR_Q[SQS: correlation-request]
```
**Correlation Pipeline (executed per alert):**
```
Alert arrives
├─ Step 1: FINGERPRINT DEDUP
│ Is there an alert with the same fingerprint in the active window?
│ YES → Increment count on existing alert, skip to scoring
│ NO → Continue
├─ Step 2: TIME-WINDOW CORRELATION
│ Find all open correlation windows for this tenant
│ Does this alert fall within an existing window's time range + service scope?
│ YES → Add alert to existing window
│ NO → Open a new correlation window (default: 5 min, configurable)
├─ Step 3: SERVICE-DEPENDENCY MATCHING
│ Does the service dependency graph show a relationship between
│ this alert's service and services in any open window?
│ YES → Merge windows (upstream DB alert + downstream API errors = one incident)
│ NO → Keep windows separate
├─ Step 4: DEPLOY CORRELATION
│ Was there a deployment to this service (or an upstream dependency)
│ within the lookback period (default: 15 min)?
│ YES → Tag the correlation window with deploy context
│ NO → Continue
└─ Step 5: INCIDENT SCORING
When a correlation window closes (timeout or manual trigger):
- Count total alerts in window
- Count unique services affected
- Calculate noise score (0-100)
- Generate incident summary
- Write Incident record to DynamoDB
- Emit to correlation-request queue for Suggestion Engine
```
**Time-Window Correlation — The Algorithm:**
```typescript
interface CorrelationWindow {
window_id: string; // ULID
tenant_id: string;
opened_at: string; // ISO 8601
closes_at: string; // opened_at + window_duration
window_duration_ms: number; // Default 300000 (5 min), configurable
status: 'open' | 'closed';
// Alerts in this window
alert_ids: string[];
alert_count: number;
unique_fingerprints: number;
// Services involved
services: Set<string>;
environments: Set<string>;
// Deploy context (if matched)
deploy_event_id?: string;
deploy_service?: string;
deploy_pr?: string;
deploy_author?: string;
deploy_timestamp?: string;
// Scoring (computed on close)
noise_score?: number; // 0-100 (100 = pure noise)
severity_max?: string; // Highest severity alert in window
}
```
**Window management in Redis:**
```
# Active windows stored as Redis sorted sets (score = closes_at timestamp)
ZADD tenant:{tenant_id}:windows {closes_at_epoch} {window_id}
# Alert-to-window mapping
SADD window:{window_id}:alerts {alert_id}
# Service-to-window index (for dependency matching)
SADD tenant:{tenant_id}:service:{service_name}:windows {window_id}
# Window metadata as hash
HSET window:{window_id} opened_at ... closes_at ... alert_count ...
# TTL on all keys = window_duration + 1 hour (cleanup buffer)
```
**Window extension logic:** If a new alert arrives for an open window within the last 30 seconds of the window's lifetime, the window extends by 2 minutes (up to a maximum of 15 minutes total). This catches cascading failures where alerts trickle in over time.
**Service Dependency Graph:**
The dependency graph is built from two sources:
1. **Inferred from alert patterns.** If Service A alerts consistently fire 1-3 minutes before Service B alerts, dd0c infers a dependency (A → B). Requires 3+ occurrences to establish.
2. **Explicit configuration.** Customers can declare dependencies via the dashboard or API: `payment-service → notification-service → email-service`.
```typescript
interface ServiceDependency {
tenant_id: string;
upstream_service: string;
downstream_service: string;
source: 'inferred' | 'explicit';
confidence: number; // 0.0-1.0 (inferred only)
occurrence_count: number; // Times this pattern was observed
last_seen: string; // ISO 8601
}
```
Stored in DynamoDB with GSI on `tenant_id + upstream_service` for fast lookups during correlation.
### 2.3 Deployment Tracker
The Deployment Tracker ingests CI/CD webhook events and maintains a timeline of deployments per service per tenant. The Correlation Engine queries this timeline to answer: "Was there a deploy to this service (or its dependencies) in the last N minutes?"
**Deploy Event Schema:**
```typescript
interface DeployEvent {
deploy_id: string; // ULID
tenant_id: string;
provider: 'github' | 'gitlab' | 'argocd' | 'custom';
provider_deploy_id: string; // e.g., GitHub Actions run_id
// What was deployed
service: string; // Target service name
environment: string; // production | staging | development
version?: string; // Git SHA, tag, or version string
// Who and what
author: string; // Git commit author or deployer
commit_sha: string;
commit_message?: string;
pr_number?: string;
pr_url?: string;
// Timing
started_at: string; // ISO 8601
completed_at?: string; // ISO 8601
status: 'in_progress' | 'success' | 'failure' | 'cancelled';
// Metadata
source_url: string; // Link to CI/CD run
changes_summary?: string; // Files changed, lines added/removed
}
```
**Deploy-to-Alert Correlation Logic:**
```
When Correlation Engine processes an alert:
1. Query DynamoDB: all deploys for tenant where
service IN (alert.service, ...upstream_dependencies)
AND completed_at > (alert.triggered_at - lookback_window)
AND completed_at < alert.triggered_at
AND environment = alert.environment
2. If match found:
- Attach deploy context to correlation window
- Boost noise_score by 15-30 points (deploy-correlated alerts
are more likely to be transient noise)
- Include deploy details in Slack incident card
3. Lookback window defaults:
- Production: 15 minutes
- Staging: 30 minutes
- Configurable per tenant
```
**Service name mapping challenge:**
The biggest practical challenge is mapping CI/CD service names to monitoring service names. GitHub Actions might deploy `payment-api` while Datadog monitors `prod-payment-api-us-east-1`. Solutions:
1. **Convention-based matching.** Strip common prefixes/suffixes (`prod-`, `-us-east-1`, `-service`). Fuzzy match on the core name.
2. **Explicit mapping.** Dashboard UI where customers map: `GitHub: payment-api``Datadog: prod-payment-api-us-east-1`.
3. **Tag-based matching.** If both CI/CD and monitoring use a common tag (e.g., `dd0c.service=payment`), match on that.
V1 uses convention-based + explicit mapping. Tag-based matching added in V2.
### 2.4 Suggestion Engine
The Suggestion Engine takes correlated incidents from the Correlation Engine and generates actionable suggestions. In V1, this is strictly observe-and-suggest — no auto-action.
**Suggestion Types:**
```typescript
type SuggestionType =
| 'group' // "These 12 alerts are one incident"
| 'suppress' // "We'd suppress these 8 alerts (deploy noise)"
| 'tune' // "This alert fires 40x/week and is never actioned — consider tuning"
| 'dependency' // "Service A alerts always precede Service B — likely upstream dependency"
| 'runbook' // "This pattern was resolved by [runbook] 5 times before" (V2+)
;
interface Suggestion {
suggestion_id: string;
tenant_id: string;
incident_id: string;
type: SuggestionType;
confidence: number; // 0.0-1.0
title: string; // Human-readable summary
reasoning: string; // Plain-English explanation of WHY
affected_alert_ids: string[];
action_taken: 'none'; // V1: always 'none' (observe-only)
user_feedback?: 'helpful' | 'not_helpful' | null;
created_at: string;
}
```
**Noise Scoring Algorithm (V1 — Rule-Based):**
```typescript
function calculateNoiseScore(window: CorrelationWindow): number {
let score = 0;
// Factor 1: Duplicate fingerprints (0-30 points)
const dupRatio = 1 - (window.unique_fingerprints / window.alert_count);
score += Math.round(dupRatio * 30);
// Factor 2: Deploy correlation (0-25 points)
if (window.deploy_event_id) {
score += 25;
// Bonus if deploy author matches a known "config change" pattern
if (window.deploy_pr?.includes('config') || window.deploy_pr?.includes('feature-flag')) {
score += 5;
}
}
// Factor 3: Historical pattern (0-20 points)
// If this service+alert combination has fired and auto-resolved
// N times in the past 7 days without human action
const autoResolveRate = getAutoResolveRate(window.tenant_id, window.services);
score += Math.round(autoResolveRate * 20);
// Factor 4: Severity distribution (0-15 points)
// Windows with only low/info severity alerts score higher
if (window.severity_max === 'low' || window.severity_max === 'info') {
score += 15;
} else if (window.severity_max === 'medium') {
score += 8;
}
// critical/high = 0 bonus (never boost noise score for critical alerts)
// Factor 5: Time of day (0-10 points)
// Alerts during deploy windows (10am-4pm weekdays) are more likely noise
const hour = new Date(window.opened_at).getUTCHours();
const isBusinessHours = hour >= 14 && hour <= 23; // ~10am-4pm US timezones
if (isBusinessHours) score += 10;
// Cap at 100, floor at 0
return Math.max(0, Math.min(100, score));
}
```
**"Never Suppress" Safelist (Default):**
These alert categories are never scored above 50 (noise), regardless of pattern matching:
```typescript
const NEVER_SUPPRESS_DEFAULTS = [
{ category: 'security', reason: 'Security alerts require human review' },
{ severity: 'critical', reason: 'Critical severity always surfaces' },
{ service_pattern: /database|db|rds|dynamo/i, reason: 'Database alerts are high-risk' },
{ service_pattern: /payment|billing|stripe|checkout/i, reason: 'Payment path alerts are business-critical' },
{ title_pattern: /data.?loss|corruption|breach/i, reason: 'Data integrity alerts are never noise' },
];
```
Configurable per tenant. Customers can add/remove patterns. Defaults are opt-out (must explicitly remove).
### 2.5 Notification Service
The Notification Service is the primary delivery mechanism. Slack is the V1 interface — most engineers never open the dashboard.
**Slack Message Format (Incident Card):**
```
📊 Incident #47 — payment-service
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔴 Severity: HIGH | Noise Score: 82/100
12 alerts → 1 incident
├─ Datadog: 8 alerts (latency spike, error rate, CPU)
├─ PagerDuty: 3 pages (payment-service P2)
└─ Grafana: 1 alert (downstream notification-service)
🚀 Deploy detected: PR #1042 "Add retry logic to payment processor"
by @marcus • merged 3 min before first alert
https://github.com/acme/payment-service/pull/1042
💡 Suggestion: This looks like deploy noise.
Similar pattern seen 4 times this month — auto-resolved within 8 min each time.
We'd suppress 10 of 12 alerts. [What would change →]
👍 Helpful 👎 Not helpful 🔇 Mute this pattern
```
**Notification Types:**
| Type | Trigger | Channel |
|---|---|---|
| **Incident Card** | Correlation window closes with grouped alerts | Configured Slack channel |
| **Real-time Alert** | First alert in a new window (eager notification) | Configured Slack channel |
| **Daily Digest** | 9:00 AM in tenant's timezone | Configured Slack channel or DM |
| **Weekly Noise Report** | Monday 9:00 AM | Configured Slack channel + email to admin |
| **Integration Health** | Webhook volume drops to zero for >2 hours | DM to integration owner |
**Daily Digest Format:**
```
📋 dd0c/alert Daily Digest — Feb 28, 2026
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Yesterday: 247 alerts → 18 incidents
Noise ratio: 87% | You'd have been paged 18 times instead of 247
Top noisy alerts:
1. checkout-latency (Datadog) — 43 fires, 0 incidents → 🔇 Tune candidate
2. disk-usage-warning (Grafana) — 31 fires, 0 incidents → 🔇 Tune candidate
3. auth-service-timeout (PagerDuty) — 22 fires, 1 real incident
Deploy-correlated noise: 67% of alerts fired within 15 min of a deploy
Noisiest deploy: PR #1038 "Update feature flags" triggered 34 alerts
💰 Estimated savings: 4.2 engineering hours not spent triaging noise
```
**Slack Integration Architecture:**
- OAuth 2.0 flow during onboarding (Slack App Directory compliant)
- Bot token stored encrypted in DynamoDB (per-tenant)
- Uses Slack `chat.postMessage` for new messages, `chat.update` for in-place updates
- Block Kit for rich formatting
- Interactive components (buttons) handled via Slack Events API → API Gateway → Lambda
- Rate limiting: Slack allows 1 message/second per channel. Batch notifications during incident storms (queue in SQS, drain at 1/sec)
---
## 3. DATA ARCHITECTURE
### 3.1 Canonical Alert Schema (Provider-Agnostic)
The canonical schema (defined in §2.1) is the single source of truth for all alert data. Every provider's payload is normalized into this schema at ingestion time. Downstream components never touch raw provider payloads.
**Schema evolution strategy:** The canonical schema uses a `schema_version` field (integer, starting at 1). When the schema changes:
1. New fields are always optional (backward compatible)
2. Removed fields are deprecated for 2 versions before removal
3. The ingestion Lambda writes the current schema version; consumers handle version differences gracefully
4. DynamoDB items carry their schema version — no backfill migrations needed
### 3.2 Event Sourcing for Alert History
All alert data follows an event-sourcing pattern. The raw event stream is the source of truth; materialized views are derived and rebuildable.
**Event Stream:**
```typescript
type AlertEvent =
| { type: 'alert.received'; alert: CanonicalAlert; }
| { type: 'alert.deduplicated'; alert_id: string; original_alert_id: string; }
| { type: 'alert.correlated'; alert_id: string; window_id: string; }
| { type: 'alert.resolved'; alert_id: string; resolved_at: string; }
| { type: 'window.opened'; window: CorrelationWindow; }
| { type: 'window.extended'; window_id: string; new_closes_at: string; }
| { type: 'window.closed'; window_id: string; incident_id: string; }
| { type: 'incident.created'; incident: Incident; }
| { type: 'suggestion.created'; suggestion: Suggestion; }
| { type: 'feedback.received'; suggestion_id: string; feedback: 'helpful' | 'not_helpful'; }
| { type: 'deploy.received'; deploy: DeployEvent; }
;
```
**Storage:**
| Store | What | Why | Retention |
|---|---|---|---|
| **S3 (raw payloads)** | Original webhook bodies, exactly as received | Audit trail, replay, simulation mode, debugging provider parser issues | Free: 7d, Pro: 90d, Business: 1yr, Enterprise: custom |
| **DynamoDB (events)** | Canonical alert events, incidents, suggestions, feedback | Primary operational store. Fast reads for dashboard, API, correlation lookups | Free: 7d, Pro: 90d, Business: 1yr |
| **TimescaleDB (time-series)** | Alert counts, noise ratios, correlation metrics per time bucket | Trend analysis, Noise Report Card, business impact dashboard | 1yr rolling (continuous aggregates compress older data) |
| **Redis (ephemeral)** | Active correlation windows, recent fingerprints, rate counters | Real-time correlation state. Ephemeral by design — rebuilt from DynamoDB on cold start | TTL-based: window_duration + 1hr |
**Replay capability:** Because raw payloads are archived in S3, the entire alert history can be replayed through the ingestion pipeline. This enables:
- **Alert Simulation Mode:** Upload historical exports → replay through correlation engine → show "what would have happened"
- **Parser upgrades:** When a provider parser is improved, replay recent payloads to backfill better-normalized data
- **Correlation tuning:** Replay last 30 days with different window durations to find optimal settings per tenant
### 3.3 Time-Series Storage for Correlation Windows
TimescaleDB handles all time-range queries that DynamoDB's key-value model handles poorly.
**Hypertables:**
```sql
-- Alert time-series (one row per alert)
CREATE TABLE alert_timeseries (
tenant_id TEXT NOT NULL,
alert_id TEXT NOT NULL,
service TEXT NOT NULL,
severity TEXT NOT NULL,
provider TEXT NOT NULL,
fingerprint TEXT NOT NULL,
triggered_at TIMESTAMPTZ NOT NULL,
received_at TIMESTAMPTZ NOT NULL,
noise_score SMALLINT,
incident_id TEXT,
PRIMARY KEY (tenant_id, triggered_at, alert_id)
);
SELECT create_hypertable('alert_timeseries', 'triggered_at',
chunk_time_interval => INTERVAL '1 day');
-- Continuous aggregate: hourly alert counts per tenant/service
CREATE MATERIALIZED VIEW alert_hourly
WITH (timescaledb.continuous) AS
SELECT
tenant_id,
service,
time_bucket('1 hour', triggered_at) AS bucket,
COUNT(*) AS alert_count,
COUNT(DISTINCT fingerprint) AS unique_alerts,
COUNT(DISTINCT incident_id) AS incident_count,
AVG(noise_score) AS avg_noise_score
FROM alert_timeseries
GROUP BY tenant_id, service, bucket;
-- Continuous aggregate: daily noise report per tenant
CREATE MATERIALIZED VIEW noise_daily
WITH (timescaledb.continuous) AS
SELECT
tenant_id,
time_bucket('1 day', triggered_at) AS bucket,
COUNT(*) AS total_alerts,
COUNT(DISTINCT incident_id) AS total_incidents,
ROUND(100.0 * (1 - COUNT(DISTINCT incident_id)::NUMERIC / NULLIF(COUNT(*), 0)), 1)
AS noise_pct,
AVG(noise_score) AS avg_noise_score
FROM alert_timeseries
GROUP BY tenant_id, bucket;
```
**Why TimescaleDB over pure DynamoDB:**
- DynamoDB excels at point lookups and narrow range scans. It's terrible at "give me all alerts for this tenant in the last 5 minutes across all services" — that's a full partition scan.
- TimescaleDB's hypertable chunking + continuous aggregates make time-range queries fast and pre-computed aggregates make dashboard queries instant.
- The trade-off: TimescaleDB is a managed RDS instance (not serverless). Cost is fixed (~$50/month for db.t4g.medium). Acceptable for V1; evaluate Aurora Serverless v2 at scale.
### 3.4 Service Dependency Graph Storage
```typescript
// DynamoDB table: service-dependencies
// Partition key: tenant_id
// Sort key: upstream_service#downstream_service
interface ServiceDependencyRecord {
tenant_id: string; // PK
edge_key: string; // SK: "payment-service#notification-service"
upstream_service: string;
downstream_service: string;
source: 'inferred' | 'explicit';
confidence: number; // 0.0-1.0
occurrence_count: number;
first_seen: string;
last_seen: string;
avg_lag_ms: number; // Average time between upstream and downstream alerts
ttl?: number; // Epoch seconds — inferred edges expire after 30d of no observations
}
```
**Graph query pattern:** When the Correlation Engine needs to check dependencies for a service, it does a DynamoDB query:
- `PK = tenant_id, SK begins_with upstream_service#` → all downstream dependencies
- `GSI: PK = tenant_id, SK begins_with #downstream_service` → all upstream dependencies (inverted GSI)
This is O(degree) per lookup — fast enough for real-time correlation. The graph is small per tenant (typically <100 edges for a 50-service architecture).
### 3.5 Multi-Tenant Data Isolation
**Isolation model: Logical isolation with tenant_id partitioning.**
Every data record includes `tenant_id` as the partition key (DynamoDB) or a required column (TimescaleDB). There is no cross-tenant data access path.
| Layer | Isolation Mechanism |
|---|---|
| **API Gateway** | API key → tenant_id mapping. All requests scoped to tenant. |
| **Webhook URLs** | Tenant ID embedded in URL path. Validated against tenant record. |
| **DynamoDB** | `tenant_id` is the partition key on every table. No scan operations in application code — all queries are PK-scoped. |
| **TimescaleDB** | Row-level security (RLS) policies enforce `tenant_id = current_setting('app.tenant_id')`. Connection pool sets tenant context per request. |
| **Redis** | All keys prefixed with `tenant:{tenant_id}:`. No `KEYS *` in application code. |
| **S3** | Object key prefix: `raw/{tenant_id}/{date}/{alert_id}.json`. Bucket policy prevents cross-prefix access. |
| **Slack** | Bot token per tenant. Stored encrypted. Never shared across tenants. |
**Why not per-tenant databases?** Cost. At $19/seat with potentially thousands of tenants, per-tenant RDS instances are economically impossible. Logical isolation with strong key-scoping is the standard pattern for multi-tenant SaaS at this price point. SOC2 auditors accept this model with proper access controls and audit logging.
### 3.6 Retention Policies
| Tier | Raw Payloads (S3) | Alert Events (DynamoDB) | Time-Series (TimescaleDB) | Correlation Windows |
|---|---|---|---|---|
| **Free** | 7 days | 7 days | 7 days (no aggregates) | 24 hours |
| **Pro** | 90 days | 90 days | 90 days + hourly aggregates for 1yr | 30 days |
| **Business** | 1 year | 1 year | 1 year + daily aggregates for 2yr | 90 days |
| **Enterprise** | Custom | Custom | Custom | Custom |
**Implementation:**
- **S3:** Lifecycle policies transition objects: Standard → IA (30d) → Glacier (90d) → Delete (per tier)
- **DynamoDB:** TTL attribute on every item. DynamoDB automatically deletes expired items (eventually consistent, ~48hr window). No Lambda cleanup needed.
- **TimescaleDB:** `drop_chunks()` policy per hypertable. Continuous aggregates survive chunk drops (aggregated data persists longer than raw data).
- **Redis:** TTL on all keys. Self-cleaning by design.
---
## 4. INFRASTRUCTURE
### 4.1 AWS Architecture
**Region:** `us-east-1` (primary). Single-region for V1. Multi-region (us-west-2 failover) at 10K+ alerts/day or first EU customer requiring data residency.
```mermaid
graph TB
subgraph Edge["Edge Layer"]
CF[CloudFront<br/>Dashboard CDN]
R53[Route 53<br/>hooks.dd0c.com]
end
subgraph Ingestion["Ingestion (Serverless)"]
APIGW[API Gateway HTTP API<br/>Webhook Receiver + REST API]
L_INGEST[Lambda: webhook-ingest<br/>256MB, 10s timeout]
L_API[Lambda: api-handler<br/>512MB, 30s timeout]
L_SLACK[Lambda: slack-events<br/>256MB, 10s timeout]
L_NOTIFY[Lambda: notification-sender<br/>256MB, 30s timeout]
end
subgraph Queue["Message Bus"]
SQS_ALERT[SQS FIFO<br/>alert-ingested<br/>MessageGroupId=tenant_id]
SQS_DEPLOY[SQS FIFO<br/>deploy-event<br/>MessageGroupId=tenant_id]
SQS_CORR[SQS Standard<br/>correlation-result]
SQS_NOTIFY[SQS Standard<br/>notification]
DLQ[SQS DLQ<br/>dead-letters]
end
subgraph Processing["Processing (ECS Fargate)"]
ECS_CE[ECS Service: correlation-engine<br/>1 vCPU, 2GB RAM<br/>Desired: 1, Max: 4]
ECS_SE[ECS Service: suggestion-engine<br/>0.5 vCPU, 1GB RAM<br/>Desired: 1, Max: 2]
end
subgraph Data["Data Layer"]
DDB[(DynamoDB<br/>On-Demand Capacity)]
RDS[(RDS PostgreSQL 16<br/>TimescaleDB<br/>db.t4g.medium)]
REDIS[(ElastiCache Redis<br/>cache.t4g.micro)]
S3_RAW[(S3: dd0c-raw-payloads)]
S3_STATIC[(S3: dd0c-dashboard)]
end
subgraph Ops["Operations"]
CW_LOGS[CloudWatch Logs]
CW_METRICS[CloudWatch Metrics]
CW_ALARMS[CloudWatch Alarms]
XRAY[X-Ray Tracing]
SM[Secrets Manager]
end
R53 --> APIGW
CF --> S3_STATIC
APIGW --> L_INGEST & L_API & L_SLACK
L_INGEST --> SQS_ALERT & SQS_DEPLOY & S3_RAW
SQS_ALERT & SQS_DEPLOY --> ECS_CE
ECS_CE --> SQS_CORR & DDB & RDS & REDIS
SQS_CORR --> ECS_SE
ECS_SE --> SQS_NOTIFY & DDB & RDS
SQS_NOTIFY --> L_NOTIFY
L_API --> DDB & RDS
L_SLACK --> DDB
SQS_ALERT & SQS_DEPLOY & SQS_CORR & SQS_NOTIFY -.-> DLQ
L_INGEST & L_API & ECS_CE & ECS_SE --> CW_LOGS & XRAY
ECS_CE & ECS_SE --> SM
```
**DynamoDB Tables:**
| Table | PK | SK | GSIs | Purpose |
|---|---|---|---|---|
| `alerts` | `tenant_id` | `alert_id` (ULID) | GSI1: `tenant_id` + `triggered_at` | Canonical alert records |
| `incidents` | `tenant_id` | `incident_id` (ULID) | GSI1: `tenant_id` + `created_at` | Correlated incident records |
| `suggestions` | `tenant_id` | `suggestion_id` | GSI1: `incident_id` | Noise suggestions + feedback |
| `deploys` | `tenant_id` | `deploy_id` (ULID) | GSI1: `tenant_id` + `service` + `completed_at` | Deploy events |
| `dependencies` | `tenant_id` | `upstream#downstream` | GSI1: `tenant_id` + `downstream#upstream` | Service dependency graph |
| `tenants` | `tenant_id` | `—` | GSI1: `api_key` | Tenant config, billing, integrations |
| `integrations` | `tenant_id` | `integration_id` | — | Webhook configs, Slack tokens |
All tables use on-demand capacity mode (no capacity planning, pay-per-request). Switch to provisioned with auto-scaling when read/write patterns stabilize (typically at 50K+ alerts/day).
### 4.2 Real-Time Processing Pipeline
The critical path from webhook receipt to Slack notification must complete in under 10 seconds for the "eager first alert" notification, and under 5 minutes + 10 seconds for the full correlated incident card.
**Latency budget:**
```
Webhook received by API Gateway 0ms
├─ Lambda cold start (worst case) +200ms
├─ HMAC validation + parsing +10ms
├─ DynamoDB write (raw event) +5ms
├─ SQS FIFO send +20ms
├─ S3 async put (non-blocking) +0ms (async)
│ ─────
│ Total ingestion: ~235ms (p99)
├─ SQS → ECS polling interval +100ms (long-polling, 0.1s)
├─ Correlation Engine processing +15ms
├─ Redis read/write (window state) +2ms
├─ DynamoDB read (deploy lookup) +5ms
│ ─────
│ Total to correlation decision: ~357ms (p99)
├─ SQS → Lambda (notification) +50ms
├─ Slack API call +300ms
│ ─────
│ Total webhook → Slack (eager): ~707ms (p99)
│ ... correlation window (5 min default) ...
├─ Window close → Suggestion Engine +200ms
├─ Slack chat.update (in-place) +300ms
│ ─────
│ Total webhook → full incident card: ~5min + 1.2s
```
**ECS Fargate task configuration:**
```typescript
// CDK definition
const correlationService = new ecs.FargateService(this, 'CorrelationEngine', {
cluster,
taskDefinition: new ecs.FargateTaskDefinition(this, 'CorrTask', {
cpu: 1024, // 1 vCPU
memoryLimitMiB: 2048, // 2 GB
}),
desiredCount: 1,
minHealthyPercent: 100,
maxHealthyPercent: 200,
circuitBreaker: { rollback: true },
});
// Auto-scaling based on SQS queue depth
const scaling = correlationService.autoScaleTaskCount({
minCapacity: 1,
maxCapacity: 4,
});
scaling.scaleOnMetric('QueueDepthScaling', {
metric: alertQueue.metricApproximateNumberOfMessagesVisible(),
scalingSteps: [
{ upper: 0, change: -1 }, // Scale in when queue empty
{ lower: 100, change: +1 }, // Scale out at 100 messages
{ lower: 1000, change: +2 }, // Scale out faster at 1000
],
adjustmentType: ecs.AdjustmentType.CHANGE_IN_CAPACITY,
cooldown: Duration.seconds(60),
});
```
**SQS FIFO configuration:**
```typescript
const alertQueue = new sqs.Queue(this, 'AlertIngested', {
fifo: true,
contentBasedDeduplication: false, // We set explicit dedup IDs
deduplicationScope: sqs.DeduplicationScope.MESSAGE_GROUP,
fifoThroughputLimit: sqs.FifoThroughputLimit.PER_MESSAGE_GROUP_ID,
// High-throughput FIFO: 30K messages/sec per group
visibilityTimeout: Duration.seconds(60),
retentionPeriod: Duration.days(4),
deadLetterQueue: {
queue: dlq,
maxReceiveCount: 3,
},
});
```
**Why SQS FIFO with per-tenant message groups:**
- `MessageGroupId = tenant_id` ensures alerts from the same tenant are processed in order (critical for time-window accuracy)
- Different tenants are processed in parallel (no head-of-line blocking)
- High-throughput FIFO mode supports 30K messages/sec per message group — more than enough for any single tenant's alert volume
### 4.3 Cost Estimates
All estimates assume us-east-1 pricing as of 2026. Costs are monthly.
#### 1K alerts/day (~30K/month) — Early Stage
| Service | Configuration | Monthly Cost |
|---|---|---|
| API Gateway HTTP API | 30K requests | $0.03 |
| Lambda (ingestion) | 30K invocations × 256MB × 200ms | $0.02 |
| Lambda (API + notifications) | 50K invocations × 512MB × 300ms | $0.10 |
| SQS (FIFO + Standard) | 120K messages | $0.05 |
| ECS Fargate (Correlation) | 1 task × 1vCPU × 2GB × 24/7 | $48.00 |
| ECS Fargate (Suggestion) | 1 task × 0.5vCPU × 1GB × 24/7 | $18.00 |
| DynamoDB (on-demand) | ~100K reads + 60K writes | $0.15 |
| RDS (TimescaleDB) | db.t4g.micro (free tier eligible) | $0.00 |
| ElastiCache Redis | cache.t4g.micro | $12.00 |
| S3 | <1 GB stored | $0.02 |
| CloudWatch | Logs + metrics | $5.00 |
| Route 53 | 1 hosted zone | $0.50 |
| Secrets Manager | 5 secrets | $2.00 |
| **Total** | | **~$86/month** |
#### 10K alerts/day (~300K/month) — Growth Stage
| Service | Configuration | Monthly Cost |
|---|---|---|
| API Gateway HTTP API | 300K requests | $0.30 |
| Lambda (ingestion) | 300K invocations | $0.20 |
| Lambda (API + notifications) | 500K invocations | $1.00 |
| SQS | 1.2M messages | $0.50 |
| ECS Fargate (Correlation) | 2 tasks avg (auto-scaling) | $96.00 |
| ECS Fargate (Suggestion) | 1 task | $18.00 |
| DynamoDB (on-demand) | ~1M reads + 600K writes | $1.50 |
| RDS (TimescaleDB) | db.t4g.medium | $50.00 |
| ElastiCache Redis | cache.t4g.small | $25.00 |
| S3 | ~10 GB stored + transitions | $2.00 |
| CloudWatch | Logs + metrics + alarms | $15.00 |
| Route 53 + ACM | | $1.00 |
| Secrets Manager | 20 secrets | $8.00 |
| **Total** | | **~$218/month** |
#### 100K alerts/day (~3M/month) — Scale Stage
| Service | Configuration | Monthly Cost |
|---|---|---|
| API Gateway HTTP API | 3M requests | $3.00 |
| Lambda (ingestion) | 3M invocations | $2.00 |
| Lambda (API + notifications) | 5M invocations | $10.00 |
| SQS | 12M messages | $5.00 |
| ECS Fargate (Correlation) | 4 tasks avg | $192.00 |
| ECS Fargate (Suggestion) | 2 tasks avg | $36.00 |
| DynamoDB (on-demand) | ~10M reads + 6M writes | $15.00 |
| RDS (TimescaleDB) | db.r6g.large + read replica | $350.00 |
| ElastiCache Redis | cache.r6g.large (cluster mode) | $200.00 |
| S3 | ~100 GB + lifecycle | $10.00 |
| CloudWatch | Full observability stack | $50.00 |
| CloudFront | Dashboard CDN | $5.00 |
| WAF | API protection | $10.00 |
| **Total** | | **~$888/month** |
**Gross margin analysis:**
| Scale | Monthly Infra Cost | Estimated MRR | Gross Margin |
|---|---|---|---|
| 1K alerts/day (~35 teams) | $86 | $6,650 | 98.7% |
| 10K alerts/day (~175 teams) | $218 | $33,250 | 99.3% |
| 100K alerts/day (~700 teams) | $888 | $133,000 | 99.3% |
Infrastructure costs are negligible relative to revenue. The cost structure is dominated by the founder's time, not AWS spend. This is the structural advantage of a webhook-based SaaS — no agents to host, no data to scrape, no heavy compute. Just receive, correlate, notify.
### 4.4 Scaling Strategy
Alert volume is bursty. During a major incident, a single tenant might generate 500 alerts in 2 minutes, then nothing for hours. The architecture must handle this without pre-provisioning for peak.
**Burst handling by layer:**
| Layer | Burst Strategy | Limit |
|---|---|---|
| **API Gateway** | Built-in burst capacity: 5,000 RPS default, increasable to 50K+ | Effectively unlimited for our scale |
| **Lambda** | Concurrency auto-scales. Reserved concurrency per function prevents one function from starving others | 1,000 concurrent (default), increase via support ticket |
| **SQS FIFO** | High-throughput mode: 30K msg/sec per message group. Queue absorbs bursts that processing can't handle immediately | Unlimited queue depth |
| **ECS Fargate** | Auto-scaling on SQS queue depth. Scale-out in ~60 seconds (Fargate task launch time). During the 60s gap, SQS buffers | Min 1, Max 4 (V1). Increase max as needed |
| **DynamoDB** | On-demand mode handles burst to 2x previous peak automatically. For sustained spikes, DynamoDB auto-adjusts within minutes | Effectively unlimited with on-demand |
| **Redis** | Single node handles 100K+ ops/sec. Cluster mode at scale | Not a bottleneck until 100K+ alerts/day |
| **TimescaleDB** | Write-ahead log buffers burst writes. Hypertable chunking prevents table bloat | RDS instance size is the limit; vertical scaling |
**The SQS buffer is the key architectural decision.** During an incident storm, the ingestion Lambda writes to SQS in <20ms and returns 200 OK to the provider. The Correlation Engine processes at its own pace. If the engine falls behind, the queue grows — but no webhooks are dropped. This decoupling is what makes the system reliable under burst load.
**Scaling triggers and actions:**
```
Alert volume < 1K/day:
- 1 Correlation Engine task, 1 Suggestion Engine task
- cache.t4g.micro Redis, db.t4g.micro RDS
- Total: ~$86/month
Alert volume 1K-10K/day:
- Auto-scale CE to 2 tasks during bursts
- Upgrade Redis to cache.t4g.small
- Upgrade RDS to db.t4g.medium
- Total: ~$218/month
Alert volume 10K-100K/day:
- Auto-scale CE to 4 tasks
- Auto-scale SE to 2 tasks
- Redis cluster mode (3 shards)
- RDS db.r6g.large + read replica
- Add WAF for API protection
- Total: ~$888/month
Alert volume > 100K/day:
- Evaluate Kinesis Data Streams replacing SQS for higher throughput
- Consider Aurora Serverless v2 replacing RDS
- Multi-region deployment for latency + redundancy
- Dedicated capacity DynamoDB with auto-scaling
- Total: $2K-5K/month (still <1% of revenue at this scale)
```
### 4.5 CI/CD Pipeline
```mermaid
graph LR
subgraph Dev
CODE[Push to main]
PR[Pull Request]
end
subgraph CI["GitHub Actions CI"]
LINT[Lint + Type Check]
TEST[Unit Tests]
INT[Integration Tests<br/>LocalStack]
BUILD[Docker Build<br/>+ CDK Synth]
end
subgraph CD["GitHub Actions CD"]
STAGING[Deploy to Staging<br/>CDK deploy]
SMOKE[Smoke Tests<br/>against staging]
PROD[Deploy to Production<br/>CDK deploy]
CANARY[Canary: send test<br/>webhook, verify Slack]
end
subgraph Dogfood["Dogfooding"]
DD0C[dd0c/alert receives<br/>its own deploy webhook]
end
PR --> LINT & TEST & INT
CODE --> BUILD --> STAGING --> SMOKE --> PROD --> CANARY
PROD --> DD0C
```
**Pipeline details:**
1. **PR checks (< 3 min):** ESLint, TypeScript strict mode, unit tests (vitest), integration tests against LocalStack (DynamoDB, SQS, S3 emulation)
2. **Staging deploy (< 5 min):** CDK deploy to staging account. Separate AWS account for isolation.
3. **Smoke tests (< 2 min):** Send test webhooks to staging endpoint. Verify: webhook accepted, alert appears in DynamoDB, correlation window opens, Slack notification sent to test channel.
4. **Production deploy (< 5 min):** CDK deploy to production. Blue/green for ECS services (CodeDeploy). Lambda versioning with aliases for instant rollback.
5. **Canary (continuous):** Post-deploy canary sends a synthetic webhook every 5 minutes. If Slack notification doesn't arrive within 30 seconds, CloudWatch alarm fires → auto-rollback.
6. **Dogfooding:** dd0c/alert's own GitHub Actions workflow sends deploy webhooks to dd0c/alert. The product monitors its own deployments. If a deploy causes alert correlation to degrade, dd0c/alert tells you about it.
**Rollback strategy:**
- Lambda: Alias shift to previous version (instant, <1 second)
- ECS: CodeDeploy blue/green rollback (< 2 minutes)
- DynamoDB: No schema migrations in V1 (schema-on-read). No rollback needed.
- TimescaleDB: Flyway migrations with rollback scripts. Test in staging first.
---
## 5. SECURITY
### 5.1 Webhook Authentication
We cannot trust unauthenticated webhooks, as an attacker could flood a tenant with fake alerts.
- **HMAC Signatures:** Every webhook request is verified using the provider's signature header (e.g., `X-PagerDuty-Signature`, `DD-WEBHOOK-SIGNATURE`).
- **Secret Management:** Provider secrets are generated upon integration creation, stored in AWS Secrets Manager (or DynamoDB KMS-encrypted), and retrieved by the ingestion Lambda.
- **Timestamp Validation:** Signatures must include a timestamp check to prevent replay attacks (requests older than 5 minutes are rejected).
- **Rate Limiting:** API Gateway enforces rate limits per tenant based on their tier to prevent noisy neighbor problems and DDoS.
### 5.2 API Key Management
For customers using the REST API (Business tier) or Custom Webhooks:
- API keys are generated as cryptographically secure random strings with a prefix (e.g., `dd0c_live_...`).
- Only a one-way hash (SHA-256) is stored in DynamoDB. The raw key is shown only once upon creation.
- API keys are tied to specific scopes (e.g., `write:alerts`, `read:incidents`).
- API Gateway Lambda Authorizer validates the key and injects the `tenant_id` into the request context, ensuring strict tenant isolation.
### 5.3 Alert Data Sensitivity
Alert payloads often contain sensitive infrastructure details (hostnames, IP addresses) and sometimes PII (error messages containing user data).
- **Payload Stripping Mode (Privacy Mode):** Configurable per-tenant. When enabled, the ingestion layer strips the `description` and raw payload bodies before saving to DynamoDB or S3. Only structural metadata (service, severity, timestamp) is retained.
- **Encryption at Rest:** All DynamoDB tables, RDS instances, and S3 buckets use AWS KMS encryption with customer master keys (CMK) or AWS-managed keys.
- **Encryption in Transit:** TLS 1.2+ enforced on all API Gateway endpoints and inter-service communications.
### 5.4 SOC 2 Considerations
While SOC 2 Type II certification is targeted for Month 6-9, the V1 architecture lays the groundwork:
- **Audit Logging:** Every configuration change (adding integrations, modifying suppression rules) is logged to an immutable audit table.
- **Access Control:** No human access to production databases. Read-only access via AWS SSO for debugging. Changes via CI/CD only.
- **Vulnerability Scanning:** ECR image scanning on push, npm audit in CI pipeline.
- **Separation of Duties:** Staging and Production are in completely separate AWS accounts.
### 5.5 Data Residency Options
- **V1:** All data resides in `us-east-1`.
- **V2/Enterprise:** The architecture supports multi-region deployment. European customers can be provisioned in `eu-central-1`. The `tenant_id` can dictate routing at the edge (e.g., Route 53 latency-based routing or CloudFront Lambda@Edge routing based on tenant prefix).
---
## 6. MVP SCOPE
### 6.1 V1 MVP (Observe-and-Suggest)
The V1 MVP is strictly scoped to prove the 60-second time-to-value constraint and earn engineer trust.
- **Integrations:** Datadog, PagerDuty, and GitHub Actions (for deploy events).
- **Core Engine:** Time-window clustering and deployment correlation. Rule-based only.
- **Actionability:** Observe-and-suggest ONLY. No auto-suppression.
- **Delivery:** Slack Bot (incident cards, real-time alerts, daily digests).
- **Dashboard:** Minimal UI for generating webhook URLs and viewing the Noise Report Card.
### 6.2 Deferred to V2+
- **Auto-suppression:** Requires explicit user opt-in and a proven track record.
- **More Integrations:** Grafana, OpsGenie, GitLab CI, ArgoCD.
- **Semantic Deduplication:** Sentence-transformer ML embeddings for fuzzy alert matching.
- **Predictive Severity:** ML-based scoring of historical resolution patterns.
- **Advanced Dashboard:** Custom charting, RBAC, SSO/SAML.
- **dd0c/run integration:** Runbook automation.
### 6.3 The 60-Second Onboarding Flow
1. User authenticates via Slack (OAuth).
2. UI provisions a `tenant_id` and generates Datadog/PagerDuty webhook URLs.
3. User pastes the URL into their monitoring tool.
4. First alert fires → Ingestion Lambda receives it.
5. Slack bot immediately posts: *"🔔 New alert: [service] [title] — watching for related alerts..."*
6. V1 value is proven instantly.
### 6.4 Technical Debt Budget
Given the 30-day build timeline, intentional technical debt is accepted in specific areas:
- **Testing:** Integration tests focus on the golden path (webhook → correlation → Slack). Edge cases in provider parsing will be fixed fast-forward in production.
- **Dashboard UI:** Built with off-the-shelf Tailwind components. Not pixel-perfect.
- **Database Migrations:** None. Schema-on-read in DynamoDB.
- **Infrastructure Code:** Hardcoded region (`us-east-1`) and basic CI/CD.
### 6.5 Solo Founder Operational Model
- **Support:** Community Slack channel. No SLA for the Free/Pro tiers.
- **On-call:** Standard AWS alarms (5XX errors, high queue depth) page the founder.
- **Resilience:** The overlay architecture means if dd0c/alert goes down, the customer just receives their raw alerts from PagerDuty/Datadog. It degrades gracefully to the status quo.
---
## 7. API DESIGN
### 7.1 Webhook Receiver Endpoints
- `POST /v1/webhooks/{tenant_id}/datadog`
- `POST /v1/webhooks/{tenant_id}/pagerduty`
- `POST /v1/webhooks/{tenant_id}/github`
*Headers must include provider-specific signatures.*
### 7.2 Alert Query & Search API
- `GET /v1/alerts?service={service}&status={status}&start={iso8601}&end={iso8601}`
*Returns paginated canonical alerts.*
- `GET /v1/alerts/{alert_id}`
### 7.3 Correlation Results API
- `GET /v1/incidents?status=open`
*Returns active correlation windows and their grouped alerts.*
- `GET /v1/incidents/{incident_id}`
- `GET /v1/incidents/{incident_id}/suggestions`
### 7.4 Slack Slash Commands
- `/dd0c status` — Shows current open correlation windows.
- `/dd0c config` — Link to the tenant dashboard.
- `/dd0c mute [service]` — Temporarily ignore alerts for a noisy service (adds to suppression list).
### 7.5 Dashboard REST API
Backend-for-frontend (BFF) used by the React SPA:
- `GET /api/v1/reports/noise-daily` — TimescaleDB aggregation for the Noise Report Card.
- `GET /api/v1/integrations` — List configured webhooks and status.
- `POST /api/v1/integrations` — Generate new webhook credentials.
### 7.6 Integration Marketplace Hooks
For native app directory listings (e.g., PagerDuty Marketplace):
- `GET /api/v1/oauth/callback` — Handles OAuth flows for third-party integrations.
- `POST /api/v1/lifecycle/uninstall` — Cleans up tenant data when the app is removed from a workspace.