1280 lines
57 KiB
Markdown
1280 lines
57 KiB
Markdown
|
|
# dd0c/alert — Technical Architecture
|
|||
|
|
### Alert Intelligence Platform
|
|||
|
|
**Version:** 1.0 | **Date:** 2026-02-28 | **Phase:** 6 — Architecture | **Author:** dd0c Engineering
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. SYSTEM OVERVIEW
|
|||
|
|
|
|||
|
|
### 1.1 High-Level Architecture
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
graph TB
|
|||
|
|
subgraph Providers["Alert Sources"]
|
|||
|
|
PD[PagerDuty]
|
|||
|
|
DD[Datadog]
|
|||
|
|
GF[Grafana]
|
|||
|
|
OG[OpsGenie]
|
|||
|
|
CW[Custom Webhooks]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph CICD["CI/CD Sources"]
|
|||
|
|
GHA[GitHub Actions]
|
|||
|
|
GLC[GitLab CI]
|
|||
|
|
ARGO[ArgoCD]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph Ingestion["Ingestion Layer (API Gateway + Lambda)"]
|
|||
|
|
WH[Webhook Receiver<br/>POST /webhooks/:provider]
|
|||
|
|
HMAC[HMAC Validator]
|
|||
|
|
NORM[Payload Normalizer]
|
|||
|
|
SCHEMA[Canonical Schema<br/>Mapper]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph Queue["Event Bus (SQS/SNS)"]
|
|||
|
|
ALERT_Q[alert-ingested<br/>SQS FIFO]
|
|||
|
|
DEPLOY_Q[deploy-event<br/>SQS FIFO]
|
|||
|
|
CORR_Q[correlation-request<br/>SQS Standard]
|
|||
|
|
NOTIFY_Q[notification<br/>SQS Standard]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph Processing["Processing Layer (ECS Fargate)"]
|
|||
|
|
CE[Correlation Engine]
|
|||
|
|
DT[Deployment Tracker]
|
|||
|
|
SE[Suggestion Engine]
|
|||
|
|
NS[Notification Service]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph Storage["Data Layer"]
|
|||
|
|
DDB[(DynamoDB<br/>Alerts + Tenants)]
|
|||
|
|
TS[(TimescaleDB on RDS<br/>Time-Series Correlation)]
|
|||
|
|
CACHE[(ElastiCache Redis<br/>Active Windows)]
|
|||
|
|
S3[(S3<br/>Raw Payloads + Exports)]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph Output["Delivery"]
|
|||
|
|
SLACK[Slack Bot]
|
|||
|
|
DASH[Dashboard API<br/>CloudFront + S3 SPA]
|
|||
|
|
API[REST API]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
PD & DD & GF & OG & CW --> WH
|
|||
|
|
GHA & GLC & ARGO --> WH
|
|||
|
|
WH --> HMAC --> NORM --> SCHEMA
|
|||
|
|
SCHEMA -- alerts --> ALERT_Q
|
|||
|
|
SCHEMA -- deploys --> DEPLOY_Q
|
|||
|
|
ALERT_Q --> CE
|
|||
|
|
DEPLOY_Q --> DT
|
|||
|
|
DT -- deploy context --> CE
|
|||
|
|
CE -- correlation results --> CORR_Q
|
|||
|
|
CORR_Q --> SE
|
|||
|
|
SE -- suggestions --> NOTIFY_Q
|
|||
|
|
NOTIFY_Q --> NS
|
|||
|
|
NS --> SLACK
|
|||
|
|
CE & DT & SE --> DDB & TS
|
|||
|
|
CE --> CACHE
|
|||
|
|
SCHEMA --> S3
|
|||
|
|
DASH & API --> DDB & TS
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 1.2 Component Inventory
|
|||
|
|
|
|||
|
|
| Component | Responsibility | AWS Service | Scaling Model |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| **Webhook Receiver** | Accept, authenticate, and normalize incoming alert/deploy webhooks | API Gateway HTTP API + Lambda | Auto-scales to 10K concurrent; API Gateway handles burst |
|
|||
|
|
| **Payload Normalizer** | Transform provider-specific payloads into canonical alert schema | Lambda (part of ingestion) | Stateless, scales with webhook volume |
|
|||
|
|
| **Event Bus** | Decouple ingestion from processing; buffer during spikes | SQS FIFO (alerts/deploys) + SQS Standard (correlation/notify) | Unlimited throughput; FIFO for ordering guarantees on per-tenant basis |
|
|||
|
|
| **Correlation Engine** | Time-window clustering, service-dependency matching, deploy correlation | ECS Fargate (long-running) | Horizontal scaling via ECS Service Auto Scaling |
|
|||
|
|
| **Deployment Tracker** | Ingest CI/CD events, maintain deploy timeline per service | ECS Fargate (shared with CE) | Co-located with Correlation Engine |
|
|||
|
|
| **Suggestion Engine** | Score noise, generate grouping suggestions, compute "what would have happened" | ECS Fargate | Scales with correlation output volume |
|
|||
|
|
| **Notification Service** | Format and deliver Slack messages, digests, weekly reports | Lambda (event-driven from SQS) | Auto-scales; Slack rate limits are the bottleneck |
|
|||
|
|
| **Alert Store** | Canonical alert records, tenant config, correlation results | DynamoDB | On-demand capacity; single-digit ms reads |
|
|||
|
|
| **Time-Series Store** | Correlation windows, alert frequency histograms, trend data | TimescaleDB on RDS (PostgreSQL) | Vertical scaling initially; read replicas at scale |
|
|||
|
|
| **Active Window Cache** | In-flight correlation windows, recent alert fingerprints | ElastiCache Redis | Single node → cluster at 10K+ alerts/day |
|
|||
|
|
| **Raw Payload Archive** | Original webhook payloads for audit, replay, and simulation | S3 Standard → S3 IA (30d) → Glacier (90d) | Unlimited; lifecycle policies manage cost |
|
|||
|
|
| **Dashboard SPA** | React frontend for noise reports, suppression logs, integration management | CloudFront + S3 | Static hosting; API calls to backend |
|
|||
|
|
| **REST API** | Dashboard backend, alert query, correlation results, admin | API Gateway + Lambda | Auto-scales |
|
|||
|
|
|
|||
|
|
### 1.3 Technology Choices
|
|||
|
|
|
|||
|
|
| Decision | Choice | Justification |
|
|||
|
|
|---|---|---|
|
|||
|
|
| **Compute: Ingestion** | AWS Lambda | Webhook traffic is bursty (incident storms). Lambda handles 0→10K RPS without pre-provisioning. Pay-per-invocation keeps costs near-zero at low volume. Cold starts acceptable (<200ms with Node.js runtime). |
|
|||
|
|
| **Compute: Processing** | ECS Fargate | Correlation Engine needs long-running processes with in-memory state (active correlation windows). Lambda's 15-min timeout is insufficient. Fargate = no EC2 management, scales to zero tasks during quiet periods. |
|
|||
|
|
| **Queue** | SQS (FIFO for ingestion, Standard for downstream) | FIFO guarantees per-tenant ordering for alert ingestion (critical for time-window accuracy). Standard queues for correlation→notification path where ordering is less critical. SNS fan-out for multi-consumer patterns. No Kafka overhead for V1. |
|
|||
|
|
| **Primary Store** | DynamoDB | Single-digit ms latency for alert lookups. On-demand pricing = pay for what you use. Partition key = `tenant_id`, sort key = `alert_id` or `timestamp`. Global tables for multi-region (V2+). |
|
|||
|
|
| **Time-Series** | TimescaleDB on RDS | Correlation windows require time-range queries ("all alerts for tenant X, service Y, in the last 5 minutes"). TimescaleDB's hypertables + continuous aggregates are purpose-built for this. PostgreSQL compatibility means standard tooling. |
|
|||
|
|
| **Cache** | ElastiCache Redis | Sub-ms reads for active correlation windows. Sorted sets for time-windowed alert lookups. TTL-based expiry for window cleanup. Pub/Sub for real-time correlation triggers. |
|
|||
|
|
| **Object Store** | S3 | Raw payload archival, alert history exports, simulation data. Lifecycle policies for cost management. Event notifications trigger replay/simulation workflows. |
|
|||
|
|
| **API Layer** | API Gateway HTTP API | Lower latency and cost than REST API type. JWT authorizer for API key validation. Built-in throttling per API key (maps to tenant rate limits by tier). |
|
|||
|
|
| **Frontend** | React SPA on CloudFront | Static hosting = zero server cost. CloudFront edge caching. Dashboard is secondary to Slack (most users never open it), so minimal investment. |
|
|||
|
|
| **Language** | TypeScript (Node.js 20) | Single language across Lambda + ECS + frontend. Strong typing catches schema mapping bugs at compile time. Excellent AWS SDK support. Fast cold starts on Lambda. |
|
|||
|
|
| **IaC** | AWS CDK (TypeScript) | Same language as application code. L2 constructs reduce boilerplate. Synthesizes to CloudFormation for drift detection. |
|
|||
|
|
| **CI/CD** | GitHub Actions | Free for public repos, cheap for private. Native webhook integration (dd0c/alert dogfoods its own deployment tracking). |
|
|||
|
|
|
|||
|
|
### 1.4 Webhook-First Ingestion Model (60-Second Time-to-Value)
|
|||
|
|
|
|||
|
|
The entire architecture is designed around a single constraint: **a new customer must see their first correlated incident in Slack within 60 seconds of pasting a webhook URL.**
|
|||
|
|
|
|||
|
|
**The flow:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
T+0s Customer signs up → gets unique webhook URL:
|
|||
|
|
https://hooks.dd0c.com/v1/wh/{tenant_id}/{provider}
|
|||
|
|
|
|||
|
|
T+5s Customer pastes URL into Datadog notification channel
|
|||
|
|
(or PagerDuty webhook extension, or Grafana contact point)
|
|||
|
|
|
|||
|
|
T+10s First alert fires from their monitoring tool
|
|||
|
|
|
|||
|
|
T+10.1s API Gateway receives POST, Lambda validates HMAC,
|
|||
|
|
normalizes payload, writes to SQS FIFO
|
|||
|
|
|
|||
|
|
T+10.3s Correlation Engine picks up alert, opens a new
|
|||
|
|
correlation window (default: 5 minutes)
|
|||
|
|
|
|||
|
|
T+10.5s Alert appears in customer's Slack channel:
|
|||
|
|
"🔔 New alert: [service] [title] — watching for related alerts..."
|
|||
|
|
|
|||
|
|
T+60s Window closes (or more alerts arrive). Correlation Engine
|
|||
|
|
groups related alerts. Suggestion Engine scores noise.
|
|||
|
|
|
|||
|
|
T+61s Slack message updates in-place:
|
|||
|
|
"📊 Incident #1: 12 alerts grouped → 1 incident
|
|||
|
|
Sources: Datadog (8), PagerDuty (4)
|
|||
|
|
Trigger: Deploy #1042 to payment-service (2 min before first alert)
|
|||
|
|
Noise score: 87% — we'd suggest suppressing 10 of 12
|
|||
|
|
👍 Helpful 👎 Not helpful"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Design decisions enabling 60-second TTV:**
|
|||
|
|
|
|||
|
|
1. **No SDK, no agent, no credentials.** Just a URL. The webhook URL encodes tenant ID and provider type — zero configuration needed.
|
|||
|
|
2. **Pre-provisioned Slack connection.** During signup, customer connects Slack via OAuth before getting the webhook URL. Slack is ready before the first alert arrives.
|
|||
|
|
3. **Eager first-alert notification.** Don't wait for the correlation window to close. Show the first alert immediately in Slack ("watching for related alerts..."), then update the message in-place when correlation completes. The customer sees activity within seconds.
|
|||
|
|
4. **Default correlation window of 5 minutes.** Long enough to catch deploy-correlated alert storms, short enough to deliver results quickly. Configurable per tenant.
|
|||
|
|
5. **Webhook URL as the product.** The URL IS the integration. No config files, no YAML, no terraform modules. Copy. Paste. Done.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. CORE COMPONENTS
|
|||
|
|
|
|||
|
|
### 2.1 Webhook Ingestion Layer
|
|||
|
|
|
|||
|
|
The ingestion layer is the front door. It must be fast (sub-100ms response to the sending provider), reliable (zero dropped webhooks), and flexible (parse any provider's payload format without code changes for supported providers).
|
|||
|
|
|
|||
|
|
**Architecture:** API Gateway HTTP API → Lambda function → SQS FIFO
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
sequenceDiagram
|
|||
|
|
participant P as Provider (Datadog/PD/etc)
|
|||
|
|
participant AG as API Gateway
|
|||
|
|
participant L as Ingestion Lambda
|
|||
|
|
participant SQS as SQS FIFO
|
|||
|
|
participant S3 as S3 (Raw Archive)
|
|||
|
|
|
|||
|
|
P->>AG: POST /v1/wh/{tenant_id}/{provider}
|
|||
|
|
AG->>L: Invoke (< 50ms)
|
|||
|
|
L->>L: Validate HMAC signature
|
|||
|
|
L->>L: Parse provider-specific payload
|
|||
|
|
L->>L: Map to canonical alert schema
|
|||
|
|
L-->>S3: Async: store raw payload
|
|||
|
|
L->>SQS: Send canonical alert message
|
|||
|
|
L->>P: 200 OK (< 100ms total)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Provider Parsers:**
|
|||
|
|
|
|||
|
|
Each supported provider has a dedicated parser module that transforms the provider's webhook payload into the canonical alert schema. Parsers are stateless functions — no database calls, no external dependencies.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
// Provider parser interface
|
|||
|
|
interface ProviderParser {
|
|||
|
|
provider: string;
|
|||
|
|
validateSignature(headers: Headers, body: string, secret: string): boolean;
|
|||
|
|
parse(payload: unknown): CanonicalAlert[];
|
|||
|
|
// Some providers send batched payloads (Datadog sends arrays)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Registry — new providers added by implementing the interface
|
|||
|
|
const parsers: Record<string, ProviderParser> = {
|
|||
|
|
'pagerduty': new PagerDutyParser(),
|
|||
|
|
'datadog': new DatadogParser(),
|
|||
|
|
'grafana': new GrafanaParser(),
|
|||
|
|
'opsgenie': new OpsGenieParser(),
|
|||
|
|
'github': new GitHubActionsParser(), // deploy events
|
|||
|
|
'gitlab': new GitLabCIParser(), // deploy events
|
|||
|
|
'argocd': new ArgoCDParser(), // deploy events
|
|||
|
|
'custom': new CustomWebhookParser(), // user-defined mapping
|
|||
|
|
};
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**HMAC Validation per Provider:**
|
|||
|
|
|
|||
|
|
| Provider | Signature Header | Algorithm | Payload |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| PagerDuty | `X-PagerDuty-Signature` | HMAC-SHA256 | Raw body |
|
|||
|
|
| Datadog | `DD-WEBHOOK-SIGNATURE` | HMAC-SHA256 | Raw body |
|
|||
|
|
| Grafana | `X-Grafana-Alerting-Signature` | HMAC-SHA256 | Raw body |
|
|||
|
|
| OpsGenie | `X-OpsGenie-Signature` | HMAC-SHA256 | Raw body |
|
|||
|
|
| GitHub | `X-Hub-Signature-256` | HMAC-SHA256 | Raw body |
|
|||
|
|
| GitLab | `X-Gitlab-Token` | Token comparison | N/A |
|
|||
|
|
| Custom | `X-DD0C-Signature` | HMAC-SHA256 | Raw body |
|
|||
|
|
|
|||
|
|
**Canonical Alert Schema:**
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface CanonicalAlert {
|
|||
|
|
// Identity
|
|||
|
|
alert_id: string; // dd0c-generated ULID (sortable, unique)
|
|||
|
|
tenant_id: string; // From webhook URL path
|
|||
|
|
provider: string; // 'pagerduty' | 'datadog' | 'grafana' | 'opsgenie' | 'custom'
|
|||
|
|
provider_alert_id: string; // Original alert ID from provider
|
|||
|
|
provider_incident_id?: string; // Provider's incident grouping (if any)
|
|||
|
|
|
|||
|
|
// Classification
|
|||
|
|
severity: 'critical' | 'high' | 'medium' | 'low' | 'info';
|
|||
|
|
status: 'triggered' | 'acknowledged' | 'resolved';
|
|||
|
|
category: 'infrastructure' | 'application' | 'security' | 'deployment' | 'custom';
|
|||
|
|
|
|||
|
|
// Context
|
|||
|
|
service: string; // Normalized service name
|
|||
|
|
environment: string; // 'production' | 'staging' | 'development'
|
|||
|
|
title: string; // Human-readable alert title
|
|||
|
|
description?: string; // Alert body/details (may be stripped in privacy mode)
|
|||
|
|
tags: Record<string, string>; // Normalized key-value tags
|
|||
|
|
|
|||
|
|
// Fingerprint (for dedup)
|
|||
|
|
fingerprint: string; // SHA-256 of (tenant_id + provider + service + title_normalized)
|
|||
|
|
|
|||
|
|
// Timestamps
|
|||
|
|
triggered_at: string; // ISO 8601 — when the alert originally fired
|
|||
|
|
received_at: string; // ISO 8601 — when dd0c received the webhook
|
|||
|
|
resolved_at?: string; // ISO 8601 — when resolved (if status=resolved)
|
|||
|
|
|
|||
|
|
// Metadata
|
|||
|
|
raw_payload_s3_key: string; // S3 key for the original payload
|
|||
|
|
source_url?: string; // Deep link back to the alert in the provider's UI
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Key design decisions:**
|
|||
|
|
|
|||
|
|
1. **ULID for alert_id.** ULIDs are lexicographically sortable by time (unlike UUIDv4), which makes DynamoDB range queries efficient and eliminates the need for a secondary index on timestamp.
|
|||
|
|
2. **Fingerprint for dedup.** The fingerprint is a deterministic hash of the alert's identity fields. Two alerts with the same fingerprint from the same provider within a correlation window are duplicates. This catches the "same alert firing every 30 seconds" pattern without ML.
|
|||
|
|
3. **Severity normalization.** Each provider uses different severity scales (Datadog: P1-P5, PagerDuty: critical/high/low, Grafana: alerting/ok). The parser normalizes to a 5-level scale. Mapping is configurable per tenant.
|
|||
|
|
4. **Privacy mode.** When enabled, `description` is set to `null` and `tags` are hashed. Only structural metadata (service, severity, timestamp) is stored. Reduces intelligence slightly but eliminates sensitive data concerns.
|
|||
|
|
|
|||
|
|
### 2.2 Correlation Engine
|
|||
|
|
|
|||
|
|
The Correlation Engine is the core intelligence of dd0c/alert. It takes a stream of canonical alerts and produces correlated incidents — groups of related alerts that represent a single underlying issue.
|
|||
|
|
|
|||
|
|
**Architecture:** ECS Fargate service consuming from SQS FIFO, maintaining active correlation windows in Redis, writing results to DynamoDB + TimescaleDB.
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
graph LR
|
|||
|
|
subgraph Input
|
|||
|
|
SQS[SQS FIFO<br/>alert-ingested]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph CE["Correlation Engine (ECS Fargate)"]
|
|||
|
|
RECV[Message Receiver]
|
|||
|
|
FP[Fingerprint Dedup]
|
|||
|
|
TW[Time-Window<br/>Correlator]
|
|||
|
|
SDG[Service Dependency<br/>Graph Matcher]
|
|||
|
|
DC[Deploy Correlation]
|
|||
|
|
SCORE[Incident Scorer]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph State
|
|||
|
|
REDIS[(Redis<br/>Active Windows)]
|
|||
|
|
DDB[(DynamoDB<br/>Incidents)]
|
|||
|
|
TSDB[(TimescaleDB<br/>Time-Series)]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
SQS --> RECV
|
|||
|
|
RECV --> FP --> TW --> SDG --> DC --> SCORE
|
|||
|
|
TW <--> REDIS
|
|||
|
|
SDG <--> DDB
|
|||
|
|
DC <--> DDB
|
|||
|
|
SCORE --> DDB & TSDB
|
|||
|
|
SCORE --> CORR_Q[SQS: correlation-request]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Correlation Pipeline (executed per alert):**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Alert arrives
|
|||
|
|
│
|
|||
|
|
├─ Step 1: FINGERPRINT DEDUP
|
|||
|
|
│ Is there an alert with the same fingerprint in the active window?
|
|||
|
|
│ YES → Increment count on existing alert, skip to scoring
|
|||
|
|
│ NO → Continue
|
|||
|
|
│
|
|||
|
|
├─ Step 2: TIME-WINDOW CORRELATION
|
|||
|
|
│ Find all open correlation windows for this tenant
|
|||
|
|
│ Does this alert fall within an existing window's time range + service scope?
|
|||
|
|
│ YES → Add alert to existing window
|
|||
|
|
│ NO → Open a new correlation window (default: 5 min, configurable)
|
|||
|
|
│
|
|||
|
|
├─ Step 3: SERVICE-DEPENDENCY MATCHING
|
|||
|
|
│ Does the service dependency graph show a relationship between
|
|||
|
|
│ this alert's service and services in any open window?
|
|||
|
|
│ YES → Merge windows (upstream DB alert + downstream API errors = one incident)
|
|||
|
|
│ NO → Keep windows separate
|
|||
|
|
│
|
|||
|
|
├─ Step 4: DEPLOY CORRELATION
|
|||
|
|
│ Was there a deployment to this service (or an upstream dependency)
|
|||
|
|
│ within the lookback period (default: 15 min)?
|
|||
|
|
│ YES → Tag the correlation window with deploy context
|
|||
|
|
│ NO → Continue
|
|||
|
|
│
|
|||
|
|
└─ Step 5: INCIDENT SCORING
|
|||
|
|
When a correlation window closes (timeout or manual trigger):
|
|||
|
|
- Count total alerts in window
|
|||
|
|
- Count unique services affected
|
|||
|
|
- Calculate noise score (0-100)
|
|||
|
|
- Generate incident summary
|
|||
|
|
- Write Incident record to DynamoDB
|
|||
|
|
- Emit to correlation-request queue for Suggestion Engine
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Time-Window Correlation — The Algorithm:**
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface CorrelationWindow {
|
|||
|
|
window_id: string; // ULID
|
|||
|
|
tenant_id: string;
|
|||
|
|
opened_at: string; // ISO 8601
|
|||
|
|
closes_at: string; // opened_at + window_duration
|
|||
|
|
window_duration_ms: number; // Default 300000 (5 min), configurable
|
|||
|
|
status: 'open' | 'closed';
|
|||
|
|
|
|||
|
|
// Alerts in this window
|
|||
|
|
alert_ids: string[];
|
|||
|
|
alert_count: number;
|
|||
|
|
unique_fingerprints: number;
|
|||
|
|
|
|||
|
|
// Services involved
|
|||
|
|
services: Set<string>;
|
|||
|
|
environments: Set<string>;
|
|||
|
|
|
|||
|
|
// Deploy context (if matched)
|
|||
|
|
deploy_event_id?: string;
|
|||
|
|
deploy_service?: string;
|
|||
|
|
deploy_pr?: string;
|
|||
|
|
deploy_author?: string;
|
|||
|
|
deploy_timestamp?: string;
|
|||
|
|
|
|||
|
|
// Scoring (computed on close)
|
|||
|
|
noise_score?: number; // 0-100 (100 = pure noise)
|
|||
|
|
severity_max?: string; // Highest severity alert in window
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Window management in Redis:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
# Active windows stored as Redis sorted sets (score = closes_at timestamp)
|
|||
|
|
ZADD tenant:{tenant_id}:windows {closes_at_epoch} {window_id}
|
|||
|
|
|
|||
|
|
# Alert-to-window mapping
|
|||
|
|
SADD window:{window_id}:alerts {alert_id}
|
|||
|
|
|
|||
|
|
# Service-to-window index (for dependency matching)
|
|||
|
|
SADD tenant:{tenant_id}:service:{service_name}:windows {window_id}
|
|||
|
|
|
|||
|
|
# Window metadata as hash
|
|||
|
|
HSET window:{window_id} opened_at ... closes_at ... alert_count ...
|
|||
|
|
|
|||
|
|
# TTL on all keys = window_duration + 1 hour (cleanup buffer)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Window extension logic:** If a new alert arrives for an open window within the last 30 seconds of the window's lifetime, the window extends by 2 minutes (up to a maximum of 15 minutes total). This catches cascading failures where alerts trickle in over time.
|
|||
|
|
|
|||
|
|
**Service Dependency Graph:**
|
|||
|
|
|
|||
|
|
The dependency graph is built from two sources:
|
|||
|
|
1. **Inferred from alert patterns.** If Service A alerts consistently fire 1-3 minutes before Service B alerts, dd0c infers a dependency (A → B). Requires 3+ occurrences to establish.
|
|||
|
|
2. **Explicit configuration.** Customers can declare dependencies via the dashboard or API: `payment-service → notification-service → email-service`.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface ServiceDependency {
|
|||
|
|
tenant_id: string;
|
|||
|
|
upstream_service: string;
|
|||
|
|
downstream_service: string;
|
|||
|
|
source: 'inferred' | 'explicit';
|
|||
|
|
confidence: number; // 0.0-1.0 (inferred only)
|
|||
|
|
occurrence_count: number; // Times this pattern was observed
|
|||
|
|
last_seen: string; // ISO 8601
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Stored in DynamoDB with GSI on `tenant_id + upstream_service` for fast lookups during correlation.
|
|||
|
|
|
|||
|
|
### 2.3 Deployment Tracker
|
|||
|
|
|
|||
|
|
The Deployment Tracker ingests CI/CD webhook events and maintains a timeline of deployments per service per tenant. The Correlation Engine queries this timeline to answer: "Was there a deploy to this service (or its dependencies) in the last N minutes?"
|
|||
|
|
|
|||
|
|
**Deploy Event Schema:**
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface DeployEvent {
|
|||
|
|
deploy_id: string; // ULID
|
|||
|
|
tenant_id: string;
|
|||
|
|
provider: 'github' | 'gitlab' | 'argocd' | 'custom';
|
|||
|
|
provider_deploy_id: string; // e.g., GitHub Actions run_id
|
|||
|
|
|
|||
|
|
// What was deployed
|
|||
|
|
service: string; // Target service name
|
|||
|
|
environment: string; // production | staging | development
|
|||
|
|
version?: string; // Git SHA, tag, or version string
|
|||
|
|
|
|||
|
|
// Who and what
|
|||
|
|
author: string; // Git commit author or deployer
|
|||
|
|
commit_sha: string;
|
|||
|
|
commit_message?: string;
|
|||
|
|
pr_number?: string;
|
|||
|
|
pr_url?: string;
|
|||
|
|
|
|||
|
|
// Timing
|
|||
|
|
started_at: string; // ISO 8601
|
|||
|
|
completed_at?: string; // ISO 8601
|
|||
|
|
status: 'in_progress' | 'success' | 'failure' | 'cancelled';
|
|||
|
|
|
|||
|
|
// Metadata
|
|||
|
|
source_url: string; // Link to CI/CD run
|
|||
|
|
changes_summary?: string; // Files changed, lines added/removed
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Deploy-to-Alert Correlation Logic:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
When Correlation Engine processes an alert:
|
|||
|
|
1. Query DynamoDB: all deploys for tenant where
|
|||
|
|
service IN (alert.service, ...upstream_dependencies)
|
|||
|
|
AND completed_at > (alert.triggered_at - lookback_window)
|
|||
|
|
AND completed_at < alert.triggered_at
|
|||
|
|
AND environment = alert.environment
|
|||
|
|
|
|||
|
|
2. If match found:
|
|||
|
|
- Attach deploy context to correlation window
|
|||
|
|
- Boost noise_score by 15-30 points (deploy-correlated alerts
|
|||
|
|
are more likely to be transient noise)
|
|||
|
|
- Include deploy details in Slack incident card
|
|||
|
|
|
|||
|
|
3. Lookback window defaults:
|
|||
|
|
- Production: 15 minutes
|
|||
|
|
- Staging: 30 minutes
|
|||
|
|
- Configurable per tenant
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Service name mapping challenge:**
|
|||
|
|
|
|||
|
|
The biggest practical challenge is mapping CI/CD service names to monitoring service names. GitHub Actions might deploy `payment-api` while Datadog monitors `prod-payment-api-us-east-1`. Solutions:
|
|||
|
|
|
|||
|
|
1. **Convention-based matching.** Strip common prefixes/suffixes (`prod-`, `-us-east-1`, `-service`). Fuzzy match on the core name.
|
|||
|
|
2. **Explicit mapping.** Dashboard UI where customers map: `GitHub: payment-api` → `Datadog: prod-payment-api-us-east-1`.
|
|||
|
|
3. **Tag-based matching.** If both CI/CD and monitoring use a common tag (e.g., `dd0c.service=payment`), match on that.
|
|||
|
|
|
|||
|
|
V1 uses convention-based + explicit mapping. Tag-based matching added in V2.
|
|||
|
|
|
|||
|
|
### 2.4 Suggestion Engine
|
|||
|
|
|
|||
|
|
The Suggestion Engine takes correlated incidents from the Correlation Engine and generates actionable suggestions. In V1, this is strictly observe-and-suggest — no auto-action.
|
|||
|
|
|
|||
|
|
**Suggestion Types:**
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
type SuggestionType =
|
|||
|
|
| 'group' // "These 12 alerts are one incident"
|
|||
|
|
| 'suppress' // "We'd suppress these 8 alerts (deploy noise)"
|
|||
|
|
| 'tune' // "This alert fires 40x/week and is never actioned — consider tuning"
|
|||
|
|
| 'dependency' // "Service A alerts always precede Service B — likely upstream dependency"
|
|||
|
|
| 'runbook' // "This pattern was resolved by [runbook] 5 times before" (V2+)
|
|||
|
|
;
|
|||
|
|
|
|||
|
|
interface Suggestion {
|
|||
|
|
suggestion_id: string;
|
|||
|
|
tenant_id: string;
|
|||
|
|
incident_id: string;
|
|||
|
|
type: SuggestionType;
|
|||
|
|
confidence: number; // 0.0-1.0
|
|||
|
|
title: string; // Human-readable summary
|
|||
|
|
reasoning: string; // Plain-English explanation of WHY
|
|||
|
|
affected_alert_ids: string[];
|
|||
|
|
action_taken: 'none'; // V1: always 'none' (observe-only)
|
|||
|
|
user_feedback?: 'helpful' | 'not_helpful' | null;
|
|||
|
|
created_at: string;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Noise Scoring Algorithm (V1 — Rule-Based):**
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
function calculateNoiseScore(window: CorrelationWindow): number {
|
|||
|
|
let score = 0;
|
|||
|
|
|
|||
|
|
// Factor 1: Duplicate fingerprints (0-30 points)
|
|||
|
|
const dupRatio = 1 - (window.unique_fingerprints / window.alert_count);
|
|||
|
|
score += Math.round(dupRatio * 30);
|
|||
|
|
|
|||
|
|
// Factor 2: Deploy correlation (0-25 points)
|
|||
|
|
if (window.deploy_event_id) {
|
|||
|
|
score += 25;
|
|||
|
|
// Bonus if deploy author matches a known "config change" pattern
|
|||
|
|
if (window.deploy_pr?.includes('config') || window.deploy_pr?.includes('feature-flag')) {
|
|||
|
|
score += 5;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Factor 3: Historical pattern (0-20 points)
|
|||
|
|
// If this service+alert combination has fired and auto-resolved
|
|||
|
|
// N times in the past 7 days without human action
|
|||
|
|
const autoResolveRate = getAutoResolveRate(window.tenant_id, window.services);
|
|||
|
|
score += Math.round(autoResolveRate * 20);
|
|||
|
|
|
|||
|
|
// Factor 4: Severity distribution (0-15 points)
|
|||
|
|
// Windows with only low/info severity alerts score higher
|
|||
|
|
if (window.severity_max === 'low' || window.severity_max === 'info') {
|
|||
|
|
score += 15;
|
|||
|
|
} else if (window.severity_max === 'medium') {
|
|||
|
|
score += 8;
|
|||
|
|
}
|
|||
|
|
// critical/high = 0 bonus (never boost noise score for critical alerts)
|
|||
|
|
|
|||
|
|
// Factor 5: Time of day (0-10 points)
|
|||
|
|
// Alerts during deploy windows (10am-4pm weekdays) are more likely noise
|
|||
|
|
const hour = new Date(window.opened_at).getUTCHours();
|
|||
|
|
const isBusinessHours = hour >= 14 && hour <= 23; // ~10am-4pm US timezones
|
|||
|
|
if (isBusinessHours) score += 10;
|
|||
|
|
|
|||
|
|
// Cap at 100, floor at 0
|
|||
|
|
return Math.max(0, Math.min(100, score));
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**"Never Suppress" Safelist (Default):**
|
|||
|
|
|
|||
|
|
These alert categories are never scored above 50 (noise), regardless of pattern matching:
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
const NEVER_SUPPRESS_DEFAULTS = [
|
|||
|
|
{ category: 'security', reason: 'Security alerts require human review' },
|
|||
|
|
{ severity: 'critical', reason: 'Critical severity always surfaces' },
|
|||
|
|
{ service_pattern: /database|db|rds|dynamo/i, reason: 'Database alerts are high-risk' },
|
|||
|
|
{ service_pattern: /payment|billing|stripe|checkout/i, reason: 'Payment path alerts are business-critical' },
|
|||
|
|
{ title_pattern: /data.?loss|corruption|breach/i, reason: 'Data integrity alerts are never noise' },
|
|||
|
|
];
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Configurable per tenant. Customers can add/remove patterns. Defaults are opt-out (must explicitly remove).
|
|||
|
|
|
|||
|
|
### 2.5 Notification Service
|
|||
|
|
|
|||
|
|
The Notification Service is the primary delivery mechanism. Slack is the V1 interface — most engineers never open the dashboard.
|
|||
|
|
|
|||
|
|
**Slack Message Format (Incident Card):**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
📊 Incident #47 — payment-service
|
|||
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|||
|
|
🔴 Severity: HIGH | Noise Score: 82/100
|
|||
|
|
|
|||
|
|
12 alerts → 1 incident
|
|||
|
|
├─ Datadog: 8 alerts (latency spike, error rate, CPU)
|
|||
|
|
├─ PagerDuty: 3 pages (payment-service P2)
|
|||
|
|
└─ Grafana: 1 alert (downstream notification-service)
|
|||
|
|
|
|||
|
|
🚀 Deploy detected: PR #1042 "Add retry logic to payment processor"
|
|||
|
|
by @marcus • merged 3 min before first alert
|
|||
|
|
https://github.com/acme/payment-service/pull/1042
|
|||
|
|
|
|||
|
|
💡 Suggestion: This looks like deploy noise.
|
|||
|
|
Similar pattern seen 4 times this month — auto-resolved within 8 min each time.
|
|||
|
|
We'd suppress 10 of 12 alerts. [What would change →]
|
|||
|
|
|
|||
|
|
👍 Helpful 👎 Not helpful 🔇 Mute this pattern
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Notification Types:**
|
|||
|
|
|
|||
|
|
| Type | Trigger | Channel |
|
|||
|
|
|---|---|---|
|
|||
|
|
| **Incident Card** | Correlation window closes with grouped alerts | Configured Slack channel |
|
|||
|
|
| **Real-time Alert** | First alert in a new window (eager notification) | Configured Slack channel |
|
|||
|
|
| **Daily Digest** | 9:00 AM in tenant's timezone | Configured Slack channel or DM |
|
|||
|
|
| **Weekly Noise Report** | Monday 9:00 AM | Configured Slack channel + email to admin |
|
|||
|
|
| **Integration Health** | Webhook volume drops to zero for >2 hours | DM to integration owner |
|
|||
|
|
|
|||
|
|
**Daily Digest Format:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
📋 dd0c/alert Daily Digest — Feb 28, 2026
|
|||
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|||
|
|
|
|||
|
|
Yesterday: 247 alerts → 18 incidents
|
|||
|
|
Noise ratio: 87% | You'd have been paged 18 times instead of 247
|
|||
|
|
|
|||
|
|
Top noisy alerts:
|
|||
|
|
1. checkout-latency (Datadog) — 43 fires, 0 incidents → 🔇 Tune candidate
|
|||
|
|
2. disk-usage-warning (Grafana) — 31 fires, 0 incidents → 🔇 Tune candidate
|
|||
|
|
3. auth-service-timeout (PagerDuty) — 22 fires, 1 real incident
|
|||
|
|
|
|||
|
|
Deploy-correlated noise: 67% of alerts fired within 15 min of a deploy
|
|||
|
|
Noisiest deploy: PR #1038 "Update feature flags" triggered 34 alerts
|
|||
|
|
|
|||
|
|
💰 Estimated savings: 4.2 engineering hours not spent triaging noise
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Slack Integration Architecture:**
|
|||
|
|
|
|||
|
|
- OAuth 2.0 flow during onboarding (Slack App Directory compliant)
|
|||
|
|
- Bot token stored encrypted in DynamoDB (per-tenant)
|
|||
|
|
- Uses Slack `chat.postMessage` for new messages, `chat.update` for in-place updates
|
|||
|
|
- Block Kit for rich formatting
|
|||
|
|
- Interactive components (buttons) handled via Slack Events API → API Gateway → Lambda
|
|||
|
|
- Rate limiting: Slack allows 1 message/second per channel. Batch notifications during incident storms (queue in SQS, drain at 1/sec)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. DATA ARCHITECTURE
|
|||
|
|
|
|||
|
|
### 3.1 Canonical Alert Schema (Provider-Agnostic)
|
|||
|
|
|
|||
|
|
The canonical schema (defined in §2.1) is the single source of truth for all alert data. Every provider's payload is normalized into this schema at ingestion time. Downstream components never touch raw provider payloads.
|
|||
|
|
|
|||
|
|
**Schema evolution strategy:** The canonical schema uses a `schema_version` field (integer, starting at 1). When the schema changes:
|
|||
|
|
1. New fields are always optional (backward compatible)
|
|||
|
|
2. Removed fields are deprecated for 2 versions before removal
|
|||
|
|
3. The ingestion Lambda writes the current schema version; consumers handle version differences gracefully
|
|||
|
|
4. DynamoDB items carry their schema version — no backfill migrations needed
|
|||
|
|
|
|||
|
|
### 3.2 Event Sourcing for Alert History
|
|||
|
|
|
|||
|
|
All alert data follows an event-sourcing pattern. The raw event stream is the source of truth; materialized views are derived and rebuildable.
|
|||
|
|
|
|||
|
|
**Event Stream:**
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
type AlertEvent =
|
|||
|
|
| { type: 'alert.received'; alert: CanonicalAlert; }
|
|||
|
|
| { type: 'alert.deduplicated'; alert_id: string; original_alert_id: string; }
|
|||
|
|
| { type: 'alert.correlated'; alert_id: string; window_id: string; }
|
|||
|
|
| { type: 'alert.resolved'; alert_id: string; resolved_at: string; }
|
|||
|
|
| { type: 'window.opened'; window: CorrelationWindow; }
|
|||
|
|
| { type: 'window.extended'; window_id: string; new_closes_at: string; }
|
|||
|
|
| { type: 'window.closed'; window_id: string; incident_id: string; }
|
|||
|
|
| { type: 'incident.created'; incident: Incident; }
|
|||
|
|
| { type: 'suggestion.created'; suggestion: Suggestion; }
|
|||
|
|
| { type: 'feedback.received'; suggestion_id: string; feedback: 'helpful' | 'not_helpful'; }
|
|||
|
|
| { type: 'deploy.received'; deploy: DeployEvent; }
|
|||
|
|
;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Storage:**
|
|||
|
|
|
|||
|
|
| Store | What | Why | Retention |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| **S3 (raw payloads)** | Original webhook bodies, exactly as received | Audit trail, replay, simulation mode, debugging provider parser issues | Free: 7d, Pro: 90d, Business: 1yr, Enterprise: custom |
|
|||
|
|
| **DynamoDB (events)** | Canonical alert events, incidents, suggestions, feedback | Primary operational store. Fast reads for dashboard, API, correlation lookups | Free: 7d, Pro: 90d, Business: 1yr |
|
|||
|
|
| **TimescaleDB (time-series)** | Alert counts, noise ratios, correlation metrics per time bucket | Trend analysis, Noise Report Card, business impact dashboard | 1yr rolling (continuous aggregates compress older data) |
|
|||
|
|
| **Redis (ephemeral)** | Active correlation windows, recent fingerprints, rate counters | Real-time correlation state. Ephemeral by design — rebuilt from DynamoDB on cold start | TTL-based: window_duration + 1hr |
|
|||
|
|
|
|||
|
|
**Replay capability:** Because raw payloads are archived in S3, the entire alert history can be replayed through the ingestion pipeline. This enables:
|
|||
|
|
- **Alert Simulation Mode:** Upload historical exports → replay through correlation engine → show "what would have happened"
|
|||
|
|
- **Parser upgrades:** When a provider parser is improved, replay recent payloads to backfill better-normalized data
|
|||
|
|
- **Correlation tuning:** Replay last 30 days with different window durations to find optimal settings per tenant
|
|||
|
|
|
|||
|
|
### 3.3 Time-Series Storage for Correlation Windows
|
|||
|
|
|
|||
|
|
TimescaleDB handles all time-range queries that DynamoDB's key-value model handles poorly.
|
|||
|
|
|
|||
|
|
**Hypertables:**
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
-- Alert time-series (one row per alert)
|
|||
|
|
CREATE TABLE alert_timeseries (
|
|||
|
|
tenant_id TEXT NOT NULL,
|
|||
|
|
alert_id TEXT NOT NULL,
|
|||
|
|
service TEXT NOT NULL,
|
|||
|
|
severity TEXT NOT NULL,
|
|||
|
|
provider TEXT NOT NULL,
|
|||
|
|
fingerprint TEXT NOT NULL,
|
|||
|
|
triggered_at TIMESTAMPTZ NOT NULL,
|
|||
|
|
received_at TIMESTAMPTZ NOT NULL,
|
|||
|
|
noise_score SMALLINT,
|
|||
|
|
incident_id TEXT,
|
|||
|
|
PRIMARY KEY (tenant_id, triggered_at, alert_id)
|
|||
|
|
);
|
|||
|
|
SELECT create_hypertable('alert_timeseries', 'triggered_at',
|
|||
|
|
chunk_time_interval => INTERVAL '1 day');
|
|||
|
|
|
|||
|
|
-- Continuous aggregate: hourly alert counts per tenant/service
|
|||
|
|
CREATE MATERIALIZED VIEW alert_hourly
|
|||
|
|
WITH (timescaledb.continuous) AS
|
|||
|
|
SELECT
|
|||
|
|
tenant_id,
|
|||
|
|
service,
|
|||
|
|
time_bucket('1 hour', triggered_at) AS bucket,
|
|||
|
|
COUNT(*) AS alert_count,
|
|||
|
|
COUNT(DISTINCT fingerprint) AS unique_alerts,
|
|||
|
|
COUNT(DISTINCT incident_id) AS incident_count,
|
|||
|
|
AVG(noise_score) AS avg_noise_score
|
|||
|
|
FROM alert_timeseries
|
|||
|
|
GROUP BY tenant_id, service, bucket;
|
|||
|
|
|
|||
|
|
-- Continuous aggregate: daily noise report per tenant
|
|||
|
|
CREATE MATERIALIZED VIEW noise_daily
|
|||
|
|
WITH (timescaledb.continuous) AS
|
|||
|
|
SELECT
|
|||
|
|
tenant_id,
|
|||
|
|
time_bucket('1 day', triggered_at) AS bucket,
|
|||
|
|
COUNT(*) AS total_alerts,
|
|||
|
|
COUNT(DISTINCT incident_id) AS total_incidents,
|
|||
|
|
ROUND(100.0 * (1 - COUNT(DISTINCT incident_id)::NUMERIC / NULLIF(COUNT(*), 0)), 1)
|
|||
|
|
AS noise_pct,
|
|||
|
|
AVG(noise_score) AS avg_noise_score
|
|||
|
|
FROM alert_timeseries
|
|||
|
|
GROUP BY tenant_id, bucket;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why TimescaleDB over pure DynamoDB:**
|
|||
|
|
- DynamoDB excels at point lookups and narrow range scans. It's terrible at "give me all alerts for this tenant in the last 5 minutes across all services" — that's a full partition scan.
|
|||
|
|
- TimescaleDB's hypertable chunking + continuous aggregates make time-range queries fast and pre-computed aggregates make dashboard queries instant.
|
|||
|
|
- The trade-off: TimescaleDB is a managed RDS instance (not serverless). Cost is fixed (~$50/month for db.t4g.medium). Acceptable for V1; evaluate Aurora Serverless v2 at scale.
|
|||
|
|
|
|||
|
|
### 3.4 Service Dependency Graph Storage
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
// DynamoDB table: service-dependencies
|
|||
|
|
// Partition key: tenant_id
|
|||
|
|
// Sort key: upstream_service#downstream_service
|
|||
|
|
|
|||
|
|
interface ServiceDependencyRecord {
|
|||
|
|
tenant_id: string; // PK
|
|||
|
|
edge_key: string; // SK: "payment-service#notification-service"
|
|||
|
|
upstream_service: string;
|
|||
|
|
downstream_service: string;
|
|||
|
|
source: 'inferred' | 'explicit';
|
|||
|
|
confidence: number; // 0.0-1.0
|
|||
|
|
occurrence_count: number;
|
|||
|
|
first_seen: string;
|
|||
|
|
last_seen: string;
|
|||
|
|
avg_lag_ms: number; // Average time between upstream and downstream alerts
|
|||
|
|
ttl?: number; // Epoch seconds — inferred edges expire after 30d of no observations
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Graph query pattern:** When the Correlation Engine needs to check dependencies for a service, it does a DynamoDB query:
|
|||
|
|
- `PK = tenant_id, SK begins_with upstream_service#` → all downstream dependencies
|
|||
|
|
- `GSI: PK = tenant_id, SK begins_with #downstream_service` → all upstream dependencies (inverted GSI)
|
|||
|
|
|
|||
|
|
This is O(degree) per lookup — fast enough for real-time correlation. The graph is small per tenant (typically <100 edges for a 50-service architecture).
|
|||
|
|
|
|||
|
|
### 3.5 Multi-Tenant Data Isolation
|
|||
|
|
|
|||
|
|
**Isolation model: Logical isolation with tenant_id partitioning.**
|
|||
|
|
|
|||
|
|
Every data record includes `tenant_id` as the partition key (DynamoDB) or a required column (TimescaleDB). There is no cross-tenant data access path.
|
|||
|
|
|
|||
|
|
| Layer | Isolation Mechanism |
|
|||
|
|
|---|---|
|
|||
|
|
| **API Gateway** | API key → tenant_id mapping. All requests scoped to tenant. |
|
|||
|
|
| **Webhook URLs** | Tenant ID embedded in URL path. Validated against tenant record. |
|
|||
|
|
| **DynamoDB** | `tenant_id` is the partition key on every table. No scan operations in application code — all queries are PK-scoped. |
|
|||
|
|
| **TimescaleDB** | Row-level security (RLS) policies enforce `tenant_id = current_setting('app.tenant_id')`. Connection pool sets tenant context per request. |
|
|||
|
|
| **Redis** | All keys prefixed with `tenant:{tenant_id}:`. No `KEYS *` in application code. |
|
|||
|
|
| **S3** | Object key prefix: `raw/{tenant_id}/{date}/{alert_id}.json`. Bucket policy prevents cross-prefix access. |
|
|||
|
|
| **Slack** | Bot token per tenant. Stored encrypted. Never shared across tenants. |
|
|||
|
|
|
|||
|
|
**Why not per-tenant databases?** Cost. At $19/seat with potentially thousands of tenants, per-tenant RDS instances are economically impossible. Logical isolation with strong key-scoping is the standard pattern for multi-tenant SaaS at this price point. SOC2 auditors accept this model with proper access controls and audit logging.
|
|||
|
|
|
|||
|
|
### 3.6 Retention Policies
|
|||
|
|
|
|||
|
|
| Tier | Raw Payloads (S3) | Alert Events (DynamoDB) | Time-Series (TimescaleDB) | Correlation Windows |
|
|||
|
|
|---|---|---|---|---|
|
|||
|
|
| **Free** | 7 days | 7 days | 7 days (no aggregates) | 24 hours |
|
|||
|
|
| **Pro** | 90 days | 90 days | 90 days + hourly aggregates for 1yr | 30 days |
|
|||
|
|
| **Business** | 1 year | 1 year | 1 year + daily aggregates for 2yr | 90 days |
|
|||
|
|
| **Enterprise** | Custom | Custom | Custom | Custom |
|
|||
|
|
|
|||
|
|
**Implementation:**
|
|||
|
|
- **S3:** Lifecycle policies transition objects: Standard → IA (30d) → Glacier (90d) → Delete (per tier)
|
|||
|
|
- **DynamoDB:** TTL attribute on every item. DynamoDB automatically deletes expired items (eventually consistent, ~48hr window). No Lambda cleanup needed.
|
|||
|
|
- **TimescaleDB:** `drop_chunks()` policy per hypertable. Continuous aggregates survive chunk drops (aggregated data persists longer than raw data).
|
|||
|
|
- **Redis:** TTL on all keys. Self-cleaning by design.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. INFRASTRUCTURE
|
|||
|
|
|
|||
|
|
### 4.1 AWS Architecture
|
|||
|
|
|
|||
|
|
**Region:** `us-east-1` (primary). Single-region for V1. Multi-region (us-west-2 failover) at 10K+ alerts/day or first EU customer requiring data residency.
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
graph TB
|
|||
|
|
subgraph Edge["Edge Layer"]
|
|||
|
|
CF[CloudFront<br/>Dashboard CDN]
|
|||
|
|
R53[Route 53<br/>hooks.dd0c.com]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph Ingestion["Ingestion (Serverless)"]
|
|||
|
|
APIGW[API Gateway HTTP API<br/>Webhook Receiver + REST API]
|
|||
|
|
L_INGEST[Lambda: webhook-ingest<br/>256MB, 10s timeout]
|
|||
|
|
L_API[Lambda: api-handler<br/>512MB, 30s timeout]
|
|||
|
|
L_SLACK[Lambda: slack-events<br/>256MB, 10s timeout]
|
|||
|
|
L_NOTIFY[Lambda: notification-sender<br/>256MB, 30s timeout]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph Queue["Message Bus"]
|
|||
|
|
SQS_ALERT[SQS FIFO<br/>alert-ingested<br/>MessageGroupId=tenant_id]
|
|||
|
|
SQS_DEPLOY[SQS FIFO<br/>deploy-event<br/>MessageGroupId=tenant_id]
|
|||
|
|
SQS_CORR[SQS Standard<br/>correlation-result]
|
|||
|
|
SQS_NOTIFY[SQS Standard<br/>notification]
|
|||
|
|
DLQ[SQS DLQ<br/>dead-letters]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph Processing["Processing (ECS Fargate)"]
|
|||
|
|
ECS_CE[ECS Service: correlation-engine<br/>1 vCPU, 2GB RAM<br/>Desired: 1, Max: 4]
|
|||
|
|
ECS_SE[ECS Service: suggestion-engine<br/>0.5 vCPU, 1GB RAM<br/>Desired: 1, Max: 2]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph Data["Data Layer"]
|
|||
|
|
DDB[(DynamoDB<br/>On-Demand Capacity)]
|
|||
|
|
RDS[(RDS PostgreSQL 16<br/>TimescaleDB<br/>db.t4g.medium)]
|
|||
|
|
REDIS[(ElastiCache Redis<br/>cache.t4g.micro)]
|
|||
|
|
S3_RAW[(S3: dd0c-raw-payloads)]
|
|||
|
|
S3_STATIC[(S3: dd0c-dashboard)]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph Ops["Operations"]
|
|||
|
|
CW_LOGS[CloudWatch Logs]
|
|||
|
|
CW_METRICS[CloudWatch Metrics]
|
|||
|
|
CW_ALARMS[CloudWatch Alarms]
|
|||
|
|
XRAY[X-Ray Tracing]
|
|||
|
|
SM[Secrets Manager]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
R53 --> APIGW
|
|||
|
|
CF --> S3_STATIC
|
|||
|
|
APIGW --> L_INGEST & L_API & L_SLACK
|
|||
|
|
L_INGEST --> SQS_ALERT & SQS_DEPLOY & S3_RAW
|
|||
|
|
SQS_ALERT & SQS_DEPLOY --> ECS_CE
|
|||
|
|
ECS_CE --> SQS_CORR & DDB & RDS & REDIS
|
|||
|
|
SQS_CORR --> ECS_SE
|
|||
|
|
ECS_SE --> SQS_NOTIFY & DDB & RDS
|
|||
|
|
SQS_NOTIFY --> L_NOTIFY
|
|||
|
|
L_API --> DDB & RDS
|
|||
|
|
L_SLACK --> DDB
|
|||
|
|
SQS_ALERT & SQS_DEPLOY & SQS_CORR & SQS_NOTIFY -.-> DLQ
|
|||
|
|
L_INGEST & L_API & ECS_CE & ECS_SE --> CW_LOGS & XRAY
|
|||
|
|
ECS_CE & ECS_SE --> SM
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**DynamoDB Tables:**
|
|||
|
|
|
|||
|
|
| Table | PK | SK | GSIs | Purpose |
|
|||
|
|
|---|---|---|---|---|
|
|||
|
|
| `alerts` | `tenant_id` | `alert_id` (ULID) | GSI1: `tenant_id` + `triggered_at` | Canonical alert records |
|
|||
|
|
| `incidents` | `tenant_id` | `incident_id` (ULID) | GSI1: `tenant_id` + `created_at` | Correlated incident records |
|
|||
|
|
| `suggestions` | `tenant_id` | `suggestion_id` | GSI1: `incident_id` | Noise suggestions + feedback |
|
|||
|
|
| `deploys` | `tenant_id` | `deploy_id` (ULID) | GSI1: `tenant_id` + `service` + `completed_at` | Deploy events |
|
|||
|
|
| `dependencies` | `tenant_id` | `upstream#downstream` | GSI1: `tenant_id` + `downstream#upstream` | Service dependency graph |
|
|||
|
|
| `tenants` | `tenant_id` | `—` | GSI1: `api_key` | Tenant config, billing, integrations |
|
|||
|
|
| `integrations` | `tenant_id` | `integration_id` | — | Webhook configs, Slack tokens |
|
|||
|
|
|
|||
|
|
All tables use on-demand capacity mode (no capacity planning, pay-per-request). Switch to provisioned with auto-scaling when read/write patterns stabilize (typically at 50K+ alerts/day).
|
|||
|
|
|
|||
|
|
### 4.2 Real-Time Processing Pipeline
|
|||
|
|
|
|||
|
|
The critical path from webhook receipt to Slack notification must complete in under 10 seconds for the "eager first alert" notification, and under 5 minutes + 10 seconds for the full correlated incident card.
|
|||
|
|
|
|||
|
|
**Latency budget:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Webhook received by API Gateway 0ms
|
|||
|
|
├─ Lambda cold start (worst case) +200ms
|
|||
|
|
├─ HMAC validation + parsing +10ms
|
|||
|
|
├─ DynamoDB write (raw event) +5ms
|
|||
|
|
├─ SQS FIFO send +20ms
|
|||
|
|
├─ S3 async put (non-blocking) +0ms (async)
|
|||
|
|
│ ─────
|
|||
|
|
│ Total ingestion: ~235ms (p99)
|
|||
|
|
│
|
|||
|
|
├─ SQS → ECS polling interval +100ms (long-polling, 0.1s)
|
|||
|
|
├─ Correlation Engine processing +15ms
|
|||
|
|
├─ Redis read/write (window state) +2ms
|
|||
|
|
├─ DynamoDB read (deploy lookup) +5ms
|
|||
|
|
│ ─────
|
|||
|
|
│ Total to correlation decision: ~357ms (p99)
|
|||
|
|
│
|
|||
|
|
├─ SQS → Lambda (notification) +50ms
|
|||
|
|
├─ Slack API call +300ms
|
|||
|
|
│ ─────
|
|||
|
|
│ Total webhook → Slack (eager): ~707ms (p99)
|
|||
|
|
│
|
|||
|
|
│ ... correlation window (5 min default) ...
|
|||
|
|
│
|
|||
|
|
├─ Window close → Suggestion Engine +200ms
|
|||
|
|
├─ Slack chat.update (in-place) +300ms
|
|||
|
|
│ ─────
|
|||
|
|
│ Total webhook → full incident card: ~5min + 1.2s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**ECS Fargate task configuration:**
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
// CDK definition
|
|||
|
|
const correlationService = new ecs.FargateService(this, 'CorrelationEngine', {
|
|||
|
|
cluster,
|
|||
|
|
taskDefinition: new ecs.FargateTaskDefinition(this, 'CorrTask', {
|
|||
|
|
cpu: 1024, // 1 vCPU
|
|||
|
|
memoryLimitMiB: 2048, // 2 GB
|
|||
|
|
}),
|
|||
|
|
desiredCount: 1,
|
|||
|
|
minHealthyPercent: 100,
|
|||
|
|
maxHealthyPercent: 200,
|
|||
|
|
circuitBreaker: { rollback: true },
|
|||
|
|
});
|
|||
|
|
|
|||
|
|
// Auto-scaling based on SQS queue depth
|
|||
|
|
const scaling = correlationService.autoScaleTaskCount({
|
|||
|
|
minCapacity: 1,
|
|||
|
|
maxCapacity: 4,
|
|||
|
|
});
|
|||
|
|
scaling.scaleOnMetric('QueueDepthScaling', {
|
|||
|
|
metric: alertQueue.metricApproximateNumberOfMessagesVisible(),
|
|||
|
|
scalingSteps: [
|
|||
|
|
{ upper: 0, change: -1 }, // Scale in when queue empty
|
|||
|
|
{ lower: 100, change: +1 }, // Scale out at 100 messages
|
|||
|
|
{ lower: 1000, change: +2 }, // Scale out faster at 1000
|
|||
|
|
],
|
|||
|
|
adjustmentType: ecs.AdjustmentType.CHANGE_IN_CAPACITY,
|
|||
|
|
cooldown: Duration.seconds(60),
|
|||
|
|
});
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**SQS FIFO configuration:**
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
const alertQueue = new sqs.Queue(this, 'AlertIngested', {
|
|||
|
|
fifo: true,
|
|||
|
|
contentBasedDeduplication: false, // We set explicit dedup IDs
|
|||
|
|
deduplicationScope: sqs.DeduplicationScope.MESSAGE_GROUP,
|
|||
|
|
fifoThroughputLimit: sqs.FifoThroughputLimit.PER_MESSAGE_GROUP_ID,
|
|||
|
|
// High-throughput FIFO: 30K messages/sec per group
|
|||
|
|
visibilityTimeout: Duration.seconds(60),
|
|||
|
|
retentionPeriod: Duration.days(4),
|
|||
|
|
deadLetterQueue: {
|
|||
|
|
queue: dlq,
|
|||
|
|
maxReceiveCount: 3,
|
|||
|
|
},
|
|||
|
|
});
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why SQS FIFO with per-tenant message groups:**
|
|||
|
|
- `MessageGroupId = tenant_id` ensures alerts from the same tenant are processed in order (critical for time-window accuracy)
|
|||
|
|
- Different tenants are processed in parallel (no head-of-line blocking)
|
|||
|
|
- High-throughput FIFO mode supports 30K messages/sec per message group — more than enough for any single tenant's alert volume
|
|||
|
|
|
|||
|
|
### 4.3 Cost Estimates
|
|||
|
|
|
|||
|
|
All estimates assume us-east-1 pricing as of 2026. Costs are monthly.
|
|||
|
|
|
|||
|
|
#### 1K alerts/day (~30K/month) — Early Stage
|
|||
|
|
|
|||
|
|
| Service | Configuration | Monthly Cost |
|
|||
|
|
|---|---|---|
|
|||
|
|
| API Gateway HTTP API | 30K requests | $0.03 |
|
|||
|
|
| Lambda (ingestion) | 30K invocations × 256MB × 200ms | $0.02 |
|
|||
|
|
| Lambda (API + notifications) | 50K invocations × 512MB × 300ms | $0.10 |
|
|||
|
|
| SQS (FIFO + Standard) | 120K messages | $0.05 |
|
|||
|
|
| ECS Fargate (Correlation) | 1 task × 1vCPU × 2GB × 24/7 | $48.00 |
|
|||
|
|
| ECS Fargate (Suggestion) | 1 task × 0.5vCPU × 1GB × 24/7 | $18.00 |
|
|||
|
|
| DynamoDB (on-demand) | ~100K reads + 60K writes | $0.15 |
|
|||
|
|
| RDS (TimescaleDB) | db.t4g.micro (free tier eligible) | $0.00 |
|
|||
|
|
| ElastiCache Redis | cache.t4g.micro | $12.00 |
|
|||
|
|
| S3 | <1 GB stored | $0.02 |
|
|||
|
|
| CloudWatch | Logs + metrics | $5.00 |
|
|||
|
|
| Route 53 | 1 hosted zone | $0.50 |
|
|||
|
|
| Secrets Manager | 5 secrets | $2.00 |
|
|||
|
|
| **Total** | | **~$86/month** |
|
|||
|
|
|
|||
|
|
#### 10K alerts/day (~300K/month) — Growth Stage
|
|||
|
|
|
|||
|
|
| Service | Configuration | Monthly Cost |
|
|||
|
|
|---|---|---|
|
|||
|
|
| API Gateway HTTP API | 300K requests | $0.30 |
|
|||
|
|
| Lambda (ingestion) | 300K invocations | $0.20 |
|
|||
|
|
| Lambda (API + notifications) | 500K invocations | $1.00 |
|
|||
|
|
| SQS | 1.2M messages | $0.50 |
|
|||
|
|
| ECS Fargate (Correlation) | 2 tasks avg (auto-scaling) | $96.00 |
|
|||
|
|
| ECS Fargate (Suggestion) | 1 task | $18.00 |
|
|||
|
|
| DynamoDB (on-demand) | ~1M reads + 600K writes | $1.50 |
|
|||
|
|
| RDS (TimescaleDB) | db.t4g.medium | $50.00 |
|
|||
|
|
| ElastiCache Redis | cache.t4g.small | $25.00 |
|
|||
|
|
| S3 | ~10 GB stored + transitions | $2.00 |
|
|||
|
|
| CloudWatch | Logs + metrics + alarms | $15.00 |
|
|||
|
|
| Route 53 + ACM | | $1.00 |
|
|||
|
|
| Secrets Manager | 20 secrets | $8.00 |
|
|||
|
|
| **Total** | | **~$218/month** |
|
|||
|
|
|
|||
|
|
#### 100K alerts/day (~3M/month) — Scale Stage
|
|||
|
|
|
|||
|
|
| Service | Configuration | Monthly Cost |
|
|||
|
|
|---|---|---|
|
|||
|
|
| API Gateway HTTP API | 3M requests | $3.00 |
|
|||
|
|
| Lambda (ingestion) | 3M invocations | $2.00 |
|
|||
|
|
| Lambda (API + notifications) | 5M invocations | $10.00 |
|
|||
|
|
| SQS | 12M messages | $5.00 |
|
|||
|
|
| ECS Fargate (Correlation) | 4 tasks avg | $192.00 |
|
|||
|
|
| ECS Fargate (Suggestion) | 2 tasks avg | $36.00 |
|
|||
|
|
| DynamoDB (on-demand) | ~10M reads + 6M writes | $15.00 |
|
|||
|
|
| RDS (TimescaleDB) | db.r6g.large + read replica | $350.00 |
|
|||
|
|
| ElastiCache Redis | cache.r6g.large (cluster mode) | $200.00 |
|
|||
|
|
| S3 | ~100 GB + lifecycle | $10.00 |
|
|||
|
|
| CloudWatch | Full observability stack | $50.00 |
|
|||
|
|
| CloudFront | Dashboard CDN | $5.00 |
|
|||
|
|
| WAF | API protection | $10.00 |
|
|||
|
|
| **Total** | | **~$888/month** |
|
|||
|
|
|
|||
|
|
**Gross margin analysis:**
|
|||
|
|
|
|||
|
|
| Scale | Monthly Infra Cost | Estimated MRR | Gross Margin |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| 1K alerts/day (~35 teams) | $86 | $6,650 | 98.7% |
|
|||
|
|
| 10K alerts/day (~175 teams) | $218 | $33,250 | 99.3% |
|
|||
|
|
| 100K alerts/day (~700 teams) | $888 | $133,000 | 99.3% |
|
|||
|
|
|
|||
|
|
Infrastructure costs are negligible relative to revenue. The cost structure is dominated by the founder's time, not AWS spend. This is the structural advantage of a webhook-based SaaS — no agents to host, no data to scrape, no heavy compute. Just receive, correlate, notify.
|
|||
|
|
|
|||
|
|
### 4.4 Scaling Strategy
|
|||
|
|
|
|||
|
|
Alert volume is bursty. During a major incident, a single tenant might generate 500 alerts in 2 minutes, then nothing for hours. The architecture must handle this without pre-provisioning for peak.
|
|||
|
|
|
|||
|
|
**Burst handling by layer:**
|
|||
|
|
|
|||
|
|
| Layer | Burst Strategy | Limit |
|
|||
|
|
|---|---|---|
|
|||
|
|
| **API Gateway** | Built-in burst capacity: 5,000 RPS default, increasable to 50K+ | Effectively unlimited for our scale |
|
|||
|
|
| **Lambda** | Concurrency auto-scales. Reserved concurrency per function prevents one function from starving others | 1,000 concurrent (default), increase via support ticket |
|
|||
|
|
| **SQS FIFO** | High-throughput mode: 30K msg/sec per message group. Queue absorbs bursts that processing can't handle immediately | Unlimited queue depth |
|
|||
|
|
| **ECS Fargate** | Auto-scaling on SQS queue depth. Scale-out in ~60 seconds (Fargate task launch time). During the 60s gap, SQS buffers | Min 1, Max 4 (V1). Increase max as needed |
|
|||
|
|
| **DynamoDB** | On-demand mode handles burst to 2x previous peak automatically. For sustained spikes, DynamoDB auto-adjusts within minutes | Effectively unlimited with on-demand |
|
|||
|
|
| **Redis** | Single node handles 100K+ ops/sec. Cluster mode at scale | Not a bottleneck until 100K+ alerts/day |
|
|||
|
|
| **TimescaleDB** | Write-ahead log buffers burst writes. Hypertable chunking prevents table bloat | RDS instance size is the limit; vertical scaling |
|
|||
|
|
|
|||
|
|
**The SQS buffer is the key architectural decision.** During an incident storm, the ingestion Lambda writes to SQS in <20ms and returns 200 OK to the provider. The Correlation Engine processes at its own pace. If the engine falls behind, the queue grows — but no webhooks are dropped. This decoupling is what makes the system reliable under burst load.
|
|||
|
|
|
|||
|
|
**Scaling triggers and actions:**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Alert volume < 1K/day:
|
|||
|
|
- 1 Correlation Engine task, 1 Suggestion Engine task
|
|||
|
|
- cache.t4g.micro Redis, db.t4g.micro RDS
|
|||
|
|
- Total: ~$86/month
|
|||
|
|
|
|||
|
|
Alert volume 1K-10K/day:
|
|||
|
|
- Auto-scale CE to 2 tasks during bursts
|
|||
|
|
- Upgrade Redis to cache.t4g.small
|
|||
|
|
- Upgrade RDS to db.t4g.medium
|
|||
|
|
- Total: ~$218/month
|
|||
|
|
|
|||
|
|
Alert volume 10K-100K/day:
|
|||
|
|
- Auto-scale CE to 4 tasks
|
|||
|
|
- Auto-scale SE to 2 tasks
|
|||
|
|
- Redis cluster mode (3 shards)
|
|||
|
|
- RDS db.r6g.large + read replica
|
|||
|
|
- Add WAF for API protection
|
|||
|
|
- Total: ~$888/month
|
|||
|
|
|
|||
|
|
Alert volume > 100K/day:
|
|||
|
|
- Evaluate Kinesis Data Streams replacing SQS for higher throughput
|
|||
|
|
- Consider Aurora Serverless v2 replacing RDS
|
|||
|
|
- Multi-region deployment for latency + redundancy
|
|||
|
|
- Dedicated capacity DynamoDB with auto-scaling
|
|||
|
|
- Total: $2K-5K/month (still <1% of revenue at this scale)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4.5 CI/CD Pipeline
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
graph LR
|
|||
|
|
subgraph Dev
|
|||
|
|
CODE[Push to main]
|
|||
|
|
PR[Pull Request]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph CI["GitHub Actions CI"]
|
|||
|
|
LINT[Lint + Type Check]
|
|||
|
|
TEST[Unit Tests]
|
|||
|
|
INT[Integration Tests<br/>LocalStack]
|
|||
|
|
BUILD[Docker Build<br/>+ CDK Synth]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph CD["GitHub Actions CD"]
|
|||
|
|
STAGING[Deploy to Staging<br/>CDK deploy]
|
|||
|
|
SMOKE[Smoke Tests<br/>against staging]
|
|||
|
|
PROD[Deploy to Production<br/>CDK deploy]
|
|||
|
|
CANARY[Canary: send test<br/>webhook, verify Slack]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
subgraph Dogfood["Dogfooding"]
|
|||
|
|
DD0C[dd0c/alert receives<br/>its own deploy webhook]
|
|||
|
|
end
|
|||
|
|
|
|||
|
|
PR --> LINT & TEST & INT
|
|||
|
|
CODE --> BUILD --> STAGING --> SMOKE --> PROD --> CANARY
|
|||
|
|
PROD --> DD0C
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Pipeline details:**
|
|||
|
|
|
|||
|
|
1. **PR checks (< 3 min):** ESLint, TypeScript strict mode, unit tests (vitest), integration tests against LocalStack (DynamoDB, SQS, S3 emulation)
|
|||
|
|
2. **Staging deploy (< 5 min):** CDK deploy to staging account. Separate AWS account for isolation.
|
|||
|
|
3. **Smoke tests (< 2 min):** Send test webhooks to staging endpoint. Verify: webhook accepted, alert appears in DynamoDB, correlation window opens, Slack notification sent to test channel.
|
|||
|
|
4. **Production deploy (< 5 min):** CDK deploy to production. Blue/green for ECS services (CodeDeploy). Lambda versioning with aliases for instant rollback.
|
|||
|
|
5. **Canary (continuous):** Post-deploy canary sends a synthetic webhook every 5 minutes. If Slack notification doesn't arrive within 30 seconds, CloudWatch alarm fires → auto-rollback.
|
|||
|
|
6. **Dogfooding:** dd0c/alert's own GitHub Actions workflow sends deploy webhooks to dd0c/alert. The product monitors its own deployments. If a deploy causes alert correlation to degrade, dd0c/alert tells you about it.
|
|||
|
|
|
|||
|
|
**Rollback strategy:**
|
|||
|
|
- Lambda: Alias shift to previous version (instant, <1 second)
|
|||
|
|
- ECS: CodeDeploy blue/green rollback (< 2 minutes)
|
|||
|
|
- DynamoDB: No schema migrations in V1 (schema-on-read). No rollback needed.
|
|||
|
|
- TimescaleDB: Flyway migrations with rollback scripts. Test in staging first.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. SECURITY
|
|||
|
|
|
|||
|
|
### 5.1 Webhook Authentication
|
|||
|
|
|
|||
|
|
We cannot trust unauthenticated webhooks, as an attacker could flood a tenant with fake alerts.
|
|||
|
|
|
|||
|
|
- **HMAC Signatures:** Every webhook request is verified using the provider's signature header (e.g., `X-PagerDuty-Signature`, `DD-WEBHOOK-SIGNATURE`).
|
|||
|
|
- **Secret Management:** Provider secrets are generated upon integration creation, stored in AWS Secrets Manager (or DynamoDB KMS-encrypted), and retrieved by the ingestion Lambda.
|
|||
|
|
- **Timestamp Validation:** Signatures must include a timestamp check to prevent replay attacks (requests older than 5 minutes are rejected).
|
|||
|
|
- **Rate Limiting:** API Gateway enforces rate limits per tenant based on their tier to prevent noisy neighbor problems and DDoS.
|
|||
|
|
|
|||
|
|
### 5.2 API Key Management
|
|||
|
|
|
|||
|
|
For customers using the REST API (Business tier) or Custom Webhooks:
|
|||
|
|
- API keys are generated as cryptographically secure random strings with a prefix (e.g., `dd0c_live_...`).
|
|||
|
|
- Only a one-way hash (SHA-256) is stored in DynamoDB. The raw key is shown only once upon creation.
|
|||
|
|
- API keys are tied to specific scopes (e.g., `write:alerts`, `read:incidents`).
|
|||
|
|
- API Gateway Lambda Authorizer validates the key and injects the `tenant_id` into the request context, ensuring strict tenant isolation.
|
|||
|
|
|
|||
|
|
### 5.3 Alert Data Sensitivity
|
|||
|
|
|
|||
|
|
Alert payloads often contain sensitive infrastructure details (hostnames, IP addresses) and sometimes PII (error messages containing user data).
|
|||
|
|
|
|||
|
|
- **Payload Stripping Mode (Privacy Mode):** Configurable per-tenant. When enabled, the ingestion layer strips the `description` and raw payload bodies before saving to DynamoDB or S3. Only structural metadata (service, severity, timestamp) is retained.
|
|||
|
|
- **Encryption at Rest:** All DynamoDB tables, RDS instances, and S3 buckets use AWS KMS encryption with customer master keys (CMK) or AWS-managed keys.
|
|||
|
|
- **Encryption in Transit:** TLS 1.2+ enforced on all API Gateway endpoints and inter-service communications.
|
|||
|
|
|
|||
|
|
### 5.4 SOC 2 Considerations
|
|||
|
|
|
|||
|
|
While SOC 2 Type II certification is targeted for Month 6-9, the V1 architecture lays the groundwork:
|
|||
|
|
- **Audit Logging:** Every configuration change (adding integrations, modifying suppression rules) is logged to an immutable audit table.
|
|||
|
|
- **Access Control:** No human access to production databases. Read-only access via AWS SSO for debugging. Changes via CI/CD only.
|
|||
|
|
- **Vulnerability Scanning:** ECR image scanning on push, npm audit in CI pipeline.
|
|||
|
|
- **Separation of Duties:** Staging and Production are in completely separate AWS accounts.
|
|||
|
|
|
|||
|
|
### 5.5 Data Residency Options
|
|||
|
|
|
|||
|
|
- **V1:** All data resides in `us-east-1`.
|
|||
|
|
- **V2/Enterprise:** The architecture supports multi-region deployment. European customers can be provisioned in `eu-central-1`. The `tenant_id` can dictate routing at the edge (e.g., Route 53 latency-based routing or CloudFront Lambda@Edge routing based on tenant prefix).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. MVP SCOPE
|
|||
|
|
|
|||
|
|
### 6.1 V1 MVP (Observe-and-Suggest)
|
|||
|
|
|
|||
|
|
The V1 MVP is strictly scoped to prove the 60-second time-to-value constraint and earn engineer trust.
|
|||
|
|
|
|||
|
|
- **Integrations:** Datadog, PagerDuty, and GitHub Actions (for deploy events).
|
|||
|
|
- **Core Engine:** Time-window clustering and deployment correlation. Rule-based only.
|
|||
|
|
- **Actionability:** Observe-and-suggest ONLY. No auto-suppression.
|
|||
|
|
- **Delivery:** Slack Bot (incident cards, real-time alerts, daily digests).
|
|||
|
|
- **Dashboard:** Minimal UI for generating webhook URLs and viewing the Noise Report Card.
|
|||
|
|
|
|||
|
|
### 6.2 Deferred to V2+
|
|||
|
|
|
|||
|
|
- **Auto-suppression:** Requires explicit user opt-in and a proven track record.
|
|||
|
|
- **More Integrations:** Grafana, OpsGenie, GitLab CI, ArgoCD.
|
|||
|
|
- **Semantic Deduplication:** Sentence-transformer ML embeddings for fuzzy alert matching.
|
|||
|
|
- **Predictive Severity:** ML-based scoring of historical resolution patterns.
|
|||
|
|
- **Advanced Dashboard:** Custom charting, RBAC, SSO/SAML.
|
|||
|
|
- **dd0c/run integration:** Runbook automation.
|
|||
|
|
|
|||
|
|
### 6.3 The 60-Second Onboarding Flow
|
|||
|
|
|
|||
|
|
1. User authenticates via Slack (OAuth).
|
|||
|
|
2. UI provisions a `tenant_id` and generates Datadog/PagerDuty webhook URLs.
|
|||
|
|
3. User pastes the URL into their monitoring tool.
|
|||
|
|
4. First alert fires → Ingestion Lambda receives it.
|
|||
|
|
5. Slack bot immediately posts: *"🔔 New alert: [service] [title] — watching for related alerts..."*
|
|||
|
|
6. V1 value is proven instantly.
|
|||
|
|
|
|||
|
|
### 6.4 Technical Debt Budget
|
|||
|
|
|
|||
|
|
Given the 30-day build timeline, intentional technical debt is accepted in specific areas:
|
|||
|
|
- **Testing:** Integration tests focus on the golden path (webhook → correlation → Slack). Edge cases in provider parsing will be fixed fast-forward in production.
|
|||
|
|
- **Dashboard UI:** Built with off-the-shelf Tailwind components. Not pixel-perfect.
|
|||
|
|
- **Database Migrations:** None. Schema-on-read in DynamoDB.
|
|||
|
|
- **Infrastructure Code:** Hardcoded region (`us-east-1`) and basic CI/CD.
|
|||
|
|
|
|||
|
|
### 6.5 Solo Founder Operational Model
|
|||
|
|
|
|||
|
|
- **Support:** Community Slack channel. No SLA for the Free/Pro tiers.
|
|||
|
|
- **On-call:** Standard AWS alarms (5XX errors, high queue depth) page the founder.
|
|||
|
|
- **Resilience:** The overlay architecture means if dd0c/alert goes down, the customer just receives their raw alerts from PagerDuty/Datadog. It degrades gracefully to the status quo.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. API DESIGN
|
|||
|
|
|
|||
|
|
### 7.1 Webhook Receiver Endpoints
|
|||
|
|
|
|||
|
|
- `POST /v1/webhooks/{tenant_id}/datadog`
|
|||
|
|
- `POST /v1/webhooks/{tenant_id}/pagerduty`
|
|||
|
|
- `POST /v1/webhooks/{tenant_id}/github`
|
|||
|
|
|
|||
|
|
*Headers must include provider-specific signatures.*
|
|||
|
|
|
|||
|
|
### 7.2 Alert Query & Search API
|
|||
|
|
|
|||
|
|
- `GET /v1/alerts?service={service}&status={status}&start={iso8601}&end={iso8601}`
|
|||
|
|
*Returns paginated canonical alerts.*
|
|||
|
|
- `GET /v1/alerts/{alert_id}`
|
|||
|
|
|
|||
|
|
### 7.3 Correlation Results API
|
|||
|
|
|
|||
|
|
- `GET /v1/incidents?status=open`
|
|||
|
|
*Returns active correlation windows and their grouped alerts.*
|
|||
|
|
- `GET /v1/incidents/{incident_id}`
|
|||
|
|
- `GET /v1/incidents/{incident_id}/suggestions`
|
|||
|
|
|
|||
|
|
### 7.4 Slack Slash Commands
|
|||
|
|
|
|||
|
|
- `/dd0c status` — Shows current open correlation windows.
|
|||
|
|
- `/dd0c config` — Link to the tenant dashboard.
|
|||
|
|
- `/dd0c mute [service]` — Temporarily ignore alerts for a noisy service (adds to suppression list).
|
|||
|
|
|
|||
|
|
### 7.5 Dashboard REST API
|
|||
|
|
|
|||
|
|
Backend-for-frontend (BFF) used by the React SPA:
|
|||
|
|
- `GET /api/v1/reports/noise-daily` — TimescaleDB aggregation for the Noise Report Card.
|
|||
|
|
- `GET /api/v1/integrations` — List configured webhooks and status.
|
|||
|
|
- `POST /api/v1/integrations` — Generate new webhook credentials.
|
|||
|
|
|
|||
|
|
### 7.6 Integration Marketplace Hooks
|
|||
|
|
|
|||
|
|
For native app directory listings (e.g., PagerDuty Marketplace):
|
|||
|
|
- `GET /api/v1/oauth/callback` — Handles OAuth flows for third-party integrations.
|
|||
|
|
- `POST /api/v1/lifecycle/uninstall` — Cleans up tenant data when the app is removed from a workspace.
|