P2: 2,245 lines, 10 epics — Sonnet subagent (8min) P3: 1,653 lines, 10 epics — Sonnet subagent (6min) P6: 2,303 lines, 262 scenarios, 10 epics — Sonnet subagent (7min) P4 (portal) still in progress
66 KiB
66 KiB
dd0c/alert — Alert Intelligence: BDD Acceptance Test Specifications
Gherkin scenarios for all 10 epics. Each Feature maps to a user story within the epic.
Epic 1: Webhook Ingestion
Feature: HMAC Signature Validation — Datadog
Feature: HMAC signature validation for Datadog webhooks
As the ingestion layer
I want to reject requests with invalid or missing HMAC signatures
So that only legitimate Datadog payloads are processed
Background:
Given the webhook endpoint is "POST /webhooks/datadog"
And a valid Datadog HMAC secret is configured as "dd-secret-abc123"
Scenario: Valid Datadog HMAC signature is accepted
Given a Datadog alert payload with body '{"title":"CPU spike","severity":"high"}'
And the request includes header "X-Datadog-Webhook-ID" with a valid HMAC-SHA256 signature
When the Lambda ingestion handler receives the request
Then the response status is 200
And the payload is forwarded to the normalization SQS queue
Scenario: Missing HMAC signature header is rejected
Given a Datadog alert payload with body '{"title":"CPU spike","severity":"high"}'
And the request has no "X-Datadog-Webhook-ID" header
When the Lambda ingestion handler receives the request
Then the response status is 401
And the payload is NOT forwarded to SQS
And a rejection event is logged with reason "missing_signature"
Scenario: Tampered payload with mismatched HMAC is rejected
Given a Datadog alert payload
And the HMAC signature was computed over a different payload body
When the Lambda ingestion handler receives the request
Then the response status is 401
And the payload is NOT forwarded to SQS
And a rejection event is logged with reason "signature_mismatch"
Scenario: Replay attack with expired timestamp is rejected
Given a Datadog alert payload with a valid HMAC signature
And the request timestamp is more than 5 minutes in the past
When the Lambda ingestion handler receives the request
Then the response status is 401
And the rejection reason is "timestamp_expired"
Scenario: HMAC secret rotation — old secret still accepted during grace period
Given the Datadog HMAC secret was rotated 2 minutes ago
And the request uses the previous secret for signing
When the Lambda ingestion handler receives the request
Then the response status is 200
And a warning metric "hmac_old_secret_used" is emitted
Feature: HMAC Signature Validation — PagerDuty
Feature: HMAC signature validation for PagerDuty webhooks
Background:
Given the webhook endpoint is "POST /webhooks/pagerduty"
And a valid PagerDuty signing secret is configured
Scenario: Valid PagerDuty v3 signature is accepted
Given a PagerDuty webhook payload
And the request includes "X-PagerDuty-Signature" with a valid v3 HMAC-SHA256 value
When the Lambda ingestion handler receives the request
Then the response status is 200
And the payload is enqueued for normalization
Scenario: PagerDuty v1 signature (legacy) is rejected
Given a PagerDuty webhook payload signed with v1 scheme
When the Lambda ingestion handler receives the request
Then the response status is 401
And the rejection reason is "unsupported_signature_version"
Scenario: Missing signature on PagerDuty webhook
Given a PagerDuty webhook payload with no signature header
When the Lambda ingestion handler receives the request
Then the response status is 401
Feature: HMAC Signature Validation — OpsGenie
Feature: HMAC signature validation for OpsGenie webhooks
Background:
Given the webhook endpoint is "POST /webhooks/opsgenie"
And a valid OpsGenie integration API key is configured
Scenario: Valid OpsGenie HMAC is accepted
Given an OpsGenie alert payload
And the request includes "X-OG-Delivery-Signature" with a valid HMAC-SHA256 value
When the Lambda ingestion handler receives the request
Then the response status is 200
Scenario: Invalid OpsGenie signature is rejected
Given an OpsGenie alert payload with a forged signature
When the Lambda ingestion handler receives the request
Then the response status is 401
And the rejection reason is "signature_mismatch"
Feature: HMAC Signature Validation — Grafana
Feature: HMAC signature validation for Grafana webhooks
Background:
Given the webhook endpoint is "POST /webhooks/grafana"
And a Grafana webhook secret is configured
Scenario: Valid Grafana signature is accepted
Given a Grafana alert payload
And the request includes "X-Grafana-Signature" with a valid HMAC-SHA256 value
When the Lambda ingestion handler receives the request
Then the response status is 200
Scenario: Grafana webhook with no secret configured (open mode) is accepted with warning
Given no Grafana webhook secret is configured for the tenant
And a Grafana alert payload arrives without a signature header
When the Lambda ingestion handler receives the request
Then the response status is 200
And a warning metric "grafana_unauthenticated_webhook" is emitted
Feature: Payload Normalization to Canonical Schema
Feature: Normalize incoming webhook payloads to canonical alert schema
Scenario: Datadog payload is normalized to canonical schema
Given a raw Datadog webhook payload with fields "title", "severity", "host", "tags"
When the normalization Lambda processes the payload
Then the canonical alert contains:
| field | value |
| source | datadog |
| severity | mapped from Datadog severity |
| service | extracted from tags |
| fingerprint | SHA-256 of source+title+host |
| received_at | ISO-8601 timestamp |
| raw_payload | original JSON preserved |
Scenario: PagerDuty payload is normalized to canonical schema
Given a raw PagerDuty v3 webhook payload
When the normalization Lambda processes the payload
Then the canonical alert contains "source" = "pagerduty"
And "severity" is mapped from PagerDuty urgency field
And "service" is extracted from the PagerDuty service name
Scenario: OpsGenie payload is normalized to canonical schema
Given a raw OpsGenie webhook payload
When the normalization Lambda processes the payload
Then the canonical alert contains "source" = "opsgenie"
And "severity" is mapped from OpsGenie priority field
Scenario: Grafana payload is normalized to canonical schema
Given a raw Grafana alerting webhook payload
When the normalization Lambda processes the payload
Then the canonical alert contains "source" = "grafana"
And "severity" is mapped from Grafana alert state
Scenario: Unknown source type returns 400
Given a webhook payload posted to "/webhooks/unknown-source"
When the Lambda ingestion handler receives the request
Then the response status is 400
And the error reason is "unknown_source"
Scenario: Malformed JSON payload returns 400
Given a request body that is not valid JSON
When the Lambda ingestion handler receives the request
Then the response status is 400
And the error reason is "invalid_json"
Feature: Async S3 Archival
Feature: Archive raw webhook payloads to S3 asynchronously
Scenario: Every accepted payload is archived to S3
Given a valid Datadog webhook payload is received and accepted
When the Lambda ingestion handler processes the request
Then the raw payload is written to S3 bucket "dd0c-raw-webhooks"
And the S3 key follows the pattern "raw/{source}/{tenant_id}/{YYYY}/{MM}/{DD}/{uuid}.json"
And the archival happens asynchronously (does not block the 200 response)
Scenario: S3 archival failure does not fail the ingestion
Given a valid webhook payload is received
And the S3 write operation fails with a transient error
When the Lambda ingestion handler processes the request
Then the response status is still 200
And the payload is still forwarded to SQS
And an error metric "s3_archival_failure" is emitted
Scenario: Archived payload includes tenant ID and trace context
Given a valid webhook payload from tenant "tenant-xyz"
When the payload is archived to S3
Then the S3 object metadata includes "tenant_id" = "tenant-xyz"
And the S3 object metadata includes the OTEL trace ID
Feature: SQS Payload Size Limit (256KB)
Feature: Handle SQS 256KB payload size limit during ingestion
Scenario: Payload under 256KB is sent directly to SQS
Given a normalized canonical alert payload of 10KB
When the ingestion Lambda forwards it to SQS
Then the message is placed on the SQS queue directly
And no S3 pointer pattern is used
Scenario: Payload exceeding 256KB is stored in S3 and pointer sent to SQS
Given a normalized canonical alert payload of 300KB (e.g. large raw_payload)
When the ingestion Lambda attempts to forward it to SQS
Then the full payload is stored in S3 under "sqs-overflow/{uuid}.json"
And an SQS message is sent containing only the S3 pointer and metadata
And the SQS message size is under 256KB
Scenario: Correlation engine retrieves oversized payload from S3 pointer
Given an SQS message containing an S3 pointer for an oversized payload
When the correlation engine consumer reads the SQS message
Then it fetches the full payload from S3 using the pointer
And processes it as a normal canonical alert
Scenario: S3 pointer fetch fails in correlation engine
Given an SQS message containing an S3 pointer
And the S3 object has been deleted or is unavailable
When the correlation engine attempts to fetch the payload
Then the message is sent to the Dead Letter Queue
And an alert metric "sqs_pointer_fetch_failure" is emitted
Feature: Dead Letter Queue Handling
Feature: Dead Letter Queue overflow and monitoring
Scenario: Message failing max retries is moved to DLQ
Given an SQS message that causes a processing error on every attempt
When the message has been retried 3 times (maxReceiveCount = 3)
Then the message is automatically moved to the DLQ "dd0c-ingestion-dlq"
And a CloudWatch alarm "DLQDepthHigh" is triggered when DLQ depth > 10
Scenario: DLQ overflow triggers operational alert
Given the DLQ contains more than 100 messages
When the DLQ depth CloudWatch alarm fires
Then a Slack notification is sent to the ops channel "#dd0c-ops"
And the notification includes the DLQ name and current depth
Scenario: DLQ messages can be replayed after fix
Given 50 messages are sitting in the DLQ
When an operator triggers the DLQ replay Lambda
Then messages are moved back to the main SQS queue in batches of 10
And each replayed message retains its original trace context
Epic 2: Alert Normalization
Feature: Datadog Source Parser
Feature: Parse and normalize Datadog alert payloads
Background:
Given the Datadog parser is registered for source "datadog"
Scenario: Datadog "alert" event maps to severity "critical"
Given a Datadog payload with "alert_type" = "error" and "priority" = "P1"
When the Datadog parser processes the payload
Then the canonical alert "severity" = "critical"
Scenario: Datadog "warning" event maps to severity "warning"
Given a Datadog payload with "alert_type" = "warning"
When the Datadog parser processes the payload
Then the canonical alert "severity" = "warning"
Scenario: Datadog "recovery" event maps to status "resolved"
Given a Datadog payload with "alert_type" = "recovery"
When the Datadog parser processes the payload
Then the canonical alert "status" = "resolved"
And the canonical alert "resolved_at" is set to the event timestamp
Scenario: Service extracted from Datadog tags
Given a Datadog payload with "tags" = ["service:payments", "env:prod", "team:backend"]
When the Datadog parser processes the payload
Then the canonical alert "service" = "payments"
And the canonical alert "environment" = "prod"
Scenario: Service tag absent — service defaults to hostname
Given a Datadog payload with no "service:" tag
And the payload contains "host" = "payments-worker-01"
When the Datadog parser processes the payload
Then the canonical alert "service" = "payments-worker-01"
Scenario: Fingerprint is deterministic for identical alerts
Given two identical Datadog payloads with the same title, host, and tags
When both are processed by the Datadog parser
Then both canonical alerts have the same "fingerprint" value
Scenario: Fingerprint differs when title changes
Given two Datadog payloads differing only in "title"
When both are processed by the Datadog parser
Then the canonical alerts have different "fingerprint" values
Scenario: Datadog payload missing required "title" field
Given a Datadog payload with no "title" field
When the Datadog parser processes the payload
Then a normalization error is raised with reason "missing_required_field:title"
And the alert is sent to the normalization DLQ
Feature: PagerDuty Source Parser
Feature: Parse and normalize PagerDuty webhook payloads
Background:
Given the PagerDuty parser is registered for source "pagerduty"
Scenario: PagerDuty "trigger" event creates a new canonical alert
Given a PagerDuty v3 webhook with "event_type" = "incident.triggered"
When the PagerDuty parser processes the payload
Then the canonical alert "status" = "firing"
And "source" = "pagerduty"
Scenario: PagerDuty "acknowledge" event updates alert status
Given a PagerDuty v3 webhook with "event_type" = "incident.acknowledged"
When the PagerDuty parser processes the payload
Then the canonical alert "status" = "acknowledged"
Scenario: PagerDuty "resolve" event updates alert status
Given a PagerDuty v3 webhook with "event_type" = "incident.resolved"
When the PagerDuty parser processes the payload
Then the canonical alert "status" = "resolved"
Scenario: PagerDuty urgency "high" maps to severity "critical"
Given a PagerDuty payload with "urgency" = "high"
When the PagerDuty parser processes the payload
Then the canonical alert "severity" = "critical"
Scenario: PagerDuty urgency "low" maps to severity "warning"
Given a PagerDuty payload with "urgency" = "low"
When the PagerDuty parser processes the payload
Then the canonical alert "severity" = "warning"
Scenario: PagerDuty service name is extracted correctly
Given a PagerDuty payload with "service.name" = "checkout-api"
When the PagerDuty parser processes the payload
Then the canonical alert "service" = "checkout-api"
Scenario: PagerDuty dedup key used as fingerprint seed
Given a PagerDuty payload with "dedup_key" = "pd-dedup-xyz789"
When the PagerDuty parser processes the payload
Then the canonical alert "fingerprint" incorporates "pd-dedup-xyz789"
Feature: OpsGenie Source Parser
Feature: Parse and normalize OpsGenie webhook payloads
Background:
Given the OpsGenie parser is registered for source "opsgenie"
Scenario: OpsGenie "Create" action maps to status "firing"
Given an OpsGenie webhook with "action" = "Create"
When the OpsGenie parser processes the payload
Then the canonical alert "status" = "firing"
Scenario: OpsGenie "Close" action maps to status "resolved"
Given an OpsGenie webhook with "action" = "Close"
When the OpsGenie parser processes the payload
Then the canonical alert "status" = "resolved"
Scenario: OpsGenie "Acknowledge" action maps to status "acknowledged"
Given an OpsGenie webhook with "action" = "Acknowledge"
When the OpsGenie parser processes the payload
Then the canonical alert "status" = "acknowledged"
Scenario: OpsGenie priority P1 maps to severity "critical"
Given an OpsGenie payload with "priority" = "P1"
When the OpsGenie parser processes the payload
Then the canonical alert "severity" = "critical"
Scenario: OpsGenie priority P3 maps to severity "warning"
Given an OpsGenie payload with "priority" = "P3"
When the OpsGenie parser processes the payload
Then the canonical alert "severity" = "warning"
Scenario: OpsGenie priority P5 maps to severity "info"
Given an OpsGenie payload with "priority" = "P5"
When the OpsGenie parser processes the payload
Then the canonical alert "severity" = "info"
Scenario: OpsGenie tags used for service extraction
Given an OpsGenie payload with "tags" = ["service:inventory", "region:us-east-1"]
When the OpsGenie parser processes the payload
Then the canonical alert "service" = "inventory"
Feature: Grafana Source Parser
Feature: Parse and normalize Grafana alerting webhook payloads
Background:
Given the Grafana parser is registered for source "grafana"
Scenario: Grafana "alerting" state maps to status "firing"
Given a Grafana webhook with "state" = "alerting"
When the Grafana parser processes the payload
Then the canonical alert "status" = "firing"
Scenario: Grafana "ok" state maps to status "resolved"
Given a Grafana webhook with "state" = "ok"
When the Grafana parser processes the payload
Then the canonical alert "status" = "resolved"
Scenario: Grafana "no_data" state maps to severity "warning"
Given a Grafana webhook with "state" = "no_data"
When the Grafana parser processes the payload
Then the canonical alert "severity" = "warning"
And the canonical alert "status" = "firing"
Scenario: Grafana panel URL preserved in canonical alert metadata
Given a Grafana webhook with "ruleUrl" = "https://grafana.example.com/d/abc/panel"
When the Grafana parser processes the payload
Then the canonical alert "metadata.dashboard_url" = "https://grafana.example.com/d/abc/panel"
Scenario: Grafana multi-alert payload (multiple evalMatches) creates one alert per match
Given a Grafana webhook with 3 "evalMatches" entries
When the Grafana parser processes the payload
Then 3 canonical alerts are produced
And each has a unique fingerprint based on the metric name and tags
Feature: Canonical Alert Schema Validation
Feature: Validate canonical alert schema completeness
Scenario: Canonical alert with all required fields passes validation
Given a canonical alert with fields: source, severity, status, service, fingerprint, received_at, tenant_id
When schema validation runs
Then the alert passes validation
Scenario: Canonical alert missing "tenant_id" fails validation
Given a canonical alert with no "tenant_id" field
When schema validation runs
Then validation fails with error "missing_required_field:tenant_id"
And the alert is rejected before SQS enqueue
Scenario: Canonical alert with unknown severity value fails validation
Given a canonical alert with "severity" = "super-critical"
When schema validation runs
Then validation fails with error "invalid_enum_value:severity"
Scenario: Canonical alert schema is additive — unknown extra fields are preserved
Given a canonical alert with an extra field "custom_label" = "team-alpha"
When schema validation runs
Then the alert passes validation
And "custom_label" is preserved in the alert document
Epic 3: Correlation Engine
Feature: Time-Window Grouping
Feature: Group alerts into incidents using time-window correlation
Background:
Given the correlation engine is running on ECS Fargate
And the default correlation time window is 5 minutes
Scenario: Two alerts for the same service within the time window are grouped
Given a canonical alert for service "payments" arrives at T=0
And a second canonical alert for service "payments" arrives at T=3min
When the correlation engine processes both alerts
Then they are grouped into a single incident
And the incident "alert_count" = 2
Scenario: Two alerts for the same service outside the time window are NOT grouped
Given a canonical alert for service "payments" arrives at T=0
And a second canonical alert for service "payments" arrives at T=6min
When the correlation engine processes both alerts
Then they are placed in separate incidents
Scenario: Time window is configurable per tenant
Given tenant "enterprise-co" has a custom correlation window of 10 minutes
And two alerts for the same service arrive 8 minutes apart
When the correlation engine processes both alerts
Then they are grouped into a single incident for tenant "enterprise-co"
Scenario: Alerts from different services within the time window are NOT grouped by default
Given a canonical alert for service "payments" at T=0
And a canonical alert for service "auth" at T=1min
When the correlation engine processes both alerts
Then they are placed in separate incidents
Feature: Service-Affinity Matching
Feature: Group alerts across related services using service-affinity rules
Background:
Given a service-affinity rule: ["payments", "checkout", "cart"] are related
Scenario: Alerts from affinity-grouped services are correlated into one incident
Given a canonical alert for service "payments" at T=0
And a canonical alert for service "checkout" at T=2min
When the correlation engine applies service-affinity matching
Then both alerts are grouped into a single incident
And the incident "root_service" is set to the first-seen service "payments"
Scenario: Alert from a service not in the affinity group is not merged
Given a canonical alert for service "payments" at T=0
And a canonical alert for service "logging" at T=1min
When the correlation engine applies service-affinity matching
Then they remain in separate incidents
Scenario: Service-affinity rules are tenant-scoped
Given tenant "acme" has affinity rule ["api", "gateway"]
And tenant "globex" has no affinity rules
And both tenants receive alerts for "api" and "gateway" simultaneously
When the correlation engine processes both tenants' alerts
Then "acme"'s alerts are grouped into one incident
And "globex"'s alerts remain in separate incidents
Feature: Fingerprint Deduplication
Feature: Deduplicate alerts with identical fingerprints
Scenario: Duplicate alert with same fingerprint within dedup window is suppressed
Given a canonical alert with fingerprint "fp-abc123" is received at T=0
And an identical alert with fingerprint "fp-abc123" arrives at T=30sec
When the correlation engine checks the Redis dedup window
Then the second alert is suppressed (not added as a new alert)
And the incident "duplicate_count" is incremented by 1
Scenario: Same fingerprint outside dedup window creates a new alert
Given a canonical alert with fingerprint "fp-abc123" was processed at T=0
And the dedup window is 10 minutes
And the same fingerprint arrives at T=11min
When the correlation engine checks the Redis dedup window
Then the alert is treated as a new occurrence
And a new incident entry is created
Scenario: Different fingerprints are never deduplicated
Given two alerts with different fingerprints "fp-abc123" and "fp-xyz789"
When the correlation engine processes both
Then both are treated as distinct alerts
Scenario: Dedup counter is visible in incident metadata
Given fingerprint "fp-abc123" has been suppressed 5 times
When the incident is retrieved via the Dashboard API
Then the incident "dedup_count" = 5
Feature: Redis Sliding Window
Feature: Redis sliding windows for correlation state management
Background:
Given Redis is available and the sliding window TTL is 10 minutes
Scenario: Alert fingerprint is stored in Redis on first occurrence
Given a canonical alert with fingerprint "fp-new001" arrives
When the correlation engine processes the alert
Then a Redis key "dedup:{tenant_id}:fp-new001" is set with TTL 10 minutes
Scenario: Redis key TTL is refreshed on each matching alert
Given a Redis key "dedup:{tenant_id}:fp-new001" exists with 2 minutes remaining
And a new alert with fingerprint "fp-new001" arrives
When the correlation engine processes the alert
Then the Redis key TTL is reset to 10 minutes
Scenario: Redis unavailability causes correlation engine to fail open
Given Redis is unreachable
When a canonical alert arrives for processing
Then the alert is processed without deduplication
And a metric "redis_unavailable_failopen" is emitted
And the alert is NOT dropped
Scenario: Redis sliding window is tenant-isolated
Given tenant "alpha" has fingerprint "fp-shared" in Redis
And tenant "beta" sends an alert with fingerprint "fp-shared"
When the correlation engine checks the dedup window
Then tenant "beta"'s alert is NOT suppressed
And tenant "alpha"'s dedup state is unaffected
Feature: Cross-Tenant Isolation in Correlation
Feature: Prevent cross-tenant alert bleed in correlation engine
Scenario: Alerts from different tenants with same fingerprint are never correlated
Given tenant "alpha" sends alert with fingerprint "fp-shared" at T=0
And tenant "beta" sends alert with fingerprint "fp-shared" at T=1min
When the correlation engine processes both alerts
Then each alert is placed in its own tenant-scoped incident
And no incident contains alerts from both tenants
Scenario: Tenant ID is validated before correlation lookup
Given a canonical alert arrives with "tenant_id" = ""
When the correlation engine attempts to process the alert
Then the alert is rejected with error "missing_tenant_id"
And the alert is sent to the correlation DLQ
Scenario: Correlation engine worker processes only one tenant's partition at a time
Given SQS messages are partitioned by tenant_id
When the ECS Fargate worker picks up a batch of messages
Then all messages in the batch belong to the same tenant
And no cross-tenant data is loaded into the worker's memory context
Feature: OTEL Trace Propagation Across SQS Boundary
Feature: Propagate OpenTelemetry trace context across SQS ingestion-to-correlation boundary
Scenario: Trace context is injected into SQS message attributes at ingestion
Given a webhook request arrives with OTEL trace header "traceparent: 00-abc123-def456-01"
When the ingestion Lambda enqueues the message to SQS
Then the SQS message attributes include "traceparent" = "00-abc123-def456-01"
And the SQS message attributes include "tracestate" if present in the original request
Scenario: Correlation engine extracts and continues trace from SQS message
Given an SQS message with "traceparent" attribute "00-abc123-def456-01"
When the correlation engine consumer reads the message
Then a child span is created with parent trace ID "abc123"
And all subsequent operations (Redis lookup, DynamoDB write) are children of this span
Scenario: Missing trace context in SQS message starts a new trace
Given an SQS message with no "traceparent" attribute
When the correlation engine consumer reads the message
Then a new root trace is started
And a metric "trace_context_missing" is emitted
Scenario: Trace ID is stored on the incident record
Given a correlated incident is created from an alert with trace ID "abc123"
When the incident is written to DynamoDB
Then the incident document includes "trace_id" = "abc123"
Epic 4: Notification & Escalation
Feature: Slack Block Kit Incident Notifications
Feature: Send Slack Block Kit notifications for new incidents
Background:
Given a Slack webhook URL is configured for tenant "acme"
And the Slack notification Lambda is subscribed to the incidents SNS topic
Scenario: New critical incident triggers Slack notification
Given a new incident is created with severity "critical" for service "payments"
When the notification Lambda processes the incident event
Then a Slack Block Kit message is posted to the configured channel
And the message includes the incident ID, service name, severity, and timestamp
And the message includes action buttons: "Acknowledge", "Escalate", "Mark as Noise"
Scenario: New warning incident triggers Slack notification
Given a new incident is created with severity "warning"
When the notification Lambda processes the incident event
Then a Slack message is posted with severity badge "⚠️ WARNING"
Scenario: Resolved incident posts a resolution message to Slack
Given an existing incident "INC-001" transitions to status "resolved"
When the notification Lambda processes the resolution event
Then a Slack message is posted indicating "INC-001 resolved"
And the message includes time-to-resolution duration
Scenario: Slack Block Kit message includes alert count for correlated incidents
Given an incident contains 7 correlated alerts
When the Slack notification is sent
Then the message body includes "7 correlated alerts"
Scenario: Slack notification includes dashboard deep-link
Given a new incident "INC-042" is created
When the Slack notification is sent
Then the message includes a button "View in Dashboard" linking to "/incidents/INC-042"
Feature: Severity-Based Routing
Feature: Route notifications to different Slack channels based on severity
Background:
Given tenant "acme" has configured:
| severity | channel |
| critical | #incidents-p1 |
| warning | #incidents-p2 |
| info | #monitoring-feed |
Scenario: Critical incident is routed to P1 channel
Given a new incident with severity "critical"
When the notification Lambda routes the alert
Then the Slack message is posted to "#incidents-p1"
Scenario: Warning incident is routed to P2 channel
Given a new incident with severity "warning"
When the notification Lambda routes the alert
Then the Slack message is posted to "#incidents-p2"
Scenario: Info incident is routed to monitoring feed
Given a new incident with severity "info"
When the notification Lambda routes the alert
Then the Slack message is posted to "#monitoring-feed"
Scenario: No routing rule configured — falls back to default channel
Given tenant "beta" has only a default channel "#alerts" configured
And a new incident with severity "critical" arrives
When the notification Lambda routes the alert
Then the Slack message is posted to "#alerts"
Feature: Escalation to PagerDuty if Unacknowledged
Feature: Escalate unacknowledged critical incidents to PagerDuty
Background:
Given the escalation check runs every minute via EventBridge
And the escalation threshold for "critical" incidents is 15 minutes
Scenario: Unacknowledged critical incident is escalated after threshold
Given a critical incident "INC-001" was created 16 minutes ago
And "INC-001" has not been acknowledged
When the escalation Lambda runs
Then a PagerDuty incident is created via the PagerDuty Events API v2
And the incident "INC-001" status is updated to "escalated"
And a Slack message is posted: "INC-001 escalated to PagerDuty"
Scenario: Acknowledged incident is NOT escalated
Given a critical incident "INC-002" was created 20 minutes ago
And "INC-002" was acknowledged 5 minutes ago
When the escalation Lambda runs
Then no PagerDuty incident is created for "INC-002"
Scenario: Warning incident has a longer escalation threshold
Given the escalation threshold for "warning" incidents is 60 minutes
And a warning incident "INC-003" was created 45 minutes ago and is unacknowledged
When the escalation Lambda runs
Then no PagerDuty incident is created for "INC-003"
Scenario: Escalation is idempotent — already-escalated incident is not re-escalated
Given incident "INC-004" was already escalated to PagerDuty
When the escalation Lambda runs again
Then no duplicate PagerDuty incident is created
And the escalation Lambda logs "already_escalated:INC-004"
Scenario: PagerDuty API failure during escalation is retried
Given incident "INC-005" is due for escalation
And the PagerDuty Events API returns a 500 error
When the escalation Lambda attempts to create the PagerDuty incident
Then the Lambda retries up to 3 times with exponential backoff
And if all retries fail, an error metric "pagerduty_escalation_failure" is emitted
And the incident is flagged for manual review
Feature: Daily Noise Report
Feature: Generate and send daily noise reduction report
Background:
Given the daily report Lambda runs at 08:00 UTC via EventBridge
Scenario: Daily noise report is sent to configured Slack channel
Given tenant "acme" has 500 raw alerts and 80 incidents in the past 24 hours
When the daily report Lambda runs
Then a Slack message is posted to "#dd0c-digest"
And the message includes:
| metric | value |
| total_alerts | 500 |
| correlated_incidents | 80 |
| noise_reduction_percent | 84% |
| top_noisy_service | shown |
Scenario: Daily report includes MTTR for resolved incidents
Given 20 incidents were resolved in the past 24 hours with an average MTTR of 23 minutes
When the daily report Lambda runs
Then the Slack message includes "Avg MTTR: 23 min"
Scenario: Daily report is skipped if no alerts in past 24 hours
Given tenant "quiet-co" had 0 alerts in the past 24 hours
When the daily report Lambda runs
Then no Slack message is sent for "quiet-co"
Scenario: Daily report is tenant-scoped — no cross-tenant data leakage
Given tenants "alpha" and "beta" both have activity
When the daily report Lambda runs
Then "alpha"'s report contains only "alpha"'s metrics
And "beta"'s report contains only "beta"'s metrics
Feature: Slack Rate Limiting
Feature: Handle Slack API rate limiting gracefully
Scenario: Slack returns 429 Too Many Requests — notification is retried
Given a Slack notification needs to be sent
And Slack returns HTTP 429 with "Retry-After: 5"
When the notification Lambda handles the response
Then the Lambda waits 5 seconds before retrying
And the notification is eventually delivered
Scenario: Slack rate limit persists beyond Lambda timeout — message queued for retry
Given Slack is rate-limiting for 30 seconds
And the Lambda timeout is 15 seconds
When the notification Lambda cannot deliver within its timeout
Then the SQS message is not deleted (remains visible after visibility timeout)
And the message is retried by the next Lambda invocation
Scenario: Burst of 50 incidents triggers Slack rate limit protection
Given 50 incidents are created within 1 second
When the notification Lambda processes the burst
Then notifications are batched and sent with 1-second delays between batches
And all 50 notifications are eventually delivered
And a metric "slack_rate_limit_batching" is emitted
Epic 5: Slack Bot
Feature: Interactive Feedback Buttons
Feature: Slack interactive feedback buttons on incident notifications
Background:
Given an incident notification was posted to Slack with buttons: "Helpful", "Noise", "Escalate"
And the Slack interactivity endpoint is "POST /slack/interactions"
Scenario: User clicks "Helpful" on an incident notification
Given user "@alice" clicks the "Helpful" button on incident "INC-007"
When the Slack interaction payload is received
Then the incident "INC-007" feedback is recorded as "helpful"
And the Slack message is updated to show "✅ Marked helpful by @alice"
And the button is disabled to prevent duplicate feedback
Scenario: User clicks "Noise" on an incident notification
Given user "@bob" clicks the "Noise" button on incident "INC-008"
When the Slack interaction payload is received
Then the incident "INC-008" feedback is recorded as "noise"
And the incident "noise_score" is incremented
And the Slack message is updated to show "🔇 Marked as noise by @bob"
Scenario: User clicks "Escalate" on an incident notification
Given user "@carol" clicks the "Escalate" button on incident "INC-009"
When the Slack interaction payload is received
Then the incident "INC-009" is immediately escalated to PagerDuty
And the Slack message is updated to show "🚨 Escalated by @carol"
And the escalation bypasses the normal time threshold
Scenario: Feedback on an already-resolved incident is rejected
Given incident "INC-010" has status "resolved"
And user "@dave" clicks "Helpful" on the stale Slack message
When the Slack interaction payload is received
Then the Slack message is updated to show "⚠️ Incident already resolved"
And no feedback is recorded
Scenario: Slack interaction payload signature is validated
Given a Slack interaction request with an invalid "X-Slack-Signature" header
When the interaction endpoint receives the request
Then the response status is 401
And the interaction is not processed
Scenario: Duplicate button click by same user is idempotent
Given user "@alice" already marked incident "INC-007" as "helpful"
And "@alice" clicks "Helpful" again on the same message
When the Slack interaction payload is received
Then the feedback count is NOT incremented again
And the response acknowledges the duplicate gracefully
Feature: Slash Command — /dd0c status
Feature: /dd0c status slash command
Background:
Given the Slack slash command "/dd0c" is registered
And the command handler endpoint is "POST /slack/commands"
Scenario: /dd0c status returns current open incident count
Given tenant "acme" has 3 open critical incidents and 5 open warning incidents
When user "@alice" runs "/dd0c status" in the Slack workspace
Then the bot responds ephemerally with:
| metric | value |
| open_critical | 3 |
| open_warning | 5 |
| alerts_last_hour | shown |
| system_status | OK |
Scenario: /dd0c status when no open incidents
Given tenant "acme" has 0 open incidents
When user "@alice" runs "/dd0c status"
Then the bot responds with "✅ All clear — no open incidents"
Scenario: /dd0c status responds within Slack's 3-second timeout
Given the command handler receives "/dd0c status"
When the handler processes the request
Then an HTTP 200 response is returned within 3 seconds
And if data retrieval takes longer, an immediate acknowledgment is sent
And the full response is delivered via response_url
Scenario: /dd0c status is scoped to the requesting tenant
Given user "@alice" belongs to tenant "acme"
When "@alice" runs "/dd0c status"
Then the response contains only "acme"'s incident data
And no data from other tenants is included
Feature: Slash Command — /dd0c anomalies
Feature: /dd0c anomalies slash command
Scenario: /dd0c anomalies returns top noisy services in the last 24 hours
Given service "payments" fired 120 alerts in the last 24 hours
And service "auth" fired 80 alerts
And service "logging" fired 10 alerts
When user "@alice" runs "/dd0c anomalies"
Then the bot responds with a ranked list:
| rank | service | alert_count |
| 1 | payments | 120 |
| 2 | auth | 80 |
| 3 | logging | 10 |
Scenario: /dd0c anomalies with time range argument
Given user "@alice" runs "/dd0c anomalies --last 7d"
When the command handler processes the request
Then the response covers the last 7 days of anomaly data
Scenario: /dd0c anomalies with no data returns helpful message
Given no alerts have been received in the last 24 hours
When user "@alice" runs "/dd0c anomalies"
Then the bot responds with "No anomalies detected in the last 24 hours"
Feature: Slash Command — /dd0c digest
Feature: /dd0c digest slash command
Scenario: /dd0c digest returns on-demand summary report
Given tenant "acme" has activity in the last 24 hours
When user "@alice" runs "/dd0c digest"
Then the bot responds with a summary matching the daily noise report format
And the response includes total alerts, incidents, noise reduction %, and avg MTTR
Scenario: /dd0c digest with custom time range
Given user "@alice" runs "/dd0c digest --last 7d"
When the command handler processes the request
Then the digest covers the last 7 days
Scenario: Unauthorized user cannot run /dd0c commands
Given user "@mallory" is not a member of any configured tenant workspace
When "@mallory" runs "/dd0c status"
Then the bot responds ephemerally with "⛔ You are not authorized to use this command"
And no tenant data is returned
Epic 6: Dashboard API
Feature: Cognito JWT Authentication
Feature: Authenticate Dashboard API requests with Cognito JWT
Background:
Given the Dashboard API requires a valid Cognito JWT in the "Authorization: Bearer <token>" header
Scenario: Valid JWT grants access to the API
Given a user has a valid Cognito JWT for tenant "acme"
When the user calls "GET /api/incidents"
Then the response status is 200
And only "acme"'s incidents are returned
Scenario: Missing Authorization header returns 401
Given a request to "GET /api/incidents" with no Authorization header
When the API Gateway processes the request
Then the response status is 401
And the body contains "error": "missing_token"
Scenario: Expired JWT returns 401
Given a user presents a JWT that expired 10 minutes ago
When the user calls "GET /api/incidents"
Then the response status is 401
And the body contains "error": "token_expired"
Scenario: JWT signed with wrong key returns 401
Given a user presents a JWT signed with a non-Cognito key
When the user calls "GET /api/incidents"
Then the response status is 401
And the body contains "error": "invalid_token_signature"
Scenario: JWT from a different tenant cannot access another tenant's data
Given user "@alice" has a valid JWT for tenant "acme"
When "@alice" calls "GET /api/incidents?tenant_id=globex"
Then the response status is 403
And the body contains "error": "tenant_access_denied"
Feature: Incident Listing with Filters
Feature: List incidents with filtering and pagination
Background:
Given the user is authenticated for tenant "acme"
Scenario: List all open incidents
Given tenant "acme" has 15 open incidents
When the user calls "GET /api/incidents?status=open"
Then the response status is 200
And the response contains 15 incidents
And each incident includes: id, severity, service, status, created_at, alert_count
Scenario: Filter incidents by severity
Given tenant "acme" has 5 critical and 10 warning incidents
When the user calls "GET /api/incidents?severity=critical"
Then the response contains exactly 5 incidents
And all returned incidents have severity "critical"
Scenario: Filter incidents by service
Given tenant "acme" has incidents for services "payments", "auth", and "checkout"
When the user calls "GET /api/incidents?service=payments"
Then only incidents for service "payments" are returned
Scenario: Filter incidents by date range
Given incidents exist from the past 30 days
When the user calls "GET /api/incidents?from=2026-02-01&to=2026-02-07"
Then only incidents created between Feb 1 and Feb 7 are returned
Scenario: Pagination returns correct page of results
Given tenant "acme" has 100 incidents
When the user calls "GET /api/incidents?page=2&limit=20"
Then the response contains incidents 21–40
And the response includes "total": 100, "page": 2, "limit": 20
Scenario: Empty result set returns 200 with empty array
Given tenant "acme" has no incidents matching the filter
When the user calls "GET /api/incidents?service=nonexistent"
Then the response status is 200
And the response body is '{"incidents": [], "total": 0}'
Scenario: Incident detail endpoint returns full alert timeline
Given incident "INC-042" has 7 correlated alerts
When the user calls "GET /api/incidents/INC-042"
Then the response includes the incident details
And "alerts" array contains 7 entries with timestamps and sources
Feature: Analytics Endpoints — MTTR
Feature: MTTR analytics endpoint
Background:
Given the user is authenticated for tenant "acme"
Scenario: MTTR endpoint returns average time-to-resolution
Given 10 incidents were resolved in the last 7 days with MTTRs ranging from 5 to 60 minutes
When the user calls "GET /api/analytics/mttr?period=7d"
Then the response includes "avg_mttr_minutes" as a number
And "incident_count" = 10
Scenario: MTTR broken down by service
When the user calls "GET /api/analytics/mttr?period=7d&group_by=service"
Then the response includes a per-service MTTR breakdown
Scenario: MTTR with no resolved incidents returns null
Given no incidents were resolved in the requested period
When the user calls "GET /api/analytics/mttr?period=1d"
Then the response includes "avg_mttr_minutes": null
And "incident_count": 0
Feature: Analytics Endpoints — Noise Reduction
Feature: Noise reduction analytics endpoint
Scenario: Noise reduction percentage is calculated correctly
Given tenant "acme" received 1000 raw alerts and 150 incidents in the last 7 days
When the user calls "GET /api/analytics/noise-reduction?period=7d"
Then the response includes "noise_reduction_percent": 85
And "raw_alerts": 1000
And "incidents": 150
Scenario: Noise reduction trend over time
When the user calls "GET /api/analytics/noise-reduction?period=30d&granularity=daily"
Then the response includes a daily time series of noise reduction percentages
Scenario: Noise reduction by source
When the user calls "GET /api/analytics/noise-reduction?period=7d&group_by=source"
Then the response includes per-source breakdown (datadog, pagerduty, opsgenie, grafana)
Feature: Tenant Isolation in Dashboard API
Feature: Enforce strict tenant isolation across all API endpoints
Scenario: DynamoDB queries always include tenant_id partition key
Given user "@alice" for tenant "acme" calls any incident endpoint
When the API handler queries DynamoDB
Then the query always includes "tenant_id = acme" as a condition
And no full-table scans are performed
Scenario: TimescaleDB analytics queries are scoped by tenant_id
Given user "@alice" for tenant "acme" calls any analytics endpoint
When the API handler queries TimescaleDB
Then the SQL query includes "WHERE tenant_id = 'acme'"
Scenario: API does not expose tenant_id enumeration
Given user "@alice" calls "GET /api/incidents/INC-999" where INC-999 belongs to tenant "globex"
When the API processes the request
Then the response status is 404 (not 403, to avoid tenant enumeration)
Epic 7: Dashboard UI
Feature: Incident List View
Feature: Incident list page in the React SPA
Background:
Given the user is logged in and the Dashboard SPA is loaded
Scenario: Incident list displays open incidents on load
Given tenant "acme" has 12 open incidents
When the user navigates to "/incidents"
Then the incident list renders 12 rows
And each row shows: incident ID, severity badge, service name, alert count, age
Scenario: Severity badge color coding
Given the incident list contains critical, warning, and info incidents
When the list renders
Then critical incidents show a red badge
And warning incidents show a yellow badge
And info incidents show a blue badge
Scenario: Clicking an incident row navigates to incident detail
Given the incident list is displayed
When the user clicks on incident "INC-042"
Then the browser navigates to "/incidents/INC-042"
Scenario: Filter by severity updates the list in real time
Given the incident list is displayed
When the user selects "Critical" from the severity filter dropdown
Then only critical incidents are shown
And the URL updates to "/incidents?severity=critical"
Scenario: Filter by service updates the list
Given the incident list is displayed
When the user types "payments" in the service search box
Then only incidents for service "payments" are shown
Scenario: Empty state is shown when no incidents match filters
Given no incidents match the current filter
When the list renders
Then a message "No incidents found" is displayed
And a "Clear filters" button is shown
Scenario: Incident list auto-refreshes every 30 seconds
Given the incident list is displayed
When 30 seconds elapse
Then the list silently re-fetches from the API
And new incidents appear without a full page reload
Feature: Alert Timeline View
Feature: Alert timeline within an incident detail page
Scenario: Alert timeline shows all correlated alerts in chronological order
Given incident "INC-042" has 5 correlated alerts from T=0 to T=4min
When the user navigates to "/incidents/INC-042"
Then the timeline renders 5 events in ascending time order
And each event shows: source icon, alert title, severity, timestamp
Scenario: Timeline highlights the root cause alert
Given the first alert in the incident is flagged as "root_cause"
When the timeline renders
Then the root cause alert is visually distinguished (e.g. bold border)
Scenario: Timeline shows deduplication count
Given fingerprint "fp-abc" was suppressed 8 times
When the timeline renders the corresponding alert
Then a badge "×8 duplicates suppressed" is shown on that alert entry
Scenario: Timeline is scrollable for large incidents
Given an incident has 200 correlated alerts
When the timeline renders
Then a virtualized scroll list is used
And the page does not freeze or crash
Feature: MTTR Chart
Feature: MTTR trend chart on the analytics page
Scenario: MTTR chart renders a 7-day trend line
Given the analytics API returns daily MTTR data for the last 7 days
When the user navigates to "/analytics"
Then a line chart is rendered with 7 data points
And the X-axis shows dates and the Y-axis shows minutes
Scenario: MTTR chart shows "No data" state when no resolved incidents
Given no incidents were resolved in the selected period
When the chart renders
Then a "No resolved incidents in this period" message is shown instead of the chart
Scenario: MTTR chart period selector changes the data range
Given the user is on the analytics page
When the user selects "Last 30 days" from the period dropdown
Then the chart re-fetches data for the last 30 days
And the chart updates without a full page reload
Feature: Noise Reduction Percentage Display
Feature: Noise reduction metric display on analytics page
Scenario: Noise reduction percentage is prominently displayed
Given the analytics API returns noise_reduction_percent = 84
When the user views the analytics page
Then a large "84%" figure is displayed under "Noise Reduction"
Scenario: Noise reduction trend sparkline is shown
Given daily noise reduction data is available for 30 days
When the analytics page renders
Then a sparkline chart shows the 30-day trend
Scenario: Noise reduction breakdown by source is shown
Given the API returns per-source noise reduction data
When the user clicks "By Source" tab
Then a bar chart shows noise reduction % for each source (Datadog, PagerDuty, OpsGenie, Grafana)
Feature: Webhook Setup Wizard
Feature: Webhook setup wizard for onboarding new monitoring sources
Scenario: Wizard generates a unique webhook URL for Datadog
Given the user navigates to "/settings/webhooks"
And clicks "Add Webhook Source"
When the user selects "Datadog" from the source dropdown
And clicks "Generate"
Then a unique webhook URL is displayed: "https://ingest.dd0c.io/webhooks/datadog/{tenant_id}/{token}"
And the HMAC secret is shown once for copying
Scenario: Wizard provides copy-paste instructions for each source
Given the user has generated a Datadog webhook URL
When the wizard displays the setup instructions
Then step-by-step instructions for configuring Datadog are shown
And a "Test Webhook" button is available
Scenario: Test webhook button sends a test payload and confirms receipt
Given the user clicks "Test Webhook" for a configured Datadog source
When the test payload is sent
Then the wizard shows "✅ Test payload received successfully"
And the test alert appears in the incident list as a test event
Scenario: Wizard shows validation error if source already configured
Given tenant "acme" already has a Datadog webhook configured
When the user tries to add a second Datadog webhook
Then the wizard shows "A Datadog webhook is already configured. Regenerate token?"
Scenario: Regenerating a webhook token invalidates the old token
Given tenant "acme" has an existing Datadog webhook token
When the user clicks "Regenerate Token" and confirms
Then a new token is generated
And the old token is immediately invalidated
And any requests using the old token return 401
Epic 8: Infrastructure
Feature: CDK Stack — Lambda Ingestion
Feature: CDK provisions Lambda ingestion infrastructure
Scenario: Lambda function is created with correct runtime and memory
Given the CDK stack is synthesized
When the CloudFormation template is inspected
Then a Lambda function "dd0c-ingestion" exists with runtime "nodejs20.x"
And memory is set to 512MB
And timeout is set to 30 seconds
Scenario: Lambda has least-privilege IAM role
Given the CDK stack is synthesized
When the IAM role for "dd0c-ingestion" is inspected
Then the role allows "sqs:SendMessage" only to the ingestion SQS queue ARN
And the role allows "s3:PutObject" only to the "dd0c-raw-webhooks" bucket
And the role does NOT have "s3:*" or "sqs:*" wildcards
Scenario: Lambda is behind API Gateway with throttling
Given the CDK stack is synthesized
When the API Gateway configuration is inspected
Then throttling is set to 1000 requests/second burst and 500 requests/second steady-state
And WAF is attached to the API Gateway stage
Scenario: Lambda environment variables are sourced from SSM Parameter Store
Given the CDK stack is synthesized
When the Lambda environment configuration is inspected
Then HMAC secrets are referenced from SSM parameters (not hardcoded)
And no secrets appear in plaintext in the CloudFormation template
Feature: CDK Stack — ECS Fargate Correlation Engine
Feature: CDK provisions ECS Fargate for the correlation engine
Scenario: ECS service is created with correct task definition
Given the CDK stack is synthesized
When the ECS task definition is inspected
Then the task uses Fargate launch type
And CPU is set to 1024 (1 vCPU) and memory to 2048MB
And the container image is pulled from ECR "dd0c-correlation-engine"
Scenario: ECS service auto-scales based on SQS queue depth
Given the CDK stack is synthesized
When the auto-scaling configuration is inspected
Then a step-scaling policy exists targeting SQS "ApproximateNumberOfMessagesVisible"
And scale-out triggers when queue depth > 100 messages
And scale-in triggers when queue depth < 10 messages
And minimum capacity is 1 and maximum capacity is 10
Scenario: ECS tasks run in private subnets with no public IP
Given the CDK stack is synthesized
When the ECS network configuration is inspected
Then tasks are placed in private subnets
And "assignPublicIp" is DISABLED
And a NAT Gateway provides outbound internet access
Feature: CDK Stack — SQS Queues
Feature: CDK provisions SQS queues with correct configuration
Scenario: Ingestion SQS queue has a Dead Letter Queue configured
Given the CDK stack is synthesized
When the SQS queue "dd0c-ingestion" is inspected
Then a DLQ "dd0c-ingestion-dlq" is attached
And maxReceiveCount is 3
And the DLQ retention period is 14 days
Scenario: SQS queue has server-side encryption enabled
Given the CDK stack is synthesized
When the SQS queue configuration is inspected
Then SSE is enabled using an AWS-managed KMS key
Scenario: SQS visibility timeout exceeds Lambda timeout
Given the Lambda timeout is 30 seconds
When the SQS queue visibility timeout is inspected
Then the visibility timeout is at least 6x the Lambda timeout (180 seconds)
Feature: CDK Stack — DynamoDB
Feature: CDK provisions DynamoDB for incident storage
Scenario: Incidents table has correct key schema
Given the CDK stack is synthesized
When the DynamoDB table "dd0c-incidents" is inspected
Then the partition key is "tenant_id" (String)
And the sort key is "incident_id" (String)
Scenario: Incidents table has a GSI for status queries
Given the CDK stack is synthesized
When the GSIs on "dd0c-incidents" are inspected
Then a GSI "status-created_at-index" exists
With partition key "status" and sort key "created_at"
Scenario: DynamoDB table has point-in-time recovery enabled
Given the CDK stack is synthesized
When the DynamoDB table settings are inspected
Then PITR is enabled on "dd0c-incidents"
Scenario: DynamoDB TTL is configured for free-tier retention
Given the CDK stack is synthesized
When the DynamoDB TTL configuration is inspected
Then TTL is enabled on attribute "expires_at"
And free-tier records have "expires_at" set to 7 days from creation
Feature: CI/CD Pipeline
Feature: CI/CD pipeline for automated deployment
Scenario: Pull request triggers test suite
Given a developer opens a pull request against "main"
When the CI pipeline runs
Then unit tests, integration tests, and CDK synth all pass before merge is allowed
Scenario: Merge to main triggers staging deployment
Given a PR is merged to "main"
When the CD pipeline runs
Then the CDK stack is deployed to the "staging" environment
And smoke tests run against staging endpoints
Scenario: Production deployment requires manual approval
Given the staging deployment and smoke tests pass
When the CD pipeline reaches the production stage
Then a manual approval gate is presented
And production deployment only proceeds after approval
Scenario: Failed deployment triggers automatic rollback
Given a production deployment fails health checks
When the CD pipeline detects the failure
Then the previous CloudFormation stack version is restored
And a Slack alert is sent to "#dd0c-ops" with the rollback reason
Scenario: CDK diff is posted as a PR comment
Given a developer opens a PR with infrastructure changes
When the CI pipeline runs "cdk diff"
Then the diff output is posted as a comment on the PR
Epic 9: Onboarding & PLG
Feature: OAuth Signup
Feature: User signup via OAuth (Google / GitHub)
Background:
Given the signup page is at "/signup"
Scenario: New user signs up with Google OAuth
Given a new user visits "/signup"
When the user clicks "Sign up with Google"
And completes the Google OAuth flow
Then a new tenant is created for the user's email domain
And the user is assigned the "owner" role for the new tenant
And the user is redirected to the onboarding wizard
Scenario: New user signs up with GitHub OAuth
Given a new user visits "/signup"
When the user clicks "Sign up with GitHub"
And completes the GitHub OAuth flow
Then a new tenant is created
And the user is redirected to the onboarding wizard
Scenario: Existing user signs in via OAuth
Given a user with email "alice@acme.com" already has an account
When the user completes the Google OAuth flow
Then no new tenant is created
And the user is redirected to "/incidents"
Scenario: OAuth failure shows user-friendly error
Given the Google OAuth provider returns an error
When the user is redirected back to the app
Then an error message "Sign-in failed. Please try again." is displayed
And no partial account is created
Scenario: Signup is blocked for disposable email domains
Given a user attempts to sign up with "user@mailinator.com"
When the OAuth flow completes
Then the signup is rejected with "Disposable email addresses are not allowed"
And no tenant is created
Feature: Free Tier — 10K Alerts/Month Limit
Feature: Enforce free tier limit of 10,000 alerts per month
Background:
Given tenant "free-co" is on the free tier
And the monthly alert counter is stored in DynamoDB
Scenario: Alert ingestion succeeds under the 10K limit
Given tenant "free-co" has ingested 9,999 alerts this month
When a new alert arrives
Then the alert is processed normally
And the counter is incremented to 10,000
Scenario: Alert ingestion is blocked at the 10K limit
Given tenant "free-co" has ingested 10,000 alerts this month
When a new alert arrives via webhook
Then the webhook returns HTTP 429
And the response body includes "Free tier limit reached. Upgrade to continue."
And the alert is NOT processed or stored
Scenario: Tenant receives email warning at 80% of limit
Given tenant "free-co" has ingested 8,000 alerts this month
When the 8,001st alert is ingested
Then an email is sent to the tenant owner: "You've used 80% of your free tier quota"
Scenario: Alert counter resets on the 1st of each month
Given tenant "free-co" has ingested 10,000 alerts in January
When February 1st arrives (UTC midnight)
Then the monthly counter is reset to 0
And alert ingestion is unblocked
Scenario: Paid tenant has no alert ingestion limit
Given tenant "paid-co" is on the "pro" plan
And has ingested 50,000 alerts this month
When a new alert arrives
Then the alert is processed normally
And no limit check is applied
Feature: 7-Day Retention for Free Tier
Feature: Enforce 7-day data retention for free tier tenants
Scenario: Free tier incidents older than 7 days are expired via DynamoDB TTL
Given tenant "free-co" is on the free tier
And incident "INC-OLD" was created 8 days ago
When DynamoDB TTL runs
Then "INC-OLD" is deleted from the incidents table
Scenario: Free tier raw S3 archives older than 7 days are deleted
Given tenant "free-co" has raw webhook archives in S3 from 8 days ago
When the S3 lifecycle policy runs
Then objects older than 7 days are deleted for free-tier tenants
Scenario: Paid tier incidents are retained for 90 days
Given tenant "paid-co" is on the "pro" plan
And incident "INC-OLD" was created 30 days ago
When DynamoDB TTL runs
Then "INC-OLD" is NOT deleted
Scenario: Retention policy is enforced per-tenant, not globally
Given "free-co" and "paid-co" both have incidents from 10 days ago
When TTL and lifecycle policies run
Then "free-co"'s old incidents are deleted
And "paid-co"'s old incidents are retained
Feature: Stripe Billing Integration
Feature: Stripe billing for plan upgrades
Scenario: User upgrades from free to pro plan via Stripe Checkout
Given user "@alice" is on the free tier
When "@alice" clicks "Upgrade to Pro" in the dashboard
Then a Stripe Checkout session is created
And the user is redirected to the Stripe-hosted payment page
Scenario: Successful Stripe payment activates pro plan
Given a Stripe Checkout session completes successfully
When the Stripe "checkout.session.completed" webhook is received
Then tenant "acme"'s plan is updated to "pro" in DynamoDB
And the alert ingestion limit is removed
And a confirmation email is sent to the tenant owner
Scenario: Failed Stripe payment does not activate pro plan
Given a Stripe payment fails
When the Stripe "payment_intent.payment_failed" webhook is received
Then the tenant remains on the free tier
And a failure notification email is sent
Scenario: Stripe webhook signature is validated
Given a Stripe webhook arrives with an invalid "Stripe-Signature" header
When the billing Lambda processes the request
Then the response status is 401
And the event is not processed
Scenario: Subscription cancellation downgrades tenant to free tier
Given tenant "acme" cancels their pro subscription
When the Stripe "customer.subscription.deleted" webhook is received
Then tenant "acme"'s plan is downgraded to "free"
And the 10K/month limit is re-applied from the next billing cycle
And a downgrade confirmation email is sent
Scenario: Stripe billing is idempotent — duplicate webhook events are ignored
Given a Stripe "checkout.session.completed" event was already processed
When the same event is received again (Stripe retry)
Then the tenant plan is not double-updated
And the response is 200 (idempotent acknowledgment)
Feature: Webhook URL Generation
Feature: Generate unique webhook URLs per tenant per source
Scenario: Webhook URL is generated on tenant creation
Given a new tenant "new-co" is created via OAuth signup
When the onboarding wizard runs
Then a unique webhook URL is generated for each supported source
And each URL follows the pattern "https://ingest.dd0c.io/webhooks/{source}/{tenant_id}/{token}"
And tokens are cryptographically random (32 bytes, URL-safe base64)
Scenario: Webhook token is stored hashed in DynamoDB
Given a webhook token is generated for tenant "new-co"
When the token is stored
Then only the SHA-256 hash of the token is stored in DynamoDB
And the plaintext token is shown to the user exactly once
Scenario: Webhook URL is validated on each ingestion request
Given a request arrives at "POST /webhooks/datadog/{tenant_id}/{token}"
When the ingestion Lambda validates the token
Then the token hash is looked up in DynamoDB for the given tenant_id
And if the hash matches, the request is accepted
And if the hash does not match, the response is 401
Scenario: 60-second time-to-value — first correlated alert within 60 seconds of webhook setup
Given a new tenant completes the onboarding wizard and copies their webhook URL
When the tenant sends their first alert to the webhook URL
Then within 60 seconds, a correlated incident appears in the dashboard
And a Slack notification is sent if Slack is configured