1654 lines
66 KiB
Markdown
1654 lines
66 KiB
Markdown
|
|
# dd0c/alert — Alert Intelligence: BDD Acceptance Test Specifications
|
|||
|
|
|
|||
|
|
> Gherkin scenarios for all 10 epics. Each Feature maps to a user story within the epic.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 1: Webhook Ingestion
|
|||
|
|
|
|||
|
|
### Feature: HMAC Signature Validation — Datadog
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: HMAC signature validation for Datadog webhooks
|
|||
|
|
As the ingestion layer
|
|||
|
|
I want to reject requests with invalid or missing HMAC signatures
|
|||
|
|
So that only legitimate Datadog payloads are processed
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the webhook endpoint is "POST /webhooks/datadog"
|
|||
|
|
And a valid Datadog HMAC secret is configured as "dd-secret-abc123"
|
|||
|
|
|
|||
|
|
Scenario: Valid Datadog HMAC signature is accepted
|
|||
|
|
Given a Datadog alert payload with body '{"title":"CPU spike","severity":"high"}'
|
|||
|
|
And the request includes header "X-Datadog-Webhook-ID" with a valid HMAC-SHA256 signature
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 200
|
|||
|
|
And the payload is forwarded to the normalization SQS queue
|
|||
|
|
|
|||
|
|
Scenario: Missing HMAC signature header is rejected
|
|||
|
|
Given a Datadog alert payload with body '{"title":"CPU spike","severity":"high"}'
|
|||
|
|
And the request has no "X-Datadog-Webhook-ID" header
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 401
|
|||
|
|
And the payload is NOT forwarded to SQS
|
|||
|
|
And a rejection event is logged with reason "missing_signature"
|
|||
|
|
|
|||
|
|
Scenario: Tampered payload with mismatched HMAC is rejected
|
|||
|
|
Given a Datadog alert payload
|
|||
|
|
And the HMAC signature was computed over a different payload body
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 401
|
|||
|
|
And the payload is NOT forwarded to SQS
|
|||
|
|
And a rejection event is logged with reason "signature_mismatch"
|
|||
|
|
|
|||
|
|
Scenario: Replay attack with expired timestamp is rejected
|
|||
|
|
Given a Datadog alert payload with a valid HMAC signature
|
|||
|
|
And the request timestamp is more than 5 minutes in the past
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 401
|
|||
|
|
And the rejection reason is "timestamp_expired"
|
|||
|
|
|
|||
|
|
Scenario: HMAC secret rotation — old secret still accepted during grace period
|
|||
|
|
Given the Datadog HMAC secret was rotated 2 minutes ago
|
|||
|
|
And the request uses the previous secret for signing
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 200
|
|||
|
|
And a warning metric "hmac_old_secret_used" is emitted
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: HMAC Signature Validation — PagerDuty
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: HMAC signature validation for PagerDuty webhooks
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the webhook endpoint is "POST /webhooks/pagerduty"
|
|||
|
|
And a valid PagerDuty signing secret is configured
|
|||
|
|
|
|||
|
|
Scenario: Valid PagerDuty v3 signature is accepted
|
|||
|
|
Given a PagerDuty webhook payload
|
|||
|
|
And the request includes "X-PagerDuty-Signature" with a valid v3 HMAC-SHA256 value
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 200
|
|||
|
|
And the payload is enqueued for normalization
|
|||
|
|
|
|||
|
|
Scenario: PagerDuty v1 signature (legacy) is rejected
|
|||
|
|
Given a PagerDuty webhook payload signed with v1 scheme
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 401
|
|||
|
|
And the rejection reason is "unsupported_signature_version"
|
|||
|
|
|
|||
|
|
Scenario: Missing signature on PagerDuty webhook
|
|||
|
|
Given a PagerDuty webhook payload with no signature header
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 401
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: HMAC Signature Validation — OpsGenie
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: HMAC signature validation for OpsGenie webhooks
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the webhook endpoint is "POST /webhooks/opsgenie"
|
|||
|
|
And a valid OpsGenie integration API key is configured
|
|||
|
|
|
|||
|
|
Scenario: Valid OpsGenie HMAC is accepted
|
|||
|
|
Given an OpsGenie alert payload
|
|||
|
|
And the request includes "X-OG-Delivery-Signature" with a valid HMAC-SHA256 value
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 200
|
|||
|
|
|
|||
|
|
Scenario: Invalid OpsGenie signature is rejected
|
|||
|
|
Given an OpsGenie alert payload with a forged signature
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 401
|
|||
|
|
And the rejection reason is "signature_mismatch"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: HMAC Signature Validation — Grafana
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: HMAC signature validation for Grafana webhooks
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the webhook endpoint is "POST /webhooks/grafana"
|
|||
|
|
And a Grafana webhook secret is configured
|
|||
|
|
|
|||
|
|
Scenario: Valid Grafana signature is accepted
|
|||
|
|
Given a Grafana alert payload
|
|||
|
|
And the request includes "X-Grafana-Signature" with a valid HMAC-SHA256 value
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 200
|
|||
|
|
|
|||
|
|
Scenario: Grafana webhook with no secret configured (open mode) is accepted with warning
|
|||
|
|
Given no Grafana webhook secret is configured for the tenant
|
|||
|
|
And a Grafana alert payload arrives without a signature header
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 200
|
|||
|
|
And a warning metric "grafana_unauthenticated_webhook" is emitted
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Payload Normalization to Canonical Schema
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Normalize incoming webhook payloads to canonical alert schema
|
|||
|
|
|
|||
|
|
Scenario: Datadog payload is normalized to canonical schema
|
|||
|
|
Given a raw Datadog webhook payload with fields "title", "severity", "host", "tags"
|
|||
|
|
When the normalization Lambda processes the payload
|
|||
|
|
Then the canonical alert contains:
|
|||
|
|
| field | value |
|
|||
|
|
| source | datadog |
|
|||
|
|
| severity | mapped from Datadog severity |
|
|||
|
|
| service | extracted from tags |
|
|||
|
|
| fingerprint | SHA-256 of source+title+host |
|
|||
|
|
| received_at | ISO-8601 timestamp |
|
|||
|
|
| raw_payload | original JSON preserved |
|
|||
|
|
|
|||
|
|
Scenario: PagerDuty payload is normalized to canonical schema
|
|||
|
|
Given a raw PagerDuty v3 webhook payload
|
|||
|
|
When the normalization Lambda processes the payload
|
|||
|
|
Then the canonical alert contains "source" = "pagerduty"
|
|||
|
|
And "severity" is mapped from PagerDuty urgency field
|
|||
|
|
And "service" is extracted from the PagerDuty service name
|
|||
|
|
|
|||
|
|
Scenario: OpsGenie payload is normalized to canonical schema
|
|||
|
|
Given a raw OpsGenie webhook payload
|
|||
|
|
When the normalization Lambda processes the payload
|
|||
|
|
Then the canonical alert contains "source" = "opsgenie"
|
|||
|
|
And "severity" is mapped from OpsGenie priority field
|
|||
|
|
|
|||
|
|
Scenario: Grafana payload is normalized to canonical schema
|
|||
|
|
Given a raw Grafana alerting webhook payload
|
|||
|
|
When the normalization Lambda processes the payload
|
|||
|
|
Then the canonical alert contains "source" = "grafana"
|
|||
|
|
And "severity" is mapped from Grafana alert state
|
|||
|
|
|
|||
|
|
Scenario: Unknown source type returns 400
|
|||
|
|
Given a webhook payload posted to "/webhooks/unknown-source"
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 400
|
|||
|
|
And the error reason is "unknown_source"
|
|||
|
|
|
|||
|
|
Scenario: Malformed JSON payload returns 400
|
|||
|
|
Given a request body that is not valid JSON
|
|||
|
|
When the Lambda ingestion handler receives the request
|
|||
|
|
Then the response status is 400
|
|||
|
|
And the error reason is "invalid_json"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Async S3 Archival
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Archive raw webhook payloads to S3 asynchronously
|
|||
|
|
|
|||
|
|
Scenario: Every accepted payload is archived to S3
|
|||
|
|
Given a valid Datadog webhook payload is received and accepted
|
|||
|
|
When the Lambda ingestion handler processes the request
|
|||
|
|
Then the raw payload is written to S3 bucket "dd0c-raw-webhooks"
|
|||
|
|
And the S3 key follows the pattern "raw/{source}/{tenant_id}/{YYYY}/{MM}/{DD}/{uuid}.json"
|
|||
|
|
And the archival happens asynchronously (does not block the 200 response)
|
|||
|
|
|
|||
|
|
Scenario: S3 archival failure does not fail the ingestion
|
|||
|
|
Given a valid webhook payload is received
|
|||
|
|
And the S3 write operation fails with a transient error
|
|||
|
|
When the Lambda ingestion handler processes the request
|
|||
|
|
Then the response status is still 200
|
|||
|
|
And the payload is still forwarded to SQS
|
|||
|
|
And an error metric "s3_archival_failure" is emitted
|
|||
|
|
|
|||
|
|
Scenario: Archived payload includes tenant ID and trace context
|
|||
|
|
Given a valid webhook payload from tenant "tenant-xyz"
|
|||
|
|
When the payload is archived to S3
|
|||
|
|
Then the S3 object metadata includes "tenant_id" = "tenant-xyz"
|
|||
|
|
And the S3 object metadata includes the OTEL trace ID
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: SQS Payload Size Limit (256KB)
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Handle SQS 256KB payload size limit during ingestion
|
|||
|
|
|
|||
|
|
Scenario: Payload under 256KB is sent directly to SQS
|
|||
|
|
Given a normalized canonical alert payload of 10KB
|
|||
|
|
When the ingestion Lambda forwards it to SQS
|
|||
|
|
Then the message is placed on the SQS queue directly
|
|||
|
|
And no S3 pointer pattern is used
|
|||
|
|
|
|||
|
|
Scenario: Payload exceeding 256KB is stored in S3 and pointer sent to SQS
|
|||
|
|
Given a normalized canonical alert payload of 300KB (e.g. large raw_payload)
|
|||
|
|
When the ingestion Lambda attempts to forward it to SQS
|
|||
|
|
Then the full payload is stored in S3 under "sqs-overflow/{uuid}.json"
|
|||
|
|
And an SQS message is sent containing only the S3 pointer and metadata
|
|||
|
|
And the SQS message size is under 256KB
|
|||
|
|
|
|||
|
|
Scenario: Correlation engine retrieves oversized payload from S3 pointer
|
|||
|
|
Given an SQS message containing an S3 pointer for an oversized payload
|
|||
|
|
When the correlation engine consumer reads the SQS message
|
|||
|
|
Then it fetches the full payload from S3 using the pointer
|
|||
|
|
And processes it as a normal canonical alert
|
|||
|
|
|
|||
|
|
Scenario: S3 pointer fetch fails in correlation engine
|
|||
|
|
Given an SQS message containing an S3 pointer
|
|||
|
|
And the S3 object has been deleted or is unavailable
|
|||
|
|
When the correlation engine attempts to fetch the payload
|
|||
|
|
Then the message is sent to the Dead Letter Queue
|
|||
|
|
And an alert metric "sqs_pointer_fetch_failure" is emitted
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Dead Letter Queue Handling
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Dead Letter Queue overflow and monitoring
|
|||
|
|
|
|||
|
|
Scenario: Message failing max retries is moved to DLQ
|
|||
|
|
Given an SQS message that causes a processing error on every attempt
|
|||
|
|
When the message has been retried 3 times (maxReceiveCount = 3)
|
|||
|
|
Then the message is automatically moved to the DLQ "dd0c-ingestion-dlq"
|
|||
|
|
And a CloudWatch alarm "DLQDepthHigh" is triggered when DLQ depth > 10
|
|||
|
|
|
|||
|
|
Scenario: DLQ overflow triggers operational alert
|
|||
|
|
Given the DLQ contains more than 100 messages
|
|||
|
|
When the DLQ depth CloudWatch alarm fires
|
|||
|
|
Then a Slack notification is sent to the ops channel "#dd0c-ops"
|
|||
|
|
And the notification includes the DLQ name and current depth
|
|||
|
|
|
|||
|
|
Scenario: DLQ messages can be replayed after fix
|
|||
|
|
Given 50 messages are sitting in the DLQ
|
|||
|
|
When an operator triggers the DLQ replay Lambda
|
|||
|
|
Then messages are moved back to the main SQS queue in batches of 10
|
|||
|
|
And each replayed message retains its original trace context
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 2: Alert Normalization
|
|||
|
|
|
|||
|
|
### Feature: Datadog Source Parser
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Parse and normalize Datadog alert payloads
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the Datadog parser is registered for source "datadog"
|
|||
|
|
|
|||
|
|
Scenario: Datadog "alert" event maps to severity "critical"
|
|||
|
|
Given a Datadog payload with "alert_type" = "error" and "priority" = "P1"
|
|||
|
|
When the Datadog parser processes the payload
|
|||
|
|
Then the canonical alert "severity" = "critical"
|
|||
|
|
|
|||
|
|
Scenario: Datadog "warning" event maps to severity "warning"
|
|||
|
|
Given a Datadog payload with "alert_type" = "warning"
|
|||
|
|
When the Datadog parser processes the payload
|
|||
|
|
Then the canonical alert "severity" = "warning"
|
|||
|
|
|
|||
|
|
Scenario: Datadog "recovery" event maps to status "resolved"
|
|||
|
|
Given a Datadog payload with "alert_type" = "recovery"
|
|||
|
|
When the Datadog parser processes the payload
|
|||
|
|
Then the canonical alert "status" = "resolved"
|
|||
|
|
And the canonical alert "resolved_at" is set to the event timestamp
|
|||
|
|
|
|||
|
|
Scenario: Service extracted from Datadog tags
|
|||
|
|
Given a Datadog payload with "tags" = ["service:payments", "env:prod", "team:backend"]
|
|||
|
|
When the Datadog parser processes the payload
|
|||
|
|
Then the canonical alert "service" = "payments"
|
|||
|
|
And the canonical alert "environment" = "prod"
|
|||
|
|
|
|||
|
|
Scenario: Service tag absent — service defaults to hostname
|
|||
|
|
Given a Datadog payload with no "service:" tag
|
|||
|
|
And the payload contains "host" = "payments-worker-01"
|
|||
|
|
When the Datadog parser processes the payload
|
|||
|
|
Then the canonical alert "service" = "payments-worker-01"
|
|||
|
|
|
|||
|
|
Scenario: Fingerprint is deterministic for identical alerts
|
|||
|
|
Given two identical Datadog payloads with the same title, host, and tags
|
|||
|
|
When both are processed by the Datadog parser
|
|||
|
|
Then both canonical alerts have the same "fingerprint" value
|
|||
|
|
|
|||
|
|
Scenario: Fingerprint differs when title changes
|
|||
|
|
Given two Datadog payloads differing only in "title"
|
|||
|
|
When both are processed by the Datadog parser
|
|||
|
|
Then the canonical alerts have different "fingerprint" values
|
|||
|
|
|
|||
|
|
Scenario: Datadog payload missing required "title" field
|
|||
|
|
Given a Datadog payload with no "title" field
|
|||
|
|
When the Datadog parser processes the payload
|
|||
|
|
Then a normalization error is raised with reason "missing_required_field:title"
|
|||
|
|
And the alert is sent to the normalization DLQ
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: PagerDuty Source Parser
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Parse and normalize PagerDuty webhook payloads
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the PagerDuty parser is registered for source "pagerduty"
|
|||
|
|
|
|||
|
|
Scenario: PagerDuty "trigger" event creates a new canonical alert
|
|||
|
|
Given a PagerDuty v3 webhook with "event_type" = "incident.triggered"
|
|||
|
|
When the PagerDuty parser processes the payload
|
|||
|
|
Then the canonical alert "status" = "firing"
|
|||
|
|
And "source" = "pagerduty"
|
|||
|
|
|
|||
|
|
Scenario: PagerDuty "acknowledge" event updates alert status
|
|||
|
|
Given a PagerDuty v3 webhook with "event_type" = "incident.acknowledged"
|
|||
|
|
When the PagerDuty parser processes the payload
|
|||
|
|
Then the canonical alert "status" = "acknowledged"
|
|||
|
|
|
|||
|
|
Scenario: PagerDuty "resolve" event updates alert status
|
|||
|
|
Given a PagerDuty v3 webhook with "event_type" = "incident.resolved"
|
|||
|
|
When the PagerDuty parser processes the payload
|
|||
|
|
Then the canonical alert "status" = "resolved"
|
|||
|
|
|
|||
|
|
Scenario: PagerDuty urgency "high" maps to severity "critical"
|
|||
|
|
Given a PagerDuty payload with "urgency" = "high"
|
|||
|
|
When the PagerDuty parser processes the payload
|
|||
|
|
Then the canonical alert "severity" = "critical"
|
|||
|
|
|
|||
|
|
Scenario: PagerDuty urgency "low" maps to severity "warning"
|
|||
|
|
Given a PagerDuty payload with "urgency" = "low"
|
|||
|
|
When the PagerDuty parser processes the payload
|
|||
|
|
Then the canonical alert "severity" = "warning"
|
|||
|
|
|
|||
|
|
Scenario: PagerDuty service name is extracted correctly
|
|||
|
|
Given a PagerDuty payload with "service.name" = "checkout-api"
|
|||
|
|
When the PagerDuty parser processes the payload
|
|||
|
|
Then the canonical alert "service" = "checkout-api"
|
|||
|
|
|
|||
|
|
Scenario: PagerDuty dedup key used as fingerprint seed
|
|||
|
|
Given a PagerDuty payload with "dedup_key" = "pd-dedup-xyz789"
|
|||
|
|
When the PagerDuty parser processes the payload
|
|||
|
|
Then the canonical alert "fingerprint" incorporates "pd-dedup-xyz789"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: OpsGenie Source Parser
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Parse and normalize OpsGenie webhook payloads
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the OpsGenie parser is registered for source "opsgenie"
|
|||
|
|
|
|||
|
|
Scenario: OpsGenie "Create" action maps to status "firing"
|
|||
|
|
Given an OpsGenie webhook with "action" = "Create"
|
|||
|
|
When the OpsGenie parser processes the payload
|
|||
|
|
Then the canonical alert "status" = "firing"
|
|||
|
|
|
|||
|
|
Scenario: OpsGenie "Close" action maps to status "resolved"
|
|||
|
|
Given an OpsGenie webhook with "action" = "Close"
|
|||
|
|
When the OpsGenie parser processes the payload
|
|||
|
|
Then the canonical alert "status" = "resolved"
|
|||
|
|
|
|||
|
|
Scenario: OpsGenie "Acknowledge" action maps to status "acknowledged"
|
|||
|
|
Given an OpsGenie webhook with "action" = "Acknowledge"
|
|||
|
|
When the OpsGenie parser processes the payload
|
|||
|
|
Then the canonical alert "status" = "acknowledged"
|
|||
|
|
|
|||
|
|
Scenario: OpsGenie priority P1 maps to severity "critical"
|
|||
|
|
Given an OpsGenie payload with "priority" = "P1"
|
|||
|
|
When the OpsGenie parser processes the payload
|
|||
|
|
Then the canonical alert "severity" = "critical"
|
|||
|
|
|
|||
|
|
Scenario: OpsGenie priority P3 maps to severity "warning"
|
|||
|
|
Given an OpsGenie payload with "priority" = "P3"
|
|||
|
|
When the OpsGenie parser processes the payload
|
|||
|
|
Then the canonical alert "severity" = "warning"
|
|||
|
|
|
|||
|
|
Scenario: OpsGenie priority P5 maps to severity "info"
|
|||
|
|
Given an OpsGenie payload with "priority" = "P5"
|
|||
|
|
When the OpsGenie parser processes the payload
|
|||
|
|
Then the canonical alert "severity" = "info"
|
|||
|
|
|
|||
|
|
Scenario: OpsGenie tags used for service extraction
|
|||
|
|
Given an OpsGenie payload with "tags" = ["service:inventory", "region:us-east-1"]
|
|||
|
|
When the OpsGenie parser processes the payload
|
|||
|
|
Then the canonical alert "service" = "inventory"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Grafana Source Parser
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Parse and normalize Grafana alerting webhook payloads
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the Grafana parser is registered for source "grafana"
|
|||
|
|
|
|||
|
|
Scenario: Grafana "alerting" state maps to status "firing"
|
|||
|
|
Given a Grafana webhook with "state" = "alerting"
|
|||
|
|
When the Grafana parser processes the payload
|
|||
|
|
Then the canonical alert "status" = "firing"
|
|||
|
|
|
|||
|
|
Scenario: Grafana "ok" state maps to status "resolved"
|
|||
|
|
Given a Grafana webhook with "state" = "ok"
|
|||
|
|
When the Grafana parser processes the payload
|
|||
|
|
Then the canonical alert "status" = "resolved"
|
|||
|
|
|
|||
|
|
Scenario: Grafana "no_data" state maps to severity "warning"
|
|||
|
|
Given a Grafana webhook with "state" = "no_data"
|
|||
|
|
When the Grafana parser processes the payload
|
|||
|
|
Then the canonical alert "severity" = "warning"
|
|||
|
|
And the canonical alert "status" = "firing"
|
|||
|
|
|
|||
|
|
Scenario: Grafana panel URL preserved in canonical alert metadata
|
|||
|
|
Given a Grafana webhook with "ruleUrl" = "https://grafana.example.com/d/abc/panel"
|
|||
|
|
When the Grafana parser processes the payload
|
|||
|
|
Then the canonical alert "metadata.dashboard_url" = "https://grafana.example.com/d/abc/panel"
|
|||
|
|
|
|||
|
|
Scenario: Grafana multi-alert payload (multiple evalMatches) creates one alert per match
|
|||
|
|
Given a Grafana webhook with 3 "evalMatches" entries
|
|||
|
|
When the Grafana parser processes the payload
|
|||
|
|
Then 3 canonical alerts are produced
|
|||
|
|
And each has a unique fingerprint based on the metric name and tags
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Canonical Alert Schema Validation
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Validate canonical alert schema completeness
|
|||
|
|
|
|||
|
|
Scenario: Canonical alert with all required fields passes validation
|
|||
|
|
Given a canonical alert with fields: source, severity, status, service, fingerprint, received_at, tenant_id
|
|||
|
|
When schema validation runs
|
|||
|
|
Then the alert passes validation
|
|||
|
|
|
|||
|
|
Scenario: Canonical alert missing "tenant_id" fails validation
|
|||
|
|
Given a canonical alert with no "tenant_id" field
|
|||
|
|
When schema validation runs
|
|||
|
|
Then validation fails with error "missing_required_field:tenant_id"
|
|||
|
|
And the alert is rejected before SQS enqueue
|
|||
|
|
|
|||
|
|
Scenario: Canonical alert with unknown severity value fails validation
|
|||
|
|
Given a canonical alert with "severity" = "super-critical"
|
|||
|
|
When schema validation runs
|
|||
|
|
Then validation fails with error "invalid_enum_value:severity"
|
|||
|
|
|
|||
|
|
Scenario: Canonical alert schema is additive — unknown extra fields are preserved
|
|||
|
|
Given a canonical alert with an extra field "custom_label" = "team-alpha"
|
|||
|
|
When schema validation runs
|
|||
|
|
Then the alert passes validation
|
|||
|
|
And "custom_label" is preserved in the alert document
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 3: Correlation Engine
|
|||
|
|
|
|||
|
|
### Feature: Time-Window Grouping
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Group alerts into incidents using time-window correlation
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the correlation engine is running on ECS Fargate
|
|||
|
|
And the default correlation time window is 5 minutes
|
|||
|
|
|
|||
|
|
Scenario: Two alerts for the same service within the time window are grouped
|
|||
|
|
Given a canonical alert for service "payments" arrives at T=0
|
|||
|
|
And a second canonical alert for service "payments" arrives at T=3min
|
|||
|
|
When the correlation engine processes both alerts
|
|||
|
|
Then they are grouped into a single incident
|
|||
|
|
And the incident "alert_count" = 2
|
|||
|
|
|
|||
|
|
Scenario: Two alerts for the same service outside the time window are NOT grouped
|
|||
|
|
Given a canonical alert for service "payments" arrives at T=0
|
|||
|
|
And a second canonical alert for service "payments" arrives at T=6min
|
|||
|
|
When the correlation engine processes both alerts
|
|||
|
|
Then they are placed in separate incidents
|
|||
|
|
|
|||
|
|
Scenario: Time window is configurable per tenant
|
|||
|
|
Given tenant "enterprise-co" has a custom correlation window of 10 minutes
|
|||
|
|
And two alerts for the same service arrive 8 minutes apart
|
|||
|
|
When the correlation engine processes both alerts
|
|||
|
|
Then they are grouped into a single incident for tenant "enterprise-co"
|
|||
|
|
|
|||
|
|
Scenario: Alerts from different services within the time window are NOT grouped by default
|
|||
|
|
Given a canonical alert for service "payments" at T=0
|
|||
|
|
And a canonical alert for service "auth" at T=1min
|
|||
|
|
When the correlation engine processes both alerts
|
|||
|
|
Then they are placed in separate incidents
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Service-Affinity Matching
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Group alerts across related services using service-affinity rules
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given a service-affinity rule: ["payments", "checkout", "cart"] are related
|
|||
|
|
|
|||
|
|
Scenario: Alerts from affinity-grouped services are correlated into one incident
|
|||
|
|
Given a canonical alert for service "payments" at T=0
|
|||
|
|
And a canonical alert for service "checkout" at T=2min
|
|||
|
|
When the correlation engine applies service-affinity matching
|
|||
|
|
Then both alerts are grouped into a single incident
|
|||
|
|
And the incident "root_service" is set to the first-seen service "payments"
|
|||
|
|
|
|||
|
|
Scenario: Alert from a service not in the affinity group is not merged
|
|||
|
|
Given a canonical alert for service "payments" at T=0
|
|||
|
|
And a canonical alert for service "logging" at T=1min
|
|||
|
|
When the correlation engine applies service-affinity matching
|
|||
|
|
Then they remain in separate incidents
|
|||
|
|
|
|||
|
|
Scenario: Service-affinity rules are tenant-scoped
|
|||
|
|
Given tenant "acme" has affinity rule ["api", "gateway"]
|
|||
|
|
And tenant "globex" has no affinity rules
|
|||
|
|
And both tenants receive alerts for "api" and "gateway" simultaneously
|
|||
|
|
When the correlation engine processes both tenants' alerts
|
|||
|
|
Then "acme"'s alerts are grouped into one incident
|
|||
|
|
And "globex"'s alerts remain in separate incidents
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Fingerprint Deduplication
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Deduplicate alerts with identical fingerprints
|
|||
|
|
|
|||
|
|
Scenario: Duplicate alert with same fingerprint within dedup window is suppressed
|
|||
|
|
Given a canonical alert with fingerprint "fp-abc123" is received at T=0
|
|||
|
|
And an identical alert with fingerprint "fp-abc123" arrives at T=30sec
|
|||
|
|
When the correlation engine checks the Redis dedup window
|
|||
|
|
Then the second alert is suppressed (not added as a new alert)
|
|||
|
|
And the incident "duplicate_count" is incremented by 1
|
|||
|
|
|
|||
|
|
Scenario: Same fingerprint outside dedup window creates a new alert
|
|||
|
|
Given a canonical alert with fingerprint "fp-abc123" was processed at T=0
|
|||
|
|
And the dedup window is 10 minutes
|
|||
|
|
And the same fingerprint arrives at T=11min
|
|||
|
|
When the correlation engine checks the Redis dedup window
|
|||
|
|
Then the alert is treated as a new occurrence
|
|||
|
|
And a new incident entry is created
|
|||
|
|
|
|||
|
|
Scenario: Different fingerprints are never deduplicated
|
|||
|
|
Given two alerts with different fingerprints "fp-abc123" and "fp-xyz789"
|
|||
|
|
When the correlation engine processes both
|
|||
|
|
Then both are treated as distinct alerts
|
|||
|
|
|
|||
|
|
Scenario: Dedup counter is visible in incident metadata
|
|||
|
|
Given fingerprint "fp-abc123" has been suppressed 5 times
|
|||
|
|
When the incident is retrieved via the Dashboard API
|
|||
|
|
Then the incident "dedup_count" = 5
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Redis Sliding Window
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Redis sliding windows for correlation state management
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given Redis is available and the sliding window TTL is 10 minutes
|
|||
|
|
|
|||
|
|
Scenario: Alert fingerprint is stored in Redis on first occurrence
|
|||
|
|
Given a canonical alert with fingerprint "fp-new001" arrives
|
|||
|
|
When the correlation engine processes the alert
|
|||
|
|
Then a Redis key "dedup:{tenant_id}:fp-new001" is set with TTL 10 minutes
|
|||
|
|
|
|||
|
|
Scenario: Redis key TTL is refreshed on each matching alert
|
|||
|
|
Given a Redis key "dedup:{tenant_id}:fp-new001" exists with 2 minutes remaining
|
|||
|
|
And a new alert with fingerprint "fp-new001" arrives
|
|||
|
|
When the correlation engine processes the alert
|
|||
|
|
Then the Redis key TTL is reset to 10 minutes
|
|||
|
|
|
|||
|
|
Scenario: Redis unavailability causes correlation engine to fail open
|
|||
|
|
Given Redis is unreachable
|
|||
|
|
When a canonical alert arrives for processing
|
|||
|
|
Then the alert is processed without deduplication
|
|||
|
|
And a metric "redis_unavailable_failopen" is emitted
|
|||
|
|
And the alert is NOT dropped
|
|||
|
|
|
|||
|
|
Scenario: Redis sliding window is tenant-isolated
|
|||
|
|
Given tenant "alpha" has fingerprint "fp-shared" in Redis
|
|||
|
|
And tenant "beta" sends an alert with fingerprint "fp-shared"
|
|||
|
|
When the correlation engine checks the dedup window
|
|||
|
|
Then tenant "beta"'s alert is NOT suppressed
|
|||
|
|
And tenant "alpha"'s dedup state is unaffected
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Cross-Tenant Isolation in Correlation
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Prevent cross-tenant alert bleed in correlation engine
|
|||
|
|
|
|||
|
|
Scenario: Alerts from different tenants with same fingerprint are never correlated
|
|||
|
|
Given tenant "alpha" sends alert with fingerprint "fp-shared" at T=0
|
|||
|
|
And tenant "beta" sends alert with fingerprint "fp-shared" at T=1min
|
|||
|
|
When the correlation engine processes both alerts
|
|||
|
|
Then each alert is placed in its own tenant-scoped incident
|
|||
|
|
And no incident contains alerts from both tenants
|
|||
|
|
|
|||
|
|
Scenario: Tenant ID is validated before correlation lookup
|
|||
|
|
Given a canonical alert arrives with "tenant_id" = ""
|
|||
|
|
When the correlation engine attempts to process the alert
|
|||
|
|
Then the alert is rejected with error "missing_tenant_id"
|
|||
|
|
And the alert is sent to the correlation DLQ
|
|||
|
|
|
|||
|
|
Scenario: Correlation engine worker processes only one tenant's partition at a time
|
|||
|
|
Given SQS messages are partitioned by tenant_id
|
|||
|
|
When the ECS Fargate worker picks up a batch of messages
|
|||
|
|
Then all messages in the batch belong to the same tenant
|
|||
|
|
And no cross-tenant data is loaded into the worker's memory context
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: OTEL Trace Propagation Across SQS Boundary
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Propagate OpenTelemetry trace context across SQS ingestion-to-correlation boundary
|
|||
|
|
|
|||
|
|
Scenario: Trace context is injected into SQS message attributes at ingestion
|
|||
|
|
Given a webhook request arrives with OTEL trace header "traceparent: 00-abc123-def456-01"
|
|||
|
|
When the ingestion Lambda enqueues the message to SQS
|
|||
|
|
Then the SQS message attributes include "traceparent" = "00-abc123-def456-01"
|
|||
|
|
And the SQS message attributes include "tracestate" if present in the original request
|
|||
|
|
|
|||
|
|
Scenario: Correlation engine extracts and continues trace from SQS message
|
|||
|
|
Given an SQS message with "traceparent" attribute "00-abc123-def456-01"
|
|||
|
|
When the correlation engine consumer reads the message
|
|||
|
|
Then a child span is created with parent trace ID "abc123"
|
|||
|
|
And all subsequent operations (Redis lookup, DynamoDB write) are children of this span
|
|||
|
|
|
|||
|
|
Scenario: Missing trace context in SQS message starts a new trace
|
|||
|
|
Given an SQS message with no "traceparent" attribute
|
|||
|
|
When the correlation engine consumer reads the message
|
|||
|
|
Then a new root trace is started
|
|||
|
|
And a metric "trace_context_missing" is emitted
|
|||
|
|
|
|||
|
|
Scenario: Trace ID is stored on the incident record
|
|||
|
|
Given a correlated incident is created from an alert with trace ID "abc123"
|
|||
|
|
When the incident is written to DynamoDB
|
|||
|
|
Then the incident document includes "trace_id" = "abc123"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 4: Notification & Escalation
|
|||
|
|
|
|||
|
|
### Feature: Slack Block Kit Incident Notifications
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Send Slack Block Kit notifications for new incidents
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given a Slack webhook URL is configured for tenant "acme"
|
|||
|
|
And the Slack notification Lambda is subscribed to the incidents SNS topic
|
|||
|
|
|
|||
|
|
Scenario: New critical incident triggers Slack notification
|
|||
|
|
Given a new incident is created with severity "critical" for service "payments"
|
|||
|
|
When the notification Lambda processes the incident event
|
|||
|
|
Then a Slack Block Kit message is posted to the configured channel
|
|||
|
|
And the message includes the incident ID, service name, severity, and timestamp
|
|||
|
|
And the message includes action buttons: "Acknowledge", "Escalate", "Mark as Noise"
|
|||
|
|
|
|||
|
|
Scenario: New warning incident triggers Slack notification
|
|||
|
|
Given a new incident is created with severity "warning"
|
|||
|
|
When the notification Lambda processes the incident event
|
|||
|
|
Then a Slack message is posted with severity badge "⚠️ WARNING"
|
|||
|
|
|
|||
|
|
Scenario: Resolved incident posts a resolution message to Slack
|
|||
|
|
Given an existing incident "INC-001" transitions to status "resolved"
|
|||
|
|
When the notification Lambda processes the resolution event
|
|||
|
|
Then a Slack message is posted indicating "INC-001 resolved"
|
|||
|
|
And the message includes time-to-resolution duration
|
|||
|
|
|
|||
|
|
Scenario: Slack Block Kit message includes alert count for correlated incidents
|
|||
|
|
Given an incident contains 7 correlated alerts
|
|||
|
|
When the Slack notification is sent
|
|||
|
|
Then the message body includes "7 correlated alerts"
|
|||
|
|
|
|||
|
|
Scenario: Slack notification includes dashboard deep-link
|
|||
|
|
Given a new incident "INC-042" is created
|
|||
|
|
When the Slack notification is sent
|
|||
|
|
Then the message includes a button "View in Dashboard" linking to "/incidents/INC-042"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Severity-Based Routing
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Route notifications to different Slack channels based on severity
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given tenant "acme" has configured:
|
|||
|
|
| severity | channel |
|
|||
|
|
| critical | #incidents-p1 |
|
|||
|
|
| warning | #incidents-p2 |
|
|||
|
|
| info | #monitoring-feed |
|
|||
|
|
|
|||
|
|
Scenario: Critical incident is routed to P1 channel
|
|||
|
|
Given a new incident with severity "critical"
|
|||
|
|
When the notification Lambda routes the alert
|
|||
|
|
Then the Slack message is posted to "#incidents-p1"
|
|||
|
|
|
|||
|
|
Scenario: Warning incident is routed to P2 channel
|
|||
|
|
Given a new incident with severity "warning"
|
|||
|
|
When the notification Lambda routes the alert
|
|||
|
|
Then the Slack message is posted to "#incidents-p2"
|
|||
|
|
|
|||
|
|
Scenario: Info incident is routed to monitoring feed
|
|||
|
|
Given a new incident with severity "info"
|
|||
|
|
When the notification Lambda routes the alert
|
|||
|
|
Then the Slack message is posted to "#monitoring-feed"
|
|||
|
|
|
|||
|
|
Scenario: No routing rule configured — falls back to default channel
|
|||
|
|
Given tenant "beta" has only a default channel "#alerts" configured
|
|||
|
|
And a new incident with severity "critical" arrives
|
|||
|
|
When the notification Lambda routes the alert
|
|||
|
|
Then the Slack message is posted to "#alerts"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Escalation to PagerDuty if Unacknowledged
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Escalate unacknowledged critical incidents to PagerDuty
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the escalation check runs every minute via EventBridge
|
|||
|
|
And the escalation threshold for "critical" incidents is 15 minutes
|
|||
|
|
|
|||
|
|
Scenario: Unacknowledged critical incident is escalated after threshold
|
|||
|
|
Given a critical incident "INC-001" was created 16 minutes ago
|
|||
|
|
And "INC-001" has not been acknowledged
|
|||
|
|
When the escalation Lambda runs
|
|||
|
|
Then a PagerDuty incident is created via the PagerDuty Events API v2
|
|||
|
|
And the incident "INC-001" status is updated to "escalated"
|
|||
|
|
And a Slack message is posted: "INC-001 escalated to PagerDuty"
|
|||
|
|
|
|||
|
|
Scenario: Acknowledged incident is NOT escalated
|
|||
|
|
Given a critical incident "INC-002" was created 20 minutes ago
|
|||
|
|
And "INC-002" was acknowledged 5 minutes ago
|
|||
|
|
When the escalation Lambda runs
|
|||
|
|
Then no PagerDuty incident is created for "INC-002"
|
|||
|
|
|
|||
|
|
Scenario: Warning incident has a longer escalation threshold
|
|||
|
|
Given the escalation threshold for "warning" incidents is 60 minutes
|
|||
|
|
And a warning incident "INC-003" was created 45 minutes ago and is unacknowledged
|
|||
|
|
When the escalation Lambda runs
|
|||
|
|
Then no PagerDuty incident is created for "INC-003"
|
|||
|
|
|
|||
|
|
Scenario: Escalation is idempotent — already-escalated incident is not re-escalated
|
|||
|
|
Given incident "INC-004" was already escalated to PagerDuty
|
|||
|
|
When the escalation Lambda runs again
|
|||
|
|
Then no duplicate PagerDuty incident is created
|
|||
|
|
And the escalation Lambda logs "already_escalated:INC-004"
|
|||
|
|
|
|||
|
|
Scenario: PagerDuty API failure during escalation is retried
|
|||
|
|
Given incident "INC-005" is due for escalation
|
|||
|
|
And the PagerDuty Events API returns a 500 error
|
|||
|
|
When the escalation Lambda attempts to create the PagerDuty incident
|
|||
|
|
Then the Lambda retries up to 3 times with exponential backoff
|
|||
|
|
And if all retries fail, an error metric "pagerduty_escalation_failure" is emitted
|
|||
|
|
And the incident is flagged for manual review
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Daily Noise Report
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Generate and send daily noise reduction report
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the daily report Lambda runs at 08:00 UTC via EventBridge
|
|||
|
|
|
|||
|
|
Scenario: Daily noise report is sent to configured Slack channel
|
|||
|
|
Given tenant "acme" has 500 raw alerts and 80 incidents in the past 24 hours
|
|||
|
|
When the daily report Lambda runs
|
|||
|
|
Then a Slack message is posted to "#dd0c-digest"
|
|||
|
|
And the message includes:
|
|||
|
|
| metric | value |
|
|||
|
|
| total_alerts | 500 |
|
|||
|
|
| correlated_incidents | 80 |
|
|||
|
|
| noise_reduction_percent | 84% |
|
|||
|
|
| top_noisy_service | shown |
|
|||
|
|
|
|||
|
|
Scenario: Daily report includes MTTR for resolved incidents
|
|||
|
|
Given 20 incidents were resolved in the past 24 hours with an average MTTR of 23 minutes
|
|||
|
|
When the daily report Lambda runs
|
|||
|
|
Then the Slack message includes "Avg MTTR: 23 min"
|
|||
|
|
|
|||
|
|
Scenario: Daily report is skipped if no alerts in past 24 hours
|
|||
|
|
Given tenant "quiet-co" had 0 alerts in the past 24 hours
|
|||
|
|
When the daily report Lambda runs
|
|||
|
|
Then no Slack message is sent for "quiet-co"
|
|||
|
|
|
|||
|
|
Scenario: Daily report is tenant-scoped — no cross-tenant data leakage
|
|||
|
|
Given tenants "alpha" and "beta" both have activity
|
|||
|
|
When the daily report Lambda runs
|
|||
|
|
Then "alpha"'s report contains only "alpha"'s metrics
|
|||
|
|
And "beta"'s report contains only "beta"'s metrics
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Slack Rate Limiting
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Handle Slack API rate limiting gracefully
|
|||
|
|
|
|||
|
|
Scenario: Slack returns 429 Too Many Requests — notification is retried
|
|||
|
|
Given a Slack notification needs to be sent
|
|||
|
|
And Slack returns HTTP 429 with "Retry-After: 5"
|
|||
|
|
When the notification Lambda handles the response
|
|||
|
|
Then the Lambda waits 5 seconds before retrying
|
|||
|
|
And the notification is eventually delivered
|
|||
|
|
|
|||
|
|
Scenario: Slack rate limit persists beyond Lambda timeout — message queued for retry
|
|||
|
|
Given Slack is rate-limiting for 30 seconds
|
|||
|
|
And the Lambda timeout is 15 seconds
|
|||
|
|
When the notification Lambda cannot deliver within its timeout
|
|||
|
|
Then the SQS message is not deleted (remains visible after visibility timeout)
|
|||
|
|
And the message is retried by the next Lambda invocation
|
|||
|
|
|
|||
|
|
Scenario: Burst of 50 incidents triggers Slack rate limit protection
|
|||
|
|
Given 50 incidents are created within 1 second
|
|||
|
|
When the notification Lambda processes the burst
|
|||
|
|
Then notifications are batched and sent with 1-second delays between batches
|
|||
|
|
And all 50 notifications are eventually delivered
|
|||
|
|
And a metric "slack_rate_limit_batching" is emitted
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 5: Slack Bot
|
|||
|
|
|
|||
|
|
### Feature: Interactive Feedback Buttons
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Slack interactive feedback buttons on incident notifications
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given an incident notification was posted to Slack with buttons: "Helpful", "Noise", "Escalate"
|
|||
|
|
And the Slack interactivity endpoint is "POST /slack/interactions"
|
|||
|
|
|
|||
|
|
Scenario: User clicks "Helpful" on an incident notification
|
|||
|
|
Given user "@alice" clicks the "Helpful" button on incident "INC-007"
|
|||
|
|
When the Slack interaction payload is received
|
|||
|
|
Then the incident "INC-007" feedback is recorded as "helpful"
|
|||
|
|
And the Slack message is updated to show "✅ Marked helpful by @alice"
|
|||
|
|
And the button is disabled to prevent duplicate feedback
|
|||
|
|
|
|||
|
|
Scenario: User clicks "Noise" on an incident notification
|
|||
|
|
Given user "@bob" clicks the "Noise" button on incident "INC-008"
|
|||
|
|
When the Slack interaction payload is received
|
|||
|
|
Then the incident "INC-008" feedback is recorded as "noise"
|
|||
|
|
And the incident "noise_score" is incremented
|
|||
|
|
And the Slack message is updated to show "🔇 Marked as noise by @bob"
|
|||
|
|
|
|||
|
|
Scenario: User clicks "Escalate" on an incident notification
|
|||
|
|
Given user "@carol" clicks the "Escalate" button on incident "INC-009"
|
|||
|
|
When the Slack interaction payload is received
|
|||
|
|
Then the incident "INC-009" is immediately escalated to PagerDuty
|
|||
|
|
And the Slack message is updated to show "🚨 Escalated by @carol"
|
|||
|
|
And the escalation bypasses the normal time threshold
|
|||
|
|
|
|||
|
|
Scenario: Feedback on an already-resolved incident is rejected
|
|||
|
|
Given incident "INC-010" has status "resolved"
|
|||
|
|
And user "@dave" clicks "Helpful" on the stale Slack message
|
|||
|
|
When the Slack interaction payload is received
|
|||
|
|
Then the Slack message is updated to show "⚠️ Incident already resolved"
|
|||
|
|
And no feedback is recorded
|
|||
|
|
|
|||
|
|
Scenario: Slack interaction payload signature is validated
|
|||
|
|
Given a Slack interaction request with an invalid "X-Slack-Signature" header
|
|||
|
|
When the interaction endpoint receives the request
|
|||
|
|
Then the response status is 401
|
|||
|
|
And the interaction is not processed
|
|||
|
|
|
|||
|
|
Scenario: Duplicate button click by same user is idempotent
|
|||
|
|
Given user "@alice" already marked incident "INC-007" as "helpful"
|
|||
|
|
And "@alice" clicks "Helpful" again on the same message
|
|||
|
|
When the Slack interaction payload is received
|
|||
|
|
Then the feedback count is NOT incremented again
|
|||
|
|
And the response acknowledges the duplicate gracefully
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Slash Command — /dd0c status
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: /dd0c status slash command
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the Slack slash command "/dd0c" is registered
|
|||
|
|
And the command handler endpoint is "POST /slack/commands"
|
|||
|
|
|
|||
|
|
Scenario: /dd0c status returns current open incident count
|
|||
|
|
Given tenant "acme" has 3 open critical incidents and 5 open warning incidents
|
|||
|
|
When user "@alice" runs "/dd0c status" in the Slack workspace
|
|||
|
|
Then the bot responds ephemerally with:
|
|||
|
|
| metric | value |
|
|||
|
|
| open_critical | 3 |
|
|||
|
|
| open_warning | 5 |
|
|||
|
|
| alerts_last_hour | shown |
|
|||
|
|
| system_status | OK |
|
|||
|
|
|
|||
|
|
Scenario: /dd0c status when no open incidents
|
|||
|
|
Given tenant "acme" has 0 open incidents
|
|||
|
|
When user "@alice" runs "/dd0c status"
|
|||
|
|
Then the bot responds with "✅ All clear — no open incidents"
|
|||
|
|
|
|||
|
|
Scenario: /dd0c status responds within Slack's 3-second timeout
|
|||
|
|
Given the command handler receives "/dd0c status"
|
|||
|
|
When the handler processes the request
|
|||
|
|
Then an HTTP 200 response is returned within 3 seconds
|
|||
|
|
And if data retrieval takes longer, an immediate acknowledgment is sent
|
|||
|
|
And the full response is delivered via response_url
|
|||
|
|
|
|||
|
|
Scenario: /dd0c status is scoped to the requesting tenant
|
|||
|
|
Given user "@alice" belongs to tenant "acme"
|
|||
|
|
When "@alice" runs "/dd0c status"
|
|||
|
|
Then the response contains only "acme"'s incident data
|
|||
|
|
And no data from other tenants is included
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Slash Command — /dd0c anomalies
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: /dd0c anomalies slash command
|
|||
|
|
|
|||
|
|
Scenario: /dd0c anomalies returns top noisy services in the last 24 hours
|
|||
|
|
Given service "payments" fired 120 alerts in the last 24 hours
|
|||
|
|
And service "auth" fired 80 alerts
|
|||
|
|
And service "logging" fired 10 alerts
|
|||
|
|
When user "@alice" runs "/dd0c anomalies"
|
|||
|
|
Then the bot responds with a ranked list:
|
|||
|
|
| rank | service | alert_count |
|
|||
|
|
| 1 | payments | 120 |
|
|||
|
|
| 2 | auth | 80 |
|
|||
|
|
| 3 | logging | 10 |
|
|||
|
|
|
|||
|
|
Scenario: /dd0c anomalies with time range argument
|
|||
|
|
Given user "@alice" runs "/dd0c anomalies --last 7d"
|
|||
|
|
When the command handler processes the request
|
|||
|
|
Then the response covers the last 7 days of anomaly data
|
|||
|
|
|
|||
|
|
Scenario: /dd0c anomalies with no data returns helpful message
|
|||
|
|
Given no alerts have been received in the last 24 hours
|
|||
|
|
When user "@alice" runs "/dd0c anomalies"
|
|||
|
|
Then the bot responds with "No anomalies detected in the last 24 hours"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Slash Command — /dd0c digest
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: /dd0c digest slash command
|
|||
|
|
|
|||
|
|
Scenario: /dd0c digest returns on-demand summary report
|
|||
|
|
Given tenant "acme" has activity in the last 24 hours
|
|||
|
|
When user "@alice" runs "/dd0c digest"
|
|||
|
|
Then the bot responds with a summary matching the daily noise report format
|
|||
|
|
And the response includes total alerts, incidents, noise reduction %, and avg MTTR
|
|||
|
|
|
|||
|
|
Scenario: /dd0c digest with custom time range
|
|||
|
|
Given user "@alice" runs "/dd0c digest --last 7d"
|
|||
|
|
When the command handler processes the request
|
|||
|
|
Then the digest covers the last 7 days
|
|||
|
|
|
|||
|
|
Scenario: Unauthorized user cannot run /dd0c commands
|
|||
|
|
Given user "@mallory" is not a member of any configured tenant workspace
|
|||
|
|
When "@mallory" runs "/dd0c status"
|
|||
|
|
Then the bot responds ephemerally with "⛔ You are not authorized to use this command"
|
|||
|
|
And no tenant data is returned
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 6: Dashboard API
|
|||
|
|
|
|||
|
|
### Feature: Cognito JWT Authentication
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Authenticate Dashboard API requests with Cognito JWT
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the Dashboard API requires a valid Cognito JWT in the "Authorization: Bearer <token>" header
|
|||
|
|
|
|||
|
|
Scenario: Valid JWT grants access to the API
|
|||
|
|
Given a user has a valid Cognito JWT for tenant "acme"
|
|||
|
|
When the user calls "GET /api/incidents"
|
|||
|
|
Then the response status is 200
|
|||
|
|
And only "acme"'s incidents are returned
|
|||
|
|
|
|||
|
|
Scenario: Missing Authorization header returns 401
|
|||
|
|
Given a request to "GET /api/incidents" with no Authorization header
|
|||
|
|
When the API Gateway processes the request
|
|||
|
|
Then the response status is 401
|
|||
|
|
And the body contains "error": "missing_token"
|
|||
|
|
|
|||
|
|
Scenario: Expired JWT returns 401
|
|||
|
|
Given a user presents a JWT that expired 10 minutes ago
|
|||
|
|
When the user calls "GET /api/incidents"
|
|||
|
|
Then the response status is 401
|
|||
|
|
And the body contains "error": "token_expired"
|
|||
|
|
|
|||
|
|
Scenario: JWT signed with wrong key returns 401
|
|||
|
|
Given a user presents a JWT signed with a non-Cognito key
|
|||
|
|
When the user calls "GET /api/incidents"
|
|||
|
|
Then the response status is 401
|
|||
|
|
And the body contains "error": "invalid_token_signature"
|
|||
|
|
|
|||
|
|
Scenario: JWT from a different tenant cannot access another tenant's data
|
|||
|
|
Given user "@alice" has a valid JWT for tenant "acme"
|
|||
|
|
When "@alice" calls "GET /api/incidents?tenant_id=globex"
|
|||
|
|
Then the response status is 403
|
|||
|
|
And the body contains "error": "tenant_access_denied"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Incident Listing with Filters
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: List incidents with filtering and pagination
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the user is authenticated for tenant "acme"
|
|||
|
|
|
|||
|
|
Scenario: List all open incidents
|
|||
|
|
Given tenant "acme" has 15 open incidents
|
|||
|
|
When the user calls "GET /api/incidents?status=open"
|
|||
|
|
Then the response status is 200
|
|||
|
|
And the response contains 15 incidents
|
|||
|
|
And each incident includes: id, severity, service, status, created_at, alert_count
|
|||
|
|
|
|||
|
|
Scenario: Filter incidents by severity
|
|||
|
|
Given tenant "acme" has 5 critical and 10 warning incidents
|
|||
|
|
When the user calls "GET /api/incidents?severity=critical"
|
|||
|
|
Then the response contains exactly 5 incidents
|
|||
|
|
And all returned incidents have severity "critical"
|
|||
|
|
|
|||
|
|
Scenario: Filter incidents by service
|
|||
|
|
Given tenant "acme" has incidents for services "payments", "auth", and "checkout"
|
|||
|
|
When the user calls "GET /api/incidents?service=payments"
|
|||
|
|
Then only incidents for service "payments" are returned
|
|||
|
|
|
|||
|
|
Scenario: Filter incidents by date range
|
|||
|
|
Given incidents exist from the past 30 days
|
|||
|
|
When the user calls "GET /api/incidents?from=2026-02-01&to=2026-02-07"
|
|||
|
|
Then only incidents created between Feb 1 and Feb 7 are returned
|
|||
|
|
|
|||
|
|
Scenario: Pagination returns correct page of results
|
|||
|
|
Given tenant "acme" has 100 incidents
|
|||
|
|
When the user calls "GET /api/incidents?page=2&limit=20"
|
|||
|
|
Then the response contains incidents 21–40
|
|||
|
|
And the response includes "total": 100, "page": 2, "limit": 20
|
|||
|
|
|
|||
|
|
Scenario: Empty result set returns 200 with empty array
|
|||
|
|
Given tenant "acme" has no incidents matching the filter
|
|||
|
|
When the user calls "GET /api/incidents?service=nonexistent"
|
|||
|
|
Then the response status is 200
|
|||
|
|
And the response body is '{"incidents": [], "total": 0}'
|
|||
|
|
|
|||
|
|
Scenario: Incident detail endpoint returns full alert timeline
|
|||
|
|
Given incident "INC-042" has 7 correlated alerts
|
|||
|
|
When the user calls "GET /api/incidents/INC-042"
|
|||
|
|
Then the response includes the incident details
|
|||
|
|
And "alerts" array contains 7 entries with timestamps and sources
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Analytics Endpoints — MTTR
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: MTTR analytics endpoint
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the user is authenticated for tenant "acme"
|
|||
|
|
|
|||
|
|
Scenario: MTTR endpoint returns average time-to-resolution
|
|||
|
|
Given 10 incidents were resolved in the last 7 days with MTTRs ranging from 5 to 60 minutes
|
|||
|
|
When the user calls "GET /api/analytics/mttr?period=7d"
|
|||
|
|
Then the response includes "avg_mttr_minutes" as a number
|
|||
|
|
And "incident_count" = 10
|
|||
|
|
|
|||
|
|
Scenario: MTTR broken down by service
|
|||
|
|
When the user calls "GET /api/analytics/mttr?period=7d&group_by=service"
|
|||
|
|
Then the response includes a per-service MTTR breakdown
|
|||
|
|
|
|||
|
|
Scenario: MTTR with no resolved incidents returns null
|
|||
|
|
Given no incidents were resolved in the requested period
|
|||
|
|
When the user calls "GET /api/analytics/mttr?period=1d"
|
|||
|
|
Then the response includes "avg_mttr_minutes": null
|
|||
|
|
And "incident_count": 0
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Analytics Endpoints — Noise Reduction
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Noise reduction analytics endpoint
|
|||
|
|
|
|||
|
|
Scenario: Noise reduction percentage is calculated correctly
|
|||
|
|
Given tenant "acme" received 1000 raw alerts and 150 incidents in the last 7 days
|
|||
|
|
When the user calls "GET /api/analytics/noise-reduction?period=7d"
|
|||
|
|
Then the response includes "noise_reduction_percent": 85
|
|||
|
|
And "raw_alerts": 1000
|
|||
|
|
And "incidents": 150
|
|||
|
|
|
|||
|
|
Scenario: Noise reduction trend over time
|
|||
|
|
When the user calls "GET /api/analytics/noise-reduction?period=30d&granularity=daily"
|
|||
|
|
Then the response includes a daily time series of noise reduction percentages
|
|||
|
|
|
|||
|
|
Scenario: Noise reduction by source
|
|||
|
|
When the user calls "GET /api/analytics/noise-reduction?period=7d&group_by=source"
|
|||
|
|
Then the response includes per-source breakdown (datadog, pagerduty, opsgenie, grafana)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Tenant Isolation in Dashboard API
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Enforce strict tenant isolation across all API endpoints
|
|||
|
|
|
|||
|
|
Scenario: DynamoDB queries always include tenant_id partition key
|
|||
|
|
Given user "@alice" for tenant "acme" calls any incident endpoint
|
|||
|
|
When the API handler queries DynamoDB
|
|||
|
|
Then the query always includes "tenant_id = acme" as a condition
|
|||
|
|
And no full-table scans are performed
|
|||
|
|
|
|||
|
|
Scenario: TimescaleDB analytics queries are scoped by tenant_id
|
|||
|
|
Given user "@alice" for tenant "acme" calls any analytics endpoint
|
|||
|
|
When the API handler queries TimescaleDB
|
|||
|
|
Then the SQL query includes "WHERE tenant_id = 'acme'"
|
|||
|
|
|
|||
|
|
Scenario: API does not expose tenant_id enumeration
|
|||
|
|
Given user "@alice" calls "GET /api/incidents/INC-999" where INC-999 belongs to tenant "globex"
|
|||
|
|
When the API processes the request
|
|||
|
|
Then the response status is 404 (not 403, to avoid tenant enumeration)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 7: Dashboard UI
|
|||
|
|
|
|||
|
|
### Feature: Incident List View
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Incident list page in the React SPA
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the user is logged in and the Dashboard SPA is loaded
|
|||
|
|
|
|||
|
|
Scenario: Incident list displays open incidents on load
|
|||
|
|
Given tenant "acme" has 12 open incidents
|
|||
|
|
When the user navigates to "/incidents"
|
|||
|
|
Then the incident list renders 12 rows
|
|||
|
|
And each row shows: incident ID, severity badge, service name, alert count, age
|
|||
|
|
|
|||
|
|
Scenario: Severity badge color coding
|
|||
|
|
Given the incident list contains critical, warning, and info incidents
|
|||
|
|
When the list renders
|
|||
|
|
Then critical incidents show a red badge
|
|||
|
|
And warning incidents show a yellow badge
|
|||
|
|
And info incidents show a blue badge
|
|||
|
|
|
|||
|
|
Scenario: Clicking an incident row navigates to incident detail
|
|||
|
|
Given the incident list is displayed
|
|||
|
|
When the user clicks on incident "INC-042"
|
|||
|
|
Then the browser navigates to "/incidents/INC-042"
|
|||
|
|
|
|||
|
|
Scenario: Filter by severity updates the list in real time
|
|||
|
|
Given the incident list is displayed
|
|||
|
|
When the user selects "Critical" from the severity filter dropdown
|
|||
|
|
Then only critical incidents are shown
|
|||
|
|
And the URL updates to "/incidents?severity=critical"
|
|||
|
|
|
|||
|
|
Scenario: Filter by service updates the list
|
|||
|
|
Given the incident list is displayed
|
|||
|
|
When the user types "payments" in the service search box
|
|||
|
|
Then only incidents for service "payments" are shown
|
|||
|
|
|
|||
|
|
Scenario: Empty state is shown when no incidents match filters
|
|||
|
|
Given no incidents match the current filter
|
|||
|
|
When the list renders
|
|||
|
|
Then a message "No incidents found" is displayed
|
|||
|
|
And a "Clear filters" button is shown
|
|||
|
|
|
|||
|
|
Scenario: Incident list auto-refreshes every 30 seconds
|
|||
|
|
Given the incident list is displayed
|
|||
|
|
When 30 seconds elapse
|
|||
|
|
Then the list silently re-fetches from the API
|
|||
|
|
And new incidents appear without a full page reload
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Alert Timeline View
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Alert timeline within an incident detail page
|
|||
|
|
|
|||
|
|
Scenario: Alert timeline shows all correlated alerts in chronological order
|
|||
|
|
Given incident "INC-042" has 5 correlated alerts from T=0 to T=4min
|
|||
|
|
When the user navigates to "/incidents/INC-042"
|
|||
|
|
Then the timeline renders 5 events in ascending time order
|
|||
|
|
And each event shows: source icon, alert title, severity, timestamp
|
|||
|
|
|
|||
|
|
Scenario: Timeline highlights the root cause alert
|
|||
|
|
Given the first alert in the incident is flagged as "root_cause"
|
|||
|
|
When the timeline renders
|
|||
|
|
Then the root cause alert is visually distinguished (e.g. bold border)
|
|||
|
|
|
|||
|
|
Scenario: Timeline shows deduplication count
|
|||
|
|
Given fingerprint "fp-abc" was suppressed 8 times
|
|||
|
|
When the timeline renders the corresponding alert
|
|||
|
|
Then a badge "×8 duplicates suppressed" is shown on that alert entry
|
|||
|
|
|
|||
|
|
Scenario: Timeline is scrollable for large incidents
|
|||
|
|
Given an incident has 200 correlated alerts
|
|||
|
|
When the timeline renders
|
|||
|
|
Then a virtualized scroll list is used
|
|||
|
|
And the page does not freeze or crash
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: MTTR Chart
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: MTTR trend chart on the analytics page
|
|||
|
|
|
|||
|
|
Scenario: MTTR chart renders a 7-day trend line
|
|||
|
|
Given the analytics API returns daily MTTR data for the last 7 days
|
|||
|
|
When the user navigates to "/analytics"
|
|||
|
|
Then a line chart is rendered with 7 data points
|
|||
|
|
And the X-axis shows dates and the Y-axis shows minutes
|
|||
|
|
|
|||
|
|
Scenario: MTTR chart shows "No data" state when no resolved incidents
|
|||
|
|
Given no incidents were resolved in the selected period
|
|||
|
|
When the chart renders
|
|||
|
|
Then a "No resolved incidents in this period" message is shown instead of the chart
|
|||
|
|
|
|||
|
|
Scenario: MTTR chart period selector changes the data range
|
|||
|
|
Given the user is on the analytics page
|
|||
|
|
When the user selects "Last 30 days" from the period dropdown
|
|||
|
|
Then the chart re-fetches data for the last 30 days
|
|||
|
|
And the chart updates without a full page reload
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Noise Reduction Percentage Display
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Noise reduction metric display on analytics page
|
|||
|
|
|
|||
|
|
Scenario: Noise reduction percentage is prominently displayed
|
|||
|
|
Given the analytics API returns noise_reduction_percent = 84
|
|||
|
|
When the user views the analytics page
|
|||
|
|
Then a large "84%" figure is displayed under "Noise Reduction"
|
|||
|
|
|
|||
|
|
Scenario: Noise reduction trend sparkline is shown
|
|||
|
|
Given daily noise reduction data is available for 30 days
|
|||
|
|
When the analytics page renders
|
|||
|
|
Then a sparkline chart shows the 30-day trend
|
|||
|
|
|
|||
|
|
Scenario: Noise reduction breakdown by source is shown
|
|||
|
|
Given the API returns per-source noise reduction data
|
|||
|
|
When the user clicks "By Source" tab
|
|||
|
|
Then a bar chart shows noise reduction % for each source (Datadog, PagerDuty, OpsGenie, Grafana)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Webhook Setup Wizard
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Webhook setup wizard for onboarding new monitoring sources
|
|||
|
|
|
|||
|
|
Scenario: Wizard generates a unique webhook URL for Datadog
|
|||
|
|
Given the user navigates to "/settings/webhooks"
|
|||
|
|
And clicks "Add Webhook Source"
|
|||
|
|
When the user selects "Datadog" from the source dropdown
|
|||
|
|
And clicks "Generate"
|
|||
|
|
Then a unique webhook URL is displayed: "https://ingest.dd0c.io/webhooks/datadog/{tenant_id}/{token}"
|
|||
|
|
And the HMAC secret is shown once for copying
|
|||
|
|
|
|||
|
|
Scenario: Wizard provides copy-paste instructions for each source
|
|||
|
|
Given the user has generated a Datadog webhook URL
|
|||
|
|
When the wizard displays the setup instructions
|
|||
|
|
Then step-by-step instructions for configuring Datadog are shown
|
|||
|
|
And a "Test Webhook" button is available
|
|||
|
|
|
|||
|
|
Scenario: Test webhook button sends a test payload and confirms receipt
|
|||
|
|
Given the user clicks "Test Webhook" for a configured Datadog source
|
|||
|
|
When the test payload is sent
|
|||
|
|
Then the wizard shows "✅ Test payload received successfully"
|
|||
|
|
And the test alert appears in the incident list as a test event
|
|||
|
|
|
|||
|
|
Scenario: Wizard shows validation error if source already configured
|
|||
|
|
Given tenant "acme" already has a Datadog webhook configured
|
|||
|
|
When the user tries to add a second Datadog webhook
|
|||
|
|
Then the wizard shows "A Datadog webhook is already configured. Regenerate token?"
|
|||
|
|
|
|||
|
|
Scenario: Regenerating a webhook token invalidates the old token
|
|||
|
|
Given tenant "acme" has an existing Datadog webhook token
|
|||
|
|
When the user clicks "Regenerate Token" and confirms
|
|||
|
|
Then a new token is generated
|
|||
|
|
And the old token is immediately invalidated
|
|||
|
|
And any requests using the old token return 401
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 8: Infrastructure
|
|||
|
|
|
|||
|
|
### Feature: CDK Stack — Lambda Ingestion
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: CDK provisions Lambda ingestion infrastructure
|
|||
|
|
|
|||
|
|
Scenario: Lambda function is created with correct runtime and memory
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the CloudFormation template is inspected
|
|||
|
|
Then a Lambda function "dd0c-ingestion" exists with runtime "nodejs20.x"
|
|||
|
|
And memory is set to 512MB
|
|||
|
|
And timeout is set to 30 seconds
|
|||
|
|
|
|||
|
|
Scenario: Lambda has least-privilege IAM role
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the IAM role for "dd0c-ingestion" is inspected
|
|||
|
|
Then the role allows "sqs:SendMessage" only to the ingestion SQS queue ARN
|
|||
|
|
And the role allows "s3:PutObject" only to the "dd0c-raw-webhooks" bucket
|
|||
|
|
And the role does NOT have "s3:*" or "sqs:*" wildcards
|
|||
|
|
|
|||
|
|
Scenario: Lambda is behind API Gateway with throttling
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the API Gateway configuration is inspected
|
|||
|
|
Then throttling is set to 1000 requests/second burst and 500 requests/second steady-state
|
|||
|
|
And WAF is attached to the API Gateway stage
|
|||
|
|
|
|||
|
|
Scenario: Lambda environment variables are sourced from SSM Parameter Store
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the Lambda environment configuration is inspected
|
|||
|
|
Then HMAC secrets are referenced from SSM parameters (not hardcoded)
|
|||
|
|
And no secrets appear in plaintext in the CloudFormation template
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: CDK Stack — ECS Fargate Correlation Engine
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: CDK provisions ECS Fargate for the correlation engine
|
|||
|
|
|
|||
|
|
Scenario: ECS service is created with correct task definition
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the ECS task definition is inspected
|
|||
|
|
Then the task uses Fargate launch type
|
|||
|
|
And CPU is set to 1024 (1 vCPU) and memory to 2048MB
|
|||
|
|
And the container image is pulled from ECR "dd0c-correlation-engine"
|
|||
|
|
|
|||
|
|
Scenario: ECS service auto-scales based on SQS queue depth
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the auto-scaling configuration is inspected
|
|||
|
|
Then a step-scaling policy exists targeting SQS "ApproximateNumberOfMessagesVisible"
|
|||
|
|
And scale-out triggers when queue depth > 100 messages
|
|||
|
|
And scale-in triggers when queue depth < 10 messages
|
|||
|
|
And minimum capacity is 1 and maximum capacity is 10
|
|||
|
|
|
|||
|
|
Scenario: ECS tasks run in private subnets with no public IP
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the ECS network configuration is inspected
|
|||
|
|
Then tasks are placed in private subnets
|
|||
|
|
And "assignPublicIp" is DISABLED
|
|||
|
|
And a NAT Gateway provides outbound internet access
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: CDK Stack — SQS Queues
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: CDK provisions SQS queues with correct configuration
|
|||
|
|
|
|||
|
|
Scenario: Ingestion SQS queue has a Dead Letter Queue configured
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the SQS queue "dd0c-ingestion" is inspected
|
|||
|
|
Then a DLQ "dd0c-ingestion-dlq" is attached
|
|||
|
|
And maxReceiveCount is 3
|
|||
|
|
And the DLQ retention period is 14 days
|
|||
|
|
|
|||
|
|
Scenario: SQS queue has server-side encryption enabled
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the SQS queue configuration is inspected
|
|||
|
|
Then SSE is enabled using an AWS-managed KMS key
|
|||
|
|
|
|||
|
|
Scenario: SQS visibility timeout exceeds Lambda timeout
|
|||
|
|
Given the Lambda timeout is 30 seconds
|
|||
|
|
When the SQS queue visibility timeout is inspected
|
|||
|
|
Then the visibility timeout is at least 6x the Lambda timeout (180 seconds)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: CDK Stack — DynamoDB
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: CDK provisions DynamoDB for incident storage
|
|||
|
|
|
|||
|
|
Scenario: Incidents table has correct key schema
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the DynamoDB table "dd0c-incidents" is inspected
|
|||
|
|
Then the partition key is "tenant_id" (String)
|
|||
|
|
And the sort key is "incident_id" (String)
|
|||
|
|
|
|||
|
|
Scenario: Incidents table has a GSI for status queries
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the GSIs on "dd0c-incidents" are inspected
|
|||
|
|
Then a GSI "status-created_at-index" exists
|
|||
|
|
With partition key "status" and sort key "created_at"
|
|||
|
|
|
|||
|
|
Scenario: DynamoDB table has point-in-time recovery enabled
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the DynamoDB table settings are inspected
|
|||
|
|
Then PITR is enabled on "dd0c-incidents"
|
|||
|
|
|
|||
|
|
Scenario: DynamoDB TTL is configured for free-tier retention
|
|||
|
|
Given the CDK stack is synthesized
|
|||
|
|
When the DynamoDB TTL configuration is inspected
|
|||
|
|
Then TTL is enabled on attribute "expires_at"
|
|||
|
|
And free-tier records have "expires_at" set to 7 days from creation
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: CI/CD Pipeline
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: CI/CD pipeline for automated deployment
|
|||
|
|
|
|||
|
|
Scenario: Pull request triggers test suite
|
|||
|
|
Given a developer opens a pull request against "main"
|
|||
|
|
When the CI pipeline runs
|
|||
|
|
Then unit tests, integration tests, and CDK synth all pass before merge is allowed
|
|||
|
|
|
|||
|
|
Scenario: Merge to main triggers staging deployment
|
|||
|
|
Given a PR is merged to "main"
|
|||
|
|
When the CD pipeline runs
|
|||
|
|
Then the CDK stack is deployed to the "staging" environment
|
|||
|
|
And smoke tests run against staging endpoints
|
|||
|
|
|
|||
|
|
Scenario: Production deployment requires manual approval
|
|||
|
|
Given the staging deployment and smoke tests pass
|
|||
|
|
When the CD pipeline reaches the production stage
|
|||
|
|
Then a manual approval gate is presented
|
|||
|
|
And production deployment only proceeds after approval
|
|||
|
|
|
|||
|
|
Scenario: Failed deployment triggers automatic rollback
|
|||
|
|
Given a production deployment fails health checks
|
|||
|
|
When the CD pipeline detects the failure
|
|||
|
|
Then the previous CloudFormation stack version is restored
|
|||
|
|
And a Slack alert is sent to "#dd0c-ops" with the rollback reason
|
|||
|
|
|
|||
|
|
Scenario: CDK diff is posted as a PR comment
|
|||
|
|
Given a developer opens a PR with infrastructure changes
|
|||
|
|
When the CI pipeline runs "cdk diff"
|
|||
|
|
Then the diff output is posted as a comment on the PR
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 9: Onboarding & PLG
|
|||
|
|
|
|||
|
|
### Feature: OAuth Signup
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: User signup via OAuth (Google / GitHub)
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the signup page is at "/signup"
|
|||
|
|
|
|||
|
|
Scenario: New user signs up with Google OAuth
|
|||
|
|
Given a new user visits "/signup"
|
|||
|
|
When the user clicks "Sign up with Google"
|
|||
|
|
And completes the Google OAuth flow
|
|||
|
|
Then a new tenant is created for the user's email domain
|
|||
|
|
And the user is assigned the "owner" role for the new tenant
|
|||
|
|
And the user is redirected to the onboarding wizard
|
|||
|
|
|
|||
|
|
Scenario: New user signs up with GitHub OAuth
|
|||
|
|
Given a new user visits "/signup"
|
|||
|
|
When the user clicks "Sign up with GitHub"
|
|||
|
|
And completes the GitHub OAuth flow
|
|||
|
|
Then a new tenant is created
|
|||
|
|
And the user is redirected to the onboarding wizard
|
|||
|
|
|
|||
|
|
Scenario: Existing user signs in via OAuth
|
|||
|
|
Given a user with email "alice@acme.com" already has an account
|
|||
|
|
When the user completes the Google OAuth flow
|
|||
|
|
Then no new tenant is created
|
|||
|
|
And the user is redirected to "/incidents"
|
|||
|
|
|
|||
|
|
Scenario: OAuth failure shows user-friendly error
|
|||
|
|
Given the Google OAuth provider returns an error
|
|||
|
|
When the user is redirected back to the app
|
|||
|
|
Then an error message "Sign-in failed. Please try again." is displayed
|
|||
|
|
And no partial account is created
|
|||
|
|
|
|||
|
|
Scenario: Signup is blocked for disposable email domains
|
|||
|
|
Given a user attempts to sign up with "user@mailinator.com"
|
|||
|
|
When the OAuth flow completes
|
|||
|
|
Then the signup is rejected with "Disposable email addresses are not allowed"
|
|||
|
|
And no tenant is created
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Free Tier — 10K Alerts/Month Limit
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Enforce free tier limit of 10,000 alerts per month
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given tenant "free-co" is on the free tier
|
|||
|
|
And the monthly alert counter is stored in DynamoDB
|
|||
|
|
|
|||
|
|
Scenario: Alert ingestion succeeds under the 10K limit
|
|||
|
|
Given tenant "free-co" has ingested 9,999 alerts this month
|
|||
|
|
When a new alert arrives
|
|||
|
|
Then the alert is processed normally
|
|||
|
|
And the counter is incremented to 10,000
|
|||
|
|
|
|||
|
|
Scenario: Alert ingestion is blocked at the 10K limit
|
|||
|
|
Given tenant "free-co" has ingested 10,000 alerts this month
|
|||
|
|
When a new alert arrives via webhook
|
|||
|
|
Then the webhook returns HTTP 429
|
|||
|
|
And the response body includes "Free tier limit reached. Upgrade to continue."
|
|||
|
|
And the alert is NOT processed or stored
|
|||
|
|
|
|||
|
|
Scenario: Tenant receives email warning at 80% of limit
|
|||
|
|
Given tenant "free-co" has ingested 8,000 alerts this month
|
|||
|
|
When the 8,001st alert is ingested
|
|||
|
|
Then an email is sent to the tenant owner: "You've used 80% of your free tier quota"
|
|||
|
|
|
|||
|
|
Scenario: Alert counter resets on the 1st of each month
|
|||
|
|
Given tenant "free-co" has ingested 10,000 alerts in January
|
|||
|
|
When February 1st arrives (UTC midnight)
|
|||
|
|
Then the monthly counter is reset to 0
|
|||
|
|
And alert ingestion is unblocked
|
|||
|
|
|
|||
|
|
Scenario: Paid tenant has no alert ingestion limit
|
|||
|
|
Given tenant "paid-co" is on the "pro" plan
|
|||
|
|
And has ingested 50,000 alerts this month
|
|||
|
|
When a new alert arrives
|
|||
|
|
Then the alert is processed normally
|
|||
|
|
And no limit check is applied
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: 7-Day Retention for Free Tier
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Enforce 7-day data retention for free tier tenants
|
|||
|
|
|
|||
|
|
Scenario: Free tier incidents older than 7 days are expired via DynamoDB TTL
|
|||
|
|
Given tenant "free-co" is on the free tier
|
|||
|
|
And incident "INC-OLD" was created 8 days ago
|
|||
|
|
When DynamoDB TTL runs
|
|||
|
|
Then "INC-OLD" is deleted from the incidents table
|
|||
|
|
|
|||
|
|
Scenario: Free tier raw S3 archives older than 7 days are deleted
|
|||
|
|
Given tenant "free-co" has raw webhook archives in S3 from 8 days ago
|
|||
|
|
When the S3 lifecycle policy runs
|
|||
|
|
Then objects older than 7 days are deleted for free-tier tenants
|
|||
|
|
|
|||
|
|
Scenario: Paid tier incidents are retained for 90 days
|
|||
|
|
Given tenant "paid-co" is on the "pro" plan
|
|||
|
|
And incident "INC-OLD" was created 30 days ago
|
|||
|
|
When DynamoDB TTL runs
|
|||
|
|
Then "INC-OLD" is NOT deleted
|
|||
|
|
|
|||
|
|
Scenario: Retention policy is enforced per-tenant, not globally
|
|||
|
|
Given "free-co" and "paid-co" both have incidents from 10 days ago
|
|||
|
|
When TTL and lifecycle policies run
|
|||
|
|
Then "free-co"'s old incidents are deleted
|
|||
|
|
And "paid-co"'s old incidents are retained
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Stripe Billing Integration
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Stripe billing for plan upgrades
|
|||
|
|
|
|||
|
|
Scenario: User upgrades from free to pro plan via Stripe Checkout
|
|||
|
|
Given user "@alice" is on the free tier
|
|||
|
|
When "@alice" clicks "Upgrade to Pro" in the dashboard
|
|||
|
|
Then a Stripe Checkout session is created
|
|||
|
|
And the user is redirected to the Stripe-hosted payment page
|
|||
|
|
|
|||
|
|
Scenario: Successful Stripe payment activates pro plan
|
|||
|
|
Given a Stripe Checkout session completes successfully
|
|||
|
|
When the Stripe "checkout.session.completed" webhook is received
|
|||
|
|
Then tenant "acme"'s plan is updated to "pro" in DynamoDB
|
|||
|
|
And the alert ingestion limit is removed
|
|||
|
|
And a confirmation email is sent to the tenant owner
|
|||
|
|
|
|||
|
|
Scenario: Failed Stripe payment does not activate pro plan
|
|||
|
|
Given a Stripe payment fails
|
|||
|
|
When the Stripe "payment_intent.payment_failed" webhook is received
|
|||
|
|
Then the tenant remains on the free tier
|
|||
|
|
And a failure notification email is sent
|
|||
|
|
|
|||
|
|
Scenario: Stripe webhook signature is validated
|
|||
|
|
Given a Stripe webhook arrives with an invalid "Stripe-Signature" header
|
|||
|
|
When the billing Lambda processes the request
|
|||
|
|
Then the response status is 401
|
|||
|
|
And the event is not processed
|
|||
|
|
|
|||
|
|
Scenario: Subscription cancellation downgrades tenant to free tier
|
|||
|
|
Given tenant "acme" cancels their pro subscription
|
|||
|
|
When the Stripe "customer.subscription.deleted" webhook is received
|
|||
|
|
Then tenant "acme"'s plan is downgraded to "free"
|
|||
|
|
And the 10K/month limit is re-applied from the next billing cycle
|
|||
|
|
And a downgrade confirmation email is sent
|
|||
|
|
|
|||
|
|
Scenario: Stripe billing is idempotent — duplicate webhook events are ignored
|
|||
|
|
Given a Stripe "checkout.session.completed" event was already processed
|
|||
|
|
When the same event is received again (Stripe retry)
|
|||
|
|
Then the tenant plan is not double-updated
|
|||
|
|
And the response is 200 (idempotent acknowledgment)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Webhook URL Generation
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Generate unique webhook URLs per tenant per source
|
|||
|
|
|
|||
|
|
Scenario: Webhook URL is generated on tenant creation
|
|||
|
|
Given a new tenant "new-co" is created via OAuth signup
|
|||
|
|
When the onboarding wizard runs
|
|||
|
|
Then a unique webhook URL is generated for each supported source
|
|||
|
|
And each URL follows the pattern "https://ingest.dd0c.io/webhooks/{source}/{tenant_id}/{token}"
|
|||
|
|
And tokens are cryptographically random (32 bytes, URL-safe base64)
|
|||
|
|
|
|||
|
|
Scenario: Webhook token is stored hashed in DynamoDB
|
|||
|
|
Given a webhook token is generated for tenant "new-co"
|
|||
|
|
When the token is stored
|
|||
|
|
Then only the SHA-256 hash of the token is stored in DynamoDB
|
|||
|
|
And the plaintext token is shown to the user exactly once
|
|||
|
|
|
|||
|
|
Scenario: Webhook URL is validated on each ingestion request
|
|||
|
|
Given a request arrives at "POST /webhooks/datadog/{tenant_id}/{token}"
|
|||
|
|
When the ingestion Lambda validates the token
|
|||
|
|
Then the token hash is looked up in DynamoDB for the given tenant_id
|
|||
|
|
And if the hash matches, the request is accepted
|
|||
|
|
And if the hash does not match, the response is 401
|
|||
|
|
|
|||
|
|
Scenario: 60-second time-to-value — first correlated alert within 60 seconds of webhook setup
|
|||
|
|
Given a new tenant completes the onboarding wizard and copies their webhook URL
|
|||
|
|
When the tenant sends their first alert to the webhook URL
|
|||
|
|
Then within 60 seconds, a correlated incident appears in the dashboard
|
|||
|
|
And a Slack notification is sent if Slack is configured
|
|||
|
|
```
|