Files

Max Mayfield 4938674c20 Phase 3: BDD acceptance specs for P2 (drift), P3 (alert), P6 (run)

P2: 2,245 lines, 10 epics — Sonnet subagent (8min)
P3: 1,653 lines, 10 epics — Sonnet subagent (6min)
P6: 2,303 lines, 262 scenarios, 10 epics — Sonnet subagent (7min)
P4 (portal) still in progress

2026-03-01 01:54:35 +00:00

66 KiB

Raw Blame History

dd0c/alert — Alert Intelligence: BDD Acceptance Test Specifications

Gherkin scenarios for all 10 epics. Each Feature maps to a user story within the epic.

Epic 1: Webhook Ingestion

Feature: HMAC Signature Validation — Datadog

Feature: HMAC signature validation for Datadog webhooks
  As the ingestion layer
  I want to reject requests with invalid or missing HMAC signatures
  So that only legitimate Datadog payloads are processed

  Background:
    Given the webhook endpoint is "POST /webhooks/datadog"
    And a valid Datadog HMAC secret is configured as "dd-secret-abc123"

  Scenario: Valid Datadog HMAC signature is accepted
    Given a Datadog alert payload with body '{"title":"CPU spike","severity":"high"}'
    And the request includes header "X-Datadog-Webhook-ID" with a valid HMAC-SHA256 signature
    When the Lambda ingestion handler receives the request
    Then the response status is 200
    And the payload is forwarded to the normalization SQS queue

  Scenario: Missing HMAC signature header is rejected
    Given a Datadog alert payload with body '{"title":"CPU spike","severity":"high"}'
    And the request has no "X-Datadog-Webhook-ID" header
    When the Lambda ingestion handler receives the request
    Then the response status is 401
    And the payload is NOT forwarded to SQS
    And a rejection event is logged with reason "missing_signature"

  Scenario: Tampered payload with mismatched HMAC is rejected
    Given a Datadog alert payload
    And the HMAC signature was computed over a different payload body
    When the Lambda ingestion handler receives the request
    Then the response status is 401
    And the payload is NOT forwarded to SQS
    And a rejection event is logged with reason "signature_mismatch"

  Scenario: Replay attack with expired timestamp is rejected
    Given a Datadog alert payload with a valid HMAC signature
    And the request timestamp is more than 5 minutes in the past
    When the Lambda ingestion handler receives the request
    Then the response status is 401
    And the rejection reason is "timestamp_expired"

  Scenario: HMAC secret rotation — old secret still accepted during grace period
    Given the Datadog HMAC secret was rotated 2 minutes ago
    And the request uses the previous secret for signing
    When the Lambda ingestion handler receives the request
    Then the response status is 200
    And a warning metric "hmac_old_secret_used" is emitted

Feature: HMAC Signature Validation — PagerDuty

Feature: HMAC signature validation for PagerDuty webhooks

  Background:
    Given the webhook endpoint is "POST /webhooks/pagerduty"
    And a valid PagerDuty signing secret is configured

  Scenario: Valid PagerDuty v3 signature is accepted
    Given a PagerDuty webhook payload
    And the request includes "X-PagerDuty-Signature" with a valid v3 HMAC-SHA256 value
    When the Lambda ingestion handler receives the request
    Then the response status is 200
    And the payload is enqueued for normalization

  Scenario: PagerDuty v1 signature (legacy) is rejected
    Given a PagerDuty webhook payload signed with v1 scheme
    When the Lambda ingestion handler receives the request
    Then the response status is 401
    And the rejection reason is "unsupported_signature_version"

  Scenario: Missing signature on PagerDuty webhook
    Given a PagerDuty webhook payload with no signature header
    When the Lambda ingestion handler receives the request
    Then the response status is 401

Feature: HMAC Signature Validation — OpsGenie

Feature: HMAC signature validation for OpsGenie webhooks

  Background:
    Given the webhook endpoint is "POST /webhooks/opsgenie"
    And a valid OpsGenie integration API key is configured

  Scenario: Valid OpsGenie HMAC is accepted
    Given an OpsGenie alert payload
    And the request includes "X-OG-Delivery-Signature" with a valid HMAC-SHA256 value
    When the Lambda ingestion handler receives the request
    Then the response status is 200

  Scenario: Invalid OpsGenie signature is rejected
    Given an OpsGenie alert payload with a forged signature
    When the Lambda ingestion handler receives the request
    Then the response status is 401
    And the rejection reason is "signature_mismatch"

Feature: HMAC Signature Validation — Grafana

Feature: HMAC signature validation for Grafana webhooks

  Background:
    Given the webhook endpoint is "POST /webhooks/grafana"
    And a Grafana webhook secret is configured

  Scenario: Valid Grafana signature is accepted
    Given a Grafana alert payload
    And the request includes "X-Grafana-Signature" with a valid HMAC-SHA256 value
    When the Lambda ingestion handler receives the request
    Then the response status is 200

  Scenario: Grafana webhook with no secret configured (open mode) is accepted with warning
    Given no Grafana webhook secret is configured for the tenant
    And a Grafana alert payload arrives without a signature header
    When the Lambda ingestion handler receives the request
    Then the response status is 200
    And a warning metric "grafana_unauthenticated_webhook" is emitted

Feature: Payload Normalization to Canonical Schema

Feature: Normalize incoming webhook payloads to canonical alert schema

  Scenario: Datadog payload is normalized to canonical schema
    Given a raw Datadog webhook payload with fields "title", "severity", "host", "tags"
    When the normalization Lambda processes the payload
    Then the canonical alert contains:
      | field       | value                        |
      | source      | datadog                      |
      | severity    | mapped from Datadog severity |
      | service     | extracted from tags          |
      | fingerprint | SHA-256 of source+title+host |
      | received_at | ISO-8601 timestamp           |
      | raw_payload | original JSON preserved      |

  Scenario: PagerDuty payload is normalized to canonical schema
    Given a raw PagerDuty v3 webhook payload
    When the normalization Lambda processes the payload
    Then the canonical alert contains "source" = "pagerduty"
    And "severity" is mapped from PagerDuty urgency field
    And "service" is extracted from the PagerDuty service name

  Scenario: OpsGenie payload is normalized to canonical schema
    Given a raw OpsGenie webhook payload
    When the normalization Lambda processes the payload
    Then the canonical alert contains "source" = "opsgenie"
    And "severity" is mapped from OpsGenie priority field

  Scenario: Grafana payload is normalized to canonical schema
    Given a raw Grafana alerting webhook payload
    When the normalization Lambda processes the payload
    Then the canonical alert contains "source" = "grafana"
    And "severity" is mapped from Grafana alert state

  Scenario: Unknown source type returns 400
    Given a webhook payload posted to "/webhooks/unknown-source"
    When the Lambda ingestion handler receives the request
    Then the response status is 400
    And the error reason is "unknown_source"

  Scenario: Malformed JSON payload returns 400
    Given a request body that is not valid JSON
    When the Lambda ingestion handler receives the request
    Then the response status is 400
    And the error reason is "invalid_json"

Feature: Async S3 Archival

Feature: Archive raw webhook payloads to S3 asynchronously

  Scenario: Every accepted payload is archived to S3
    Given a valid Datadog webhook payload is received and accepted
    When the Lambda ingestion handler processes the request
    Then the raw payload is written to S3 bucket "dd0c-raw-webhooks"
    And the S3 key follows the pattern "raw/{source}/{tenant_id}/{YYYY}/{MM}/{DD}/{uuid}.json"
    And the archival happens asynchronously (does not block the 200 response)

  Scenario: S3 archival failure does not fail the ingestion
    Given a valid webhook payload is received
    And the S3 write operation fails with a transient error
    When the Lambda ingestion handler processes the request
    Then the response status is still 200
    And the payload is still forwarded to SQS
    And an error metric "s3_archival_failure" is emitted

  Scenario: Archived payload includes tenant ID and trace context
    Given a valid webhook payload from tenant "tenant-xyz"
    When the payload is archived to S3
    Then the S3 object metadata includes "tenant_id" = "tenant-xyz"
    And the S3 object metadata includes the OTEL trace ID

Feature: SQS Payload Size Limit (256KB)

Feature: Handle SQS 256KB payload size limit during ingestion

  Scenario: Payload under 256KB is sent directly to SQS
    Given a normalized canonical alert payload of 10KB
    When the ingestion Lambda forwards it to SQS
    Then the message is placed on the SQS queue directly
    And no S3 pointer pattern is used

  Scenario: Payload exceeding 256KB is stored in S3 and pointer sent to SQS
    Given a normalized canonical alert payload of 300KB (e.g. large raw_payload)
    When the ingestion Lambda attempts to forward it to SQS
    Then the full payload is stored in S3 under "sqs-overflow/{uuid}.json"
    And an SQS message is sent containing only the S3 pointer and metadata
    And the SQS message size is under 256KB

  Scenario: Correlation engine retrieves oversized payload from S3 pointer
    Given an SQS message containing an S3 pointer for an oversized payload
    When the correlation engine consumer reads the SQS message
    Then it fetches the full payload from S3 using the pointer
    And processes it as a normal canonical alert

  Scenario: S3 pointer fetch fails in correlation engine
    Given an SQS message containing an S3 pointer
    And the S3 object has been deleted or is unavailable
    When the correlation engine attempts to fetch the payload
    Then the message is sent to the Dead Letter Queue
    And an alert metric "sqs_pointer_fetch_failure" is emitted

Feature: Dead Letter Queue Handling

Feature: Dead Letter Queue overflow and monitoring

  Scenario: Message failing max retries is moved to DLQ
    Given an SQS message that causes a processing error on every attempt
    When the message has been retried 3 times (maxReceiveCount = 3)
    Then the message is automatically moved to the DLQ "dd0c-ingestion-dlq"
    And a CloudWatch alarm "DLQDepthHigh" is triggered when DLQ depth > 10

  Scenario: DLQ overflow triggers operational alert
    Given the DLQ contains more than 100 messages
    When the DLQ depth CloudWatch alarm fires
    Then a Slack notification is sent to the ops channel "#dd0c-ops"
    And the notification includes the DLQ name and current depth

  Scenario: DLQ messages can be replayed after fix
    Given 50 messages are sitting in the DLQ
    When an operator triggers the DLQ replay Lambda
    Then messages are moved back to the main SQS queue in batches of 10
    And each replayed message retains its original trace context

Epic 2: Alert Normalization

Feature: Datadog Source Parser

Feature: Parse and normalize Datadog alert payloads

  Background:
    Given the Datadog parser is registered for source "datadog"

  Scenario: Datadog "alert" event maps to severity "critical"
    Given a Datadog payload with "alert_type" = "error" and "priority" = "P1"
    When the Datadog parser processes the payload
    Then the canonical alert "severity" = "critical"

  Scenario: Datadog "warning" event maps to severity "warning"
    Given a Datadog payload with "alert_type" = "warning"
    When the Datadog parser processes the payload
    Then the canonical alert "severity" = "warning"

  Scenario: Datadog "recovery" event maps to status "resolved"
    Given a Datadog payload with "alert_type" = "recovery"
    When the Datadog parser processes the payload
    Then the canonical alert "status" = "resolved"
    And the canonical alert "resolved_at" is set to the event timestamp

  Scenario: Service extracted from Datadog tags
    Given a Datadog payload with "tags" = ["service:payments", "env:prod", "team:backend"]
    When the Datadog parser processes the payload
    Then the canonical alert "service" = "payments"
    And the canonical alert "environment" = "prod"

  Scenario: Service tag absent — service defaults to hostname
    Given a Datadog payload with no "service:" tag
    And the payload contains "host" = "payments-worker-01"
    When the Datadog parser processes the payload
    Then the canonical alert "service" = "payments-worker-01"

  Scenario: Fingerprint is deterministic for identical alerts
    Given two identical Datadog payloads with the same title, host, and tags
    When both are processed by the Datadog parser
    Then both canonical alerts have the same "fingerprint" value

  Scenario: Fingerprint differs when title changes
    Given two Datadog payloads differing only in "title"
    When both are processed by the Datadog parser
    Then the canonical alerts have different "fingerprint" values

  Scenario: Datadog payload missing required "title" field
    Given a Datadog payload with no "title" field
    When the Datadog parser processes the payload
    Then a normalization error is raised with reason "missing_required_field:title"
    And the alert is sent to the normalization DLQ

Feature: PagerDuty Source Parser

Feature: Parse and normalize PagerDuty webhook payloads

  Background:
    Given the PagerDuty parser is registered for source "pagerduty"

  Scenario: PagerDuty "trigger" event creates a new canonical alert
    Given a PagerDuty v3 webhook with "event_type" = "incident.triggered"
    When the PagerDuty parser processes the payload
    Then the canonical alert "status" = "firing"
    And "source" = "pagerduty"

  Scenario: PagerDuty "acknowledge" event updates alert status
    Given a PagerDuty v3 webhook with "event_type" = "incident.acknowledged"
    When the PagerDuty parser processes the payload
    Then the canonical alert "status" = "acknowledged"

  Scenario: PagerDuty "resolve" event updates alert status
    Given a PagerDuty v3 webhook with "event_type" = "incident.resolved"
    When the PagerDuty parser processes the payload
    Then the canonical alert "status" = "resolved"

  Scenario: PagerDuty urgency "high" maps to severity "critical"
    Given a PagerDuty payload with "urgency" = "high"
    When the PagerDuty parser processes the payload
    Then the canonical alert "severity" = "critical"

  Scenario: PagerDuty urgency "low" maps to severity "warning"
    Given a PagerDuty payload with "urgency" = "low"
    When the PagerDuty parser processes the payload
    Then the canonical alert "severity" = "warning"

  Scenario: PagerDuty service name is extracted correctly
    Given a PagerDuty payload with "service.name" = "checkout-api"
    When the PagerDuty parser processes the payload
    Then the canonical alert "service" = "checkout-api"

  Scenario: PagerDuty dedup key used as fingerprint seed
    Given a PagerDuty payload with "dedup_key" = "pd-dedup-xyz789"
    When the PagerDuty parser processes the payload
    Then the canonical alert "fingerprint" incorporates "pd-dedup-xyz789"

Feature: OpsGenie Source Parser

Feature: Parse and normalize OpsGenie webhook payloads

  Background:
    Given the OpsGenie parser is registered for source "opsgenie"

  Scenario: OpsGenie "Create" action maps to status "firing"
    Given an OpsGenie webhook with "action" = "Create"
    When the OpsGenie parser processes the payload
    Then the canonical alert "status" = "firing"

  Scenario: OpsGenie "Close" action maps to status "resolved"
    Given an OpsGenie webhook with "action" = "Close"
    When the OpsGenie parser processes the payload
    Then the canonical alert "status" = "resolved"

  Scenario: OpsGenie "Acknowledge" action maps to status "acknowledged"
    Given an OpsGenie webhook with "action" = "Acknowledge"
    When the OpsGenie parser processes the payload
    Then the canonical alert "status" = "acknowledged"

  Scenario: OpsGenie priority P1 maps to severity "critical"
    Given an OpsGenie payload with "priority" = "P1"
    When the OpsGenie parser processes the payload
    Then the canonical alert "severity" = "critical"

  Scenario: OpsGenie priority P3 maps to severity "warning"
    Given an OpsGenie payload with "priority" = "P3"
    When the OpsGenie parser processes the payload
    Then the canonical alert "severity" = "warning"

  Scenario: OpsGenie priority P5 maps to severity "info"
    Given an OpsGenie payload with "priority" = "P5"
    When the OpsGenie parser processes the payload
    Then the canonical alert "severity" = "info"

  Scenario: OpsGenie tags used for service extraction
    Given an OpsGenie payload with "tags" = ["service:inventory", "region:us-east-1"]
    When the OpsGenie parser processes the payload
    Then the canonical alert "service" = "inventory"

Feature: Grafana Source Parser

Feature: Parse and normalize Grafana alerting webhook payloads

  Background:
    Given the Grafana parser is registered for source "grafana"

  Scenario: Grafana "alerting" state maps to status "firing"
    Given a Grafana webhook with "state" = "alerting"
    When the Grafana parser processes the payload
    Then the canonical alert "status" = "firing"

  Scenario: Grafana "ok" state maps to status "resolved"
    Given a Grafana webhook with "state" = "ok"
    When the Grafana parser processes the payload
    Then the canonical alert "status" = "resolved"

  Scenario: Grafana "no_data" state maps to severity "warning"
    Given a Grafana webhook with "state" = "no_data"
    When the Grafana parser processes the payload
    Then the canonical alert "severity" = "warning"
    And the canonical alert "status" = "firing"

  Scenario: Grafana panel URL preserved in canonical alert metadata
    Given a Grafana webhook with "ruleUrl" = "https://grafana.example.com/d/abc/panel"
    When the Grafana parser processes the payload
    Then the canonical alert "metadata.dashboard_url" = "https://grafana.example.com/d/abc/panel"

  Scenario: Grafana multi-alert payload (multiple evalMatches) creates one alert per match
    Given a Grafana webhook with 3 "evalMatches" entries
    When the Grafana parser processes the payload
    Then 3 canonical alerts are produced
    And each has a unique fingerprint based on the metric name and tags

Feature: Canonical Alert Schema Validation

Feature: Validate canonical alert schema completeness

  Scenario: Canonical alert with all required fields passes validation
    Given a canonical alert with fields: source, severity, status, service, fingerprint, received_at, tenant_id
    When schema validation runs
    Then the alert passes validation

  Scenario: Canonical alert missing "tenant_id" fails validation
    Given a canonical alert with no "tenant_id" field
    When schema validation runs
    Then validation fails with error "missing_required_field:tenant_id"
    And the alert is rejected before SQS enqueue

  Scenario: Canonical alert with unknown severity value fails validation
    Given a canonical alert with "severity" = "super-critical"
    When schema validation runs
    Then validation fails with error "invalid_enum_value:severity"

  Scenario: Canonical alert schema is additive — unknown extra fields are preserved
    Given a canonical alert with an extra field "custom_label" = "team-alpha"
    When schema validation runs
    Then the alert passes validation
    And "custom_label" is preserved in the alert document

Epic 3: Correlation Engine

Feature: Time-Window Grouping

Feature: Group alerts into incidents using time-window correlation

  Background:
    Given the correlation engine is running on ECS Fargate
    And the default correlation time window is 5 minutes

  Scenario: Two alerts for the same service within the time window are grouped
    Given a canonical alert for service "payments" arrives at T=0
    And a second canonical alert for service "payments" arrives at T=3min
    When the correlation engine processes both alerts
    Then they are grouped into a single incident
    And the incident "alert_count" = 2

  Scenario: Two alerts for the same service outside the time window are NOT grouped
    Given a canonical alert for service "payments" arrives at T=0
    And a second canonical alert for service "payments" arrives at T=6min
    When the correlation engine processes both alerts
    Then they are placed in separate incidents

  Scenario: Time window is configurable per tenant
    Given tenant "enterprise-co" has a custom correlation window of 10 minutes
    And two alerts for the same service arrive 8 minutes apart
    When the correlation engine processes both alerts
    Then they are grouped into a single incident for tenant "enterprise-co"

  Scenario: Alerts from different services within the time window are NOT grouped by default
    Given a canonical alert for service "payments" at T=0
    And a canonical alert for service "auth" at T=1min
    When the correlation engine processes both alerts
    Then they are placed in separate incidents

Feature: Service-Affinity Matching

Feature: Group alerts across related services using service-affinity rules

  Background:
    Given a service-affinity rule: ["payments", "checkout", "cart"] are related

  Scenario: Alerts from affinity-grouped services are correlated into one incident
    Given a canonical alert for service "payments" at T=0
    And a canonical alert for service "checkout" at T=2min
    When the correlation engine applies service-affinity matching
    Then both alerts are grouped into a single incident
    And the incident "root_service" is set to the first-seen service "payments"

  Scenario: Alert from a service not in the affinity group is not merged
    Given a canonical alert for service "payments" at T=0
    And a canonical alert for service "logging" at T=1min
    When the correlation engine applies service-affinity matching
    Then they remain in separate incidents

  Scenario: Service-affinity rules are tenant-scoped
    Given tenant "acme" has affinity rule ["api", "gateway"]
    And tenant "globex" has no affinity rules
    And both tenants receive alerts for "api" and "gateway" simultaneously
    When the correlation engine processes both tenants' alerts
    Then "acme"'s alerts are grouped into one incident
    And "globex"'s alerts remain in separate incidents

Feature: Fingerprint Deduplication

Feature: Deduplicate alerts with identical fingerprints

  Scenario: Duplicate alert with same fingerprint within dedup window is suppressed
    Given a canonical alert with fingerprint "fp-abc123" is received at T=0
    And an identical alert with fingerprint "fp-abc123" arrives at T=30sec
    When the correlation engine checks the Redis dedup window
    Then the second alert is suppressed (not added as a new alert)
    And the incident "duplicate_count" is incremented by 1

  Scenario: Same fingerprint outside dedup window creates a new alert
    Given a canonical alert with fingerprint "fp-abc123" was processed at T=0
    And the dedup window is 10 minutes
    And the same fingerprint arrives at T=11min
    When the correlation engine checks the Redis dedup window
    Then the alert is treated as a new occurrence
    And a new incident entry is created

  Scenario: Different fingerprints are never deduplicated
    Given two alerts with different fingerprints "fp-abc123" and "fp-xyz789"
    When the correlation engine processes both
    Then both are treated as distinct alerts

  Scenario: Dedup counter is visible in incident metadata
    Given fingerprint "fp-abc123" has been suppressed 5 times
    When the incident is retrieved via the Dashboard API
    Then the incident "dedup_count" = 5

Feature: Redis Sliding Window

Feature: Redis sliding windows for correlation state management

  Background:
    Given Redis is available and the sliding window TTL is 10 minutes

  Scenario: Alert fingerprint is stored in Redis on first occurrence
    Given a canonical alert with fingerprint "fp-new001" arrives
    When the correlation engine processes the alert
    Then a Redis key "dedup:{tenant_id}:fp-new001" is set with TTL 10 minutes

  Scenario: Redis key TTL is refreshed on each matching alert
    Given a Redis key "dedup:{tenant_id}:fp-new001" exists with 2 minutes remaining
    And a new alert with fingerprint "fp-new001" arrives
    When the correlation engine processes the alert
    Then the Redis key TTL is reset to 10 minutes

  Scenario: Redis unavailability causes correlation engine to fail open
    Given Redis is unreachable
    When a canonical alert arrives for processing
    Then the alert is processed without deduplication
    And a metric "redis_unavailable_failopen" is emitted
    And the alert is NOT dropped

  Scenario: Redis sliding window is tenant-isolated
    Given tenant "alpha" has fingerprint "fp-shared" in Redis
    And tenant "beta" sends an alert with fingerprint "fp-shared"
    When the correlation engine checks the dedup window
    Then tenant "beta"'s alert is NOT suppressed
    And tenant "alpha"'s dedup state is unaffected

Feature: Cross-Tenant Isolation in Correlation

Feature: Prevent cross-tenant alert bleed in correlation engine

  Scenario: Alerts from different tenants with same fingerprint are never correlated
    Given tenant "alpha" sends alert with fingerprint "fp-shared" at T=0
    And tenant "beta" sends alert with fingerprint "fp-shared" at T=1min
    When the correlation engine processes both alerts
    Then each alert is placed in its own tenant-scoped incident
    And no incident contains alerts from both tenants

  Scenario: Tenant ID is validated before correlation lookup
    Given a canonical alert arrives with "tenant_id" = ""
    When the correlation engine attempts to process the alert
    Then the alert is rejected with error "missing_tenant_id"
    And the alert is sent to the correlation DLQ

  Scenario: Correlation engine worker processes only one tenant's partition at a time
    Given SQS messages are partitioned by tenant_id
    When the ECS Fargate worker picks up a batch of messages
    Then all messages in the batch belong to the same tenant
    And no cross-tenant data is loaded into the worker's memory context

Feature: OTEL Trace Propagation Across SQS Boundary

Feature: Propagate OpenTelemetry trace context across SQS ingestion-to-correlation boundary

  Scenario: Trace context is injected into SQS message attributes at ingestion
    Given a webhook request arrives with OTEL trace header "traceparent: 00-abc123-def456-01"
    When the ingestion Lambda enqueues the message to SQS
    Then the SQS message attributes include "traceparent" = "00-abc123-def456-01"
    And the SQS message attributes include "tracestate" if present in the original request

  Scenario: Correlation engine extracts and continues trace from SQS message
    Given an SQS message with "traceparent" attribute "00-abc123-def456-01"
    When the correlation engine consumer reads the message
    Then a child span is created with parent trace ID "abc123"
    And all subsequent operations (Redis lookup, DynamoDB write) are children of this span

  Scenario: Missing trace context in SQS message starts a new trace
    Given an SQS message with no "traceparent" attribute
    When the correlation engine consumer reads the message
    Then a new root trace is started
    And a metric "trace_context_missing" is emitted

  Scenario: Trace ID is stored on the incident record
    Given a correlated incident is created from an alert with trace ID "abc123"
    When the incident is written to DynamoDB
    Then the incident document includes "trace_id" = "abc123"

Epic 4: Notification & Escalation

Feature: Slack Block Kit Incident Notifications

Feature: Send Slack Block Kit notifications for new incidents

  Background:
    Given a Slack webhook URL is configured for tenant "acme"
    And the Slack notification Lambda is subscribed to the incidents SNS topic

  Scenario: New critical incident triggers Slack notification
    Given a new incident is created with severity "critical" for service "payments"
    When the notification Lambda processes the incident event
    Then a Slack Block Kit message is posted to the configured channel
    And the message includes the incident ID, service name, severity, and timestamp
    And the message includes action buttons: "Acknowledge", "Escalate", "Mark as Noise"

  Scenario: New warning incident triggers Slack notification
    Given a new incident is created with severity "warning"
    When the notification Lambda processes the incident event
    Then a Slack message is posted with severity badge "⚠️ WARNING"

  Scenario: Resolved incident posts a resolution message to Slack
    Given an existing incident "INC-001" transitions to status "resolved"
    When the notification Lambda processes the resolution event
    Then a Slack message is posted indicating "INC-001 resolved"
    And the message includes time-to-resolution duration

  Scenario: Slack Block Kit message includes alert count for correlated incidents
    Given an incident contains 7 correlated alerts
    When the Slack notification is sent
    Then the message body includes "7 correlated alerts"

  Scenario: Slack notification includes dashboard deep-link
    Given a new incident "INC-042" is created
    When the Slack notification is sent
    Then the message includes a button "View in Dashboard" linking to "/incidents/INC-042"

Feature: Severity-Based Routing

Feature: Route notifications to different Slack channels based on severity

  Background:
    Given tenant "acme" has configured:
      | severity  | channel          |
      | critical  | #incidents-p1    |
      | warning   | #incidents-p2    |
      | info      | #monitoring-feed |

  Scenario: Critical incident is routed to P1 channel
    Given a new incident with severity "critical"
    When the notification Lambda routes the alert
    Then the Slack message is posted to "#incidents-p1"

  Scenario: Warning incident is routed to P2 channel
    Given a new incident with severity "warning"
    When the notification Lambda routes the alert
    Then the Slack message is posted to "#incidents-p2"

  Scenario: Info incident is routed to monitoring feed
    Given a new incident with severity "info"
    When the notification Lambda routes the alert
    Then the Slack message is posted to "#monitoring-feed"

  Scenario: No routing rule configured — falls back to default channel
    Given tenant "beta" has only a default channel "#alerts" configured
    And a new incident with severity "critical" arrives
    When the notification Lambda routes the alert
    Then the Slack message is posted to "#alerts"

Feature: Escalation to PagerDuty if Unacknowledged

Feature: Escalate unacknowledged critical incidents to PagerDuty

  Background:
    Given the escalation check runs every minute via EventBridge
    And the escalation threshold for "critical" incidents is 15 minutes

  Scenario: Unacknowledged critical incident is escalated after threshold
    Given a critical incident "INC-001" was created 16 minutes ago
    And "INC-001" has not been acknowledged
    When the escalation Lambda runs
    Then a PagerDuty incident is created via the PagerDuty Events API v2
    And the incident "INC-001" status is updated to "escalated"
    And a Slack message is posted: "INC-001 escalated to PagerDuty"

  Scenario: Acknowledged incident is NOT escalated
    Given a critical incident "INC-002" was created 20 minutes ago
    And "INC-002" was acknowledged 5 minutes ago
    When the escalation Lambda runs
    Then no PagerDuty incident is created for "INC-002"

  Scenario: Warning incident has a longer escalation threshold
    Given the escalation threshold for "warning" incidents is 60 minutes
    And a warning incident "INC-003" was created 45 minutes ago and is unacknowledged
    When the escalation Lambda runs
    Then no PagerDuty incident is created for "INC-003"

  Scenario: Escalation is idempotent — already-escalated incident is not re-escalated
    Given incident "INC-004" was already escalated to PagerDuty
    When the escalation Lambda runs again
    Then no duplicate PagerDuty incident is created
    And the escalation Lambda logs "already_escalated:INC-004"

  Scenario: PagerDuty API failure during escalation is retried
    Given incident "INC-005" is due for escalation
    And the PagerDuty Events API returns a 500 error
    When the escalation Lambda attempts to create the PagerDuty incident
    Then the Lambda retries up to 3 times with exponential backoff
    And if all retries fail, an error metric "pagerduty_escalation_failure" is emitted
    And the incident is flagged for manual review

Feature: Daily Noise Report

Feature: Generate and send daily noise reduction report

  Background:
    Given the daily report Lambda runs at 08:00 UTC via EventBridge

  Scenario: Daily noise report is sent to configured Slack channel
    Given tenant "acme" has 500 raw alerts and 80 incidents in the past 24 hours
    When the daily report Lambda runs
    Then a Slack message is posted to "#dd0c-digest"
    And the message includes:
      | metric                  | value |
      | total_alerts            | 500   |
      | correlated_incidents    | 80    |
      | noise_reduction_percent | 84%   |
      | top_noisy_service       | shown |

  Scenario: Daily report includes MTTR for resolved incidents
    Given 20 incidents were resolved in the past 24 hours with an average MTTR of 23 minutes
    When the daily report Lambda runs
    Then the Slack message includes "Avg MTTR: 23 min"

  Scenario: Daily report is skipped if no alerts in past 24 hours
    Given tenant "quiet-co" had 0 alerts in the past 24 hours
    When the daily report Lambda runs
    Then no Slack message is sent for "quiet-co"

  Scenario: Daily report is tenant-scoped — no cross-tenant data leakage
    Given tenants "alpha" and "beta" both have activity
    When the daily report Lambda runs
    Then "alpha"'s report contains only "alpha"'s metrics
    And "beta"'s report contains only "beta"'s metrics

Feature: Slack Rate Limiting

Feature: Handle Slack API rate limiting gracefully

  Scenario: Slack returns 429 Too Many Requests — notification is retried
    Given a Slack notification needs to be sent
    And Slack returns HTTP 429 with "Retry-After: 5"
    When the notification Lambda handles the response
    Then the Lambda waits 5 seconds before retrying
    And the notification is eventually delivered

  Scenario: Slack rate limit persists beyond Lambda timeout — message queued for retry
    Given Slack is rate-limiting for 30 seconds
    And the Lambda timeout is 15 seconds
    When the notification Lambda cannot deliver within its timeout
    Then the SQS message is not deleted (remains visible after visibility timeout)
    And the message is retried by the next Lambda invocation

  Scenario: Burst of 50 incidents triggers Slack rate limit protection
    Given 50 incidents are created within 1 second
    When the notification Lambda processes the burst
    Then notifications are batched and sent with 1-second delays between batches
    And all 50 notifications are eventually delivered
    And a metric "slack_rate_limit_batching" is emitted

Epic 5: Slack Bot

Feature: Interactive Feedback Buttons

Feature: Slack interactive feedback buttons on incident notifications

  Background:
    Given an incident notification was posted to Slack with buttons: "Helpful", "Noise", "Escalate"
    And the Slack interactivity endpoint is "POST /slack/interactions"

  Scenario: User clicks "Helpful" on an incident notification
    Given user "@alice" clicks the "Helpful" button on incident "INC-007"
    When the Slack interaction payload is received
    Then the incident "INC-007" feedback is recorded as "helpful"
    And the Slack message is updated to show "✅ Marked helpful by @alice"
    And the button is disabled to prevent duplicate feedback

  Scenario: User clicks "Noise" on an incident notification
    Given user "@bob" clicks the "Noise" button on incident "INC-008"
    When the Slack interaction payload is received
    Then the incident "INC-008" feedback is recorded as "noise"
    And the incident "noise_score" is incremented
    And the Slack message is updated to show "🔇 Marked as noise by @bob"

  Scenario: User clicks "Escalate" on an incident notification
    Given user "@carol" clicks the "Escalate" button on incident "INC-009"
    When the Slack interaction payload is received
    Then the incident "INC-009" is immediately escalated to PagerDuty
    And the Slack message is updated to show "🚨 Escalated by @carol"
    And the escalation bypasses the normal time threshold

  Scenario: Feedback on an already-resolved incident is rejected
    Given incident "INC-010" has status "resolved"
    And user "@dave" clicks "Helpful" on the stale Slack message
    When the Slack interaction payload is received
    Then the Slack message is updated to show "⚠️ Incident already resolved"
    And no feedback is recorded

  Scenario: Slack interaction payload signature is validated
    Given a Slack interaction request with an invalid "X-Slack-Signature" header
    When the interaction endpoint receives the request
    Then the response status is 401
    And the interaction is not processed

  Scenario: Duplicate button click by same user is idempotent
    Given user "@alice" already marked incident "INC-007" as "helpful"
    And "@alice" clicks "Helpful" again on the same message
    When the Slack interaction payload is received
    Then the feedback count is NOT incremented again
    And the response acknowledges the duplicate gracefully

Feature: Slash Command — /dd0c status

Feature: /dd0c status slash command

  Background:
    Given the Slack slash command "/dd0c" is registered
    And the command handler endpoint is "POST /slack/commands"

  Scenario: /dd0c status returns current open incident count
    Given tenant "acme" has 3 open critical incidents and 5 open warning incidents
    When user "@alice" runs "/dd0c status" in the Slack workspace
    Then the bot responds ephemerally with:
      | metric              | value |
      | open_critical       | 3     |
      | open_warning        | 5     |
      | alerts_last_hour    | shown |
      | system_status       | OK    |

  Scenario: /dd0c status when no open incidents
    Given tenant "acme" has 0 open incidents
    When user "@alice" runs "/dd0c status"
    Then the bot responds with "✅ All clear — no open incidents"

  Scenario: /dd0c status responds within Slack's 3-second timeout
    Given the command handler receives "/dd0c status"
    When the handler processes the request
    Then an HTTP 200 response is returned within 3 seconds
    And if data retrieval takes longer, an immediate acknowledgment is sent
    And the full response is delivered via response_url

  Scenario: /dd0c status is scoped to the requesting tenant
    Given user "@alice" belongs to tenant "acme"
    When "@alice" runs "/dd0c status"
    Then the response contains only "acme"'s incident data
    And no data from other tenants is included

Feature: Slash Command — /dd0c anomalies

Feature: /dd0c anomalies slash command

  Scenario: /dd0c anomalies returns top noisy services in the last 24 hours
    Given service "payments" fired 120 alerts in the last 24 hours
    And service "auth" fired 80 alerts
    And service "logging" fired 10 alerts
    When user "@alice" runs "/dd0c anomalies"
    Then the bot responds with a ranked list:
      | rank | service  | alert_count |
      | 1    | payments | 120         |
      | 2    | auth     | 80          |
      | 3    | logging  | 10          |

  Scenario: /dd0c anomalies with time range argument
    Given user "@alice" runs "/dd0c anomalies --last 7d"
    When the command handler processes the request
    Then the response covers the last 7 days of anomaly data

  Scenario: /dd0c anomalies with no data returns helpful message
    Given no alerts have been received in the last 24 hours
    When user "@alice" runs "/dd0c anomalies"
    Then the bot responds with "No anomalies detected in the last 24 hours"

Feature: Slash Command — /dd0c digest

Feature: /dd0c digest slash command

  Scenario: /dd0c digest returns on-demand summary report
    Given tenant "acme" has activity in the last 24 hours
    When user "@alice" runs "/dd0c digest"
    Then the bot responds with a summary matching the daily noise report format
    And the response includes total alerts, incidents, noise reduction %, and avg MTTR

  Scenario: /dd0c digest with custom time range
    Given user "@alice" runs "/dd0c digest --last 7d"
    When the command handler processes the request
    Then the digest covers the last 7 days

  Scenario: Unauthorized user cannot run /dd0c commands
    Given user "@mallory" is not a member of any configured tenant workspace
    When "@mallory" runs "/dd0c status"
    Then the bot responds ephemerally with "⛔ You are not authorized to use this command"
    And no tenant data is returned

Epic 6: Dashboard API

Feature: Cognito JWT Authentication

Feature: Authenticate Dashboard API requests with Cognito JWT

  Background:
    Given the Dashboard API requires a valid Cognito JWT in the "Authorization: Bearer <token>" header

  Scenario: Valid JWT grants access to the API
    Given a user has a valid Cognito JWT for tenant "acme"
    When the user calls "GET /api/incidents"
    Then the response status is 200
    And only "acme"'s incidents are returned

  Scenario: Missing Authorization header returns 401
    Given a request to "GET /api/incidents" with no Authorization header
    When the API Gateway processes the request
    Then the response status is 401
    And the body contains "error": "missing_token"

  Scenario: Expired JWT returns 401
    Given a user presents a JWT that expired 10 minutes ago
    When the user calls "GET /api/incidents"
    Then the response status is 401
    And the body contains "error": "token_expired"

  Scenario: JWT signed with wrong key returns 401
    Given a user presents a JWT signed with a non-Cognito key
    When the user calls "GET /api/incidents"
    Then the response status is 401
    And the body contains "error": "invalid_token_signature"

  Scenario: JWT from a different tenant cannot access another tenant's data
    Given user "@alice" has a valid JWT for tenant "acme"
    When "@alice" calls "GET /api/incidents?tenant_id=globex"
    Then the response status is 403
    And the body contains "error": "tenant_access_denied"

Feature: Incident Listing with Filters

Feature: List incidents with filtering and pagination

  Background:
    Given the user is authenticated for tenant "acme"

  Scenario: List all open incidents
    Given tenant "acme" has 15 open incidents
    When the user calls "GET /api/incidents?status=open"
    Then the response status is 200
    And the response contains 15 incidents
    And each incident includes: id, severity, service, status, created_at, alert_count

  Scenario: Filter incidents by severity
    Given tenant "acme" has 5 critical and 10 warning incidents
    When the user calls "GET /api/incidents?severity=critical"
    Then the response contains exactly 5 incidents
    And all returned incidents have severity "critical"

  Scenario: Filter incidents by service
    Given tenant "acme" has incidents for services "payments", "auth", and "checkout"
    When the user calls "GET /api/incidents?service=payments"
    Then only incidents for service "payments" are returned

  Scenario: Filter incidents by date range
    Given incidents exist from the past 30 days
    When the user calls "GET /api/incidents?from=2026-02-01&to=2026-02-07"
    Then only incidents created between Feb 1 and Feb 7 are returned

  Scenario: Pagination returns correct page of results
    Given tenant "acme" has 100 incidents
    When the user calls "GET /api/incidents?page=2&limit=20"
    Then the response contains incidents 21–40
    And the response includes "total": 100, "page": 2, "limit": 20

  Scenario: Empty result set returns 200 with empty array
    Given tenant "acme" has no incidents matching the filter
    When the user calls "GET /api/incidents?service=nonexistent"
    Then the response status is 200
    And the response body is '{"incidents": [], "total": 0}'

  Scenario: Incident detail endpoint returns full alert timeline
    Given incident "INC-042" has 7 correlated alerts
    When the user calls "GET /api/incidents/INC-042"
    Then the response includes the incident details
    And "alerts" array contains 7 entries with timestamps and sources

Feature: Analytics Endpoints — MTTR

Feature: MTTR analytics endpoint

  Background:
    Given the user is authenticated for tenant "acme"

  Scenario: MTTR endpoint returns average time-to-resolution
    Given 10 incidents were resolved in the last 7 days with MTTRs ranging from 5 to 60 minutes
    When the user calls "GET /api/analytics/mttr?period=7d"
    Then the response includes "avg_mttr_minutes" as a number
    And "incident_count" = 10

  Scenario: MTTR broken down by service
    When the user calls "GET /api/analytics/mttr?period=7d&group_by=service"
    Then the response includes a per-service MTTR breakdown

  Scenario: MTTR with no resolved incidents returns null
    Given no incidents were resolved in the requested period
    When the user calls "GET /api/analytics/mttr?period=1d"
    Then the response includes "avg_mttr_minutes": null
    And "incident_count": 0

Feature: Analytics Endpoints — Noise Reduction

Feature: Noise reduction analytics endpoint

  Scenario: Noise reduction percentage is calculated correctly
    Given tenant "acme" received 1000 raw alerts and 150 incidents in the last 7 days
    When the user calls "GET /api/analytics/noise-reduction?period=7d"
    Then the response includes "noise_reduction_percent": 85
    And "raw_alerts": 1000
    And "incidents": 150

  Scenario: Noise reduction trend over time
    When the user calls "GET /api/analytics/noise-reduction?period=30d&granularity=daily"
    Then the response includes a daily time series of noise reduction percentages

  Scenario: Noise reduction by source
    When the user calls "GET /api/analytics/noise-reduction?period=7d&group_by=source"
    Then the response includes per-source breakdown (datadog, pagerduty, opsgenie, grafana)

Feature: Tenant Isolation in Dashboard API

Feature: Enforce strict tenant isolation across all API endpoints

  Scenario: DynamoDB queries always include tenant_id partition key
    Given user "@alice" for tenant "acme" calls any incident endpoint
    When the API handler queries DynamoDB
    Then the query always includes "tenant_id = acme" as a condition
    And no full-table scans are performed

  Scenario: TimescaleDB analytics queries are scoped by tenant_id
    Given user "@alice" for tenant "acme" calls any analytics endpoint
    When the API handler queries TimescaleDB
    Then the SQL query includes "WHERE tenant_id = 'acme'"

  Scenario: API does not expose tenant_id enumeration
    Given user "@alice" calls "GET /api/incidents/INC-999" where INC-999 belongs to tenant "globex"
    When the API processes the request
    Then the response status is 404 (not 403, to avoid tenant enumeration)

Epic 7: Dashboard UI

Feature: Incident List View

Feature: Incident list page in the React SPA

  Background:
    Given the user is logged in and the Dashboard SPA is loaded

  Scenario: Incident list displays open incidents on load
    Given tenant "acme" has 12 open incidents
    When the user navigates to "/incidents"
    Then the incident list renders 12 rows
    And each row shows: incident ID, severity badge, service name, alert count, age

  Scenario: Severity badge color coding
    Given the incident list contains critical, warning, and info incidents
    When the list renders
    Then critical incidents show a red badge
    And warning incidents show a yellow badge
    And info incidents show a blue badge

  Scenario: Clicking an incident row navigates to incident detail
    Given the incident list is displayed
    When the user clicks on incident "INC-042"
    Then the browser navigates to "/incidents/INC-042"

  Scenario: Filter by severity updates the list in real time
    Given the incident list is displayed
    When the user selects "Critical" from the severity filter dropdown
    Then only critical incidents are shown
    And the URL updates to "/incidents?severity=critical"

  Scenario: Filter by service updates the list
    Given the incident list is displayed
    When the user types "payments" in the service search box
    Then only incidents for service "payments" are shown

  Scenario: Empty state is shown when no incidents match filters
    Given no incidents match the current filter
    When the list renders
    Then a message "No incidents found" is displayed
    And a "Clear filters" button is shown

  Scenario: Incident list auto-refreshes every 30 seconds
    Given the incident list is displayed
    When 30 seconds elapse
    Then the list silently re-fetches from the API
    And new incidents appear without a full page reload

Feature: Alert Timeline View

Feature: Alert timeline within an incident detail page

  Scenario: Alert timeline shows all correlated alerts in chronological order
    Given incident "INC-042" has 5 correlated alerts from T=0 to T=4min
    When the user navigates to "/incidents/INC-042"
    Then the timeline renders 5 events in ascending time order
    And each event shows: source icon, alert title, severity, timestamp

  Scenario: Timeline highlights the root cause alert
    Given the first alert in the incident is flagged as "root_cause"
    When the timeline renders
    Then the root cause alert is visually distinguished (e.g. bold border)

  Scenario: Timeline shows deduplication count
    Given fingerprint "fp-abc" was suppressed 8 times
    When the timeline renders the corresponding alert
    Then a badge "×8 duplicates suppressed" is shown on that alert entry

  Scenario: Timeline is scrollable for large incidents
    Given an incident has 200 correlated alerts
    When the timeline renders
    Then a virtualized scroll list is used
    And the page does not freeze or crash

Feature: MTTR Chart

Feature: MTTR trend chart on the analytics page

  Scenario: MTTR chart renders a 7-day trend line
    Given the analytics API returns daily MTTR data for the last 7 days
    When the user navigates to "/analytics"
    Then a line chart is rendered with 7 data points
    And the X-axis shows dates and the Y-axis shows minutes

  Scenario: MTTR chart shows "No data" state when no resolved incidents
    Given no incidents were resolved in the selected period
    When the chart renders
    Then a "No resolved incidents in this period" message is shown instead of the chart

  Scenario: MTTR chart period selector changes the data range
    Given the user is on the analytics page
    When the user selects "Last 30 days" from the period dropdown
    Then the chart re-fetches data for the last 30 days
    And the chart updates without a full page reload

Feature: Noise Reduction Percentage Display

Feature: Noise reduction metric display on analytics page

  Scenario: Noise reduction percentage is prominently displayed
    Given the analytics API returns noise_reduction_percent = 84
    When the user views the analytics page
    Then a large "84%" figure is displayed under "Noise Reduction"

  Scenario: Noise reduction trend sparkline is shown
    Given daily noise reduction data is available for 30 days
    When the analytics page renders
    Then a sparkline chart shows the 30-day trend

  Scenario: Noise reduction breakdown by source is shown
    Given the API returns per-source noise reduction data
    When the user clicks "By Source" tab
    Then a bar chart shows noise reduction % for each source (Datadog, PagerDuty, OpsGenie, Grafana)

Feature: Webhook Setup Wizard

Feature: Webhook setup wizard for onboarding new monitoring sources

  Scenario: Wizard generates a unique webhook URL for Datadog
    Given the user navigates to "/settings/webhooks"
    And clicks "Add Webhook Source"
    When the user selects "Datadog" from the source dropdown
    And clicks "Generate"
    Then a unique webhook URL is displayed: "https://ingest.dd0c.io/webhooks/datadog/{tenant_id}/{token}"
    And the HMAC secret is shown once for copying

  Scenario: Wizard provides copy-paste instructions for each source
    Given the user has generated a Datadog webhook URL
    When the wizard displays the setup instructions
    Then step-by-step instructions for configuring Datadog are shown
    And a "Test Webhook" button is available

  Scenario: Test webhook button sends a test payload and confirms receipt
    Given the user clicks "Test Webhook" for a configured Datadog source
    When the test payload is sent
    Then the wizard shows "✅ Test payload received successfully"
    And the test alert appears in the incident list as a test event

  Scenario: Wizard shows validation error if source already configured
    Given tenant "acme" already has a Datadog webhook configured
    When the user tries to add a second Datadog webhook
    Then the wizard shows "A Datadog webhook is already configured. Regenerate token?"

  Scenario: Regenerating a webhook token invalidates the old token
    Given tenant "acme" has an existing Datadog webhook token
    When the user clicks "Regenerate Token" and confirms
    Then a new token is generated
    And the old token is immediately invalidated
    And any requests using the old token return 401

Epic 8: Infrastructure

Feature: CDK Stack — Lambda Ingestion

Feature: CDK provisions Lambda ingestion infrastructure

  Scenario: Lambda function is created with correct runtime and memory
    Given the CDK stack is synthesized
    When the CloudFormation template is inspected
    Then a Lambda function "dd0c-ingestion" exists with runtime "nodejs20.x"
    And memory is set to 512MB
    And timeout is set to 30 seconds

  Scenario: Lambda has least-privilege IAM role
    Given the CDK stack is synthesized
    When the IAM role for "dd0c-ingestion" is inspected
    Then the role allows "sqs:SendMessage" only to the ingestion SQS queue ARN
    And the role allows "s3:PutObject" only to the "dd0c-raw-webhooks" bucket
    And the role does NOT have "s3:*" or "sqs:*" wildcards

  Scenario: Lambda is behind API Gateway with throttling
    Given the CDK stack is synthesized
    When the API Gateway configuration is inspected
    Then throttling is set to 1000 requests/second burst and 500 requests/second steady-state
    And WAF is attached to the API Gateway stage

  Scenario: Lambda environment variables are sourced from SSM Parameter Store
    Given the CDK stack is synthesized
    When the Lambda environment configuration is inspected
    Then HMAC secrets are referenced from SSM parameters (not hardcoded)
    And no secrets appear in plaintext in the CloudFormation template

Feature: CDK Stack — ECS Fargate Correlation Engine

Feature: CDK provisions ECS Fargate for the correlation engine

  Scenario: ECS service is created with correct task definition
    Given the CDK stack is synthesized
    When the ECS task definition is inspected
    Then the task uses Fargate launch type
    And CPU is set to 1024 (1 vCPU) and memory to 2048MB
    And the container image is pulled from ECR "dd0c-correlation-engine"

  Scenario: ECS service auto-scales based on SQS queue depth
    Given the CDK stack is synthesized
    When the auto-scaling configuration is inspected
    Then a step-scaling policy exists targeting SQS "ApproximateNumberOfMessagesVisible"
    And scale-out triggers when queue depth > 100 messages
    And scale-in triggers when queue depth < 10 messages
    And minimum capacity is 1 and maximum capacity is 10

  Scenario: ECS tasks run in private subnets with no public IP
    Given the CDK stack is synthesized
    When the ECS network configuration is inspected
    Then tasks are placed in private subnets
    And "assignPublicIp" is DISABLED
    And a NAT Gateway provides outbound internet access

Feature: CDK Stack — SQS Queues

Feature: CDK provisions SQS queues with correct configuration

  Scenario: Ingestion SQS queue has a Dead Letter Queue configured
    Given the CDK stack is synthesized
    When the SQS queue "dd0c-ingestion" is inspected
    Then a DLQ "dd0c-ingestion-dlq" is attached
    And maxReceiveCount is 3
    And the DLQ retention period is 14 days

  Scenario: SQS queue has server-side encryption enabled
    Given the CDK stack is synthesized
    When the SQS queue configuration is inspected
    Then SSE is enabled using an AWS-managed KMS key

  Scenario: SQS visibility timeout exceeds Lambda timeout
    Given the Lambda timeout is 30 seconds
    When the SQS queue visibility timeout is inspected
    Then the visibility timeout is at least 6x the Lambda timeout (180 seconds)

Feature: CDK Stack — DynamoDB

Feature: CDK provisions DynamoDB for incident storage

  Scenario: Incidents table has correct key schema
    Given the CDK stack is synthesized
    When the DynamoDB table "dd0c-incidents" is inspected
    Then the partition key is "tenant_id" (String)
    And the sort key is "incident_id" (String)

  Scenario: Incidents table has a GSI for status queries
    Given the CDK stack is synthesized
    When the GSIs on "dd0c-incidents" are inspected
    Then a GSI "status-created_at-index" exists
    With partition key "status" and sort key "created_at"

  Scenario: DynamoDB table has point-in-time recovery enabled
    Given the CDK stack is synthesized
    When the DynamoDB table settings are inspected
    Then PITR is enabled on "dd0c-incidents"

  Scenario: DynamoDB TTL is configured for free-tier retention
    Given the CDK stack is synthesized
    When the DynamoDB TTL configuration is inspected
    Then TTL is enabled on attribute "expires_at"
    And free-tier records have "expires_at" set to 7 days from creation

Feature: CI/CD Pipeline

Feature: CI/CD pipeline for automated deployment

  Scenario: Pull request triggers test suite
    Given a developer opens a pull request against "main"
    When the CI pipeline runs
    Then unit tests, integration tests, and CDK synth all pass before merge is allowed

  Scenario: Merge to main triggers staging deployment
    Given a PR is merged to "main"
    When the CD pipeline runs
    Then the CDK stack is deployed to the "staging" environment
    And smoke tests run against staging endpoints

  Scenario: Production deployment requires manual approval
    Given the staging deployment and smoke tests pass
    When the CD pipeline reaches the production stage
    Then a manual approval gate is presented
    And production deployment only proceeds after approval

  Scenario: Failed deployment triggers automatic rollback
    Given a production deployment fails health checks
    When the CD pipeline detects the failure
    Then the previous CloudFormation stack version is restored
    And a Slack alert is sent to "#dd0c-ops" with the rollback reason

  Scenario: CDK diff is posted as a PR comment
    Given a developer opens a PR with infrastructure changes
    When the CI pipeline runs "cdk diff"
    Then the diff output is posted as a comment on the PR

Epic 9: Onboarding & PLG

Feature: User signup via OAuth (Google / GitHub)

  Background:
    Given the signup page is at "/signup"

  Scenario: New user signs up with Google OAuth
    Given a new user visits "/signup"
    When the user clicks "Sign up with Google"
    And completes the Google OAuth flow
    Then a new tenant is created for the user's email domain
    And the user is assigned the "owner" role for the new tenant
    And the user is redirected to the onboarding wizard

  Scenario: New user signs up with GitHub OAuth
    Given a new user visits "/signup"
    When the user clicks "Sign up with GitHub"
    And completes the GitHub OAuth flow
    Then a new tenant is created
    And the user is redirected to the onboarding wizard

  Scenario: Existing user signs in via OAuth
    Given a user with email "alice@acme.com" already has an account
    When the user completes the Google OAuth flow
    Then no new tenant is created
    And the user is redirected to "/incidents"

  Scenario: OAuth failure shows user-friendly error
    Given the Google OAuth provider returns an error
    When the user is redirected back to the app
    Then an error message "Sign-in failed. Please try again." is displayed
    And no partial account is created

  Scenario: Signup is blocked for disposable email domains
    Given a user attempts to sign up with "user@mailinator.com"
    When the OAuth flow completes
    Then the signup is rejected with "Disposable email addresses are not allowed"
    And no tenant is created

Feature: Free Tier — 10K Alerts/Month Limit

Feature: Enforce free tier limit of 10,000 alerts per month

  Background:
    Given tenant "free-co" is on the free tier
    And the monthly alert counter is stored in DynamoDB

  Scenario: Alert ingestion succeeds under the 10K limit
    Given tenant "free-co" has ingested 9,999 alerts this month
    When a new alert arrives
    Then the alert is processed normally
    And the counter is incremented to 10,000

  Scenario: Alert ingestion is blocked at the 10K limit
    Given tenant "free-co" has ingested 10,000 alerts this month
    When a new alert arrives via webhook
    Then the webhook returns HTTP 429
    And the response body includes "Free tier limit reached. Upgrade to continue."
    And the alert is NOT processed or stored

  Scenario: Tenant receives email warning at 80% of limit
    Given tenant "free-co" has ingested 8,000 alerts this month
    When the 8,001st alert is ingested
    Then an email is sent to the tenant owner: "You've used 80% of your free tier quota"

  Scenario: Alert counter resets on the 1st of each month
    Given tenant "free-co" has ingested 10,000 alerts in January
    When February 1st arrives (UTC midnight)
    Then the monthly counter is reset to 0
    And alert ingestion is unblocked

  Scenario: Paid tenant has no alert ingestion limit
    Given tenant "paid-co" is on the "pro" plan
    And has ingested 50,000 alerts this month
    When a new alert arrives
    Then the alert is processed normally
    And no limit check is applied

Feature: 7-Day Retention for Free Tier

Feature: Enforce 7-day data retention for free tier tenants

  Scenario: Free tier incidents older than 7 days are expired via DynamoDB TTL
    Given tenant "free-co" is on the free tier
    And incident "INC-OLD" was created 8 days ago
    When DynamoDB TTL runs
    Then "INC-OLD" is deleted from the incidents table

  Scenario: Free tier raw S3 archives older than 7 days are deleted
    Given tenant "free-co" has raw webhook archives in S3 from 8 days ago
    When the S3 lifecycle policy runs
    Then objects older than 7 days are deleted for free-tier tenants

  Scenario: Paid tier incidents are retained for 90 days
    Given tenant "paid-co" is on the "pro" plan
    And incident "INC-OLD" was created 30 days ago
    When DynamoDB TTL runs
    Then "INC-OLD" is NOT deleted

  Scenario: Retention policy is enforced per-tenant, not globally
    Given "free-co" and "paid-co" both have incidents from 10 days ago
    When TTL and lifecycle policies run
    Then "free-co"'s old incidents are deleted
    And "paid-co"'s old incidents are retained

Feature: Stripe Billing Integration

Feature: Stripe billing for plan upgrades

  Scenario: User upgrades from free to pro plan via Stripe Checkout
    Given user "@alice" is on the free tier
    When "@alice" clicks "Upgrade to Pro" in the dashboard
    Then a Stripe Checkout session is created
    And the user is redirected to the Stripe-hosted payment page

  Scenario: Successful Stripe payment activates pro plan
    Given a Stripe Checkout session completes successfully
    When the Stripe "checkout.session.completed" webhook is received
    Then tenant "acme"'s plan is updated to "pro" in DynamoDB
    And the alert ingestion limit is removed
    And a confirmation email is sent to the tenant owner

  Scenario: Failed Stripe payment does not activate pro plan
    Given a Stripe payment fails
    When the Stripe "payment_intent.payment_failed" webhook is received
    Then the tenant remains on the free tier
    And a failure notification email is sent

  Scenario: Stripe webhook signature is validated
    Given a Stripe webhook arrives with an invalid "Stripe-Signature" header
    When the billing Lambda processes the request
    Then the response status is 401
    And the event is not processed

  Scenario: Subscription cancellation downgrades tenant to free tier
    Given tenant "acme" cancels their pro subscription
    When the Stripe "customer.subscription.deleted" webhook is received
    Then tenant "acme"'s plan is downgraded to "free"
    And the 10K/month limit is re-applied from the next billing cycle
    And a downgrade confirmation email is sent

  Scenario: Stripe billing is idempotent — duplicate webhook events are ignored
    Given a Stripe "checkout.session.completed" event was already processed
    When the same event is received again (Stripe retry)
    Then the tenant plan is not double-updated
    And the response is 200 (idempotent acknowledgment)

Feature: Webhook URL Generation

Feature: Generate unique webhook URLs per tenant per source

  Scenario: Webhook URL is generated on tenant creation
    Given a new tenant "new-co" is created via OAuth signup
    When the onboarding wizard runs
    Then a unique webhook URL is generated for each supported source
    And each URL follows the pattern "https://ingest.dd0c.io/webhooks/{source}/{tenant_id}/{token}"
    And tokens are cryptographically random (32 bytes, URL-safe base64)

  Scenario: Webhook token is stored hashed in DynamoDB
    Given a webhook token is generated for tenant "new-co"
    When the token is stored
    Then only the SHA-256 hash of the token is stored in DynamoDB
    And the plaintext token is shown to the user exactly once

  Scenario: Webhook URL is validated on each ingestion request
    Given a request arrives at "POST /webhooks/datadog/{tenant_id}/{token}"
    When the ingestion Lambda validates the token
    Then the token hash is looked up in DynamoDB for the given tenant_id
    And if the hash matches, the request is accepted
    And if the hash does not match, the response is 401

  Scenario: 60-second time-to-value — first correlated alert within 60 seconds of webhook setup
    Given a new tenant completes the onboarding wizard and copies their webhook URL
    When the tenant sends their first alert to the webhook URL
    Then within 60 seconds, a correlated incident appears in the dashboard
    And a Slack notification is sent if Slack is configured

66 KiB Raw Blame History Unescape Escape

dd0c/alert — Alert Intelligence: BDD Acceptance Test Specifications

Epic 1: Webhook Ingestion

Feature: HMAC Signature Validation — Datadog

Feature: HMAC Signature Validation — PagerDuty

Feature: HMAC Signature Validation — OpsGenie

Feature: HMAC Signature Validation — Grafana

Feature: Payload Normalization to Canonical Schema

Feature: Async S3 Archival

Feature: SQS Payload Size Limit (256KB)

Feature: Dead Letter Queue Handling

Epic 2: Alert Normalization

Feature: Datadog Source Parser

Feature: PagerDuty Source Parser

Feature: OpsGenie Source Parser

Feature: Grafana Source Parser

Feature: Canonical Alert Schema Validation

Epic 3: Correlation Engine

Feature: Time-Window Grouping

Feature: Service-Affinity Matching

Feature: Fingerprint Deduplication

Feature: Redis Sliding Window

Feature: Cross-Tenant Isolation in Correlation

Feature: OTEL Trace Propagation Across SQS Boundary

Epic 4: Notification & Escalation

Feature: Slack Block Kit Incident Notifications

Feature: Severity-Based Routing

Feature: Escalation to PagerDuty if Unacknowledged

Feature: Daily Noise Report

Feature: Slack Rate Limiting

Epic 5: Slack Bot

Feature: Interactive Feedback Buttons

Feature: Slash Command — /dd0c status

Feature: Slash Command — /dd0c anomalies

Feature: Slash Command — /dd0c digest

Epic 6: Dashboard API

Feature: Cognito JWT Authentication

Feature: Incident Listing with Filters

Feature: Analytics Endpoints — MTTR

Feature: Analytics Endpoints — Noise Reduction

Feature: Tenant Isolation in Dashboard API

Epic 7: Dashboard UI

Feature: Incident List View

Feature: Alert Timeline View

Feature: MTTR Chart

Feature: Noise Reduction Percentage Display

Feature: Webhook Setup Wizard

Epic 8: Infrastructure

Feature: CDK Stack — Lambda Ingestion

Feature: CDK Stack — ECS Fargate Correlation Engine

Feature: CDK Stack — SQS Queues

Feature: CDK Stack — DynamoDB

Feature: CI/CD Pipeline

Epic 9: Onboarding & PLG

Feature: OAuth Signup

Feature: Free Tier — 10K Alerts/Month Limit

Feature: 7-Day Retention for Free Tier

Feature: Stripe Billing Integration

Feature: Webhook URL Generation

66 KiB

Raw Blame History