# dd0c/alert — Alert Intelligence: BDD Acceptance Test Specifications > Gherkin scenarios for all 10 epics. Each Feature maps to a user story within the epic. --- ## Epic 1: Webhook Ingestion ### Feature: HMAC Signature Validation — Datadog ```gherkin Feature: HMAC signature validation for Datadog webhooks As the ingestion layer I want to reject requests with invalid or missing HMAC signatures So that only legitimate Datadog payloads are processed Background: Given the webhook endpoint is "POST /webhooks/datadog" And a valid Datadog HMAC secret is configured as "dd-secret-abc123" Scenario: Valid Datadog HMAC signature is accepted Given a Datadog alert payload with body '{"title":"CPU spike","severity":"high"}' And the request includes header "X-Datadog-Webhook-ID" with a valid HMAC-SHA256 signature When the Lambda ingestion handler receives the request Then the response status is 200 And the payload is forwarded to the normalization SQS queue Scenario: Missing HMAC signature header is rejected Given a Datadog alert payload with body '{"title":"CPU spike","severity":"high"}' And the request has no "X-Datadog-Webhook-ID" header When the Lambda ingestion handler receives the request Then the response status is 401 And the payload is NOT forwarded to SQS And a rejection event is logged with reason "missing_signature" Scenario: Tampered payload with mismatched HMAC is rejected Given a Datadog alert payload And the HMAC signature was computed over a different payload body When the Lambda ingestion handler receives the request Then the response status is 401 And the payload is NOT forwarded to SQS And a rejection event is logged with reason "signature_mismatch" Scenario: Replay attack with expired timestamp is rejected Given a Datadog alert payload with a valid HMAC signature And the request timestamp is more than 5 minutes in the past When the Lambda ingestion handler receives the request Then the response status is 401 And the rejection reason is "timestamp_expired" Scenario: HMAC secret rotation — old secret still accepted during grace period Given the Datadog HMAC secret was rotated 2 minutes ago And the request uses the previous secret for signing When the Lambda ingestion handler receives the request Then the response status is 200 And a warning metric "hmac_old_secret_used" is emitted ``` ### Feature: HMAC Signature Validation — PagerDuty ```gherkin Feature: HMAC signature validation for PagerDuty webhooks Background: Given the webhook endpoint is "POST /webhooks/pagerduty" And a valid PagerDuty signing secret is configured Scenario: Valid PagerDuty v3 signature is accepted Given a PagerDuty webhook payload And the request includes "X-PagerDuty-Signature" with a valid v3 HMAC-SHA256 value When the Lambda ingestion handler receives the request Then the response status is 200 And the payload is enqueued for normalization Scenario: PagerDuty v1 signature (legacy) is rejected Given a PagerDuty webhook payload signed with v1 scheme When the Lambda ingestion handler receives the request Then the response status is 401 And the rejection reason is "unsupported_signature_version" Scenario: Missing signature on PagerDuty webhook Given a PagerDuty webhook payload with no signature header When the Lambda ingestion handler receives the request Then the response status is 401 ``` ### Feature: HMAC Signature Validation — OpsGenie ```gherkin Feature: HMAC signature validation for OpsGenie webhooks Background: Given the webhook endpoint is "POST /webhooks/opsgenie" And a valid OpsGenie integration API key is configured Scenario: Valid OpsGenie HMAC is accepted Given an OpsGenie alert payload And the request includes "X-OG-Delivery-Signature" with a valid HMAC-SHA256 value When the Lambda ingestion handler receives the request Then the response status is 200 Scenario: Invalid OpsGenie signature is rejected Given an OpsGenie alert payload with a forged signature When the Lambda ingestion handler receives the request Then the response status is 401 And the rejection reason is "signature_mismatch" ``` ### Feature: HMAC Signature Validation — Grafana ```gherkin Feature: HMAC signature validation for Grafana webhooks Background: Given the webhook endpoint is "POST /webhooks/grafana" And a Grafana webhook secret is configured Scenario: Valid Grafana signature is accepted Given a Grafana alert payload And the request includes "X-Grafana-Signature" with a valid HMAC-SHA256 value When the Lambda ingestion handler receives the request Then the response status is 200 Scenario: Grafana webhook with no secret configured (open mode) is accepted with warning Given no Grafana webhook secret is configured for the tenant And a Grafana alert payload arrives without a signature header When the Lambda ingestion handler receives the request Then the response status is 200 And a warning metric "grafana_unauthenticated_webhook" is emitted ``` ### Feature: Payload Normalization to Canonical Schema ```gherkin Feature: Normalize incoming webhook payloads to canonical alert schema Scenario: Datadog payload is normalized to canonical schema Given a raw Datadog webhook payload with fields "title", "severity", "host", "tags" When the normalization Lambda processes the payload Then the canonical alert contains: | field | value | | source | datadog | | severity | mapped from Datadog severity | | service | extracted from tags | | fingerprint | SHA-256 of source+title+host | | received_at | ISO-8601 timestamp | | raw_payload | original JSON preserved | Scenario: PagerDuty payload is normalized to canonical schema Given a raw PagerDuty v3 webhook payload When the normalization Lambda processes the payload Then the canonical alert contains "source" = "pagerduty" And "severity" is mapped from PagerDuty urgency field And "service" is extracted from the PagerDuty service name Scenario: OpsGenie payload is normalized to canonical schema Given a raw OpsGenie webhook payload When the normalization Lambda processes the payload Then the canonical alert contains "source" = "opsgenie" And "severity" is mapped from OpsGenie priority field Scenario: Grafana payload is normalized to canonical schema Given a raw Grafana alerting webhook payload When the normalization Lambda processes the payload Then the canonical alert contains "source" = "grafana" And "severity" is mapped from Grafana alert state Scenario: Unknown source type returns 400 Given a webhook payload posted to "/webhooks/unknown-source" When the Lambda ingestion handler receives the request Then the response status is 400 And the error reason is "unknown_source" Scenario: Malformed JSON payload returns 400 Given a request body that is not valid JSON When the Lambda ingestion handler receives the request Then the response status is 400 And the error reason is "invalid_json" ``` ### Feature: Async S3 Archival ```gherkin Feature: Archive raw webhook payloads to S3 asynchronously Scenario: Every accepted payload is archived to S3 Given a valid Datadog webhook payload is received and accepted When the Lambda ingestion handler processes the request Then the raw payload is written to S3 bucket "dd0c-raw-webhooks" And the S3 key follows the pattern "raw/{source}/{tenant_id}/{YYYY}/{MM}/{DD}/{uuid}.json" And the archival happens asynchronously (does not block the 200 response) Scenario: S3 archival failure does not fail the ingestion Given a valid webhook payload is received And the S3 write operation fails with a transient error When the Lambda ingestion handler processes the request Then the response status is still 200 And the payload is still forwarded to SQS And an error metric "s3_archival_failure" is emitted Scenario: Archived payload includes tenant ID and trace context Given a valid webhook payload from tenant "tenant-xyz" When the payload is archived to S3 Then the S3 object metadata includes "tenant_id" = "tenant-xyz" And the S3 object metadata includes the OTEL trace ID ``` ### Feature: SQS Payload Size Limit (256KB) ```gherkin Feature: Handle SQS 256KB payload size limit during ingestion Scenario: Payload under 256KB is sent directly to SQS Given a normalized canonical alert payload of 10KB When the ingestion Lambda forwards it to SQS Then the message is placed on the SQS queue directly And no S3 pointer pattern is used Scenario: Payload exceeding 256KB is stored in S3 and pointer sent to SQS Given a normalized canonical alert payload of 300KB (e.g. large raw_payload) When the ingestion Lambda attempts to forward it to SQS Then the full payload is stored in S3 under "sqs-overflow/{uuid}.json" And an SQS message is sent containing only the S3 pointer and metadata And the SQS message size is under 256KB Scenario: Correlation engine retrieves oversized payload from S3 pointer Given an SQS message containing an S3 pointer for an oversized payload When the correlation engine consumer reads the SQS message Then it fetches the full payload from S3 using the pointer And processes it as a normal canonical alert Scenario: S3 pointer fetch fails in correlation engine Given an SQS message containing an S3 pointer And the S3 object has been deleted or is unavailable When the correlation engine attempts to fetch the payload Then the message is sent to the Dead Letter Queue And an alert metric "sqs_pointer_fetch_failure" is emitted ``` ### Feature: Dead Letter Queue Handling ```gherkin Feature: Dead Letter Queue overflow and monitoring Scenario: Message failing max retries is moved to DLQ Given an SQS message that causes a processing error on every attempt When the message has been retried 3 times (maxReceiveCount = 3) Then the message is automatically moved to the DLQ "dd0c-ingestion-dlq" And a CloudWatch alarm "DLQDepthHigh" is triggered when DLQ depth > 10 Scenario: DLQ overflow triggers operational alert Given the DLQ contains more than 100 messages When the DLQ depth CloudWatch alarm fires Then a Slack notification is sent to the ops channel "#dd0c-ops" And the notification includes the DLQ name and current depth Scenario: DLQ messages can be replayed after fix Given 50 messages are sitting in the DLQ When an operator triggers the DLQ replay Lambda Then messages are moved back to the main SQS queue in batches of 10 And each replayed message retains its original trace context ``` --- ## Epic 2: Alert Normalization ### Feature: Datadog Source Parser ```gherkin Feature: Parse and normalize Datadog alert payloads Background: Given the Datadog parser is registered for source "datadog" Scenario: Datadog "alert" event maps to severity "critical" Given a Datadog payload with "alert_type" = "error" and "priority" = "P1" When the Datadog parser processes the payload Then the canonical alert "severity" = "critical" Scenario: Datadog "warning" event maps to severity "warning" Given a Datadog payload with "alert_type" = "warning" When the Datadog parser processes the payload Then the canonical alert "severity" = "warning" Scenario: Datadog "recovery" event maps to status "resolved" Given a Datadog payload with "alert_type" = "recovery" When the Datadog parser processes the payload Then the canonical alert "status" = "resolved" And the canonical alert "resolved_at" is set to the event timestamp Scenario: Service extracted from Datadog tags Given a Datadog payload with "tags" = ["service:payments", "env:prod", "team:backend"] When the Datadog parser processes the payload Then the canonical alert "service" = "payments" And the canonical alert "environment" = "prod" Scenario: Service tag absent — service defaults to hostname Given a Datadog payload with no "service:" tag And the payload contains "host" = "payments-worker-01" When the Datadog parser processes the payload Then the canonical alert "service" = "payments-worker-01" Scenario: Fingerprint is deterministic for identical alerts Given two identical Datadog payloads with the same title, host, and tags When both are processed by the Datadog parser Then both canonical alerts have the same "fingerprint" value Scenario: Fingerprint differs when title changes Given two Datadog payloads differing only in "title" When both are processed by the Datadog parser Then the canonical alerts have different "fingerprint" values Scenario: Datadog payload missing required "title" field Given a Datadog payload with no "title" field When the Datadog parser processes the payload Then a normalization error is raised with reason "missing_required_field:title" And the alert is sent to the normalization DLQ ``` ### Feature: PagerDuty Source Parser ```gherkin Feature: Parse and normalize PagerDuty webhook payloads Background: Given the PagerDuty parser is registered for source "pagerduty" Scenario: PagerDuty "trigger" event creates a new canonical alert Given a PagerDuty v3 webhook with "event_type" = "incident.triggered" When the PagerDuty parser processes the payload Then the canonical alert "status" = "firing" And "source" = "pagerduty" Scenario: PagerDuty "acknowledge" event updates alert status Given a PagerDuty v3 webhook with "event_type" = "incident.acknowledged" When the PagerDuty parser processes the payload Then the canonical alert "status" = "acknowledged" Scenario: PagerDuty "resolve" event updates alert status Given a PagerDuty v3 webhook with "event_type" = "incident.resolved" When the PagerDuty parser processes the payload Then the canonical alert "status" = "resolved" Scenario: PagerDuty urgency "high" maps to severity "critical" Given a PagerDuty payload with "urgency" = "high" When the PagerDuty parser processes the payload Then the canonical alert "severity" = "critical" Scenario: PagerDuty urgency "low" maps to severity "warning" Given a PagerDuty payload with "urgency" = "low" When the PagerDuty parser processes the payload Then the canonical alert "severity" = "warning" Scenario: PagerDuty service name is extracted correctly Given a PagerDuty payload with "service.name" = "checkout-api" When the PagerDuty parser processes the payload Then the canonical alert "service" = "checkout-api" Scenario: PagerDuty dedup key used as fingerprint seed Given a PagerDuty payload with "dedup_key" = "pd-dedup-xyz789" When the PagerDuty parser processes the payload Then the canonical alert "fingerprint" incorporates "pd-dedup-xyz789" ``` ### Feature: OpsGenie Source Parser ```gherkin Feature: Parse and normalize OpsGenie webhook payloads Background: Given the OpsGenie parser is registered for source "opsgenie" Scenario: OpsGenie "Create" action maps to status "firing" Given an OpsGenie webhook with "action" = "Create" When the OpsGenie parser processes the payload Then the canonical alert "status" = "firing" Scenario: OpsGenie "Close" action maps to status "resolved" Given an OpsGenie webhook with "action" = "Close" When the OpsGenie parser processes the payload Then the canonical alert "status" = "resolved" Scenario: OpsGenie "Acknowledge" action maps to status "acknowledged" Given an OpsGenie webhook with "action" = "Acknowledge" When the OpsGenie parser processes the payload Then the canonical alert "status" = "acknowledged" Scenario: OpsGenie priority P1 maps to severity "critical" Given an OpsGenie payload with "priority" = "P1" When the OpsGenie parser processes the payload Then the canonical alert "severity" = "critical" Scenario: OpsGenie priority P3 maps to severity "warning" Given an OpsGenie payload with "priority" = "P3" When the OpsGenie parser processes the payload Then the canonical alert "severity" = "warning" Scenario: OpsGenie priority P5 maps to severity "info" Given an OpsGenie payload with "priority" = "P5" When the OpsGenie parser processes the payload Then the canonical alert "severity" = "info" Scenario: OpsGenie tags used for service extraction Given an OpsGenie payload with "tags" = ["service:inventory", "region:us-east-1"] When the OpsGenie parser processes the payload Then the canonical alert "service" = "inventory" ``` ### Feature: Grafana Source Parser ```gherkin Feature: Parse and normalize Grafana alerting webhook payloads Background: Given the Grafana parser is registered for source "grafana" Scenario: Grafana "alerting" state maps to status "firing" Given a Grafana webhook with "state" = "alerting" When the Grafana parser processes the payload Then the canonical alert "status" = "firing" Scenario: Grafana "ok" state maps to status "resolved" Given a Grafana webhook with "state" = "ok" When the Grafana parser processes the payload Then the canonical alert "status" = "resolved" Scenario: Grafana "no_data" state maps to severity "warning" Given a Grafana webhook with "state" = "no_data" When the Grafana parser processes the payload Then the canonical alert "severity" = "warning" And the canonical alert "status" = "firing" Scenario: Grafana panel URL preserved in canonical alert metadata Given a Grafana webhook with "ruleUrl" = "https://grafana.example.com/d/abc/panel" When the Grafana parser processes the payload Then the canonical alert "metadata.dashboard_url" = "https://grafana.example.com/d/abc/panel" Scenario: Grafana multi-alert payload (multiple evalMatches) creates one alert per match Given a Grafana webhook with 3 "evalMatches" entries When the Grafana parser processes the payload Then 3 canonical alerts are produced And each has a unique fingerprint based on the metric name and tags ``` ### Feature: Canonical Alert Schema Validation ```gherkin Feature: Validate canonical alert schema completeness Scenario: Canonical alert with all required fields passes validation Given a canonical alert with fields: source, severity, status, service, fingerprint, received_at, tenant_id When schema validation runs Then the alert passes validation Scenario: Canonical alert missing "tenant_id" fails validation Given a canonical alert with no "tenant_id" field When schema validation runs Then validation fails with error "missing_required_field:tenant_id" And the alert is rejected before SQS enqueue Scenario: Canonical alert with unknown severity value fails validation Given a canonical alert with "severity" = "super-critical" When schema validation runs Then validation fails with error "invalid_enum_value:severity" Scenario: Canonical alert schema is additive — unknown extra fields are preserved Given a canonical alert with an extra field "custom_label" = "team-alpha" When schema validation runs Then the alert passes validation And "custom_label" is preserved in the alert document ``` --- ## Epic 3: Correlation Engine ### Feature: Time-Window Grouping ```gherkin Feature: Group alerts into incidents using time-window correlation Background: Given the correlation engine is running on ECS Fargate And the default correlation time window is 5 minutes Scenario: Two alerts for the same service within the time window are grouped Given a canonical alert for service "payments" arrives at T=0 And a second canonical alert for service "payments" arrives at T=3min When the correlation engine processes both alerts Then they are grouped into a single incident And the incident "alert_count" = 2 Scenario: Two alerts for the same service outside the time window are NOT grouped Given a canonical alert for service "payments" arrives at T=0 And a second canonical alert for service "payments" arrives at T=6min When the correlation engine processes both alerts Then they are placed in separate incidents Scenario: Time window is configurable per tenant Given tenant "enterprise-co" has a custom correlation window of 10 minutes And two alerts for the same service arrive 8 minutes apart When the correlation engine processes both alerts Then they are grouped into a single incident for tenant "enterprise-co" Scenario: Alerts from different services within the time window are NOT grouped by default Given a canonical alert for service "payments" at T=0 And a canonical alert for service "auth" at T=1min When the correlation engine processes both alerts Then they are placed in separate incidents ``` ### Feature: Service-Affinity Matching ```gherkin Feature: Group alerts across related services using service-affinity rules Background: Given a service-affinity rule: ["payments", "checkout", "cart"] are related Scenario: Alerts from affinity-grouped services are correlated into one incident Given a canonical alert for service "payments" at T=0 And a canonical alert for service "checkout" at T=2min When the correlation engine applies service-affinity matching Then both alerts are grouped into a single incident And the incident "root_service" is set to the first-seen service "payments" Scenario: Alert from a service not in the affinity group is not merged Given a canonical alert for service "payments" at T=0 And a canonical alert for service "logging" at T=1min When the correlation engine applies service-affinity matching Then they remain in separate incidents Scenario: Service-affinity rules are tenant-scoped Given tenant "acme" has affinity rule ["api", "gateway"] And tenant "globex" has no affinity rules And both tenants receive alerts for "api" and "gateway" simultaneously When the correlation engine processes both tenants' alerts Then "acme"'s alerts are grouped into one incident And "globex"'s alerts remain in separate incidents ``` ### Feature: Fingerprint Deduplication ```gherkin Feature: Deduplicate alerts with identical fingerprints Scenario: Duplicate alert with same fingerprint within dedup window is suppressed Given a canonical alert with fingerprint "fp-abc123" is received at T=0 And an identical alert with fingerprint "fp-abc123" arrives at T=30sec When the correlation engine checks the Redis dedup window Then the second alert is suppressed (not added as a new alert) And the incident "duplicate_count" is incremented by 1 Scenario: Same fingerprint outside dedup window creates a new alert Given a canonical alert with fingerprint "fp-abc123" was processed at T=0 And the dedup window is 10 minutes And the same fingerprint arrives at T=11min When the correlation engine checks the Redis dedup window Then the alert is treated as a new occurrence And a new incident entry is created Scenario: Different fingerprints are never deduplicated Given two alerts with different fingerprints "fp-abc123" and "fp-xyz789" When the correlation engine processes both Then both are treated as distinct alerts Scenario: Dedup counter is visible in incident metadata Given fingerprint "fp-abc123" has been suppressed 5 times When the incident is retrieved via the Dashboard API Then the incident "dedup_count" = 5 ``` ### Feature: Redis Sliding Window ```gherkin Feature: Redis sliding windows for correlation state management Background: Given Redis is available and the sliding window TTL is 10 minutes Scenario: Alert fingerprint is stored in Redis on first occurrence Given a canonical alert with fingerprint "fp-new001" arrives When the correlation engine processes the alert Then a Redis key "dedup:{tenant_id}:fp-new001" is set with TTL 10 minutes Scenario: Redis key TTL is refreshed on each matching alert Given a Redis key "dedup:{tenant_id}:fp-new001" exists with 2 minutes remaining And a new alert with fingerprint "fp-new001" arrives When the correlation engine processes the alert Then the Redis key TTL is reset to 10 minutes Scenario: Redis unavailability causes correlation engine to fail open Given Redis is unreachable When a canonical alert arrives for processing Then the alert is processed without deduplication And a metric "redis_unavailable_failopen" is emitted And the alert is NOT dropped Scenario: Redis sliding window is tenant-isolated Given tenant "alpha" has fingerprint "fp-shared" in Redis And tenant "beta" sends an alert with fingerprint "fp-shared" When the correlation engine checks the dedup window Then tenant "beta"'s alert is NOT suppressed And tenant "alpha"'s dedup state is unaffected ``` ### Feature: Cross-Tenant Isolation in Correlation ```gherkin Feature: Prevent cross-tenant alert bleed in correlation engine Scenario: Alerts from different tenants with same fingerprint are never correlated Given tenant "alpha" sends alert with fingerprint "fp-shared" at T=0 And tenant "beta" sends alert with fingerprint "fp-shared" at T=1min When the correlation engine processes both alerts Then each alert is placed in its own tenant-scoped incident And no incident contains alerts from both tenants Scenario: Tenant ID is validated before correlation lookup Given a canonical alert arrives with "tenant_id" = "" When the correlation engine attempts to process the alert Then the alert is rejected with error "missing_tenant_id" And the alert is sent to the correlation DLQ Scenario: Correlation engine worker processes only one tenant's partition at a time Given SQS messages are partitioned by tenant_id When the ECS Fargate worker picks up a batch of messages Then all messages in the batch belong to the same tenant And no cross-tenant data is loaded into the worker's memory context ``` ### Feature: OTEL Trace Propagation Across SQS Boundary ```gherkin Feature: Propagate OpenTelemetry trace context across SQS ingestion-to-correlation boundary Scenario: Trace context is injected into SQS message attributes at ingestion Given a webhook request arrives with OTEL trace header "traceparent: 00-abc123-def456-01" When the ingestion Lambda enqueues the message to SQS Then the SQS message attributes include "traceparent" = "00-abc123-def456-01" And the SQS message attributes include "tracestate" if present in the original request Scenario: Correlation engine extracts and continues trace from SQS message Given an SQS message with "traceparent" attribute "00-abc123-def456-01" When the correlation engine consumer reads the message Then a child span is created with parent trace ID "abc123" And all subsequent operations (Redis lookup, DynamoDB write) are children of this span Scenario: Missing trace context in SQS message starts a new trace Given an SQS message with no "traceparent" attribute When the correlation engine consumer reads the message Then a new root trace is started And a metric "trace_context_missing" is emitted Scenario: Trace ID is stored on the incident record Given a correlated incident is created from an alert with trace ID "abc123" When the incident is written to DynamoDB Then the incident document includes "trace_id" = "abc123" ``` --- ## Epic 4: Notification & Escalation ### Feature: Slack Block Kit Incident Notifications ```gherkin Feature: Send Slack Block Kit notifications for new incidents Background: Given a Slack webhook URL is configured for tenant "acme" And the Slack notification Lambda is subscribed to the incidents SNS topic Scenario: New critical incident triggers Slack notification Given a new incident is created with severity "critical" for service "payments" When the notification Lambda processes the incident event Then a Slack Block Kit message is posted to the configured channel And the message includes the incident ID, service name, severity, and timestamp And the message includes action buttons: "Acknowledge", "Escalate", "Mark as Noise" Scenario: New warning incident triggers Slack notification Given a new incident is created with severity "warning" When the notification Lambda processes the incident event Then a Slack message is posted with severity badge "⚠️ WARNING" Scenario: Resolved incident posts a resolution message to Slack Given an existing incident "INC-001" transitions to status "resolved" When the notification Lambda processes the resolution event Then a Slack message is posted indicating "INC-001 resolved" And the message includes time-to-resolution duration Scenario: Slack Block Kit message includes alert count for correlated incidents Given an incident contains 7 correlated alerts When the Slack notification is sent Then the message body includes "7 correlated alerts" Scenario: Slack notification includes dashboard deep-link Given a new incident "INC-042" is created When the Slack notification is sent Then the message includes a button "View in Dashboard" linking to "/incidents/INC-042" ``` ### Feature: Severity-Based Routing ```gherkin Feature: Route notifications to different Slack channels based on severity Background: Given tenant "acme" has configured: | severity | channel | | critical | #incidents-p1 | | warning | #incidents-p2 | | info | #monitoring-feed | Scenario: Critical incident is routed to P1 channel Given a new incident with severity "critical" When the notification Lambda routes the alert Then the Slack message is posted to "#incidents-p1" Scenario: Warning incident is routed to P2 channel Given a new incident with severity "warning" When the notification Lambda routes the alert Then the Slack message is posted to "#incidents-p2" Scenario: Info incident is routed to monitoring feed Given a new incident with severity "info" When the notification Lambda routes the alert Then the Slack message is posted to "#monitoring-feed" Scenario: No routing rule configured — falls back to default channel Given tenant "beta" has only a default channel "#alerts" configured And a new incident with severity "critical" arrives When the notification Lambda routes the alert Then the Slack message is posted to "#alerts" ``` ### Feature: Escalation to PagerDuty if Unacknowledged ```gherkin Feature: Escalate unacknowledged critical incidents to PagerDuty Background: Given the escalation check runs every minute via EventBridge And the escalation threshold for "critical" incidents is 15 minutes Scenario: Unacknowledged critical incident is escalated after threshold Given a critical incident "INC-001" was created 16 minutes ago And "INC-001" has not been acknowledged When the escalation Lambda runs Then a PagerDuty incident is created via the PagerDuty Events API v2 And the incident "INC-001" status is updated to "escalated" And a Slack message is posted: "INC-001 escalated to PagerDuty" Scenario: Acknowledged incident is NOT escalated Given a critical incident "INC-002" was created 20 minutes ago And "INC-002" was acknowledged 5 minutes ago When the escalation Lambda runs Then no PagerDuty incident is created for "INC-002" Scenario: Warning incident has a longer escalation threshold Given the escalation threshold for "warning" incidents is 60 minutes And a warning incident "INC-003" was created 45 minutes ago and is unacknowledged When the escalation Lambda runs Then no PagerDuty incident is created for "INC-003" Scenario: Escalation is idempotent — already-escalated incident is not re-escalated Given incident "INC-004" was already escalated to PagerDuty When the escalation Lambda runs again Then no duplicate PagerDuty incident is created And the escalation Lambda logs "already_escalated:INC-004" Scenario: PagerDuty API failure during escalation is retried Given incident "INC-005" is due for escalation And the PagerDuty Events API returns a 500 error When the escalation Lambda attempts to create the PagerDuty incident Then the Lambda retries up to 3 times with exponential backoff And if all retries fail, an error metric "pagerduty_escalation_failure" is emitted And the incident is flagged for manual review ``` ### Feature: Daily Noise Report ```gherkin Feature: Generate and send daily noise reduction report Background: Given the daily report Lambda runs at 08:00 UTC via EventBridge Scenario: Daily noise report is sent to configured Slack channel Given tenant "acme" has 500 raw alerts and 80 incidents in the past 24 hours When the daily report Lambda runs Then a Slack message is posted to "#dd0c-digest" And the message includes: | metric | value | | total_alerts | 500 | | correlated_incidents | 80 | | noise_reduction_percent | 84% | | top_noisy_service | shown | Scenario: Daily report includes MTTR for resolved incidents Given 20 incidents were resolved in the past 24 hours with an average MTTR of 23 minutes When the daily report Lambda runs Then the Slack message includes "Avg MTTR: 23 min" Scenario: Daily report is skipped if no alerts in past 24 hours Given tenant "quiet-co" had 0 alerts in the past 24 hours When the daily report Lambda runs Then no Slack message is sent for "quiet-co" Scenario: Daily report is tenant-scoped — no cross-tenant data leakage Given tenants "alpha" and "beta" both have activity When the daily report Lambda runs Then "alpha"'s report contains only "alpha"'s metrics And "beta"'s report contains only "beta"'s metrics ``` ### Feature: Slack Rate Limiting ```gherkin Feature: Handle Slack API rate limiting gracefully Scenario: Slack returns 429 Too Many Requests — notification is retried Given a Slack notification needs to be sent And Slack returns HTTP 429 with "Retry-After: 5" When the notification Lambda handles the response Then the Lambda waits 5 seconds before retrying And the notification is eventually delivered Scenario: Slack rate limit persists beyond Lambda timeout — message queued for retry Given Slack is rate-limiting for 30 seconds And the Lambda timeout is 15 seconds When the notification Lambda cannot deliver within its timeout Then the SQS message is not deleted (remains visible after visibility timeout) And the message is retried by the next Lambda invocation Scenario: Burst of 50 incidents triggers Slack rate limit protection Given 50 incidents are created within 1 second When the notification Lambda processes the burst Then notifications are batched and sent with 1-second delays between batches And all 50 notifications are eventually delivered And a metric "slack_rate_limit_batching" is emitted ``` --- ## Epic 5: Slack Bot ### Feature: Interactive Feedback Buttons ```gherkin Feature: Slack interactive feedback buttons on incident notifications Background: Given an incident notification was posted to Slack with buttons: "Helpful", "Noise", "Escalate" And the Slack interactivity endpoint is "POST /slack/interactions" Scenario: User clicks "Helpful" on an incident notification Given user "@alice" clicks the "Helpful" button on incident "INC-007" When the Slack interaction payload is received Then the incident "INC-007" feedback is recorded as "helpful" And the Slack message is updated to show "✅ Marked helpful by @alice" And the button is disabled to prevent duplicate feedback Scenario: User clicks "Noise" on an incident notification Given user "@bob" clicks the "Noise" button on incident "INC-008" When the Slack interaction payload is received Then the incident "INC-008" feedback is recorded as "noise" And the incident "noise_score" is incremented And the Slack message is updated to show "🔇 Marked as noise by @bob" Scenario: User clicks "Escalate" on an incident notification Given user "@carol" clicks the "Escalate" button on incident "INC-009" When the Slack interaction payload is received Then the incident "INC-009" is immediately escalated to PagerDuty And the Slack message is updated to show "🚨 Escalated by @carol" And the escalation bypasses the normal time threshold Scenario: Feedback on an already-resolved incident is rejected Given incident "INC-010" has status "resolved" And user "@dave" clicks "Helpful" on the stale Slack message When the Slack interaction payload is received Then the Slack message is updated to show "⚠️ Incident already resolved" And no feedback is recorded Scenario: Slack interaction payload signature is validated Given a Slack interaction request with an invalid "X-Slack-Signature" header When the interaction endpoint receives the request Then the response status is 401 And the interaction is not processed Scenario: Duplicate button click by same user is idempotent Given user "@alice" already marked incident "INC-007" as "helpful" And "@alice" clicks "Helpful" again on the same message When the Slack interaction payload is received Then the feedback count is NOT incremented again And the response acknowledges the duplicate gracefully ``` ### Feature: Slash Command — /dd0c status ```gherkin Feature: /dd0c status slash command Background: Given the Slack slash command "/dd0c" is registered And the command handler endpoint is "POST /slack/commands" Scenario: /dd0c status returns current open incident count Given tenant "acme" has 3 open critical incidents and 5 open warning incidents When user "@alice" runs "/dd0c status" in the Slack workspace Then the bot responds ephemerally with: | metric | value | | open_critical | 3 | | open_warning | 5 | | alerts_last_hour | shown | | system_status | OK | Scenario: /dd0c status when no open incidents Given tenant "acme" has 0 open incidents When user "@alice" runs "/dd0c status" Then the bot responds with "✅ All clear — no open incidents" Scenario: /dd0c status responds within Slack's 3-second timeout Given the command handler receives "/dd0c status" When the handler processes the request Then an HTTP 200 response is returned within 3 seconds And if data retrieval takes longer, an immediate acknowledgment is sent And the full response is delivered via response_url Scenario: /dd0c status is scoped to the requesting tenant Given user "@alice" belongs to tenant "acme" When "@alice" runs "/dd0c status" Then the response contains only "acme"'s incident data And no data from other tenants is included ``` ### Feature: Slash Command — /dd0c anomalies ```gherkin Feature: /dd0c anomalies slash command Scenario: /dd0c anomalies returns top noisy services in the last 24 hours Given service "payments" fired 120 alerts in the last 24 hours And service "auth" fired 80 alerts And service "logging" fired 10 alerts When user "@alice" runs "/dd0c anomalies" Then the bot responds with a ranked list: | rank | service | alert_count | | 1 | payments | 120 | | 2 | auth | 80 | | 3 | logging | 10 | Scenario: /dd0c anomalies with time range argument Given user "@alice" runs "/dd0c anomalies --last 7d" When the command handler processes the request Then the response covers the last 7 days of anomaly data Scenario: /dd0c anomalies with no data returns helpful message Given no alerts have been received in the last 24 hours When user "@alice" runs "/dd0c anomalies" Then the bot responds with "No anomalies detected in the last 24 hours" ``` ### Feature: Slash Command — /dd0c digest ```gherkin Feature: /dd0c digest slash command Scenario: /dd0c digest returns on-demand summary report Given tenant "acme" has activity in the last 24 hours When user "@alice" runs "/dd0c digest" Then the bot responds with a summary matching the daily noise report format And the response includes total alerts, incidents, noise reduction %, and avg MTTR Scenario: /dd0c digest with custom time range Given user "@alice" runs "/dd0c digest --last 7d" When the command handler processes the request Then the digest covers the last 7 days Scenario: Unauthorized user cannot run /dd0c commands Given user "@mallory" is not a member of any configured tenant workspace When "@mallory" runs "/dd0c status" Then the bot responds ephemerally with "⛔ You are not authorized to use this command" And no tenant data is returned ``` --- ## Epic 6: Dashboard API ### Feature: Cognito JWT Authentication ```gherkin Feature: Authenticate Dashboard API requests with Cognito JWT Background: Given the Dashboard API requires a valid Cognito JWT in the "Authorization: Bearer " header Scenario: Valid JWT grants access to the API Given a user has a valid Cognito JWT for tenant "acme" When the user calls "GET /api/incidents" Then the response status is 200 And only "acme"'s incidents are returned Scenario: Missing Authorization header returns 401 Given a request to "GET /api/incidents" with no Authorization header When the API Gateway processes the request Then the response status is 401 And the body contains "error": "missing_token" Scenario: Expired JWT returns 401 Given a user presents a JWT that expired 10 minutes ago When the user calls "GET /api/incidents" Then the response status is 401 And the body contains "error": "token_expired" Scenario: JWT signed with wrong key returns 401 Given a user presents a JWT signed with a non-Cognito key When the user calls "GET /api/incidents" Then the response status is 401 And the body contains "error": "invalid_token_signature" Scenario: JWT from a different tenant cannot access another tenant's data Given user "@alice" has a valid JWT for tenant "acme" When "@alice" calls "GET /api/incidents?tenant_id=globex" Then the response status is 403 And the body contains "error": "tenant_access_denied" ``` ### Feature: Incident Listing with Filters ```gherkin Feature: List incidents with filtering and pagination Background: Given the user is authenticated for tenant "acme" Scenario: List all open incidents Given tenant "acme" has 15 open incidents When the user calls "GET /api/incidents?status=open" Then the response status is 200 And the response contains 15 incidents And each incident includes: id, severity, service, status, created_at, alert_count Scenario: Filter incidents by severity Given tenant "acme" has 5 critical and 10 warning incidents When the user calls "GET /api/incidents?severity=critical" Then the response contains exactly 5 incidents And all returned incidents have severity "critical" Scenario: Filter incidents by service Given tenant "acme" has incidents for services "payments", "auth", and "checkout" When the user calls "GET /api/incidents?service=payments" Then only incidents for service "payments" are returned Scenario: Filter incidents by date range Given incidents exist from the past 30 days When the user calls "GET /api/incidents?from=2026-02-01&to=2026-02-07" Then only incidents created between Feb 1 and Feb 7 are returned Scenario: Pagination returns correct page of results Given tenant "acme" has 100 incidents When the user calls "GET /api/incidents?page=2&limit=20" Then the response contains incidents 21–40 And the response includes "total": 100, "page": 2, "limit": 20 Scenario: Empty result set returns 200 with empty array Given tenant "acme" has no incidents matching the filter When the user calls "GET /api/incidents?service=nonexistent" Then the response status is 200 And the response body is '{"incidents": [], "total": 0}' Scenario: Incident detail endpoint returns full alert timeline Given incident "INC-042" has 7 correlated alerts When the user calls "GET /api/incidents/INC-042" Then the response includes the incident details And "alerts" array contains 7 entries with timestamps and sources ``` ### Feature: Analytics Endpoints — MTTR ```gherkin Feature: MTTR analytics endpoint Background: Given the user is authenticated for tenant "acme" Scenario: MTTR endpoint returns average time-to-resolution Given 10 incidents were resolved in the last 7 days with MTTRs ranging from 5 to 60 minutes When the user calls "GET /api/analytics/mttr?period=7d" Then the response includes "avg_mttr_minutes" as a number And "incident_count" = 10 Scenario: MTTR broken down by service When the user calls "GET /api/analytics/mttr?period=7d&group_by=service" Then the response includes a per-service MTTR breakdown Scenario: MTTR with no resolved incidents returns null Given no incidents were resolved in the requested period When the user calls "GET /api/analytics/mttr?period=1d" Then the response includes "avg_mttr_minutes": null And "incident_count": 0 ``` ### Feature: Analytics Endpoints — Noise Reduction ```gherkin Feature: Noise reduction analytics endpoint Scenario: Noise reduction percentage is calculated correctly Given tenant "acme" received 1000 raw alerts and 150 incidents in the last 7 days When the user calls "GET /api/analytics/noise-reduction?period=7d" Then the response includes "noise_reduction_percent": 85 And "raw_alerts": 1000 And "incidents": 150 Scenario: Noise reduction trend over time When the user calls "GET /api/analytics/noise-reduction?period=30d&granularity=daily" Then the response includes a daily time series of noise reduction percentages Scenario: Noise reduction by source When the user calls "GET /api/analytics/noise-reduction?period=7d&group_by=source" Then the response includes per-source breakdown (datadog, pagerduty, opsgenie, grafana) ``` ### Feature: Tenant Isolation in Dashboard API ```gherkin Feature: Enforce strict tenant isolation across all API endpoints Scenario: DynamoDB queries always include tenant_id partition key Given user "@alice" for tenant "acme" calls any incident endpoint When the API handler queries DynamoDB Then the query always includes "tenant_id = acme" as a condition And no full-table scans are performed Scenario: TimescaleDB analytics queries are scoped by tenant_id Given user "@alice" for tenant "acme" calls any analytics endpoint When the API handler queries TimescaleDB Then the SQL query includes "WHERE tenant_id = 'acme'" Scenario: API does not expose tenant_id enumeration Given user "@alice" calls "GET /api/incidents/INC-999" where INC-999 belongs to tenant "globex" When the API processes the request Then the response status is 404 (not 403, to avoid tenant enumeration) ``` --- ## Epic 7: Dashboard UI ### Feature: Incident List View ```gherkin Feature: Incident list page in the React SPA Background: Given the user is logged in and the Dashboard SPA is loaded Scenario: Incident list displays open incidents on load Given tenant "acme" has 12 open incidents When the user navigates to "/incidents" Then the incident list renders 12 rows And each row shows: incident ID, severity badge, service name, alert count, age Scenario: Severity badge color coding Given the incident list contains critical, warning, and info incidents When the list renders Then critical incidents show a red badge And warning incidents show a yellow badge And info incidents show a blue badge Scenario: Clicking an incident row navigates to incident detail Given the incident list is displayed When the user clicks on incident "INC-042" Then the browser navigates to "/incidents/INC-042" Scenario: Filter by severity updates the list in real time Given the incident list is displayed When the user selects "Critical" from the severity filter dropdown Then only critical incidents are shown And the URL updates to "/incidents?severity=critical" Scenario: Filter by service updates the list Given the incident list is displayed When the user types "payments" in the service search box Then only incidents for service "payments" are shown Scenario: Empty state is shown when no incidents match filters Given no incidents match the current filter When the list renders Then a message "No incidents found" is displayed And a "Clear filters" button is shown Scenario: Incident list auto-refreshes every 30 seconds Given the incident list is displayed When 30 seconds elapse Then the list silently re-fetches from the API And new incidents appear without a full page reload ``` ### Feature: Alert Timeline View ```gherkin Feature: Alert timeline within an incident detail page Scenario: Alert timeline shows all correlated alerts in chronological order Given incident "INC-042" has 5 correlated alerts from T=0 to T=4min When the user navigates to "/incidents/INC-042" Then the timeline renders 5 events in ascending time order And each event shows: source icon, alert title, severity, timestamp Scenario: Timeline highlights the root cause alert Given the first alert in the incident is flagged as "root_cause" When the timeline renders Then the root cause alert is visually distinguished (e.g. bold border) Scenario: Timeline shows deduplication count Given fingerprint "fp-abc" was suppressed 8 times When the timeline renders the corresponding alert Then a badge "×8 duplicates suppressed" is shown on that alert entry Scenario: Timeline is scrollable for large incidents Given an incident has 200 correlated alerts When the timeline renders Then a virtualized scroll list is used And the page does not freeze or crash ``` ### Feature: MTTR Chart ```gherkin Feature: MTTR trend chart on the analytics page Scenario: MTTR chart renders a 7-day trend line Given the analytics API returns daily MTTR data for the last 7 days When the user navigates to "/analytics" Then a line chart is rendered with 7 data points And the X-axis shows dates and the Y-axis shows minutes Scenario: MTTR chart shows "No data" state when no resolved incidents Given no incidents were resolved in the selected period When the chart renders Then a "No resolved incidents in this period" message is shown instead of the chart Scenario: MTTR chart period selector changes the data range Given the user is on the analytics page When the user selects "Last 30 days" from the period dropdown Then the chart re-fetches data for the last 30 days And the chart updates without a full page reload ``` ### Feature: Noise Reduction Percentage Display ```gherkin Feature: Noise reduction metric display on analytics page Scenario: Noise reduction percentage is prominently displayed Given the analytics API returns noise_reduction_percent = 84 When the user views the analytics page Then a large "84%" figure is displayed under "Noise Reduction" Scenario: Noise reduction trend sparkline is shown Given daily noise reduction data is available for 30 days When the analytics page renders Then a sparkline chart shows the 30-day trend Scenario: Noise reduction breakdown by source is shown Given the API returns per-source noise reduction data When the user clicks "By Source" tab Then a bar chart shows noise reduction % for each source (Datadog, PagerDuty, OpsGenie, Grafana) ``` ### Feature: Webhook Setup Wizard ```gherkin Feature: Webhook setup wizard for onboarding new monitoring sources Scenario: Wizard generates a unique webhook URL for Datadog Given the user navigates to "/settings/webhooks" And clicks "Add Webhook Source" When the user selects "Datadog" from the source dropdown And clicks "Generate" Then a unique webhook URL is displayed: "https://ingest.dd0c.io/webhooks/datadog/{tenant_id}/{token}" And the HMAC secret is shown once for copying Scenario: Wizard provides copy-paste instructions for each source Given the user has generated a Datadog webhook URL When the wizard displays the setup instructions Then step-by-step instructions for configuring Datadog are shown And a "Test Webhook" button is available Scenario: Test webhook button sends a test payload and confirms receipt Given the user clicks "Test Webhook" for a configured Datadog source When the test payload is sent Then the wizard shows "✅ Test payload received successfully" And the test alert appears in the incident list as a test event Scenario: Wizard shows validation error if source already configured Given tenant "acme" already has a Datadog webhook configured When the user tries to add a second Datadog webhook Then the wizard shows "A Datadog webhook is already configured. Regenerate token?" Scenario: Regenerating a webhook token invalidates the old token Given tenant "acme" has an existing Datadog webhook token When the user clicks "Regenerate Token" and confirms Then a new token is generated And the old token is immediately invalidated And any requests using the old token return 401 ``` --- ## Epic 8: Infrastructure ### Feature: CDK Stack — Lambda Ingestion ```gherkin Feature: CDK provisions Lambda ingestion infrastructure Scenario: Lambda function is created with correct runtime and memory Given the CDK stack is synthesized When the CloudFormation template is inspected Then a Lambda function "dd0c-ingestion" exists with runtime "nodejs20.x" And memory is set to 512MB And timeout is set to 30 seconds Scenario: Lambda has least-privilege IAM role Given the CDK stack is synthesized When the IAM role for "dd0c-ingestion" is inspected Then the role allows "sqs:SendMessage" only to the ingestion SQS queue ARN And the role allows "s3:PutObject" only to the "dd0c-raw-webhooks" bucket And the role does NOT have "s3:*" or "sqs:*" wildcards Scenario: Lambda is behind API Gateway with throttling Given the CDK stack is synthesized When the API Gateway configuration is inspected Then throttling is set to 1000 requests/second burst and 500 requests/second steady-state And WAF is attached to the API Gateway stage Scenario: Lambda environment variables are sourced from SSM Parameter Store Given the CDK stack is synthesized When the Lambda environment configuration is inspected Then HMAC secrets are referenced from SSM parameters (not hardcoded) And no secrets appear in plaintext in the CloudFormation template ``` ### Feature: CDK Stack — ECS Fargate Correlation Engine ```gherkin Feature: CDK provisions ECS Fargate for the correlation engine Scenario: ECS service is created with correct task definition Given the CDK stack is synthesized When the ECS task definition is inspected Then the task uses Fargate launch type And CPU is set to 1024 (1 vCPU) and memory to 2048MB And the container image is pulled from ECR "dd0c-correlation-engine" Scenario: ECS service auto-scales based on SQS queue depth Given the CDK stack is synthesized When the auto-scaling configuration is inspected Then a step-scaling policy exists targeting SQS "ApproximateNumberOfMessagesVisible" And scale-out triggers when queue depth > 100 messages And scale-in triggers when queue depth < 10 messages And minimum capacity is 1 and maximum capacity is 10 Scenario: ECS tasks run in private subnets with no public IP Given the CDK stack is synthesized When the ECS network configuration is inspected Then tasks are placed in private subnets And "assignPublicIp" is DISABLED And a NAT Gateway provides outbound internet access ``` ### Feature: CDK Stack — SQS Queues ```gherkin Feature: CDK provisions SQS queues with correct configuration Scenario: Ingestion SQS queue has a Dead Letter Queue configured Given the CDK stack is synthesized When the SQS queue "dd0c-ingestion" is inspected Then a DLQ "dd0c-ingestion-dlq" is attached And maxReceiveCount is 3 And the DLQ retention period is 14 days Scenario: SQS queue has server-side encryption enabled Given the CDK stack is synthesized When the SQS queue configuration is inspected Then SSE is enabled using an AWS-managed KMS key Scenario: SQS visibility timeout exceeds Lambda timeout Given the Lambda timeout is 30 seconds When the SQS queue visibility timeout is inspected Then the visibility timeout is at least 6x the Lambda timeout (180 seconds) ``` ### Feature: CDK Stack — DynamoDB ```gherkin Feature: CDK provisions DynamoDB for incident storage Scenario: Incidents table has correct key schema Given the CDK stack is synthesized When the DynamoDB table "dd0c-incidents" is inspected Then the partition key is "tenant_id" (String) And the sort key is "incident_id" (String) Scenario: Incidents table has a GSI for status queries Given the CDK stack is synthesized When the GSIs on "dd0c-incidents" are inspected Then a GSI "status-created_at-index" exists With partition key "status" and sort key "created_at" Scenario: DynamoDB table has point-in-time recovery enabled Given the CDK stack is synthesized When the DynamoDB table settings are inspected Then PITR is enabled on "dd0c-incidents" Scenario: DynamoDB TTL is configured for free-tier retention Given the CDK stack is synthesized When the DynamoDB TTL configuration is inspected Then TTL is enabled on attribute "expires_at" And free-tier records have "expires_at" set to 7 days from creation ``` ### Feature: CI/CD Pipeline ```gherkin Feature: CI/CD pipeline for automated deployment Scenario: Pull request triggers test suite Given a developer opens a pull request against "main" When the CI pipeline runs Then unit tests, integration tests, and CDK synth all pass before merge is allowed Scenario: Merge to main triggers staging deployment Given a PR is merged to "main" When the CD pipeline runs Then the CDK stack is deployed to the "staging" environment And smoke tests run against staging endpoints Scenario: Production deployment requires manual approval Given the staging deployment and smoke tests pass When the CD pipeline reaches the production stage Then a manual approval gate is presented And production deployment only proceeds after approval Scenario: Failed deployment triggers automatic rollback Given a production deployment fails health checks When the CD pipeline detects the failure Then the previous CloudFormation stack version is restored And a Slack alert is sent to "#dd0c-ops" with the rollback reason Scenario: CDK diff is posted as a PR comment Given a developer opens a PR with infrastructure changes When the CI pipeline runs "cdk diff" Then the diff output is posted as a comment on the PR ``` --- ## Epic 9: Onboarding & PLG ### Feature: OAuth Signup ```gherkin Feature: User signup via OAuth (Google / GitHub) Background: Given the signup page is at "/signup" Scenario: New user signs up with Google OAuth Given a new user visits "/signup" When the user clicks "Sign up with Google" And completes the Google OAuth flow Then a new tenant is created for the user's email domain And the user is assigned the "owner" role for the new tenant And the user is redirected to the onboarding wizard Scenario: New user signs up with GitHub OAuth Given a new user visits "/signup" When the user clicks "Sign up with GitHub" And completes the GitHub OAuth flow Then a new tenant is created And the user is redirected to the onboarding wizard Scenario: Existing user signs in via OAuth Given a user with email "alice@acme.com" already has an account When the user completes the Google OAuth flow Then no new tenant is created And the user is redirected to "/incidents" Scenario: OAuth failure shows user-friendly error Given the Google OAuth provider returns an error When the user is redirected back to the app Then an error message "Sign-in failed. Please try again." is displayed And no partial account is created Scenario: Signup is blocked for disposable email domains Given a user attempts to sign up with "user@mailinator.com" When the OAuth flow completes Then the signup is rejected with "Disposable email addresses are not allowed" And no tenant is created ``` ### Feature: Free Tier — 10K Alerts/Month Limit ```gherkin Feature: Enforce free tier limit of 10,000 alerts per month Background: Given tenant "free-co" is on the free tier And the monthly alert counter is stored in DynamoDB Scenario: Alert ingestion succeeds under the 10K limit Given tenant "free-co" has ingested 9,999 alerts this month When a new alert arrives Then the alert is processed normally And the counter is incremented to 10,000 Scenario: Alert ingestion is blocked at the 10K limit Given tenant "free-co" has ingested 10,000 alerts this month When a new alert arrives via webhook Then the webhook returns HTTP 429 And the response body includes "Free tier limit reached. Upgrade to continue." And the alert is NOT processed or stored Scenario: Tenant receives email warning at 80% of limit Given tenant "free-co" has ingested 8,000 alerts this month When the 8,001st alert is ingested Then an email is sent to the tenant owner: "You've used 80% of your free tier quota" Scenario: Alert counter resets on the 1st of each month Given tenant "free-co" has ingested 10,000 alerts in January When February 1st arrives (UTC midnight) Then the monthly counter is reset to 0 And alert ingestion is unblocked Scenario: Paid tenant has no alert ingestion limit Given tenant "paid-co" is on the "pro" plan And has ingested 50,000 alerts this month When a new alert arrives Then the alert is processed normally And no limit check is applied ``` ### Feature: 7-Day Retention for Free Tier ```gherkin Feature: Enforce 7-day data retention for free tier tenants Scenario: Free tier incidents older than 7 days are expired via DynamoDB TTL Given tenant "free-co" is on the free tier And incident "INC-OLD" was created 8 days ago When DynamoDB TTL runs Then "INC-OLD" is deleted from the incidents table Scenario: Free tier raw S3 archives older than 7 days are deleted Given tenant "free-co" has raw webhook archives in S3 from 8 days ago When the S3 lifecycle policy runs Then objects older than 7 days are deleted for free-tier tenants Scenario: Paid tier incidents are retained for 90 days Given tenant "paid-co" is on the "pro" plan And incident "INC-OLD" was created 30 days ago When DynamoDB TTL runs Then "INC-OLD" is NOT deleted Scenario: Retention policy is enforced per-tenant, not globally Given "free-co" and "paid-co" both have incidents from 10 days ago When TTL and lifecycle policies run Then "free-co"'s old incidents are deleted And "paid-co"'s old incidents are retained ``` ### Feature: Stripe Billing Integration ```gherkin Feature: Stripe billing for plan upgrades Scenario: User upgrades from free to pro plan via Stripe Checkout Given user "@alice" is on the free tier When "@alice" clicks "Upgrade to Pro" in the dashboard Then a Stripe Checkout session is created And the user is redirected to the Stripe-hosted payment page Scenario: Successful Stripe payment activates pro plan Given a Stripe Checkout session completes successfully When the Stripe "checkout.session.completed" webhook is received Then tenant "acme"'s plan is updated to "pro" in DynamoDB And the alert ingestion limit is removed And a confirmation email is sent to the tenant owner Scenario: Failed Stripe payment does not activate pro plan Given a Stripe payment fails When the Stripe "payment_intent.payment_failed" webhook is received Then the tenant remains on the free tier And a failure notification email is sent Scenario: Stripe webhook signature is validated Given a Stripe webhook arrives with an invalid "Stripe-Signature" header When the billing Lambda processes the request Then the response status is 401 And the event is not processed Scenario: Subscription cancellation downgrades tenant to free tier Given tenant "acme" cancels their pro subscription When the Stripe "customer.subscription.deleted" webhook is received Then tenant "acme"'s plan is downgraded to "free" And the 10K/month limit is re-applied from the next billing cycle And a downgrade confirmation email is sent Scenario: Stripe billing is idempotent — duplicate webhook events are ignored Given a Stripe "checkout.session.completed" event was already processed When the same event is received again (Stripe retry) Then the tenant plan is not double-updated And the response is 200 (idempotent acknowledgment) ``` ### Feature: Webhook URL Generation ```gherkin Feature: Generate unique webhook URLs per tenant per source Scenario: Webhook URL is generated on tenant creation Given a new tenant "new-co" is created via OAuth signup When the onboarding wizard runs Then a unique webhook URL is generated for each supported source And each URL follows the pattern "https://ingest.dd0c.io/webhooks/{source}/{tenant_id}/{token}" And tokens are cryptographically random (32 bytes, URL-safe base64) Scenario: Webhook token is stored hashed in DynamoDB Given a webhook token is generated for tenant "new-co" When the token is stored Then only the SHA-256 hash of the token is stored in DynamoDB And the plaintext token is shown to the user exactly once Scenario: Webhook URL is validated on each ingestion request Given a request arrives at "POST /webhooks/datadog/{tenant_id}/{token}" When the ingestion Lambda validates the token Then the token hash is looked up in DynamoDB for the given tenant_id And if the hash matches, the request is accepted And if the hash does not match, the response is 401 Scenario: 60-second time-to-value — first correlated alert within 60 seconds of webhook setup Given a new tenant completes the onboarding wizard and copies their webhook URL When the tenant sends their first alert to the webhook URL Then within 60 seconds, a correlated incident appears in the dashboard And a Slack notification is sent if Slack is configured ```