products/02-iac-drift-detection/acceptance-specs/acceptance-specs.md

# dd0c/drift — BDD Acceptance Test Specifications

> Gherkin scenarios for all 10 epics. Each Feature maps to a user story within the epic.

---

<!-- TABLE OF CONTENTS -->
- [Epic 1: Drift Detection Agent](#epic-1-drift-detection-agent)
- [Epic 2: Agent Communication](#epic-2-agent-communication)
- [Epic 3: Event Processor](#epic-3-event-processor)
- [Epic 4: Notification Engine](#epic-4-notification-engine)
- [Epic 5: Remediation](#epic-5-remediation)
- [Epic 6: Dashboard UI](#epic-6-dashboard-ui)
- [Epic 7: Dashboard API](#epic-7-dashboard-api)
- [Epic 8: Infrastructure](#epic-8-infrastructure)
- [Epic 9: Onboarding & PLG](#epic-9-onboarding--plg)
- [Epic 10: Transparent Factory](#epic-10-transparent-factory)

---

## Epic 1: Drift Detection Agent

### Feature: Agent Initialization

```gherkin
Feature: Drift Detection Agent Initialization
  As a platform engineer
  I want the drift agent to initialize correctly in my VPC
  So that it can begin scanning infrastructure state

  Background:
    Given the drift agent binary is installed at "/usr/local/bin/drift-agent"
    And a valid agent config exists at "/etc/drift/config.yaml"

  Scenario: Successful agent startup
    Given the config specifies AWS region "us-east-1"
    And valid mTLS certificates are present
    And the SaaS endpoint is reachable
    When the agent starts
    Then the agent logs "drift-agent started" at INFO level
    And the agent registers itself with the SaaS control plane
    And the first scan is scheduled within 15 minutes

  Scenario: Agent startup with missing config
    Given no config file exists at "/etc/drift/config.yaml"
    When the agent starts
    Then the agent exits with code 1
    And logs "config file not found" at ERROR level

  Scenario: Agent startup with invalid AWS credentials
    Given the config references an IAM role that does not exist
    When the agent starts
    Then the agent exits with code 1
    And logs "failed to assume IAM role" at ERROR level

  Scenario: Agent startup with unreachable SaaS endpoint
    Given the SaaS endpoint is not reachable from the VPC
    When the agent starts
    Then the agent retries connection 3 times with exponential backoff
    And after all retries fail, exits with code 1
    And logs "failed to reach control plane after 3 attempts" at ERROR level
```

### Feature: Terraform State Scanning

```gherkin
Feature: Terraform State Scanning
  As a platform engineer
  I want the agent to read Terraform state
  So that it can compare planned state against live AWS resources

  Background:
    Given the agent is running and initialized
    And the stack type is "terraform"

  Scenario: Scan Terraform state from S3 backend
    Given a Terraform state file exists in S3 bucket "my-tfstate" at key "prod/terraform.tfstate"
    And the agent has s3:GetObject permission on that bucket
    When a scan cycle runs
    Then the agent reads the state file successfully
    And parses all resource definitions from the state

  Scenario: Scan Terraform state from local backend
    Given a Terraform state file exists at "/var/drift/stacks/prod/terraform.tfstate"
    When a scan cycle runs
    Then the agent reads the local state file
    And parses all resource definitions

  Scenario: Terraform state file is locked
    Given the Terraform state file is currently locked by another process
    When a scan cycle runs
    Then the agent logs "state file locked, skipping scan" at WARN level
    And schedules a retry for the next cycle
    And does not report a drift event

  Scenario: Terraform state file is malformed
    Given the Terraform state file contains invalid JSON
    When a scan cycle runs
    Then the agent logs "failed to parse state file" at ERROR level
    And emits a health event with status "parse_error"
    And does not report false drift

  Scenario: Terraform state references deleted resources
    Given the Terraform state contains a resource "aws_instance.web" 
    And that EC2 instance no longer exists in AWS
    When a scan cycle runs
    Then the agent detects drift of type "resource_deleted"
    And includes the resource ARN in the drift report
```

### Feature: CloudFormation Stack Scanning

```gherkin
Feature: CloudFormation Stack Scanning
  As a platform engineer
  I want the agent to scan CloudFormation stacks
  So that I can detect drift from declared template state

  Background:
    Given the agent is running
    And the stack type is "cloudformation"

  Scenario: Scan a CloudFormation stack successfully
    Given a CloudFormation stack named "prod-api" exists in "us-east-1"
    And the agent has cloudformation:DescribeStackResources permission
    When a scan cycle runs
    Then the agent retrieves all stack resources
    And compares each resource's actual configuration against the template

  Scenario: CloudFormation stack does not exist
    Given the config references a CloudFormation stack "ghost-stack"
    And that stack does not exist in AWS
    When a scan cycle runs
    Then the agent logs "stack not found: ghost-stack" at WARN level
    And emits a drift event of type "stack_missing"

  Scenario: CloudFormation native drift detection result available
    Given CloudFormation has already run drift detection on "prod-api"
    And the result shows 2 drifted resources
    When a scan cycle runs
    Then the agent reads the CloudFormation drift detection result
    And includes both drifted resources in the drift report

  Scenario: CloudFormation drift detection result is stale
    Given the last CloudFormation drift detection ran 48 hours ago
    When a scan cycle runs
    Then the agent triggers a new CloudFormation drift detection
    And waits up to 5 minutes for the result
    And uses the fresh result in the drift report
```

### Feature: Kubernetes Resource Scanning

```gherkin
Feature: Kubernetes Resource Scanning
  As a platform engineer
  I want the agent to scan Kubernetes resources
  So that I can detect drift from Helm/Kustomize definitions

  Background:
    Given the agent is running
    And a kubeconfig is available at "/etc/drift/kubeconfig"
    And the stack type is "kubernetes"

  Scenario: Scan Kubernetes Deployment successfully
    Given a Deployment "api-server" exists in namespace "production"
    And the IaC definition specifies 3 replicas
    And the live Deployment has 2 replicas
    When a scan cycle runs
    Then the agent detects drift on field "spec.replicas"
    And the drift report includes expected value "3" and actual value "2"

  Scenario: Kubernetes resource has been manually patched
    Given a ConfigMap "app-config" has been manually edited
    And the live data differs from the Helm chart values
    When a scan cycle runs
    Then the agent detects drift of type "config_modified"
    And includes a field-level diff in the report

  Scenario: Kubernetes API server is unreachable
    Given the Kubernetes API server is not responding
    When a scan cycle runs
    Then the agent logs "k8s API unreachable" at ERROR level
    And emits a health event with status "k8s_unreachable"
    And does not report false drift

  Scenario: Scan across multiple namespaces
    Given the config specifies namespaces ["production", "staging"]
    When a scan cycle runs
    Then the agent scans resources in both namespaces independently
    And drift reports include the namespace as context
```

### Feature: Secret Scrubbing

```gherkin
Feature: Secret Scrubbing Before Transmission
  As a security officer
  I want all secrets scrubbed from drift reports before they leave the VPC
  So that credentials are never transmitted to the SaaS backend

  Background:
    Given the agent is running with secret scrubbing enabled

  Scenario: AWS secret key detected and scrubbed
    Given a drift report contains a field value matching AWS secret key pattern "AKIA[0-9A-Z]{16}"
    When the scrubber processes the report
    Then the field value is replaced with "[REDACTED:aws_secret_key]"
    And the original value is not present anywhere in the transmitted payload

  Scenario: Generic password field scrubbed
    Given a drift report contains a field named "password" with value "s3cr3tP@ss"
    When the scrubber processes the report
    Then the field value is replaced with "[REDACTED:password]"

  Scenario: Private key block scrubbed
    Given a drift report contains a PEM private key block
    When the scrubber processes the report
    Then the entire PEM block is replaced with "[REDACTED:private_key]"

  Scenario: Nested secret in JSON value scrubbed
    Given a drift report contains a JSON string value with a nested "api_key" field
    When the scrubber processes the report
    Then the nested api_key value is replaced with "[REDACTED:api_key]"
    And the surrounding JSON structure is preserved

  Scenario: Secret scrubber bypass attempt via encoding
    Given a drift report contains a base64-encoded AWS secret key
    When the scrubber processes the report
    Then the encoded value is detected and replaced with "[REDACTED:encoded_secret]"

  Scenario: Secret scrubber bypass attempt via Unicode homoglyphs
    Given a drift report contains a value using Unicode lookalike characters to resemble a secret pattern
    When the scrubber processes the report
    Then the value is flagged and replaced with "[REDACTED:suspicious_value]"

  Scenario: Non-secret value is not scrubbed
    Given a drift report contains a field "instance_type" with value "t3.medium"
    When the scrubber processes the report
    Then the field value remains "t3.medium" unchanged

  Scenario: Scrubber coverage is 100%
    Given a test corpus of 500 known secret patterns
    When the scrubber processes all patterns
    Then every pattern is detected and redacted
    And the scrubber reports 0 missed secrets

  Scenario: Scrubber audit log
    Given the scrubber redacts 3 values from a drift report
    When the report is transmitted
    Then the agent logs a scrubber audit entry with count "3" and field names (not values)
    And the audit log is stored locally for 30 days
```

### Feature: Pulumi State Scanning

```gherkin
Feature: Pulumi State Scanning
  As a platform engineer
  I want the agent to read Pulumi state
  So that I can detect drift from Pulumi-managed resources

  Background:
    Given the agent is running
    And the stack type is "pulumi"

  Scenario: Scan Pulumi state from Pulumi Cloud backend
    Given a Pulumi stack "prod" exists in organization "acme"
    And the agent has a valid Pulumi access token configured
    When a scan cycle runs
    Then the agent fetches the stack's exported state via Pulumi API
    And parses all resource URNs and properties

  Scenario: Scan Pulumi state from self-managed S3 backend
    Given the Pulumi state is stored in S3 bucket "pulumi-state" at key "acme/prod.json"
    When a scan cycle runs
    Then the agent reads the state file from S3
    And parses all resource definitions

  Scenario: Pulumi access token is expired
    Given the configured Pulumi access token has expired
    When a scan cycle runs
    Then the agent logs "pulumi token expired" at ERROR level
    And emits a health event with status "auth_error"
    And does not report false drift
```

### Feature: 15-Minute Scan Cycle

```gherkin
Feature: Scheduled Scan Cycle
  As a platform engineer
  I want scans to run every 15 minutes automatically
  So that drift is detected promptly

  Background:
    Given the agent is running and initialized

  Scenario: Scan runs on schedule
    Given the last scan completed at T+0
    When 15 minutes elapse
    Then a new scan starts automatically
    And the scan completion is logged with timestamp

  Scenario: Scan cycle skipped if previous scan still running
    Given a scan started at T+0 and is still running at T+15
    When the next scheduled scan would start
    Then the new scan is skipped
    And the agent logs "scan skipped: previous scan still in progress" at WARN level

  Scenario: Scan interval is configurable
    Given the config specifies scan_interval_minutes: 30
    When the agent starts
    Then scans run every 30 minutes instead of 15

  Scenario: No drift detected — no report sent
    Given all resources match their IaC definitions
    When a scan cycle completes
    Then no drift report is sent to SaaS
    And the agent logs "scan complete: no drift detected"

  Scenario: Agent recovers scan schedule after restart
    Given the agent was restarted
    When the agent starts
    Then it reads the last scan timestamp from local state
    And schedules the next scan relative to the last completed scan
```

---

## Epic 2: Agent Communication

### Feature: mTLS Certificate Handshake

```gherkin
Feature: mTLS Mutual TLS Authentication
  As a security engineer
  I want all agent-to-SaaS communication to use mTLS
  So that only authenticated agents can submit drift reports

  Background:
    Given the agent has a client certificate issued by the drift CA
    And the SaaS endpoint requires mTLS

  Scenario: Successful mTLS handshake
    Given the agent certificate is valid and not expired
    And the SaaS server certificate is trusted by the agent's CA bundle
    When the agent connects to the SaaS endpoint
    Then the TLS handshake succeeds
    And the connection is established with mutual authentication

  Scenario: Agent certificate is expired
    Given the agent certificate expired 1 day ago
    When the agent attempts to connect to the SaaS endpoint
    Then the TLS handshake fails with "certificate expired"
    And the agent logs "mTLS cert expired, cannot connect" at ERROR level
    And the agent emits a local alert to stderr
    And no data is transmitted

  Scenario: Agent certificate is revoked
    Given the agent certificate has been added to the CRL
    When the agent attempts to connect
    Then the SaaS endpoint rejects the connection
    And the agent logs "certificate revoked" at ERROR level

  Scenario: Agent presents wrong CA certificate
    Given the agent has a certificate from an untrusted CA
    When the agent attempts to connect
    Then the TLS handshake fails
    And the agent logs "unknown certificate authority" at ERROR level

  Scenario: SaaS server certificate is expired
    Given the SaaS server certificate has expired
    When the agent attempts to connect
    Then the agent rejects the server certificate
    And logs "server cert expired" at ERROR level
    And does not transmit any data

  Scenario: Certificate rotation — new cert issued before expiry
    Given the agent certificate expires in 7 days
    And the control plane issues a new certificate
    When the agent receives the new certificate via the cert rotation endpoint
    Then the agent stores the new certificate
    And uses the new certificate for subsequent connections
    And logs "certificate rotated successfully" at INFO level

  Scenario: Certificate rotation fails — agent continues with old cert
    Given the agent certificate expires in 7 days
    And the cert rotation endpoint is unreachable
    When the agent attempts certificate rotation
    Then the agent retries rotation every hour
    And continues using the existing certificate until it expires
    And logs "cert rotation failed, retrying" at WARN level
```

### Feature: Heartbeat

```gherkin
Feature: Agent Heartbeat
  As a SaaS operator
  I want agents to send regular heartbeats
  So that I can detect offline or unhealthy agents

  Background:
    Given the agent is running and connected via mTLS

  Scenario: Heartbeat sent on schedule
    Given the heartbeat interval is 60 seconds
    When 60 seconds elapse
    Then the agent sends a heartbeat message to the SaaS control plane
    And the heartbeat includes agent version, last scan time, and stack count

  Scenario: SaaS marks agent as offline after missed heartbeats
    Given the agent has not sent a heartbeat for 5 minutes
    When the SaaS control plane checks agent status
    Then the agent is marked as "offline"
    And a notification is sent to the tenant's configured alert channel

  Scenario: Agent reconnects after network interruption
    Given the agent lost connectivity for 3 minutes
    When connectivity is restored
    Then the agent re-establishes the mTLS connection
    And sends a heartbeat immediately
    And the SaaS marks the agent as "online"

  Scenario: Heartbeat includes health status
    Given the last scan encountered a parse error
    When the agent sends its next heartbeat
    Then the heartbeat payload includes health_status "degraded"
    And includes the error details in the health_details field

  Scenario: Multiple agents from same tenant
    Given tenant "acme" has 3 agents running in different regions
    When all 3 agents send heartbeats
    Then each agent is tracked independently by agent_id
    And the SaaS dashboard shows 3 active agents for tenant "acme"
```

### Feature: SQS FIFO Drift Report Ingestion

```gherkin
Feature: SQS FIFO Drift Report Ingestion
  As a SaaS backend engineer
  I want drift reports delivered via SQS FIFO
  So that reports are processed in order without duplicates

  Background:
    Given an SQS FIFO queue "drift-reports.fifo" exists
    And the agent has sqs:SendMessage permission on the queue

  Scenario: Agent publishes drift report to SQS FIFO
    Given a scan detects 2 drifted resources
    When the agent publishes the drift report
    Then a message is sent to "drift-reports.fifo"
    And the MessageGroupId is set to the tenant's agent_id
    And the MessageDeduplicationId is set to the scan's unique scan_id

  Scenario: Duplicate scan report is deduplicated
    Given a drift report with scan_id "scan-abc-123" was already sent
    When the agent sends the same report again (e.g., due to retry)
    Then SQS FIFO deduplicates the message
    And only one copy is delivered to the consumer

  Scenario: SQS message size limit exceeded
    Given a drift report exceeds 256KB (SQS max message size)
    When the agent attempts to publish
    Then the agent splits the report into chunks
    And each chunk is sent as a separate message with a shared batch_id
    And the sequence is indicated by chunk_index and chunk_total fields

  Scenario: SQS publish fails — agent retries with backoff
    Given the SQS endpoint returns a 500 error
    When the agent attempts to publish a drift report
    Then the agent retries up to 5 times with exponential backoff
    And logs each retry attempt at WARN level
    And after all retries fail, stores the report locally for later replay

  Scenario: Agent publishes to correct queue per environment
    Given the agent config specifies environment "production"
    When the agent publishes a drift report
    Then the message is sent to "drift-reports-prod.fifo"
    And not to "drift-reports-staging.fifo"
```

### Feature: Dead Letter Queue Handling

```gherkin
Feature: Dead Letter Queue (DLQ) Handling
  As a SaaS operator
  I want failed messages routed to a DLQ
  So that no drift reports are silently lost

  Background:
    Given a DLQ "drift-reports-dlq.fifo" is configured
    And the maxReceiveCount is set to 3

  Scenario: Message moved to DLQ after max receive count
    Given a drift report message has failed processing 3 times
    When the SQS visibility timeout expires a 3rd time
    Then the message is automatically moved to the DLQ
    And an alarm fires on the DLQ depth metric

  Scenario: DLQ alarm triggers operator notification
    Given the DLQ depth exceeds 0
    When the CloudWatch alarm triggers
    Then a PagerDuty alert is sent to the on-call engineer
    And the alert includes the queue name and approximate message count

  Scenario: DLQ message is replayed after fix
    Given a message in the DLQ was caused by a schema validation bug
    And the bug has been fixed and deployed
    When an operator triggers DLQ replay
    Then the message is moved back to the main queue
    And processed successfully
    And removed from the DLQ

  Scenario: DLQ message contains poison pill — permanently discarded
    Given a DLQ message is malformed beyond repair
    When an operator inspects the message
    Then the operator can mark it as "discarded" via the ops console
    And the discard action is logged with operator identity and reason
    And the message is deleted from the DLQ
```

---

## Epic 3: Event Processor

### Feature: Drift Report Normalization

```gherkin
Feature: Drift Report Normalization
  As a SaaS backend engineer
  I want incoming drift reports normalized to a canonical schema
  So that downstream consumers work with consistent data regardless of IaC type

  Background:
    Given the event processor is running
    And it is consuming from "drift-reports.fifo"

  Scenario: Normalize a Terraform drift report
    Given a raw drift report with type "terraform" arrives on the queue
    When the event processor consumes the message
    Then it maps Terraform resource addresses to canonical resource_id format
    And stores the normalized report in the "drift_events" PostgreSQL table
    And sets the "iac_type" field to "terraform"

  Scenario: Normalize a CloudFormation drift report
    Given a raw drift report with type "cloudformation" arrives
    When the event processor normalizes it
    Then CloudFormation logical resource IDs are mapped to canonical resource_id
    And the "iac_type" field is set to "cloudformation"

  Scenario: Normalize a Kubernetes drift report
    Given a raw drift report with type "kubernetes" arrives
    When the event processor normalizes it
    Then Kubernetes resource URIs (namespace/kind/name) are mapped to canonical resource_id
    And the "iac_type" field is set to "kubernetes"

  Scenario: Unknown IaC type in report
    Given a drift report arrives with iac_type "unknown_tool"
    When the event processor attempts normalization
    Then the message is rejected with error "unsupported_iac_type"
    And the message is moved to the DLQ
    And an error is logged with the raw message_id

  Scenario: Report with missing required fields
    Given a drift report is missing the "tenant_id" field
    When the event processor validates the message
    Then validation fails with "missing required field: tenant_id"
    And the message is moved to the DLQ
    And no partial record is written to PostgreSQL

  Scenario: Chunked report is reassembled before normalization
    Given 3 chunked messages arrive with the same batch_id
    And chunk_total is 3
    When all 3 chunks are received
    Then the event processor reassembles them in chunk_index order
    And normalizes the complete report as a single event

  Scenario: Chunked report — one chunk missing after timeout
    Given 2 of 3 chunks arrive for batch_id "batch-xyz"
    And the third chunk does not arrive within 10 minutes
    When the reassembly timeout fires
    Then the event processor logs "incomplete batch: batch-xyz, missing chunk 2" at WARN level
    And moves the partial batch to the DLQ
```

### Feature: Drift Severity Scoring

```gherkin
Feature: Drift Severity Scoring
  As a platform engineer
  I want each drift event scored by severity
  So that I can prioritize which drifts to address first

  Background:
    Given a normalized drift event is ready for scoring

  Scenario: Security group rule added — HIGH severity
    Given a drift event describes an added inbound rule on a security group
    And the rule opens port 22 to 0.0.0.0/0
    When the severity scorer evaluates the event
    Then the severity is set to "HIGH"
    And the reason is "public SSH access opened"

  Scenario: IAM policy attached — CRITICAL severity
    Given a drift event describes an IAM policy "AdministratorAccess" attached to a role
    When the severity scorer evaluates the event
    Then the severity is set to "CRITICAL"
    And the reason is "admin policy attached outside IaC"

  Scenario: Replica count changed — LOW severity
    Given a drift event describes a Kubernetes Deployment replica count changed from 3 to 2
    When the severity scorer evaluates the event
    Then the severity is set to "LOW"
    And the reason is "non-security configuration drift"

  Scenario: Resource deleted — HIGH severity
    Given a drift event describes an RDS instance being deleted
    When the severity scorer evaluates the event
    Then the severity is set to "HIGH"
    And the reason is "managed resource deleted outside IaC"

  Scenario: Tag-only drift — INFO severity
    Given a drift event describes only tag changes on an EC2 instance
    When the severity scorer evaluates the event
    Then the severity is set to "INFO"
    And the reason is "tag-only drift"

  Scenario: Custom severity rules override defaults
    Given a tenant has configured a custom rule: "any change to resource type aws_s3_bucket = CRITICAL"
    And a drift event describes a tag change on an S3 bucket
    When the severity scorer evaluates the event
    Then the tenant's custom rule takes precedence
    And the severity is set to "CRITICAL"

  Scenario: Severity score is stored with the drift event
    Given a drift event has been scored as "HIGH"
    When the event processor writes to PostgreSQL
    Then the "drift_events" row includes severity "HIGH" and scored_at timestamp
```

### Feature: PostgreSQL Storage with RLS

```gherkin
Feature: Drift Event Storage with Row-Level Security
  As a SaaS engineer
  I want drift events stored in PostgreSQL with RLS
  So that tenants can only access their own data

  Background:
    Given PostgreSQL is running with RLS enabled on the "drift_events" table
    And the RLS policy filters rows by "tenant_id = current_setting('app.tenant_id')"

  Scenario: Drift event written for tenant A
    Given a normalized drift event belongs to tenant "acme"
    When the event processor writes the event
    Then the row is inserted with tenant_id "acme"
    And the row is visible when querying as tenant "acme"

  Scenario: Tenant B cannot read tenant A's drift events
    Given drift events exist for tenant "acme"
    When a query runs with app.tenant_id set to "globex"
    Then zero rows are returned
    And no error is thrown (RLS silently filters)

  Scenario: Superuser bypass is disabled for application role
    Given the application database role is "drift_app"
    When "drift_app" attempts to query without setting app.tenant_id
    Then zero rows are returned due to RLS default-deny policy

  Scenario: Drift event deduplication
    Given a drift event with scan_id "scan-abc" and resource_id "aws_instance.web" already exists
    When the event processor attempts to insert the same event again
    Then the INSERT is ignored (ON CONFLICT DO NOTHING)
    And no duplicate row is created

  Scenario: Database connection pool exhausted
    Given all PostgreSQL connections are in use
    When the event processor tries to write a drift event
    Then it waits up to 5 seconds for a connection
    And if no connection is available, the message is nacked and retried
    And an alert fires if pool exhaustion persists for more than 60 seconds

  Scenario: Schema migration runs without downtime
    Given a new additive column "remediation_status" is being added
    When the migration runs
    Then existing rows are unaffected
    And new rows include the "remediation_status" column
    And the event processor continues writing without restart
```

### Feature: Idempotent Event Processing

```gherkin
Feature: Idempotent Event Processing
  As a SaaS backend engineer
  I want event processing to be idempotent
  So that retries and replays do not create duplicate records

  Scenario: Same SQS message delivered twice (at-least-once delivery)
    Given an SQS message with MessageId "msg-001" was processed successfully
    When the same message is delivered again due to SQS retry
    Then the event processor detects the duplicate via scan_id lookup
    And skips processing
    And deletes the message from the queue

  Scenario: Event processor restarts mid-batch
    Given the event processor crashed after writing 5 of 10 events
    When the processor restarts and reprocesses the batch
    Then the 5 already-written events are skipped (idempotent)
    And the remaining 5 events are written
    And the final state has exactly 10 events

  Scenario: Replay from DLQ does not create duplicates
    Given a DLQ message is replayed after a bug fix
    And the event was partially processed before the crash
    When the replayed message is processed
    Then the processor uses upsert semantics
    And the final record reflects the correct state
```

---

## Epic 4: Notification Engine

### Feature: Slack Block Kit Drift Alerts

```gherkin
Feature: Slack Block Kit Drift Alerts
  As a platform engineer
  I want to receive Slack notifications when drift is detected
  So that I can act on it immediately

  Background:
    Given a tenant has configured a Slack webhook URL
    And the notification engine is running

  Scenario: HIGH severity drift triggers immediate Slack alert
    Given a drift event with severity "HIGH" is stored
    When the notification engine processes the event
    Then a Slack Block Kit message is sent to the configured channel
    And the message includes the resource_id, drift type, and severity badge
    And the message includes an inline diff of expected vs actual values
    And the message includes a "Revert to IaC" action button

  Scenario: CRITICAL severity drift triggers immediate Slack alert with @here mention
    Given a drift event with severity "CRITICAL" is stored
    When the notification engine processes the event
    Then the Slack message includes an "@here" mention
    And the message is sent within 60 seconds of the event being stored

  Scenario: LOW severity drift is batched — not sent immediately
    Given a drift event with severity "LOW" is stored
    When the notification engine processes the event
    Then no immediate Slack message is sent
    And the event is queued for the next daily digest

  Scenario: INFO severity drift is suppressed from Slack
    Given a drift event with severity "INFO" is stored
    When the notification engine processes the event
    Then no Slack message is sent
    And the event is only visible in the dashboard

  Scenario: Slack message includes inline diff
    Given a drift event shows security group rule changed
    And expected value is "port 443 from 10.0.0.0/8"
    And actual value is "port 443 from 0.0.0.0/0"
    When the Slack alert is composed
    Then the message body includes a diff block showing the change
    And removed lines are prefixed with "-" in red
    And added lines are prefixed with "+" in green

  Scenario: Slack webhook returns 429 rate limit
    Given the Slack webhook returns HTTP 429
    When the notification engine attempts to send
    Then it respects the Retry-After header
    And retries after the specified delay
    And logs "slack rate limited, retrying in Xs" at WARN level

  Scenario: Slack webhook URL is invalid
    Given the tenant's Slack webhook URL returns HTTP 404
    When the notification engine attempts to send
    Then it logs "invalid slack webhook" at ERROR level
    And marks the notification as "failed" in the database
    And does not retry indefinitely (max 3 attempts)

  Scenario: Multiple drifts in same scan — grouped in one message
    Given a scan detects 5 drifted resources all with severity "HIGH"
    When the notification engine processes the batch
    Then a single Slack message is sent grouping all 5 resources
    And the message includes a summary "5 resources drifted"
    And each resource is listed with its diff
```

### Feature: Daily Digest

```gherkin
Feature: Daily Drift Digest
  As a platform engineer
  I want a daily summary of all drift events
  So that I have a consolidated view without alert fatigue

  Background:
    Given a tenant has daily digest enabled
    And the digest is scheduled for 09:00 tenant local time

  Scenario: Daily digest sent with pending LOW/INFO events
    Given 12 LOW severity drift events accumulated since the last digest
    When the digest job runs at 09:00
    Then a single Slack message is sent summarizing all 12 events
    And the message groups events by stack name
    And includes a link to the dashboard for full details

  Scenario: Daily digest skipped when no events
    Given no drift events occurred in the last 24 hours
    When the digest job runs
    Then no Slack message is sent
    And the job logs "digest skipped: no events" at INFO level

  Scenario: Daily digest includes resolved drifts
    Given 3 drift events were detected and then remediated in the last 24 hours
    When the digest runs
    Then the digest includes a "Resolved" section listing those 3 events
    And shows time-to-remediation for each

  Scenario: Digest timezone is per-tenant
    Given tenant "acme" is in timezone "America/New_York" (UTC-5)
    And tenant "globex" is in timezone "Asia/Tokyo" (UTC+9)
    When the digest scheduler runs
    Then "acme" receives their digest at 14:00 UTC
    And "globex" receives their digest at 00:00 UTC

  Scenario: Digest delivery fails — retried next hour
    Given the Slack webhook is temporarily unavailable at 09:00
    When the digest send fails
    Then the system retries at 10:00
    And again at 11:00
    And after 3 failures, marks the digest as "failed" and alerts the operator
```

### Feature: Severity-Based Routing

```gherkin
Feature: Severity-Based Notification Routing
  As a platform engineer
  I want different severity levels routed to different channels
  So that critical alerts reach the right people immediately

  Background:
    Given a tenant has configured routing rules

  Scenario: CRITICAL routed to #incidents channel
    Given the routing rule maps CRITICAL → "#incidents"
    And a CRITICAL drift event occurs
    When the notification engine routes the alert
    Then the message is sent to "#incidents"
    And not to "#drift-alerts"

  Scenario: HIGH routed to #drift-alerts channel
    Given the routing rule maps HIGH → "#drift-alerts"
    And a HIGH drift event occurs
    When the notification engine routes the alert
    Then the message is sent to "#drift-alerts"

  Scenario: No routing rule configured — fallback to default channel
    Given no routing rules are configured for severity "MEDIUM"
    And a MEDIUM drift event occurs
    When the notification engine routes the alert
    Then the message is sent to the tenant's default Slack channel

  Scenario: Multiple channels for same severity
    Given the routing rule maps CRITICAL → ["#incidents", "#sre-oncall"]
    And a CRITICAL drift event occurs
    When the notification engine routes the alert
    Then the message is sent to both "#incidents" and "#sre-oncall"

  Scenario: PagerDuty integration for CRITICAL severity
    Given the tenant has PagerDuty integration configured
    And the routing rule maps CRITICAL → PagerDuty
    And a CRITICAL drift event occurs
    When the notification engine routes the alert
    Then a PagerDuty incident is created via the Events API
    And the incident includes the drift event details and dashboard link
```

---

## Epic 5: Remediation

### Feature: One-Click Revert via Slack

```gherkin
Feature: One-Click Revert to IaC via Slack
  As a platform engineer
  I want to trigger remediation directly from a Slack alert
  So that I can revert drift without leaving my chat tool

  Background:
    Given a HIGH severity drift alert was sent to Slack
    And the alert includes a "Revert to IaC" button

  Scenario: Engineer clicks Revert to IaC for non-destructive change
    Given the drift is a security group rule addition (non-destructive revert)
    When the engineer clicks "Revert to IaC"
    Then the SaaS backend receives the Slack interaction payload
    And sends a remediation command to the agent via the control plane
    And the agent runs "terraform apply -target=aws_security_group.web"
    And the Slack message is updated to show "Remediation in progress..."

  Scenario: Remediation completes successfully
    Given a remediation command was dispatched to the agent
    When the agent completes the terraform apply
    Then the agent sends a remediation result event to SaaS
    And the Slack message is updated to "✅ Reverted successfully"
    And a new scan is triggered immediately to confirm no drift

  Scenario: Remediation fails — agent reports error
    Given a remediation command was dispatched
    And the terraform apply exits with a non-zero code
    When the agent reports the failure
    Then the Slack message is updated to "❌ Remediation failed"
    And the error output is included in the Slack message (truncated to 500 chars)
    And the drift event status is set to "remediation_failed"

  Scenario: Revert button clicked by unauthorized user
    Given the Slack user "intern@acme.com" is not in the "remediation_approvers" group
    When they click "Revert to IaC"
    Then the SaaS backend rejects the action with "Unauthorized"
    And a Slack ephemeral message is shown: "You don't have permission to trigger remediation"
    And the action is logged with the user's identity

  Scenario: Revert button clicked twice (double-click protection)
    Given a remediation is already in progress for drift event "drift-001"
    When the "Revert to IaC" button is clicked again
    Then the SaaS backend returns "remediation already in progress"
    And a Slack ephemeral message is shown: "Remediation already running"
    And no duplicate remediation is triggered
```

### Feature: Approval Workflow for Destructive Changes

```gherkin
Feature: Approval Workflow for Destructive Remediation
  As a security officer
  I want destructive remediations to require explicit approval
  So that no resources are accidentally deleted

  Background:
    Given a drift event involves a resource that would be deleted during revert
    And the tenant has approval workflow enabled

  Scenario: Destructive revert requires approval
    Given the drift revert would delete an RDS instance
    When an engineer clicks "Revert to IaC"
    Then instead of executing immediately, an approval request is sent
    And the Slack message shows "⚠️ Destructive change — approval required"
    And an approval request is sent to all users in the "remediation_approvers" group

  Scenario: Approver approves destructive remediation
    Given an approval request is pending for drift event "drift-002"
    When an approver clicks "Approve" in Slack
    Then the approval is recorded with the approver's identity and timestamp
    And the remediation command is dispatched to the agent
    And the Slack thread is updated: "Approved by @jane — executing..."

  Scenario: Approver rejects destructive remediation
    Given an approval request is pending
    When an approver clicks "Reject"
    Then the remediation is cancelled
    And the drift event status is set to "remediation_rejected"
    And the Slack message is updated: "❌ Rejected by @jane"
    And the rejection reason (if provided) is logged

  Scenario: Approval timeout — remediation auto-cancelled
    Given an approval request has been pending for 24 hours
    And no approver has responded
    When the approval timeout fires
    Then the remediation is automatically cancelled
    And the drift event status is set to "approval_timeout"
    And a Slack message is sent: "⏰ Approval timed out — remediation cancelled"
    And the event is included in the next daily digest

  Scenario: Approval timeout is configurable
    Given the tenant has set approval_timeout_hours to 4
    When an approval request is pending for 4 hours without response
    Then the timeout fires after 4 hours (not 24)

  Scenario: Self-approval is blocked
    Given engineer "alice@acme.com" triggered the remediation request
    When "alice@acme.com" attempts to approve their own request
    Then the approval is rejected with "Self-approval not permitted"
    And an ephemeral Slack message informs Alice

  Scenario: Minimum approvers requirement
    Given the tenant requires 2 approvals for destructive changes
    And only 1 approver has approved
    When the second approver approves
    Then the quorum is met
    And the remediation is dispatched
```

### Feature: Agent-Side Remediation Execution

```gherkin
Feature: Agent-Side Remediation Execution
  As a platform engineer
  I want the agent to apply IaC changes to revert drift
  So that remediation happens inside the customer VPC with proper credentials

  Background:
    Given the agent has received a remediation command from the control plane
    And the agent has the necessary IAM permissions

  Scenario: Terraform revert executed successfully
    Given the remediation command specifies stack "prod" and resource "aws_security_group.web"
    When the agent executes the remediation
    Then it runs "terraform apply -target=aws_security_group.web -auto-approve"
    And captures stdout and stderr
    And reports the result back to SaaS via the control plane

  Scenario: kubectl revert executed successfully
    Given the remediation command specifies a Kubernetes Deployment "api-server"
    When the agent executes the remediation
    Then it runs "kubectl apply -f /etc/drift/stacks/prod/api-server.yaml"
    And reports the result back to SaaS

  Scenario: Remediation command times out
    Given the terraform apply is still running after 10 minutes
    When the remediation timeout fires
    Then the agent kills the terraform process
    And reports status "timeout" to SaaS
    And logs "remediation timed out after 10m" at ERROR level

  Scenario: Agent loses connectivity during remediation
    Given a remediation is in progress
    And the agent loses connectivity to SaaS mid-execution
    When connectivity is restored
    Then the agent reports the final remediation result
    And the SaaS backend reconciles the status

  Scenario: Remediation command is replayed after agent restart
    Given a remediation command was received but the agent restarted before executing
    When the agent restarts
    Then it checks for pending remediation commands
    And executes any pending commands
    And reports results to SaaS

  Scenario: Remediation is blocked when panic mode is active
    Given the tenant's panic mode is active
    When a remediation command is received by the agent
    Then the agent rejects the command
    And logs "remediation blocked: panic mode active" at WARN level
    And reports status "blocked_panic_mode" to SaaS
```

---

## Epic 6: Dashboard UI

### Feature: OAuth Login

```gherkin
Feature: OAuth Login
  As a user
  I want to log in via OAuth
  So that I don't need a separate password for the drift dashboard

  Background:
    Given the dashboard is running at "https://app.drift.dd0c.io"

  Scenario: Successful login via GitHub OAuth
    Given the user navigates to the dashboard login page
    When they click "Sign in with GitHub"
    Then they are redirected to GitHub's OAuth authorization page
    And after authorizing, redirected back with an authorization code
    And the backend exchanges the code for a JWT
    And the user is logged in and sees their tenant's dashboard

  Scenario: Successful login via Google OAuth
    Given the user clicks "Sign in with Google"
    When they complete Google OAuth flow
    Then they are logged in with a valid JWT
    And the JWT contains their tenant_id and email claims

  Scenario: OAuth callback with invalid state parameter
    Given an OAuth callback arrives with a mismatched state parameter
    When the frontend processes the callback
    Then the login is rejected with "Invalid OAuth state"
    And the user is redirected to the login page with an error message
    And the event is logged as a potential CSRF attempt

  Scenario: JWT expires during session
    Given the user is logged in with a JWT that expires in 1 minute
    When the JWT expires
    Then the dashboard silently refreshes the token using the refresh token
    And the user's session continues uninterrupted

  Scenario: Refresh token is revoked
    Given the user's refresh token has been revoked (e.g., password change)
    When the dashboard attempts to refresh the JWT
    Then the refresh fails
    And the user is redirected to the login page
    And shown "Your session has expired, please log in again"

  Scenario: User belongs to no tenant
    Given a new OAuth user has no tenant association
    When they complete OAuth login
    Then they are redirected to the onboarding flow
    And prompted to create or join a tenant
```

### Feature: Stack Overview

```gherkin
Feature: Stack Overview Page
  As a platform engineer
  I want to see all my stacks and their drift status at a glance
  So that I can quickly identify which stacks need attention

  Background:
    Given the user is logged in as a member of tenant "acme"
    And tenant "acme" has 5 stacks configured

  Scenario: Stack overview loads all stacks
    Given the user navigates to the dashboard home
    When the page loads
    Then 5 stack cards are displayed
    And each card shows stack name, IaC type, last scan time, and drift count

  Scenario: Stack with active drift shows warning indicator
    Given stack "prod-api" has 3 active drift events
    When the stack overview loads
    Then the "prod-api" card shows a yellow warning badge with "3 drifts"

  Scenario: Stack with CRITICAL drift shows critical indicator
    Given stack "prod-db" has 1 CRITICAL drift event
    When the stack overview loads
    Then the "prod-db" card shows a red critical badge

  Scenario: Stack with no drift shows healthy indicator
    Given stack "staging" has no active drift events
    When the stack overview loads
    Then the "staging" card shows a green "Healthy" badge

  Scenario: Stack overview auto-refreshes
    Given the user is viewing the stack overview
    When 60 seconds elapse
    Then the page automatically refreshes drift counts without a full page reload

  Scenario: Tenant with no stacks sees empty state
    Given a new tenant has no stacks configured
    When they navigate to the dashboard
    Then an empty state is shown: "No stacks yet — run drift init to get started"
    And a link to the onboarding guide is displayed

  Scenario: Stack overview only shows current tenant's stacks
    Given tenant "acme" has 5 stacks and tenant "globex" has 3 stacks
    When a user from "acme" views the dashboard
    Then only 5 stacks are shown
    And no stacks from "globex" are visible
```

### Feature: Diff Viewer

```gherkin
Feature: Drift Diff Viewer
  As a platform engineer
  I want to see a detailed diff of what changed
  So that I understand exactly what drifted and how

  Background:
    Given the user is viewing a specific drift event

  Scenario: Diff viewer shows field-level changes
    Given a drift event has 3 changed fields
    When the user opens the diff viewer
    Then each changed field is shown with expected and actual values
    And removed values are highlighted in red
    And added values are highlighted in green

  Scenario: Diff viewer shows JSON diff for complex values
    Given a drift event involves a changed IAM policy document (JSON)
    When the user opens the diff viewer
    Then the policy JSON is shown as a structured diff
    And individual JSON fields are highlighted rather than the whole blob

  Scenario: Diff viewer handles large diffs with pagination
    Given a drift event has 50 changed fields
    When the user opens the diff viewer
    Then the first 20 fields are shown
    And a "Show more" button loads the remaining 30

  Scenario: Diff viewer shows resource metadata
    Given a drift event for resource "aws_security_group.web"
    When the user views the diff
    Then the viewer shows resource type, ARN, region, and stack name
    And the scan timestamp is displayed

  Scenario: Diff viewer copy-to-clipboard
    Given the user is viewing a diff
    When they click "Copy diff"
    Then the diff is copied to clipboard in unified diff format
    And a toast notification confirms "Copied to clipboard"
```

### Feature: Drift Timeline

```gherkin
Feature: Drift Timeline
  As a platform engineer
  I want to see a timeline of drift events over time
  So that I can identify patterns and recurring issues

  Background:
    Given the user is viewing the drift timeline for stack "prod-api"

  Scenario: Timeline shows events in reverse chronological order
    Given 10 drift events exist for "prod-api" over the last 7 days
    When the user views the timeline
    Then events are listed newest first
    And each event shows timestamp, resource, severity, and status

  Scenario: Timeline filtered by severity
    Given the timeline has HIGH and LOW events
    When the user filters by severity "HIGH"
    Then only HIGH events are shown
    And the filter state is reflected in the URL for shareability

  Scenario: Timeline filtered by date range
    Given the user selects a date range of "last 30 days"
    When the filter is applied
    Then only events within the last 30 days are shown

  Scenario: Timeline shows remediation events
    Given a drift event was remediated
    When the user views the timeline
    Then the event shows a "Remediated" badge
    And the remediation timestamp and actor are shown

  Scenario: Timeline is empty for new stack
    Given a stack was added 1 hour ago and has no drift history
    When the user views the timeline
    Then an empty state is shown: "No drift history yet"

  Scenario: Timeline pagination
    Given 200 drift events exist for a stack
    When the user views the timeline
    Then the first 50 events are shown
    And infinite scroll or pagination loads more on demand
```

---

## Epic 7: Dashboard API

### Feature: JWT Authentication

```gherkin
Feature: JWT Authentication on Dashboard API
  As a SaaS engineer
  I want all API endpoints protected by JWT
  So that only authenticated users can access tenant data

  Background:
    Given the Dashboard API is running at "https://api.drift.dd0c.io"

  Scenario: Valid JWT grants access
    Given a request includes a valid JWT in the Authorization header
    And the JWT is not expired
    And the JWT signature is valid
    When the request reaches the API
    Then the request is processed
    And the response is returned with HTTP 200

  Scenario: Missing JWT returns 401
    Given a request has no Authorization header
    When the request reaches the API
    Then the API returns HTTP 401
    And the response body includes "Authentication required"

  Scenario: Expired JWT returns 401
    Given a request includes a JWT that expired 5 minutes ago
    When the request reaches the API
    Then the API returns HTTP 401
    And the response body includes "Token expired"

  Scenario: JWT with invalid signature returns 401
    Given a request includes a JWT with a tampered signature
    When the request reaches the API
    Then the API returns HTTP 401
    And the response body includes "Invalid token"

  Scenario: JWT with wrong audience claim returns 401
    Given a request includes a JWT issued for a different service
    When the request reaches the API
    Then the API returns HTTP 401
    And the response body includes "Invalid audience"

  Scenario: JWT tenant_id claim is used for RLS
    Given a JWT contains tenant_id "acme"
    When the request reaches the API
    Then the API sets PostgreSQL session variable "app.tenant_id" to "acme"
    And all queries are automatically scoped to tenant "acme" via RLS
```

### Feature: Tenant Isolation via RLS

```gherkin
Feature: Tenant Isolation via Row-Level Security
  As a security engineer
  I want the API to enforce tenant isolation at the database level
  So that a bug in application logic cannot leak cross-tenant data

  Background:
    Given the API uses PostgreSQL with RLS on all tenant-scoped tables

  Scenario: User from tenant A cannot access tenant B's stacks
    Given tenant "acme" has stacks ["prod", "staging"]
    And tenant "globex" has stacks ["prod"]
    When a user from "acme" calls GET /stacks
    Then only "acme"'s stacks are returned
    And "globex"'s stack is not included

  Scenario: Cross-tenant drift event access attempt
    Given drift event "drift-001" belongs to tenant "globex"
    When a user from "acme" calls GET /drifts/drift-001
    Then the API returns HTTP 404
    And no data from "globex" is exposed

  Scenario: Cross-tenant stack update attempt
    Given stack "prod" belongs to tenant "globex"
    When a user from "acme" calls PATCH /stacks/prod
    Then the API returns HTTP 404
    And the stack is not modified

  Scenario: RLS is enforced even if application code has a bug
    Given a hypothetical bug causes the API to omit the tenant_id filter in a query
    When the query executes
    Then PostgreSQL RLS still filters rows to the current tenant
    And no cross-tenant data is returned

  Scenario: Tenant isolation for remediation actions
    Given remediation "rem-001" belongs to tenant "globex"
    When a user from "acme" calls POST /remediations/rem-001/approve
    Then the API returns HTTP 404
    And the remediation is not affected
```

### Feature: Stack CRUD

```gherkin
Feature: Stack CRUD Operations
  As a platform engineer
  I want to manage my stacks via the API
  So that I can add, update, and remove stacks programmatically

  Background:
    Given the user is authenticated as a member of tenant "acme"

  Scenario: Create a new Terraform stack
    Given a POST /stacks request with body:
      """
      { "name": "prod-api", "iac_type": "terraform", "region": "us-east-1" }
      """
    When the request is processed
    Then the API returns HTTP 201
    And the response includes the new stack's id and created_at
    And the stack is visible in GET /stacks

  Scenario: Create stack with duplicate name
    Given a stack named "prod-api" already exists for tenant "acme"
    When a POST /stacks request is made with name "prod-api"
    Then the API returns HTTP 409
    And the response body includes "Stack name already exists"

  Scenario: Create stack exceeding free tier limit
    Given the tenant is on the free tier (max 3 stacks)
    And the tenant already has 3 stacks
    When a POST /stacks request is made
    Then the API returns HTTP 402
    And the response body includes "Free tier limit reached. Upgrade to add more stacks."

  Scenario: Update stack configuration
    Given stack "prod-api" exists
    When a PATCH /stacks/prod-api request updates the scan_interval to 30
    Then the API returns HTTP 200
    And the stack's scan_interval is updated to 30
    And the agent receives the updated config on next heartbeat

  Scenario: Delete a stack
    Given stack "staging" exists with no active remediations
    When a DELETE /stacks/staging request is made
    Then the API returns HTTP 204
    And the stack is removed from GET /stacks
    And associated drift events are soft-deleted (retained for 90 days)

  Scenario: Delete a stack with active remediation
    Given stack "prod-api" has an active remediation in progress
    When a DELETE /stacks/prod-api request is made
    Then the API returns HTTP 409
    And the response body includes "Cannot delete stack with active remediation"

  Scenario: Get stack by ID
    Given stack "prod-api" exists
    When a GET /stacks/prod-api request is made
    Then the API returns HTTP 200
    And the response includes all stack fields including last_scan_at and drift_count
```

### Feature: Drift Event CRUD

```gherkin
Feature: Drift Event API
  As a platform engineer
  I want to query and manage drift events via the API
  So that I can build integrations and automations

  Background:
    Given the user is authenticated as a member of tenant "acme"

  Scenario: List drift events for a stack
    Given stack "prod-api" has 10 drift events
    When GET /stacks/prod-api/drifts is called
    Then the API returns HTTP 200
    And the response includes all 10 events
    And events are sorted by detected_at descending

  Scenario: Filter drift events by severity
    Given drift events include HIGH and LOW severity events
    When GET /drifts?severity=HIGH is called
    Then only HIGH severity events are returned

  Scenario: Filter drift events by status
    When GET /drifts?status=active is called
    Then only unresolved drift events are returned

  Scenario: Mark drift event as acknowledged
    Given drift event "drift-001" has status "active"
    When POST /drifts/drift-001/acknowledge is called
    Then the API returns HTTP 200
    And the event status is updated to "acknowledged"
    And the acknowledged_by and acknowledged_at fields are set

  Scenario: Mark drift event as resolved manually
    Given drift event "drift-001" has status "active"
    When POST /drifts/drift-001/resolve is called with body {"reason": "manual fix applied"}
    Then the API returns HTTP 200
    And the event status is updated to "resolved"
    And the resolution reason is stored

  Scenario: Pagination on drift events list
    Given 200 drift events exist
    When GET /drifts?page=1&per_page=50 is called
    Then 50 events are returned
    And the response includes pagination metadata (total, page, per_page, total_pages)
```

### Feature: API Rate Limiting

```gherkin
Feature: API Rate Limiting
  As a SaaS operator
  I want API rate limits enforced per tenant
  So that one tenant cannot degrade service for others

  Background:
    Given the API enforces rate limits per tenant

  Scenario: Request within rate limit succeeds
    Given the rate limit is 1000 requests per minute
    And the tenant has made 500 requests this minute
    When a new request is made
    Then the API returns HTTP 200
    And the response includes headers X-RateLimit-Remaining and X-RateLimit-Reset

  Scenario: Request exceeds rate limit
    Given the tenant has made 1000 requests this minute
    When a new request is made
    Then the API returns HTTP 429
    And the response includes Retry-After header
    And the response body includes "Rate limit exceeded"

  Scenario: Rate limit resets after window
    Given the tenant hit the rate limit at T+0
    When 60 seconds elapse (new window)
    Then the tenant's request counter resets
    And new requests succeed
```

---

## Epic 8: Infrastructure

### Feature: Terraform/CDK SaaS Infrastructure Provisioning

```gherkin
Feature: SaaS Infrastructure Provisioning
  As a SaaS platform engineer
  I want the SaaS infrastructure defined as code
  So that environments are reproducible and auditable

  Background:
    Given the infrastructure code lives in "infra/" directory
    And Terraform and CDK are both used for different layers

  Scenario: Terraform plan produces no unexpected changes in production
    Given the production Terraform state is up to date
    When "terraform plan" runs against the production workspace
    Then the plan shows zero resource changes
    And the plan output is stored as a CI artifact

  Scenario: New environment provisioned from scratch
    Given a new environment "staging-eu" is needed
    When "terraform apply -var-file=staging-eu.tfvars" runs
    Then all required resources are created (VPC, RDS, SQS, ECS, etc.)
    And the environment is reachable within 15 minutes
    And outputs (queue URLs, DB endpoints) are stored in SSM Parameter Store

  Scenario: RDS instance is provisioned with encryption at rest
    Given the Terraform module for RDS is applied
    When the RDS instance is created
    Then storage_encrypted is true
    And the KMS key ARN is set to the tenant-specific key

  Scenario: SQS FIFO queues are provisioned with DLQ
    Given the SQS Terraform module is applied
    When the queues are created
    Then "drift-reports.fifo" exists with content_based_deduplication enabled
    And "drift-reports-dlq.fifo" exists as the redrive target
    And maxReceiveCount is set to 3

  Scenario: CDK stack drift detected by drift agent (dogfooding)
    Given the SaaS CDK stacks are monitored by the drift agent itself
    When a CDK resource is manually modified in the AWS console
    Then the drift agent detects the change
    And an internal alert is sent to the SaaS ops channel

  Scenario: Infrastructure destroy is blocked in production
    Given a Terraform workspace is tagged as "production"
    When "terraform destroy" is attempted
    Then the CI pipeline rejects the command
    And logs "destroy blocked in production environment"
```

### Feature: GitHub Actions CI/CD

```gherkin
Feature: GitHub Actions CI/CD Pipeline
  As a platform engineer
  I want automated CI/CD via GitHub Actions
  So that code changes are tested and deployed safely

  Background:
    Given GitHub Actions workflows are defined in ".github/workflows/"

  Scenario: PR triggers CI checks
    Given a pull request is opened against the main branch
    When the CI workflow triggers
    Then unit tests run for Go agent code
    And unit tests run for TypeScript SaaS code
    And Terraform plan runs for infrastructure changes
    And all checks must pass before merge is allowed

  Scenario: Merge to main triggers staging deployment
    Given a PR is merged to the main branch
    When the deploy workflow triggers
    Then the Go agent binary is built and pushed to ECR
    And the TypeScript services are built and deployed to ECS staging
    And smoke tests run against staging
    And the deployment is marked successful if smoke tests pass

  Scenario: Staging smoke tests fail — production deploy blocked
    Given staging deployment completed
    And smoke tests fail (e.g., health check returns 500)
    When the pipeline evaluates the smoke test results
    Then the production deployment step is skipped
    And a Slack alert is sent to "#deployments" channel
    And the pipeline exits with failure

  Scenario: Production deployment requires manual approval
    Given staging smoke tests passed
    When the pipeline reaches the production deployment step
    Then it pauses and waits for a manual approval in GitHub Actions
    And the approval request is sent to the "production-deployers" team
    And deployment proceeds only after approval

  Scenario: Rollback triggered on production health check failure
    Given a production deployment completed
    And the post-deploy health check fails within 5 minutes
    When the rollback workflow triggers
    Then the previous ECS task definition revision is redeployed
    And a Slack alert is sent: "Production rollback triggered"
    And the failed deployment is logged with the commit SHA

  Scenario: Terraform plan diff is posted to PR as comment
    Given a PR modifies infrastructure code
    When the CI Terraform plan runs
    Then the plan output is posted as a comment on the PR
    And the comment includes a summary of resources to add/change/destroy

  Scenario: Secrets are never logged in CI
    Given the CI pipeline uses AWS credentials and Slack tokens
    When any CI step runs
    Then no secret values appear in the GitHub Actions log output
    And GitHub secret masking is verified in the workflow config
```

### Feature: Database Migrations in CI/CD

```gherkin
Feature: Database Migrations
  As a SaaS engineer
  I want database migrations to run automatically in CI/CD
  So that schema changes are applied safely and consistently

  Background:
    Given migrations are managed with a migration tool (e.g., golang-migrate or Flyway)

  Scenario: Migration runs successfully on deploy
    Given a new migration file "V20_add_remediation_status.sql" exists
    When the deploy pipeline runs
    Then the migration is applied to the target database
    And the migration is recorded in the schema_migrations table
    And the deploy continues

  Scenario: Migration is idempotent — already-applied migration is skipped
    Given migration "V20_add_remediation_status.sql" was already applied
    When the deploy pipeline runs again
    Then the migration is skipped
    And no error is thrown

  Scenario: Migration fails — deploy is halted
    Given a migration contains a syntax error
    When the migration runs
    Then the migration fails and is rolled back
    And the deploy pipeline halts
    And an alert is sent to the engineering team

  Scenario: Additive-only migrations enforced
    Given a migration attempts to drop a column
    When the CI linter checks the migration
    Then the lint check fails with "destructive migration not allowed"
    And the PR is blocked from merging
```

---

## Epic 9: Onboarding & PLG

### Feature: drift init CLI

```gherkin
Feature: drift init CLI Onboarding
  As a new user
  I want a guided CLI setup experience
  So that I can connect my infrastructure to drift in minutes

  Background:
    Given the drift CLI is installed via "curl -sSL https://get.drift.dd0c.io | sh"

  Scenario: First-time drift init runs guided setup
    Given the user runs "drift init" for the first time
    When the CLI starts
    Then it prompts for: cloud provider, IaC type, region, and stack name
    And validates each input before proceeding
    And generates a config file at "/etc/drift/config.yaml"

  Scenario: drift init detects existing Terraform state
    Given a Terraform state file exists in the current directory
    When the user runs "drift init"
    Then the CLI auto-detects the IaC type as "terraform"
    And pre-fills the stack name from the Terraform workspace name
    And asks the user to confirm

  Scenario: drift init creates IAM role with least-privilege policy
    Given the user confirms IAM role creation
    When "drift init" runs
    Then it creates an IAM role "drift-agent-role" with only required permissions
    And outputs the role ARN for the config

  Scenario: drift init generates and installs agent certificates
    Given the user has authenticated with the SaaS backend
    When "drift init" completes
    Then it fetches a signed mTLS certificate from the SaaS CA
    And stores the certificate at "/etc/drift/certs/agent.crt"
    And stores the private key at "/etc/drift/certs/agent.key" with mode 0600

  Scenario: drift init installs agent as systemd service
    Given the user is on a Linux system with systemd
    When "drift init" completes
    Then it installs a systemd unit file for the drift agent
    And enables and starts the service
    And confirms "drift-agent is running" in the output

  Scenario: drift init fails gracefully on missing AWS credentials
    Given no AWS credentials are configured
    When "drift init" runs
    Then it detects missing credentials
    And outputs a helpful error: "AWS credentials not found. Run 'aws configure' first."
    And exits with code 1

  Scenario: drift init --dry-run shows what would be created
    Given the user runs "drift init --dry-run"
    When the CLI runs
    Then it outputs all actions it would take without executing them
    And no resources are created
    And no config files are written
```

### Feature: Free Tier — 3 Stacks

```gherkin
Feature: Free Tier Stack Limit
  As a product manager
  I want the free tier limited to 3 stacks
  So that we have a clear upgrade path

  Background:
    Given a tenant is on the free tier

  Scenario: Free tier tenant can add up to 3 stacks
    Given the tenant has 0 stacks
    When they add stacks "prod", "staging", and "dev"
    Then all 3 stacks are created successfully
    And the tenant is not prompted to upgrade

  Scenario: Free tier tenant blocked from adding 4th stack
    Given the tenant has 3 stacks
    When they attempt to add a 4th stack via the CLI
    Then the CLI outputs "Free tier limit reached (3/3 stacks). Upgrade at https://app.drift.dd0c.io/billing"
    And exits with code 1

  Scenario: Free tier tenant blocked from adding 4th stack via API
    Given the tenant has 3 stacks
    When POST /stacks is called
    Then the API returns HTTP 402
    And the response includes upgrade_url

  Scenario: Free tier tenant blocked from adding 4th stack via dashboard
    Given the tenant has 3 stacks
    When they click "Add Stack" in the dashboard
    Then a modal appears: "You've reached the free tier limit"
    And an "Upgrade Plan" button is shown

  Scenario: Upgrading to paid tier unlocks unlimited stacks
    Given the tenant upgrades to the paid plan via Stripe
    When the Stripe webhook confirms payment
    Then the tenant's stack limit is set to unlimited
    And they can immediately add a 4th stack
```

### Feature: Stripe Billing

```gherkin
Feature: Stripe Billing Integration
  As a product manager
  I want usage-based billing via Stripe
  So that customers are charged $29/stack/month

  Background:
    Given Stripe is configured with the drift product and price

  Scenario: New tenant subscribes to paid plan
    Given a free tier tenant clicks "Upgrade"
    When they complete the Stripe Checkout flow
    Then a Stripe subscription is created for the tenant
    And the subscription includes a metered item for stack count
    And the tenant's plan is updated to "paid" in the database

  Scenario: Monthly invoice calculated at $29/stack
    Given a tenant has 5 stacks active for the full billing month
    When Stripe generates the monthly invoice
    Then the invoice total is $145.00 (5 × $29)
    And the invoice is sent to the tenant's billing email

  Scenario: Stack added mid-month — prorated charge
    Given a tenant adds a 6th stack on the 15th of the month
    When Stripe generates the monthly invoice
    Then the 6th stack is charged prorated (~$14.50 for half month)

  Scenario: Stack deleted mid-month — prorated credit
    Given a tenant deletes a stack on the 10th of the month
    When Stripe generates the monthly invoice
    Then a prorated credit is applied for the unused days

  Scenario: Payment fails — tenant notified and grace period applied
    Given a tenant's payment method is declined
    When Stripe sends the payment_failed webhook
    Then the tenant receives an email: "Payment failed — please update your billing info"
    And a 7-day grace period is applied before service is restricted

  Scenario: Grace period expires — stacks suspended
    Given a tenant's payment has been failing for 7 days
    When the grace period expires
    Then the tenant's stacks are suspended (scans paused)
    And the dashboard shows a banner: "Account suspended — payment required"
    And the agent stops sending reports

  Scenario: Payment updated — service restored immediately
    Given a tenant's stacks are suspended due to non-payment
    When the tenant updates their payment method and payment succeeds
    Then the Stripe webhook triggers service restoration
    And stacks are unsuspended within 60 seconds
    And scans resume on the next scheduled cycle

  Scenario: Stripe webhook signature validation
    Given a webhook arrives at POST /webhooks/stripe
    When the webhook signature is invalid
    Then the API returns HTTP 400
    And the event is ignored
    And the attempt is logged as a potential spoofing attempt

  Scenario: Free tier tenant is never charged
    Given a tenant is on the free tier with 3 stacks
    When the billing cycle runs
    Then no Stripe invoice is generated for this tenant
    And no charge is made
```

### Feature: Guided Setup Flow

```gherkin
Feature: Guided Setup Flow in Dashboard
  As a new user
  I want a step-by-step setup guide in the dashboard
  So that I can get value from drift quickly

  Background:
    Given a new tenant has just signed up and logged in

  Scenario: Onboarding checklist is shown to new tenants
    Given the tenant has completed 0 onboarding steps
    When they log in for the first time
    Then an onboarding checklist is shown with steps:
      | Step                        | Status    |
      | Install drift agent         | Pending   |
      | Add your first stack        | Pending   |
      | Configure Slack alerts      | Pending   |
      | Run your first scan         | Pending   |

  Scenario: Checklist step marked complete automatically
    Given the tenant installs the agent and it sends its first heartbeat
    When the dashboard refreshes
    Then the "Install drift agent" step is marked complete
    And a congratulatory message is shown

  Scenario: Onboarding checklist dismissed after all steps complete
    Given all 4 onboarding steps are complete
    When the tenant views the dashboard
    Then the checklist is replaced with the normal stack overview
    And a one-time "You're all set!" banner is shown

  Scenario: Onboarding checklist can be dismissed early
    Given the tenant has completed 2 of 4 steps
    When they click "Dismiss checklist"
    Then the checklist is hidden
    And a "Resume setup" link is available in the settings page
```

---

## Epic 10: Transparent Factory

### Feature: Feature Flags

```gherkin
Feature: Feature Flags
  As a product engineer
  I want feature flags to control rollout of new capabilities
  So that I can ship safely and roll back instantly

  Background:
    Given the feature flag service is running
    And flags are evaluated per-tenant

  Scenario: Feature flag enabled for specific tenant
    Given flag "remediation_v2" is enabled for tenant "acme"
    And flag "remediation_v2" is disabled for tenant "globex"
    When tenant "acme" triggers a remediation
    Then the v2 remediation code path is used
    When tenant "globex" triggers a remediation
    Then the v1 remediation code path is used

  Scenario: Feature flag enabled for percentage rollout
    Given flag "new_diff_viewer" is enabled for 10% of tenants
    When 1000 tenants load the dashboard
    Then approximately 100 tenants see the new diff viewer
    And the remaining 900 see the existing diff viewer

  Scenario: Feature flag disabled globally kills a feature
    Given flag "experimental_pulumi_scan" is globally disabled
    When any tenant attempts to add a Pulumi stack
    Then the API returns HTTP 501 "Feature not available"
    And the dashboard hides the Pulumi option in the stack type selector

  Scenario: Feature flag change takes effect without deployment
    Given flag "slack_digest_v2" is currently disabled
    When an operator enables the flag in the flag management console
    Then within 30 seconds, the notification engine uses the v2 digest format
    And no service restart is required

  Scenario: Feature flag evaluation is logged for audit
    Given flag "remediation_v2" is evaluated for tenant "acme"
    When the flag is checked
    Then the evaluation (flag name, tenant, result, timestamp) is written to the audit log
    And the audit log is queryable for compliance review

  Scenario: Unknown feature flag defaults to disabled
    Given code checks for flag "nonexistent_flag"
    When the flag service evaluates it
    Then the result is "disabled" (safe default)
    And a warning is logged: "unknown flag: nonexistent_flag"
```

### Feature: Additive Schema Migrations

```gherkin
Feature: Additive-Only Schema Migrations
  As a SaaS engineer
  I want all schema changes to be additive
  So that deployments are zero-downtime and rollback-safe

  Background:
    Given the migration linter runs in CI on every PR

  Scenario: Adding a new nullable column is allowed
    Given a migration adds column "remediation_status VARCHAR(50) NULL"
    When the migration linter checks the file
    Then the lint check passes
    And the migration is approved for merge

  Scenario: Adding a new table is allowed
    Given a migration creates a new table "decision_logs"
    When the migration linter checks the file
    Then the lint check passes

  Scenario: Dropping a column is blocked
    Given a migration contains "ALTER TABLE drift_events DROP COLUMN old_field"
    When the migration linter checks the file
    Then the lint check fails with "destructive operation: DROP COLUMN not allowed"
    And the PR is blocked

  Scenario: Dropping a table is blocked
    Given a migration contains "DROP TABLE legacy_alerts"
    When the migration linter checks the file
    Then the lint check fails with "destructive operation: DROP TABLE not allowed"

  Scenario: Renaming a column is blocked
    Given a migration contains "ALTER TABLE stacks RENAME COLUMN name TO stack_name"
    When the migration linter checks the file
    Then the lint check fails with "destructive operation: RENAME COLUMN not allowed"
    And the suggested alternative is to add a new column and deprecate the old one

  Scenario: Adding a NOT NULL column without default is blocked
    Given a migration adds "ALTER TABLE stacks ADD COLUMN owner_id UUID NOT NULL"
    When the migration linter checks the file
    Then the lint check fails with "NOT NULL column without DEFAULT will break existing rows"

  Scenario: Old column marked deprecated — not yet removed
    Given column "legacy_iac_path" is marked with a deprecation comment in the schema
    When the application code is deployed
    Then the column still exists in the database
    And the application ignores it
    And a deprecation notice is logged at startup

  Scenario: Column removal only after 2 release cycles
    Given column "legacy_iac_path" has been deprecated for 2 releases
    And all application code no longer references it
    When an engineer submits a migration to drop the column
    Then the migration linter checks the deprecation age
    And allows the drop if the deprecation period has elapsed
```

### Feature: Decision Logs

```gherkin
Feature: Decision Logs
  As an engineering lead
  I want architectural and operational decisions logged
  So that the team has a transparent record of why things are the way they are

  Background:
    Given the decision log is stored in "docs/decisions/" as markdown files

  Scenario: New ADR created for significant architectural change
    Given an engineer proposes switching from SQS to Kafka
    When they create "docs/decisions/ADR-042-kafka-vs-sqs.md"
    Then the ADR includes: context, decision, consequences, and status
    And the PR requires at least 2 reviewers from the architecture group

  Scenario: ADR status transitions are tracked
    Given ADR-042 has status "proposed"
    When the team accepts the decision
    Then the status is updated to "accepted"
    And the accepted_at date is recorded
    And the ADR is immutable after acceptance (changes require a new ADR)

  Scenario: Superseded ADR is linked to its replacement
    Given ADR-010 is superseded by ADR-042
    When ADR-042 is accepted
    Then ADR-010's status is updated to "superseded"
    And ADR-010 includes a link to ADR-042

  Scenario: Decision log is searchable
    Given 50 ADRs exist in the decision log
    When an engineer searches for "database"
    Then all ADRs mentioning "database" in title or body are returned

  Scenario: Operational decisions logged for drift remediation
    Given an operator manually overrides a remediation decision
    When the override is applied
    Then a decision log entry is created with: operator identity, reason, timestamp, and affected resource
    And the entry is linked to the drift event
```

### Feature: OTEL Tracing

```gherkin
Feature: OpenTelemetry Distributed Tracing
  As a SaaS engineer
  I want end-to-end distributed tracing via OTEL
  So that I can diagnose latency and errors across services

  Background:
    Given OTEL is configured with a Jaeger/Tempo backend
    And all services export traces

  Scenario: Drift report ingestion is fully traced
    Given an agent publishes a drift report to SQS
    When the event processor consumes and processes the message
    Then a trace exists spanning: SQS receive → normalization → severity scoring → DB write
    And each span includes tenant_id and scan_id as attributes
    And the total trace duration is under 2 seconds for normal reports

  Scenario: Slack notification is traced end-to-end
    Given a drift event triggers a Slack notification
    When the notification is sent
    Then a trace exists spanning: event stored → notification engine → Slack API call
    And the Slack API response code is recorded as a span attribute

  Scenario: Remediation flow is fully traced
    Given a remediation is triggered from Slack
    When the remediation completes
    Then a trace exists spanning: Slack interaction → API → control plane → agent → result
    And the trace includes the approver identity and approval timestamp

  Scenario: Slow span triggers latency alert
    Given the DB write span exceeds 500ms
    When the trace is analyzed
    Then a latency alert fires in the observability platform
    And the alert includes the trace_id for direct investigation

  Scenario: Trace context propagated across service boundaries
    Given the agent sends a drift report with a trace context header
    When the event processor receives the message
    Then it extracts the trace context from the SQS message attributes
    And continues the trace as a child span
    And the full trace is visible as a single tree in Jaeger

  Scenario: Traces do not contain PII or secrets
    Given a drift report is processed end-to-end
    When the trace is exported to the tracing backend
    Then no span attributes contain secret values
    And no span attributes contain tenant PII beyond tenant_id
    And the scrubber audit confirms 0 secrets in trace data

  Scenario: OTEL collector is unavailable — service continues
    Given the OTEL collector is down
    When the event processor handles a drift report
    Then the report is processed normally
    And trace export failures are logged at DEBUG level
    And no errors are surfaced to the end user
```

### Feature: Governance / Panic Mode

```gherkin
Feature: Panic Mode
  As a SaaS operator
  I want a panic mode that halts all automated actions
  So that I can freeze the system during a security incident or outage

  Background:
    Given the panic mode toggle is available in the ops console

  Scenario: Operator activates panic mode globally
    Given panic mode is currently inactive
    When an operator activates panic mode with reason "security incident"
    Then all automated remediations are immediately halted
    And all pending remediation commands are cancelled
    And a Slack alert is sent to "#ops-critical": "⚠️ PANIC MODE ACTIVATED by @operator"
    And the reason and operator identity are logged

  Scenario: Panic mode blocks new remediations
    Given panic mode is active
    When a user clicks "Revert to IaC" in Slack
    Then the SaaS backend rejects the action
    And the user sees: "System is in panic mode — automated actions are disabled"

  Scenario: Panic mode blocks agent remediation commands
    Given panic mode is active
    And an agent receives a remediation command (e.g., from a race condition)
    When the agent checks panic mode status
    Then the agent rejects the command
    And logs "remediation blocked: panic mode active" at WARN level

  Scenario: Panic mode does NOT block drift scanning
    Given panic mode is active
    When the next scan cycle runs
    Then the agent continues scanning normally
    And drift events continue to be reported and stored
    And notifications continue to be sent (read-only operations are unaffected)

  Scenario: Panic mode deactivated by authorized operator
    Given panic mode is active
    When an authorized operator deactivates panic mode
    Then automated remediations are re-enabled
    And a Slack alert is sent: "✅ PANIC MODE DEACTIVATED by @operator"
    And the deactivation is logged with timestamp and operator identity

  Scenario: Panic mode activation requires elevated role
    Given a regular user attempts to activate panic mode
    When they call POST /ops/panic-mode
    Then the API returns HTTP 403
    And the attempt is logged as a security event

  Scenario: Panic mode state is persisted across restarts
    Given panic mode is active
    When the SaaS backend restarts
    Then panic mode remains active after restart
    And the system does not auto-deactivate panic mode on restart

  Scenario: Tenant-level panic mode
    Given tenant "acme" is experiencing an incident
    When an operator activates panic mode for tenant "acme" only
    Then only "acme"'s remediations are halted
    And other tenants are unaffected
    And "acme"'s dashboard shows a panic mode banner
```

### Feature: Observability — Metrics and Alerts

```gherkin
Feature: Operational Metrics and Alerting
  As a SaaS operator
  I want key metrics exported and alerting configured
  So that I can detect and respond to production issues proactively

  Background:
    Given metrics are exported to CloudWatch and/or Prometheus

  Scenario: Drift report processing latency metric
    Given drift reports are being processed
    When the event processor handles each report
    Then a histogram metric "drift_report_processing_duration_ms" is recorded
    And a P99 alert fires if latency exceeds 5000ms

  Scenario: DLQ depth metric triggers alert
    Given the DLQ depth exceeds 0
    When the CloudWatch alarm evaluates
    Then a PagerDuty alert fires within 5 minutes
    And the alert includes the queue name and message count

  Scenario: Agent offline metric
    Given an agent has not sent a heartbeat for 5 minutes
    When the heartbeat monitor checks
    Then a metric "agents_offline_count" is incremented
    And if any agent is offline for more than 15 minutes, an alert fires

  Scenario: Secret scrubber miss rate metric
    Given the scrubber processes drift reports
    When a scrubber audit runs
    Then a metric "scrubber_miss_rate" is recorded
    And if the miss rate is ever > 0%, a CRITICAL alert fires immediately

  Scenario: Stripe webhook processing metric
    Given Stripe webhooks are received
    When each webhook is processed
    Then a counter "stripe_webhooks_processed_total" is incremented by event type
    And a counter "stripe_webhooks_failed_total" is incremented on failures
    And an alert fires if the failure rate exceeds 1% over 5 minutes

  Scenario: Database connection pool metric
    Given the application maintains a PostgreSQL connection pool
    When pool utilization exceeds 80%
    Then a warning alert fires
    And when utilization exceeds 95%, a critical alert fires
```

### Feature: Cross-Tenant Isolation Audit

```gherkin
Feature: Cross-Tenant Isolation Audit
  As a security engineer
  I want automated tests that verify cross-tenant isolation
  So that data leakage between tenants is caught before production

  Background:
    Given the test suite includes cross-tenant isolation tests
    And two test tenants "tenant-a" and "tenant-b" exist with separate data

  Scenario: API cross-tenant read isolation
    Given tenant-a has drift event "drift-a-001"
    When tenant-b's JWT is used to call GET /drifts/drift-a-001
    Then the API returns HTTP 404
    And no data from tenant-a is present in the response body

  Scenario: API cross-tenant write isolation
    Given tenant-a has stack "prod"
    When tenant-b's JWT is used to call DELETE /stacks/prod
    Then the API returns HTTP 404
    And tenant-a's stack is not deleted

  Scenario: Database RLS cross-tenant query isolation
    Given a direct database query runs with app.tenant_id set to "tenant-b"
    When the query selects all rows from drift_events
    Then zero rows from tenant-a are returned
    And the query does not error

  Scenario: SQS message from tenant-a cannot be processed as tenant-b
    Given a drift report message from tenant-a arrives on the queue
    When the event processor reads the tenant_id from the message
    Then the event is stored under tenant-a's tenant_id
    And not under tenant-b's tenant_id

  Scenario: Remediation command cannot target another tenant's agent
    Given tenant-b's agent has agent_id "agent-b-001"
    When tenant-a attempts to send a remediation command to "agent-b-001"
    Then the control plane rejects the command with HTTP 403
    And the attempt is logged as a security event

  Scenario: Cross-tenant isolation tests run in CI on every PR
    Given the isolation test suite is part of the CI pipeline
    When a PR is opened
    Then all cross-tenant isolation tests run automatically
    And the PR cannot be merged if any isolation test fails
```

---

*End of BDD Acceptance Test Specifications for dd0c/drift*

*Total epics covered: 10 | Features: 40+ | Scenarios: 200+*