P2: 2,245 lines, 10 epics — Sonnet subagent (8min) P3: 1,653 lines, 10 epics — Sonnet subagent (6min) P6: 2,303 lines, 262 scenarios, 10 epics — Sonnet subagent (7min) P4 (portal) still in progress
87 KiB
87 KiB
dd0c/drift — BDD Acceptance Test Specifications
Gherkin scenarios for all 10 epics. Each Feature maps to a user story within the epic.
- Epic 1: Drift Detection Agent
- Epic 2: Agent Communication
- Epic 3: Event Processor
- Epic 4: Notification Engine
- Epic 5: Remediation
- Epic 6: Dashboard UI
- Epic 7: Dashboard API
- Epic 8: Infrastructure
- Epic 9: Onboarding & PLG
- Epic 10: Transparent Factory
Epic 1: Drift Detection Agent
Feature: Agent Initialization
Feature: Drift Detection Agent Initialization
As a platform engineer
I want the drift agent to initialize correctly in my VPC
So that it can begin scanning infrastructure state
Background:
Given the drift agent binary is installed at "/usr/local/bin/drift-agent"
And a valid agent config exists at "/etc/drift/config.yaml"
Scenario: Successful agent startup
Given the config specifies AWS region "us-east-1"
And valid mTLS certificates are present
And the SaaS endpoint is reachable
When the agent starts
Then the agent logs "drift-agent started" at INFO level
And the agent registers itself with the SaaS control plane
And the first scan is scheduled within 15 minutes
Scenario: Agent startup with missing config
Given no config file exists at "/etc/drift/config.yaml"
When the agent starts
Then the agent exits with code 1
And logs "config file not found" at ERROR level
Scenario: Agent startup with invalid AWS credentials
Given the config references an IAM role that does not exist
When the agent starts
Then the agent exits with code 1
And logs "failed to assume IAM role" at ERROR level
Scenario: Agent startup with unreachable SaaS endpoint
Given the SaaS endpoint is not reachable from the VPC
When the agent starts
Then the agent retries connection 3 times with exponential backoff
And after all retries fail, exits with code 1
And logs "failed to reach control plane after 3 attempts" at ERROR level
Feature: Terraform State Scanning
Feature: Terraform State Scanning
As a platform engineer
I want the agent to read Terraform state
So that it can compare planned state against live AWS resources
Background:
Given the agent is running and initialized
And the stack type is "terraform"
Scenario: Scan Terraform state from S3 backend
Given a Terraform state file exists in S3 bucket "my-tfstate" at key "prod/terraform.tfstate"
And the agent has s3:GetObject permission on that bucket
When a scan cycle runs
Then the agent reads the state file successfully
And parses all resource definitions from the state
Scenario: Scan Terraform state from local backend
Given a Terraform state file exists at "/var/drift/stacks/prod/terraform.tfstate"
When a scan cycle runs
Then the agent reads the local state file
And parses all resource definitions
Scenario: Terraform state file is locked
Given the Terraform state file is currently locked by another process
When a scan cycle runs
Then the agent logs "state file locked, skipping scan" at WARN level
And schedules a retry for the next cycle
And does not report a drift event
Scenario: Terraform state file is malformed
Given the Terraform state file contains invalid JSON
When a scan cycle runs
Then the agent logs "failed to parse state file" at ERROR level
And emits a health event with status "parse_error"
And does not report false drift
Scenario: Terraform state references deleted resources
Given the Terraform state contains a resource "aws_instance.web"
And that EC2 instance no longer exists in AWS
When a scan cycle runs
Then the agent detects drift of type "resource_deleted"
And includes the resource ARN in the drift report
Feature: CloudFormation Stack Scanning
Feature: CloudFormation Stack Scanning
As a platform engineer
I want the agent to scan CloudFormation stacks
So that I can detect drift from declared template state
Background:
Given the agent is running
And the stack type is "cloudformation"
Scenario: Scan a CloudFormation stack successfully
Given a CloudFormation stack named "prod-api" exists in "us-east-1"
And the agent has cloudformation:DescribeStackResources permission
When a scan cycle runs
Then the agent retrieves all stack resources
And compares each resource's actual configuration against the template
Scenario: CloudFormation stack does not exist
Given the config references a CloudFormation stack "ghost-stack"
And that stack does not exist in AWS
When a scan cycle runs
Then the agent logs "stack not found: ghost-stack" at WARN level
And emits a drift event of type "stack_missing"
Scenario: CloudFormation native drift detection result available
Given CloudFormation has already run drift detection on "prod-api"
And the result shows 2 drifted resources
When a scan cycle runs
Then the agent reads the CloudFormation drift detection result
And includes both drifted resources in the drift report
Scenario: CloudFormation drift detection result is stale
Given the last CloudFormation drift detection ran 48 hours ago
When a scan cycle runs
Then the agent triggers a new CloudFormation drift detection
And waits up to 5 minutes for the result
And uses the fresh result in the drift report
Feature: Kubernetes Resource Scanning
Feature: Kubernetes Resource Scanning
As a platform engineer
I want the agent to scan Kubernetes resources
So that I can detect drift from Helm/Kustomize definitions
Background:
Given the agent is running
And a kubeconfig is available at "/etc/drift/kubeconfig"
And the stack type is "kubernetes"
Scenario: Scan Kubernetes Deployment successfully
Given a Deployment "api-server" exists in namespace "production"
And the IaC definition specifies 3 replicas
And the live Deployment has 2 replicas
When a scan cycle runs
Then the agent detects drift on field "spec.replicas"
And the drift report includes expected value "3" and actual value "2"
Scenario: Kubernetes resource has been manually patched
Given a ConfigMap "app-config" has been manually edited
And the live data differs from the Helm chart values
When a scan cycle runs
Then the agent detects drift of type "config_modified"
And includes a field-level diff in the report
Scenario: Kubernetes API server is unreachable
Given the Kubernetes API server is not responding
When a scan cycle runs
Then the agent logs "k8s API unreachable" at ERROR level
And emits a health event with status "k8s_unreachable"
And does not report false drift
Scenario: Scan across multiple namespaces
Given the config specifies namespaces ["production", "staging"]
When a scan cycle runs
Then the agent scans resources in both namespaces independently
And drift reports include the namespace as context
Feature: Secret Scrubbing
Feature: Secret Scrubbing Before Transmission
As a security officer
I want all secrets scrubbed from drift reports before they leave the VPC
So that credentials are never transmitted to the SaaS backend
Background:
Given the agent is running with secret scrubbing enabled
Scenario: AWS secret key detected and scrubbed
Given a drift report contains a field value matching AWS secret key pattern "AKIA[0-9A-Z]{16}"
When the scrubber processes the report
Then the field value is replaced with "[REDACTED:aws_secret_key]"
And the original value is not present anywhere in the transmitted payload
Scenario: Generic password field scrubbed
Given a drift report contains a field named "password" with value "s3cr3tP@ss"
When the scrubber processes the report
Then the field value is replaced with "[REDACTED:password]"
Scenario: Private key block scrubbed
Given a drift report contains a PEM private key block
When the scrubber processes the report
Then the entire PEM block is replaced with "[REDACTED:private_key]"
Scenario: Nested secret in JSON value scrubbed
Given a drift report contains a JSON string value with a nested "api_key" field
When the scrubber processes the report
Then the nested api_key value is replaced with "[REDACTED:api_key]"
And the surrounding JSON structure is preserved
Scenario: Secret scrubber bypass attempt via encoding
Given a drift report contains a base64-encoded AWS secret key
When the scrubber processes the report
Then the encoded value is detected and replaced with "[REDACTED:encoded_secret]"
Scenario: Secret scrubber bypass attempt via Unicode homoglyphs
Given a drift report contains a value using Unicode lookalike characters to resemble a secret pattern
When the scrubber processes the report
Then the value is flagged and replaced with "[REDACTED:suspicious_value]"
Scenario: Non-secret value is not scrubbed
Given a drift report contains a field "instance_type" with value "t3.medium"
When the scrubber processes the report
Then the field value remains "t3.medium" unchanged
Scenario: Scrubber coverage is 100%
Given a test corpus of 500 known secret patterns
When the scrubber processes all patterns
Then every pattern is detected and redacted
And the scrubber reports 0 missed secrets
Scenario: Scrubber audit log
Given the scrubber redacts 3 values from a drift report
When the report is transmitted
Then the agent logs a scrubber audit entry with count "3" and field names (not values)
And the audit log is stored locally for 30 days
Feature: Pulumi State Scanning
Feature: Pulumi State Scanning
As a platform engineer
I want the agent to read Pulumi state
So that I can detect drift from Pulumi-managed resources
Background:
Given the agent is running
And the stack type is "pulumi"
Scenario: Scan Pulumi state from Pulumi Cloud backend
Given a Pulumi stack "prod" exists in organization "acme"
And the agent has a valid Pulumi access token configured
When a scan cycle runs
Then the agent fetches the stack's exported state via Pulumi API
And parses all resource URNs and properties
Scenario: Scan Pulumi state from self-managed S3 backend
Given the Pulumi state is stored in S3 bucket "pulumi-state" at key "acme/prod.json"
When a scan cycle runs
Then the agent reads the state file from S3
And parses all resource definitions
Scenario: Pulumi access token is expired
Given the configured Pulumi access token has expired
When a scan cycle runs
Then the agent logs "pulumi token expired" at ERROR level
And emits a health event with status "auth_error"
And does not report false drift
Feature: 15-Minute Scan Cycle
Feature: Scheduled Scan Cycle
As a platform engineer
I want scans to run every 15 minutes automatically
So that drift is detected promptly
Background:
Given the agent is running and initialized
Scenario: Scan runs on schedule
Given the last scan completed at T+0
When 15 minutes elapse
Then a new scan starts automatically
And the scan completion is logged with timestamp
Scenario: Scan cycle skipped if previous scan still running
Given a scan started at T+0 and is still running at T+15
When the next scheduled scan would start
Then the new scan is skipped
And the agent logs "scan skipped: previous scan still in progress" at WARN level
Scenario: Scan interval is configurable
Given the config specifies scan_interval_minutes: 30
When the agent starts
Then scans run every 30 minutes instead of 15
Scenario: No drift detected — no report sent
Given all resources match their IaC definitions
When a scan cycle completes
Then no drift report is sent to SaaS
And the agent logs "scan complete: no drift detected"
Scenario: Agent recovers scan schedule after restart
Given the agent was restarted
When the agent starts
Then it reads the last scan timestamp from local state
And schedules the next scan relative to the last completed scan
Epic 2: Agent Communication
Feature: mTLS Certificate Handshake
Feature: mTLS Mutual TLS Authentication
As a security engineer
I want all agent-to-SaaS communication to use mTLS
So that only authenticated agents can submit drift reports
Background:
Given the agent has a client certificate issued by the drift CA
And the SaaS endpoint requires mTLS
Scenario: Successful mTLS handshake
Given the agent certificate is valid and not expired
And the SaaS server certificate is trusted by the agent's CA bundle
When the agent connects to the SaaS endpoint
Then the TLS handshake succeeds
And the connection is established with mutual authentication
Scenario: Agent certificate is expired
Given the agent certificate expired 1 day ago
When the agent attempts to connect to the SaaS endpoint
Then the TLS handshake fails with "certificate expired"
And the agent logs "mTLS cert expired, cannot connect" at ERROR level
And the agent emits a local alert to stderr
And no data is transmitted
Scenario: Agent certificate is revoked
Given the agent certificate has been added to the CRL
When the agent attempts to connect
Then the SaaS endpoint rejects the connection
And the agent logs "certificate revoked" at ERROR level
Scenario: Agent presents wrong CA certificate
Given the agent has a certificate from an untrusted CA
When the agent attempts to connect
Then the TLS handshake fails
And the agent logs "unknown certificate authority" at ERROR level
Scenario: SaaS server certificate is expired
Given the SaaS server certificate has expired
When the agent attempts to connect
Then the agent rejects the server certificate
And logs "server cert expired" at ERROR level
And does not transmit any data
Scenario: Certificate rotation — new cert issued before expiry
Given the agent certificate expires in 7 days
And the control plane issues a new certificate
When the agent receives the new certificate via the cert rotation endpoint
Then the agent stores the new certificate
And uses the new certificate for subsequent connections
And logs "certificate rotated successfully" at INFO level
Scenario: Certificate rotation fails — agent continues with old cert
Given the agent certificate expires in 7 days
And the cert rotation endpoint is unreachable
When the agent attempts certificate rotation
Then the agent retries rotation every hour
And continues using the existing certificate until it expires
And logs "cert rotation failed, retrying" at WARN level
Feature: Heartbeat
Feature: Agent Heartbeat
As a SaaS operator
I want agents to send regular heartbeats
So that I can detect offline or unhealthy agents
Background:
Given the agent is running and connected via mTLS
Scenario: Heartbeat sent on schedule
Given the heartbeat interval is 60 seconds
When 60 seconds elapse
Then the agent sends a heartbeat message to the SaaS control plane
And the heartbeat includes agent version, last scan time, and stack count
Scenario: SaaS marks agent as offline after missed heartbeats
Given the agent has not sent a heartbeat for 5 minutes
When the SaaS control plane checks agent status
Then the agent is marked as "offline"
And a notification is sent to the tenant's configured alert channel
Scenario: Agent reconnects after network interruption
Given the agent lost connectivity for 3 minutes
When connectivity is restored
Then the agent re-establishes the mTLS connection
And sends a heartbeat immediately
And the SaaS marks the agent as "online"
Scenario: Heartbeat includes health status
Given the last scan encountered a parse error
When the agent sends its next heartbeat
Then the heartbeat payload includes health_status "degraded"
And includes the error details in the health_details field
Scenario: Multiple agents from same tenant
Given tenant "acme" has 3 agents running in different regions
When all 3 agents send heartbeats
Then each agent is tracked independently by agent_id
And the SaaS dashboard shows 3 active agents for tenant "acme"
Feature: SQS FIFO Drift Report Ingestion
Feature: SQS FIFO Drift Report Ingestion
As a SaaS backend engineer
I want drift reports delivered via SQS FIFO
So that reports are processed in order without duplicates
Background:
Given an SQS FIFO queue "drift-reports.fifo" exists
And the agent has sqs:SendMessage permission on the queue
Scenario: Agent publishes drift report to SQS FIFO
Given a scan detects 2 drifted resources
When the agent publishes the drift report
Then a message is sent to "drift-reports.fifo"
And the MessageGroupId is set to the tenant's agent_id
And the MessageDeduplicationId is set to the scan's unique scan_id
Scenario: Duplicate scan report is deduplicated
Given a drift report with scan_id "scan-abc-123" was already sent
When the agent sends the same report again (e.g., due to retry)
Then SQS FIFO deduplicates the message
And only one copy is delivered to the consumer
Scenario: SQS message size limit exceeded
Given a drift report exceeds 256KB (SQS max message size)
When the agent attempts to publish
Then the agent splits the report into chunks
And each chunk is sent as a separate message with a shared batch_id
And the sequence is indicated by chunk_index and chunk_total fields
Scenario: SQS publish fails — agent retries with backoff
Given the SQS endpoint returns a 500 error
When the agent attempts to publish a drift report
Then the agent retries up to 5 times with exponential backoff
And logs each retry attempt at WARN level
And after all retries fail, stores the report locally for later replay
Scenario: Agent publishes to correct queue per environment
Given the agent config specifies environment "production"
When the agent publishes a drift report
Then the message is sent to "drift-reports-prod.fifo"
And not to "drift-reports-staging.fifo"
Feature: Dead Letter Queue Handling
Feature: Dead Letter Queue (DLQ) Handling
As a SaaS operator
I want failed messages routed to a DLQ
So that no drift reports are silently lost
Background:
Given a DLQ "drift-reports-dlq.fifo" is configured
And the maxReceiveCount is set to 3
Scenario: Message moved to DLQ after max receive count
Given a drift report message has failed processing 3 times
When the SQS visibility timeout expires a 3rd time
Then the message is automatically moved to the DLQ
And an alarm fires on the DLQ depth metric
Scenario: DLQ alarm triggers operator notification
Given the DLQ depth exceeds 0
When the CloudWatch alarm triggers
Then a PagerDuty alert is sent to the on-call engineer
And the alert includes the queue name and approximate message count
Scenario: DLQ message is replayed after fix
Given a message in the DLQ was caused by a schema validation bug
And the bug has been fixed and deployed
When an operator triggers DLQ replay
Then the message is moved back to the main queue
And processed successfully
And removed from the DLQ
Scenario: DLQ message contains poison pill — permanently discarded
Given a DLQ message is malformed beyond repair
When an operator inspects the message
Then the operator can mark it as "discarded" via the ops console
And the discard action is logged with operator identity and reason
And the message is deleted from the DLQ
Epic 3: Event Processor
Feature: Drift Report Normalization
Feature: Drift Report Normalization
As a SaaS backend engineer
I want incoming drift reports normalized to a canonical schema
So that downstream consumers work with consistent data regardless of IaC type
Background:
Given the event processor is running
And it is consuming from "drift-reports.fifo"
Scenario: Normalize a Terraform drift report
Given a raw drift report with type "terraform" arrives on the queue
When the event processor consumes the message
Then it maps Terraform resource addresses to canonical resource_id format
And stores the normalized report in the "drift_events" PostgreSQL table
And sets the "iac_type" field to "terraform"
Scenario: Normalize a CloudFormation drift report
Given a raw drift report with type "cloudformation" arrives
When the event processor normalizes it
Then CloudFormation logical resource IDs are mapped to canonical resource_id
And the "iac_type" field is set to "cloudformation"
Scenario: Normalize a Kubernetes drift report
Given a raw drift report with type "kubernetes" arrives
When the event processor normalizes it
Then Kubernetes resource URIs (namespace/kind/name) are mapped to canonical resource_id
And the "iac_type" field is set to "kubernetes"
Scenario: Unknown IaC type in report
Given a drift report arrives with iac_type "unknown_tool"
When the event processor attempts normalization
Then the message is rejected with error "unsupported_iac_type"
And the message is moved to the DLQ
And an error is logged with the raw message_id
Scenario: Report with missing required fields
Given a drift report is missing the "tenant_id" field
When the event processor validates the message
Then validation fails with "missing required field: tenant_id"
And the message is moved to the DLQ
And no partial record is written to PostgreSQL
Scenario: Chunked report is reassembled before normalization
Given 3 chunked messages arrive with the same batch_id
And chunk_total is 3
When all 3 chunks are received
Then the event processor reassembles them in chunk_index order
And normalizes the complete report as a single event
Scenario: Chunked report — one chunk missing after timeout
Given 2 of 3 chunks arrive for batch_id "batch-xyz"
And the third chunk does not arrive within 10 minutes
When the reassembly timeout fires
Then the event processor logs "incomplete batch: batch-xyz, missing chunk 2" at WARN level
And moves the partial batch to the DLQ
Feature: Drift Severity Scoring
Feature: Drift Severity Scoring
As a platform engineer
I want each drift event scored by severity
So that I can prioritize which drifts to address first
Background:
Given a normalized drift event is ready for scoring
Scenario: Security group rule added — HIGH severity
Given a drift event describes an added inbound rule on a security group
And the rule opens port 22 to 0.0.0.0/0
When the severity scorer evaluates the event
Then the severity is set to "HIGH"
And the reason is "public SSH access opened"
Scenario: IAM policy attached — CRITICAL severity
Given a drift event describes an IAM policy "AdministratorAccess" attached to a role
When the severity scorer evaluates the event
Then the severity is set to "CRITICAL"
And the reason is "admin policy attached outside IaC"
Scenario: Replica count changed — LOW severity
Given a drift event describes a Kubernetes Deployment replica count changed from 3 to 2
When the severity scorer evaluates the event
Then the severity is set to "LOW"
And the reason is "non-security configuration drift"
Scenario: Resource deleted — HIGH severity
Given a drift event describes an RDS instance being deleted
When the severity scorer evaluates the event
Then the severity is set to "HIGH"
And the reason is "managed resource deleted outside IaC"
Scenario: Tag-only drift — INFO severity
Given a drift event describes only tag changes on an EC2 instance
When the severity scorer evaluates the event
Then the severity is set to "INFO"
And the reason is "tag-only drift"
Scenario: Custom severity rules override defaults
Given a tenant has configured a custom rule: "any change to resource type aws_s3_bucket = CRITICAL"
And a drift event describes a tag change on an S3 bucket
When the severity scorer evaluates the event
Then the tenant's custom rule takes precedence
And the severity is set to "CRITICAL"
Scenario: Severity score is stored with the drift event
Given a drift event has been scored as "HIGH"
When the event processor writes to PostgreSQL
Then the "drift_events" row includes severity "HIGH" and scored_at timestamp
Feature: PostgreSQL Storage with RLS
Feature: Drift Event Storage with Row-Level Security
As a SaaS engineer
I want drift events stored in PostgreSQL with RLS
So that tenants can only access their own data
Background:
Given PostgreSQL is running with RLS enabled on the "drift_events" table
And the RLS policy filters rows by "tenant_id = current_setting('app.tenant_id')"
Scenario: Drift event written for tenant A
Given a normalized drift event belongs to tenant "acme"
When the event processor writes the event
Then the row is inserted with tenant_id "acme"
And the row is visible when querying as tenant "acme"
Scenario: Tenant B cannot read tenant A's drift events
Given drift events exist for tenant "acme"
When a query runs with app.tenant_id set to "globex"
Then zero rows are returned
And no error is thrown (RLS silently filters)
Scenario: Superuser bypass is disabled for application role
Given the application database role is "drift_app"
When "drift_app" attempts to query without setting app.tenant_id
Then zero rows are returned due to RLS default-deny policy
Scenario: Drift event deduplication
Given a drift event with scan_id "scan-abc" and resource_id "aws_instance.web" already exists
When the event processor attempts to insert the same event again
Then the INSERT is ignored (ON CONFLICT DO NOTHING)
And no duplicate row is created
Scenario: Database connection pool exhausted
Given all PostgreSQL connections are in use
When the event processor tries to write a drift event
Then it waits up to 5 seconds for a connection
And if no connection is available, the message is nacked and retried
And an alert fires if pool exhaustion persists for more than 60 seconds
Scenario: Schema migration runs without downtime
Given a new additive column "remediation_status" is being added
When the migration runs
Then existing rows are unaffected
And new rows include the "remediation_status" column
And the event processor continues writing without restart
Feature: Idempotent Event Processing
Feature: Idempotent Event Processing
As a SaaS backend engineer
I want event processing to be idempotent
So that retries and replays do not create duplicate records
Scenario: Same SQS message delivered twice (at-least-once delivery)
Given an SQS message with MessageId "msg-001" was processed successfully
When the same message is delivered again due to SQS retry
Then the event processor detects the duplicate via scan_id lookup
And skips processing
And deletes the message from the queue
Scenario: Event processor restarts mid-batch
Given the event processor crashed after writing 5 of 10 events
When the processor restarts and reprocesses the batch
Then the 5 already-written events are skipped (idempotent)
And the remaining 5 events are written
And the final state has exactly 10 events
Scenario: Replay from DLQ does not create duplicates
Given a DLQ message is replayed after a bug fix
And the event was partially processed before the crash
When the replayed message is processed
Then the processor uses upsert semantics
And the final record reflects the correct state
Epic 4: Notification Engine
Feature: Slack Block Kit Drift Alerts
Feature: Slack Block Kit Drift Alerts
As a platform engineer
I want to receive Slack notifications when drift is detected
So that I can act on it immediately
Background:
Given a tenant has configured a Slack webhook URL
And the notification engine is running
Scenario: HIGH severity drift triggers immediate Slack alert
Given a drift event with severity "HIGH" is stored
When the notification engine processes the event
Then a Slack Block Kit message is sent to the configured channel
And the message includes the resource_id, drift type, and severity badge
And the message includes an inline diff of expected vs actual values
And the message includes a "Revert to IaC" action button
Scenario: CRITICAL severity drift triggers immediate Slack alert with @here mention
Given a drift event with severity "CRITICAL" is stored
When the notification engine processes the event
Then the Slack message includes an "@here" mention
And the message is sent within 60 seconds of the event being stored
Scenario: LOW severity drift is batched — not sent immediately
Given a drift event with severity "LOW" is stored
When the notification engine processes the event
Then no immediate Slack message is sent
And the event is queued for the next daily digest
Scenario: INFO severity drift is suppressed from Slack
Given a drift event with severity "INFO" is stored
When the notification engine processes the event
Then no Slack message is sent
And the event is only visible in the dashboard
Scenario: Slack message includes inline diff
Given a drift event shows security group rule changed
And expected value is "port 443 from 10.0.0.0/8"
And actual value is "port 443 from 0.0.0.0/0"
When the Slack alert is composed
Then the message body includes a diff block showing the change
And removed lines are prefixed with "-" in red
And added lines are prefixed with "+" in green
Scenario: Slack webhook returns 429 rate limit
Given the Slack webhook returns HTTP 429
When the notification engine attempts to send
Then it respects the Retry-After header
And retries after the specified delay
And logs "slack rate limited, retrying in Xs" at WARN level
Scenario: Slack webhook URL is invalid
Given the tenant's Slack webhook URL returns HTTP 404
When the notification engine attempts to send
Then it logs "invalid slack webhook" at ERROR level
And marks the notification as "failed" in the database
And does not retry indefinitely (max 3 attempts)
Scenario: Multiple drifts in same scan — grouped in one message
Given a scan detects 5 drifted resources all with severity "HIGH"
When the notification engine processes the batch
Then a single Slack message is sent grouping all 5 resources
And the message includes a summary "5 resources drifted"
And each resource is listed with its diff
Feature: Daily Digest
Feature: Daily Drift Digest
As a platform engineer
I want a daily summary of all drift events
So that I have a consolidated view without alert fatigue
Background:
Given a tenant has daily digest enabled
And the digest is scheduled for 09:00 tenant local time
Scenario: Daily digest sent with pending LOW/INFO events
Given 12 LOW severity drift events accumulated since the last digest
When the digest job runs at 09:00
Then a single Slack message is sent summarizing all 12 events
And the message groups events by stack name
And includes a link to the dashboard for full details
Scenario: Daily digest skipped when no events
Given no drift events occurred in the last 24 hours
When the digest job runs
Then no Slack message is sent
And the job logs "digest skipped: no events" at INFO level
Scenario: Daily digest includes resolved drifts
Given 3 drift events were detected and then remediated in the last 24 hours
When the digest runs
Then the digest includes a "Resolved" section listing those 3 events
And shows time-to-remediation for each
Scenario: Digest timezone is per-tenant
Given tenant "acme" is in timezone "America/New_York" (UTC-5)
And tenant "globex" is in timezone "Asia/Tokyo" (UTC+9)
When the digest scheduler runs
Then "acme" receives their digest at 14:00 UTC
And "globex" receives their digest at 00:00 UTC
Scenario: Digest delivery fails — retried next hour
Given the Slack webhook is temporarily unavailable at 09:00
When the digest send fails
Then the system retries at 10:00
And again at 11:00
And after 3 failures, marks the digest as "failed" and alerts the operator
Feature: Severity-Based Routing
Feature: Severity-Based Notification Routing
As a platform engineer
I want different severity levels routed to different channels
So that critical alerts reach the right people immediately
Background:
Given a tenant has configured routing rules
Scenario: CRITICAL routed to #incidents channel
Given the routing rule maps CRITICAL → "#incidents"
And a CRITICAL drift event occurs
When the notification engine routes the alert
Then the message is sent to "#incidents"
And not to "#drift-alerts"
Scenario: HIGH routed to #drift-alerts channel
Given the routing rule maps HIGH → "#drift-alerts"
And a HIGH drift event occurs
When the notification engine routes the alert
Then the message is sent to "#drift-alerts"
Scenario: No routing rule configured — fallback to default channel
Given no routing rules are configured for severity "MEDIUM"
And a MEDIUM drift event occurs
When the notification engine routes the alert
Then the message is sent to the tenant's default Slack channel
Scenario: Multiple channels for same severity
Given the routing rule maps CRITICAL → ["#incidents", "#sre-oncall"]
And a CRITICAL drift event occurs
When the notification engine routes the alert
Then the message is sent to both "#incidents" and "#sre-oncall"
Scenario: PagerDuty integration for CRITICAL severity
Given the tenant has PagerDuty integration configured
And the routing rule maps CRITICAL → PagerDuty
And a CRITICAL drift event occurs
When the notification engine routes the alert
Then a PagerDuty incident is created via the Events API
And the incident includes the drift event details and dashboard link
Epic 5: Remediation
Feature: One-Click Revert via Slack
Feature: One-Click Revert to IaC via Slack
As a platform engineer
I want to trigger remediation directly from a Slack alert
So that I can revert drift without leaving my chat tool
Background:
Given a HIGH severity drift alert was sent to Slack
And the alert includes a "Revert to IaC" button
Scenario: Engineer clicks Revert to IaC for non-destructive change
Given the drift is a security group rule addition (non-destructive revert)
When the engineer clicks "Revert to IaC"
Then the SaaS backend receives the Slack interaction payload
And sends a remediation command to the agent via the control plane
And the agent runs "terraform apply -target=aws_security_group.web"
And the Slack message is updated to show "Remediation in progress..."
Scenario: Remediation completes successfully
Given a remediation command was dispatched to the agent
When the agent completes the terraform apply
Then the agent sends a remediation result event to SaaS
And the Slack message is updated to "✅ Reverted successfully"
And a new scan is triggered immediately to confirm no drift
Scenario: Remediation fails — agent reports error
Given a remediation command was dispatched
And the terraform apply exits with a non-zero code
When the agent reports the failure
Then the Slack message is updated to "❌ Remediation failed"
And the error output is included in the Slack message (truncated to 500 chars)
And the drift event status is set to "remediation_failed"
Scenario: Revert button clicked by unauthorized user
Given the Slack user "intern@acme.com" is not in the "remediation_approvers" group
When they click "Revert to IaC"
Then the SaaS backend rejects the action with "Unauthorized"
And a Slack ephemeral message is shown: "You don't have permission to trigger remediation"
And the action is logged with the user's identity
Scenario: Revert button clicked twice (double-click protection)
Given a remediation is already in progress for drift event "drift-001"
When the "Revert to IaC" button is clicked again
Then the SaaS backend returns "remediation already in progress"
And a Slack ephemeral message is shown: "Remediation already running"
And no duplicate remediation is triggered
Feature: Approval Workflow for Destructive Changes
Feature: Approval Workflow for Destructive Remediation
As a security officer
I want destructive remediations to require explicit approval
So that no resources are accidentally deleted
Background:
Given a drift event involves a resource that would be deleted during revert
And the tenant has approval workflow enabled
Scenario: Destructive revert requires approval
Given the drift revert would delete an RDS instance
When an engineer clicks "Revert to IaC"
Then instead of executing immediately, an approval request is sent
And the Slack message shows "⚠️ Destructive change — approval required"
And an approval request is sent to all users in the "remediation_approvers" group
Scenario: Approver approves destructive remediation
Given an approval request is pending for drift event "drift-002"
When an approver clicks "Approve" in Slack
Then the approval is recorded with the approver's identity and timestamp
And the remediation command is dispatched to the agent
And the Slack thread is updated: "Approved by @jane — executing..."
Scenario: Approver rejects destructive remediation
Given an approval request is pending
When an approver clicks "Reject"
Then the remediation is cancelled
And the drift event status is set to "remediation_rejected"
And the Slack message is updated: "❌ Rejected by @jane"
And the rejection reason (if provided) is logged
Scenario: Approval timeout — remediation auto-cancelled
Given an approval request has been pending for 24 hours
And no approver has responded
When the approval timeout fires
Then the remediation is automatically cancelled
And the drift event status is set to "approval_timeout"
And a Slack message is sent: "⏰ Approval timed out — remediation cancelled"
And the event is included in the next daily digest
Scenario: Approval timeout is configurable
Given the tenant has set approval_timeout_hours to 4
When an approval request is pending for 4 hours without response
Then the timeout fires after 4 hours (not 24)
Scenario: Self-approval is blocked
Given engineer "alice@acme.com" triggered the remediation request
When "alice@acme.com" attempts to approve their own request
Then the approval is rejected with "Self-approval not permitted"
And an ephemeral Slack message informs Alice
Scenario: Minimum approvers requirement
Given the tenant requires 2 approvals for destructive changes
And only 1 approver has approved
When the second approver approves
Then the quorum is met
And the remediation is dispatched
Feature: Agent-Side Remediation Execution
Feature: Agent-Side Remediation Execution
As a platform engineer
I want the agent to apply IaC changes to revert drift
So that remediation happens inside the customer VPC with proper credentials
Background:
Given the agent has received a remediation command from the control plane
And the agent has the necessary IAM permissions
Scenario: Terraform revert executed successfully
Given the remediation command specifies stack "prod" and resource "aws_security_group.web"
When the agent executes the remediation
Then it runs "terraform apply -target=aws_security_group.web -auto-approve"
And captures stdout and stderr
And reports the result back to SaaS via the control plane
Scenario: kubectl revert executed successfully
Given the remediation command specifies a Kubernetes Deployment "api-server"
When the agent executes the remediation
Then it runs "kubectl apply -f /etc/drift/stacks/prod/api-server.yaml"
And reports the result back to SaaS
Scenario: Remediation command times out
Given the terraform apply is still running after 10 minutes
When the remediation timeout fires
Then the agent kills the terraform process
And reports status "timeout" to SaaS
And logs "remediation timed out after 10m" at ERROR level
Scenario: Agent loses connectivity during remediation
Given a remediation is in progress
And the agent loses connectivity to SaaS mid-execution
When connectivity is restored
Then the agent reports the final remediation result
And the SaaS backend reconciles the status
Scenario: Remediation command is replayed after agent restart
Given a remediation command was received but the agent restarted before executing
When the agent restarts
Then it checks for pending remediation commands
And executes any pending commands
And reports results to SaaS
Scenario: Remediation is blocked when panic mode is active
Given the tenant's panic mode is active
When a remediation command is received by the agent
Then the agent rejects the command
And logs "remediation blocked: panic mode active" at WARN level
And reports status "blocked_panic_mode" to SaaS
Epic 6: Dashboard UI
Feature: OAuth Login
Feature: OAuth Login
As a user
I want to log in via OAuth
So that I don't need a separate password for the drift dashboard
Background:
Given the dashboard is running at "https://app.drift.dd0c.io"
Scenario: Successful login via GitHub OAuth
Given the user navigates to the dashboard login page
When they click "Sign in with GitHub"
Then they are redirected to GitHub's OAuth authorization page
And after authorizing, redirected back with an authorization code
And the backend exchanges the code for a JWT
And the user is logged in and sees their tenant's dashboard
Scenario: Successful login via Google OAuth
Given the user clicks "Sign in with Google"
When they complete Google OAuth flow
Then they are logged in with a valid JWT
And the JWT contains their tenant_id and email claims
Scenario: OAuth callback with invalid state parameter
Given an OAuth callback arrives with a mismatched state parameter
When the frontend processes the callback
Then the login is rejected with "Invalid OAuth state"
And the user is redirected to the login page with an error message
And the event is logged as a potential CSRF attempt
Scenario: JWT expires during session
Given the user is logged in with a JWT that expires in 1 minute
When the JWT expires
Then the dashboard silently refreshes the token using the refresh token
And the user's session continues uninterrupted
Scenario: Refresh token is revoked
Given the user's refresh token has been revoked (e.g., password change)
When the dashboard attempts to refresh the JWT
Then the refresh fails
And the user is redirected to the login page
And shown "Your session has expired, please log in again"
Scenario: User belongs to no tenant
Given a new OAuth user has no tenant association
When they complete OAuth login
Then they are redirected to the onboarding flow
And prompted to create or join a tenant
Feature: Stack Overview
Feature: Stack Overview Page
As a platform engineer
I want to see all my stacks and their drift status at a glance
So that I can quickly identify which stacks need attention
Background:
Given the user is logged in as a member of tenant "acme"
And tenant "acme" has 5 stacks configured
Scenario: Stack overview loads all stacks
Given the user navigates to the dashboard home
When the page loads
Then 5 stack cards are displayed
And each card shows stack name, IaC type, last scan time, and drift count
Scenario: Stack with active drift shows warning indicator
Given stack "prod-api" has 3 active drift events
When the stack overview loads
Then the "prod-api" card shows a yellow warning badge with "3 drifts"
Scenario: Stack with CRITICAL drift shows critical indicator
Given stack "prod-db" has 1 CRITICAL drift event
When the stack overview loads
Then the "prod-db" card shows a red critical badge
Scenario: Stack with no drift shows healthy indicator
Given stack "staging" has no active drift events
When the stack overview loads
Then the "staging" card shows a green "Healthy" badge
Scenario: Stack overview auto-refreshes
Given the user is viewing the stack overview
When 60 seconds elapse
Then the page automatically refreshes drift counts without a full page reload
Scenario: Tenant with no stacks sees empty state
Given a new tenant has no stacks configured
When they navigate to the dashboard
Then an empty state is shown: "No stacks yet — run drift init to get started"
And a link to the onboarding guide is displayed
Scenario: Stack overview only shows current tenant's stacks
Given tenant "acme" has 5 stacks and tenant "globex" has 3 stacks
When a user from "acme" views the dashboard
Then only 5 stacks are shown
And no stacks from "globex" are visible
Feature: Diff Viewer
Feature: Drift Diff Viewer
As a platform engineer
I want to see a detailed diff of what changed
So that I understand exactly what drifted and how
Background:
Given the user is viewing a specific drift event
Scenario: Diff viewer shows field-level changes
Given a drift event has 3 changed fields
When the user opens the diff viewer
Then each changed field is shown with expected and actual values
And removed values are highlighted in red
And added values are highlighted in green
Scenario: Diff viewer shows JSON diff for complex values
Given a drift event involves a changed IAM policy document (JSON)
When the user opens the diff viewer
Then the policy JSON is shown as a structured diff
And individual JSON fields are highlighted rather than the whole blob
Scenario: Diff viewer handles large diffs with pagination
Given a drift event has 50 changed fields
When the user opens the diff viewer
Then the first 20 fields are shown
And a "Show more" button loads the remaining 30
Scenario: Diff viewer shows resource metadata
Given a drift event for resource "aws_security_group.web"
When the user views the diff
Then the viewer shows resource type, ARN, region, and stack name
And the scan timestamp is displayed
Scenario: Diff viewer copy-to-clipboard
Given the user is viewing a diff
When they click "Copy diff"
Then the diff is copied to clipboard in unified diff format
And a toast notification confirms "Copied to clipboard"
Feature: Drift Timeline
Feature: Drift Timeline
As a platform engineer
I want to see a timeline of drift events over time
So that I can identify patterns and recurring issues
Background:
Given the user is viewing the drift timeline for stack "prod-api"
Scenario: Timeline shows events in reverse chronological order
Given 10 drift events exist for "prod-api" over the last 7 days
When the user views the timeline
Then events are listed newest first
And each event shows timestamp, resource, severity, and status
Scenario: Timeline filtered by severity
Given the timeline has HIGH and LOW events
When the user filters by severity "HIGH"
Then only HIGH events are shown
And the filter state is reflected in the URL for shareability
Scenario: Timeline filtered by date range
Given the user selects a date range of "last 30 days"
When the filter is applied
Then only events within the last 30 days are shown
Scenario: Timeline shows remediation events
Given a drift event was remediated
When the user views the timeline
Then the event shows a "Remediated" badge
And the remediation timestamp and actor are shown
Scenario: Timeline is empty for new stack
Given a stack was added 1 hour ago and has no drift history
When the user views the timeline
Then an empty state is shown: "No drift history yet"
Scenario: Timeline pagination
Given 200 drift events exist for a stack
When the user views the timeline
Then the first 50 events are shown
And infinite scroll or pagination loads more on demand
Epic 7: Dashboard API
Feature: JWT Authentication
Feature: JWT Authentication on Dashboard API
As a SaaS engineer
I want all API endpoints protected by JWT
So that only authenticated users can access tenant data
Background:
Given the Dashboard API is running at "https://api.drift.dd0c.io"
Scenario: Valid JWT grants access
Given a request includes a valid JWT in the Authorization header
And the JWT is not expired
And the JWT signature is valid
When the request reaches the API
Then the request is processed
And the response is returned with HTTP 200
Scenario: Missing JWT returns 401
Given a request has no Authorization header
When the request reaches the API
Then the API returns HTTP 401
And the response body includes "Authentication required"
Scenario: Expired JWT returns 401
Given a request includes a JWT that expired 5 minutes ago
When the request reaches the API
Then the API returns HTTP 401
And the response body includes "Token expired"
Scenario: JWT with invalid signature returns 401
Given a request includes a JWT with a tampered signature
When the request reaches the API
Then the API returns HTTP 401
And the response body includes "Invalid token"
Scenario: JWT with wrong audience claim returns 401
Given a request includes a JWT issued for a different service
When the request reaches the API
Then the API returns HTTP 401
And the response body includes "Invalid audience"
Scenario: JWT tenant_id claim is used for RLS
Given a JWT contains tenant_id "acme"
When the request reaches the API
Then the API sets PostgreSQL session variable "app.tenant_id" to "acme"
And all queries are automatically scoped to tenant "acme" via RLS
Feature: Tenant Isolation via RLS
Feature: Tenant Isolation via Row-Level Security
As a security engineer
I want the API to enforce tenant isolation at the database level
So that a bug in application logic cannot leak cross-tenant data
Background:
Given the API uses PostgreSQL with RLS on all tenant-scoped tables
Scenario: User from tenant A cannot access tenant B's stacks
Given tenant "acme" has stacks ["prod", "staging"]
And tenant "globex" has stacks ["prod"]
When a user from "acme" calls GET /stacks
Then only "acme"'s stacks are returned
And "globex"'s stack is not included
Scenario: Cross-tenant drift event access attempt
Given drift event "drift-001" belongs to tenant "globex"
When a user from "acme" calls GET /drifts/drift-001
Then the API returns HTTP 404
And no data from "globex" is exposed
Scenario: Cross-tenant stack update attempt
Given stack "prod" belongs to tenant "globex"
When a user from "acme" calls PATCH /stacks/prod
Then the API returns HTTP 404
And the stack is not modified
Scenario: RLS is enforced even if application code has a bug
Given a hypothetical bug causes the API to omit the tenant_id filter in a query
When the query executes
Then PostgreSQL RLS still filters rows to the current tenant
And no cross-tenant data is returned
Scenario: Tenant isolation for remediation actions
Given remediation "rem-001" belongs to tenant "globex"
When a user from "acme" calls POST /remediations/rem-001/approve
Then the API returns HTTP 404
And the remediation is not affected
Feature: Stack CRUD
Feature: Stack CRUD Operations
As a platform engineer
I want to manage my stacks via the API
So that I can add, update, and remove stacks programmatically
Background:
Given the user is authenticated as a member of tenant "acme"
Scenario: Create a new Terraform stack
Given a POST /stacks request with body:
"""
{ "name": "prod-api", "iac_type": "terraform", "region": "us-east-1" }
"""
When the request is processed
Then the API returns HTTP 201
And the response includes the new stack's id and created_at
And the stack is visible in GET /stacks
Scenario: Create stack with duplicate name
Given a stack named "prod-api" already exists for tenant "acme"
When a POST /stacks request is made with name "prod-api"
Then the API returns HTTP 409
And the response body includes "Stack name already exists"
Scenario: Create stack exceeding free tier limit
Given the tenant is on the free tier (max 3 stacks)
And the tenant already has 3 stacks
When a POST /stacks request is made
Then the API returns HTTP 402
And the response body includes "Free tier limit reached. Upgrade to add more stacks."
Scenario: Update stack configuration
Given stack "prod-api" exists
When a PATCH /stacks/prod-api request updates the scan_interval to 30
Then the API returns HTTP 200
And the stack's scan_interval is updated to 30
And the agent receives the updated config on next heartbeat
Scenario: Delete a stack
Given stack "staging" exists with no active remediations
When a DELETE /stacks/staging request is made
Then the API returns HTTP 204
And the stack is removed from GET /stacks
And associated drift events are soft-deleted (retained for 90 days)
Scenario: Delete a stack with active remediation
Given stack "prod-api" has an active remediation in progress
When a DELETE /stacks/prod-api request is made
Then the API returns HTTP 409
And the response body includes "Cannot delete stack with active remediation"
Scenario: Get stack by ID
Given stack "prod-api" exists
When a GET /stacks/prod-api request is made
Then the API returns HTTP 200
And the response includes all stack fields including last_scan_at and drift_count
Feature: Drift Event CRUD
Feature: Drift Event API
As a platform engineer
I want to query and manage drift events via the API
So that I can build integrations and automations
Background:
Given the user is authenticated as a member of tenant "acme"
Scenario: List drift events for a stack
Given stack "prod-api" has 10 drift events
When GET /stacks/prod-api/drifts is called
Then the API returns HTTP 200
And the response includes all 10 events
And events are sorted by detected_at descending
Scenario: Filter drift events by severity
Given drift events include HIGH and LOW severity events
When GET /drifts?severity=HIGH is called
Then only HIGH severity events are returned
Scenario: Filter drift events by status
When GET /drifts?status=active is called
Then only unresolved drift events are returned
Scenario: Mark drift event as acknowledged
Given drift event "drift-001" has status "active"
When POST /drifts/drift-001/acknowledge is called
Then the API returns HTTP 200
And the event status is updated to "acknowledged"
And the acknowledged_by and acknowledged_at fields are set
Scenario: Mark drift event as resolved manually
Given drift event "drift-001" has status "active"
When POST /drifts/drift-001/resolve is called with body {"reason": "manual fix applied"}
Then the API returns HTTP 200
And the event status is updated to "resolved"
And the resolution reason is stored
Scenario: Pagination on drift events list
Given 200 drift events exist
When GET /drifts?page=1&per_page=50 is called
Then 50 events are returned
And the response includes pagination metadata (total, page, per_page, total_pages)
Feature: API Rate Limiting
Feature: API Rate Limiting
As a SaaS operator
I want API rate limits enforced per tenant
So that one tenant cannot degrade service for others
Background:
Given the API enforces rate limits per tenant
Scenario: Request within rate limit succeeds
Given the rate limit is 1000 requests per minute
And the tenant has made 500 requests this minute
When a new request is made
Then the API returns HTTP 200
And the response includes headers X-RateLimit-Remaining and X-RateLimit-Reset
Scenario: Request exceeds rate limit
Given the tenant has made 1000 requests this minute
When a new request is made
Then the API returns HTTP 429
And the response includes Retry-After header
And the response body includes "Rate limit exceeded"
Scenario: Rate limit resets after window
Given the tenant hit the rate limit at T+0
When 60 seconds elapse (new window)
Then the tenant's request counter resets
And new requests succeed
Epic 8: Infrastructure
Feature: Terraform/CDK SaaS Infrastructure Provisioning
Feature: SaaS Infrastructure Provisioning
As a SaaS platform engineer
I want the SaaS infrastructure defined as code
So that environments are reproducible and auditable
Background:
Given the infrastructure code lives in "infra/" directory
And Terraform and CDK are both used for different layers
Scenario: Terraform plan produces no unexpected changes in production
Given the production Terraform state is up to date
When "terraform plan" runs against the production workspace
Then the plan shows zero resource changes
And the plan output is stored as a CI artifact
Scenario: New environment provisioned from scratch
Given a new environment "staging-eu" is needed
When "terraform apply -var-file=staging-eu.tfvars" runs
Then all required resources are created (VPC, RDS, SQS, ECS, etc.)
And the environment is reachable within 15 minutes
And outputs (queue URLs, DB endpoints) are stored in SSM Parameter Store
Scenario: RDS instance is provisioned with encryption at rest
Given the Terraform module for RDS is applied
When the RDS instance is created
Then storage_encrypted is true
And the KMS key ARN is set to the tenant-specific key
Scenario: SQS FIFO queues are provisioned with DLQ
Given the SQS Terraform module is applied
When the queues are created
Then "drift-reports.fifo" exists with content_based_deduplication enabled
And "drift-reports-dlq.fifo" exists as the redrive target
And maxReceiveCount is set to 3
Scenario: CDK stack drift detected by drift agent (dogfooding)
Given the SaaS CDK stacks are monitored by the drift agent itself
When a CDK resource is manually modified in the AWS console
Then the drift agent detects the change
And an internal alert is sent to the SaaS ops channel
Scenario: Infrastructure destroy is blocked in production
Given a Terraform workspace is tagged as "production"
When "terraform destroy" is attempted
Then the CI pipeline rejects the command
And logs "destroy blocked in production environment"
Feature: GitHub Actions CI/CD
Feature: GitHub Actions CI/CD Pipeline
As a platform engineer
I want automated CI/CD via GitHub Actions
So that code changes are tested and deployed safely
Background:
Given GitHub Actions workflows are defined in ".github/workflows/"
Scenario: PR triggers CI checks
Given a pull request is opened against the main branch
When the CI workflow triggers
Then unit tests run for Go agent code
And unit tests run for TypeScript SaaS code
And Terraform plan runs for infrastructure changes
And all checks must pass before merge is allowed
Scenario: Merge to main triggers staging deployment
Given a PR is merged to the main branch
When the deploy workflow triggers
Then the Go agent binary is built and pushed to ECR
And the TypeScript services are built and deployed to ECS staging
And smoke tests run against staging
And the deployment is marked successful if smoke tests pass
Scenario: Staging smoke tests fail — production deploy blocked
Given staging deployment completed
And smoke tests fail (e.g., health check returns 500)
When the pipeline evaluates the smoke test results
Then the production deployment step is skipped
And a Slack alert is sent to "#deployments" channel
And the pipeline exits with failure
Scenario: Production deployment requires manual approval
Given staging smoke tests passed
When the pipeline reaches the production deployment step
Then it pauses and waits for a manual approval in GitHub Actions
And the approval request is sent to the "production-deployers" team
And deployment proceeds only after approval
Scenario: Rollback triggered on production health check failure
Given a production deployment completed
And the post-deploy health check fails within 5 minutes
When the rollback workflow triggers
Then the previous ECS task definition revision is redeployed
And a Slack alert is sent: "Production rollback triggered"
And the failed deployment is logged with the commit SHA
Scenario: Terraform plan diff is posted to PR as comment
Given a PR modifies infrastructure code
When the CI Terraform plan runs
Then the plan output is posted as a comment on the PR
And the comment includes a summary of resources to add/change/destroy
Scenario: Secrets are never logged in CI
Given the CI pipeline uses AWS credentials and Slack tokens
When any CI step runs
Then no secret values appear in the GitHub Actions log output
And GitHub secret masking is verified in the workflow config
Feature: Database Migrations in CI/CD
Feature: Database Migrations
As a SaaS engineer
I want database migrations to run automatically in CI/CD
So that schema changes are applied safely and consistently
Background:
Given migrations are managed with a migration tool (e.g., golang-migrate or Flyway)
Scenario: Migration runs successfully on deploy
Given a new migration file "V20_add_remediation_status.sql" exists
When the deploy pipeline runs
Then the migration is applied to the target database
And the migration is recorded in the schema_migrations table
And the deploy continues
Scenario: Migration is idempotent — already-applied migration is skipped
Given migration "V20_add_remediation_status.sql" was already applied
When the deploy pipeline runs again
Then the migration is skipped
And no error is thrown
Scenario: Migration fails — deploy is halted
Given a migration contains a syntax error
When the migration runs
Then the migration fails and is rolled back
And the deploy pipeline halts
And an alert is sent to the engineering team
Scenario: Additive-only migrations enforced
Given a migration attempts to drop a column
When the CI linter checks the migration
Then the lint check fails with "destructive migration not allowed"
And the PR is blocked from merging
Epic 9: Onboarding & PLG
Feature: drift init CLI
Feature: drift init CLI Onboarding
As a new user
I want a guided CLI setup experience
So that I can connect my infrastructure to drift in minutes
Background:
Given the drift CLI is installed via "curl -sSL https://get.drift.dd0c.io | sh"
Scenario: First-time drift init runs guided setup
Given the user runs "drift init" for the first time
When the CLI starts
Then it prompts for: cloud provider, IaC type, region, and stack name
And validates each input before proceeding
And generates a config file at "/etc/drift/config.yaml"
Scenario: drift init detects existing Terraform state
Given a Terraform state file exists in the current directory
When the user runs "drift init"
Then the CLI auto-detects the IaC type as "terraform"
And pre-fills the stack name from the Terraform workspace name
And asks the user to confirm
Scenario: drift init creates IAM role with least-privilege policy
Given the user confirms IAM role creation
When "drift init" runs
Then it creates an IAM role "drift-agent-role" with only required permissions
And outputs the role ARN for the config
Scenario: drift init generates and installs agent certificates
Given the user has authenticated with the SaaS backend
When "drift init" completes
Then it fetches a signed mTLS certificate from the SaaS CA
And stores the certificate at "/etc/drift/certs/agent.crt"
And stores the private key at "/etc/drift/certs/agent.key" with mode 0600
Scenario: drift init installs agent as systemd service
Given the user is on a Linux system with systemd
When "drift init" completes
Then it installs a systemd unit file for the drift agent
And enables and starts the service
And confirms "drift-agent is running" in the output
Scenario: drift init fails gracefully on missing AWS credentials
Given no AWS credentials are configured
When "drift init" runs
Then it detects missing credentials
And outputs a helpful error: "AWS credentials not found. Run 'aws configure' first."
And exits with code 1
Scenario: drift init --dry-run shows what would be created
Given the user runs "drift init --dry-run"
When the CLI runs
Then it outputs all actions it would take without executing them
And no resources are created
And no config files are written
Feature: Free Tier — 3 Stacks
Feature: Free Tier Stack Limit
As a product manager
I want the free tier limited to 3 stacks
So that we have a clear upgrade path
Background:
Given a tenant is on the free tier
Scenario: Free tier tenant can add up to 3 stacks
Given the tenant has 0 stacks
When they add stacks "prod", "staging", and "dev"
Then all 3 stacks are created successfully
And the tenant is not prompted to upgrade
Scenario: Free tier tenant blocked from adding 4th stack
Given the tenant has 3 stacks
When they attempt to add a 4th stack via the CLI
Then the CLI outputs "Free tier limit reached (3/3 stacks). Upgrade at https://app.drift.dd0c.io/billing"
And exits with code 1
Scenario: Free tier tenant blocked from adding 4th stack via API
Given the tenant has 3 stacks
When POST /stacks is called
Then the API returns HTTP 402
And the response includes upgrade_url
Scenario: Free tier tenant blocked from adding 4th stack via dashboard
Given the tenant has 3 stacks
When they click "Add Stack" in the dashboard
Then a modal appears: "You've reached the free tier limit"
And an "Upgrade Plan" button is shown
Scenario: Upgrading to paid tier unlocks unlimited stacks
Given the tenant upgrades to the paid plan via Stripe
When the Stripe webhook confirms payment
Then the tenant's stack limit is set to unlimited
And they can immediately add a 4th stack
Feature: Stripe Billing
Feature: Stripe Billing Integration
As a product manager
I want usage-based billing via Stripe
So that customers are charged $29/stack/month
Background:
Given Stripe is configured with the drift product and price
Scenario: New tenant subscribes to paid plan
Given a free tier tenant clicks "Upgrade"
When they complete the Stripe Checkout flow
Then a Stripe subscription is created for the tenant
And the subscription includes a metered item for stack count
And the tenant's plan is updated to "paid" in the database
Scenario: Monthly invoice calculated at $29/stack
Given a tenant has 5 stacks active for the full billing month
When Stripe generates the monthly invoice
Then the invoice total is $145.00 (5 × $29)
And the invoice is sent to the tenant's billing email
Scenario: Stack added mid-month — prorated charge
Given a tenant adds a 6th stack on the 15th of the month
When Stripe generates the monthly invoice
Then the 6th stack is charged prorated (~$14.50 for half month)
Scenario: Stack deleted mid-month — prorated credit
Given a tenant deletes a stack on the 10th of the month
When Stripe generates the monthly invoice
Then a prorated credit is applied for the unused days
Scenario: Payment fails — tenant notified and grace period applied
Given a tenant's payment method is declined
When Stripe sends the payment_failed webhook
Then the tenant receives an email: "Payment failed — please update your billing info"
And a 7-day grace period is applied before service is restricted
Scenario: Grace period expires — stacks suspended
Given a tenant's payment has been failing for 7 days
When the grace period expires
Then the tenant's stacks are suspended (scans paused)
And the dashboard shows a banner: "Account suspended — payment required"
And the agent stops sending reports
Scenario: Payment updated — service restored immediately
Given a tenant's stacks are suspended due to non-payment
When the tenant updates their payment method and payment succeeds
Then the Stripe webhook triggers service restoration
And stacks are unsuspended within 60 seconds
And scans resume on the next scheduled cycle
Scenario: Stripe webhook signature validation
Given a webhook arrives at POST /webhooks/stripe
When the webhook signature is invalid
Then the API returns HTTP 400
And the event is ignored
And the attempt is logged as a potential spoofing attempt
Scenario: Free tier tenant is never charged
Given a tenant is on the free tier with 3 stacks
When the billing cycle runs
Then no Stripe invoice is generated for this tenant
And no charge is made
Feature: Guided Setup Flow
Feature: Guided Setup Flow in Dashboard
As a new user
I want a step-by-step setup guide in the dashboard
So that I can get value from drift quickly
Background:
Given a new tenant has just signed up and logged in
Scenario: Onboarding checklist is shown to new tenants
Given the tenant has completed 0 onboarding steps
When they log in for the first time
Then an onboarding checklist is shown with steps:
| Step | Status |
| Install drift agent | Pending |
| Add your first stack | Pending |
| Configure Slack alerts | Pending |
| Run your first scan | Pending |
Scenario: Checklist step marked complete automatically
Given the tenant installs the agent and it sends its first heartbeat
When the dashboard refreshes
Then the "Install drift agent" step is marked complete
And a congratulatory message is shown
Scenario: Onboarding checklist dismissed after all steps complete
Given all 4 onboarding steps are complete
When the tenant views the dashboard
Then the checklist is replaced with the normal stack overview
And a one-time "You're all set!" banner is shown
Scenario: Onboarding checklist can be dismissed early
Given the tenant has completed 2 of 4 steps
When they click "Dismiss checklist"
Then the checklist is hidden
And a "Resume setup" link is available in the settings page
Epic 10: Transparent Factory
Feature: Feature Flags
Feature: Feature Flags
As a product engineer
I want feature flags to control rollout of new capabilities
So that I can ship safely and roll back instantly
Background:
Given the feature flag service is running
And flags are evaluated per-tenant
Scenario: Feature flag enabled for specific tenant
Given flag "remediation_v2" is enabled for tenant "acme"
And flag "remediation_v2" is disabled for tenant "globex"
When tenant "acme" triggers a remediation
Then the v2 remediation code path is used
When tenant "globex" triggers a remediation
Then the v1 remediation code path is used
Scenario: Feature flag enabled for percentage rollout
Given flag "new_diff_viewer" is enabled for 10% of tenants
When 1000 tenants load the dashboard
Then approximately 100 tenants see the new diff viewer
And the remaining 900 see the existing diff viewer
Scenario: Feature flag disabled globally kills a feature
Given flag "experimental_pulumi_scan" is globally disabled
When any tenant attempts to add a Pulumi stack
Then the API returns HTTP 501 "Feature not available"
And the dashboard hides the Pulumi option in the stack type selector
Scenario: Feature flag change takes effect without deployment
Given flag "slack_digest_v2" is currently disabled
When an operator enables the flag in the flag management console
Then within 30 seconds, the notification engine uses the v2 digest format
And no service restart is required
Scenario: Feature flag evaluation is logged for audit
Given flag "remediation_v2" is evaluated for tenant "acme"
When the flag is checked
Then the evaluation (flag name, tenant, result, timestamp) is written to the audit log
And the audit log is queryable for compliance review
Scenario: Unknown feature flag defaults to disabled
Given code checks for flag "nonexistent_flag"
When the flag service evaluates it
Then the result is "disabled" (safe default)
And a warning is logged: "unknown flag: nonexistent_flag"
Feature: Additive Schema Migrations
Feature: Additive-Only Schema Migrations
As a SaaS engineer
I want all schema changes to be additive
So that deployments are zero-downtime and rollback-safe
Background:
Given the migration linter runs in CI on every PR
Scenario: Adding a new nullable column is allowed
Given a migration adds column "remediation_status VARCHAR(50) NULL"
When the migration linter checks the file
Then the lint check passes
And the migration is approved for merge
Scenario: Adding a new table is allowed
Given a migration creates a new table "decision_logs"
When the migration linter checks the file
Then the lint check passes
Scenario: Dropping a column is blocked
Given a migration contains "ALTER TABLE drift_events DROP COLUMN old_field"
When the migration linter checks the file
Then the lint check fails with "destructive operation: DROP COLUMN not allowed"
And the PR is blocked
Scenario: Dropping a table is blocked
Given a migration contains "DROP TABLE legacy_alerts"
When the migration linter checks the file
Then the lint check fails with "destructive operation: DROP TABLE not allowed"
Scenario: Renaming a column is blocked
Given a migration contains "ALTER TABLE stacks RENAME COLUMN name TO stack_name"
When the migration linter checks the file
Then the lint check fails with "destructive operation: RENAME COLUMN not allowed"
And the suggested alternative is to add a new column and deprecate the old one
Scenario: Adding a NOT NULL column without default is blocked
Given a migration adds "ALTER TABLE stacks ADD COLUMN owner_id UUID NOT NULL"
When the migration linter checks the file
Then the lint check fails with "NOT NULL column without DEFAULT will break existing rows"
Scenario: Old column marked deprecated — not yet removed
Given column "legacy_iac_path" is marked with a deprecation comment in the schema
When the application code is deployed
Then the column still exists in the database
And the application ignores it
And a deprecation notice is logged at startup
Scenario: Column removal only after 2 release cycles
Given column "legacy_iac_path" has been deprecated for 2 releases
And all application code no longer references it
When an engineer submits a migration to drop the column
Then the migration linter checks the deprecation age
And allows the drop if the deprecation period has elapsed
Feature: Decision Logs
Feature: Decision Logs
As an engineering lead
I want architectural and operational decisions logged
So that the team has a transparent record of why things are the way they are
Background:
Given the decision log is stored in "docs/decisions/" as markdown files
Scenario: New ADR created for significant architectural change
Given an engineer proposes switching from SQS to Kafka
When they create "docs/decisions/ADR-042-kafka-vs-sqs.md"
Then the ADR includes: context, decision, consequences, and status
And the PR requires at least 2 reviewers from the architecture group
Scenario: ADR status transitions are tracked
Given ADR-042 has status "proposed"
When the team accepts the decision
Then the status is updated to "accepted"
And the accepted_at date is recorded
And the ADR is immutable after acceptance (changes require a new ADR)
Scenario: Superseded ADR is linked to its replacement
Given ADR-010 is superseded by ADR-042
When ADR-042 is accepted
Then ADR-010's status is updated to "superseded"
And ADR-010 includes a link to ADR-042
Scenario: Decision log is searchable
Given 50 ADRs exist in the decision log
When an engineer searches for "database"
Then all ADRs mentioning "database" in title or body are returned
Scenario: Operational decisions logged for drift remediation
Given an operator manually overrides a remediation decision
When the override is applied
Then a decision log entry is created with: operator identity, reason, timestamp, and affected resource
And the entry is linked to the drift event
Feature: OTEL Tracing
Feature: OpenTelemetry Distributed Tracing
As a SaaS engineer
I want end-to-end distributed tracing via OTEL
So that I can diagnose latency and errors across services
Background:
Given OTEL is configured with a Jaeger/Tempo backend
And all services export traces
Scenario: Drift report ingestion is fully traced
Given an agent publishes a drift report to SQS
When the event processor consumes and processes the message
Then a trace exists spanning: SQS receive → normalization → severity scoring → DB write
And each span includes tenant_id and scan_id as attributes
And the total trace duration is under 2 seconds for normal reports
Scenario: Slack notification is traced end-to-end
Given a drift event triggers a Slack notification
When the notification is sent
Then a trace exists spanning: event stored → notification engine → Slack API call
And the Slack API response code is recorded as a span attribute
Scenario: Remediation flow is fully traced
Given a remediation is triggered from Slack
When the remediation completes
Then a trace exists spanning: Slack interaction → API → control plane → agent → result
And the trace includes the approver identity and approval timestamp
Scenario: Slow span triggers latency alert
Given the DB write span exceeds 500ms
When the trace is analyzed
Then a latency alert fires in the observability platform
And the alert includes the trace_id for direct investigation
Scenario: Trace context propagated across service boundaries
Given the agent sends a drift report with a trace context header
When the event processor receives the message
Then it extracts the trace context from the SQS message attributes
And continues the trace as a child span
And the full trace is visible as a single tree in Jaeger
Scenario: Traces do not contain PII or secrets
Given a drift report is processed end-to-end
When the trace is exported to the tracing backend
Then no span attributes contain secret values
And no span attributes contain tenant PII beyond tenant_id
And the scrubber audit confirms 0 secrets in trace data
Scenario: OTEL collector is unavailable — service continues
Given the OTEL collector is down
When the event processor handles a drift report
Then the report is processed normally
And trace export failures are logged at DEBUG level
And no errors are surfaced to the end user
Feature: Governance / Panic Mode
Feature: Panic Mode
As a SaaS operator
I want a panic mode that halts all automated actions
So that I can freeze the system during a security incident or outage
Background:
Given the panic mode toggle is available in the ops console
Scenario: Operator activates panic mode globally
Given panic mode is currently inactive
When an operator activates panic mode with reason "security incident"
Then all automated remediations are immediately halted
And all pending remediation commands are cancelled
And a Slack alert is sent to "#ops-critical": "⚠️ PANIC MODE ACTIVATED by @operator"
And the reason and operator identity are logged
Scenario: Panic mode blocks new remediations
Given panic mode is active
When a user clicks "Revert to IaC" in Slack
Then the SaaS backend rejects the action
And the user sees: "System is in panic mode — automated actions are disabled"
Scenario: Panic mode blocks agent remediation commands
Given panic mode is active
And an agent receives a remediation command (e.g., from a race condition)
When the agent checks panic mode status
Then the agent rejects the command
And logs "remediation blocked: panic mode active" at WARN level
Scenario: Panic mode does NOT block drift scanning
Given panic mode is active
When the next scan cycle runs
Then the agent continues scanning normally
And drift events continue to be reported and stored
And notifications continue to be sent (read-only operations are unaffected)
Scenario: Panic mode deactivated by authorized operator
Given panic mode is active
When an authorized operator deactivates panic mode
Then automated remediations are re-enabled
And a Slack alert is sent: "✅ PANIC MODE DEACTIVATED by @operator"
And the deactivation is logged with timestamp and operator identity
Scenario: Panic mode activation requires elevated role
Given a regular user attempts to activate panic mode
When they call POST /ops/panic-mode
Then the API returns HTTP 403
And the attempt is logged as a security event
Scenario: Panic mode state is persisted across restarts
Given panic mode is active
When the SaaS backend restarts
Then panic mode remains active after restart
And the system does not auto-deactivate panic mode on restart
Scenario: Tenant-level panic mode
Given tenant "acme" is experiencing an incident
When an operator activates panic mode for tenant "acme" only
Then only "acme"'s remediations are halted
And other tenants are unaffected
And "acme"'s dashboard shows a panic mode banner
Feature: Observability — Metrics and Alerts
Feature: Operational Metrics and Alerting
As a SaaS operator
I want key metrics exported and alerting configured
So that I can detect and respond to production issues proactively
Background:
Given metrics are exported to CloudWatch and/or Prometheus
Scenario: Drift report processing latency metric
Given drift reports are being processed
When the event processor handles each report
Then a histogram metric "drift_report_processing_duration_ms" is recorded
And a P99 alert fires if latency exceeds 5000ms
Scenario: DLQ depth metric triggers alert
Given the DLQ depth exceeds 0
When the CloudWatch alarm evaluates
Then a PagerDuty alert fires within 5 minutes
And the alert includes the queue name and message count
Scenario: Agent offline metric
Given an agent has not sent a heartbeat for 5 minutes
When the heartbeat monitor checks
Then a metric "agents_offline_count" is incremented
And if any agent is offline for more than 15 minutes, an alert fires
Scenario: Secret scrubber miss rate metric
Given the scrubber processes drift reports
When a scrubber audit runs
Then a metric "scrubber_miss_rate" is recorded
And if the miss rate is ever > 0%, a CRITICAL alert fires immediately
Scenario: Stripe webhook processing metric
Given Stripe webhooks are received
When each webhook is processed
Then a counter "stripe_webhooks_processed_total" is incremented by event type
And a counter "stripe_webhooks_failed_total" is incremented on failures
And an alert fires if the failure rate exceeds 1% over 5 minutes
Scenario: Database connection pool metric
Given the application maintains a PostgreSQL connection pool
When pool utilization exceeds 80%
Then a warning alert fires
And when utilization exceeds 95%, a critical alert fires
Feature: Cross-Tenant Isolation Audit
Feature: Cross-Tenant Isolation Audit
As a security engineer
I want automated tests that verify cross-tenant isolation
So that data leakage between tenants is caught before production
Background:
Given the test suite includes cross-tenant isolation tests
And two test tenants "tenant-a" and "tenant-b" exist with separate data
Scenario: API cross-tenant read isolation
Given tenant-a has drift event "drift-a-001"
When tenant-b's JWT is used to call GET /drifts/drift-a-001
Then the API returns HTTP 404
And no data from tenant-a is present in the response body
Scenario: API cross-tenant write isolation
Given tenant-a has stack "prod"
When tenant-b's JWT is used to call DELETE /stacks/prod
Then the API returns HTTP 404
And tenant-a's stack is not deleted
Scenario: Database RLS cross-tenant query isolation
Given a direct database query runs with app.tenant_id set to "tenant-b"
When the query selects all rows from drift_events
Then zero rows from tenant-a are returned
And the query does not error
Scenario: SQS message from tenant-a cannot be processed as tenant-b
Given a drift report message from tenant-a arrives on the queue
When the event processor reads the tenant_id from the message
Then the event is stored under tenant-a's tenant_id
And not under tenant-b's tenant_id
Scenario: Remediation command cannot target another tenant's agent
Given tenant-b's agent has agent_id "agent-b-001"
When tenant-a attempts to send a remediation command to "agent-b-001"
Then the control plane rejects the command with HTTP 403
And the attempt is logged as a security event
Scenario: Cross-tenant isolation tests run in CI on every PR
Given the isolation test suite is part of the CI pipeline
When a PR is opened
Then all cross-tenant isolation tests run automatically
And the PR cannot be merged if any isolation test fails
End of BDD Acceptance Test Specifications for dd0c/drift
Total epics covered: 10 | Features: 40+ | Scenarios: 200+