# dd0c/drift — BDD Acceptance Test Specifications > Gherkin scenarios for all 10 epics. Each Feature maps to a user story within the epic. --- - [Epic 1: Drift Detection Agent](#epic-1-drift-detection-agent) - [Epic 2: Agent Communication](#epic-2-agent-communication) - [Epic 3: Event Processor](#epic-3-event-processor) - [Epic 4: Notification Engine](#epic-4-notification-engine) - [Epic 5: Remediation](#epic-5-remediation) - [Epic 6: Dashboard UI](#epic-6-dashboard-ui) - [Epic 7: Dashboard API](#epic-7-dashboard-api) - [Epic 8: Infrastructure](#epic-8-infrastructure) - [Epic 9: Onboarding & PLG](#epic-9-onboarding--plg) - [Epic 10: Transparent Factory](#epic-10-transparent-factory) --- ## Epic 1: Drift Detection Agent ### Feature: Agent Initialization ```gherkin Feature: Drift Detection Agent Initialization As a platform engineer I want the drift agent to initialize correctly in my VPC So that it can begin scanning infrastructure state Background: Given the drift agent binary is installed at "/usr/local/bin/drift-agent" And a valid agent config exists at "/etc/drift/config.yaml" Scenario: Successful agent startup Given the config specifies AWS region "us-east-1" And valid mTLS certificates are present And the SaaS endpoint is reachable When the agent starts Then the agent logs "drift-agent started" at INFO level And the agent registers itself with the SaaS control plane And the first scan is scheduled within 15 minutes Scenario: Agent startup with missing config Given no config file exists at "/etc/drift/config.yaml" When the agent starts Then the agent exits with code 1 And logs "config file not found" at ERROR level Scenario: Agent startup with invalid AWS credentials Given the config references an IAM role that does not exist When the agent starts Then the agent exits with code 1 And logs "failed to assume IAM role" at ERROR level Scenario: Agent startup with unreachable SaaS endpoint Given the SaaS endpoint is not reachable from the VPC When the agent starts Then the agent retries connection 3 times with exponential backoff And after all retries fail, exits with code 1 And logs "failed to reach control plane after 3 attempts" at ERROR level ``` ### Feature: Terraform State Scanning ```gherkin Feature: Terraform State Scanning As a platform engineer I want the agent to read Terraform state So that it can compare planned state against live AWS resources Background: Given the agent is running and initialized And the stack type is "terraform" Scenario: Scan Terraform state from S3 backend Given a Terraform state file exists in S3 bucket "my-tfstate" at key "prod/terraform.tfstate" And the agent has s3:GetObject permission on that bucket When a scan cycle runs Then the agent reads the state file successfully And parses all resource definitions from the state Scenario: Scan Terraform state from local backend Given a Terraform state file exists at "/var/drift/stacks/prod/terraform.tfstate" When a scan cycle runs Then the agent reads the local state file And parses all resource definitions Scenario: Terraform state file is locked Given the Terraform state file is currently locked by another process When a scan cycle runs Then the agent logs "state file locked, skipping scan" at WARN level And schedules a retry for the next cycle And does not report a drift event Scenario: Terraform state file is malformed Given the Terraform state file contains invalid JSON When a scan cycle runs Then the agent logs "failed to parse state file" at ERROR level And emits a health event with status "parse_error" And does not report false drift Scenario: Terraform state references deleted resources Given the Terraform state contains a resource "aws_instance.web" And that EC2 instance no longer exists in AWS When a scan cycle runs Then the agent detects drift of type "resource_deleted" And includes the resource ARN in the drift report ``` ### Feature: CloudFormation Stack Scanning ```gherkin Feature: CloudFormation Stack Scanning As a platform engineer I want the agent to scan CloudFormation stacks So that I can detect drift from declared template state Background: Given the agent is running And the stack type is "cloudformation" Scenario: Scan a CloudFormation stack successfully Given a CloudFormation stack named "prod-api" exists in "us-east-1" And the agent has cloudformation:DescribeStackResources permission When a scan cycle runs Then the agent retrieves all stack resources And compares each resource's actual configuration against the template Scenario: CloudFormation stack does not exist Given the config references a CloudFormation stack "ghost-stack" And that stack does not exist in AWS When a scan cycle runs Then the agent logs "stack not found: ghost-stack" at WARN level And emits a drift event of type "stack_missing" Scenario: CloudFormation native drift detection result available Given CloudFormation has already run drift detection on "prod-api" And the result shows 2 drifted resources When a scan cycle runs Then the agent reads the CloudFormation drift detection result And includes both drifted resources in the drift report Scenario: CloudFormation drift detection result is stale Given the last CloudFormation drift detection ran 48 hours ago When a scan cycle runs Then the agent triggers a new CloudFormation drift detection And waits up to 5 minutes for the result And uses the fresh result in the drift report ``` ### Feature: Kubernetes Resource Scanning ```gherkin Feature: Kubernetes Resource Scanning As a platform engineer I want the agent to scan Kubernetes resources So that I can detect drift from Helm/Kustomize definitions Background: Given the agent is running And a kubeconfig is available at "/etc/drift/kubeconfig" And the stack type is "kubernetes" Scenario: Scan Kubernetes Deployment successfully Given a Deployment "api-server" exists in namespace "production" And the IaC definition specifies 3 replicas And the live Deployment has 2 replicas When a scan cycle runs Then the agent detects drift on field "spec.replicas" And the drift report includes expected value "3" and actual value "2" Scenario: Kubernetes resource has been manually patched Given a ConfigMap "app-config" has been manually edited And the live data differs from the Helm chart values When a scan cycle runs Then the agent detects drift of type "config_modified" And includes a field-level diff in the report Scenario: Kubernetes API server is unreachable Given the Kubernetes API server is not responding When a scan cycle runs Then the agent logs "k8s API unreachable" at ERROR level And emits a health event with status "k8s_unreachable" And does not report false drift Scenario: Scan across multiple namespaces Given the config specifies namespaces ["production", "staging"] When a scan cycle runs Then the agent scans resources in both namespaces independently And drift reports include the namespace as context ``` ### Feature: Secret Scrubbing ```gherkin Feature: Secret Scrubbing Before Transmission As a security officer I want all secrets scrubbed from drift reports before they leave the VPC So that credentials are never transmitted to the SaaS backend Background: Given the agent is running with secret scrubbing enabled Scenario: AWS secret key detected and scrubbed Given a drift report contains a field value matching AWS secret key pattern "AKIA[0-9A-Z]{16}" When the scrubber processes the report Then the field value is replaced with "[REDACTED:aws_secret_key]" And the original value is not present anywhere in the transmitted payload Scenario: Generic password field scrubbed Given a drift report contains a field named "password" with value "s3cr3tP@ss" When the scrubber processes the report Then the field value is replaced with "[REDACTED:password]" Scenario: Private key block scrubbed Given a drift report contains a PEM private key block When the scrubber processes the report Then the entire PEM block is replaced with "[REDACTED:private_key]" Scenario: Nested secret in JSON value scrubbed Given a drift report contains a JSON string value with a nested "api_key" field When the scrubber processes the report Then the nested api_key value is replaced with "[REDACTED:api_key]" And the surrounding JSON structure is preserved Scenario: Secret scrubber bypass attempt via encoding Given a drift report contains a base64-encoded AWS secret key When the scrubber processes the report Then the encoded value is detected and replaced with "[REDACTED:encoded_secret]" Scenario: Secret scrubber bypass attempt via Unicode homoglyphs Given a drift report contains a value using Unicode lookalike characters to resemble a secret pattern When the scrubber processes the report Then the value is flagged and replaced with "[REDACTED:suspicious_value]" Scenario: Non-secret value is not scrubbed Given a drift report contains a field "instance_type" with value "t3.medium" When the scrubber processes the report Then the field value remains "t3.medium" unchanged Scenario: Scrubber coverage is 100% Given a test corpus of 500 known secret patterns When the scrubber processes all patterns Then every pattern is detected and redacted And the scrubber reports 0 missed secrets Scenario: Scrubber audit log Given the scrubber redacts 3 values from a drift report When the report is transmitted Then the agent logs a scrubber audit entry with count "3" and field names (not values) And the audit log is stored locally for 30 days ``` ### Feature: Pulumi State Scanning ```gherkin Feature: Pulumi State Scanning As a platform engineer I want the agent to read Pulumi state So that I can detect drift from Pulumi-managed resources Background: Given the agent is running And the stack type is "pulumi" Scenario: Scan Pulumi state from Pulumi Cloud backend Given a Pulumi stack "prod" exists in organization "acme" And the agent has a valid Pulumi access token configured When a scan cycle runs Then the agent fetches the stack's exported state via Pulumi API And parses all resource URNs and properties Scenario: Scan Pulumi state from self-managed S3 backend Given the Pulumi state is stored in S3 bucket "pulumi-state" at key "acme/prod.json" When a scan cycle runs Then the agent reads the state file from S3 And parses all resource definitions Scenario: Pulumi access token is expired Given the configured Pulumi access token has expired When a scan cycle runs Then the agent logs "pulumi token expired" at ERROR level And emits a health event with status "auth_error" And does not report false drift ``` ### Feature: 15-Minute Scan Cycle ```gherkin Feature: Scheduled Scan Cycle As a platform engineer I want scans to run every 15 minutes automatically So that drift is detected promptly Background: Given the agent is running and initialized Scenario: Scan runs on schedule Given the last scan completed at T+0 When 15 minutes elapse Then a new scan starts automatically And the scan completion is logged with timestamp Scenario: Scan cycle skipped if previous scan still running Given a scan started at T+0 and is still running at T+15 When the next scheduled scan would start Then the new scan is skipped And the agent logs "scan skipped: previous scan still in progress" at WARN level Scenario: Scan interval is configurable Given the config specifies scan_interval_minutes: 30 When the agent starts Then scans run every 30 minutes instead of 15 Scenario: No drift detected — no report sent Given all resources match their IaC definitions When a scan cycle completes Then no drift report is sent to SaaS And the agent logs "scan complete: no drift detected" Scenario: Agent recovers scan schedule after restart Given the agent was restarted When the agent starts Then it reads the last scan timestamp from local state And schedules the next scan relative to the last completed scan ``` --- ## Epic 2: Agent Communication ### Feature: mTLS Certificate Handshake ```gherkin Feature: mTLS Mutual TLS Authentication As a security engineer I want all agent-to-SaaS communication to use mTLS So that only authenticated agents can submit drift reports Background: Given the agent has a client certificate issued by the drift CA And the SaaS endpoint requires mTLS Scenario: Successful mTLS handshake Given the agent certificate is valid and not expired And the SaaS server certificate is trusted by the agent's CA bundle When the agent connects to the SaaS endpoint Then the TLS handshake succeeds And the connection is established with mutual authentication Scenario: Agent certificate is expired Given the agent certificate expired 1 day ago When the agent attempts to connect to the SaaS endpoint Then the TLS handshake fails with "certificate expired" And the agent logs "mTLS cert expired, cannot connect" at ERROR level And the agent emits a local alert to stderr And no data is transmitted Scenario: Agent certificate is revoked Given the agent certificate has been added to the CRL When the agent attempts to connect Then the SaaS endpoint rejects the connection And the agent logs "certificate revoked" at ERROR level Scenario: Agent presents wrong CA certificate Given the agent has a certificate from an untrusted CA When the agent attempts to connect Then the TLS handshake fails And the agent logs "unknown certificate authority" at ERROR level Scenario: SaaS server certificate is expired Given the SaaS server certificate has expired When the agent attempts to connect Then the agent rejects the server certificate And logs "server cert expired" at ERROR level And does not transmit any data Scenario: Certificate rotation — new cert issued before expiry Given the agent certificate expires in 7 days And the control plane issues a new certificate When the agent receives the new certificate via the cert rotation endpoint Then the agent stores the new certificate And uses the new certificate for subsequent connections And logs "certificate rotated successfully" at INFO level Scenario: Certificate rotation fails — agent continues with old cert Given the agent certificate expires in 7 days And the cert rotation endpoint is unreachable When the agent attempts certificate rotation Then the agent retries rotation every hour And continues using the existing certificate until it expires And logs "cert rotation failed, retrying" at WARN level ``` ### Feature: Heartbeat ```gherkin Feature: Agent Heartbeat As a SaaS operator I want agents to send regular heartbeats So that I can detect offline or unhealthy agents Background: Given the agent is running and connected via mTLS Scenario: Heartbeat sent on schedule Given the heartbeat interval is 60 seconds When 60 seconds elapse Then the agent sends a heartbeat message to the SaaS control plane And the heartbeat includes agent version, last scan time, and stack count Scenario: SaaS marks agent as offline after missed heartbeats Given the agent has not sent a heartbeat for 5 minutes When the SaaS control plane checks agent status Then the agent is marked as "offline" And a notification is sent to the tenant's configured alert channel Scenario: Agent reconnects after network interruption Given the agent lost connectivity for 3 minutes When connectivity is restored Then the agent re-establishes the mTLS connection And sends a heartbeat immediately And the SaaS marks the agent as "online" Scenario: Heartbeat includes health status Given the last scan encountered a parse error When the agent sends its next heartbeat Then the heartbeat payload includes health_status "degraded" And includes the error details in the health_details field Scenario: Multiple agents from same tenant Given tenant "acme" has 3 agents running in different regions When all 3 agents send heartbeats Then each agent is tracked independently by agent_id And the SaaS dashboard shows 3 active agents for tenant "acme" ``` ### Feature: SQS FIFO Drift Report Ingestion ```gherkin Feature: SQS FIFO Drift Report Ingestion As a SaaS backend engineer I want drift reports delivered via SQS FIFO So that reports are processed in order without duplicates Background: Given an SQS FIFO queue "drift-reports.fifo" exists And the agent has sqs:SendMessage permission on the queue Scenario: Agent publishes drift report to SQS FIFO Given a scan detects 2 drifted resources When the agent publishes the drift report Then a message is sent to "drift-reports.fifo" And the MessageGroupId is set to the tenant's agent_id And the MessageDeduplicationId is set to the scan's unique scan_id Scenario: Duplicate scan report is deduplicated Given a drift report with scan_id "scan-abc-123" was already sent When the agent sends the same report again (e.g., due to retry) Then SQS FIFO deduplicates the message And only one copy is delivered to the consumer Scenario: SQS message size limit exceeded Given a drift report exceeds 256KB (SQS max message size) When the agent attempts to publish Then the agent splits the report into chunks And each chunk is sent as a separate message with a shared batch_id And the sequence is indicated by chunk_index and chunk_total fields Scenario: SQS publish fails — agent retries with backoff Given the SQS endpoint returns a 500 error When the agent attempts to publish a drift report Then the agent retries up to 5 times with exponential backoff And logs each retry attempt at WARN level And after all retries fail, stores the report locally for later replay Scenario: Agent publishes to correct queue per environment Given the agent config specifies environment "production" When the agent publishes a drift report Then the message is sent to "drift-reports-prod.fifo" And not to "drift-reports-staging.fifo" ``` ### Feature: Dead Letter Queue Handling ```gherkin Feature: Dead Letter Queue (DLQ) Handling As a SaaS operator I want failed messages routed to a DLQ So that no drift reports are silently lost Background: Given a DLQ "drift-reports-dlq.fifo" is configured And the maxReceiveCount is set to 3 Scenario: Message moved to DLQ after max receive count Given a drift report message has failed processing 3 times When the SQS visibility timeout expires a 3rd time Then the message is automatically moved to the DLQ And an alarm fires on the DLQ depth metric Scenario: DLQ alarm triggers operator notification Given the DLQ depth exceeds 0 When the CloudWatch alarm triggers Then a PagerDuty alert is sent to the on-call engineer And the alert includes the queue name and approximate message count Scenario: DLQ message is replayed after fix Given a message in the DLQ was caused by a schema validation bug And the bug has been fixed and deployed When an operator triggers DLQ replay Then the message is moved back to the main queue And processed successfully And removed from the DLQ Scenario: DLQ message contains poison pill — permanently discarded Given a DLQ message is malformed beyond repair When an operator inspects the message Then the operator can mark it as "discarded" via the ops console And the discard action is logged with operator identity and reason And the message is deleted from the DLQ ``` --- ## Epic 3: Event Processor ### Feature: Drift Report Normalization ```gherkin Feature: Drift Report Normalization As a SaaS backend engineer I want incoming drift reports normalized to a canonical schema So that downstream consumers work with consistent data regardless of IaC type Background: Given the event processor is running And it is consuming from "drift-reports.fifo" Scenario: Normalize a Terraform drift report Given a raw drift report with type "terraform" arrives on the queue When the event processor consumes the message Then it maps Terraform resource addresses to canonical resource_id format And stores the normalized report in the "drift_events" PostgreSQL table And sets the "iac_type" field to "terraform" Scenario: Normalize a CloudFormation drift report Given a raw drift report with type "cloudformation" arrives When the event processor normalizes it Then CloudFormation logical resource IDs are mapped to canonical resource_id And the "iac_type" field is set to "cloudformation" Scenario: Normalize a Kubernetes drift report Given a raw drift report with type "kubernetes" arrives When the event processor normalizes it Then Kubernetes resource URIs (namespace/kind/name) are mapped to canonical resource_id And the "iac_type" field is set to "kubernetes" Scenario: Unknown IaC type in report Given a drift report arrives with iac_type "unknown_tool" When the event processor attempts normalization Then the message is rejected with error "unsupported_iac_type" And the message is moved to the DLQ And an error is logged with the raw message_id Scenario: Report with missing required fields Given a drift report is missing the "tenant_id" field When the event processor validates the message Then validation fails with "missing required field: tenant_id" And the message is moved to the DLQ And no partial record is written to PostgreSQL Scenario: Chunked report is reassembled before normalization Given 3 chunked messages arrive with the same batch_id And chunk_total is 3 When all 3 chunks are received Then the event processor reassembles them in chunk_index order And normalizes the complete report as a single event Scenario: Chunked report — one chunk missing after timeout Given 2 of 3 chunks arrive for batch_id "batch-xyz" And the third chunk does not arrive within 10 minutes When the reassembly timeout fires Then the event processor logs "incomplete batch: batch-xyz, missing chunk 2" at WARN level And moves the partial batch to the DLQ ``` ### Feature: Drift Severity Scoring ```gherkin Feature: Drift Severity Scoring As a platform engineer I want each drift event scored by severity So that I can prioritize which drifts to address first Background: Given a normalized drift event is ready for scoring Scenario: Security group rule added — HIGH severity Given a drift event describes an added inbound rule on a security group And the rule opens port 22 to 0.0.0.0/0 When the severity scorer evaluates the event Then the severity is set to "HIGH" And the reason is "public SSH access opened" Scenario: IAM policy attached — CRITICAL severity Given a drift event describes an IAM policy "AdministratorAccess" attached to a role When the severity scorer evaluates the event Then the severity is set to "CRITICAL" And the reason is "admin policy attached outside IaC" Scenario: Replica count changed — LOW severity Given a drift event describes a Kubernetes Deployment replica count changed from 3 to 2 When the severity scorer evaluates the event Then the severity is set to "LOW" And the reason is "non-security configuration drift" Scenario: Resource deleted — HIGH severity Given a drift event describes an RDS instance being deleted When the severity scorer evaluates the event Then the severity is set to "HIGH" And the reason is "managed resource deleted outside IaC" Scenario: Tag-only drift — INFO severity Given a drift event describes only tag changes on an EC2 instance When the severity scorer evaluates the event Then the severity is set to "INFO" And the reason is "tag-only drift" Scenario: Custom severity rules override defaults Given a tenant has configured a custom rule: "any change to resource type aws_s3_bucket = CRITICAL" And a drift event describes a tag change on an S3 bucket When the severity scorer evaluates the event Then the tenant's custom rule takes precedence And the severity is set to "CRITICAL" Scenario: Severity score is stored with the drift event Given a drift event has been scored as "HIGH" When the event processor writes to PostgreSQL Then the "drift_events" row includes severity "HIGH" and scored_at timestamp ``` ### Feature: PostgreSQL Storage with RLS ```gherkin Feature: Drift Event Storage with Row-Level Security As a SaaS engineer I want drift events stored in PostgreSQL with RLS So that tenants can only access their own data Background: Given PostgreSQL is running with RLS enabled on the "drift_events" table And the RLS policy filters rows by "tenant_id = current_setting('app.tenant_id')" Scenario: Drift event written for tenant A Given a normalized drift event belongs to tenant "acme" When the event processor writes the event Then the row is inserted with tenant_id "acme" And the row is visible when querying as tenant "acme" Scenario: Tenant B cannot read tenant A's drift events Given drift events exist for tenant "acme" When a query runs with app.tenant_id set to "globex" Then zero rows are returned And no error is thrown (RLS silently filters) Scenario: Superuser bypass is disabled for application role Given the application database role is "drift_app" When "drift_app" attempts to query without setting app.tenant_id Then zero rows are returned due to RLS default-deny policy Scenario: Drift event deduplication Given a drift event with scan_id "scan-abc" and resource_id "aws_instance.web" already exists When the event processor attempts to insert the same event again Then the INSERT is ignored (ON CONFLICT DO NOTHING) And no duplicate row is created Scenario: Database connection pool exhausted Given all PostgreSQL connections are in use When the event processor tries to write a drift event Then it waits up to 5 seconds for a connection And if no connection is available, the message is nacked and retried And an alert fires if pool exhaustion persists for more than 60 seconds Scenario: Schema migration runs without downtime Given a new additive column "remediation_status" is being added When the migration runs Then existing rows are unaffected And new rows include the "remediation_status" column And the event processor continues writing without restart ``` ### Feature: Idempotent Event Processing ```gherkin Feature: Idempotent Event Processing As a SaaS backend engineer I want event processing to be idempotent So that retries and replays do not create duplicate records Scenario: Same SQS message delivered twice (at-least-once delivery) Given an SQS message with MessageId "msg-001" was processed successfully When the same message is delivered again due to SQS retry Then the event processor detects the duplicate via scan_id lookup And skips processing And deletes the message from the queue Scenario: Event processor restarts mid-batch Given the event processor crashed after writing 5 of 10 events When the processor restarts and reprocesses the batch Then the 5 already-written events are skipped (idempotent) And the remaining 5 events are written And the final state has exactly 10 events Scenario: Replay from DLQ does not create duplicates Given a DLQ message is replayed after a bug fix And the event was partially processed before the crash When the replayed message is processed Then the processor uses upsert semantics And the final record reflects the correct state ``` --- ## Epic 4: Notification Engine ### Feature: Slack Block Kit Drift Alerts ```gherkin Feature: Slack Block Kit Drift Alerts As a platform engineer I want to receive Slack notifications when drift is detected So that I can act on it immediately Background: Given a tenant has configured a Slack webhook URL And the notification engine is running Scenario: HIGH severity drift triggers immediate Slack alert Given a drift event with severity "HIGH" is stored When the notification engine processes the event Then a Slack Block Kit message is sent to the configured channel And the message includes the resource_id, drift type, and severity badge And the message includes an inline diff of expected vs actual values And the message includes a "Revert to IaC" action button Scenario: CRITICAL severity drift triggers immediate Slack alert with @here mention Given a drift event with severity "CRITICAL" is stored When the notification engine processes the event Then the Slack message includes an "@here" mention And the message is sent within 60 seconds of the event being stored Scenario: LOW severity drift is batched — not sent immediately Given a drift event with severity "LOW" is stored When the notification engine processes the event Then no immediate Slack message is sent And the event is queued for the next daily digest Scenario: INFO severity drift is suppressed from Slack Given a drift event with severity "INFO" is stored When the notification engine processes the event Then no Slack message is sent And the event is only visible in the dashboard Scenario: Slack message includes inline diff Given a drift event shows security group rule changed And expected value is "port 443 from 10.0.0.0/8" And actual value is "port 443 from 0.0.0.0/0" When the Slack alert is composed Then the message body includes a diff block showing the change And removed lines are prefixed with "-" in red And added lines are prefixed with "+" in green Scenario: Slack webhook returns 429 rate limit Given the Slack webhook returns HTTP 429 When the notification engine attempts to send Then it respects the Retry-After header And retries after the specified delay And logs "slack rate limited, retrying in Xs" at WARN level Scenario: Slack webhook URL is invalid Given the tenant's Slack webhook URL returns HTTP 404 When the notification engine attempts to send Then it logs "invalid slack webhook" at ERROR level And marks the notification as "failed" in the database And does not retry indefinitely (max 3 attempts) Scenario: Multiple drifts in same scan — grouped in one message Given a scan detects 5 drifted resources all with severity "HIGH" When the notification engine processes the batch Then a single Slack message is sent grouping all 5 resources And the message includes a summary "5 resources drifted" And each resource is listed with its diff ``` ### Feature: Daily Digest ```gherkin Feature: Daily Drift Digest As a platform engineer I want a daily summary of all drift events So that I have a consolidated view without alert fatigue Background: Given a tenant has daily digest enabled And the digest is scheduled for 09:00 tenant local time Scenario: Daily digest sent with pending LOW/INFO events Given 12 LOW severity drift events accumulated since the last digest When the digest job runs at 09:00 Then a single Slack message is sent summarizing all 12 events And the message groups events by stack name And includes a link to the dashboard for full details Scenario: Daily digest skipped when no events Given no drift events occurred in the last 24 hours When the digest job runs Then no Slack message is sent And the job logs "digest skipped: no events" at INFO level Scenario: Daily digest includes resolved drifts Given 3 drift events were detected and then remediated in the last 24 hours When the digest runs Then the digest includes a "Resolved" section listing those 3 events And shows time-to-remediation for each Scenario: Digest timezone is per-tenant Given tenant "acme" is in timezone "America/New_York" (UTC-5) And tenant "globex" is in timezone "Asia/Tokyo" (UTC+9) When the digest scheduler runs Then "acme" receives their digest at 14:00 UTC And "globex" receives their digest at 00:00 UTC Scenario: Digest delivery fails — retried next hour Given the Slack webhook is temporarily unavailable at 09:00 When the digest send fails Then the system retries at 10:00 And again at 11:00 And after 3 failures, marks the digest as "failed" and alerts the operator ``` ### Feature: Severity-Based Routing ```gherkin Feature: Severity-Based Notification Routing As a platform engineer I want different severity levels routed to different channels So that critical alerts reach the right people immediately Background: Given a tenant has configured routing rules Scenario: CRITICAL routed to #incidents channel Given the routing rule maps CRITICAL → "#incidents" And a CRITICAL drift event occurs When the notification engine routes the alert Then the message is sent to "#incidents" And not to "#drift-alerts" Scenario: HIGH routed to #drift-alerts channel Given the routing rule maps HIGH → "#drift-alerts" And a HIGH drift event occurs When the notification engine routes the alert Then the message is sent to "#drift-alerts" Scenario: No routing rule configured — fallback to default channel Given no routing rules are configured for severity "MEDIUM" And a MEDIUM drift event occurs When the notification engine routes the alert Then the message is sent to the tenant's default Slack channel Scenario: Multiple channels for same severity Given the routing rule maps CRITICAL → ["#incidents", "#sre-oncall"] And a CRITICAL drift event occurs When the notification engine routes the alert Then the message is sent to both "#incidents" and "#sre-oncall" Scenario: PagerDuty integration for CRITICAL severity Given the tenant has PagerDuty integration configured And the routing rule maps CRITICAL → PagerDuty And a CRITICAL drift event occurs When the notification engine routes the alert Then a PagerDuty incident is created via the Events API And the incident includes the drift event details and dashboard link ``` --- ## Epic 5: Remediation ### Feature: One-Click Revert via Slack ```gherkin Feature: One-Click Revert to IaC via Slack As a platform engineer I want to trigger remediation directly from a Slack alert So that I can revert drift without leaving my chat tool Background: Given a HIGH severity drift alert was sent to Slack And the alert includes a "Revert to IaC" button Scenario: Engineer clicks Revert to IaC for non-destructive change Given the drift is a security group rule addition (non-destructive revert) When the engineer clicks "Revert to IaC" Then the SaaS backend receives the Slack interaction payload And sends a remediation command to the agent via the control plane And the agent runs "terraform apply -target=aws_security_group.web" And the Slack message is updated to show "Remediation in progress..." Scenario: Remediation completes successfully Given a remediation command was dispatched to the agent When the agent completes the terraform apply Then the agent sends a remediation result event to SaaS And the Slack message is updated to "✅ Reverted successfully" And a new scan is triggered immediately to confirm no drift Scenario: Remediation fails — agent reports error Given a remediation command was dispatched And the terraform apply exits with a non-zero code When the agent reports the failure Then the Slack message is updated to "❌ Remediation failed" And the error output is included in the Slack message (truncated to 500 chars) And the drift event status is set to "remediation_failed" Scenario: Revert button clicked by unauthorized user Given the Slack user "intern@acme.com" is not in the "remediation_approvers" group When they click "Revert to IaC" Then the SaaS backend rejects the action with "Unauthorized" And a Slack ephemeral message is shown: "You don't have permission to trigger remediation" And the action is logged with the user's identity Scenario: Revert button clicked twice (double-click protection) Given a remediation is already in progress for drift event "drift-001" When the "Revert to IaC" button is clicked again Then the SaaS backend returns "remediation already in progress" And a Slack ephemeral message is shown: "Remediation already running" And no duplicate remediation is triggered ``` ### Feature: Approval Workflow for Destructive Changes ```gherkin Feature: Approval Workflow for Destructive Remediation As a security officer I want destructive remediations to require explicit approval So that no resources are accidentally deleted Background: Given a drift event involves a resource that would be deleted during revert And the tenant has approval workflow enabled Scenario: Destructive revert requires approval Given the drift revert would delete an RDS instance When an engineer clicks "Revert to IaC" Then instead of executing immediately, an approval request is sent And the Slack message shows "⚠️ Destructive change — approval required" And an approval request is sent to all users in the "remediation_approvers" group Scenario: Approver approves destructive remediation Given an approval request is pending for drift event "drift-002" When an approver clicks "Approve" in Slack Then the approval is recorded with the approver's identity and timestamp And the remediation command is dispatched to the agent And the Slack thread is updated: "Approved by @jane — executing..." Scenario: Approver rejects destructive remediation Given an approval request is pending When an approver clicks "Reject" Then the remediation is cancelled And the drift event status is set to "remediation_rejected" And the Slack message is updated: "❌ Rejected by @jane" And the rejection reason (if provided) is logged Scenario: Approval timeout — remediation auto-cancelled Given an approval request has been pending for 24 hours And no approver has responded When the approval timeout fires Then the remediation is automatically cancelled And the drift event status is set to "approval_timeout" And a Slack message is sent: "⏰ Approval timed out — remediation cancelled" And the event is included in the next daily digest Scenario: Approval timeout is configurable Given the tenant has set approval_timeout_hours to 4 When an approval request is pending for 4 hours without response Then the timeout fires after 4 hours (not 24) Scenario: Self-approval is blocked Given engineer "alice@acme.com" triggered the remediation request When "alice@acme.com" attempts to approve their own request Then the approval is rejected with "Self-approval not permitted" And an ephemeral Slack message informs Alice Scenario: Minimum approvers requirement Given the tenant requires 2 approvals for destructive changes And only 1 approver has approved When the second approver approves Then the quorum is met And the remediation is dispatched ``` ### Feature: Agent-Side Remediation Execution ```gherkin Feature: Agent-Side Remediation Execution As a platform engineer I want the agent to apply IaC changes to revert drift So that remediation happens inside the customer VPC with proper credentials Background: Given the agent has received a remediation command from the control plane And the agent has the necessary IAM permissions Scenario: Terraform revert executed successfully Given the remediation command specifies stack "prod" and resource "aws_security_group.web" When the agent executes the remediation Then it runs "terraform apply -target=aws_security_group.web -auto-approve" And captures stdout and stderr And reports the result back to SaaS via the control plane Scenario: kubectl revert executed successfully Given the remediation command specifies a Kubernetes Deployment "api-server" When the agent executes the remediation Then it runs "kubectl apply -f /etc/drift/stacks/prod/api-server.yaml" And reports the result back to SaaS Scenario: Remediation command times out Given the terraform apply is still running after 10 minutes When the remediation timeout fires Then the agent kills the terraform process And reports status "timeout" to SaaS And logs "remediation timed out after 10m" at ERROR level Scenario: Agent loses connectivity during remediation Given a remediation is in progress And the agent loses connectivity to SaaS mid-execution When connectivity is restored Then the agent reports the final remediation result And the SaaS backend reconciles the status Scenario: Remediation command is replayed after agent restart Given a remediation command was received but the agent restarted before executing When the agent restarts Then it checks for pending remediation commands And executes any pending commands And reports results to SaaS Scenario: Remediation is blocked when panic mode is active Given the tenant's panic mode is active When a remediation command is received by the agent Then the agent rejects the command And logs "remediation blocked: panic mode active" at WARN level And reports status "blocked_panic_mode" to SaaS ``` --- ## Epic 6: Dashboard UI ### Feature: OAuth Login ```gherkin Feature: OAuth Login As a user I want to log in via OAuth So that I don't need a separate password for the drift dashboard Background: Given the dashboard is running at "https://app.drift.dd0c.io" Scenario: Successful login via GitHub OAuth Given the user navigates to the dashboard login page When they click "Sign in with GitHub" Then they are redirected to GitHub's OAuth authorization page And after authorizing, redirected back with an authorization code And the backend exchanges the code for a JWT And the user is logged in and sees their tenant's dashboard Scenario: Successful login via Google OAuth Given the user clicks "Sign in with Google" When they complete Google OAuth flow Then they are logged in with a valid JWT And the JWT contains their tenant_id and email claims Scenario: OAuth callback with invalid state parameter Given an OAuth callback arrives with a mismatched state parameter When the frontend processes the callback Then the login is rejected with "Invalid OAuth state" And the user is redirected to the login page with an error message And the event is logged as a potential CSRF attempt Scenario: JWT expires during session Given the user is logged in with a JWT that expires in 1 minute When the JWT expires Then the dashboard silently refreshes the token using the refresh token And the user's session continues uninterrupted Scenario: Refresh token is revoked Given the user's refresh token has been revoked (e.g., password change) When the dashboard attempts to refresh the JWT Then the refresh fails And the user is redirected to the login page And shown "Your session has expired, please log in again" Scenario: User belongs to no tenant Given a new OAuth user has no tenant association When they complete OAuth login Then they are redirected to the onboarding flow And prompted to create or join a tenant ``` ### Feature: Stack Overview ```gherkin Feature: Stack Overview Page As a platform engineer I want to see all my stacks and their drift status at a glance So that I can quickly identify which stacks need attention Background: Given the user is logged in as a member of tenant "acme" And tenant "acme" has 5 stacks configured Scenario: Stack overview loads all stacks Given the user navigates to the dashboard home When the page loads Then 5 stack cards are displayed And each card shows stack name, IaC type, last scan time, and drift count Scenario: Stack with active drift shows warning indicator Given stack "prod-api" has 3 active drift events When the stack overview loads Then the "prod-api" card shows a yellow warning badge with "3 drifts" Scenario: Stack with CRITICAL drift shows critical indicator Given stack "prod-db" has 1 CRITICAL drift event When the stack overview loads Then the "prod-db" card shows a red critical badge Scenario: Stack with no drift shows healthy indicator Given stack "staging" has no active drift events When the stack overview loads Then the "staging" card shows a green "Healthy" badge Scenario: Stack overview auto-refreshes Given the user is viewing the stack overview When 60 seconds elapse Then the page automatically refreshes drift counts without a full page reload Scenario: Tenant with no stacks sees empty state Given a new tenant has no stacks configured When they navigate to the dashboard Then an empty state is shown: "No stacks yet — run drift init to get started" And a link to the onboarding guide is displayed Scenario: Stack overview only shows current tenant's stacks Given tenant "acme" has 5 stacks and tenant "globex" has 3 stacks When a user from "acme" views the dashboard Then only 5 stacks are shown And no stacks from "globex" are visible ``` ### Feature: Diff Viewer ```gherkin Feature: Drift Diff Viewer As a platform engineer I want to see a detailed diff of what changed So that I understand exactly what drifted and how Background: Given the user is viewing a specific drift event Scenario: Diff viewer shows field-level changes Given a drift event has 3 changed fields When the user opens the diff viewer Then each changed field is shown with expected and actual values And removed values are highlighted in red And added values are highlighted in green Scenario: Diff viewer shows JSON diff for complex values Given a drift event involves a changed IAM policy document (JSON) When the user opens the diff viewer Then the policy JSON is shown as a structured diff And individual JSON fields are highlighted rather than the whole blob Scenario: Diff viewer handles large diffs with pagination Given a drift event has 50 changed fields When the user opens the diff viewer Then the first 20 fields are shown And a "Show more" button loads the remaining 30 Scenario: Diff viewer shows resource metadata Given a drift event for resource "aws_security_group.web" When the user views the diff Then the viewer shows resource type, ARN, region, and stack name And the scan timestamp is displayed Scenario: Diff viewer copy-to-clipboard Given the user is viewing a diff When they click "Copy diff" Then the diff is copied to clipboard in unified diff format And a toast notification confirms "Copied to clipboard" ``` ### Feature: Drift Timeline ```gherkin Feature: Drift Timeline As a platform engineer I want to see a timeline of drift events over time So that I can identify patterns and recurring issues Background: Given the user is viewing the drift timeline for stack "prod-api" Scenario: Timeline shows events in reverse chronological order Given 10 drift events exist for "prod-api" over the last 7 days When the user views the timeline Then events are listed newest first And each event shows timestamp, resource, severity, and status Scenario: Timeline filtered by severity Given the timeline has HIGH and LOW events When the user filters by severity "HIGH" Then only HIGH events are shown And the filter state is reflected in the URL for shareability Scenario: Timeline filtered by date range Given the user selects a date range of "last 30 days" When the filter is applied Then only events within the last 30 days are shown Scenario: Timeline shows remediation events Given a drift event was remediated When the user views the timeline Then the event shows a "Remediated" badge And the remediation timestamp and actor are shown Scenario: Timeline is empty for new stack Given a stack was added 1 hour ago and has no drift history When the user views the timeline Then an empty state is shown: "No drift history yet" Scenario: Timeline pagination Given 200 drift events exist for a stack When the user views the timeline Then the first 50 events are shown And infinite scroll or pagination loads more on demand ``` --- ## Epic 7: Dashboard API ### Feature: JWT Authentication ```gherkin Feature: JWT Authentication on Dashboard API As a SaaS engineer I want all API endpoints protected by JWT So that only authenticated users can access tenant data Background: Given the Dashboard API is running at "https://api.drift.dd0c.io" Scenario: Valid JWT grants access Given a request includes a valid JWT in the Authorization header And the JWT is not expired And the JWT signature is valid When the request reaches the API Then the request is processed And the response is returned with HTTP 200 Scenario: Missing JWT returns 401 Given a request has no Authorization header When the request reaches the API Then the API returns HTTP 401 And the response body includes "Authentication required" Scenario: Expired JWT returns 401 Given a request includes a JWT that expired 5 minutes ago When the request reaches the API Then the API returns HTTP 401 And the response body includes "Token expired" Scenario: JWT with invalid signature returns 401 Given a request includes a JWT with a tampered signature When the request reaches the API Then the API returns HTTP 401 And the response body includes "Invalid token" Scenario: JWT with wrong audience claim returns 401 Given a request includes a JWT issued for a different service When the request reaches the API Then the API returns HTTP 401 And the response body includes "Invalid audience" Scenario: JWT tenant_id claim is used for RLS Given a JWT contains tenant_id "acme" When the request reaches the API Then the API sets PostgreSQL session variable "app.tenant_id" to "acme" And all queries are automatically scoped to tenant "acme" via RLS ``` ### Feature: Tenant Isolation via RLS ```gherkin Feature: Tenant Isolation via Row-Level Security As a security engineer I want the API to enforce tenant isolation at the database level So that a bug in application logic cannot leak cross-tenant data Background: Given the API uses PostgreSQL with RLS on all tenant-scoped tables Scenario: User from tenant A cannot access tenant B's stacks Given tenant "acme" has stacks ["prod", "staging"] And tenant "globex" has stacks ["prod"] When a user from "acme" calls GET /stacks Then only "acme"'s stacks are returned And "globex"'s stack is not included Scenario: Cross-tenant drift event access attempt Given drift event "drift-001" belongs to tenant "globex" When a user from "acme" calls GET /drifts/drift-001 Then the API returns HTTP 404 And no data from "globex" is exposed Scenario: Cross-tenant stack update attempt Given stack "prod" belongs to tenant "globex" When a user from "acme" calls PATCH /stacks/prod Then the API returns HTTP 404 And the stack is not modified Scenario: RLS is enforced even if application code has a bug Given a hypothetical bug causes the API to omit the tenant_id filter in a query When the query executes Then PostgreSQL RLS still filters rows to the current tenant And no cross-tenant data is returned Scenario: Tenant isolation for remediation actions Given remediation "rem-001" belongs to tenant "globex" When a user from "acme" calls POST /remediations/rem-001/approve Then the API returns HTTP 404 And the remediation is not affected ``` ### Feature: Stack CRUD ```gherkin Feature: Stack CRUD Operations As a platform engineer I want to manage my stacks via the API So that I can add, update, and remove stacks programmatically Background: Given the user is authenticated as a member of tenant "acme" Scenario: Create a new Terraform stack Given a POST /stacks request with body: """ { "name": "prod-api", "iac_type": "terraform", "region": "us-east-1" } """ When the request is processed Then the API returns HTTP 201 And the response includes the new stack's id and created_at And the stack is visible in GET /stacks Scenario: Create stack with duplicate name Given a stack named "prod-api" already exists for tenant "acme" When a POST /stacks request is made with name "prod-api" Then the API returns HTTP 409 And the response body includes "Stack name already exists" Scenario: Create stack exceeding free tier limit Given the tenant is on the free tier (max 3 stacks) And the tenant already has 3 stacks When a POST /stacks request is made Then the API returns HTTP 402 And the response body includes "Free tier limit reached. Upgrade to add more stacks." Scenario: Update stack configuration Given stack "prod-api" exists When a PATCH /stacks/prod-api request updates the scan_interval to 30 Then the API returns HTTP 200 And the stack's scan_interval is updated to 30 And the agent receives the updated config on next heartbeat Scenario: Delete a stack Given stack "staging" exists with no active remediations When a DELETE /stacks/staging request is made Then the API returns HTTP 204 And the stack is removed from GET /stacks And associated drift events are soft-deleted (retained for 90 days) Scenario: Delete a stack with active remediation Given stack "prod-api" has an active remediation in progress When a DELETE /stacks/prod-api request is made Then the API returns HTTP 409 And the response body includes "Cannot delete stack with active remediation" Scenario: Get stack by ID Given stack "prod-api" exists When a GET /stacks/prod-api request is made Then the API returns HTTP 200 And the response includes all stack fields including last_scan_at and drift_count ``` ### Feature: Drift Event CRUD ```gherkin Feature: Drift Event API As a platform engineer I want to query and manage drift events via the API So that I can build integrations and automations Background: Given the user is authenticated as a member of tenant "acme" Scenario: List drift events for a stack Given stack "prod-api" has 10 drift events When GET /stacks/prod-api/drifts is called Then the API returns HTTP 200 And the response includes all 10 events And events are sorted by detected_at descending Scenario: Filter drift events by severity Given drift events include HIGH and LOW severity events When GET /drifts?severity=HIGH is called Then only HIGH severity events are returned Scenario: Filter drift events by status When GET /drifts?status=active is called Then only unresolved drift events are returned Scenario: Mark drift event as acknowledged Given drift event "drift-001" has status "active" When POST /drifts/drift-001/acknowledge is called Then the API returns HTTP 200 And the event status is updated to "acknowledged" And the acknowledged_by and acknowledged_at fields are set Scenario: Mark drift event as resolved manually Given drift event "drift-001" has status "active" When POST /drifts/drift-001/resolve is called with body {"reason": "manual fix applied"} Then the API returns HTTP 200 And the event status is updated to "resolved" And the resolution reason is stored Scenario: Pagination on drift events list Given 200 drift events exist When GET /drifts?page=1&per_page=50 is called Then 50 events are returned And the response includes pagination metadata (total, page, per_page, total_pages) ``` ### Feature: API Rate Limiting ```gherkin Feature: API Rate Limiting As a SaaS operator I want API rate limits enforced per tenant So that one tenant cannot degrade service for others Background: Given the API enforces rate limits per tenant Scenario: Request within rate limit succeeds Given the rate limit is 1000 requests per minute And the tenant has made 500 requests this minute When a new request is made Then the API returns HTTP 200 And the response includes headers X-RateLimit-Remaining and X-RateLimit-Reset Scenario: Request exceeds rate limit Given the tenant has made 1000 requests this minute When a new request is made Then the API returns HTTP 429 And the response includes Retry-After header And the response body includes "Rate limit exceeded" Scenario: Rate limit resets after window Given the tenant hit the rate limit at T+0 When 60 seconds elapse (new window) Then the tenant's request counter resets And new requests succeed ``` --- ## Epic 8: Infrastructure ### Feature: Terraform/CDK SaaS Infrastructure Provisioning ```gherkin Feature: SaaS Infrastructure Provisioning As a SaaS platform engineer I want the SaaS infrastructure defined as code So that environments are reproducible and auditable Background: Given the infrastructure code lives in "infra/" directory And Terraform and CDK are both used for different layers Scenario: Terraform plan produces no unexpected changes in production Given the production Terraform state is up to date When "terraform plan" runs against the production workspace Then the plan shows zero resource changes And the plan output is stored as a CI artifact Scenario: New environment provisioned from scratch Given a new environment "staging-eu" is needed When "terraform apply -var-file=staging-eu.tfvars" runs Then all required resources are created (VPC, RDS, SQS, ECS, etc.) And the environment is reachable within 15 minutes And outputs (queue URLs, DB endpoints) are stored in SSM Parameter Store Scenario: RDS instance is provisioned with encryption at rest Given the Terraform module for RDS is applied When the RDS instance is created Then storage_encrypted is true And the KMS key ARN is set to the tenant-specific key Scenario: SQS FIFO queues are provisioned with DLQ Given the SQS Terraform module is applied When the queues are created Then "drift-reports.fifo" exists with content_based_deduplication enabled And "drift-reports-dlq.fifo" exists as the redrive target And maxReceiveCount is set to 3 Scenario: CDK stack drift detected by drift agent (dogfooding) Given the SaaS CDK stacks are monitored by the drift agent itself When a CDK resource is manually modified in the AWS console Then the drift agent detects the change And an internal alert is sent to the SaaS ops channel Scenario: Infrastructure destroy is blocked in production Given a Terraform workspace is tagged as "production" When "terraform destroy" is attempted Then the CI pipeline rejects the command And logs "destroy blocked in production environment" ``` ### Feature: GitHub Actions CI/CD ```gherkin Feature: GitHub Actions CI/CD Pipeline As a platform engineer I want automated CI/CD via GitHub Actions So that code changes are tested and deployed safely Background: Given GitHub Actions workflows are defined in ".github/workflows/" Scenario: PR triggers CI checks Given a pull request is opened against the main branch When the CI workflow triggers Then unit tests run for Go agent code And unit tests run for TypeScript SaaS code And Terraform plan runs for infrastructure changes And all checks must pass before merge is allowed Scenario: Merge to main triggers staging deployment Given a PR is merged to the main branch When the deploy workflow triggers Then the Go agent binary is built and pushed to ECR And the TypeScript services are built and deployed to ECS staging And smoke tests run against staging And the deployment is marked successful if smoke tests pass Scenario: Staging smoke tests fail — production deploy blocked Given staging deployment completed And smoke tests fail (e.g., health check returns 500) When the pipeline evaluates the smoke test results Then the production deployment step is skipped And a Slack alert is sent to "#deployments" channel And the pipeline exits with failure Scenario: Production deployment requires manual approval Given staging smoke tests passed When the pipeline reaches the production deployment step Then it pauses and waits for a manual approval in GitHub Actions And the approval request is sent to the "production-deployers" team And deployment proceeds only after approval Scenario: Rollback triggered on production health check failure Given a production deployment completed And the post-deploy health check fails within 5 minutes When the rollback workflow triggers Then the previous ECS task definition revision is redeployed And a Slack alert is sent: "Production rollback triggered" And the failed deployment is logged with the commit SHA Scenario: Terraform plan diff is posted to PR as comment Given a PR modifies infrastructure code When the CI Terraform plan runs Then the plan output is posted as a comment on the PR And the comment includes a summary of resources to add/change/destroy Scenario: Secrets are never logged in CI Given the CI pipeline uses AWS credentials and Slack tokens When any CI step runs Then no secret values appear in the GitHub Actions log output And GitHub secret masking is verified in the workflow config ``` ### Feature: Database Migrations in CI/CD ```gherkin Feature: Database Migrations As a SaaS engineer I want database migrations to run automatically in CI/CD So that schema changes are applied safely and consistently Background: Given migrations are managed with a migration tool (e.g., golang-migrate or Flyway) Scenario: Migration runs successfully on deploy Given a new migration file "V20_add_remediation_status.sql" exists When the deploy pipeline runs Then the migration is applied to the target database And the migration is recorded in the schema_migrations table And the deploy continues Scenario: Migration is idempotent — already-applied migration is skipped Given migration "V20_add_remediation_status.sql" was already applied When the deploy pipeline runs again Then the migration is skipped And no error is thrown Scenario: Migration fails — deploy is halted Given a migration contains a syntax error When the migration runs Then the migration fails and is rolled back And the deploy pipeline halts And an alert is sent to the engineering team Scenario: Additive-only migrations enforced Given a migration attempts to drop a column When the CI linter checks the migration Then the lint check fails with "destructive migration not allowed" And the PR is blocked from merging ``` --- ## Epic 9: Onboarding & PLG ### Feature: drift init CLI ```gherkin Feature: drift init CLI Onboarding As a new user I want a guided CLI setup experience So that I can connect my infrastructure to drift in minutes Background: Given the drift CLI is installed via "curl -sSL https://get.drift.dd0c.io | sh" Scenario: First-time drift init runs guided setup Given the user runs "drift init" for the first time When the CLI starts Then it prompts for: cloud provider, IaC type, region, and stack name And validates each input before proceeding And generates a config file at "/etc/drift/config.yaml" Scenario: drift init detects existing Terraform state Given a Terraform state file exists in the current directory When the user runs "drift init" Then the CLI auto-detects the IaC type as "terraform" And pre-fills the stack name from the Terraform workspace name And asks the user to confirm Scenario: drift init creates IAM role with least-privilege policy Given the user confirms IAM role creation When "drift init" runs Then it creates an IAM role "drift-agent-role" with only required permissions And outputs the role ARN for the config Scenario: drift init generates and installs agent certificates Given the user has authenticated with the SaaS backend When "drift init" completes Then it fetches a signed mTLS certificate from the SaaS CA And stores the certificate at "/etc/drift/certs/agent.crt" And stores the private key at "/etc/drift/certs/agent.key" with mode 0600 Scenario: drift init installs agent as systemd service Given the user is on a Linux system with systemd When "drift init" completes Then it installs a systemd unit file for the drift agent And enables and starts the service And confirms "drift-agent is running" in the output Scenario: drift init fails gracefully on missing AWS credentials Given no AWS credentials are configured When "drift init" runs Then it detects missing credentials And outputs a helpful error: "AWS credentials not found. Run 'aws configure' first." And exits with code 1 Scenario: drift init --dry-run shows what would be created Given the user runs "drift init --dry-run" When the CLI runs Then it outputs all actions it would take without executing them And no resources are created And no config files are written ``` ### Feature: Free Tier — 3 Stacks ```gherkin Feature: Free Tier Stack Limit As a product manager I want the free tier limited to 3 stacks So that we have a clear upgrade path Background: Given a tenant is on the free tier Scenario: Free tier tenant can add up to 3 stacks Given the tenant has 0 stacks When they add stacks "prod", "staging", and "dev" Then all 3 stacks are created successfully And the tenant is not prompted to upgrade Scenario: Free tier tenant blocked from adding 4th stack Given the tenant has 3 stacks When they attempt to add a 4th stack via the CLI Then the CLI outputs "Free tier limit reached (3/3 stacks). Upgrade at https://app.drift.dd0c.io/billing" And exits with code 1 Scenario: Free tier tenant blocked from adding 4th stack via API Given the tenant has 3 stacks When POST /stacks is called Then the API returns HTTP 402 And the response includes upgrade_url Scenario: Free tier tenant blocked from adding 4th stack via dashboard Given the tenant has 3 stacks When they click "Add Stack" in the dashboard Then a modal appears: "You've reached the free tier limit" And an "Upgrade Plan" button is shown Scenario: Upgrading to paid tier unlocks unlimited stacks Given the tenant upgrades to the paid plan via Stripe When the Stripe webhook confirms payment Then the tenant's stack limit is set to unlimited And they can immediately add a 4th stack ``` ### Feature: Stripe Billing ```gherkin Feature: Stripe Billing Integration As a product manager I want usage-based billing via Stripe So that customers are charged $29/stack/month Background: Given Stripe is configured with the drift product and price Scenario: New tenant subscribes to paid plan Given a free tier tenant clicks "Upgrade" When they complete the Stripe Checkout flow Then a Stripe subscription is created for the tenant And the subscription includes a metered item for stack count And the tenant's plan is updated to "paid" in the database Scenario: Monthly invoice calculated at $29/stack Given a tenant has 5 stacks active for the full billing month When Stripe generates the monthly invoice Then the invoice total is $145.00 (5 × $29) And the invoice is sent to the tenant's billing email Scenario: Stack added mid-month — prorated charge Given a tenant adds a 6th stack on the 15th of the month When Stripe generates the monthly invoice Then the 6th stack is charged prorated (~$14.50 for half month) Scenario: Stack deleted mid-month — prorated credit Given a tenant deletes a stack on the 10th of the month When Stripe generates the monthly invoice Then a prorated credit is applied for the unused days Scenario: Payment fails — tenant notified and grace period applied Given a tenant's payment method is declined When Stripe sends the payment_failed webhook Then the tenant receives an email: "Payment failed — please update your billing info" And a 7-day grace period is applied before service is restricted Scenario: Grace period expires — stacks suspended Given a tenant's payment has been failing for 7 days When the grace period expires Then the tenant's stacks are suspended (scans paused) And the dashboard shows a banner: "Account suspended — payment required" And the agent stops sending reports Scenario: Payment updated — service restored immediately Given a tenant's stacks are suspended due to non-payment When the tenant updates their payment method and payment succeeds Then the Stripe webhook triggers service restoration And stacks are unsuspended within 60 seconds And scans resume on the next scheduled cycle Scenario: Stripe webhook signature validation Given a webhook arrives at POST /webhooks/stripe When the webhook signature is invalid Then the API returns HTTP 400 And the event is ignored And the attempt is logged as a potential spoofing attempt Scenario: Free tier tenant is never charged Given a tenant is on the free tier with 3 stacks When the billing cycle runs Then no Stripe invoice is generated for this tenant And no charge is made ``` ### Feature: Guided Setup Flow ```gherkin Feature: Guided Setup Flow in Dashboard As a new user I want a step-by-step setup guide in the dashboard So that I can get value from drift quickly Background: Given a new tenant has just signed up and logged in Scenario: Onboarding checklist is shown to new tenants Given the tenant has completed 0 onboarding steps When they log in for the first time Then an onboarding checklist is shown with steps: | Step | Status | | Install drift agent | Pending | | Add your first stack | Pending | | Configure Slack alerts | Pending | | Run your first scan | Pending | Scenario: Checklist step marked complete automatically Given the tenant installs the agent and it sends its first heartbeat When the dashboard refreshes Then the "Install drift agent" step is marked complete And a congratulatory message is shown Scenario: Onboarding checklist dismissed after all steps complete Given all 4 onboarding steps are complete When the tenant views the dashboard Then the checklist is replaced with the normal stack overview And a one-time "You're all set!" banner is shown Scenario: Onboarding checklist can be dismissed early Given the tenant has completed 2 of 4 steps When they click "Dismiss checklist" Then the checklist is hidden And a "Resume setup" link is available in the settings page ``` --- ## Epic 10: Transparent Factory ### Feature: Feature Flags ```gherkin Feature: Feature Flags As a product engineer I want feature flags to control rollout of new capabilities So that I can ship safely and roll back instantly Background: Given the feature flag service is running And flags are evaluated per-tenant Scenario: Feature flag enabled for specific tenant Given flag "remediation_v2" is enabled for tenant "acme" And flag "remediation_v2" is disabled for tenant "globex" When tenant "acme" triggers a remediation Then the v2 remediation code path is used When tenant "globex" triggers a remediation Then the v1 remediation code path is used Scenario: Feature flag enabled for percentage rollout Given flag "new_diff_viewer" is enabled for 10% of tenants When 1000 tenants load the dashboard Then approximately 100 tenants see the new diff viewer And the remaining 900 see the existing diff viewer Scenario: Feature flag disabled globally kills a feature Given flag "experimental_pulumi_scan" is globally disabled When any tenant attempts to add a Pulumi stack Then the API returns HTTP 501 "Feature not available" And the dashboard hides the Pulumi option in the stack type selector Scenario: Feature flag change takes effect without deployment Given flag "slack_digest_v2" is currently disabled When an operator enables the flag in the flag management console Then within 30 seconds, the notification engine uses the v2 digest format And no service restart is required Scenario: Feature flag evaluation is logged for audit Given flag "remediation_v2" is evaluated for tenant "acme" When the flag is checked Then the evaluation (flag name, tenant, result, timestamp) is written to the audit log And the audit log is queryable for compliance review Scenario: Unknown feature flag defaults to disabled Given code checks for flag "nonexistent_flag" When the flag service evaluates it Then the result is "disabled" (safe default) And a warning is logged: "unknown flag: nonexistent_flag" ``` ### Feature: Additive Schema Migrations ```gherkin Feature: Additive-Only Schema Migrations As a SaaS engineer I want all schema changes to be additive So that deployments are zero-downtime and rollback-safe Background: Given the migration linter runs in CI on every PR Scenario: Adding a new nullable column is allowed Given a migration adds column "remediation_status VARCHAR(50) NULL" When the migration linter checks the file Then the lint check passes And the migration is approved for merge Scenario: Adding a new table is allowed Given a migration creates a new table "decision_logs" When the migration linter checks the file Then the lint check passes Scenario: Dropping a column is blocked Given a migration contains "ALTER TABLE drift_events DROP COLUMN old_field" When the migration linter checks the file Then the lint check fails with "destructive operation: DROP COLUMN not allowed" And the PR is blocked Scenario: Dropping a table is blocked Given a migration contains "DROP TABLE legacy_alerts" When the migration linter checks the file Then the lint check fails with "destructive operation: DROP TABLE not allowed" Scenario: Renaming a column is blocked Given a migration contains "ALTER TABLE stacks RENAME COLUMN name TO stack_name" When the migration linter checks the file Then the lint check fails with "destructive operation: RENAME COLUMN not allowed" And the suggested alternative is to add a new column and deprecate the old one Scenario: Adding a NOT NULL column without default is blocked Given a migration adds "ALTER TABLE stacks ADD COLUMN owner_id UUID NOT NULL" When the migration linter checks the file Then the lint check fails with "NOT NULL column without DEFAULT will break existing rows" Scenario: Old column marked deprecated — not yet removed Given column "legacy_iac_path" is marked with a deprecation comment in the schema When the application code is deployed Then the column still exists in the database And the application ignores it And a deprecation notice is logged at startup Scenario: Column removal only after 2 release cycles Given column "legacy_iac_path" has been deprecated for 2 releases And all application code no longer references it When an engineer submits a migration to drop the column Then the migration linter checks the deprecation age And allows the drop if the deprecation period has elapsed ``` ### Feature: Decision Logs ```gherkin Feature: Decision Logs As an engineering lead I want architectural and operational decisions logged So that the team has a transparent record of why things are the way they are Background: Given the decision log is stored in "docs/decisions/" as markdown files Scenario: New ADR created for significant architectural change Given an engineer proposes switching from SQS to Kafka When they create "docs/decisions/ADR-042-kafka-vs-sqs.md" Then the ADR includes: context, decision, consequences, and status And the PR requires at least 2 reviewers from the architecture group Scenario: ADR status transitions are tracked Given ADR-042 has status "proposed" When the team accepts the decision Then the status is updated to "accepted" And the accepted_at date is recorded And the ADR is immutable after acceptance (changes require a new ADR) Scenario: Superseded ADR is linked to its replacement Given ADR-010 is superseded by ADR-042 When ADR-042 is accepted Then ADR-010's status is updated to "superseded" And ADR-010 includes a link to ADR-042 Scenario: Decision log is searchable Given 50 ADRs exist in the decision log When an engineer searches for "database" Then all ADRs mentioning "database" in title or body are returned Scenario: Operational decisions logged for drift remediation Given an operator manually overrides a remediation decision When the override is applied Then a decision log entry is created with: operator identity, reason, timestamp, and affected resource And the entry is linked to the drift event ``` ### Feature: OTEL Tracing ```gherkin Feature: OpenTelemetry Distributed Tracing As a SaaS engineer I want end-to-end distributed tracing via OTEL So that I can diagnose latency and errors across services Background: Given OTEL is configured with a Jaeger/Tempo backend And all services export traces Scenario: Drift report ingestion is fully traced Given an agent publishes a drift report to SQS When the event processor consumes and processes the message Then a trace exists spanning: SQS receive → normalization → severity scoring → DB write And each span includes tenant_id and scan_id as attributes And the total trace duration is under 2 seconds for normal reports Scenario: Slack notification is traced end-to-end Given a drift event triggers a Slack notification When the notification is sent Then a trace exists spanning: event stored → notification engine → Slack API call And the Slack API response code is recorded as a span attribute Scenario: Remediation flow is fully traced Given a remediation is triggered from Slack When the remediation completes Then a trace exists spanning: Slack interaction → API → control plane → agent → result And the trace includes the approver identity and approval timestamp Scenario: Slow span triggers latency alert Given the DB write span exceeds 500ms When the trace is analyzed Then a latency alert fires in the observability platform And the alert includes the trace_id for direct investigation Scenario: Trace context propagated across service boundaries Given the agent sends a drift report with a trace context header When the event processor receives the message Then it extracts the trace context from the SQS message attributes And continues the trace as a child span And the full trace is visible as a single tree in Jaeger Scenario: Traces do not contain PII or secrets Given a drift report is processed end-to-end When the trace is exported to the tracing backend Then no span attributes contain secret values And no span attributes contain tenant PII beyond tenant_id And the scrubber audit confirms 0 secrets in trace data Scenario: OTEL collector is unavailable — service continues Given the OTEL collector is down When the event processor handles a drift report Then the report is processed normally And trace export failures are logged at DEBUG level And no errors are surfaced to the end user ``` ### Feature: Governance / Panic Mode ```gherkin Feature: Panic Mode As a SaaS operator I want a panic mode that halts all automated actions So that I can freeze the system during a security incident or outage Background: Given the panic mode toggle is available in the ops console Scenario: Operator activates panic mode globally Given panic mode is currently inactive When an operator activates panic mode with reason "security incident" Then all automated remediations are immediately halted And all pending remediation commands are cancelled And a Slack alert is sent to "#ops-critical": "⚠️ PANIC MODE ACTIVATED by @operator" And the reason and operator identity are logged Scenario: Panic mode blocks new remediations Given panic mode is active When a user clicks "Revert to IaC" in Slack Then the SaaS backend rejects the action And the user sees: "System is in panic mode — automated actions are disabled" Scenario: Panic mode blocks agent remediation commands Given panic mode is active And an agent receives a remediation command (e.g., from a race condition) When the agent checks panic mode status Then the agent rejects the command And logs "remediation blocked: panic mode active" at WARN level Scenario: Panic mode does NOT block drift scanning Given panic mode is active When the next scan cycle runs Then the agent continues scanning normally And drift events continue to be reported and stored And notifications continue to be sent (read-only operations are unaffected) Scenario: Panic mode deactivated by authorized operator Given panic mode is active When an authorized operator deactivates panic mode Then automated remediations are re-enabled And a Slack alert is sent: "✅ PANIC MODE DEACTIVATED by @operator" And the deactivation is logged with timestamp and operator identity Scenario: Panic mode activation requires elevated role Given a regular user attempts to activate panic mode When they call POST /ops/panic-mode Then the API returns HTTP 403 And the attempt is logged as a security event Scenario: Panic mode state is persisted across restarts Given panic mode is active When the SaaS backend restarts Then panic mode remains active after restart And the system does not auto-deactivate panic mode on restart Scenario: Tenant-level panic mode Given tenant "acme" is experiencing an incident When an operator activates panic mode for tenant "acme" only Then only "acme"'s remediations are halted And other tenants are unaffected And "acme"'s dashboard shows a panic mode banner ``` ### Feature: Observability — Metrics and Alerts ```gherkin Feature: Operational Metrics and Alerting As a SaaS operator I want key metrics exported and alerting configured So that I can detect and respond to production issues proactively Background: Given metrics are exported to CloudWatch and/or Prometheus Scenario: Drift report processing latency metric Given drift reports are being processed When the event processor handles each report Then a histogram metric "drift_report_processing_duration_ms" is recorded And a P99 alert fires if latency exceeds 5000ms Scenario: DLQ depth metric triggers alert Given the DLQ depth exceeds 0 When the CloudWatch alarm evaluates Then a PagerDuty alert fires within 5 minutes And the alert includes the queue name and message count Scenario: Agent offline metric Given an agent has not sent a heartbeat for 5 minutes When the heartbeat monitor checks Then a metric "agents_offline_count" is incremented And if any agent is offline for more than 15 minutes, an alert fires Scenario: Secret scrubber miss rate metric Given the scrubber processes drift reports When a scrubber audit runs Then a metric "scrubber_miss_rate" is recorded And if the miss rate is ever > 0%, a CRITICAL alert fires immediately Scenario: Stripe webhook processing metric Given Stripe webhooks are received When each webhook is processed Then a counter "stripe_webhooks_processed_total" is incremented by event type And a counter "stripe_webhooks_failed_total" is incremented on failures And an alert fires if the failure rate exceeds 1% over 5 minutes Scenario: Database connection pool metric Given the application maintains a PostgreSQL connection pool When pool utilization exceeds 80% Then a warning alert fires And when utilization exceeds 95%, a critical alert fires ``` ### Feature: Cross-Tenant Isolation Audit ```gherkin Feature: Cross-Tenant Isolation Audit As a security engineer I want automated tests that verify cross-tenant isolation So that data leakage between tenants is caught before production Background: Given the test suite includes cross-tenant isolation tests And two test tenants "tenant-a" and "tenant-b" exist with separate data Scenario: API cross-tenant read isolation Given tenant-a has drift event "drift-a-001" When tenant-b's JWT is used to call GET /drifts/drift-a-001 Then the API returns HTTP 404 And no data from tenant-a is present in the response body Scenario: API cross-tenant write isolation Given tenant-a has stack "prod" When tenant-b's JWT is used to call DELETE /stacks/prod Then the API returns HTTP 404 And tenant-a's stack is not deleted Scenario: Database RLS cross-tenant query isolation Given a direct database query runs with app.tenant_id set to "tenant-b" When the query selects all rows from drift_events Then zero rows from tenant-a are returned And the query does not error Scenario: SQS message from tenant-a cannot be processed as tenant-b Given a drift report message from tenant-a arrives on the queue When the event processor reads the tenant_id from the message Then the event is stored under tenant-a's tenant_id And not under tenant-b's tenant_id Scenario: Remediation command cannot target another tenant's agent Given tenant-b's agent has agent_id "agent-b-001" When tenant-a attempts to send a remediation command to "agent-b-001" Then the control plane rejects the command with HTTP 403 And the attempt is logged as a security event Scenario: Cross-tenant isolation tests run in CI on every PR Given the isolation test suite is part of the CI pipeline When a PR is opened Then all cross-tenant isolation tests run automatically And the PR cannot be merged if any isolation test fails ``` --- *End of BDD Acceptance Test Specifications for dd0c/drift* *Total epics covered: 10 | Features: 40+ | Scenarios: 200+*