2246 lines
87 KiB
Markdown
2246 lines
87 KiB
Markdown
|
|
# dd0c/drift — BDD Acceptance Test Specifications
|
|||
|
|
|
|||
|
|
> Gherkin scenarios for all 10 epics. Each Feature maps to a user story within the epic.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
<!-- TABLE OF CONTENTS -->
|
|||
|
|
- [Epic 1: Drift Detection Agent](#epic-1-drift-detection-agent)
|
|||
|
|
- [Epic 2: Agent Communication](#epic-2-agent-communication)
|
|||
|
|
- [Epic 3: Event Processor](#epic-3-event-processor)
|
|||
|
|
- [Epic 4: Notification Engine](#epic-4-notification-engine)
|
|||
|
|
- [Epic 5: Remediation](#epic-5-remediation)
|
|||
|
|
- [Epic 6: Dashboard UI](#epic-6-dashboard-ui)
|
|||
|
|
- [Epic 7: Dashboard API](#epic-7-dashboard-api)
|
|||
|
|
- [Epic 8: Infrastructure](#epic-8-infrastructure)
|
|||
|
|
- [Epic 9: Onboarding & PLG](#epic-9-onboarding--plg)
|
|||
|
|
- [Epic 10: Transparent Factory](#epic-10-transparent-factory)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 1: Drift Detection Agent
|
|||
|
|
|
|||
|
|
### Feature: Agent Initialization
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Drift Detection Agent Initialization
|
|||
|
|
As a platform engineer
|
|||
|
|
I want the drift agent to initialize correctly in my VPC
|
|||
|
|
So that it can begin scanning infrastructure state
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the drift agent binary is installed at "/usr/local/bin/drift-agent"
|
|||
|
|
And a valid agent config exists at "/etc/drift/config.yaml"
|
|||
|
|
|
|||
|
|
Scenario: Successful agent startup
|
|||
|
|
Given the config specifies AWS region "us-east-1"
|
|||
|
|
And valid mTLS certificates are present
|
|||
|
|
And the SaaS endpoint is reachable
|
|||
|
|
When the agent starts
|
|||
|
|
Then the agent logs "drift-agent started" at INFO level
|
|||
|
|
And the agent registers itself with the SaaS control plane
|
|||
|
|
And the first scan is scheduled within 15 minutes
|
|||
|
|
|
|||
|
|
Scenario: Agent startup with missing config
|
|||
|
|
Given no config file exists at "/etc/drift/config.yaml"
|
|||
|
|
When the agent starts
|
|||
|
|
Then the agent exits with code 1
|
|||
|
|
And logs "config file not found" at ERROR level
|
|||
|
|
|
|||
|
|
Scenario: Agent startup with invalid AWS credentials
|
|||
|
|
Given the config references an IAM role that does not exist
|
|||
|
|
When the agent starts
|
|||
|
|
Then the agent exits with code 1
|
|||
|
|
And logs "failed to assume IAM role" at ERROR level
|
|||
|
|
|
|||
|
|
Scenario: Agent startup with unreachable SaaS endpoint
|
|||
|
|
Given the SaaS endpoint is not reachable from the VPC
|
|||
|
|
When the agent starts
|
|||
|
|
Then the agent retries connection 3 times with exponential backoff
|
|||
|
|
And after all retries fail, exits with code 1
|
|||
|
|
And logs "failed to reach control plane after 3 attempts" at ERROR level
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Terraform State Scanning
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Terraform State Scanning
|
|||
|
|
As a platform engineer
|
|||
|
|
I want the agent to read Terraform state
|
|||
|
|
So that it can compare planned state against live AWS resources
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the agent is running and initialized
|
|||
|
|
And the stack type is "terraform"
|
|||
|
|
|
|||
|
|
Scenario: Scan Terraform state from S3 backend
|
|||
|
|
Given a Terraform state file exists in S3 bucket "my-tfstate" at key "prod/terraform.tfstate"
|
|||
|
|
And the agent has s3:GetObject permission on that bucket
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent reads the state file successfully
|
|||
|
|
And parses all resource definitions from the state
|
|||
|
|
|
|||
|
|
Scenario: Scan Terraform state from local backend
|
|||
|
|
Given a Terraform state file exists at "/var/drift/stacks/prod/terraform.tfstate"
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent reads the local state file
|
|||
|
|
And parses all resource definitions
|
|||
|
|
|
|||
|
|
Scenario: Terraform state file is locked
|
|||
|
|
Given the Terraform state file is currently locked by another process
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent logs "state file locked, skipping scan" at WARN level
|
|||
|
|
And schedules a retry for the next cycle
|
|||
|
|
And does not report a drift event
|
|||
|
|
|
|||
|
|
Scenario: Terraform state file is malformed
|
|||
|
|
Given the Terraform state file contains invalid JSON
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent logs "failed to parse state file" at ERROR level
|
|||
|
|
And emits a health event with status "parse_error"
|
|||
|
|
And does not report false drift
|
|||
|
|
|
|||
|
|
Scenario: Terraform state references deleted resources
|
|||
|
|
Given the Terraform state contains a resource "aws_instance.web"
|
|||
|
|
And that EC2 instance no longer exists in AWS
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent detects drift of type "resource_deleted"
|
|||
|
|
And includes the resource ARN in the drift report
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: CloudFormation Stack Scanning
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: CloudFormation Stack Scanning
|
|||
|
|
As a platform engineer
|
|||
|
|
I want the agent to scan CloudFormation stacks
|
|||
|
|
So that I can detect drift from declared template state
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the agent is running
|
|||
|
|
And the stack type is "cloudformation"
|
|||
|
|
|
|||
|
|
Scenario: Scan a CloudFormation stack successfully
|
|||
|
|
Given a CloudFormation stack named "prod-api" exists in "us-east-1"
|
|||
|
|
And the agent has cloudformation:DescribeStackResources permission
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent retrieves all stack resources
|
|||
|
|
And compares each resource's actual configuration against the template
|
|||
|
|
|
|||
|
|
Scenario: CloudFormation stack does not exist
|
|||
|
|
Given the config references a CloudFormation stack "ghost-stack"
|
|||
|
|
And that stack does not exist in AWS
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent logs "stack not found: ghost-stack" at WARN level
|
|||
|
|
And emits a drift event of type "stack_missing"
|
|||
|
|
|
|||
|
|
Scenario: CloudFormation native drift detection result available
|
|||
|
|
Given CloudFormation has already run drift detection on "prod-api"
|
|||
|
|
And the result shows 2 drifted resources
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent reads the CloudFormation drift detection result
|
|||
|
|
And includes both drifted resources in the drift report
|
|||
|
|
|
|||
|
|
Scenario: CloudFormation drift detection result is stale
|
|||
|
|
Given the last CloudFormation drift detection ran 48 hours ago
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent triggers a new CloudFormation drift detection
|
|||
|
|
And waits up to 5 minutes for the result
|
|||
|
|
And uses the fresh result in the drift report
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Kubernetes Resource Scanning
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Kubernetes Resource Scanning
|
|||
|
|
As a platform engineer
|
|||
|
|
I want the agent to scan Kubernetes resources
|
|||
|
|
So that I can detect drift from Helm/Kustomize definitions
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the agent is running
|
|||
|
|
And a kubeconfig is available at "/etc/drift/kubeconfig"
|
|||
|
|
And the stack type is "kubernetes"
|
|||
|
|
|
|||
|
|
Scenario: Scan Kubernetes Deployment successfully
|
|||
|
|
Given a Deployment "api-server" exists in namespace "production"
|
|||
|
|
And the IaC definition specifies 3 replicas
|
|||
|
|
And the live Deployment has 2 replicas
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent detects drift on field "spec.replicas"
|
|||
|
|
And the drift report includes expected value "3" and actual value "2"
|
|||
|
|
|
|||
|
|
Scenario: Kubernetes resource has been manually patched
|
|||
|
|
Given a ConfigMap "app-config" has been manually edited
|
|||
|
|
And the live data differs from the Helm chart values
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent detects drift of type "config_modified"
|
|||
|
|
And includes a field-level diff in the report
|
|||
|
|
|
|||
|
|
Scenario: Kubernetes API server is unreachable
|
|||
|
|
Given the Kubernetes API server is not responding
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent logs "k8s API unreachable" at ERROR level
|
|||
|
|
And emits a health event with status "k8s_unreachable"
|
|||
|
|
And does not report false drift
|
|||
|
|
|
|||
|
|
Scenario: Scan across multiple namespaces
|
|||
|
|
Given the config specifies namespaces ["production", "staging"]
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent scans resources in both namespaces independently
|
|||
|
|
And drift reports include the namespace as context
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Secret Scrubbing
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Secret Scrubbing Before Transmission
|
|||
|
|
As a security officer
|
|||
|
|
I want all secrets scrubbed from drift reports before they leave the VPC
|
|||
|
|
So that credentials are never transmitted to the SaaS backend
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the agent is running with secret scrubbing enabled
|
|||
|
|
|
|||
|
|
Scenario: AWS secret key detected and scrubbed
|
|||
|
|
Given a drift report contains a field value matching AWS secret key pattern "AKIA[0-9A-Z]{16}"
|
|||
|
|
When the scrubber processes the report
|
|||
|
|
Then the field value is replaced with "[REDACTED:aws_secret_key]"
|
|||
|
|
And the original value is not present anywhere in the transmitted payload
|
|||
|
|
|
|||
|
|
Scenario: Generic password field scrubbed
|
|||
|
|
Given a drift report contains a field named "password" with value "s3cr3tP@ss"
|
|||
|
|
When the scrubber processes the report
|
|||
|
|
Then the field value is replaced with "[REDACTED:password]"
|
|||
|
|
|
|||
|
|
Scenario: Private key block scrubbed
|
|||
|
|
Given a drift report contains a PEM private key block
|
|||
|
|
When the scrubber processes the report
|
|||
|
|
Then the entire PEM block is replaced with "[REDACTED:private_key]"
|
|||
|
|
|
|||
|
|
Scenario: Nested secret in JSON value scrubbed
|
|||
|
|
Given a drift report contains a JSON string value with a nested "api_key" field
|
|||
|
|
When the scrubber processes the report
|
|||
|
|
Then the nested api_key value is replaced with "[REDACTED:api_key]"
|
|||
|
|
And the surrounding JSON structure is preserved
|
|||
|
|
|
|||
|
|
Scenario: Secret scrubber bypass attempt via encoding
|
|||
|
|
Given a drift report contains a base64-encoded AWS secret key
|
|||
|
|
When the scrubber processes the report
|
|||
|
|
Then the encoded value is detected and replaced with "[REDACTED:encoded_secret]"
|
|||
|
|
|
|||
|
|
Scenario: Secret scrubber bypass attempt via Unicode homoglyphs
|
|||
|
|
Given a drift report contains a value using Unicode lookalike characters to resemble a secret pattern
|
|||
|
|
When the scrubber processes the report
|
|||
|
|
Then the value is flagged and replaced with "[REDACTED:suspicious_value]"
|
|||
|
|
|
|||
|
|
Scenario: Non-secret value is not scrubbed
|
|||
|
|
Given a drift report contains a field "instance_type" with value "t3.medium"
|
|||
|
|
When the scrubber processes the report
|
|||
|
|
Then the field value remains "t3.medium" unchanged
|
|||
|
|
|
|||
|
|
Scenario: Scrubber coverage is 100%
|
|||
|
|
Given a test corpus of 500 known secret patterns
|
|||
|
|
When the scrubber processes all patterns
|
|||
|
|
Then every pattern is detected and redacted
|
|||
|
|
And the scrubber reports 0 missed secrets
|
|||
|
|
|
|||
|
|
Scenario: Scrubber audit log
|
|||
|
|
Given the scrubber redacts 3 values from a drift report
|
|||
|
|
When the report is transmitted
|
|||
|
|
Then the agent logs a scrubber audit entry with count "3" and field names (not values)
|
|||
|
|
And the audit log is stored locally for 30 days
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Pulumi State Scanning
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Pulumi State Scanning
|
|||
|
|
As a platform engineer
|
|||
|
|
I want the agent to read Pulumi state
|
|||
|
|
So that I can detect drift from Pulumi-managed resources
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the agent is running
|
|||
|
|
And the stack type is "pulumi"
|
|||
|
|
|
|||
|
|
Scenario: Scan Pulumi state from Pulumi Cloud backend
|
|||
|
|
Given a Pulumi stack "prod" exists in organization "acme"
|
|||
|
|
And the agent has a valid Pulumi access token configured
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent fetches the stack's exported state via Pulumi API
|
|||
|
|
And parses all resource URNs and properties
|
|||
|
|
|
|||
|
|
Scenario: Scan Pulumi state from self-managed S3 backend
|
|||
|
|
Given the Pulumi state is stored in S3 bucket "pulumi-state" at key "acme/prod.json"
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent reads the state file from S3
|
|||
|
|
And parses all resource definitions
|
|||
|
|
|
|||
|
|
Scenario: Pulumi access token is expired
|
|||
|
|
Given the configured Pulumi access token has expired
|
|||
|
|
When a scan cycle runs
|
|||
|
|
Then the agent logs "pulumi token expired" at ERROR level
|
|||
|
|
And emits a health event with status "auth_error"
|
|||
|
|
And does not report false drift
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: 15-Minute Scan Cycle
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Scheduled Scan Cycle
|
|||
|
|
As a platform engineer
|
|||
|
|
I want scans to run every 15 minutes automatically
|
|||
|
|
So that drift is detected promptly
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the agent is running and initialized
|
|||
|
|
|
|||
|
|
Scenario: Scan runs on schedule
|
|||
|
|
Given the last scan completed at T+0
|
|||
|
|
When 15 minutes elapse
|
|||
|
|
Then a new scan starts automatically
|
|||
|
|
And the scan completion is logged with timestamp
|
|||
|
|
|
|||
|
|
Scenario: Scan cycle skipped if previous scan still running
|
|||
|
|
Given a scan started at T+0 and is still running at T+15
|
|||
|
|
When the next scheduled scan would start
|
|||
|
|
Then the new scan is skipped
|
|||
|
|
And the agent logs "scan skipped: previous scan still in progress" at WARN level
|
|||
|
|
|
|||
|
|
Scenario: Scan interval is configurable
|
|||
|
|
Given the config specifies scan_interval_minutes: 30
|
|||
|
|
When the agent starts
|
|||
|
|
Then scans run every 30 minutes instead of 15
|
|||
|
|
|
|||
|
|
Scenario: No drift detected — no report sent
|
|||
|
|
Given all resources match their IaC definitions
|
|||
|
|
When a scan cycle completes
|
|||
|
|
Then no drift report is sent to SaaS
|
|||
|
|
And the agent logs "scan complete: no drift detected"
|
|||
|
|
|
|||
|
|
Scenario: Agent recovers scan schedule after restart
|
|||
|
|
Given the agent was restarted
|
|||
|
|
When the agent starts
|
|||
|
|
Then it reads the last scan timestamp from local state
|
|||
|
|
And schedules the next scan relative to the last completed scan
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 2: Agent Communication
|
|||
|
|
|
|||
|
|
### Feature: mTLS Certificate Handshake
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: mTLS Mutual TLS Authentication
|
|||
|
|
As a security engineer
|
|||
|
|
I want all agent-to-SaaS communication to use mTLS
|
|||
|
|
So that only authenticated agents can submit drift reports
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the agent has a client certificate issued by the drift CA
|
|||
|
|
And the SaaS endpoint requires mTLS
|
|||
|
|
|
|||
|
|
Scenario: Successful mTLS handshake
|
|||
|
|
Given the agent certificate is valid and not expired
|
|||
|
|
And the SaaS server certificate is trusted by the agent's CA bundle
|
|||
|
|
When the agent connects to the SaaS endpoint
|
|||
|
|
Then the TLS handshake succeeds
|
|||
|
|
And the connection is established with mutual authentication
|
|||
|
|
|
|||
|
|
Scenario: Agent certificate is expired
|
|||
|
|
Given the agent certificate expired 1 day ago
|
|||
|
|
When the agent attempts to connect to the SaaS endpoint
|
|||
|
|
Then the TLS handshake fails with "certificate expired"
|
|||
|
|
And the agent logs "mTLS cert expired, cannot connect" at ERROR level
|
|||
|
|
And the agent emits a local alert to stderr
|
|||
|
|
And no data is transmitted
|
|||
|
|
|
|||
|
|
Scenario: Agent certificate is revoked
|
|||
|
|
Given the agent certificate has been added to the CRL
|
|||
|
|
When the agent attempts to connect
|
|||
|
|
Then the SaaS endpoint rejects the connection
|
|||
|
|
And the agent logs "certificate revoked" at ERROR level
|
|||
|
|
|
|||
|
|
Scenario: Agent presents wrong CA certificate
|
|||
|
|
Given the agent has a certificate from an untrusted CA
|
|||
|
|
When the agent attempts to connect
|
|||
|
|
Then the TLS handshake fails
|
|||
|
|
And the agent logs "unknown certificate authority" at ERROR level
|
|||
|
|
|
|||
|
|
Scenario: SaaS server certificate is expired
|
|||
|
|
Given the SaaS server certificate has expired
|
|||
|
|
When the agent attempts to connect
|
|||
|
|
Then the agent rejects the server certificate
|
|||
|
|
And logs "server cert expired" at ERROR level
|
|||
|
|
And does not transmit any data
|
|||
|
|
|
|||
|
|
Scenario: Certificate rotation — new cert issued before expiry
|
|||
|
|
Given the agent certificate expires in 7 days
|
|||
|
|
And the control plane issues a new certificate
|
|||
|
|
When the agent receives the new certificate via the cert rotation endpoint
|
|||
|
|
Then the agent stores the new certificate
|
|||
|
|
And uses the new certificate for subsequent connections
|
|||
|
|
And logs "certificate rotated successfully" at INFO level
|
|||
|
|
|
|||
|
|
Scenario: Certificate rotation fails — agent continues with old cert
|
|||
|
|
Given the agent certificate expires in 7 days
|
|||
|
|
And the cert rotation endpoint is unreachable
|
|||
|
|
When the agent attempts certificate rotation
|
|||
|
|
Then the agent retries rotation every hour
|
|||
|
|
And continues using the existing certificate until it expires
|
|||
|
|
And logs "cert rotation failed, retrying" at WARN level
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Heartbeat
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Agent Heartbeat
|
|||
|
|
As a SaaS operator
|
|||
|
|
I want agents to send regular heartbeats
|
|||
|
|
So that I can detect offline or unhealthy agents
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the agent is running and connected via mTLS
|
|||
|
|
|
|||
|
|
Scenario: Heartbeat sent on schedule
|
|||
|
|
Given the heartbeat interval is 60 seconds
|
|||
|
|
When 60 seconds elapse
|
|||
|
|
Then the agent sends a heartbeat message to the SaaS control plane
|
|||
|
|
And the heartbeat includes agent version, last scan time, and stack count
|
|||
|
|
|
|||
|
|
Scenario: SaaS marks agent as offline after missed heartbeats
|
|||
|
|
Given the agent has not sent a heartbeat for 5 minutes
|
|||
|
|
When the SaaS control plane checks agent status
|
|||
|
|
Then the agent is marked as "offline"
|
|||
|
|
And a notification is sent to the tenant's configured alert channel
|
|||
|
|
|
|||
|
|
Scenario: Agent reconnects after network interruption
|
|||
|
|
Given the agent lost connectivity for 3 minutes
|
|||
|
|
When connectivity is restored
|
|||
|
|
Then the agent re-establishes the mTLS connection
|
|||
|
|
And sends a heartbeat immediately
|
|||
|
|
And the SaaS marks the agent as "online"
|
|||
|
|
|
|||
|
|
Scenario: Heartbeat includes health status
|
|||
|
|
Given the last scan encountered a parse error
|
|||
|
|
When the agent sends its next heartbeat
|
|||
|
|
Then the heartbeat payload includes health_status "degraded"
|
|||
|
|
And includes the error details in the health_details field
|
|||
|
|
|
|||
|
|
Scenario: Multiple agents from same tenant
|
|||
|
|
Given tenant "acme" has 3 agents running in different regions
|
|||
|
|
When all 3 agents send heartbeats
|
|||
|
|
Then each agent is tracked independently by agent_id
|
|||
|
|
And the SaaS dashboard shows 3 active agents for tenant "acme"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: SQS FIFO Drift Report Ingestion
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: SQS FIFO Drift Report Ingestion
|
|||
|
|
As a SaaS backend engineer
|
|||
|
|
I want drift reports delivered via SQS FIFO
|
|||
|
|
So that reports are processed in order without duplicates
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given an SQS FIFO queue "drift-reports.fifo" exists
|
|||
|
|
And the agent has sqs:SendMessage permission on the queue
|
|||
|
|
|
|||
|
|
Scenario: Agent publishes drift report to SQS FIFO
|
|||
|
|
Given a scan detects 2 drifted resources
|
|||
|
|
When the agent publishes the drift report
|
|||
|
|
Then a message is sent to "drift-reports.fifo"
|
|||
|
|
And the MessageGroupId is set to the tenant's agent_id
|
|||
|
|
And the MessageDeduplicationId is set to the scan's unique scan_id
|
|||
|
|
|
|||
|
|
Scenario: Duplicate scan report is deduplicated
|
|||
|
|
Given a drift report with scan_id "scan-abc-123" was already sent
|
|||
|
|
When the agent sends the same report again (e.g., due to retry)
|
|||
|
|
Then SQS FIFO deduplicates the message
|
|||
|
|
And only one copy is delivered to the consumer
|
|||
|
|
|
|||
|
|
Scenario: SQS message size limit exceeded
|
|||
|
|
Given a drift report exceeds 256KB (SQS max message size)
|
|||
|
|
When the agent attempts to publish
|
|||
|
|
Then the agent splits the report into chunks
|
|||
|
|
And each chunk is sent as a separate message with a shared batch_id
|
|||
|
|
And the sequence is indicated by chunk_index and chunk_total fields
|
|||
|
|
|
|||
|
|
Scenario: SQS publish fails — agent retries with backoff
|
|||
|
|
Given the SQS endpoint returns a 500 error
|
|||
|
|
When the agent attempts to publish a drift report
|
|||
|
|
Then the agent retries up to 5 times with exponential backoff
|
|||
|
|
And logs each retry attempt at WARN level
|
|||
|
|
And after all retries fail, stores the report locally for later replay
|
|||
|
|
|
|||
|
|
Scenario: Agent publishes to correct queue per environment
|
|||
|
|
Given the agent config specifies environment "production"
|
|||
|
|
When the agent publishes a drift report
|
|||
|
|
Then the message is sent to "drift-reports-prod.fifo"
|
|||
|
|
And not to "drift-reports-staging.fifo"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Dead Letter Queue Handling
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Dead Letter Queue (DLQ) Handling
|
|||
|
|
As a SaaS operator
|
|||
|
|
I want failed messages routed to a DLQ
|
|||
|
|
So that no drift reports are silently lost
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given a DLQ "drift-reports-dlq.fifo" is configured
|
|||
|
|
And the maxReceiveCount is set to 3
|
|||
|
|
|
|||
|
|
Scenario: Message moved to DLQ after max receive count
|
|||
|
|
Given a drift report message has failed processing 3 times
|
|||
|
|
When the SQS visibility timeout expires a 3rd time
|
|||
|
|
Then the message is automatically moved to the DLQ
|
|||
|
|
And an alarm fires on the DLQ depth metric
|
|||
|
|
|
|||
|
|
Scenario: DLQ alarm triggers operator notification
|
|||
|
|
Given the DLQ depth exceeds 0
|
|||
|
|
When the CloudWatch alarm triggers
|
|||
|
|
Then a PagerDuty alert is sent to the on-call engineer
|
|||
|
|
And the alert includes the queue name and approximate message count
|
|||
|
|
|
|||
|
|
Scenario: DLQ message is replayed after fix
|
|||
|
|
Given a message in the DLQ was caused by a schema validation bug
|
|||
|
|
And the bug has been fixed and deployed
|
|||
|
|
When an operator triggers DLQ replay
|
|||
|
|
Then the message is moved back to the main queue
|
|||
|
|
And processed successfully
|
|||
|
|
And removed from the DLQ
|
|||
|
|
|
|||
|
|
Scenario: DLQ message contains poison pill — permanently discarded
|
|||
|
|
Given a DLQ message is malformed beyond repair
|
|||
|
|
When an operator inspects the message
|
|||
|
|
Then the operator can mark it as "discarded" via the ops console
|
|||
|
|
And the discard action is logged with operator identity and reason
|
|||
|
|
And the message is deleted from the DLQ
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 3: Event Processor
|
|||
|
|
|
|||
|
|
### Feature: Drift Report Normalization
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Drift Report Normalization
|
|||
|
|
As a SaaS backend engineer
|
|||
|
|
I want incoming drift reports normalized to a canonical schema
|
|||
|
|
So that downstream consumers work with consistent data regardless of IaC type
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the event processor is running
|
|||
|
|
And it is consuming from "drift-reports.fifo"
|
|||
|
|
|
|||
|
|
Scenario: Normalize a Terraform drift report
|
|||
|
|
Given a raw drift report with type "terraform" arrives on the queue
|
|||
|
|
When the event processor consumes the message
|
|||
|
|
Then it maps Terraform resource addresses to canonical resource_id format
|
|||
|
|
And stores the normalized report in the "drift_events" PostgreSQL table
|
|||
|
|
And sets the "iac_type" field to "terraform"
|
|||
|
|
|
|||
|
|
Scenario: Normalize a CloudFormation drift report
|
|||
|
|
Given a raw drift report with type "cloudformation" arrives
|
|||
|
|
When the event processor normalizes it
|
|||
|
|
Then CloudFormation logical resource IDs are mapped to canonical resource_id
|
|||
|
|
And the "iac_type" field is set to "cloudformation"
|
|||
|
|
|
|||
|
|
Scenario: Normalize a Kubernetes drift report
|
|||
|
|
Given a raw drift report with type "kubernetes" arrives
|
|||
|
|
When the event processor normalizes it
|
|||
|
|
Then Kubernetes resource URIs (namespace/kind/name) are mapped to canonical resource_id
|
|||
|
|
And the "iac_type" field is set to "kubernetes"
|
|||
|
|
|
|||
|
|
Scenario: Unknown IaC type in report
|
|||
|
|
Given a drift report arrives with iac_type "unknown_tool"
|
|||
|
|
When the event processor attempts normalization
|
|||
|
|
Then the message is rejected with error "unsupported_iac_type"
|
|||
|
|
And the message is moved to the DLQ
|
|||
|
|
And an error is logged with the raw message_id
|
|||
|
|
|
|||
|
|
Scenario: Report with missing required fields
|
|||
|
|
Given a drift report is missing the "tenant_id" field
|
|||
|
|
When the event processor validates the message
|
|||
|
|
Then validation fails with "missing required field: tenant_id"
|
|||
|
|
And the message is moved to the DLQ
|
|||
|
|
And no partial record is written to PostgreSQL
|
|||
|
|
|
|||
|
|
Scenario: Chunked report is reassembled before normalization
|
|||
|
|
Given 3 chunked messages arrive with the same batch_id
|
|||
|
|
And chunk_total is 3
|
|||
|
|
When all 3 chunks are received
|
|||
|
|
Then the event processor reassembles them in chunk_index order
|
|||
|
|
And normalizes the complete report as a single event
|
|||
|
|
|
|||
|
|
Scenario: Chunked report — one chunk missing after timeout
|
|||
|
|
Given 2 of 3 chunks arrive for batch_id "batch-xyz"
|
|||
|
|
And the third chunk does not arrive within 10 minutes
|
|||
|
|
When the reassembly timeout fires
|
|||
|
|
Then the event processor logs "incomplete batch: batch-xyz, missing chunk 2" at WARN level
|
|||
|
|
And moves the partial batch to the DLQ
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Drift Severity Scoring
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Drift Severity Scoring
|
|||
|
|
As a platform engineer
|
|||
|
|
I want each drift event scored by severity
|
|||
|
|
So that I can prioritize which drifts to address first
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given a normalized drift event is ready for scoring
|
|||
|
|
|
|||
|
|
Scenario: Security group rule added — HIGH severity
|
|||
|
|
Given a drift event describes an added inbound rule on a security group
|
|||
|
|
And the rule opens port 22 to 0.0.0.0/0
|
|||
|
|
When the severity scorer evaluates the event
|
|||
|
|
Then the severity is set to "HIGH"
|
|||
|
|
And the reason is "public SSH access opened"
|
|||
|
|
|
|||
|
|
Scenario: IAM policy attached — CRITICAL severity
|
|||
|
|
Given a drift event describes an IAM policy "AdministratorAccess" attached to a role
|
|||
|
|
When the severity scorer evaluates the event
|
|||
|
|
Then the severity is set to "CRITICAL"
|
|||
|
|
And the reason is "admin policy attached outside IaC"
|
|||
|
|
|
|||
|
|
Scenario: Replica count changed — LOW severity
|
|||
|
|
Given a drift event describes a Kubernetes Deployment replica count changed from 3 to 2
|
|||
|
|
When the severity scorer evaluates the event
|
|||
|
|
Then the severity is set to "LOW"
|
|||
|
|
And the reason is "non-security configuration drift"
|
|||
|
|
|
|||
|
|
Scenario: Resource deleted — HIGH severity
|
|||
|
|
Given a drift event describes an RDS instance being deleted
|
|||
|
|
When the severity scorer evaluates the event
|
|||
|
|
Then the severity is set to "HIGH"
|
|||
|
|
And the reason is "managed resource deleted outside IaC"
|
|||
|
|
|
|||
|
|
Scenario: Tag-only drift — INFO severity
|
|||
|
|
Given a drift event describes only tag changes on an EC2 instance
|
|||
|
|
When the severity scorer evaluates the event
|
|||
|
|
Then the severity is set to "INFO"
|
|||
|
|
And the reason is "tag-only drift"
|
|||
|
|
|
|||
|
|
Scenario: Custom severity rules override defaults
|
|||
|
|
Given a tenant has configured a custom rule: "any change to resource type aws_s3_bucket = CRITICAL"
|
|||
|
|
And a drift event describes a tag change on an S3 bucket
|
|||
|
|
When the severity scorer evaluates the event
|
|||
|
|
Then the tenant's custom rule takes precedence
|
|||
|
|
And the severity is set to "CRITICAL"
|
|||
|
|
|
|||
|
|
Scenario: Severity score is stored with the drift event
|
|||
|
|
Given a drift event has been scored as "HIGH"
|
|||
|
|
When the event processor writes to PostgreSQL
|
|||
|
|
Then the "drift_events" row includes severity "HIGH" and scored_at timestamp
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: PostgreSQL Storage with RLS
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Drift Event Storage with Row-Level Security
|
|||
|
|
As a SaaS engineer
|
|||
|
|
I want drift events stored in PostgreSQL with RLS
|
|||
|
|
So that tenants can only access their own data
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given PostgreSQL is running with RLS enabled on the "drift_events" table
|
|||
|
|
And the RLS policy filters rows by "tenant_id = current_setting('app.tenant_id')"
|
|||
|
|
|
|||
|
|
Scenario: Drift event written for tenant A
|
|||
|
|
Given a normalized drift event belongs to tenant "acme"
|
|||
|
|
When the event processor writes the event
|
|||
|
|
Then the row is inserted with tenant_id "acme"
|
|||
|
|
And the row is visible when querying as tenant "acme"
|
|||
|
|
|
|||
|
|
Scenario: Tenant B cannot read tenant A's drift events
|
|||
|
|
Given drift events exist for tenant "acme"
|
|||
|
|
When a query runs with app.tenant_id set to "globex"
|
|||
|
|
Then zero rows are returned
|
|||
|
|
And no error is thrown (RLS silently filters)
|
|||
|
|
|
|||
|
|
Scenario: Superuser bypass is disabled for application role
|
|||
|
|
Given the application database role is "drift_app"
|
|||
|
|
When "drift_app" attempts to query without setting app.tenant_id
|
|||
|
|
Then zero rows are returned due to RLS default-deny policy
|
|||
|
|
|
|||
|
|
Scenario: Drift event deduplication
|
|||
|
|
Given a drift event with scan_id "scan-abc" and resource_id "aws_instance.web" already exists
|
|||
|
|
When the event processor attempts to insert the same event again
|
|||
|
|
Then the INSERT is ignored (ON CONFLICT DO NOTHING)
|
|||
|
|
And no duplicate row is created
|
|||
|
|
|
|||
|
|
Scenario: Database connection pool exhausted
|
|||
|
|
Given all PostgreSQL connections are in use
|
|||
|
|
When the event processor tries to write a drift event
|
|||
|
|
Then it waits up to 5 seconds for a connection
|
|||
|
|
And if no connection is available, the message is nacked and retried
|
|||
|
|
And an alert fires if pool exhaustion persists for more than 60 seconds
|
|||
|
|
|
|||
|
|
Scenario: Schema migration runs without downtime
|
|||
|
|
Given a new additive column "remediation_status" is being added
|
|||
|
|
When the migration runs
|
|||
|
|
Then existing rows are unaffected
|
|||
|
|
And new rows include the "remediation_status" column
|
|||
|
|
And the event processor continues writing without restart
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Idempotent Event Processing
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Idempotent Event Processing
|
|||
|
|
As a SaaS backend engineer
|
|||
|
|
I want event processing to be idempotent
|
|||
|
|
So that retries and replays do not create duplicate records
|
|||
|
|
|
|||
|
|
Scenario: Same SQS message delivered twice (at-least-once delivery)
|
|||
|
|
Given an SQS message with MessageId "msg-001" was processed successfully
|
|||
|
|
When the same message is delivered again due to SQS retry
|
|||
|
|
Then the event processor detects the duplicate via scan_id lookup
|
|||
|
|
And skips processing
|
|||
|
|
And deletes the message from the queue
|
|||
|
|
|
|||
|
|
Scenario: Event processor restarts mid-batch
|
|||
|
|
Given the event processor crashed after writing 5 of 10 events
|
|||
|
|
When the processor restarts and reprocesses the batch
|
|||
|
|
Then the 5 already-written events are skipped (idempotent)
|
|||
|
|
And the remaining 5 events are written
|
|||
|
|
And the final state has exactly 10 events
|
|||
|
|
|
|||
|
|
Scenario: Replay from DLQ does not create duplicates
|
|||
|
|
Given a DLQ message is replayed after a bug fix
|
|||
|
|
And the event was partially processed before the crash
|
|||
|
|
When the replayed message is processed
|
|||
|
|
Then the processor uses upsert semantics
|
|||
|
|
And the final record reflects the correct state
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 4: Notification Engine
|
|||
|
|
|
|||
|
|
### Feature: Slack Block Kit Drift Alerts
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Slack Block Kit Drift Alerts
|
|||
|
|
As a platform engineer
|
|||
|
|
I want to receive Slack notifications when drift is detected
|
|||
|
|
So that I can act on it immediately
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given a tenant has configured a Slack webhook URL
|
|||
|
|
And the notification engine is running
|
|||
|
|
|
|||
|
|
Scenario: HIGH severity drift triggers immediate Slack alert
|
|||
|
|
Given a drift event with severity "HIGH" is stored
|
|||
|
|
When the notification engine processes the event
|
|||
|
|
Then a Slack Block Kit message is sent to the configured channel
|
|||
|
|
And the message includes the resource_id, drift type, and severity badge
|
|||
|
|
And the message includes an inline diff of expected vs actual values
|
|||
|
|
And the message includes a "Revert to IaC" action button
|
|||
|
|
|
|||
|
|
Scenario: CRITICAL severity drift triggers immediate Slack alert with @here mention
|
|||
|
|
Given a drift event with severity "CRITICAL" is stored
|
|||
|
|
When the notification engine processes the event
|
|||
|
|
Then the Slack message includes an "@here" mention
|
|||
|
|
And the message is sent within 60 seconds of the event being stored
|
|||
|
|
|
|||
|
|
Scenario: LOW severity drift is batched — not sent immediately
|
|||
|
|
Given a drift event with severity "LOW" is stored
|
|||
|
|
When the notification engine processes the event
|
|||
|
|
Then no immediate Slack message is sent
|
|||
|
|
And the event is queued for the next daily digest
|
|||
|
|
|
|||
|
|
Scenario: INFO severity drift is suppressed from Slack
|
|||
|
|
Given a drift event with severity "INFO" is stored
|
|||
|
|
When the notification engine processes the event
|
|||
|
|
Then no Slack message is sent
|
|||
|
|
And the event is only visible in the dashboard
|
|||
|
|
|
|||
|
|
Scenario: Slack message includes inline diff
|
|||
|
|
Given a drift event shows security group rule changed
|
|||
|
|
And expected value is "port 443 from 10.0.0.0/8"
|
|||
|
|
And actual value is "port 443 from 0.0.0.0/0"
|
|||
|
|
When the Slack alert is composed
|
|||
|
|
Then the message body includes a diff block showing the change
|
|||
|
|
And removed lines are prefixed with "-" in red
|
|||
|
|
And added lines are prefixed with "+" in green
|
|||
|
|
|
|||
|
|
Scenario: Slack webhook returns 429 rate limit
|
|||
|
|
Given the Slack webhook returns HTTP 429
|
|||
|
|
When the notification engine attempts to send
|
|||
|
|
Then it respects the Retry-After header
|
|||
|
|
And retries after the specified delay
|
|||
|
|
And logs "slack rate limited, retrying in Xs" at WARN level
|
|||
|
|
|
|||
|
|
Scenario: Slack webhook URL is invalid
|
|||
|
|
Given the tenant's Slack webhook URL returns HTTP 404
|
|||
|
|
When the notification engine attempts to send
|
|||
|
|
Then it logs "invalid slack webhook" at ERROR level
|
|||
|
|
And marks the notification as "failed" in the database
|
|||
|
|
And does not retry indefinitely (max 3 attempts)
|
|||
|
|
|
|||
|
|
Scenario: Multiple drifts in same scan — grouped in one message
|
|||
|
|
Given a scan detects 5 drifted resources all with severity "HIGH"
|
|||
|
|
When the notification engine processes the batch
|
|||
|
|
Then a single Slack message is sent grouping all 5 resources
|
|||
|
|
And the message includes a summary "5 resources drifted"
|
|||
|
|
And each resource is listed with its diff
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Daily Digest
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Daily Drift Digest
|
|||
|
|
As a platform engineer
|
|||
|
|
I want a daily summary of all drift events
|
|||
|
|
So that I have a consolidated view without alert fatigue
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given a tenant has daily digest enabled
|
|||
|
|
And the digest is scheduled for 09:00 tenant local time
|
|||
|
|
|
|||
|
|
Scenario: Daily digest sent with pending LOW/INFO events
|
|||
|
|
Given 12 LOW severity drift events accumulated since the last digest
|
|||
|
|
When the digest job runs at 09:00
|
|||
|
|
Then a single Slack message is sent summarizing all 12 events
|
|||
|
|
And the message groups events by stack name
|
|||
|
|
And includes a link to the dashboard for full details
|
|||
|
|
|
|||
|
|
Scenario: Daily digest skipped when no events
|
|||
|
|
Given no drift events occurred in the last 24 hours
|
|||
|
|
When the digest job runs
|
|||
|
|
Then no Slack message is sent
|
|||
|
|
And the job logs "digest skipped: no events" at INFO level
|
|||
|
|
|
|||
|
|
Scenario: Daily digest includes resolved drifts
|
|||
|
|
Given 3 drift events were detected and then remediated in the last 24 hours
|
|||
|
|
When the digest runs
|
|||
|
|
Then the digest includes a "Resolved" section listing those 3 events
|
|||
|
|
And shows time-to-remediation for each
|
|||
|
|
|
|||
|
|
Scenario: Digest timezone is per-tenant
|
|||
|
|
Given tenant "acme" is in timezone "America/New_York" (UTC-5)
|
|||
|
|
And tenant "globex" is in timezone "Asia/Tokyo" (UTC+9)
|
|||
|
|
When the digest scheduler runs
|
|||
|
|
Then "acme" receives their digest at 14:00 UTC
|
|||
|
|
And "globex" receives their digest at 00:00 UTC
|
|||
|
|
|
|||
|
|
Scenario: Digest delivery fails — retried next hour
|
|||
|
|
Given the Slack webhook is temporarily unavailable at 09:00
|
|||
|
|
When the digest send fails
|
|||
|
|
Then the system retries at 10:00
|
|||
|
|
And again at 11:00
|
|||
|
|
And after 3 failures, marks the digest as "failed" and alerts the operator
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Severity-Based Routing
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Severity-Based Notification Routing
|
|||
|
|
As a platform engineer
|
|||
|
|
I want different severity levels routed to different channels
|
|||
|
|
So that critical alerts reach the right people immediately
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given a tenant has configured routing rules
|
|||
|
|
|
|||
|
|
Scenario: CRITICAL routed to #incidents channel
|
|||
|
|
Given the routing rule maps CRITICAL → "#incidents"
|
|||
|
|
And a CRITICAL drift event occurs
|
|||
|
|
When the notification engine routes the alert
|
|||
|
|
Then the message is sent to "#incidents"
|
|||
|
|
And not to "#drift-alerts"
|
|||
|
|
|
|||
|
|
Scenario: HIGH routed to #drift-alerts channel
|
|||
|
|
Given the routing rule maps HIGH → "#drift-alerts"
|
|||
|
|
And a HIGH drift event occurs
|
|||
|
|
When the notification engine routes the alert
|
|||
|
|
Then the message is sent to "#drift-alerts"
|
|||
|
|
|
|||
|
|
Scenario: No routing rule configured — fallback to default channel
|
|||
|
|
Given no routing rules are configured for severity "MEDIUM"
|
|||
|
|
And a MEDIUM drift event occurs
|
|||
|
|
When the notification engine routes the alert
|
|||
|
|
Then the message is sent to the tenant's default Slack channel
|
|||
|
|
|
|||
|
|
Scenario: Multiple channels for same severity
|
|||
|
|
Given the routing rule maps CRITICAL → ["#incidents", "#sre-oncall"]
|
|||
|
|
And a CRITICAL drift event occurs
|
|||
|
|
When the notification engine routes the alert
|
|||
|
|
Then the message is sent to both "#incidents" and "#sre-oncall"
|
|||
|
|
|
|||
|
|
Scenario: PagerDuty integration for CRITICAL severity
|
|||
|
|
Given the tenant has PagerDuty integration configured
|
|||
|
|
And the routing rule maps CRITICAL → PagerDuty
|
|||
|
|
And a CRITICAL drift event occurs
|
|||
|
|
When the notification engine routes the alert
|
|||
|
|
Then a PagerDuty incident is created via the Events API
|
|||
|
|
And the incident includes the drift event details and dashboard link
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 5: Remediation
|
|||
|
|
|
|||
|
|
### Feature: One-Click Revert via Slack
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: One-Click Revert to IaC via Slack
|
|||
|
|
As a platform engineer
|
|||
|
|
I want to trigger remediation directly from a Slack alert
|
|||
|
|
So that I can revert drift without leaving my chat tool
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given a HIGH severity drift alert was sent to Slack
|
|||
|
|
And the alert includes a "Revert to IaC" button
|
|||
|
|
|
|||
|
|
Scenario: Engineer clicks Revert to IaC for non-destructive change
|
|||
|
|
Given the drift is a security group rule addition (non-destructive revert)
|
|||
|
|
When the engineer clicks "Revert to IaC"
|
|||
|
|
Then the SaaS backend receives the Slack interaction payload
|
|||
|
|
And sends a remediation command to the agent via the control plane
|
|||
|
|
And the agent runs "terraform apply -target=aws_security_group.web"
|
|||
|
|
And the Slack message is updated to show "Remediation in progress..."
|
|||
|
|
|
|||
|
|
Scenario: Remediation completes successfully
|
|||
|
|
Given a remediation command was dispatched to the agent
|
|||
|
|
When the agent completes the terraform apply
|
|||
|
|
Then the agent sends a remediation result event to SaaS
|
|||
|
|
And the Slack message is updated to "✅ Reverted successfully"
|
|||
|
|
And a new scan is triggered immediately to confirm no drift
|
|||
|
|
|
|||
|
|
Scenario: Remediation fails — agent reports error
|
|||
|
|
Given a remediation command was dispatched
|
|||
|
|
And the terraform apply exits with a non-zero code
|
|||
|
|
When the agent reports the failure
|
|||
|
|
Then the Slack message is updated to "❌ Remediation failed"
|
|||
|
|
And the error output is included in the Slack message (truncated to 500 chars)
|
|||
|
|
And the drift event status is set to "remediation_failed"
|
|||
|
|
|
|||
|
|
Scenario: Revert button clicked by unauthorized user
|
|||
|
|
Given the Slack user "intern@acme.com" is not in the "remediation_approvers" group
|
|||
|
|
When they click "Revert to IaC"
|
|||
|
|
Then the SaaS backend rejects the action with "Unauthorized"
|
|||
|
|
And a Slack ephemeral message is shown: "You don't have permission to trigger remediation"
|
|||
|
|
And the action is logged with the user's identity
|
|||
|
|
|
|||
|
|
Scenario: Revert button clicked twice (double-click protection)
|
|||
|
|
Given a remediation is already in progress for drift event "drift-001"
|
|||
|
|
When the "Revert to IaC" button is clicked again
|
|||
|
|
Then the SaaS backend returns "remediation already in progress"
|
|||
|
|
And a Slack ephemeral message is shown: "Remediation already running"
|
|||
|
|
And no duplicate remediation is triggered
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Approval Workflow for Destructive Changes
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Approval Workflow for Destructive Remediation
|
|||
|
|
As a security officer
|
|||
|
|
I want destructive remediations to require explicit approval
|
|||
|
|
So that no resources are accidentally deleted
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given a drift event involves a resource that would be deleted during revert
|
|||
|
|
And the tenant has approval workflow enabled
|
|||
|
|
|
|||
|
|
Scenario: Destructive revert requires approval
|
|||
|
|
Given the drift revert would delete an RDS instance
|
|||
|
|
When an engineer clicks "Revert to IaC"
|
|||
|
|
Then instead of executing immediately, an approval request is sent
|
|||
|
|
And the Slack message shows "⚠️ Destructive change — approval required"
|
|||
|
|
And an approval request is sent to all users in the "remediation_approvers" group
|
|||
|
|
|
|||
|
|
Scenario: Approver approves destructive remediation
|
|||
|
|
Given an approval request is pending for drift event "drift-002"
|
|||
|
|
When an approver clicks "Approve" in Slack
|
|||
|
|
Then the approval is recorded with the approver's identity and timestamp
|
|||
|
|
And the remediation command is dispatched to the agent
|
|||
|
|
And the Slack thread is updated: "Approved by @jane — executing..."
|
|||
|
|
|
|||
|
|
Scenario: Approver rejects destructive remediation
|
|||
|
|
Given an approval request is pending
|
|||
|
|
When an approver clicks "Reject"
|
|||
|
|
Then the remediation is cancelled
|
|||
|
|
And the drift event status is set to "remediation_rejected"
|
|||
|
|
And the Slack message is updated: "❌ Rejected by @jane"
|
|||
|
|
And the rejection reason (if provided) is logged
|
|||
|
|
|
|||
|
|
Scenario: Approval timeout — remediation auto-cancelled
|
|||
|
|
Given an approval request has been pending for 24 hours
|
|||
|
|
And no approver has responded
|
|||
|
|
When the approval timeout fires
|
|||
|
|
Then the remediation is automatically cancelled
|
|||
|
|
And the drift event status is set to "approval_timeout"
|
|||
|
|
And a Slack message is sent: "⏰ Approval timed out — remediation cancelled"
|
|||
|
|
And the event is included in the next daily digest
|
|||
|
|
|
|||
|
|
Scenario: Approval timeout is configurable
|
|||
|
|
Given the tenant has set approval_timeout_hours to 4
|
|||
|
|
When an approval request is pending for 4 hours without response
|
|||
|
|
Then the timeout fires after 4 hours (not 24)
|
|||
|
|
|
|||
|
|
Scenario: Self-approval is blocked
|
|||
|
|
Given engineer "alice@acme.com" triggered the remediation request
|
|||
|
|
When "alice@acme.com" attempts to approve their own request
|
|||
|
|
Then the approval is rejected with "Self-approval not permitted"
|
|||
|
|
And an ephemeral Slack message informs Alice
|
|||
|
|
|
|||
|
|
Scenario: Minimum approvers requirement
|
|||
|
|
Given the tenant requires 2 approvals for destructive changes
|
|||
|
|
And only 1 approver has approved
|
|||
|
|
When the second approver approves
|
|||
|
|
Then the quorum is met
|
|||
|
|
And the remediation is dispatched
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Agent-Side Remediation Execution
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Agent-Side Remediation Execution
|
|||
|
|
As a platform engineer
|
|||
|
|
I want the agent to apply IaC changes to revert drift
|
|||
|
|
So that remediation happens inside the customer VPC with proper credentials
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the agent has received a remediation command from the control plane
|
|||
|
|
And the agent has the necessary IAM permissions
|
|||
|
|
|
|||
|
|
Scenario: Terraform revert executed successfully
|
|||
|
|
Given the remediation command specifies stack "prod" and resource "aws_security_group.web"
|
|||
|
|
When the agent executes the remediation
|
|||
|
|
Then it runs "terraform apply -target=aws_security_group.web -auto-approve"
|
|||
|
|
And captures stdout and stderr
|
|||
|
|
And reports the result back to SaaS via the control plane
|
|||
|
|
|
|||
|
|
Scenario: kubectl revert executed successfully
|
|||
|
|
Given the remediation command specifies a Kubernetes Deployment "api-server"
|
|||
|
|
When the agent executes the remediation
|
|||
|
|
Then it runs "kubectl apply -f /etc/drift/stacks/prod/api-server.yaml"
|
|||
|
|
And reports the result back to SaaS
|
|||
|
|
|
|||
|
|
Scenario: Remediation command times out
|
|||
|
|
Given the terraform apply is still running after 10 minutes
|
|||
|
|
When the remediation timeout fires
|
|||
|
|
Then the agent kills the terraform process
|
|||
|
|
And reports status "timeout" to SaaS
|
|||
|
|
And logs "remediation timed out after 10m" at ERROR level
|
|||
|
|
|
|||
|
|
Scenario: Agent loses connectivity during remediation
|
|||
|
|
Given a remediation is in progress
|
|||
|
|
And the agent loses connectivity to SaaS mid-execution
|
|||
|
|
When connectivity is restored
|
|||
|
|
Then the agent reports the final remediation result
|
|||
|
|
And the SaaS backend reconciles the status
|
|||
|
|
|
|||
|
|
Scenario: Remediation command is replayed after agent restart
|
|||
|
|
Given a remediation command was received but the agent restarted before executing
|
|||
|
|
When the agent restarts
|
|||
|
|
Then it checks for pending remediation commands
|
|||
|
|
And executes any pending commands
|
|||
|
|
And reports results to SaaS
|
|||
|
|
|
|||
|
|
Scenario: Remediation is blocked when panic mode is active
|
|||
|
|
Given the tenant's panic mode is active
|
|||
|
|
When a remediation command is received by the agent
|
|||
|
|
Then the agent rejects the command
|
|||
|
|
And logs "remediation blocked: panic mode active" at WARN level
|
|||
|
|
And reports status "blocked_panic_mode" to SaaS
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 6: Dashboard UI
|
|||
|
|
|
|||
|
|
### Feature: OAuth Login
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: OAuth Login
|
|||
|
|
As a user
|
|||
|
|
I want to log in via OAuth
|
|||
|
|
So that I don't need a separate password for the drift dashboard
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the dashboard is running at "https://app.drift.dd0c.io"
|
|||
|
|
|
|||
|
|
Scenario: Successful login via GitHub OAuth
|
|||
|
|
Given the user navigates to the dashboard login page
|
|||
|
|
When they click "Sign in with GitHub"
|
|||
|
|
Then they are redirected to GitHub's OAuth authorization page
|
|||
|
|
And after authorizing, redirected back with an authorization code
|
|||
|
|
And the backend exchanges the code for a JWT
|
|||
|
|
And the user is logged in and sees their tenant's dashboard
|
|||
|
|
|
|||
|
|
Scenario: Successful login via Google OAuth
|
|||
|
|
Given the user clicks "Sign in with Google"
|
|||
|
|
When they complete Google OAuth flow
|
|||
|
|
Then they are logged in with a valid JWT
|
|||
|
|
And the JWT contains their tenant_id and email claims
|
|||
|
|
|
|||
|
|
Scenario: OAuth callback with invalid state parameter
|
|||
|
|
Given an OAuth callback arrives with a mismatched state parameter
|
|||
|
|
When the frontend processes the callback
|
|||
|
|
Then the login is rejected with "Invalid OAuth state"
|
|||
|
|
And the user is redirected to the login page with an error message
|
|||
|
|
And the event is logged as a potential CSRF attempt
|
|||
|
|
|
|||
|
|
Scenario: JWT expires during session
|
|||
|
|
Given the user is logged in with a JWT that expires in 1 minute
|
|||
|
|
When the JWT expires
|
|||
|
|
Then the dashboard silently refreshes the token using the refresh token
|
|||
|
|
And the user's session continues uninterrupted
|
|||
|
|
|
|||
|
|
Scenario: Refresh token is revoked
|
|||
|
|
Given the user's refresh token has been revoked (e.g., password change)
|
|||
|
|
When the dashboard attempts to refresh the JWT
|
|||
|
|
Then the refresh fails
|
|||
|
|
And the user is redirected to the login page
|
|||
|
|
And shown "Your session has expired, please log in again"
|
|||
|
|
|
|||
|
|
Scenario: User belongs to no tenant
|
|||
|
|
Given a new OAuth user has no tenant association
|
|||
|
|
When they complete OAuth login
|
|||
|
|
Then they are redirected to the onboarding flow
|
|||
|
|
And prompted to create or join a tenant
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Stack Overview
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Stack Overview Page
|
|||
|
|
As a platform engineer
|
|||
|
|
I want to see all my stacks and their drift status at a glance
|
|||
|
|
So that I can quickly identify which stacks need attention
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the user is logged in as a member of tenant "acme"
|
|||
|
|
And tenant "acme" has 5 stacks configured
|
|||
|
|
|
|||
|
|
Scenario: Stack overview loads all stacks
|
|||
|
|
Given the user navigates to the dashboard home
|
|||
|
|
When the page loads
|
|||
|
|
Then 5 stack cards are displayed
|
|||
|
|
And each card shows stack name, IaC type, last scan time, and drift count
|
|||
|
|
|
|||
|
|
Scenario: Stack with active drift shows warning indicator
|
|||
|
|
Given stack "prod-api" has 3 active drift events
|
|||
|
|
When the stack overview loads
|
|||
|
|
Then the "prod-api" card shows a yellow warning badge with "3 drifts"
|
|||
|
|
|
|||
|
|
Scenario: Stack with CRITICAL drift shows critical indicator
|
|||
|
|
Given stack "prod-db" has 1 CRITICAL drift event
|
|||
|
|
When the stack overview loads
|
|||
|
|
Then the "prod-db" card shows a red critical badge
|
|||
|
|
|
|||
|
|
Scenario: Stack with no drift shows healthy indicator
|
|||
|
|
Given stack "staging" has no active drift events
|
|||
|
|
When the stack overview loads
|
|||
|
|
Then the "staging" card shows a green "Healthy" badge
|
|||
|
|
|
|||
|
|
Scenario: Stack overview auto-refreshes
|
|||
|
|
Given the user is viewing the stack overview
|
|||
|
|
When 60 seconds elapse
|
|||
|
|
Then the page automatically refreshes drift counts without a full page reload
|
|||
|
|
|
|||
|
|
Scenario: Tenant with no stacks sees empty state
|
|||
|
|
Given a new tenant has no stacks configured
|
|||
|
|
When they navigate to the dashboard
|
|||
|
|
Then an empty state is shown: "No stacks yet — run drift init to get started"
|
|||
|
|
And a link to the onboarding guide is displayed
|
|||
|
|
|
|||
|
|
Scenario: Stack overview only shows current tenant's stacks
|
|||
|
|
Given tenant "acme" has 5 stacks and tenant "globex" has 3 stacks
|
|||
|
|
When a user from "acme" views the dashboard
|
|||
|
|
Then only 5 stacks are shown
|
|||
|
|
And no stacks from "globex" are visible
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Diff Viewer
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Drift Diff Viewer
|
|||
|
|
As a platform engineer
|
|||
|
|
I want to see a detailed diff of what changed
|
|||
|
|
So that I understand exactly what drifted and how
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the user is viewing a specific drift event
|
|||
|
|
|
|||
|
|
Scenario: Diff viewer shows field-level changes
|
|||
|
|
Given a drift event has 3 changed fields
|
|||
|
|
When the user opens the diff viewer
|
|||
|
|
Then each changed field is shown with expected and actual values
|
|||
|
|
And removed values are highlighted in red
|
|||
|
|
And added values are highlighted in green
|
|||
|
|
|
|||
|
|
Scenario: Diff viewer shows JSON diff for complex values
|
|||
|
|
Given a drift event involves a changed IAM policy document (JSON)
|
|||
|
|
When the user opens the diff viewer
|
|||
|
|
Then the policy JSON is shown as a structured diff
|
|||
|
|
And individual JSON fields are highlighted rather than the whole blob
|
|||
|
|
|
|||
|
|
Scenario: Diff viewer handles large diffs with pagination
|
|||
|
|
Given a drift event has 50 changed fields
|
|||
|
|
When the user opens the diff viewer
|
|||
|
|
Then the first 20 fields are shown
|
|||
|
|
And a "Show more" button loads the remaining 30
|
|||
|
|
|
|||
|
|
Scenario: Diff viewer shows resource metadata
|
|||
|
|
Given a drift event for resource "aws_security_group.web"
|
|||
|
|
When the user views the diff
|
|||
|
|
Then the viewer shows resource type, ARN, region, and stack name
|
|||
|
|
And the scan timestamp is displayed
|
|||
|
|
|
|||
|
|
Scenario: Diff viewer copy-to-clipboard
|
|||
|
|
Given the user is viewing a diff
|
|||
|
|
When they click "Copy diff"
|
|||
|
|
Then the diff is copied to clipboard in unified diff format
|
|||
|
|
And a toast notification confirms "Copied to clipboard"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Drift Timeline
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Drift Timeline
|
|||
|
|
As a platform engineer
|
|||
|
|
I want to see a timeline of drift events over time
|
|||
|
|
So that I can identify patterns and recurring issues
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the user is viewing the drift timeline for stack "prod-api"
|
|||
|
|
|
|||
|
|
Scenario: Timeline shows events in reverse chronological order
|
|||
|
|
Given 10 drift events exist for "prod-api" over the last 7 days
|
|||
|
|
When the user views the timeline
|
|||
|
|
Then events are listed newest first
|
|||
|
|
And each event shows timestamp, resource, severity, and status
|
|||
|
|
|
|||
|
|
Scenario: Timeline filtered by severity
|
|||
|
|
Given the timeline has HIGH and LOW events
|
|||
|
|
When the user filters by severity "HIGH"
|
|||
|
|
Then only HIGH events are shown
|
|||
|
|
And the filter state is reflected in the URL for shareability
|
|||
|
|
|
|||
|
|
Scenario: Timeline filtered by date range
|
|||
|
|
Given the user selects a date range of "last 30 days"
|
|||
|
|
When the filter is applied
|
|||
|
|
Then only events within the last 30 days are shown
|
|||
|
|
|
|||
|
|
Scenario: Timeline shows remediation events
|
|||
|
|
Given a drift event was remediated
|
|||
|
|
When the user views the timeline
|
|||
|
|
Then the event shows a "Remediated" badge
|
|||
|
|
And the remediation timestamp and actor are shown
|
|||
|
|
|
|||
|
|
Scenario: Timeline is empty for new stack
|
|||
|
|
Given a stack was added 1 hour ago and has no drift history
|
|||
|
|
When the user views the timeline
|
|||
|
|
Then an empty state is shown: "No drift history yet"
|
|||
|
|
|
|||
|
|
Scenario: Timeline pagination
|
|||
|
|
Given 200 drift events exist for a stack
|
|||
|
|
When the user views the timeline
|
|||
|
|
Then the first 50 events are shown
|
|||
|
|
And infinite scroll or pagination loads more on demand
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 7: Dashboard API
|
|||
|
|
|
|||
|
|
### Feature: JWT Authentication
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: JWT Authentication on Dashboard API
|
|||
|
|
As a SaaS engineer
|
|||
|
|
I want all API endpoints protected by JWT
|
|||
|
|
So that only authenticated users can access tenant data
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the Dashboard API is running at "https://api.drift.dd0c.io"
|
|||
|
|
|
|||
|
|
Scenario: Valid JWT grants access
|
|||
|
|
Given a request includes a valid JWT in the Authorization header
|
|||
|
|
And the JWT is not expired
|
|||
|
|
And the JWT signature is valid
|
|||
|
|
When the request reaches the API
|
|||
|
|
Then the request is processed
|
|||
|
|
And the response is returned with HTTP 200
|
|||
|
|
|
|||
|
|
Scenario: Missing JWT returns 401
|
|||
|
|
Given a request has no Authorization header
|
|||
|
|
When the request reaches the API
|
|||
|
|
Then the API returns HTTP 401
|
|||
|
|
And the response body includes "Authentication required"
|
|||
|
|
|
|||
|
|
Scenario: Expired JWT returns 401
|
|||
|
|
Given a request includes a JWT that expired 5 minutes ago
|
|||
|
|
When the request reaches the API
|
|||
|
|
Then the API returns HTTP 401
|
|||
|
|
And the response body includes "Token expired"
|
|||
|
|
|
|||
|
|
Scenario: JWT with invalid signature returns 401
|
|||
|
|
Given a request includes a JWT with a tampered signature
|
|||
|
|
When the request reaches the API
|
|||
|
|
Then the API returns HTTP 401
|
|||
|
|
And the response body includes "Invalid token"
|
|||
|
|
|
|||
|
|
Scenario: JWT with wrong audience claim returns 401
|
|||
|
|
Given a request includes a JWT issued for a different service
|
|||
|
|
When the request reaches the API
|
|||
|
|
Then the API returns HTTP 401
|
|||
|
|
And the response body includes "Invalid audience"
|
|||
|
|
|
|||
|
|
Scenario: JWT tenant_id claim is used for RLS
|
|||
|
|
Given a JWT contains tenant_id "acme"
|
|||
|
|
When the request reaches the API
|
|||
|
|
Then the API sets PostgreSQL session variable "app.tenant_id" to "acme"
|
|||
|
|
And all queries are automatically scoped to tenant "acme" via RLS
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Tenant Isolation via RLS
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Tenant Isolation via Row-Level Security
|
|||
|
|
As a security engineer
|
|||
|
|
I want the API to enforce tenant isolation at the database level
|
|||
|
|
So that a bug in application logic cannot leak cross-tenant data
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the API uses PostgreSQL with RLS on all tenant-scoped tables
|
|||
|
|
|
|||
|
|
Scenario: User from tenant A cannot access tenant B's stacks
|
|||
|
|
Given tenant "acme" has stacks ["prod", "staging"]
|
|||
|
|
And tenant "globex" has stacks ["prod"]
|
|||
|
|
When a user from "acme" calls GET /stacks
|
|||
|
|
Then only "acme"'s stacks are returned
|
|||
|
|
And "globex"'s stack is not included
|
|||
|
|
|
|||
|
|
Scenario: Cross-tenant drift event access attempt
|
|||
|
|
Given drift event "drift-001" belongs to tenant "globex"
|
|||
|
|
When a user from "acme" calls GET /drifts/drift-001
|
|||
|
|
Then the API returns HTTP 404
|
|||
|
|
And no data from "globex" is exposed
|
|||
|
|
|
|||
|
|
Scenario: Cross-tenant stack update attempt
|
|||
|
|
Given stack "prod" belongs to tenant "globex"
|
|||
|
|
When a user from "acme" calls PATCH /stacks/prod
|
|||
|
|
Then the API returns HTTP 404
|
|||
|
|
And the stack is not modified
|
|||
|
|
|
|||
|
|
Scenario: RLS is enforced even if application code has a bug
|
|||
|
|
Given a hypothetical bug causes the API to omit the tenant_id filter in a query
|
|||
|
|
When the query executes
|
|||
|
|
Then PostgreSQL RLS still filters rows to the current tenant
|
|||
|
|
And no cross-tenant data is returned
|
|||
|
|
|
|||
|
|
Scenario: Tenant isolation for remediation actions
|
|||
|
|
Given remediation "rem-001" belongs to tenant "globex"
|
|||
|
|
When a user from "acme" calls POST /remediations/rem-001/approve
|
|||
|
|
Then the API returns HTTP 404
|
|||
|
|
And the remediation is not affected
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Stack CRUD
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Stack CRUD Operations
|
|||
|
|
As a platform engineer
|
|||
|
|
I want to manage my stacks via the API
|
|||
|
|
So that I can add, update, and remove stacks programmatically
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the user is authenticated as a member of tenant "acme"
|
|||
|
|
|
|||
|
|
Scenario: Create a new Terraform stack
|
|||
|
|
Given a POST /stacks request with body:
|
|||
|
|
"""
|
|||
|
|
{ "name": "prod-api", "iac_type": "terraform", "region": "us-east-1" }
|
|||
|
|
"""
|
|||
|
|
When the request is processed
|
|||
|
|
Then the API returns HTTP 201
|
|||
|
|
And the response includes the new stack's id and created_at
|
|||
|
|
And the stack is visible in GET /stacks
|
|||
|
|
|
|||
|
|
Scenario: Create stack with duplicate name
|
|||
|
|
Given a stack named "prod-api" already exists for tenant "acme"
|
|||
|
|
When a POST /stacks request is made with name "prod-api"
|
|||
|
|
Then the API returns HTTP 409
|
|||
|
|
And the response body includes "Stack name already exists"
|
|||
|
|
|
|||
|
|
Scenario: Create stack exceeding free tier limit
|
|||
|
|
Given the tenant is on the free tier (max 3 stacks)
|
|||
|
|
And the tenant already has 3 stacks
|
|||
|
|
When a POST /stacks request is made
|
|||
|
|
Then the API returns HTTP 402
|
|||
|
|
And the response body includes "Free tier limit reached. Upgrade to add more stacks."
|
|||
|
|
|
|||
|
|
Scenario: Update stack configuration
|
|||
|
|
Given stack "prod-api" exists
|
|||
|
|
When a PATCH /stacks/prod-api request updates the scan_interval to 30
|
|||
|
|
Then the API returns HTTP 200
|
|||
|
|
And the stack's scan_interval is updated to 30
|
|||
|
|
And the agent receives the updated config on next heartbeat
|
|||
|
|
|
|||
|
|
Scenario: Delete a stack
|
|||
|
|
Given stack "staging" exists with no active remediations
|
|||
|
|
When a DELETE /stacks/staging request is made
|
|||
|
|
Then the API returns HTTP 204
|
|||
|
|
And the stack is removed from GET /stacks
|
|||
|
|
And associated drift events are soft-deleted (retained for 90 days)
|
|||
|
|
|
|||
|
|
Scenario: Delete a stack with active remediation
|
|||
|
|
Given stack "prod-api" has an active remediation in progress
|
|||
|
|
When a DELETE /stacks/prod-api request is made
|
|||
|
|
Then the API returns HTTP 409
|
|||
|
|
And the response body includes "Cannot delete stack with active remediation"
|
|||
|
|
|
|||
|
|
Scenario: Get stack by ID
|
|||
|
|
Given stack "prod-api" exists
|
|||
|
|
When a GET /stacks/prod-api request is made
|
|||
|
|
Then the API returns HTTP 200
|
|||
|
|
And the response includes all stack fields including last_scan_at and drift_count
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Drift Event CRUD
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Drift Event API
|
|||
|
|
As a platform engineer
|
|||
|
|
I want to query and manage drift events via the API
|
|||
|
|
So that I can build integrations and automations
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the user is authenticated as a member of tenant "acme"
|
|||
|
|
|
|||
|
|
Scenario: List drift events for a stack
|
|||
|
|
Given stack "prod-api" has 10 drift events
|
|||
|
|
When GET /stacks/prod-api/drifts is called
|
|||
|
|
Then the API returns HTTP 200
|
|||
|
|
And the response includes all 10 events
|
|||
|
|
And events are sorted by detected_at descending
|
|||
|
|
|
|||
|
|
Scenario: Filter drift events by severity
|
|||
|
|
Given drift events include HIGH and LOW severity events
|
|||
|
|
When GET /drifts?severity=HIGH is called
|
|||
|
|
Then only HIGH severity events are returned
|
|||
|
|
|
|||
|
|
Scenario: Filter drift events by status
|
|||
|
|
When GET /drifts?status=active is called
|
|||
|
|
Then only unresolved drift events are returned
|
|||
|
|
|
|||
|
|
Scenario: Mark drift event as acknowledged
|
|||
|
|
Given drift event "drift-001" has status "active"
|
|||
|
|
When POST /drifts/drift-001/acknowledge is called
|
|||
|
|
Then the API returns HTTP 200
|
|||
|
|
And the event status is updated to "acknowledged"
|
|||
|
|
And the acknowledged_by and acknowledged_at fields are set
|
|||
|
|
|
|||
|
|
Scenario: Mark drift event as resolved manually
|
|||
|
|
Given drift event "drift-001" has status "active"
|
|||
|
|
When POST /drifts/drift-001/resolve is called with body {"reason": "manual fix applied"}
|
|||
|
|
Then the API returns HTTP 200
|
|||
|
|
And the event status is updated to "resolved"
|
|||
|
|
And the resolution reason is stored
|
|||
|
|
|
|||
|
|
Scenario: Pagination on drift events list
|
|||
|
|
Given 200 drift events exist
|
|||
|
|
When GET /drifts?page=1&per_page=50 is called
|
|||
|
|
Then 50 events are returned
|
|||
|
|
And the response includes pagination metadata (total, page, per_page, total_pages)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: API Rate Limiting
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: API Rate Limiting
|
|||
|
|
As a SaaS operator
|
|||
|
|
I want API rate limits enforced per tenant
|
|||
|
|
So that one tenant cannot degrade service for others
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the API enforces rate limits per tenant
|
|||
|
|
|
|||
|
|
Scenario: Request within rate limit succeeds
|
|||
|
|
Given the rate limit is 1000 requests per minute
|
|||
|
|
And the tenant has made 500 requests this minute
|
|||
|
|
When a new request is made
|
|||
|
|
Then the API returns HTTP 200
|
|||
|
|
And the response includes headers X-RateLimit-Remaining and X-RateLimit-Reset
|
|||
|
|
|
|||
|
|
Scenario: Request exceeds rate limit
|
|||
|
|
Given the tenant has made 1000 requests this minute
|
|||
|
|
When a new request is made
|
|||
|
|
Then the API returns HTTP 429
|
|||
|
|
And the response includes Retry-After header
|
|||
|
|
And the response body includes "Rate limit exceeded"
|
|||
|
|
|
|||
|
|
Scenario: Rate limit resets after window
|
|||
|
|
Given the tenant hit the rate limit at T+0
|
|||
|
|
When 60 seconds elapse (new window)
|
|||
|
|
Then the tenant's request counter resets
|
|||
|
|
And new requests succeed
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 8: Infrastructure
|
|||
|
|
|
|||
|
|
### Feature: Terraform/CDK SaaS Infrastructure Provisioning
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: SaaS Infrastructure Provisioning
|
|||
|
|
As a SaaS platform engineer
|
|||
|
|
I want the SaaS infrastructure defined as code
|
|||
|
|
So that environments are reproducible and auditable
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the infrastructure code lives in "infra/" directory
|
|||
|
|
And Terraform and CDK are both used for different layers
|
|||
|
|
|
|||
|
|
Scenario: Terraform plan produces no unexpected changes in production
|
|||
|
|
Given the production Terraform state is up to date
|
|||
|
|
When "terraform plan" runs against the production workspace
|
|||
|
|
Then the plan shows zero resource changes
|
|||
|
|
And the plan output is stored as a CI artifact
|
|||
|
|
|
|||
|
|
Scenario: New environment provisioned from scratch
|
|||
|
|
Given a new environment "staging-eu" is needed
|
|||
|
|
When "terraform apply -var-file=staging-eu.tfvars" runs
|
|||
|
|
Then all required resources are created (VPC, RDS, SQS, ECS, etc.)
|
|||
|
|
And the environment is reachable within 15 minutes
|
|||
|
|
And outputs (queue URLs, DB endpoints) are stored in SSM Parameter Store
|
|||
|
|
|
|||
|
|
Scenario: RDS instance is provisioned with encryption at rest
|
|||
|
|
Given the Terraform module for RDS is applied
|
|||
|
|
When the RDS instance is created
|
|||
|
|
Then storage_encrypted is true
|
|||
|
|
And the KMS key ARN is set to the tenant-specific key
|
|||
|
|
|
|||
|
|
Scenario: SQS FIFO queues are provisioned with DLQ
|
|||
|
|
Given the SQS Terraform module is applied
|
|||
|
|
When the queues are created
|
|||
|
|
Then "drift-reports.fifo" exists with content_based_deduplication enabled
|
|||
|
|
And "drift-reports-dlq.fifo" exists as the redrive target
|
|||
|
|
And maxReceiveCount is set to 3
|
|||
|
|
|
|||
|
|
Scenario: CDK stack drift detected by drift agent (dogfooding)
|
|||
|
|
Given the SaaS CDK stacks are monitored by the drift agent itself
|
|||
|
|
When a CDK resource is manually modified in the AWS console
|
|||
|
|
Then the drift agent detects the change
|
|||
|
|
And an internal alert is sent to the SaaS ops channel
|
|||
|
|
|
|||
|
|
Scenario: Infrastructure destroy is blocked in production
|
|||
|
|
Given a Terraform workspace is tagged as "production"
|
|||
|
|
When "terraform destroy" is attempted
|
|||
|
|
Then the CI pipeline rejects the command
|
|||
|
|
And logs "destroy blocked in production environment"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: GitHub Actions CI/CD
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: GitHub Actions CI/CD Pipeline
|
|||
|
|
As a platform engineer
|
|||
|
|
I want automated CI/CD via GitHub Actions
|
|||
|
|
So that code changes are tested and deployed safely
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given GitHub Actions workflows are defined in ".github/workflows/"
|
|||
|
|
|
|||
|
|
Scenario: PR triggers CI checks
|
|||
|
|
Given a pull request is opened against the main branch
|
|||
|
|
When the CI workflow triggers
|
|||
|
|
Then unit tests run for Go agent code
|
|||
|
|
And unit tests run for TypeScript SaaS code
|
|||
|
|
And Terraform plan runs for infrastructure changes
|
|||
|
|
And all checks must pass before merge is allowed
|
|||
|
|
|
|||
|
|
Scenario: Merge to main triggers staging deployment
|
|||
|
|
Given a PR is merged to the main branch
|
|||
|
|
When the deploy workflow triggers
|
|||
|
|
Then the Go agent binary is built and pushed to ECR
|
|||
|
|
And the TypeScript services are built and deployed to ECS staging
|
|||
|
|
And smoke tests run against staging
|
|||
|
|
And the deployment is marked successful if smoke tests pass
|
|||
|
|
|
|||
|
|
Scenario: Staging smoke tests fail — production deploy blocked
|
|||
|
|
Given staging deployment completed
|
|||
|
|
And smoke tests fail (e.g., health check returns 500)
|
|||
|
|
When the pipeline evaluates the smoke test results
|
|||
|
|
Then the production deployment step is skipped
|
|||
|
|
And a Slack alert is sent to "#deployments" channel
|
|||
|
|
And the pipeline exits with failure
|
|||
|
|
|
|||
|
|
Scenario: Production deployment requires manual approval
|
|||
|
|
Given staging smoke tests passed
|
|||
|
|
When the pipeline reaches the production deployment step
|
|||
|
|
Then it pauses and waits for a manual approval in GitHub Actions
|
|||
|
|
And the approval request is sent to the "production-deployers" team
|
|||
|
|
And deployment proceeds only after approval
|
|||
|
|
|
|||
|
|
Scenario: Rollback triggered on production health check failure
|
|||
|
|
Given a production deployment completed
|
|||
|
|
And the post-deploy health check fails within 5 minutes
|
|||
|
|
When the rollback workflow triggers
|
|||
|
|
Then the previous ECS task definition revision is redeployed
|
|||
|
|
And a Slack alert is sent: "Production rollback triggered"
|
|||
|
|
And the failed deployment is logged with the commit SHA
|
|||
|
|
|
|||
|
|
Scenario: Terraform plan diff is posted to PR as comment
|
|||
|
|
Given a PR modifies infrastructure code
|
|||
|
|
When the CI Terraform plan runs
|
|||
|
|
Then the plan output is posted as a comment on the PR
|
|||
|
|
And the comment includes a summary of resources to add/change/destroy
|
|||
|
|
|
|||
|
|
Scenario: Secrets are never logged in CI
|
|||
|
|
Given the CI pipeline uses AWS credentials and Slack tokens
|
|||
|
|
When any CI step runs
|
|||
|
|
Then no secret values appear in the GitHub Actions log output
|
|||
|
|
And GitHub secret masking is verified in the workflow config
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Database Migrations in CI/CD
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Database Migrations
|
|||
|
|
As a SaaS engineer
|
|||
|
|
I want database migrations to run automatically in CI/CD
|
|||
|
|
So that schema changes are applied safely and consistently
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given migrations are managed with a migration tool (e.g., golang-migrate or Flyway)
|
|||
|
|
|
|||
|
|
Scenario: Migration runs successfully on deploy
|
|||
|
|
Given a new migration file "V20_add_remediation_status.sql" exists
|
|||
|
|
When the deploy pipeline runs
|
|||
|
|
Then the migration is applied to the target database
|
|||
|
|
And the migration is recorded in the schema_migrations table
|
|||
|
|
And the deploy continues
|
|||
|
|
|
|||
|
|
Scenario: Migration is idempotent — already-applied migration is skipped
|
|||
|
|
Given migration "V20_add_remediation_status.sql" was already applied
|
|||
|
|
When the deploy pipeline runs again
|
|||
|
|
Then the migration is skipped
|
|||
|
|
And no error is thrown
|
|||
|
|
|
|||
|
|
Scenario: Migration fails — deploy is halted
|
|||
|
|
Given a migration contains a syntax error
|
|||
|
|
When the migration runs
|
|||
|
|
Then the migration fails and is rolled back
|
|||
|
|
And the deploy pipeline halts
|
|||
|
|
And an alert is sent to the engineering team
|
|||
|
|
|
|||
|
|
Scenario: Additive-only migrations enforced
|
|||
|
|
Given a migration attempts to drop a column
|
|||
|
|
When the CI linter checks the migration
|
|||
|
|
Then the lint check fails with "destructive migration not allowed"
|
|||
|
|
And the PR is blocked from merging
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 9: Onboarding & PLG
|
|||
|
|
|
|||
|
|
### Feature: drift init CLI
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: drift init CLI Onboarding
|
|||
|
|
As a new user
|
|||
|
|
I want a guided CLI setup experience
|
|||
|
|
So that I can connect my infrastructure to drift in minutes
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the drift CLI is installed via "curl -sSL https://get.drift.dd0c.io | sh"
|
|||
|
|
|
|||
|
|
Scenario: First-time drift init runs guided setup
|
|||
|
|
Given the user runs "drift init" for the first time
|
|||
|
|
When the CLI starts
|
|||
|
|
Then it prompts for: cloud provider, IaC type, region, and stack name
|
|||
|
|
And validates each input before proceeding
|
|||
|
|
And generates a config file at "/etc/drift/config.yaml"
|
|||
|
|
|
|||
|
|
Scenario: drift init detects existing Terraform state
|
|||
|
|
Given a Terraform state file exists in the current directory
|
|||
|
|
When the user runs "drift init"
|
|||
|
|
Then the CLI auto-detects the IaC type as "terraform"
|
|||
|
|
And pre-fills the stack name from the Terraform workspace name
|
|||
|
|
And asks the user to confirm
|
|||
|
|
|
|||
|
|
Scenario: drift init creates IAM role with least-privilege policy
|
|||
|
|
Given the user confirms IAM role creation
|
|||
|
|
When "drift init" runs
|
|||
|
|
Then it creates an IAM role "drift-agent-role" with only required permissions
|
|||
|
|
And outputs the role ARN for the config
|
|||
|
|
|
|||
|
|
Scenario: drift init generates and installs agent certificates
|
|||
|
|
Given the user has authenticated with the SaaS backend
|
|||
|
|
When "drift init" completes
|
|||
|
|
Then it fetches a signed mTLS certificate from the SaaS CA
|
|||
|
|
And stores the certificate at "/etc/drift/certs/agent.crt"
|
|||
|
|
And stores the private key at "/etc/drift/certs/agent.key" with mode 0600
|
|||
|
|
|
|||
|
|
Scenario: drift init installs agent as systemd service
|
|||
|
|
Given the user is on a Linux system with systemd
|
|||
|
|
When "drift init" completes
|
|||
|
|
Then it installs a systemd unit file for the drift agent
|
|||
|
|
And enables and starts the service
|
|||
|
|
And confirms "drift-agent is running" in the output
|
|||
|
|
|
|||
|
|
Scenario: drift init fails gracefully on missing AWS credentials
|
|||
|
|
Given no AWS credentials are configured
|
|||
|
|
When "drift init" runs
|
|||
|
|
Then it detects missing credentials
|
|||
|
|
And outputs a helpful error: "AWS credentials not found. Run 'aws configure' first."
|
|||
|
|
And exits with code 1
|
|||
|
|
|
|||
|
|
Scenario: drift init --dry-run shows what would be created
|
|||
|
|
Given the user runs "drift init --dry-run"
|
|||
|
|
When the CLI runs
|
|||
|
|
Then it outputs all actions it would take without executing them
|
|||
|
|
And no resources are created
|
|||
|
|
And no config files are written
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Free Tier — 3 Stacks
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Free Tier Stack Limit
|
|||
|
|
As a product manager
|
|||
|
|
I want the free tier limited to 3 stacks
|
|||
|
|
So that we have a clear upgrade path
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given a tenant is on the free tier
|
|||
|
|
|
|||
|
|
Scenario: Free tier tenant can add up to 3 stacks
|
|||
|
|
Given the tenant has 0 stacks
|
|||
|
|
When they add stacks "prod", "staging", and "dev"
|
|||
|
|
Then all 3 stacks are created successfully
|
|||
|
|
And the tenant is not prompted to upgrade
|
|||
|
|
|
|||
|
|
Scenario: Free tier tenant blocked from adding 4th stack
|
|||
|
|
Given the tenant has 3 stacks
|
|||
|
|
When they attempt to add a 4th stack via the CLI
|
|||
|
|
Then the CLI outputs "Free tier limit reached (3/3 stacks). Upgrade at https://app.drift.dd0c.io/billing"
|
|||
|
|
And exits with code 1
|
|||
|
|
|
|||
|
|
Scenario: Free tier tenant blocked from adding 4th stack via API
|
|||
|
|
Given the tenant has 3 stacks
|
|||
|
|
When POST /stacks is called
|
|||
|
|
Then the API returns HTTP 402
|
|||
|
|
And the response includes upgrade_url
|
|||
|
|
|
|||
|
|
Scenario: Free tier tenant blocked from adding 4th stack via dashboard
|
|||
|
|
Given the tenant has 3 stacks
|
|||
|
|
When they click "Add Stack" in the dashboard
|
|||
|
|
Then a modal appears: "You've reached the free tier limit"
|
|||
|
|
And an "Upgrade Plan" button is shown
|
|||
|
|
|
|||
|
|
Scenario: Upgrading to paid tier unlocks unlimited stacks
|
|||
|
|
Given the tenant upgrades to the paid plan via Stripe
|
|||
|
|
When the Stripe webhook confirms payment
|
|||
|
|
Then the tenant's stack limit is set to unlimited
|
|||
|
|
And they can immediately add a 4th stack
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Stripe Billing
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Stripe Billing Integration
|
|||
|
|
As a product manager
|
|||
|
|
I want usage-based billing via Stripe
|
|||
|
|
So that customers are charged $29/stack/month
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given Stripe is configured with the drift product and price
|
|||
|
|
|
|||
|
|
Scenario: New tenant subscribes to paid plan
|
|||
|
|
Given a free tier tenant clicks "Upgrade"
|
|||
|
|
When they complete the Stripe Checkout flow
|
|||
|
|
Then a Stripe subscription is created for the tenant
|
|||
|
|
And the subscription includes a metered item for stack count
|
|||
|
|
And the tenant's plan is updated to "paid" in the database
|
|||
|
|
|
|||
|
|
Scenario: Monthly invoice calculated at $29/stack
|
|||
|
|
Given a tenant has 5 stacks active for the full billing month
|
|||
|
|
When Stripe generates the monthly invoice
|
|||
|
|
Then the invoice total is $145.00 (5 × $29)
|
|||
|
|
And the invoice is sent to the tenant's billing email
|
|||
|
|
|
|||
|
|
Scenario: Stack added mid-month — prorated charge
|
|||
|
|
Given a tenant adds a 6th stack on the 15th of the month
|
|||
|
|
When Stripe generates the monthly invoice
|
|||
|
|
Then the 6th stack is charged prorated (~$14.50 for half month)
|
|||
|
|
|
|||
|
|
Scenario: Stack deleted mid-month — prorated credit
|
|||
|
|
Given a tenant deletes a stack on the 10th of the month
|
|||
|
|
When Stripe generates the monthly invoice
|
|||
|
|
Then a prorated credit is applied for the unused days
|
|||
|
|
|
|||
|
|
Scenario: Payment fails — tenant notified and grace period applied
|
|||
|
|
Given a tenant's payment method is declined
|
|||
|
|
When Stripe sends the payment_failed webhook
|
|||
|
|
Then the tenant receives an email: "Payment failed — please update your billing info"
|
|||
|
|
And a 7-day grace period is applied before service is restricted
|
|||
|
|
|
|||
|
|
Scenario: Grace period expires — stacks suspended
|
|||
|
|
Given a tenant's payment has been failing for 7 days
|
|||
|
|
When the grace period expires
|
|||
|
|
Then the tenant's stacks are suspended (scans paused)
|
|||
|
|
And the dashboard shows a banner: "Account suspended — payment required"
|
|||
|
|
And the agent stops sending reports
|
|||
|
|
|
|||
|
|
Scenario: Payment updated — service restored immediately
|
|||
|
|
Given a tenant's stacks are suspended due to non-payment
|
|||
|
|
When the tenant updates their payment method and payment succeeds
|
|||
|
|
Then the Stripe webhook triggers service restoration
|
|||
|
|
And stacks are unsuspended within 60 seconds
|
|||
|
|
And scans resume on the next scheduled cycle
|
|||
|
|
|
|||
|
|
Scenario: Stripe webhook signature validation
|
|||
|
|
Given a webhook arrives at POST /webhooks/stripe
|
|||
|
|
When the webhook signature is invalid
|
|||
|
|
Then the API returns HTTP 400
|
|||
|
|
And the event is ignored
|
|||
|
|
And the attempt is logged as a potential spoofing attempt
|
|||
|
|
|
|||
|
|
Scenario: Free tier tenant is never charged
|
|||
|
|
Given a tenant is on the free tier with 3 stacks
|
|||
|
|
When the billing cycle runs
|
|||
|
|
Then no Stripe invoice is generated for this tenant
|
|||
|
|
And no charge is made
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Guided Setup Flow
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Guided Setup Flow in Dashboard
|
|||
|
|
As a new user
|
|||
|
|
I want a step-by-step setup guide in the dashboard
|
|||
|
|
So that I can get value from drift quickly
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given a new tenant has just signed up and logged in
|
|||
|
|
|
|||
|
|
Scenario: Onboarding checklist is shown to new tenants
|
|||
|
|
Given the tenant has completed 0 onboarding steps
|
|||
|
|
When they log in for the first time
|
|||
|
|
Then an onboarding checklist is shown with steps:
|
|||
|
|
| Step | Status |
|
|||
|
|
| Install drift agent | Pending |
|
|||
|
|
| Add your first stack | Pending |
|
|||
|
|
| Configure Slack alerts | Pending |
|
|||
|
|
| Run your first scan | Pending |
|
|||
|
|
|
|||
|
|
Scenario: Checklist step marked complete automatically
|
|||
|
|
Given the tenant installs the agent and it sends its first heartbeat
|
|||
|
|
When the dashboard refreshes
|
|||
|
|
Then the "Install drift agent" step is marked complete
|
|||
|
|
And a congratulatory message is shown
|
|||
|
|
|
|||
|
|
Scenario: Onboarding checklist dismissed after all steps complete
|
|||
|
|
Given all 4 onboarding steps are complete
|
|||
|
|
When the tenant views the dashboard
|
|||
|
|
Then the checklist is replaced with the normal stack overview
|
|||
|
|
And a one-time "You're all set!" banner is shown
|
|||
|
|
|
|||
|
|
Scenario: Onboarding checklist can be dismissed early
|
|||
|
|
Given the tenant has completed 2 of 4 steps
|
|||
|
|
When they click "Dismiss checklist"
|
|||
|
|
Then the checklist is hidden
|
|||
|
|
And a "Resume setup" link is available in the settings page
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Epic 10: Transparent Factory
|
|||
|
|
|
|||
|
|
### Feature: Feature Flags
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Feature Flags
|
|||
|
|
As a product engineer
|
|||
|
|
I want feature flags to control rollout of new capabilities
|
|||
|
|
So that I can ship safely and roll back instantly
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the feature flag service is running
|
|||
|
|
And flags are evaluated per-tenant
|
|||
|
|
|
|||
|
|
Scenario: Feature flag enabled for specific tenant
|
|||
|
|
Given flag "remediation_v2" is enabled for tenant "acme"
|
|||
|
|
And flag "remediation_v2" is disabled for tenant "globex"
|
|||
|
|
When tenant "acme" triggers a remediation
|
|||
|
|
Then the v2 remediation code path is used
|
|||
|
|
When tenant "globex" triggers a remediation
|
|||
|
|
Then the v1 remediation code path is used
|
|||
|
|
|
|||
|
|
Scenario: Feature flag enabled for percentage rollout
|
|||
|
|
Given flag "new_diff_viewer" is enabled for 10% of tenants
|
|||
|
|
When 1000 tenants load the dashboard
|
|||
|
|
Then approximately 100 tenants see the new diff viewer
|
|||
|
|
And the remaining 900 see the existing diff viewer
|
|||
|
|
|
|||
|
|
Scenario: Feature flag disabled globally kills a feature
|
|||
|
|
Given flag "experimental_pulumi_scan" is globally disabled
|
|||
|
|
When any tenant attempts to add a Pulumi stack
|
|||
|
|
Then the API returns HTTP 501 "Feature not available"
|
|||
|
|
And the dashboard hides the Pulumi option in the stack type selector
|
|||
|
|
|
|||
|
|
Scenario: Feature flag change takes effect without deployment
|
|||
|
|
Given flag "slack_digest_v2" is currently disabled
|
|||
|
|
When an operator enables the flag in the flag management console
|
|||
|
|
Then within 30 seconds, the notification engine uses the v2 digest format
|
|||
|
|
And no service restart is required
|
|||
|
|
|
|||
|
|
Scenario: Feature flag evaluation is logged for audit
|
|||
|
|
Given flag "remediation_v2" is evaluated for tenant "acme"
|
|||
|
|
When the flag is checked
|
|||
|
|
Then the evaluation (flag name, tenant, result, timestamp) is written to the audit log
|
|||
|
|
And the audit log is queryable for compliance review
|
|||
|
|
|
|||
|
|
Scenario: Unknown feature flag defaults to disabled
|
|||
|
|
Given code checks for flag "nonexistent_flag"
|
|||
|
|
When the flag service evaluates it
|
|||
|
|
Then the result is "disabled" (safe default)
|
|||
|
|
And a warning is logged: "unknown flag: nonexistent_flag"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Additive Schema Migrations
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Additive-Only Schema Migrations
|
|||
|
|
As a SaaS engineer
|
|||
|
|
I want all schema changes to be additive
|
|||
|
|
So that deployments are zero-downtime and rollback-safe
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the migration linter runs in CI on every PR
|
|||
|
|
|
|||
|
|
Scenario: Adding a new nullable column is allowed
|
|||
|
|
Given a migration adds column "remediation_status VARCHAR(50) NULL"
|
|||
|
|
When the migration linter checks the file
|
|||
|
|
Then the lint check passes
|
|||
|
|
And the migration is approved for merge
|
|||
|
|
|
|||
|
|
Scenario: Adding a new table is allowed
|
|||
|
|
Given a migration creates a new table "decision_logs"
|
|||
|
|
When the migration linter checks the file
|
|||
|
|
Then the lint check passes
|
|||
|
|
|
|||
|
|
Scenario: Dropping a column is blocked
|
|||
|
|
Given a migration contains "ALTER TABLE drift_events DROP COLUMN old_field"
|
|||
|
|
When the migration linter checks the file
|
|||
|
|
Then the lint check fails with "destructive operation: DROP COLUMN not allowed"
|
|||
|
|
And the PR is blocked
|
|||
|
|
|
|||
|
|
Scenario: Dropping a table is blocked
|
|||
|
|
Given a migration contains "DROP TABLE legacy_alerts"
|
|||
|
|
When the migration linter checks the file
|
|||
|
|
Then the lint check fails with "destructive operation: DROP TABLE not allowed"
|
|||
|
|
|
|||
|
|
Scenario: Renaming a column is blocked
|
|||
|
|
Given a migration contains "ALTER TABLE stacks RENAME COLUMN name TO stack_name"
|
|||
|
|
When the migration linter checks the file
|
|||
|
|
Then the lint check fails with "destructive operation: RENAME COLUMN not allowed"
|
|||
|
|
And the suggested alternative is to add a new column and deprecate the old one
|
|||
|
|
|
|||
|
|
Scenario: Adding a NOT NULL column without default is blocked
|
|||
|
|
Given a migration adds "ALTER TABLE stacks ADD COLUMN owner_id UUID NOT NULL"
|
|||
|
|
When the migration linter checks the file
|
|||
|
|
Then the lint check fails with "NOT NULL column without DEFAULT will break existing rows"
|
|||
|
|
|
|||
|
|
Scenario: Old column marked deprecated — not yet removed
|
|||
|
|
Given column "legacy_iac_path" is marked with a deprecation comment in the schema
|
|||
|
|
When the application code is deployed
|
|||
|
|
Then the column still exists in the database
|
|||
|
|
And the application ignores it
|
|||
|
|
And a deprecation notice is logged at startup
|
|||
|
|
|
|||
|
|
Scenario: Column removal only after 2 release cycles
|
|||
|
|
Given column "legacy_iac_path" has been deprecated for 2 releases
|
|||
|
|
And all application code no longer references it
|
|||
|
|
When an engineer submits a migration to drop the column
|
|||
|
|
Then the migration linter checks the deprecation age
|
|||
|
|
And allows the drop if the deprecation period has elapsed
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Decision Logs
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Decision Logs
|
|||
|
|
As an engineering lead
|
|||
|
|
I want architectural and operational decisions logged
|
|||
|
|
So that the team has a transparent record of why things are the way they are
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the decision log is stored in "docs/decisions/" as markdown files
|
|||
|
|
|
|||
|
|
Scenario: New ADR created for significant architectural change
|
|||
|
|
Given an engineer proposes switching from SQS to Kafka
|
|||
|
|
When they create "docs/decisions/ADR-042-kafka-vs-sqs.md"
|
|||
|
|
Then the ADR includes: context, decision, consequences, and status
|
|||
|
|
And the PR requires at least 2 reviewers from the architecture group
|
|||
|
|
|
|||
|
|
Scenario: ADR status transitions are tracked
|
|||
|
|
Given ADR-042 has status "proposed"
|
|||
|
|
When the team accepts the decision
|
|||
|
|
Then the status is updated to "accepted"
|
|||
|
|
And the accepted_at date is recorded
|
|||
|
|
And the ADR is immutable after acceptance (changes require a new ADR)
|
|||
|
|
|
|||
|
|
Scenario: Superseded ADR is linked to its replacement
|
|||
|
|
Given ADR-010 is superseded by ADR-042
|
|||
|
|
When ADR-042 is accepted
|
|||
|
|
Then ADR-010's status is updated to "superseded"
|
|||
|
|
And ADR-010 includes a link to ADR-042
|
|||
|
|
|
|||
|
|
Scenario: Decision log is searchable
|
|||
|
|
Given 50 ADRs exist in the decision log
|
|||
|
|
When an engineer searches for "database"
|
|||
|
|
Then all ADRs mentioning "database" in title or body are returned
|
|||
|
|
|
|||
|
|
Scenario: Operational decisions logged for drift remediation
|
|||
|
|
Given an operator manually overrides a remediation decision
|
|||
|
|
When the override is applied
|
|||
|
|
Then a decision log entry is created with: operator identity, reason, timestamp, and affected resource
|
|||
|
|
And the entry is linked to the drift event
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: OTEL Tracing
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: OpenTelemetry Distributed Tracing
|
|||
|
|
As a SaaS engineer
|
|||
|
|
I want end-to-end distributed tracing via OTEL
|
|||
|
|
So that I can diagnose latency and errors across services
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given OTEL is configured with a Jaeger/Tempo backend
|
|||
|
|
And all services export traces
|
|||
|
|
|
|||
|
|
Scenario: Drift report ingestion is fully traced
|
|||
|
|
Given an agent publishes a drift report to SQS
|
|||
|
|
When the event processor consumes and processes the message
|
|||
|
|
Then a trace exists spanning: SQS receive → normalization → severity scoring → DB write
|
|||
|
|
And each span includes tenant_id and scan_id as attributes
|
|||
|
|
And the total trace duration is under 2 seconds for normal reports
|
|||
|
|
|
|||
|
|
Scenario: Slack notification is traced end-to-end
|
|||
|
|
Given a drift event triggers a Slack notification
|
|||
|
|
When the notification is sent
|
|||
|
|
Then a trace exists spanning: event stored → notification engine → Slack API call
|
|||
|
|
And the Slack API response code is recorded as a span attribute
|
|||
|
|
|
|||
|
|
Scenario: Remediation flow is fully traced
|
|||
|
|
Given a remediation is triggered from Slack
|
|||
|
|
When the remediation completes
|
|||
|
|
Then a trace exists spanning: Slack interaction → API → control plane → agent → result
|
|||
|
|
And the trace includes the approver identity and approval timestamp
|
|||
|
|
|
|||
|
|
Scenario: Slow span triggers latency alert
|
|||
|
|
Given the DB write span exceeds 500ms
|
|||
|
|
When the trace is analyzed
|
|||
|
|
Then a latency alert fires in the observability platform
|
|||
|
|
And the alert includes the trace_id for direct investigation
|
|||
|
|
|
|||
|
|
Scenario: Trace context propagated across service boundaries
|
|||
|
|
Given the agent sends a drift report with a trace context header
|
|||
|
|
When the event processor receives the message
|
|||
|
|
Then it extracts the trace context from the SQS message attributes
|
|||
|
|
And continues the trace as a child span
|
|||
|
|
And the full trace is visible as a single tree in Jaeger
|
|||
|
|
|
|||
|
|
Scenario: Traces do not contain PII or secrets
|
|||
|
|
Given a drift report is processed end-to-end
|
|||
|
|
When the trace is exported to the tracing backend
|
|||
|
|
Then no span attributes contain secret values
|
|||
|
|
And no span attributes contain tenant PII beyond tenant_id
|
|||
|
|
And the scrubber audit confirms 0 secrets in trace data
|
|||
|
|
|
|||
|
|
Scenario: OTEL collector is unavailable — service continues
|
|||
|
|
Given the OTEL collector is down
|
|||
|
|
When the event processor handles a drift report
|
|||
|
|
Then the report is processed normally
|
|||
|
|
And trace export failures are logged at DEBUG level
|
|||
|
|
And no errors are surfaced to the end user
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Governance / Panic Mode
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Panic Mode
|
|||
|
|
As a SaaS operator
|
|||
|
|
I want a panic mode that halts all automated actions
|
|||
|
|
So that I can freeze the system during a security incident or outage
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the panic mode toggle is available in the ops console
|
|||
|
|
|
|||
|
|
Scenario: Operator activates panic mode globally
|
|||
|
|
Given panic mode is currently inactive
|
|||
|
|
When an operator activates panic mode with reason "security incident"
|
|||
|
|
Then all automated remediations are immediately halted
|
|||
|
|
And all pending remediation commands are cancelled
|
|||
|
|
And a Slack alert is sent to "#ops-critical": "⚠️ PANIC MODE ACTIVATED by @operator"
|
|||
|
|
And the reason and operator identity are logged
|
|||
|
|
|
|||
|
|
Scenario: Panic mode blocks new remediations
|
|||
|
|
Given panic mode is active
|
|||
|
|
When a user clicks "Revert to IaC" in Slack
|
|||
|
|
Then the SaaS backend rejects the action
|
|||
|
|
And the user sees: "System is in panic mode — automated actions are disabled"
|
|||
|
|
|
|||
|
|
Scenario: Panic mode blocks agent remediation commands
|
|||
|
|
Given panic mode is active
|
|||
|
|
And an agent receives a remediation command (e.g., from a race condition)
|
|||
|
|
When the agent checks panic mode status
|
|||
|
|
Then the agent rejects the command
|
|||
|
|
And logs "remediation blocked: panic mode active" at WARN level
|
|||
|
|
|
|||
|
|
Scenario: Panic mode does NOT block drift scanning
|
|||
|
|
Given panic mode is active
|
|||
|
|
When the next scan cycle runs
|
|||
|
|
Then the agent continues scanning normally
|
|||
|
|
And drift events continue to be reported and stored
|
|||
|
|
And notifications continue to be sent (read-only operations are unaffected)
|
|||
|
|
|
|||
|
|
Scenario: Panic mode deactivated by authorized operator
|
|||
|
|
Given panic mode is active
|
|||
|
|
When an authorized operator deactivates panic mode
|
|||
|
|
Then automated remediations are re-enabled
|
|||
|
|
And a Slack alert is sent: "✅ PANIC MODE DEACTIVATED by @operator"
|
|||
|
|
And the deactivation is logged with timestamp and operator identity
|
|||
|
|
|
|||
|
|
Scenario: Panic mode activation requires elevated role
|
|||
|
|
Given a regular user attempts to activate panic mode
|
|||
|
|
When they call POST /ops/panic-mode
|
|||
|
|
Then the API returns HTTP 403
|
|||
|
|
And the attempt is logged as a security event
|
|||
|
|
|
|||
|
|
Scenario: Panic mode state is persisted across restarts
|
|||
|
|
Given panic mode is active
|
|||
|
|
When the SaaS backend restarts
|
|||
|
|
Then panic mode remains active after restart
|
|||
|
|
And the system does not auto-deactivate panic mode on restart
|
|||
|
|
|
|||
|
|
Scenario: Tenant-level panic mode
|
|||
|
|
Given tenant "acme" is experiencing an incident
|
|||
|
|
When an operator activates panic mode for tenant "acme" only
|
|||
|
|
Then only "acme"'s remediations are halted
|
|||
|
|
And other tenants are unaffected
|
|||
|
|
And "acme"'s dashboard shows a panic mode banner
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Observability — Metrics and Alerts
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Operational Metrics and Alerting
|
|||
|
|
As a SaaS operator
|
|||
|
|
I want key metrics exported and alerting configured
|
|||
|
|
So that I can detect and respond to production issues proactively
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given metrics are exported to CloudWatch and/or Prometheus
|
|||
|
|
|
|||
|
|
Scenario: Drift report processing latency metric
|
|||
|
|
Given drift reports are being processed
|
|||
|
|
When the event processor handles each report
|
|||
|
|
Then a histogram metric "drift_report_processing_duration_ms" is recorded
|
|||
|
|
And a P99 alert fires if latency exceeds 5000ms
|
|||
|
|
|
|||
|
|
Scenario: DLQ depth metric triggers alert
|
|||
|
|
Given the DLQ depth exceeds 0
|
|||
|
|
When the CloudWatch alarm evaluates
|
|||
|
|
Then a PagerDuty alert fires within 5 minutes
|
|||
|
|
And the alert includes the queue name and message count
|
|||
|
|
|
|||
|
|
Scenario: Agent offline metric
|
|||
|
|
Given an agent has not sent a heartbeat for 5 minutes
|
|||
|
|
When the heartbeat monitor checks
|
|||
|
|
Then a metric "agents_offline_count" is incremented
|
|||
|
|
And if any agent is offline for more than 15 minutes, an alert fires
|
|||
|
|
|
|||
|
|
Scenario: Secret scrubber miss rate metric
|
|||
|
|
Given the scrubber processes drift reports
|
|||
|
|
When a scrubber audit runs
|
|||
|
|
Then a metric "scrubber_miss_rate" is recorded
|
|||
|
|
And if the miss rate is ever > 0%, a CRITICAL alert fires immediately
|
|||
|
|
|
|||
|
|
Scenario: Stripe webhook processing metric
|
|||
|
|
Given Stripe webhooks are received
|
|||
|
|
When each webhook is processed
|
|||
|
|
Then a counter "stripe_webhooks_processed_total" is incremented by event type
|
|||
|
|
And a counter "stripe_webhooks_failed_total" is incremented on failures
|
|||
|
|
And an alert fires if the failure rate exceeds 1% over 5 minutes
|
|||
|
|
|
|||
|
|
Scenario: Database connection pool metric
|
|||
|
|
Given the application maintains a PostgreSQL connection pool
|
|||
|
|
When pool utilization exceeds 80%
|
|||
|
|
Then a warning alert fires
|
|||
|
|
And when utilization exceeds 95%, a critical alert fires
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Feature: Cross-Tenant Isolation Audit
|
|||
|
|
|
|||
|
|
```gherkin
|
|||
|
|
Feature: Cross-Tenant Isolation Audit
|
|||
|
|
As a security engineer
|
|||
|
|
I want automated tests that verify cross-tenant isolation
|
|||
|
|
So that data leakage between tenants is caught before production
|
|||
|
|
|
|||
|
|
Background:
|
|||
|
|
Given the test suite includes cross-tenant isolation tests
|
|||
|
|
And two test tenants "tenant-a" and "tenant-b" exist with separate data
|
|||
|
|
|
|||
|
|
Scenario: API cross-tenant read isolation
|
|||
|
|
Given tenant-a has drift event "drift-a-001"
|
|||
|
|
When tenant-b's JWT is used to call GET /drifts/drift-a-001
|
|||
|
|
Then the API returns HTTP 404
|
|||
|
|
And no data from tenant-a is present in the response body
|
|||
|
|
|
|||
|
|
Scenario: API cross-tenant write isolation
|
|||
|
|
Given tenant-a has stack "prod"
|
|||
|
|
When tenant-b's JWT is used to call DELETE /stacks/prod
|
|||
|
|
Then the API returns HTTP 404
|
|||
|
|
And tenant-a's stack is not deleted
|
|||
|
|
|
|||
|
|
Scenario: Database RLS cross-tenant query isolation
|
|||
|
|
Given a direct database query runs with app.tenant_id set to "tenant-b"
|
|||
|
|
When the query selects all rows from drift_events
|
|||
|
|
Then zero rows from tenant-a are returned
|
|||
|
|
And the query does not error
|
|||
|
|
|
|||
|
|
Scenario: SQS message from tenant-a cannot be processed as tenant-b
|
|||
|
|
Given a drift report message from tenant-a arrives on the queue
|
|||
|
|
When the event processor reads the tenant_id from the message
|
|||
|
|
Then the event is stored under tenant-a's tenant_id
|
|||
|
|
And not under tenant-b's tenant_id
|
|||
|
|
|
|||
|
|
Scenario: Remediation command cannot target another tenant's agent
|
|||
|
|
Given tenant-b's agent has agent_id "agent-b-001"
|
|||
|
|
When tenant-a attempts to send a remediation command to "agent-b-001"
|
|||
|
|
Then the control plane rejects the command with HTTP 403
|
|||
|
|
And the attempt is logged as a security event
|
|||
|
|
|
|||
|
|
Scenario: Cross-tenant isolation tests run in CI on every PR
|
|||
|
|
Given the isolation test suite is part of the CI pipeline
|
|||
|
|
When a PR is opened
|
|||
|
|
Then all cross-tenant isolation tests run automatically
|
|||
|
|
And the PR cannot be merged if any isolation test fails
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
*End of BDD Acceptance Test Specifications for dd0c/drift*
|
|||
|
|
|
|||
|
|
*Total epics covered: 10 | Features: 40+ | Scenarios: 200+*
|