1178 lines
49 KiB
Markdown
1178 lines
49 KiB
Markdown
|
|
# dd0c/portal — BDD Acceptance Test Specifications
|
||
|
|
|
||
|
|
> Gherkin scenarios for all 10 epics. Edge cases included.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Epic 1: AWS Discovery Scanner
|
||
|
|
|
||
|
|
### Story 1.1 — IAM Role Assumption into Customer Account
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: IAM Role Assumption for Cross-Account Discovery
|
||
|
|
|
||
|
|
Background:
|
||
|
|
Given the portal has a tenant with AWS integration configured
|
||
|
|
And the tenant has provided an IAM role ARN "arn:aws:iam::123456789012:role/PortalDiscoveryRole"
|
||
|
|
|
||
|
|
Scenario: Successfully assume IAM role and begin scan
|
||
|
|
Given the IAM role has the required trust policy allowing portal's scanner principal
|
||
|
|
When the AWS Discovery Scanner initiates a scan for the tenant
|
||
|
|
Then the scanner assumes the role via STS AssumeRole
|
||
|
|
And receives temporary credentials valid for 1 hour
|
||
|
|
And begins scanning the configured AWS regions
|
||
|
|
|
||
|
|
Scenario: IAM role assumption fails due to missing trust policy
|
||
|
|
Given the IAM role does NOT have a trust policy for the portal's scanner principal
|
||
|
|
When the AWS Discovery Scanner attempts to assume the role
|
||
|
|
Then STS returns an "AccessDenied" error
|
||
|
|
And the scanner marks the scan job as FAILED with reason "IAM role assumption denied"
|
||
|
|
And the tenant receives a notification: "AWS integration error: cannot assume role"
|
||
|
|
And no partial scan results are persisted
|
||
|
|
|
||
|
|
Scenario: IAM role ARN is malformed
|
||
|
|
Given the tenant has configured an invalid role ARN "not-a-valid-arn"
|
||
|
|
When the AWS Discovery Scanner attempts to assume the role
|
||
|
|
Then the scanner marks the scan job as FAILED with reason "Invalid role ARN format"
|
||
|
|
And logs a structured error with tenant_id and the invalid ARN
|
||
|
|
|
||
|
|
Scenario: Temporary credentials expire mid-scan
|
||
|
|
Given the scanner has assumed the IAM role and started scanning
|
||
|
|
And the scan has been running for 61 minutes
|
||
|
|
When the scanner attempts to call ECS DescribeClusters
|
||
|
|
Then AWS returns an "ExpiredTokenException"
|
||
|
|
And the scanner refreshes credentials via a new AssumeRole call
|
||
|
|
And resumes scanning from the last checkpoint
|
||
|
|
And the scan job status remains IN_PROGRESS
|
||
|
|
|
||
|
|
Scenario: Role assumption succeeds but has insufficient permissions
|
||
|
|
Given the IAM role is assumed successfully
|
||
|
|
But the role lacks "ecs:ListClusters" permission
|
||
|
|
When the scanner attempts to list ECS clusters
|
||
|
|
Then AWS returns an "AccessDenied" error for that specific call
|
||
|
|
And the scanner records a partial failure for the ECS resource type
|
||
|
|
And continues scanning other resource types (Lambda, CloudFormation, API Gateway)
|
||
|
|
And the final scan result includes a warnings list with "ECS: insufficient permissions"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 1.2 — ECS Service Discovery
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: ECS Service Discovery
|
||
|
|
|
||
|
|
Background:
|
||
|
|
Given a tenant's AWS account has been successfully authenticated
|
||
|
|
And the scanner is configured to scan region "us-east-1"
|
||
|
|
|
||
|
|
Scenario: Discover ECS services across multiple clusters
|
||
|
|
Given the AWS account has 3 ECS clusters: "prod-cluster", "staging-cluster", "dev-cluster"
|
||
|
|
And "prod-cluster" has 5 services, "staging-cluster" has 3, "dev-cluster" has 2
|
||
|
|
When the AWS Discovery Scanner runs the ECS scan step
|
||
|
|
Then it lists all 3 clusters via ecs:ListClusters
|
||
|
|
And describes all services in each cluster via ecs:DescribeServices
|
||
|
|
And emits 10 discovered service events to the catalog ingestion queue
|
||
|
|
And each event includes: cluster_name, service_name, task_definition, desired_count, region
|
||
|
|
|
||
|
|
Scenario: ECS cluster has no services
|
||
|
|
Given the AWS account has 1 ECS cluster "empty-cluster" with 0 services
|
||
|
|
When the scanner runs the ECS scan step
|
||
|
|
Then it lists the cluster successfully
|
||
|
|
And records 0 services discovered for that cluster
|
||
|
|
And does not emit any service events for that cluster
|
||
|
|
And the scan step completes without error
|
||
|
|
|
||
|
|
Scenario: ECS DescribeServices returns throttling error
|
||
|
|
Given the AWS account has an ECS cluster with 100 services
|
||
|
|
When the scanner calls ecs:DescribeServices in batches of 10
|
||
|
|
And AWS returns a ThrottlingException on the 5th batch
|
||
|
|
Then the scanner applies exponential backoff (2s, 4s, 8s)
|
||
|
|
And retries the failed batch up to 3 times
|
||
|
|
And if retry succeeds, continues with remaining batches
|
||
|
|
And the final result includes all 100 services
|
||
|
|
|
||
|
|
Scenario: ECS DescribeServices fails after all retries
|
||
|
|
Given AWS returns ThrottlingException on every retry attempt
|
||
|
|
When the scanner exhausts all 3 retries for a batch
|
||
|
|
Then it marks that batch as partially failed
|
||
|
|
And records which service ARNs could not be described
|
||
|
|
And continues with the next batch
|
||
|
|
And the scan summary includes "ECS partial failure: 10 services could not be described"
|
||
|
|
|
||
|
|
Scenario: Multi-region ECS scan
|
||
|
|
Given the tenant has configured scanning for regions: "us-east-1", "eu-west-1", "ap-southeast-1"
|
||
|
|
When the AWS Discovery Scanner runs
|
||
|
|
Then it scans ECS in all 3 regions in parallel via Step Functions Map state
|
||
|
|
And aggregates results from all regions
|
||
|
|
And each discovered service record includes its region
|
||
|
|
And duplicate service names across regions are stored as separate catalog entries
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 1.3 — Lambda Function Discovery
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Lambda Function Discovery
|
||
|
|
|
||
|
|
Scenario: Discover all Lambda functions in a region
|
||
|
|
Given the AWS account has 25 Lambda functions in "us-east-1"
|
||
|
|
When the scanner runs the Lambda scan step
|
||
|
|
Then it calls lambda:ListFunctions with pagination
|
||
|
|
And retrieves all 25 functions across multiple pages
|
||
|
|
And emits 25 service events with: function_name, runtime, memory_size, timeout, last_modified
|
||
|
|
|
||
|
|
Scenario: Lambda function has tags indicating service ownership
|
||
|
|
Given a Lambda function "payment-processor" has tags:
|
||
|
|
| Key | Value |
|
||
|
|
| team | payments-team |
|
||
|
|
| service | payment-svc |
|
||
|
|
| environment | production |
|
||
|
|
When the scanner processes this Lambda function
|
||
|
|
Then the catalog entry includes ownership: "payments-team" (source: implicit/tags)
|
||
|
|
And the service name is inferred as "payment-svc" from the "service" tag
|
||
|
|
|
||
|
|
Scenario: Lambda function has no tags
|
||
|
|
Given a Lambda function "legacy-cron-job" has no tags
|
||
|
|
When the scanner processes this Lambda function
|
||
|
|
Then the catalog entry has ownership: null (source: heuristic/pending)
|
||
|
|
And the entry is flagged for ownership inference via commit history
|
||
|
|
|
||
|
|
Scenario: Lambda ListFunctions pagination
|
||
|
|
Given the AWS account has 150 Lambda functions (exceeding the 50-per-page limit)
|
||
|
|
When the scanner calls lambda:ListFunctions
|
||
|
|
Then it follows NextMarker pagination tokens
|
||
|
|
And retrieves all 150 functions across 3 pages
|
||
|
|
And does not duplicate any function in the results
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 1.4 — CloudFormation Stack Discovery
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: CloudFormation Stack Discovery
|
||
|
|
|
||
|
|
Scenario: Discover active CloudFormation stacks
|
||
|
|
Given the AWS account has 8 CloudFormation stacks in "us-east-1"
|
||
|
|
And 6 stacks have status "CREATE_COMPLETE" or "UPDATE_COMPLETE"
|
||
|
|
And 2 stacks have status "ROLLBACK_COMPLETE"
|
||
|
|
When the scanner runs the CloudFormation scan step
|
||
|
|
Then it discovers all 8 stacks
|
||
|
|
And includes stacks of all statuses in the raw results
|
||
|
|
And marks ROLLBACK_COMPLETE stacks with health_status: "degraded"
|
||
|
|
|
||
|
|
Scenario: CloudFormation stack outputs contain service metadata
|
||
|
|
Given a stack "api-gateway-stack" has outputs:
|
||
|
|
| OutputKey | OutputValue |
|
||
|
|
| ServiceName | user-api |
|
||
|
|
| TeamOwner | platform-team |
|
||
|
|
| ApiEndpoint | https://api.example.com/users |
|
||
|
|
When the scanner processes this stack
|
||
|
|
Then the catalog entry uses "user-api" as the service name
|
||
|
|
And sets ownership to "platform-team" (source: implicit/stack-output)
|
||
|
|
And records the API endpoint in service metadata
|
||
|
|
|
||
|
|
Scenario: Nested stacks are discovered
|
||
|
|
Given a parent CloudFormation stack has 3 nested stacks
|
||
|
|
When the scanner processes the parent stack
|
||
|
|
Then it also discovers and catalogs each nested stack
|
||
|
|
And links nested stacks to their parent in the catalog
|
||
|
|
|
||
|
|
Scenario: Stack in DELETE_IN_PROGRESS state
|
||
|
|
Given a CloudFormation stack has status "DELETE_IN_PROGRESS"
|
||
|
|
When the scanner discovers this stack
|
||
|
|
Then it records the stack with health_status: "terminating"
|
||
|
|
And does not block the scan on this stack's state
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 1.5 — API Gateway Discovery
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: API Gateway Discovery
|
||
|
|
|
||
|
|
Scenario: Discover REST APIs in API Gateway v1
|
||
|
|
Given the AWS account has 4 REST APIs in API Gateway
|
||
|
|
When the scanner runs the API Gateway scan step
|
||
|
|
Then it calls apigateway:GetRestApis
|
||
|
|
And retrieves all 4 APIs with their names, IDs, and endpoints
|
||
|
|
And emits catalog events for each API
|
||
|
|
|
||
|
|
Scenario: Discover HTTP APIs in API Gateway v2
|
||
|
|
Given the AWS account has 3 HTTP APIs in API Gateway v2
|
||
|
|
When the scanner runs the API Gateway v2 scan step
|
||
|
|
Then it calls apigatewayv2:GetApis
|
||
|
|
And retrieves all 3 HTTP APIs
|
||
|
|
And correctly distinguishes them from REST APIs in the catalog
|
||
|
|
|
||
|
|
Scenario: API Gateway API has no associated service tag
|
||
|
|
Given an API Gateway REST API "legacy-api" has no resource tags
|
||
|
|
When the scanner processes this API
|
||
|
|
Then it creates a catalog entry with name "legacy-api"
|
||
|
|
And sets ownership to null pending heuristic inference
|
||
|
|
And records the API type as "REST" and the invoke URL
|
||
|
|
|
||
|
|
Scenario: API Gateway scan across multiple regions
|
||
|
|
Given API Gateway APIs exist in "us-east-1" and "us-west-2"
|
||
|
|
When the scanner runs in parallel across both regions
|
||
|
|
Then APIs from both regions are discovered
|
||
|
|
And each catalog entry includes the region field
|
||
|
|
And there are no cross-region duplicates
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 1.6 — Step Functions Orchestration
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Step Functions Orchestration of Discovery Scan
|
||
|
|
|
||
|
|
Scenario: Full scan executes all resource type steps in parallel
|
||
|
|
Given a scan job is triggered for a tenant
|
||
|
|
When the Step Functions state machine starts
|
||
|
|
Then it executes ECS, Lambda, CloudFormation, and API Gateway scan steps in parallel
|
||
|
|
And waits for all parallel branches to complete
|
||
|
|
And aggregates results from all branches
|
||
|
|
And transitions to the catalog ingestion step
|
||
|
|
|
||
|
|
Scenario: One parallel branch fails, others succeed
|
||
|
|
Given the Step Functions state machine is running a scan
|
||
|
|
And the Lambda scan step throws an unhandled exception
|
||
|
|
When the state machine catches the error
|
||
|
|
Then the ECS, CloudFormation, and API Gateway branches continue to completion
|
||
|
|
And the Lambda branch is marked as FAILED in the scan summary
|
||
|
|
And the overall scan job status is PARTIAL_SUCCESS
|
||
|
|
And discovered resources from successful branches are ingested into the catalog
|
||
|
|
|
||
|
|
Scenario: Scan job is idempotent on retry
|
||
|
|
Given a scan job failed midway through execution
|
||
|
|
When the Step Functions state machine retries the execution
|
||
|
|
Then it does not duplicate already-ingested catalog entries
|
||
|
|
And uses upsert semantics based on (tenant_id, resource_arn) as the unique key
|
||
|
|
|
||
|
|
Scenario: Concurrent scan jobs for the same tenant are prevented
|
||
|
|
Given a scan job is already IN_PROGRESS for tenant "acme-corp"
|
||
|
|
When a second scan job is triggered for "acme-corp"
|
||
|
|
Then the system detects the existing in-progress execution
|
||
|
|
And rejects the new job with status 409 Conflict
|
||
|
|
And returns the ID of the existing running scan job
|
||
|
|
|
||
|
|
Scenario: Scan job timeout
|
||
|
|
Given a scan job has been running for more than 30 minutes
|
||
|
|
When the Step Functions state machine timeout is reached
|
||
|
|
Then the execution is marked as TIMED_OUT
|
||
|
|
And partial results already ingested are retained in the catalog
|
||
|
|
And the tenant is notified of the timeout
|
||
|
|
And the scan job record shows last_successful_step
|
||
|
|
|
||
|
|
Scenario: WebSocket progress events during scan
|
||
|
|
Given a user is watching the discovery progress in the UI
|
||
|
|
When each Step Functions step completes
|
||
|
|
Then a WebSocket event is pushed to the user's connection
|
||
|
|
And the event includes: step_name, status, resources_found, timestamp
|
||
|
|
And the UI progress bar updates in real-time
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 1.7 — Multi-Region Scanning
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Multi-Region AWS Scanning
|
||
|
|
|
||
|
|
Scenario: Tenant configures specific regions to scan
|
||
|
|
Given a tenant has configured scan regions: ["us-east-1", "eu-central-1"]
|
||
|
|
When the discovery scanner runs
|
||
|
|
Then it only scans the configured regions
|
||
|
|
And does not scan any other AWS regions
|
||
|
|
And the scan summary shows per-region resource counts
|
||
|
|
|
||
|
|
Scenario: Tenant configures "all regions" scan
|
||
|
|
Given a tenant has configured scan_all_regions: true
|
||
|
|
When the discovery scanner runs
|
||
|
|
Then it first calls ec2:DescribeRegions to get the current list of enabled regions
|
||
|
|
And scans all enabled regions
|
||
|
|
And the Step Functions Map state iterates over the dynamic region list
|
||
|
|
|
||
|
|
Scenario: A region is disabled in the AWS account
|
||
|
|
Given the tenant's AWS account has "ap-east-1" disabled
|
||
|
|
And the tenant has configured scan_all_regions: true
|
||
|
|
When the scanner attempts to scan "ap-east-1"
|
||
|
|
Then it receives an "OptInRequired" or region-disabled error
|
||
|
|
And skips that region gracefully
|
||
|
|
And records "ap-east-1: skipped (region disabled)" in the scan summary
|
||
|
|
|
||
|
|
Scenario: Region scan results are isolated per tenant
|
||
|
|
Given tenant "acme" and tenant "globex" both have resources in "us-east-1"
|
||
|
|
When both tenants run discovery scans simultaneously
|
||
|
|
Then acme's catalog only contains acme's resources
|
||
|
|
And globex's catalog only contains globex's resources
|
||
|
|
And no cross-tenant data leakage occurs
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Epic 2: GitHub Discovery Scanner
|
||
|
|
|
||
|
|
### Story 2.1 — GitHub App Installation and Authentication
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: GitHub App Authentication
|
||
|
|
|
||
|
|
Background:
|
||
|
|
Given the portal has a GitHub App registered with app_id "12345"
|
||
|
|
And the app has a private key stored in AWS Secrets Manager
|
||
|
|
|
||
|
|
Scenario: Successfully authenticate as GitHub App installation
|
||
|
|
Given a tenant has installed the GitHub App on their organization "acme-org"
|
||
|
|
And the installation_id is "67890"
|
||
|
|
When the GitHub Discovery Scanner initializes for this tenant
|
||
|
|
Then it generates a JWT signed with the app's private key
|
||
|
|
And exchanges the JWT for an installation access token via POST /app/installations/67890/access_tokens
|
||
|
|
And the token has scopes: contents:read, metadata:read
|
||
|
|
And the scanner uses this token for all subsequent API calls
|
||
|
|
|
||
|
|
Scenario: GitHub App JWT generation fails due to expired private key
|
||
|
|
Given the private key in Secrets Manager is expired or invalid
|
||
|
|
When the scanner attempts to generate a JWT
|
||
|
|
Then it fails with a key validation error
|
||
|
|
And marks the scan job as FAILED with reason "GitHub App authentication failed: invalid private key"
|
||
|
|
And triggers an alert to the tenant admin
|
||
|
|
|
||
|
|
Scenario: Installation access token request returns 404
|
||
|
|
Given the GitHub App installation has been uninstalled by the tenant
|
||
|
|
When the scanner requests an installation access token
|
||
|
|
Then GitHub returns 404 Not Found
|
||
|
|
And the scanner marks the GitHub integration as DISCONNECTED
|
||
|
|
And notifies the tenant: "GitHub App has been uninstalled. Please reinstall to continue discovery."
|
||
|
|
|
||
|
|
Scenario: Installation access token expires mid-scan
|
||
|
|
Given the scanner has a valid installation token (valid for 1 hour)
|
||
|
|
And the scan has been running for 58 minutes
|
||
|
|
When the token expires during a GraphQL query
|
||
|
|
Then GitHub returns 401 Unauthorized
|
||
|
|
And the scanner proactively refreshes the token before expiry (at 55 min mark)
|
||
|
|
And the scan continues without interruption
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 2.2 — Repository Discovery via GraphQL
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: GitHub Repository Discovery via GraphQL API
|
||
|
|
|
||
|
|
Scenario: Discover all repositories in an organization
|
||
|
|
Given the tenant's GitHub organization "acme-org" has 150 repositories
|
||
|
|
When the GitHub Discovery Scanner runs the repository listing query
|
||
|
|
Then it uses the GraphQL API with cursor-based pagination
|
||
|
|
And retrieves all 150 repositories in batches of 100
|
||
|
|
And each repository record includes: name, defaultBranch, isArchived, pushedAt, topics
|
||
|
|
|
||
|
|
Scenario: Archived repositories are excluded by default
|
||
|
|
Given "acme-org" has 150 active repos and 30 archived repos
|
||
|
|
When the scanner runs with default settings
|
||
|
|
Then it discovers 150 active repositories
|
||
|
|
And excludes the 30 archived repositories
|
||
|
|
And the scan summary notes "30 archived repositories skipped"
|
||
|
|
|
||
|
|
Scenario: Tenant opts in to include archived repositories
|
||
|
|
Given the tenant has configured include_archived: true
|
||
|
|
When the scanner runs
|
||
|
|
Then it discovers all 180 repositories including archived ones
|
||
|
|
And archived repos are marked with status: "archived" in the catalog
|
||
|
|
|
||
|
|
Scenario: Organization has no repositories
|
||
|
|
Given the tenant's GitHub organization has 0 repositories
|
||
|
|
When the scanner runs the repository listing query
|
||
|
|
Then it returns an empty result set
|
||
|
|
And the scan completes successfully with 0 GitHub services discovered
|
||
|
|
And no errors are raised
|
||
|
|
|
||
|
|
Scenario: GraphQL query returns partial data with errors
|
||
|
|
Given the GraphQL query returns 80 repositories successfully
|
||
|
|
But includes a "FORBIDDEN" error for 20 private repositories
|
||
|
|
When the scanner processes the response
|
||
|
|
Then it ingests the 80 accessible repositories
|
||
|
|
And records a warning: "20 repositories inaccessible due to permissions"
|
||
|
|
And the scan status is PARTIAL_SUCCESS
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 2.3 — Service Metadata Extraction from Repositories
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Service Metadata Extraction from Repository Files
|
||
|
|
|
||
|
|
Scenario: Extract service name from package.json
|
||
|
|
Given a repository "user-service" has a package.json at the root:
|
||
|
|
"""
|
||
|
|
{
|
||
|
|
"name": "@acme/user-service",
|
||
|
|
"version": "2.1.0",
|
||
|
|
"description": "Handles user authentication and profiles"
|
||
|
|
}
|
||
|
|
"""
|
||
|
|
When the scanner fetches and parses package.json
|
||
|
|
Then the catalog entry uses service name "@acme/user-service"
|
||
|
|
And records version "2.1.0"
|
||
|
|
And records description "Handles user authentication and profiles"
|
||
|
|
And sets the service type as "nodejs"
|
||
|
|
|
||
|
|
Scenario: Extract service metadata from Dockerfile
|
||
|
|
Given a repository has a Dockerfile with:
|
||
|
|
"""
|
||
|
|
FROM python:3.11-slim
|
||
|
|
LABEL service.name="data-pipeline"
|
||
|
|
LABEL service.team="data-engineering"
|
||
|
|
EXPOSE 8080
|
||
|
|
"""
|
||
|
|
When the scanner parses the Dockerfile
|
||
|
|
Then the catalog entry uses service name "data-pipeline"
|
||
|
|
And sets ownership to "data-engineering" (source: implicit/dockerfile-label)
|
||
|
|
And records the service type as "python"
|
||
|
|
And records exposed port 8080
|
||
|
|
|
||
|
|
Scenario: Extract service metadata from Terraform files
|
||
|
|
Given a repository has a main.tf with:
|
||
|
|
"""
|
||
|
|
module "service" {
|
||
|
|
source = "..."
|
||
|
|
service_name = "billing-api"
|
||
|
|
team = "billing-team"
|
||
|
|
}
|
||
|
|
"""
|
||
|
|
When the scanner parses Terraform files
|
||
|
|
Then the catalog entry uses service name "billing-api"
|
||
|
|
And sets ownership to "billing-team" (source: implicit/terraform)
|
||
|
|
|
||
|
|
Scenario: Repository has no recognizable service metadata files
|
||
|
|
Given a repository "random-scripts" has no package.json, Dockerfile, or terraform files
|
||
|
|
When the scanner processes this repository
|
||
|
|
Then it creates a catalog entry using the repository name "random-scripts"
|
||
|
|
And sets service type as "unknown"
|
||
|
|
And flags the entry for manual review
|
||
|
|
And ownership is set to null pending heuristic inference
|
||
|
|
|
||
|
|
Scenario: Multiple metadata files exist — precedence order
|
||
|
|
Given a repository has both a package.json (name: "svc-a") and a Dockerfile (LABEL service.name="svc-b")
|
||
|
|
When the scanner processes the repository
|
||
|
|
Then it uses package.json as the primary metadata source
|
||
|
|
And records the service name as "svc-a"
|
||
|
|
And notes the Dockerfile label as secondary metadata
|
||
|
|
|
||
|
|
Scenario: package.json is malformed JSON
|
||
|
|
Given a repository has a package.json with invalid JSON syntax
|
||
|
|
When the scanner attempts to parse it
|
||
|
|
Then it logs a warning: "package.json parse error in repo: user-service"
|
||
|
|
And falls back to Dockerfile or terraform for metadata
|
||
|
|
And if no fallback exists, uses the repository name
|
||
|
|
And does not crash the scan job
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 2.4 — GitHub Rate Limit Handling
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: GitHub API Rate Limit Handling
|
||
|
|
|
||
|
|
Scenario: GraphQL rate limit warning threshold reached
|
||
|
|
Given the scanner has consumed 4,500 of 5,000 GraphQL rate limit points
|
||
|
|
When the scanner checks the rate limit headers after each response
|
||
|
|
Then it detects the threshold (90% consumed)
|
||
|
|
And pauses scanning until the rate limit resets
|
||
|
|
And logs: "GitHub rate limit threshold reached, waiting for reset at {reset_time}"
|
||
|
|
And resumes scanning after the reset window
|
||
|
|
|
||
|
|
Scenario: REST API secondary rate limit hit
|
||
|
|
Given the scanner is making rapid REST API calls
|
||
|
|
When GitHub returns HTTP 429 with "secondary rate limit" in the response
|
||
|
|
Then the scanner reads the Retry-After header
|
||
|
|
And waits the specified number of seconds before retrying
|
||
|
|
And does not count this as a scan failure
|
||
|
|
|
||
|
|
Scenario: Rate limit exhausted with many repos remaining
|
||
|
|
Given the scanner has 500 repositories to scan
|
||
|
|
And the rate limit is exhausted after scanning 200 repositories
|
||
|
|
When the scanner detects rate limit exhaustion
|
||
|
|
Then it checkpoints the current progress (200 repos scanned)
|
||
|
|
And schedules a continuation scan after the rate limit reset
|
||
|
|
And the scan job status is set to RATE_LIMITED (not FAILED)
|
||
|
|
And the tenant is notified of the delay
|
||
|
|
|
||
|
|
Scenario: Rate limit headers are missing from response
|
||
|
|
Given GitHub returns a response without X-RateLimit headers
|
||
|
|
When the scanner processes the response
|
||
|
|
Then it applies a conservative default delay of 1 second between requests
|
||
|
|
And logs a warning about missing rate limit headers
|
||
|
|
And continues scanning without crashing
|
||
|
|
|
||
|
|
Scenario: Concurrent scans from multiple tenants share rate limit awareness
|
||
|
|
Given two tenants share the same GitHub App installation
|
||
|
|
When both tenants trigger scans simultaneously
|
||
|
|
Then the scanner uses a shared rate limit counter in Redis
|
||
|
|
And distributes available rate limit points between the two scans
|
||
|
|
And neither scan exceeds the total available rate limit
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 2.5 — Ownership Inference from Commit History
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Heuristic Ownership Inference from Commit History
|
||
|
|
|
||
|
|
Scenario: Infer owner from most recent committer
|
||
|
|
Given a repository has no explicit ownership config and no team tags
|
||
|
|
And the last 10 commits were made by "alice@acme.com" (7 commits) and "bob@acme.com" (3 commits)
|
||
|
|
When the ownership inference engine runs
|
||
|
|
Then it identifies "alice@acme.com" as the primary contributor
|
||
|
|
And maps the email to team "frontend-team" via the org's CODEOWNERS or team membership
|
||
|
|
And sets ownership to "frontend-team" (source: heuristic/commit-history)
|
||
|
|
And records confidence: 0.7
|
||
|
|
|
||
|
|
Scenario: Ownership inference from CODEOWNERS file
|
||
|
|
Given a repository has a CODEOWNERS file:
|
||
|
|
"""
|
||
|
|
* @acme-org/platform-team
|
||
|
|
/src/api/ @acme-org/api-team
|
||
|
|
"""
|
||
|
|
When the scanner processes the CODEOWNERS file
|
||
|
|
Then it sets ownership to "platform-team" (source: implicit/codeowners)
|
||
|
|
And records that the api directory has additional ownership by "api-team"
|
||
|
|
And this takes precedence over commit history heuristics
|
||
|
|
|
||
|
|
Scenario: Ownership conflict — multiple teams have equal commit share
|
||
|
|
Given a repository has commits split 50/50 between "team-a" and "team-b"
|
||
|
|
When the ownership inference engine runs
|
||
|
|
Then it records both teams as co-owners
|
||
|
|
And sets ownership_confidence: 0.5
|
||
|
|
And flags the service for manual ownership resolution
|
||
|
|
And the catalog entry shows ownership_status: "conflict"
|
||
|
|
|
||
|
|
Scenario: Explicit ownership config overrides all other sources
|
||
|
|
Given a repository has a .portal.yaml file:
|
||
|
|
"""
|
||
|
|
owner: payments-team
|
||
|
|
tier: critical
|
||
|
|
"""
|
||
|
|
And the repository also has CODEOWNERS pointing to "platform-team"
|
||
|
|
And commit history suggests "devops-team"
|
||
|
|
When the scanner processes the repository
|
||
|
|
Then ownership is set to "payments-team" (source: explicit/portal-config)
|
||
|
|
And CODEOWNERS and commit history are recorded as secondary metadata
|
||
|
|
And the explicit config is not overridden by any other source
|
||
|
|
|
||
|
|
Scenario: Commit history API call fails
|
||
|
|
Given the GitHub API returns 500 when fetching commit history for a repo
|
||
|
|
When the ownership inference engine attempts heuristic inference
|
||
|
|
Then it marks ownership as null (source: unknown)
|
||
|
|
And records the error in the scan summary
|
||
|
|
And does not block catalog ingestion for this service
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
## Epic 3: Service Catalog
|
||
|
|
|
||
|
|
### Story 3.1 — Catalog Ingestion and Upsert
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Service Catalog Ingestion
|
||
|
|
|
||
|
|
Background:
|
||
|
|
Given the catalog uses PostgreSQL Aurora Serverless v2
|
||
|
|
And the unique key for a service is (tenant_id, source, resource_id)
|
||
|
|
|
||
|
|
Scenario: New service is ingested into the catalog
|
||
|
|
Given a discovery scan emits a new service event for "payment-api"
|
||
|
|
When the catalog ingestion handler processes the event
|
||
|
|
Then a new row is inserted into the services table
|
||
|
|
And the row includes: tenant_id, name, source, resource_id, owner, health_status, discovered_at
|
||
|
|
And the Meilisearch index is updated with the new service document
|
||
|
|
And the catalog entry is immediately searchable
|
||
|
|
|
||
|
|
Scenario: Existing service is updated on re-scan (upsert)
|
||
|
|
Given "payment-api" already exists in the catalog with owner "old-team"
|
||
|
|
When a new scan emits an updated event for "payment-api" with owner "payments-team"
|
||
|
|
Then the existing row is updated (not duplicated)
|
||
|
|
And owner is changed to "payments-team"
|
||
|
|
And updated_at is refreshed
|
||
|
|
And the Meilisearch document is updated
|
||
|
|
|
||
|
|
Scenario: Service removed from AWS is marked stale
|
||
|
|
Given "deprecated-lambda" exists in the catalog from a previous scan
|
||
|
|
When the latest scan completes and does not include "deprecated-lambda"
|
||
|
|
Then the catalog entry is marked with status: "stale"
|
||
|
|
And last_seen_at is not updated
|
||
|
|
And the service remains visible in the catalog with a "stale" badge
|
||
|
|
And is not immediately deleted
|
||
|
|
|
||
|
|
Scenario: Stale service is purged after retention period
|
||
|
|
Given "deprecated-lambda" has been stale for more than 30 days
|
||
|
|
When the nightly cleanup job runs
|
||
|
|
Then the service is soft-deleted from the catalog
|
||
|
|
And removed from the Meilisearch index
|
||
|
|
And a deletion event is logged for audit purposes
|
||
|
|
|
||
|
|
Scenario: Catalog ingestion fails due to Aurora connection error
|
||
|
|
Given Aurora Serverless is scaling up (cold start)
|
||
|
|
When the ingestion handler attempts to write a service record
|
||
|
|
Then it retries with exponential backoff up to 5 times
|
||
|
|
And if all retries fail, the event is sent to a dead-letter queue
|
||
|
|
And an alert is raised for the operations team
|
||
|
|
And the scan job is marked PARTIAL_SUCCESS
|
||
|
|
|
||
|
|
Scenario: Bulk ingestion of 500 services from a large account
|
||
|
|
Given a scan discovers 500 services across all resource types
|
||
|
|
When the catalog ingestion handler processes all events
|
||
|
|
Then it uses batch inserts (100 records per batch)
|
||
|
|
And completes ingestion within 30 seconds
|
||
|
|
And all 500 services appear in the catalog
|
||
|
|
And the Meilisearch index reflects all 500 services
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 3.2 — PagerDuty / OpsGenie On-Call Mapping
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: On-Call Ownership Mapping via PagerDuty and OpsGenie
|
||
|
|
|
||
|
|
Scenario: Map service owner to PagerDuty escalation policy
|
||
|
|
Given a service "checkout-api" has owner "payments-team"
|
||
|
|
And PagerDuty integration is configured for the tenant
|
||
|
|
And "payments-team" maps to PagerDuty service "checkout-escalation-policy"
|
||
|
|
When the catalog builds the service detail record
|
||
|
|
Then the service includes on_call_policy: "checkout-escalation-policy"
|
||
|
|
And a link to the PagerDuty service page
|
||
|
|
|
||
|
|
Scenario: Fetch current on-call engineer from PagerDuty
|
||
|
|
Given "checkout-api" has a linked PagerDuty escalation policy
|
||
|
|
When a user views the service detail page
|
||
|
|
Then the portal calls PagerDuty GET /oncalls?escalation_policy_ids[]=...
|
||
|
|
And displays the current on-call engineer's name and contact
|
||
|
|
And caches the result for 5 minutes in Redis
|
||
|
|
|
||
|
|
Scenario: PagerDuty API returns 401 (invalid token)
|
||
|
|
Given the PagerDuty API token has been revoked
|
||
|
|
When the portal attempts to fetch on-call data
|
||
|
|
Then it returns the service detail without on-call info
|
||
|
|
And displays a warning: "On-call data unavailable — check PagerDuty integration"
|
||
|
|
And logs the auth failure for the tenant admin
|
||
|
|
|
||
|
|
Scenario: OpsGenie integration as alternative to PagerDuty
|
||
|
|
Given the tenant uses OpsGenie instead of PagerDuty
|
||
|
|
And OpsGenie integration is configured with an API key
|
||
|
|
When the portal fetches on-call data for a service
|
||
|
|
Then it calls the OpsGenie API instead of PagerDuty
|
||
|
|
And maps the OpsGenie schedule to the service owner
|
||
|
|
And displays the current on-call responder
|
||
|
|
|
||
|
|
Scenario: Service owner has no PagerDuty or OpsGenie mapping
|
||
|
|
Given "internal-tool" has owner "eng-team"
|
||
|
|
But "eng-team" has no mapping in PagerDuty or OpsGenie
|
||
|
|
When the portal builds the service detail
|
||
|
|
Then on_call_policy is null
|
||
|
|
And the UI shows "No on-call policy configured" for this service
|
||
|
|
And suggests linking a PagerDuty/OpsGenie service
|
||
|
|
|
||
|
|
Scenario: Multiple services share the same on-call policy
|
||
|
|
Given 10 services all map to the "platform-oncall" PagerDuty policy
|
||
|
|
When the portal fetches on-call data
|
||
|
|
Then it batches the PagerDuty API call for all 10 services
|
||
|
|
And does not make 10 separate API calls
|
||
|
|
And caches the shared result for all 10 services
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 3.3 — Service Catalog CRUD API
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Service Catalog CRUD API
|
||
|
|
|
||
|
|
Background:
|
||
|
|
Given the API requires a valid Cognito JWT
|
||
|
|
And the JWT contains tenant_id claim
|
||
|
|
|
||
|
|
Scenario: List services for a tenant
|
||
|
|
Given tenant "acme" has 42 services in the catalog
|
||
|
|
When GET /api/v1/services is called with a valid JWT for "acme"
|
||
|
|
Then the response returns 42 services
|
||
|
|
And each service includes: id, name, owner, source, health_status, updated_at
|
||
|
|
And results are paginated (default page size: 20)
|
||
|
|
And the response includes total_count: 42
|
||
|
|
|
||
|
|
Scenario: List services with pagination
|
||
|
|
Given tenant "acme" has 42 services
|
||
|
|
When GET /api/v1/services?page=2&limit=20 is called
|
||
|
|
Then the response returns services 21-40
|
||
|
|
And includes pagination metadata: page, limit, total_count, total_pages
|
||
|
|
|
||
|
|
Scenario: Get single service by ID
|
||
|
|
Given service "svc-uuid-123" exists for tenant "acme"
|
||
|
|
When GET /api/v1/services/svc-uuid-123 is called with acme's JWT
|
||
|
|
Then the response returns the full service detail
|
||
|
|
And includes: metadata, owner, on_call_policy, health_scorecard, tags, links
|
||
|
|
|
||
|
|
Scenario: Get service belonging to different tenant returns 404
|
||
|
|
Given service "svc-uuid-456" belongs to tenant "globex"
|
||
|
|
When GET /api/v1/services/svc-uuid-456 is called with acme's JWT
|
||
|
|
Then the response returns 404 Not Found
|
||
|
|
And does not reveal that the service exists for another tenant
|
||
|
|
|
||
|
|
Scenario: Manually update service owner via API
|
||
|
|
Given service "legacy-api" has owner inferred as "unknown"
|
||
|
|
When PATCH /api/v1/services/svc-uuid-789 is called with body:
|
||
|
|
"""
|
||
|
|
{ "owner": "platform-team", "owner_source": "manual" }
|
||
|
|
"""
|
||
|
|
Then the service owner is updated to "platform-team"
|
||
|
|
And owner_source is set to "explicit/manual"
|
||
|
|
And the change is recorded in the audit log with the user's identity
|
||
|
|
And the Meilisearch document is updated
|
||
|
|
|
||
|
|
Scenario: Delete service from catalog
|
||
|
|
Given service "decommissioned-api" exists in the catalog
|
||
|
|
When DELETE /api/v1/services/svc-uuid-000 is called
|
||
|
|
Then the service is soft-deleted (deleted_at is set)
|
||
|
|
And it no longer appears in list or search results
|
||
|
|
And the Meilisearch document is removed
|
||
|
|
And the deletion is recorded in the audit log
|
||
|
|
|
||
|
|
Scenario: Create service manually (not from discovery)
|
||
|
|
When POST /api/v1/services is called with:
|
||
|
|
"""
|
||
|
|
{ "name": "manual-service", "owner": "ops-team", "source": "manual" }
|
||
|
|
"""
|
||
|
|
Then a new service is created with source: "manual"
|
||
|
|
And it appears in the catalog and search results
|
||
|
|
And it is not overwritten by automated discovery scans
|
||
|
|
|
||
|
|
Scenario: Unauthenticated request is rejected
|
||
|
|
When GET /api/v1/services is called without an Authorization header
|
||
|
|
Then the response returns 401 Unauthorized
|
||
|
|
And no service data is returned
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 3.4 — Full-Text Search via Meilisearch
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Catalog Full-Text Search
|
||
|
|
|
||
|
|
Scenario: Search returns relevant services
|
||
|
|
Given the catalog has services: "payment-api", "payment-processor", "user-service"
|
||
|
|
When a search query "payment" is submitted
|
||
|
|
Then Meilisearch returns "payment-api" and "payment-processor"
|
||
|
|
And results are ranked by relevance score
|
||
|
|
And the response time is under 10ms
|
||
|
|
|
||
|
|
Scenario: Search is scoped to tenant
|
||
|
|
Given tenant "acme" has service "auth-service"
|
||
|
|
And tenant "globex" has service "auth-service" (same name)
|
||
|
|
When tenant "acme" searches for "auth-service"
|
||
|
|
Then only acme's "auth-service" is returned
|
||
|
|
And globex's service is not included in results
|
||
|
|
|
||
|
|
Scenario: Meilisearch index is corrupted or unavailable
|
||
|
|
Given Meilisearch returns a 503 error
|
||
|
|
When a search request is made
|
||
|
|
Then the API falls back to PostgreSQL full-text search (pg_trgm)
|
||
|
|
And returns results within 500ms
|
||
|
|
And logs a warning: "Meilisearch unavailable, using PostgreSQL fallback"
|
||
|
|
And the user sees results (degraded performance, not an error)
|
||
|
|
|
||
|
|
Scenario: Catalog update triggers Meilisearch re-index
|
||
|
|
Given a service "new-api" is added to the catalog
|
||
|
|
When the catalog write completes
|
||
|
|
Then a Meilisearch index update is triggered asynchronously
|
||
|
|
And the service is searchable within 1 second
|
||
|
|
And if the Meilisearch update fails, it is retried via a background queue
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Epic 4: Search Engine
|
||
|
|
|
||
|
|
### Story 4.1 — Cmd+K Instant Search
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Cmd+K Instant Search Interface
|
||
|
|
|
||
|
|
Scenario: User opens search with keyboard shortcut
|
||
|
|
Given the user is on any page of the portal
|
||
|
|
When the user presses Cmd+K (Mac) or Ctrl+K (Windows/Linux)
|
||
|
|
Then the search modal opens immediately
|
||
|
|
And the search input is focused
|
||
|
|
And recent searches are shown as suggestions
|
||
|
|
|
||
|
|
Scenario: Search returns results under 10ms
|
||
|
|
Given the Meilisearch index has 500 services
|
||
|
|
And Redis prefix cache is warm
|
||
|
|
When the user types "pay" in the search box
|
||
|
|
Then results appear within 10ms
|
||
|
|
And the top 5 matching services are shown
|
||
|
|
And each result shows: service name, owner, health status
|
||
|
|
|
||
|
|
Scenario: Search with Redis prefix cache hit
|
||
|
|
Given "pay" has been searched recently and is cached in Redis
|
||
|
|
When the user types "pay"
|
||
|
|
Then the API returns cached results from Redis
|
||
|
|
And does not query Meilisearch
|
||
|
|
And the response time is under 5ms
|
||
|
|
|
||
|
|
Scenario: Search with Redis cache miss
|
||
|
|
Given "xyz-unique-query" has never been searched
|
||
|
|
When the user types "xyz-unique-query"
|
||
|
|
Then the API queries Meilisearch directly
|
||
|
|
And stores the result in Redis with TTL of 60 seconds
|
||
|
|
And returns results within 10ms
|
||
|
|
|
||
|
|
Scenario: Search cache is invalidated on catalog update
|
||
|
|
Given "payment-api" search results are cached in Redis
|
||
|
|
When the catalog is updated (new service added or existing service modified)
|
||
|
|
Then all Redis cache keys matching affected prefixes are invalidated
|
||
|
|
And the next search query fetches fresh results from Meilisearch
|
||
|
|
|
||
|
|
Scenario: Search with no results
|
||
|
|
Given no services match the query "zzznomatch"
|
||
|
|
When the user searches for "zzznomatch"
|
||
|
|
Then the search modal shows "No services found"
|
||
|
|
And suggests: "Try a broader search or browse all services"
|
||
|
|
And does not show an error
|
||
|
|
|
||
|
|
Scenario: Search is dismissed
|
||
|
|
Given the search modal is open
|
||
|
|
When the user presses Escape
|
||
|
|
Then the search modal closes
|
||
|
|
And focus returns to the previously focused element
|
||
|
|
|
||
|
|
Scenario: Search result is selected
|
||
|
|
Given search results show "payment-api"
|
||
|
|
When the user clicks or presses Enter on "payment-api"
|
||
|
|
Then the search modal closes
|
||
|
|
And the service detail drawer opens for "payment-api"
|
||
|
|
And the URL updates to reflect the selected service
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 4.2 — Redis Prefix Caching
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Redis Prefix Caching for Search
|
||
|
|
|
||
|
|
Scenario: Cache key structure is tenant-scoped
|
||
|
|
Given tenant "acme" searches for "pay"
|
||
|
|
Then the Redis cache key is "search:acme:pay"
|
||
|
|
And tenant "globex" searching for "pay" uses key "search:globex:pay"
|
||
|
|
And the two cache entries are completely independent
|
||
|
|
|
||
|
|
Scenario: Cache TTL expires and is refreshed
|
||
|
|
Given a cache entry for "search:acme:pay" has TTL of 60 seconds
|
||
|
|
When 61 seconds pass without a search for "pay"
|
||
|
|
Then the cache entry expires
|
||
|
|
And the next search for "pay" queries Meilisearch
|
||
|
|
And a new cache entry is created with a fresh 60-second TTL
|
||
|
|
|
||
|
|
Scenario: Redis is unavailable
|
||
|
|
Given Redis returns a connection error
|
||
|
|
When a search request is made
|
||
|
|
Then the API bypasses the cache and queries Meilisearch directly
|
||
|
|
And logs a warning: "Redis unavailable, cache bypassed"
|
||
|
|
And the search still returns results (degraded, not broken)
|
||
|
|
And does not attempt to write to Redis while it is down
|
||
|
|
|
||
|
|
Scenario: Cache invalidation on bulk catalog update
|
||
|
|
Given a discovery scan adds 50 new services to the catalog
|
||
|
|
When the bulk ingestion completes
|
||
|
|
Then all Redis search cache keys for the affected tenant are flushed
|
||
|
|
And subsequent searches reflect the updated catalog
|
||
|
|
And the flush is atomic (all keys or none)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 4.3 — Meilisearch Index Management
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Meilisearch Index Management
|
||
|
|
|
||
|
|
Scenario: Index is created on first tenant onboarding
|
||
|
|
Given a new tenant "startup-co" completes onboarding
|
||
|
|
When the tenant's first discovery scan runs
|
||
|
|
Then a Meilisearch index "services_startup-co" is created
|
||
|
|
And the index is configured with searchable attributes: name, description, owner, tags
|
||
|
|
And filterable attributes: source, health_status, owner, region
|
||
|
|
|
||
|
|
Scenario: Index corruption is detected and rebuilt
|
||
|
|
Given the Meilisearch index for tenant "acme" returns inconsistent results
|
||
|
|
When the health check detects index corruption (checksum mismatch)
|
||
|
|
Then an index rebuild is triggered
|
||
|
|
And the index is rebuilt from the PostgreSQL catalog (source of truth)
|
||
|
|
And during rebuild, search falls back to PostgreSQL full-text search
|
||
|
|
And users see a banner: "Search index is being rebuilt, results may be incomplete"
|
||
|
|
And the rebuild completes within 5 minutes for up to 10,000 services
|
||
|
|
|
||
|
|
Scenario: Meilisearch index rebuild does not affect other tenants
|
||
|
|
Given tenant "acme"'s index is being rebuilt
|
||
|
|
When tenant "globex" performs a search
|
||
|
|
Then globex's search is unaffected
|
||
|
|
And globex's index is not touched during acme's rebuild
|
||
|
|
|
||
|
|
Scenario: Search ranking is configured correctly
|
||
|
|
Given the Meilisearch index has ranking rules configured
|
||
|
|
When a user searches for "api"
|
||
|
|
Then results are ranked by: typo, words, proximity, attribute, sort, exactness
|
||
|
|
And services with "api" in the name rank higher than those with "api" in description
|
||
|
|
And recently updated services rank higher among equal-relevance results
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Epic 5: Dashboard UI
|
||
|
|
|
||
|
|
### Story 5.1 — Service Catalog Browse
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Service Catalog Browse UI
|
||
|
|
|
||
|
|
Background:
|
||
|
|
Given the user is authenticated and on the catalog page
|
||
|
|
|
||
|
|
Scenario: Catalog page loads with service cards
|
||
|
|
Given the tenant has 42 services in the catalog
|
||
|
|
When the catalog page loads
|
||
|
|
Then 20 service cards are displayed (first page)
|
||
|
|
And each card shows: service name, owner, health status badge, source icon, last updated
|
||
|
|
And a pagination control shows "Page 1 of 3"
|
||
|
|
|
||
|
|
Scenario: Progressive disclosure — card expands on hover
|
||
|
|
Given a service card for "payment-api" is visible
|
||
|
|
When the user hovers over the card
|
||
|
|
Then the card expands to show additional details:
|
||
|
|
| Field | Value |
|
||
|
|
| Description | Handles payment flows |
|
||
|
|
| Region | us-east-1 |
|
||
|
|
| On-call | alice@acme.com |
|
||
|
|
| Tech stack | Node.js, Docker |
|
||
|
|
And the expansion animation completes within 200ms
|
||
|
|
|
||
|
|
Scenario: Filter services by owner
|
||
|
|
Given the catalog has services from multiple teams
|
||
|
|
When the user selects filter "owner: payments-team"
|
||
|
|
Then only services owned by "payments-team" are shown
|
||
|
|
And the URL updates with ?owner=payments-team
|
||
|
|
And the filter chip is shown as active
|
||
|
|
|
||
|
|
Scenario: Filter services by health status
|
||
|
|
Given some services have health_status: "healthy", "degraded", "unknown"
|
||
|
|
When the user selects filter "status: degraded"
|
||
|
|
Then only degraded services are shown
|
||
|
|
And the count badge updates to reflect filtered results
|
||
|
|
|
||
|
|
Scenario: Filter services by source
|
||
|
|
Given services come from "aws-ecs", "aws-lambda", "github"
|
||
|
|
When the user selects filter "source: aws-lambda"
|
||
|
|
Then only Lambda functions are shown in the catalog
|
||
|
|
|
||
|
|
Scenario: Sort services by last updated
|
||
|
|
Given the catalog has services with various updated_at timestamps
|
||
|
|
When the user selects sort "Last Updated (newest first)"
|
||
|
|
Then services are sorted with most recently updated first
|
||
|
|
And the sort selection persists across page navigation
|
||
|
|
|
||
|
|
Scenario: Empty catalog state
|
||
|
|
Given the tenant has just onboarded and has 0 services
|
||
|
|
When the catalog page loads
|
||
|
|
Then an empty state is shown: "No services discovered yet"
|
||
|
|
And a CTA button: "Run Discovery Scan" is prominently displayed
|
||
|
|
And a link to the onboarding guide is shown
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 5.2 — Service Detail Drawer
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Service Detail Drawer
|
||
|
|
|
||
|
|
Scenario: Open service detail drawer
|
||
|
|
Given the user clicks on service card "checkout-api"
|
||
|
|
When the drawer opens
|
||
|
|
Then it slides in from the right within 300ms
|
||
|
|
And displays full service details:
|
||
|
|
| Section | Content |
|
||
|
|
| Header | Service name, health badge, source |
|
||
|
|
| Ownership | Team name, on-call engineer |
|
||
|
|
| Metadata | Region, runtime, version, tags |
|
||
|
|
| Health | Scorecard with metrics |
|
||
|
|
| Links | GitHub repo, PagerDuty, AWS console |
|
||
|
|
And the main catalog page remains visible behind the drawer
|
||
|
|
|
||
|
|
Scenario: Drawer URL is shareable
|
||
|
|
Given the drawer is open for "checkout-api" (id: svc-123)
|
||
|
|
When the user copies the URL
|
||
|
|
Then the URL is /catalog?service=svc-123
|
||
|
|
And sharing this URL opens the catalog with the drawer pre-opened
|
||
|
|
|
||
|
|
Scenario: Close drawer with Escape key
|
||
|
|
Given the service detail drawer is open
|
||
|
|
When the user presses Escape
|
||
|
|
Then the drawer closes
|
||
|
|
And focus returns to the service card that was clicked
|
||
|
|
|
||
|
|
Scenario: Navigate between services in drawer
|
||
|
|
Given the drawer is open for "checkout-api"
|
||
|
|
When the user presses the right arrow or clicks "Next service"
|
||
|
|
Then the drawer transitions to the next service in the current filtered/sorted list
|
||
|
|
And the URL updates accordingly
|
||
|
|
|
||
|
|
Scenario: Drawer shows stale data warning
|
||
|
|
Given service "legacy-api" has status: "stale" (not seen in last scan)
|
||
|
|
When the drawer opens for "legacy-api"
|
||
|
|
Then a warning banner shows: "This service was not found in the last scan (3 days ago)"
|
||
|
|
And a "Re-run scan" button is available
|
||
|
|
|
||
|
|
Scenario: Edit service owner from drawer
|
||
|
|
Given the user has editor permissions
|
||
|
|
When they click "Edit owner" in the drawer
|
||
|
|
Then an inline edit field appears
|
||
|
|
And they can type a new owner name with autocomplete from known teams
|
||
|
|
And saving updates the catalog immediately
|
||
|
|
And the drawer reflects the new owner without a full page reload
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 5.3 — Cmd+K Search in UI
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Cmd+K Search UI Integration
|
||
|
|
|
||
|
|
Scenario: Search modal shows categorized results
|
||
|
|
Given the user types "pay" in the Cmd+K search modal
|
||
|
|
When results are returned
|
||
|
|
Then they are grouped by category:
|
||
|
|
| Category | Examples |
|
||
|
|
| Services | payment-api, payment-processor |
|
||
|
|
| Teams | payments-team |
|
||
|
|
| Actions | Run discovery scan |
|
||
|
|
And keyboard navigation works between categories
|
||
|
|
|
||
|
|
Scenario: Search modal shows recent searches on open
|
||
|
|
Given the user has previously searched for "auth", "billing", "gateway"
|
||
|
|
When the user opens Cmd+K without typing
|
||
|
|
Then recent searches are shown as suggestions
|
||
|
|
And clicking a suggestion populates the search input
|
||
|
|
|
||
|
|
Scenario: Search result keyboard navigation
|
||
|
|
Given the search modal is open with 5 results
|
||
|
|
When the user presses the down arrow key
|
||
|
|
Then the first result is highlighted
|
||
|
|
And pressing down again highlights the second result
|
||
|
|
And pressing Enter navigates to the highlighted result
|
||
|
|
|
||
|
|
Scenario: Search modal is accessible
|
||
|
|
Given the search modal is open
|
||
|
|
Then it has role="dialog" and aria-label="Search services"
|
||
|
|
And the input has aria-label="Search"
|
||
|
|
And results have role="listbox" with role="option" items
|
||
|
|
And screen readers announce result count changes
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Epic 6: Analytics Dashboards
|
||
|
|
|
||
|
|
### Story 6.1 — Ownership Coverage Dashboard
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Ownership Coverage Analytics
|
||
|
|
|
||
|
|
Scenario: Display ownership coverage percentage
|
||
|
|
Given the tenant has 100 services in the catalog
|
||
|
|
And 75 services have a confirmed owner
|
||
|
|
And 25 services have owner: null or "unknown"
|
||
|
|
When the analytics dashboard loads
|
||
|
|
Then the ownership coverage metric shows 75%
|
||
|
|
And a donut chart shows the breakdown: 75 owned, 25 unowned
|
||
|
|
And a trend line shows coverage change over the last 30 days
|
||
|
|
|
||
|
|
Scenario: Ownership coverage by team
|
||
|
|
Given multiple teams own services
|
||
|
|
When the ownership breakdown table is rendered
|
||
|
|
Then it shows each team with their service count and percentage
|
||
|
|
And teams are sorted by service count descending
|
||
|
|
And clicking a team filters the catalog to that team's services
|
||
|
|
|
||
|
|
Scenario: Ownership coverage drops below threshold
|
||
|
|
Given the ownership coverage threshold is set to 80%
|
||
|
|
And coverage drops from 82% to 76% after a scan
|
||
|
|
When the dashboard refreshes
|
||
|
|
Then a warning alert is shown: "Ownership coverage below 80% threshold"
|
||
|
|
And the affected unowned services are listed
|
||
|
|
And an email notification is sent to the tenant admin
|
||
|
|
|
||
|
|
Scenario: Zero unowned services
|
||
|
|
Given all 50 services have confirmed owners
|
||
|
|
When the ownership dashboard loads
|
||
|
|
Then coverage shows 100%
|
||
|
|
And a success state is shown: "All services have owners"
|
||
|
|
And no warning alerts are displayed
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 6.2 — Service Health Scorecards
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Service Health Scorecards
|
||
|
|
|
||
|
|
Scenario: Health scorecard is calculated for a service
|
||
|
|
Given service "payment-api" has the following signals:
|
||
|
|
| Signal | Value | Score |
|
||
|
|
| Has owner | Yes | 20 |
|
||
|
|
| Has on-call policy | Yes | 20 |
|
||
|
|
| Has documentation | Yes | 15 |
|
||
|
|
| Recent deployment | 2 days | 15 |
|
||
|
|
| No open P1 alerts | True | 30 |
|
||
|
|
When the health scorecard is computed
|
||
|
|
Then the overall score is 100/100
|
||
|
|
And the health_status is "healthy"
|
||
|
|
|
||
|
|
Scenario: Service with missing ownership scores lower
|
||
|
|
Given service "orphan-lambda" has:
|
||
|
|
| Signal | Value | Score |
|
||
|
|
| Has owner | No | 0 |
|
||
|
|
| Has on-call policy | No | 0 |
|
||
|
|
| Has documentation | No | 0 |
|
||
|
|
| Recent deployment | 90 days | 5 |
|
||
|
|
| No open P1 alerts | True | 30 |
|
||
|
|
When the health scorecard is computed
|
||
|
|
Then the overall score is 35/100
|
||
|
|
And the health_status is "at-risk"
|
||
|
|
And the scorecard highlights the missing signals as improvement actions
|
||
|
|
|
||
|
|
Scenario: Health scorecard trend over time
|
||
|
|
Given "checkout-api" has had weekly scorecard snapshots for 8 weeks
|
||
|
|
When the service detail drawer shows the health trend
|
||
|
|
Then a sparkline chart shows the score history
|
||
|
|
And the trend direction (improving/declining) is indicated
|
||
|
|
|
||
|
|
Scenario: Team-level health KPI aggregation
|
||
|
|
Given "payments-team" owns 10 services with scores: [90, 85, 70, 95, 60, 80, 75, 88, 92, 65]
|
||
|
|
When the team KPI dashboard is rendered
|
||
|
|
Then the team average score is 80/100
|
||
|
|
And the lowest-scoring service is highlighted for attention
|
||
|
|
And the team score is compared to the org average
|
||
|
|
|
||
|
|
Scenario: Health scorecard for stale service
|
||
|
|
Given service "legacy-api" has status: "stale"
|
||
|
|
When the scorecard is computed
|
||
|
|
Then the "Recent deployment" signal scores 0
|
||
|
|
And a penalty is applied for being stale
|
||
|
|
And the overall score reflects the staleness
|
||
|
|
```
|
||
|
|
|
||
|
|
### Story 6.3 — Tech Debt Tracking
|
||
|
|
|
||
|
|
```gherkin
|
||
|
|
Feature: Tech Debt Tracking Dashboard
|
||
|
|
|
||
|
|
Scenario: Identify services using deprecated runtimes
|
||
|
|
Given the portal has a policy: "Node.js < 18 is deprecated"
|
||
|
|
And 5 services use Node.js 16
|
||
|
|
When the tech debt dashboard loads
|
||
|
|
Then it shows 5 services flagged for "deprecated runtime"
|
||
|
|
And each flagged service links to its catalog entry
|
||
|
|
And the total tech debt score is calculated
|
||
|
|
|
||
|
|
Scenario: Track services with no recent deployments
|
||
|
|
Given the policy threshold is "no deployment in 90 days = tech debt"
|
||
|
|
And 8 services have not been deployed in over 90 days
|
||
|
|
When the tech debt dashboard loads
|
||
|
|
Then these 8 services appear in the "stale deployments" category
|
||
|
|
And the owning teams are notified via the dashboard
|
||
|
|
|
||
|
|
Scenario: Tech debt trend over time
|
||
|
|
Given tech debt metrics have been tracked for 12 weeks
|
||
|
|
When the trend chart is rendered
|
||
|
|
Then it shows weekly tech debt item counts
|
||
|
|
And highlights weeks where debt increased significantly
|
||
|
|
And shows the net change (items resolved vs. items added)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|