P4: 1,177 lines (subagent still running, may have more output pending) All 6 products now have acceptance specs committed.
49 KiB
49 KiB
dd0c/portal — BDD Acceptance Test Specifications
Gherkin scenarios for all 10 epics. Edge cases included.
Epic 1: AWS Discovery Scanner
Story 1.1 — IAM Role Assumption into Customer Account
Feature: IAM Role Assumption for Cross-Account Discovery
Background:
Given the portal has a tenant with AWS integration configured
And the tenant has provided an IAM role ARN "arn:aws:iam::123456789012:role/PortalDiscoveryRole"
Scenario: Successfully assume IAM role and begin scan
Given the IAM role has the required trust policy allowing portal's scanner principal
When the AWS Discovery Scanner initiates a scan for the tenant
Then the scanner assumes the role via STS AssumeRole
And receives temporary credentials valid for 1 hour
And begins scanning the configured AWS regions
Scenario: IAM role assumption fails due to missing trust policy
Given the IAM role does NOT have a trust policy for the portal's scanner principal
When the AWS Discovery Scanner attempts to assume the role
Then STS returns an "AccessDenied" error
And the scanner marks the scan job as FAILED with reason "IAM role assumption denied"
And the tenant receives a notification: "AWS integration error: cannot assume role"
And no partial scan results are persisted
Scenario: IAM role ARN is malformed
Given the tenant has configured an invalid role ARN "not-a-valid-arn"
When the AWS Discovery Scanner attempts to assume the role
Then the scanner marks the scan job as FAILED with reason "Invalid role ARN format"
And logs a structured error with tenant_id and the invalid ARN
Scenario: Temporary credentials expire mid-scan
Given the scanner has assumed the IAM role and started scanning
And the scan has been running for 61 minutes
When the scanner attempts to call ECS DescribeClusters
Then AWS returns an "ExpiredTokenException"
And the scanner refreshes credentials via a new AssumeRole call
And resumes scanning from the last checkpoint
And the scan job status remains IN_PROGRESS
Scenario: Role assumption succeeds but has insufficient permissions
Given the IAM role is assumed successfully
But the role lacks "ecs:ListClusters" permission
When the scanner attempts to list ECS clusters
Then AWS returns an "AccessDenied" error for that specific call
And the scanner records a partial failure for the ECS resource type
And continues scanning other resource types (Lambda, CloudFormation, API Gateway)
And the final scan result includes a warnings list with "ECS: insufficient permissions"
Story 1.2 — ECS Service Discovery
Feature: ECS Service Discovery
Background:
Given a tenant's AWS account has been successfully authenticated
And the scanner is configured to scan region "us-east-1"
Scenario: Discover ECS services across multiple clusters
Given the AWS account has 3 ECS clusters: "prod-cluster", "staging-cluster", "dev-cluster"
And "prod-cluster" has 5 services, "staging-cluster" has 3, "dev-cluster" has 2
When the AWS Discovery Scanner runs the ECS scan step
Then it lists all 3 clusters via ecs:ListClusters
And describes all services in each cluster via ecs:DescribeServices
And emits 10 discovered service events to the catalog ingestion queue
And each event includes: cluster_name, service_name, task_definition, desired_count, region
Scenario: ECS cluster has no services
Given the AWS account has 1 ECS cluster "empty-cluster" with 0 services
When the scanner runs the ECS scan step
Then it lists the cluster successfully
And records 0 services discovered for that cluster
And does not emit any service events for that cluster
And the scan step completes without error
Scenario: ECS DescribeServices returns throttling error
Given the AWS account has an ECS cluster with 100 services
When the scanner calls ecs:DescribeServices in batches of 10
And AWS returns a ThrottlingException on the 5th batch
Then the scanner applies exponential backoff (2s, 4s, 8s)
And retries the failed batch up to 3 times
And if retry succeeds, continues with remaining batches
And the final result includes all 100 services
Scenario: ECS DescribeServices fails after all retries
Given AWS returns ThrottlingException on every retry attempt
When the scanner exhausts all 3 retries for a batch
Then it marks that batch as partially failed
And records which service ARNs could not be described
And continues with the next batch
And the scan summary includes "ECS partial failure: 10 services could not be described"
Scenario: Multi-region ECS scan
Given the tenant has configured scanning for regions: "us-east-1", "eu-west-1", "ap-southeast-1"
When the AWS Discovery Scanner runs
Then it scans ECS in all 3 regions in parallel via Step Functions Map state
And aggregates results from all regions
And each discovered service record includes its region
And duplicate service names across regions are stored as separate catalog entries
Story 1.3 — Lambda Function Discovery
Feature: Lambda Function Discovery
Scenario: Discover all Lambda functions in a region
Given the AWS account has 25 Lambda functions in "us-east-1"
When the scanner runs the Lambda scan step
Then it calls lambda:ListFunctions with pagination
And retrieves all 25 functions across multiple pages
And emits 25 service events with: function_name, runtime, memory_size, timeout, last_modified
Scenario: Lambda function has tags indicating service ownership
Given a Lambda function "payment-processor" has tags:
| Key | Value |
| team | payments-team |
| service | payment-svc |
| environment | production |
When the scanner processes this Lambda function
Then the catalog entry includes ownership: "payments-team" (source: implicit/tags)
And the service name is inferred as "payment-svc" from the "service" tag
Scenario: Lambda function has no tags
Given a Lambda function "legacy-cron-job" has no tags
When the scanner processes this Lambda function
Then the catalog entry has ownership: null (source: heuristic/pending)
And the entry is flagged for ownership inference via commit history
Scenario: Lambda ListFunctions pagination
Given the AWS account has 150 Lambda functions (exceeding the 50-per-page limit)
When the scanner calls lambda:ListFunctions
Then it follows NextMarker pagination tokens
And retrieves all 150 functions across 3 pages
And does not duplicate any function in the results
Story 1.4 — CloudFormation Stack Discovery
Feature: CloudFormation Stack Discovery
Scenario: Discover active CloudFormation stacks
Given the AWS account has 8 CloudFormation stacks in "us-east-1"
And 6 stacks have status "CREATE_COMPLETE" or "UPDATE_COMPLETE"
And 2 stacks have status "ROLLBACK_COMPLETE"
When the scanner runs the CloudFormation scan step
Then it discovers all 8 stacks
And includes stacks of all statuses in the raw results
And marks ROLLBACK_COMPLETE stacks with health_status: "degraded"
Scenario: CloudFormation stack outputs contain service metadata
Given a stack "api-gateway-stack" has outputs:
| OutputKey | OutputValue |
| ServiceName | user-api |
| TeamOwner | platform-team |
| ApiEndpoint | https://api.example.com/users |
When the scanner processes this stack
Then the catalog entry uses "user-api" as the service name
And sets ownership to "platform-team" (source: implicit/stack-output)
And records the API endpoint in service metadata
Scenario: Nested stacks are discovered
Given a parent CloudFormation stack has 3 nested stacks
When the scanner processes the parent stack
Then it also discovers and catalogs each nested stack
And links nested stacks to their parent in the catalog
Scenario: Stack in DELETE_IN_PROGRESS state
Given a CloudFormation stack has status "DELETE_IN_PROGRESS"
When the scanner discovers this stack
Then it records the stack with health_status: "terminating"
And does not block the scan on this stack's state
Story 1.5 — API Gateway Discovery
Feature: API Gateway Discovery
Scenario: Discover REST APIs in API Gateway v1
Given the AWS account has 4 REST APIs in API Gateway
When the scanner runs the API Gateway scan step
Then it calls apigateway:GetRestApis
And retrieves all 4 APIs with their names, IDs, and endpoints
And emits catalog events for each API
Scenario: Discover HTTP APIs in API Gateway v2
Given the AWS account has 3 HTTP APIs in API Gateway v2
When the scanner runs the API Gateway v2 scan step
Then it calls apigatewayv2:GetApis
And retrieves all 3 HTTP APIs
And correctly distinguishes them from REST APIs in the catalog
Scenario: API Gateway API has no associated service tag
Given an API Gateway REST API "legacy-api" has no resource tags
When the scanner processes this API
Then it creates a catalog entry with name "legacy-api"
And sets ownership to null pending heuristic inference
And records the API type as "REST" and the invoke URL
Scenario: API Gateway scan across multiple regions
Given API Gateway APIs exist in "us-east-1" and "us-west-2"
When the scanner runs in parallel across both regions
Then APIs from both regions are discovered
And each catalog entry includes the region field
And there are no cross-region duplicates
Story 1.6 — Step Functions Orchestration
Feature: Step Functions Orchestration of Discovery Scan
Scenario: Full scan executes all resource type steps in parallel
Given a scan job is triggered for a tenant
When the Step Functions state machine starts
Then it executes ECS, Lambda, CloudFormation, and API Gateway scan steps in parallel
And waits for all parallel branches to complete
And aggregates results from all branches
And transitions to the catalog ingestion step
Scenario: One parallel branch fails, others succeed
Given the Step Functions state machine is running a scan
And the Lambda scan step throws an unhandled exception
When the state machine catches the error
Then the ECS, CloudFormation, and API Gateway branches continue to completion
And the Lambda branch is marked as FAILED in the scan summary
And the overall scan job status is PARTIAL_SUCCESS
And discovered resources from successful branches are ingested into the catalog
Scenario: Scan job is idempotent on retry
Given a scan job failed midway through execution
When the Step Functions state machine retries the execution
Then it does not duplicate already-ingested catalog entries
And uses upsert semantics based on (tenant_id, resource_arn) as the unique key
Scenario: Concurrent scan jobs for the same tenant are prevented
Given a scan job is already IN_PROGRESS for tenant "acme-corp"
When a second scan job is triggered for "acme-corp"
Then the system detects the existing in-progress execution
And rejects the new job with status 409 Conflict
And returns the ID of the existing running scan job
Scenario: Scan job timeout
Given a scan job has been running for more than 30 minutes
When the Step Functions state machine timeout is reached
Then the execution is marked as TIMED_OUT
And partial results already ingested are retained in the catalog
And the tenant is notified of the timeout
And the scan job record shows last_successful_step
Scenario: WebSocket progress events during scan
Given a user is watching the discovery progress in the UI
When each Step Functions step completes
Then a WebSocket event is pushed to the user's connection
And the event includes: step_name, status, resources_found, timestamp
And the UI progress bar updates in real-time
Story 1.7 — Multi-Region Scanning
Feature: Multi-Region AWS Scanning
Scenario: Tenant configures specific regions to scan
Given a tenant has configured scan regions: ["us-east-1", "eu-central-1"]
When the discovery scanner runs
Then it only scans the configured regions
And does not scan any other AWS regions
And the scan summary shows per-region resource counts
Scenario: Tenant configures "all regions" scan
Given a tenant has configured scan_all_regions: true
When the discovery scanner runs
Then it first calls ec2:DescribeRegions to get the current list of enabled regions
And scans all enabled regions
And the Step Functions Map state iterates over the dynamic region list
Scenario: A region is disabled in the AWS account
Given the tenant's AWS account has "ap-east-1" disabled
And the tenant has configured scan_all_regions: true
When the scanner attempts to scan "ap-east-1"
Then it receives an "OptInRequired" or region-disabled error
And skips that region gracefully
And records "ap-east-1: skipped (region disabled)" in the scan summary
Scenario: Region scan results are isolated per tenant
Given tenant "acme" and tenant "globex" both have resources in "us-east-1"
When both tenants run discovery scans simultaneously
Then acme's catalog only contains acme's resources
And globex's catalog only contains globex's resources
And no cross-tenant data leakage occurs
Epic 2: GitHub Discovery Scanner
Story 2.1 — GitHub App Installation and Authentication
Feature: GitHub App Authentication
Background:
Given the portal has a GitHub App registered with app_id "12345"
And the app has a private key stored in AWS Secrets Manager
Scenario: Successfully authenticate as GitHub App installation
Given a tenant has installed the GitHub App on their organization "acme-org"
And the installation_id is "67890"
When the GitHub Discovery Scanner initializes for this tenant
Then it generates a JWT signed with the app's private key
And exchanges the JWT for an installation access token via POST /app/installations/67890/access_tokens
And the token has scopes: contents:read, metadata:read
And the scanner uses this token for all subsequent API calls
Scenario: GitHub App JWT generation fails due to expired private key
Given the private key in Secrets Manager is expired or invalid
When the scanner attempts to generate a JWT
Then it fails with a key validation error
And marks the scan job as FAILED with reason "GitHub App authentication failed: invalid private key"
And triggers an alert to the tenant admin
Scenario: Installation access token request returns 404
Given the GitHub App installation has been uninstalled by the tenant
When the scanner requests an installation access token
Then GitHub returns 404 Not Found
And the scanner marks the GitHub integration as DISCONNECTED
And notifies the tenant: "GitHub App has been uninstalled. Please reinstall to continue discovery."
Scenario: Installation access token expires mid-scan
Given the scanner has a valid installation token (valid for 1 hour)
And the scan has been running for 58 minutes
When the token expires during a GraphQL query
Then GitHub returns 401 Unauthorized
And the scanner proactively refreshes the token before expiry (at 55 min mark)
And the scan continues without interruption
Story 2.2 — Repository Discovery via GraphQL
Feature: GitHub Repository Discovery via GraphQL API
Scenario: Discover all repositories in an organization
Given the tenant's GitHub organization "acme-org" has 150 repositories
When the GitHub Discovery Scanner runs the repository listing query
Then it uses the GraphQL API with cursor-based pagination
And retrieves all 150 repositories in batches of 100
And each repository record includes: name, defaultBranch, isArchived, pushedAt, topics
Scenario: Archived repositories are excluded by default
Given "acme-org" has 150 active repos and 30 archived repos
When the scanner runs with default settings
Then it discovers 150 active repositories
And excludes the 30 archived repositories
And the scan summary notes "30 archived repositories skipped"
Scenario: Tenant opts in to include archived repositories
Given the tenant has configured include_archived: true
When the scanner runs
Then it discovers all 180 repositories including archived ones
And archived repos are marked with status: "archived" in the catalog
Scenario: Organization has no repositories
Given the tenant's GitHub organization has 0 repositories
When the scanner runs the repository listing query
Then it returns an empty result set
And the scan completes successfully with 0 GitHub services discovered
And no errors are raised
Scenario: GraphQL query returns partial data with errors
Given the GraphQL query returns 80 repositories successfully
But includes a "FORBIDDEN" error for 20 private repositories
When the scanner processes the response
Then it ingests the 80 accessible repositories
And records a warning: "20 repositories inaccessible due to permissions"
And the scan status is PARTIAL_SUCCESS
Story 2.3 — Service Metadata Extraction from Repositories
Feature: Service Metadata Extraction from Repository Files
Scenario: Extract service name from package.json
Given a repository "user-service" has a package.json at the root:
"""
{
"name": "@acme/user-service",
"version": "2.1.0",
"description": "Handles user authentication and profiles"
}
"""
When the scanner fetches and parses package.json
Then the catalog entry uses service name "@acme/user-service"
And records version "2.1.0"
And records description "Handles user authentication and profiles"
And sets the service type as "nodejs"
Scenario: Extract service metadata from Dockerfile
Given a repository has a Dockerfile with:
"""
FROM python:3.11-slim
LABEL service.name="data-pipeline"
LABEL service.team="data-engineering"
EXPOSE 8080
"""
When the scanner parses the Dockerfile
Then the catalog entry uses service name "data-pipeline"
And sets ownership to "data-engineering" (source: implicit/dockerfile-label)
And records the service type as "python"
And records exposed port 8080
Scenario: Extract service metadata from Terraform files
Given a repository has a main.tf with:
"""
module "service" {
source = "..."
service_name = "billing-api"
team = "billing-team"
}
"""
When the scanner parses Terraform files
Then the catalog entry uses service name "billing-api"
And sets ownership to "billing-team" (source: implicit/terraform)
Scenario: Repository has no recognizable service metadata files
Given a repository "random-scripts" has no package.json, Dockerfile, or terraform files
When the scanner processes this repository
Then it creates a catalog entry using the repository name "random-scripts"
And sets service type as "unknown"
And flags the entry for manual review
And ownership is set to null pending heuristic inference
Scenario: Multiple metadata files exist — precedence order
Given a repository has both a package.json (name: "svc-a") and a Dockerfile (LABEL service.name="svc-b")
When the scanner processes the repository
Then it uses package.json as the primary metadata source
And records the service name as "svc-a"
And notes the Dockerfile label as secondary metadata
Scenario: package.json is malformed JSON
Given a repository has a package.json with invalid JSON syntax
When the scanner attempts to parse it
Then it logs a warning: "package.json parse error in repo: user-service"
And falls back to Dockerfile or terraform for metadata
And if no fallback exists, uses the repository name
And does not crash the scan job
Story 2.4 — GitHub Rate Limit Handling
Feature: GitHub API Rate Limit Handling
Scenario: GraphQL rate limit warning threshold reached
Given the scanner has consumed 4,500 of 5,000 GraphQL rate limit points
When the scanner checks the rate limit headers after each response
Then it detects the threshold (90% consumed)
And pauses scanning until the rate limit resets
And logs: "GitHub rate limit threshold reached, waiting for reset at {reset_time}"
And resumes scanning after the reset window
Scenario: REST API secondary rate limit hit
Given the scanner is making rapid REST API calls
When GitHub returns HTTP 429 with "secondary rate limit" in the response
Then the scanner reads the Retry-After header
And waits the specified number of seconds before retrying
And does not count this as a scan failure
Scenario: Rate limit exhausted with many repos remaining
Given the scanner has 500 repositories to scan
And the rate limit is exhausted after scanning 200 repositories
When the scanner detects rate limit exhaustion
Then it checkpoints the current progress (200 repos scanned)
And schedules a continuation scan after the rate limit reset
And the scan job status is set to RATE_LIMITED (not FAILED)
And the tenant is notified of the delay
Scenario: Rate limit headers are missing from response
Given GitHub returns a response without X-RateLimit headers
When the scanner processes the response
Then it applies a conservative default delay of 1 second between requests
And logs a warning about missing rate limit headers
And continues scanning without crashing
Scenario: Concurrent scans from multiple tenants share rate limit awareness
Given two tenants share the same GitHub App installation
When both tenants trigger scans simultaneously
Then the scanner uses a shared rate limit counter in Redis
And distributes available rate limit points between the two scans
And neither scan exceeds the total available rate limit
Story 2.5 — Ownership Inference from Commit History
Feature: Heuristic Ownership Inference from Commit History
Scenario: Infer owner from most recent committer
Given a repository has no explicit ownership config and no team tags
And the last 10 commits were made by "alice@acme.com" (7 commits) and "bob@acme.com" (3 commits)
When the ownership inference engine runs
Then it identifies "alice@acme.com" as the primary contributor
And maps the email to team "frontend-team" via the org's CODEOWNERS or team membership
And sets ownership to "frontend-team" (source: heuristic/commit-history)
And records confidence: 0.7
Scenario: Ownership inference from CODEOWNERS file
Given a repository has a CODEOWNERS file:
"""
* @acme-org/platform-team
/src/api/ @acme-org/api-team
"""
When the scanner processes the CODEOWNERS file
Then it sets ownership to "platform-team" (source: implicit/codeowners)
And records that the api directory has additional ownership by "api-team"
And this takes precedence over commit history heuristics
Scenario: Ownership conflict — multiple teams have equal commit share
Given a repository has commits split 50/50 between "team-a" and "team-b"
When the ownership inference engine runs
Then it records both teams as co-owners
And sets ownership_confidence: 0.5
And flags the service for manual ownership resolution
And the catalog entry shows ownership_status: "conflict"
Scenario: Explicit ownership config overrides all other sources
Given a repository has a .portal.yaml file:
"""
owner: payments-team
tier: critical
"""
And the repository also has CODEOWNERS pointing to "platform-team"
And commit history suggests "devops-team"
When the scanner processes the repository
Then ownership is set to "payments-team" (source: explicit/portal-config)
And CODEOWNERS and commit history are recorded as secondary metadata
And the explicit config is not overridden by any other source
Scenario: Commit history API call fails
Given the GitHub API returns 500 when fetching commit history for a repo
When the ownership inference engine attempts heuristic inference
Then it marks ownership as null (source: unknown)
And records the error in the scan summary
And does not block catalog ingestion for this service
Epic 3: Service Catalog
Story 3.1 — Catalog Ingestion and Upsert
Feature: Service Catalog Ingestion
Background:
Given the catalog uses PostgreSQL Aurora Serverless v2
And the unique key for a service is (tenant_id, source, resource_id)
Scenario: New service is ingested into the catalog
Given a discovery scan emits a new service event for "payment-api"
When the catalog ingestion handler processes the event
Then a new row is inserted into the services table
And the row includes: tenant_id, name, source, resource_id, owner, health_status, discovered_at
And the Meilisearch index is updated with the new service document
And the catalog entry is immediately searchable
Scenario: Existing service is updated on re-scan (upsert)
Given "payment-api" already exists in the catalog with owner "old-team"
When a new scan emits an updated event for "payment-api" with owner "payments-team"
Then the existing row is updated (not duplicated)
And owner is changed to "payments-team"
And updated_at is refreshed
And the Meilisearch document is updated
Scenario: Service removed from AWS is marked stale
Given "deprecated-lambda" exists in the catalog from a previous scan
When the latest scan completes and does not include "deprecated-lambda"
Then the catalog entry is marked with status: "stale"
And last_seen_at is not updated
And the service remains visible in the catalog with a "stale" badge
And is not immediately deleted
Scenario: Stale service is purged after retention period
Given "deprecated-lambda" has been stale for more than 30 days
When the nightly cleanup job runs
Then the service is soft-deleted from the catalog
And removed from the Meilisearch index
And a deletion event is logged for audit purposes
Scenario: Catalog ingestion fails due to Aurora connection error
Given Aurora Serverless is scaling up (cold start)
When the ingestion handler attempts to write a service record
Then it retries with exponential backoff up to 5 times
And if all retries fail, the event is sent to a dead-letter queue
And an alert is raised for the operations team
And the scan job is marked PARTIAL_SUCCESS
Scenario: Bulk ingestion of 500 services from a large account
Given a scan discovers 500 services across all resource types
When the catalog ingestion handler processes all events
Then it uses batch inserts (100 records per batch)
And completes ingestion within 30 seconds
And all 500 services appear in the catalog
And the Meilisearch index reflects all 500 services
Story 3.2 — PagerDuty / OpsGenie On-Call Mapping
Feature: On-Call Ownership Mapping via PagerDuty and OpsGenie
Scenario: Map service owner to PagerDuty escalation policy
Given a service "checkout-api" has owner "payments-team"
And PagerDuty integration is configured for the tenant
And "payments-team" maps to PagerDuty service "checkout-escalation-policy"
When the catalog builds the service detail record
Then the service includes on_call_policy: "checkout-escalation-policy"
And a link to the PagerDuty service page
Scenario: Fetch current on-call engineer from PagerDuty
Given "checkout-api" has a linked PagerDuty escalation policy
When a user views the service detail page
Then the portal calls PagerDuty GET /oncalls?escalation_policy_ids[]=...
And displays the current on-call engineer's name and contact
And caches the result for 5 minutes in Redis
Scenario: PagerDuty API returns 401 (invalid token)
Given the PagerDuty API token has been revoked
When the portal attempts to fetch on-call data
Then it returns the service detail without on-call info
And displays a warning: "On-call data unavailable — check PagerDuty integration"
And logs the auth failure for the tenant admin
Scenario: OpsGenie integration as alternative to PagerDuty
Given the tenant uses OpsGenie instead of PagerDuty
And OpsGenie integration is configured with an API key
When the portal fetches on-call data for a service
Then it calls the OpsGenie API instead of PagerDuty
And maps the OpsGenie schedule to the service owner
And displays the current on-call responder
Scenario: Service owner has no PagerDuty or OpsGenie mapping
Given "internal-tool" has owner "eng-team"
But "eng-team" has no mapping in PagerDuty or OpsGenie
When the portal builds the service detail
Then on_call_policy is null
And the UI shows "No on-call policy configured" for this service
And suggests linking a PagerDuty/OpsGenie service
Scenario: Multiple services share the same on-call policy
Given 10 services all map to the "platform-oncall" PagerDuty policy
When the portal fetches on-call data
Then it batches the PagerDuty API call for all 10 services
And does not make 10 separate API calls
And caches the shared result for all 10 services
Story 3.3 — Service Catalog CRUD API
Feature: Service Catalog CRUD API
Background:
Given the API requires a valid Cognito JWT
And the JWT contains tenant_id claim
Scenario: List services for a tenant
Given tenant "acme" has 42 services in the catalog
When GET /api/v1/services is called with a valid JWT for "acme"
Then the response returns 42 services
And each service includes: id, name, owner, source, health_status, updated_at
And results are paginated (default page size: 20)
And the response includes total_count: 42
Scenario: List services with pagination
Given tenant "acme" has 42 services
When GET /api/v1/services?page=2&limit=20 is called
Then the response returns services 21-40
And includes pagination metadata: page, limit, total_count, total_pages
Scenario: Get single service by ID
Given service "svc-uuid-123" exists for tenant "acme"
When GET /api/v1/services/svc-uuid-123 is called with acme's JWT
Then the response returns the full service detail
And includes: metadata, owner, on_call_policy, health_scorecard, tags, links
Scenario: Get service belonging to different tenant returns 404
Given service "svc-uuid-456" belongs to tenant "globex"
When GET /api/v1/services/svc-uuid-456 is called with acme's JWT
Then the response returns 404 Not Found
And does not reveal that the service exists for another tenant
Scenario: Manually update service owner via API
Given service "legacy-api" has owner inferred as "unknown"
When PATCH /api/v1/services/svc-uuid-789 is called with body:
"""
{ "owner": "platform-team", "owner_source": "manual" }
"""
Then the service owner is updated to "platform-team"
And owner_source is set to "explicit/manual"
And the change is recorded in the audit log with the user's identity
And the Meilisearch document is updated
Scenario: Delete service from catalog
Given service "decommissioned-api" exists in the catalog
When DELETE /api/v1/services/svc-uuid-000 is called
Then the service is soft-deleted (deleted_at is set)
And it no longer appears in list or search results
And the Meilisearch document is removed
And the deletion is recorded in the audit log
Scenario: Create service manually (not from discovery)
When POST /api/v1/services is called with:
"""
{ "name": "manual-service", "owner": "ops-team", "source": "manual" }
"""
Then a new service is created with source: "manual"
And it appears in the catalog and search results
And it is not overwritten by automated discovery scans
Scenario: Unauthenticated request is rejected
When GET /api/v1/services is called without an Authorization header
Then the response returns 401 Unauthorized
And no service data is returned
Story 3.4 — Full-Text Search via Meilisearch
Feature: Catalog Full-Text Search
Scenario: Search returns relevant services
Given the catalog has services: "payment-api", "payment-processor", "user-service"
When a search query "payment" is submitted
Then Meilisearch returns "payment-api" and "payment-processor"
And results are ranked by relevance score
And the response time is under 10ms
Scenario: Search is scoped to tenant
Given tenant "acme" has service "auth-service"
And tenant "globex" has service "auth-service" (same name)
When tenant "acme" searches for "auth-service"
Then only acme's "auth-service" is returned
And globex's service is not included in results
Scenario: Meilisearch index is corrupted or unavailable
Given Meilisearch returns a 503 error
When a search request is made
Then the API falls back to PostgreSQL full-text search (pg_trgm)
And returns results within 500ms
And logs a warning: "Meilisearch unavailable, using PostgreSQL fallback"
And the user sees results (degraded performance, not an error)
Scenario: Catalog update triggers Meilisearch re-index
Given a service "new-api" is added to the catalog
When the catalog write completes
Then a Meilisearch index update is triggered asynchronously
And the service is searchable within 1 second
And if the Meilisearch update fails, it is retried via a background queue
Epic 4: Search Engine
Story 4.1 — Cmd+K Instant Search
Feature: Cmd+K Instant Search Interface
Scenario: User opens search with keyboard shortcut
Given the user is on any page of the portal
When the user presses Cmd+K (Mac) or Ctrl+K (Windows/Linux)
Then the search modal opens immediately
And the search input is focused
And recent searches are shown as suggestions
Scenario: Search returns results under 10ms
Given the Meilisearch index has 500 services
And Redis prefix cache is warm
When the user types "pay" in the search box
Then results appear within 10ms
And the top 5 matching services are shown
And each result shows: service name, owner, health status
Scenario: Search with Redis prefix cache hit
Given "pay" has been searched recently and is cached in Redis
When the user types "pay"
Then the API returns cached results from Redis
And does not query Meilisearch
And the response time is under 5ms
Scenario: Search with Redis cache miss
Given "xyz-unique-query" has never been searched
When the user types "xyz-unique-query"
Then the API queries Meilisearch directly
And stores the result in Redis with TTL of 60 seconds
And returns results within 10ms
Scenario: Search cache is invalidated on catalog update
Given "payment-api" search results are cached in Redis
When the catalog is updated (new service added or existing service modified)
Then all Redis cache keys matching affected prefixes are invalidated
And the next search query fetches fresh results from Meilisearch
Scenario: Search with no results
Given no services match the query "zzznomatch"
When the user searches for "zzznomatch"
Then the search modal shows "No services found"
And suggests: "Try a broader search or browse all services"
And does not show an error
Scenario: Search is dismissed
Given the search modal is open
When the user presses Escape
Then the search modal closes
And focus returns to the previously focused element
Scenario: Search result is selected
Given search results show "payment-api"
When the user clicks or presses Enter on "payment-api"
Then the search modal closes
And the service detail drawer opens for "payment-api"
And the URL updates to reflect the selected service
Story 4.2 — Redis Prefix Caching
Feature: Redis Prefix Caching for Search
Scenario: Cache key structure is tenant-scoped
Given tenant "acme" searches for "pay"
Then the Redis cache key is "search:acme:pay"
And tenant "globex" searching for "pay" uses key "search:globex:pay"
And the two cache entries are completely independent
Scenario: Cache TTL expires and is refreshed
Given a cache entry for "search:acme:pay" has TTL of 60 seconds
When 61 seconds pass without a search for "pay"
Then the cache entry expires
And the next search for "pay" queries Meilisearch
And a new cache entry is created with a fresh 60-second TTL
Scenario: Redis is unavailable
Given Redis returns a connection error
When a search request is made
Then the API bypasses the cache and queries Meilisearch directly
And logs a warning: "Redis unavailable, cache bypassed"
And the search still returns results (degraded, not broken)
And does not attempt to write to Redis while it is down
Scenario: Cache invalidation on bulk catalog update
Given a discovery scan adds 50 new services to the catalog
When the bulk ingestion completes
Then all Redis search cache keys for the affected tenant are flushed
And subsequent searches reflect the updated catalog
And the flush is atomic (all keys or none)
Story 4.3 — Meilisearch Index Management
Feature: Meilisearch Index Management
Scenario: Index is created on first tenant onboarding
Given a new tenant "startup-co" completes onboarding
When the tenant's first discovery scan runs
Then a Meilisearch index "services_startup-co" is created
And the index is configured with searchable attributes: name, description, owner, tags
And filterable attributes: source, health_status, owner, region
Scenario: Index corruption is detected and rebuilt
Given the Meilisearch index for tenant "acme" returns inconsistent results
When the health check detects index corruption (checksum mismatch)
Then an index rebuild is triggered
And the index is rebuilt from the PostgreSQL catalog (source of truth)
And during rebuild, search falls back to PostgreSQL full-text search
And users see a banner: "Search index is being rebuilt, results may be incomplete"
And the rebuild completes within 5 minutes for up to 10,000 services
Scenario: Meilisearch index rebuild does not affect other tenants
Given tenant "acme"'s index is being rebuilt
When tenant "globex" performs a search
Then globex's search is unaffected
And globex's index is not touched during acme's rebuild
Scenario: Search ranking is configured correctly
Given the Meilisearch index has ranking rules configured
When a user searches for "api"
Then results are ranked by: typo, words, proximity, attribute, sort, exactness
And services with "api" in the name rank higher than those with "api" in description
And recently updated services rank higher among equal-relevance results
Epic 5: Dashboard UI
Story 5.1 — Service Catalog Browse
Feature: Service Catalog Browse UI
Background:
Given the user is authenticated and on the catalog page
Scenario: Catalog page loads with service cards
Given the tenant has 42 services in the catalog
When the catalog page loads
Then 20 service cards are displayed (first page)
And each card shows: service name, owner, health status badge, source icon, last updated
And a pagination control shows "Page 1 of 3"
Scenario: Progressive disclosure — card expands on hover
Given a service card for "payment-api" is visible
When the user hovers over the card
Then the card expands to show additional details:
| Field | Value |
| Description | Handles payment flows |
| Region | us-east-1 |
| On-call | alice@acme.com |
| Tech stack | Node.js, Docker |
And the expansion animation completes within 200ms
Scenario: Filter services by owner
Given the catalog has services from multiple teams
When the user selects filter "owner: payments-team"
Then only services owned by "payments-team" are shown
And the URL updates with ?owner=payments-team
And the filter chip is shown as active
Scenario: Filter services by health status
Given some services have health_status: "healthy", "degraded", "unknown"
When the user selects filter "status: degraded"
Then only degraded services are shown
And the count badge updates to reflect filtered results
Scenario: Filter services by source
Given services come from "aws-ecs", "aws-lambda", "github"
When the user selects filter "source: aws-lambda"
Then only Lambda functions are shown in the catalog
Scenario: Sort services by last updated
Given the catalog has services with various updated_at timestamps
When the user selects sort "Last Updated (newest first)"
Then services are sorted with most recently updated first
And the sort selection persists across page navigation
Scenario: Empty catalog state
Given the tenant has just onboarded and has 0 services
When the catalog page loads
Then an empty state is shown: "No services discovered yet"
And a CTA button: "Run Discovery Scan" is prominently displayed
And a link to the onboarding guide is shown
Story 5.2 — Service Detail Drawer
Feature: Service Detail Drawer
Scenario: Open service detail drawer
Given the user clicks on service card "checkout-api"
When the drawer opens
Then it slides in from the right within 300ms
And displays full service details:
| Section | Content |
| Header | Service name, health badge, source |
| Ownership | Team name, on-call engineer |
| Metadata | Region, runtime, version, tags |
| Health | Scorecard with metrics |
| Links | GitHub repo, PagerDuty, AWS console |
And the main catalog page remains visible behind the drawer
Scenario: Drawer URL is shareable
Given the drawer is open for "checkout-api" (id: svc-123)
When the user copies the URL
Then the URL is /catalog?service=svc-123
And sharing this URL opens the catalog with the drawer pre-opened
Scenario: Close drawer with Escape key
Given the service detail drawer is open
When the user presses Escape
Then the drawer closes
And focus returns to the service card that was clicked
Scenario: Navigate between services in drawer
Given the drawer is open for "checkout-api"
When the user presses the right arrow or clicks "Next service"
Then the drawer transitions to the next service in the current filtered/sorted list
And the URL updates accordingly
Scenario: Drawer shows stale data warning
Given service "legacy-api" has status: "stale" (not seen in last scan)
When the drawer opens for "legacy-api"
Then a warning banner shows: "This service was not found in the last scan (3 days ago)"
And a "Re-run scan" button is available
Scenario: Edit service owner from drawer
Given the user has editor permissions
When they click "Edit owner" in the drawer
Then an inline edit field appears
And they can type a new owner name with autocomplete from known teams
And saving updates the catalog immediately
And the drawer reflects the new owner without a full page reload
Story 5.3 — Cmd+K Search in UI
Feature: Cmd+K Search UI Integration
Scenario: Search modal shows categorized results
Given the user types "pay" in the Cmd+K search modal
When results are returned
Then they are grouped by category:
| Category | Examples |
| Services | payment-api, payment-processor |
| Teams | payments-team |
| Actions | Run discovery scan |
And keyboard navigation works between categories
Scenario: Search modal shows recent searches on open
Given the user has previously searched for "auth", "billing", "gateway"
When the user opens Cmd+K without typing
Then recent searches are shown as suggestions
And clicking a suggestion populates the search input
Scenario: Search result keyboard navigation
Given the search modal is open with 5 results
When the user presses the down arrow key
Then the first result is highlighted
And pressing down again highlights the second result
And pressing Enter navigates to the highlighted result
Scenario: Search modal is accessible
Given the search modal is open
Then it has role="dialog" and aria-label="Search services"
And the input has aria-label="Search"
And results have role="listbox" with role="option" items
And screen readers announce result count changes
Epic 6: Analytics Dashboards
Story 6.1 — Ownership Coverage Dashboard
Feature: Ownership Coverage Analytics
Scenario: Display ownership coverage percentage
Given the tenant has 100 services in the catalog
And 75 services have a confirmed owner
And 25 services have owner: null or "unknown"
When the analytics dashboard loads
Then the ownership coverage metric shows 75%
And a donut chart shows the breakdown: 75 owned, 25 unowned
And a trend line shows coverage change over the last 30 days
Scenario: Ownership coverage by team
Given multiple teams own services
When the ownership breakdown table is rendered
Then it shows each team with their service count and percentage
And teams are sorted by service count descending
And clicking a team filters the catalog to that team's services
Scenario: Ownership coverage drops below threshold
Given the ownership coverage threshold is set to 80%
And coverage drops from 82% to 76% after a scan
When the dashboard refreshes
Then a warning alert is shown: "Ownership coverage below 80% threshold"
And the affected unowned services are listed
And an email notification is sent to the tenant admin
Scenario: Zero unowned services
Given all 50 services have confirmed owners
When the ownership dashboard loads
Then coverage shows 100%
And a success state is shown: "All services have owners"
And no warning alerts are displayed
Story 6.2 — Service Health Scorecards
Feature: Service Health Scorecards
Scenario: Health scorecard is calculated for a service
Given service "payment-api" has the following signals:
| Signal | Value | Score |
| Has owner | Yes | 20 |
| Has on-call policy | Yes | 20 |
| Has documentation | Yes | 15 |
| Recent deployment | 2 days | 15 |
| No open P1 alerts | True | 30 |
When the health scorecard is computed
Then the overall score is 100/100
And the health_status is "healthy"
Scenario: Service with missing ownership scores lower
Given service "orphan-lambda" has:
| Signal | Value | Score |
| Has owner | No | 0 |
| Has on-call policy | No | 0 |
| Has documentation | No | 0 |
| Recent deployment | 90 days | 5 |
| No open P1 alerts | True | 30 |
When the health scorecard is computed
Then the overall score is 35/100
And the health_status is "at-risk"
And the scorecard highlights the missing signals as improvement actions
Scenario: Health scorecard trend over time
Given "checkout-api" has had weekly scorecard snapshots for 8 weeks
When the service detail drawer shows the health trend
Then a sparkline chart shows the score history
And the trend direction (improving/declining) is indicated
Scenario: Team-level health KPI aggregation
Given "payments-team" owns 10 services with scores: [90, 85, 70, 95, 60, 80, 75, 88, 92, 65]
When the team KPI dashboard is rendered
Then the team average score is 80/100
And the lowest-scoring service is highlighted for attention
And the team score is compared to the org average
Scenario: Health scorecard for stale service
Given service "legacy-api" has status: "stale"
When the scorecard is computed
Then the "Recent deployment" signal scores 0
And a penalty is applied for being stale
And the overall score reflects the staleness
Story 6.3 — Tech Debt Tracking
Feature: Tech Debt Tracking Dashboard
Scenario: Identify services using deprecated runtimes
Given the portal has a policy: "Node.js < 18 is deprecated"
And 5 services use Node.js 16
When the tech debt dashboard loads
Then it shows 5 services flagged for "deprecated runtime"
And each flagged service links to its catalog entry
And the total tech debt score is calculated
Scenario: Track services with no recent deployments
Given the policy threshold is "no deployment in 90 days = tech debt"
And 8 services have not been deployed in over 90 days
When the tech debt dashboard loads
Then these 8 services appear in the "stale deployments" category
And the owning teams are notified via the dashboard
Scenario: Tech debt trend over time
Given tech debt metrics have been tracked for 12 weeks
When the trend chart is rendered
Then it shows weekly tech debt item counts
And highlights weeks where debt increased significantly
And shows the net change (items resolved vs. items added)