diff --git a/products/04-lightweight-idp/acceptance-specs/acceptance-specs.md b/products/04-lightweight-idp/acceptance-specs/acceptance-specs.md new file mode 100644 index 0000000..59d053e --- /dev/null +++ b/products/04-lightweight-idp/acceptance-specs/acceptance-specs.md @@ -0,0 +1,1177 @@ +# dd0c/portal — BDD Acceptance Test Specifications + +> Gherkin scenarios for all 10 epics. Edge cases included. + +--- + +## Epic 1: AWS Discovery Scanner + +### Story 1.1 — IAM Role Assumption into Customer Account + +```gherkin +Feature: IAM Role Assumption for Cross-Account Discovery + + Background: + Given the portal has a tenant with AWS integration configured + And the tenant has provided an IAM role ARN "arn:aws:iam::123456789012:role/PortalDiscoveryRole" + + Scenario: Successfully assume IAM role and begin scan + Given the IAM role has the required trust policy allowing portal's scanner principal + When the AWS Discovery Scanner initiates a scan for the tenant + Then the scanner assumes the role via STS AssumeRole + And receives temporary credentials valid for 1 hour + And begins scanning the configured AWS regions + + Scenario: IAM role assumption fails due to missing trust policy + Given the IAM role does NOT have a trust policy for the portal's scanner principal + When the AWS Discovery Scanner attempts to assume the role + Then STS returns an "AccessDenied" error + And the scanner marks the scan job as FAILED with reason "IAM role assumption denied" + And the tenant receives a notification: "AWS integration error: cannot assume role" + And no partial scan results are persisted + + Scenario: IAM role ARN is malformed + Given the tenant has configured an invalid role ARN "not-a-valid-arn" + When the AWS Discovery Scanner attempts to assume the role + Then the scanner marks the scan job as FAILED with reason "Invalid role ARN format" + And logs a structured error with tenant_id and the invalid ARN + + Scenario: Temporary credentials expire mid-scan + Given the scanner has assumed the IAM role and started scanning + And the scan has been running for 61 minutes + When the scanner attempts to call ECS DescribeClusters + Then AWS returns an "ExpiredTokenException" + And the scanner refreshes credentials via a new AssumeRole call + And resumes scanning from the last checkpoint + And the scan job status remains IN_PROGRESS + + Scenario: Role assumption succeeds but has insufficient permissions + Given the IAM role is assumed successfully + But the role lacks "ecs:ListClusters" permission + When the scanner attempts to list ECS clusters + Then AWS returns an "AccessDenied" error for that specific call + And the scanner records a partial failure for the ECS resource type + And continues scanning other resource types (Lambda, CloudFormation, API Gateway) + And the final scan result includes a warnings list with "ECS: insufficient permissions" +``` + +### Story 1.2 — ECS Service Discovery + +```gherkin +Feature: ECS Service Discovery + + Background: + Given a tenant's AWS account has been successfully authenticated + And the scanner is configured to scan region "us-east-1" + + Scenario: Discover ECS services across multiple clusters + Given the AWS account has 3 ECS clusters: "prod-cluster", "staging-cluster", "dev-cluster" + And "prod-cluster" has 5 services, "staging-cluster" has 3, "dev-cluster" has 2 + When the AWS Discovery Scanner runs the ECS scan step + Then it lists all 3 clusters via ecs:ListClusters + And describes all services in each cluster via ecs:DescribeServices + And emits 10 discovered service events to the catalog ingestion queue + And each event includes: cluster_name, service_name, task_definition, desired_count, region + + Scenario: ECS cluster has no services + Given the AWS account has 1 ECS cluster "empty-cluster" with 0 services + When the scanner runs the ECS scan step + Then it lists the cluster successfully + And records 0 services discovered for that cluster + And does not emit any service events for that cluster + And the scan step completes without error + + Scenario: ECS DescribeServices returns throttling error + Given the AWS account has an ECS cluster with 100 services + When the scanner calls ecs:DescribeServices in batches of 10 + And AWS returns a ThrottlingException on the 5th batch + Then the scanner applies exponential backoff (2s, 4s, 8s) + And retries the failed batch up to 3 times + And if retry succeeds, continues with remaining batches + And the final result includes all 100 services + + Scenario: ECS DescribeServices fails after all retries + Given AWS returns ThrottlingException on every retry attempt + When the scanner exhausts all 3 retries for a batch + Then it marks that batch as partially failed + And records which service ARNs could not be described + And continues with the next batch + And the scan summary includes "ECS partial failure: 10 services could not be described" + + Scenario: Multi-region ECS scan + Given the tenant has configured scanning for regions: "us-east-1", "eu-west-1", "ap-southeast-1" + When the AWS Discovery Scanner runs + Then it scans ECS in all 3 regions in parallel via Step Functions Map state + And aggregates results from all regions + And each discovered service record includes its region + And duplicate service names across regions are stored as separate catalog entries +``` + +### Story 1.3 — Lambda Function Discovery + +```gherkin +Feature: Lambda Function Discovery + + Scenario: Discover all Lambda functions in a region + Given the AWS account has 25 Lambda functions in "us-east-1" + When the scanner runs the Lambda scan step + Then it calls lambda:ListFunctions with pagination + And retrieves all 25 functions across multiple pages + And emits 25 service events with: function_name, runtime, memory_size, timeout, last_modified + + Scenario: Lambda function has tags indicating service ownership + Given a Lambda function "payment-processor" has tags: + | Key | Value | + | team | payments-team | + | service | payment-svc | + | environment | production | + When the scanner processes this Lambda function + Then the catalog entry includes ownership: "payments-team" (source: implicit/tags) + And the service name is inferred as "payment-svc" from the "service" tag + + Scenario: Lambda function has no tags + Given a Lambda function "legacy-cron-job" has no tags + When the scanner processes this Lambda function + Then the catalog entry has ownership: null (source: heuristic/pending) + And the entry is flagged for ownership inference via commit history + + Scenario: Lambda ListFunctions pagination + Given the AWS account has 150 Lambda functions (exceeding the 50-per-page limit) + When the scanner calls lambda:ListFunctions + Then it follows NextMarker pagination tokens + And retrieves all 150 functions across 3 pages + And does not duplicate any function in the results +``` + +### Story 1.4 — CloudFormation Stack Discovery + +```gherkin +Feature: CloudFormation Stack Discovery + + Scenario: Discover active CloudFormation stacks + Given the AWS account has 8 CloudFormation stacks in "us-east-1" + And 6 stacks have status "CREATE_COMPLETE" or "UPDATE_COMPLETE" + And 2 stacks have status "ROLLBACK_COMPLETE" + When the scanner runs the CloudFormation scan step + Then it discovers all 8 stacks + And includes stacks of all statuses in the raw results + And marks ROLLBACK_COMPLETE stacks with health_status: "degraded" + + Scenario: CloudFormation stack outputs contain service metadata + Given a stack "api-gateway-stack" has outputs: + | OutputKey | OutputValue | + | ServiceName | user-api | + | TeamOwner | platform-team | + | ApiEndpoint | https://api.example.com/users | + When the scanner processes this stack + Then the catalog entry uses "user-api" as the service name + And sets ownership to "platform-team" (source: implicit/stack-output) + And records the API endpoint in service metadata + + Scenario: Nested stacks are discovered + Given a parent CloudFormation stack has 3 nested stacks + When the scanner processes the parent stack + Then it also discovers and catalogs each nested stack + And links nested stacks to their parent in the catalog + + Scenario: Stack in DELETE_IN_PROGRESS state + Given a CloudFormation stack has status "DELETE_IN_PROGRESS" + When the scanner discovers this stack + Then it records the stack with health_status: "terminating" + And does not block the scan on this stack's state +``` + +### Story 1.5 — API Gateway Discovery + +```gherkin +Feature: API Gateway Discovery + + Scenario: Discover REST APIs in API Gateway v1 + Given the AWS account has 4 REST APIs in API Gateway + When the scanner runs the API Gateway scan step + Then it calls apigateway:GetRestApis + And retrieves all 4 APIs with their names, IDs, and endpoints + And emits catalog events for each API + + Scenario: Discover HTTP APIs in API Gateway v2 + Given the AWS account has 3 HTTP APIs in API Gateway v2 + When the scanner runs the API Gateway v2 scan step + Then it calls apigatewayv2:GetApis + And retrieves all 3 HTTP APIs + And correctly distinguishes them from REST APIs in the catalog + + Scenario: API Gateway API has no associated service tag + Given an API Gateway REST API "legacy-api" has no resource tags + When the scanner processes this API + Then it creates a catalog entry with name "legacy-api" + And sets ownership to null pending heuristic inference + And records the API type as "REST" and the invoke URL + + Scenario: API Gateway scan across multiple regions + Given API Gateway APIs exist in "us-east-1" and "us-west-2" + When the scanner runs in parallel across both regions + Then APIs from both regions are discovered + And each catalog entry includes the region field + And there are no cross-region duplicates +``` + +### Story 1.6 — Step Functions Orchestration + +```gherkin +Feature: Step Functions Orchestration of Discovery Scan + + Scenario: Full scan executes all resource type steps in parallel + Given a scan job is triggered for a tenant + When the Step Functions state machine starts + Then it executes ECS, Lambda, CloudFormation, and API Gateway scan steps in parallel + And waits for all parallel branches to complete + And aggregates results from all branches + And transitions to the catalog ingestion step + + Scenario: One parallel branch fails, others succeed + Given the Step Functions state machine is running a scan + And the Lambda scan step throws an unhandled exception + When the state machine catches the error + Then the ECS, CloudFormation, and API Gateway branches continue to completion + And the Lambda branch is marked as FAILED in the scan summary + And the overall scan job status is PARTIAL_SUCCESS + And discovered resources from successful branches are ingested into the catalog + + Scenario: Scan job is idempotent on retry + Given a scan job failed midway through execution + When the Step Functions state machine retries the execution + Then it does not duplicate already-ingested catalog entries + And uses upsert semantics based on (tenant_id, resource_arn) as the unique key + + Scenario: Concurrent scan jobs for the same tenant are prevented + Given a scan job is already IN_PROGRESS for tenant "acme-corp" + When a second scan job is triggered for "acme-corp" + Then the system detects the existing in-progress execution + And rejects the new job with status 409 Conflict + And returns the ID of the existing running scan job + + Scenario: Scan job timeout + Given a scan job has been running for more than 30 minutes + When the Step Functions state machine timeout is reached + Then the execution is marked as TIMED_OUT + And partial results already ingested are retained in the catalog + And the tenant is notified of the timeout + And the scan job record shows last_successful_step + + Scenario: WebSocket progress events during scan + Given a user is watching the discovery progress in the UI + When each Step Functions step completes + Then a WebSocket event is pushed to the user's connection + And the event includes: step_name, status, resources_found, timestamp + And the UI progress bar updates in real-time +``` + +### Story 1.7 — Multi-Region Scanning + +```gherkin +Feature: Multi-Region AWS Scanning + + Scenario: Tenant configures specific regions to scan + Given a tenant has configured scan regions: ["us-east-1", "eu-central-1"] + When the discovery scanner runs + Then it only scans the configured regions + And does not scan any other AWS regions + And the scan summary shows per-region resource counts + + Scenario: Tenant configures "all regions" scan + Given a tenant has configured scan_all_regions: true + When the discovery scanner runs + Then it first calls ec2:DescribeRegions to get the current list of enabled regions + And scans all enabled regions + And the Step Functions Map state iterates over the dynamic region list + + Scenario: A region is disabled in the AWS account + Given the tenant's AWS account has "ap-east-1" disabled + And the tenant has configured scan_all_regions: true + When the scanner attempts to scan "ap-east-1" + Then it receives an "OptInRequired" or region-disabled error + And skips that region gracefully + And records "ap-east-1: skipped (region disabled)" in the scan summary + + Scenario: Region scan results are isolated per tenant + Given tenant "acme" and tenant "globex" both have resources in "us-east-1" + When both tenants run discovery scans simultaneously + Then acme's catalog only contains acme's resources + And globex's catalog only contains globex's resources + And no cross-tenant data leakage occurs +``` + +--- + +## Epic 2: GitHub Discovery Scanner + +### Story 2.1 — GitHub App Installation and Authentication + +```gherkin +Feature: GitHub App Authentication + + Background: + Given the portal has a GitHub App registered with app_id "12345" + And the app has a private key stored in AWS Secrets Manager + + Scenario: Successfully authenticate as GitHub App installation + Given a tenant has installed the GitHub App on their organization "acme-org" + And the installation_id is "67890" + When the GitHub Discovery Scanner initializes for this tenant + Then it generates a JWT signed with the app's private key + And exchanges the JWT for an installation access token via POST /app/installations/67890/access_tokens + And the token has scopes: contents:read, metadata:read + And the scanner uses this token for all subsequent API calls + + Scenario: GitHub App JWT generation fails due to expired private key + Given the private key in Secrets Manager is expired or invalid + When the scanner attempts to generate a JWT + Then it fails with a key validation error + And marks the scan job as FAILED with reason "GitHub App authentication failed: invalid private key" + And triggers an alert to the tenant admin + + Scenario: Installation access token request returns 404 + Given the GitHub App installation has been uninstalled by the tenant + When the scanner requests an installation access token + Then GitHub returns 404 Not Found + And the scanner marks the GitHub integration as DISCONNECTED + And notifies the tenant: "GitHub App has been uninstalled. Please reinstall to continue discovery." + + Scenario: Installation access token expires mid-scan + Given the scanner has a valid installation token (valid for 1 hour) + And the scan has been running for 58 minutes + When the token expires during a GraphQL query + Then GitHub returns 401 Unauthorized + And the scanner proactively refreshes the token before expiry (at 55 min mark) + And the scan continues without interruption +``` + +### Story 2.2 — Repository Discovery via GraphQL + +```gherkin +Feature: GitHub Repository Discovery via GraphQL API + + Scenario: Discover all repositories in an organization + Given the tenant's GitHub organization "acme-org" has 150 repositories + When the GitHub Discovery Scanner runs the repository listing query + Then it uses the GraphQL API with cursor-based pagination + And retrieves all 150 repositories in batches of 100 + And each repository record includes: name, defaultBranch, isArchived, pushedAt, topics + + Scenario: Archived repositories are excluded by default + Given "acme-org" has 150 active repos and 30 archived repos + When the scanner runs with default settings + Then it discovers 150 active repositories + And excludes the 30 archived repositories + And the scan summary notes "30 archived repositories skipped" + + Scenario: Tenant opts in to include archived repositories + Given the tenant has configured include_archived: true + When the scanner runs + Then it discovers all 180 repositories including archived ones + And archived repos are marked with status: "archived" in the catalog + + Scenario: Organization has no repositories + Given the tenant's GitHub organization has 0 repositories + When the scanner runs the repository listing query + Then it returns an empty result set + And the scan completes successfully with 0 GitHub services discovered + And no errors are raised + + Scenario: GraphQL query returns partial data with errors + Given the GraphQL query returns 80 repositories successfully + But includes a "FORBIDDEN" error for 20 private repositories + When the scanner processes the response + Then it ingests the 80 accessible repositories + And records a warning: "20 repositories inaccessible due to permissions" + And the scan status is PARTIAL_SUCCESS +``` + +### Story 2.3 — Service Metadata Extraction from Repositories + +```gherkin +Feature: Service Metadata Extraction from Repository Files + + Scenario: Extract service name from package.json + Given a repository "user-service" has a package.json at the root: + """ + { + "name": "@acme/user-service", + "version": "2.1.0", + "description": "Handles user authentication and profiles" + } + """ + When the scanner fetches and parses package.json + Then the catalog entry uses service name "@acme/user-service" + And records version "2.1.0" + And records description "Handles user authentication and profiles" + And sets the service type as "nodejs" + + Scenario: Extract service metadata from Dockerfile + Given a repository has a Dockerfile with: + """ + FROM python:3.11-slim + LABEL service.name="data-pipeline" + LABEL service.team="data-engineering" + EXPOSE 8080 + """ + When the scanner parses the Dockerfile + Then the catalog entry uses service name "data-pipeline" + And sets ownership to "data-engineering" (source: implicit/dockerfile-label) + And records the service type as "python" + And records exposed port 8080 + + Scenario: Extract service metadata from Terraform files + Given a repository has a main.tf with: + """ + module "service" { + source = "..." + service_name = "billing-api" + team = "billing-team" + } + """ + When the scanner parses Terraform files + Then the catalog entry uses service name "billing-api" + And sets ownership to "billing-team" (source: implicit/terraform) + + Scenario: Repository has no recognizable service metadata files + Given a repository "random-scripts" has no package.json, Dockerfile, or terraform files + When the scanner processes this repository + Then it creates a catalog entry using the repository name "random-scripts" + And sets service type as "unknown" + And flags the entry for manual review + And ownership is set to null pending heuristic inference + + Scenario: Multiple metadata files exist — precedence order + Given a repository has both a package.json (name: "svc-a") and a Dockerfile (LABEL service.name="svc-b") + When the scanner processes the repository + Then it uses package.json as the primary metadata source + And records the service name as "svc-a" + And notes the Dockerfile label as secondary metadata + + Scenario: package.json is malformed JSON + Given a repository has a package.json with invalid JSON syntax + When the scanner attempts to parse it + Then it logs a warning: "package.json parse error in repo: user-service" + And falls back to Dockerfile or terraform for metadata + And if no fallback exists, uses the repository name + And does not crash the scan job +``` + +### Story 2.4 — GitHub Rate Limit Handling + +```gherkin +Feature: GitHub API Rate Limit Handling + + Scenario: GraphQL rate limit warning threshold reached + Given the scanner has consumed 4,500 of 5,000 GraphQL rate limit points + When the scanner checks the rate limit headers after each response + Then it detects the threshold (90% consumed) + And pauses scanning until the rate limit resets + And logs: "GitHub rate limit threshold reached, waiting for reset at {reset_time}" + And resumes scanning after the reset window + + Scenario: REST API secondary rate limit hit + Given the scanner is making rapid REST API calls + When GitHub returns HTTP 429 with "secondary rate limit" in the response + Then the scanner reads the Retry-After header + And waits the specified number of seconds before retrying + And does not count this as a scan failure + + Scenario: Rate limit exhausted with many repos remaining + Given the scanner has 500 repositories to scan + And the rate limit is exhausted after scanning 200 repositories + When the scanner detects rate limit exhaustion + Then it checkpoints the current progress (200 repos scanned) + And schedules a continuation scan after the rate limit reset + And the scan job status is set to RATE_LIMITED (not FAILED) + And the tenant is notified of the delay + + Scenario: Rate limit headers are missing from response + Given GitHub returns a response without X-RateLimit headers + When the scanner processes the response + Then it applies a conservative default delay of 1 second between requests + And logs a warning about missing rate limit headers + And continues scanning without crashing + + Scenario: Concurrent scans from multiple tenants share rate limit awareness + Given two tenants share the same GitHub App installation + When both tenants trigger scans simultaneously + Then the scanner uses a shared rate limit counter in Redis + And distributes available rate limit points between the two scans + And neither scan exceeds the total available rate limit +``` + +### Story 2.5 — Ownership Inference from Commit History + +```gherkin +Feature: Heuristic Ownership Inference from Commit History + + Scenario: Infer owner from most recent committer + Given a repository has no explicit ownership config and no team tags + And the last 10 commits were made by "alice@acme.com" (7 commits) and "bob@acme.com" (3 commits) + When the ownership inference engine runs + Then it identifies "alice@acme.com" as the primary contributor + And maps the email to team "frontend-team" via the org's CODEOWNERS or team membership + And sets ownership to "frontend-team" (source: heuristic/commit-history) + And records confidence: 0.7 + + Scenario: Ownership inference from CODEOWNERS file + Given a repository has a CODEOWNERS file: + """ + * @acme-org/platform-team + /src/api/ @acme-org/api-team + """ + When the scanner processes the CODEOWNERS file + Then it sets ownership to "platform-team" (source: implicit/codeowners) + And records that the api directory has additional ownership by "api-team" + And this takes precedence over commit history heuristics + + Scenario: Ownership conflict — multiple teams have equal commit share + Given a repository has commits split 50/50 between "team-a" and "team-b" + When the ownership inference engine runs + Then it records both teams as co-owners + And sets ownership_confidence: 0.5 + And flags the service for manual ownership resolution + And the catalog entry shows ownership_status: "conflict" + + Scenario: Explicit ownership config overrides all other sources + Given a repository has a .portal.yaml file: + """ + owner: payments-team + tier: critical + """ + And the repository also has CODEOWNERS pointing to "platform-team" + And commit history suggests "devops-team" + When the scanner processes the repository + Then ownership is set to "payments-team" (source: explicit/portal-config) + And CODEOWNERS and commit history are recorded as secondary metadata + And the explicit config is not overridden by any other source + + Scenario: Commit history API call fails + Given the GitHub API returns 500 when fetching commit history for a repo + When the ownership inference engine attempts heuristic inference + Then it marks ownership as null (source: unknown) + And records the error in the scan summary + And does not block catalog ingestion for this service +``` + +--- +## Epic 3: Service Catalog + +### Story 3.1 — Catalog Ingestion and Upsert + +```gherkin +Feature: Service Catalog Ingestion + + Background: + Given the catalog uses PostgreSQL Aurora Serverless v2 + And the unique key for a service is (tenant_id, source, resource_id) + + Scenario: New service is ingested into the catalog + Given a discovery scan emits a new service event for "payment-api" + When the catalog ingestion handler processes the event + Then a new row is inserted into the services table + And the row includes: tenant_id, name, source, resource_id, owner, health_status, discovered_at + And the Meilisearch index is updated with the new service document + And the catalog entry is immediately searchable + + Scenario: Existing service is updated on re-scan (upsert) + Given "payment-api" already exists in the catalog with owner "old-team" + When a new scan emits an updated event for "payment-api" with owner "payments-team" + Then the existing row is updated (not duplicated) + And owner is changed to "payments-team" + And updated_at is refreshed + And the Meilisearch document is updated + + Scenario: Service removed from AWS is marked stale + Given "deprecated-lambda" exists in the catalog from a previous scan + When the latest scan completes and does not include "deprecated-lambda" + Then the catalog entry is marked with status: "stale" + And last_seen_at is not updated + And the service remains visible in the catalog with a "stale" badge + And is not immediately deleted + + Scenario: Stale service is purged after retention period + Given "deprecated-lambda" has been stale for more than 30 days + When the nightly cleanup job runs + Then the service is soft-deleted from the catalog + And removed from the Meilisearch index + And a deletion event is logged for audit purposes + + Scenario: Catalog ingestion fails due to Aurora connection error + Given Aurora Serverless is scaling up (cold start) + When the ingestion handler attempts to write a service record + Then it retries with exponential backoff up to 5 times + And if all retries fail, the event is sent to a dead-letter queue + And an alert is raised for the operations team + And the scan job is marked PARTIAL_SUCCESS + + Scenario: Bulk ingestion of 500 services from a large account + Given a scan discovers 500 services across all resource types + When the catalog ingestion handler processes all events + Then it uses batch inserts (100 records per batch) + And completes ingestion within 30 seconds + And all 500 services appear in the catalog + And the Meilisearch index reflects all 500 services +``` + +### Story 3.2 — PagerDuty / OpsGenie On-Call Mapping + +```gherkin +Feature: On-Call Ownership Mapping via PagerDuty and OpsGenie + + Scenario: Map service owner to PagerDuty escalation policy + Given a service "checkout-api" has owner "payments-team" + And PagerDuty integration is configured for the tenant + And "payments-team" maps to PagerDuty service "checkout-escalation-policy" + When the catalog builds the service detail record + Then the service includes on_call_policy: "checkout-escalation-policy" + And a link to the PagerDuty service page + + Scenario: Fetch current on-call engineer from PagerDuty + Given "checkout-api" has a linked PagerDuty escalation policy + When a user views the service detail page + Then the portal calls PagerDuty GET /oncalls?escalation_policy_ids[]=... + And displays the current on-call engineer's name and contact + And caches the result for 5 minutes in Redis + + Scenario: PagerDuty API returns 401 (invalid token) + Given the PagerDuty API token has been revoked + When the portal attempts to fetch on-call data + Then it returns the service detail without on-call info + And displays a warning: "On-call data unavailable — check PagerDuty integration" + And logs the auth failure for the tenant admin + + Scenario: OpsGenie integration as alternative to PagerDuty + Given the tenant uses OpsGenie instead of PagerDuty + And OpsGenie integration is configured with an API key + When the portal fetches on-call data for a service + Then it calls the OpsGenie API instead of PagerDuty + And maps the OpsGenie schedule to the service owner + And displays the current on-call responder + + Scenario: Service owner has no PagerDuty or OpsGenie mapping + Given "internal-tool" has owner "eng-team" + But "eng-team" has no mapping in PagerDuty or OpsGenie + When the portal builds the service detail + Then on_call_policy is null + And the UI shows "No on-call policy configured" for this service + And suggests linking a PagerDuty/OpsGenie service + + Scenario: Multiple services share the same on-call policy + Given 10 services all map to the "platform-oncall" PagerDuty policy + When the portal fetches on-call data + Then it batches the PagerDuty API call for all 10 services + And does not make 10 separate API calls + And caches the shared result for all 10 services +``` + +### Story 3.3 — Service Catalog CRUD API + +```gherkin +Feature: Service Catalog CRUD API + + Background: + Given the API requires a valid Cognito JWT + And the JWT contains tenant_id claim + + Scenario: List services for a tenant + Given tenant "acme" has 42 services in the catalog + When GET /api/v1/services is called with a valid JWT for "acme" + Then the response returns 42 services + And each service includes: id, name, owner, source, health_status, updated_at + And results are paginated (default page size: 20) + And the response includes total_count: 42 + + Scenario: List services with pagination + Given tenant "acme" has 42 services + When GET /api/v1/services?page=2&limit=20 is called + Then the response returns services 21-40 + And includes pagination metadata: page, limit, total_count, total_pages + + Scenario: Get single service by ID + Given service "svc-uuid-123" exists for tenant "acme" + When GET /api/v1/services/svc-uuid-123 is called with acme's JWT + Then the response returns the full service detail + And includes: metadata, owner, on_call_policy, health_scorecard, tags, links + + Scenario: Get service belonging to different tenant returns 404 + Given service "svc-uuid-456" belongs to tenant "globex" + When GET /api/v1/services/svc-uuid-456 is called with acme's JWT + Then the response returns 404 Not Found + And does not reveal that the service exists for another tenant + + Scenario: Manually update service owner via API + Given service "legacy-api" has owner inferred as "unknown" + When PATCH /api/v1/services/svc-uuid-789 is called with body: + """ + { "owner": "platform-team", "owner_source": "manual" } + """ + Then the service owner is updated to "platform-team" + And owner_source is set to "explicit/manual" + And the change is recorded in the audit log with the user's identity + And the Meilisearch document is updated + + Scenario: Delete service from catalog + Given service "decommissioned-api" exists in the catalog + When DELETE /api/v1/services/svc-uuid-000 is called + Then the service is soft-deleted (deleted_at is set) + And it no longer appears in list or search results + And the Meilisearch document is removed + And the deletion is recorded in the audit log + + Scenario: Create service manually (not from discovery) + When POST /api/v1/services is called with: + """ + { "name": "manual-service", "owner": "ops-team", "source": "manual" } + """ + Then a new service is created with source: "manual" + And it appears in the catalog and search results + And it is not overwritten by automated discovery scans + + Scenario: Unauthenticated request is rejected + When GET /api/v1/services is called without an Authorization header + Then the response returns 401 Unauthorized + And no service data is returned +``` + +### Story 3.4 — Full-Text Search via Meilisearch + +```gherkin +Feature: Catalog Full-Text Search + + Scenario: Search returns relevant services + Given the catalog has services: "payment-api", "payment-processor", "user-service" + When a search query "payment" is submitted + Then Meilisearch returns "payment-api" and "payment-processor" + And results are ranked by relevance score + And the response time is under 10ms + + Scenario: Search is scoped to tenant + Given tenant "acme" has service "auth-service" + And tenant "globex" has service "auth-service" (same name) + When tenant "acme" searches for "auth-service" + Then only acme's "auth-service" is returned + And globex's service is not included in results + + Scenario: Meilisearch index is corrupted or unavailable + Given Meilisearch returns a 503 error + When a search request is made + Then the API falls back to PostgreSQL full-text search (pg_trgm) + And returns results within 500ms + And logs a warning: "Meilisearch unavailable, using PostgreSQL fallback" + And the user sees results (degraded performance, not an error) + + Scenario: Catalog update triggers Meilisearch re-index + Given a service "new-api" is added to the catalog + When the catalog write completes + Then a Meilisearch index update is triggered asynchronously + And the service is searchable within 1 second + And if the Meilisearch update fails, it is retried via a background queue +``` + +--- + +## Epic 4: Search Engine + +### Story 4.1 — Cmd+K Instant Search + +```gherkin +Feature: Cmd+K Instant Search Interface + + Scenario: User opens search with keyboard shortcut + Given the user is on any page of the portal + When the user presses Cmd+K (Mac) or Ctrl+K (Windows/Linux) + Then the search modal opens immediately + And the search input is focused + And recent searches are shown as suggestions + + Scenario: Search returns results under 10ms + Given the Meilisearch index has 500 services + And Redis prefix cache is warm + When the user types "pay" in the search box + Then results appear within 10ms + And the top 5 matching services are shown + And each result shows: service name, owner, health status + + Scenario: Search with Redis prefix cache hit + Given "pay" has been searched recently and is cached in Redis + When the user types "pay" + Then the API returns cached results from Redis + And does not query Meilisearch + And the response time is under 5ms + + Scenario: Search with Redis cache miss + Given "xyz-unique-query" has never been searched + When the user types "xyz-unique-query" + Then the API queries Meilisearch directly + And stores the result in Redis with TTL of 60 seconds + And returns results within 10ms + + Scenario: Search cache is invalidated on catalog update + Given "payment-api" search results are cached in Redis + When the catalog is updated (new service added or existing service modified) + Then all Redis cache keys matching affected prefixes are invalidated + And the next search query fetches fresh results from Meilisearch + + Scenario: Search with no results + Given no services match the query "zzznomatch" + When the user searches for "zzznomatch" + Then the search modal shows "No services found" + And suggests: "Try a broader search or browse all services" + And does not show an error + + Scenario: Search is dismissed + Given the search modal is open + When the user presses Escape + Then the search modal closes + And focus returns to the previously focused element + + Scenario: Search result is selected + Given search results show "payment-api" + When the user clicks or presses Enter on "payment-api" + Then the search modal closes + And the service detail drawer opens for "payment-api" + And the URL updates to reflect the selected service +``` + +### Story 4.2 — Redis Prefix Caching + +```gherkin +Feature: Redis Prefix Caching for Search + + Scenario: Cache key structure is tenant-scoped + Given tenant "acme" searches for "pay" + Then the Redis cache key is "search:acme:pay" + And tenant "globex" searching for "pay" uses key "search:globex:pay" + And the two cache entries are completely independent + + Scenario: Cache TTL expires and is refreshed + Given a cache entry for "search:acme:pay" has TTL of 60 seconds + When 61 seconds pass without a search for "pay" + Then the cache entry expires + And the next search for "pay" queries Meilisearch + And a new cache entry is created with a fresh 60-second TTL + + Scenario: Redis is unavailable + Given Redis returns a connection error + When a search request is made + Then the API bypasses the cache and queries Meilisearch directly + And logs a warning: "Redis unavailable, cache bypassed" + And the search still returns results (degraded, not broken) + And does not attempt to write to Redis while it is down + + Scenario: Cache invalidation on bulk catalog update + Given a discovery scan adds 50 new services to the catalog + When the bulk ingestion completes + Then all Redis search cache keys for the affected tenant are flushed + And subsequent searches reflect the updated catalog + And the flush is atomic (all keys or none) +``` + +### Story 4.3 — Meilisearch Index Management + +```gherkin +Feature: Meilisearch Index Management + + Scenario: Index is created on first tenant onboarding + Given a new tenant "startup-co" completes onboarding + When the tenant's first discovery scan runs + Then a Meilisearch index "services_startup-co" is created + And the index is configured with searchable attributes: name, description, owner, tags + And filterable attributes: source, health_status, owner, region + + Scenario: Index corruption is detected and rebuilt + Given the Meilisearch index for tenant "acme" returns inconsistent results + When the health check detects index corruption (checksum mismatch) + Then an index rebuild is triggered + And the index is rebuilt from the PostgreSQL catalog (source of truth) + And during rebuild, search falls back to PostgreSQL full-text search + And users see a banner: "Search index is being rebuilt, results may be incomplete" + And the rebuild completes within 5 minutes for up to 10,000 services + + Scenario: Meilisearch index rebuild does not affect other tenants + Given tenant "acme"'s index is being rebuilt + When tenant "globex" performs a search + Then globex's search is unaffected + And globex's index is not touched during acme's rebuild + + Scenario: Search ranking is configured correctly + Given the Meilisearch index has ranking rules configured + When a user searches for "api" + Then results are ranked by: typo, words, proximity, attribute, sort, exactness + And services with "api" in the name rank higher than those with "api" in description + And recently updated services rank higher among equal-relevance results +``` + +--- + +## Epic 5: Dashboard UI + +### Story 5.1 — Service Catalog Browse + +```gherkin +Feature: Service Catalog Browse UI + + Background: + Given the user is authenticated and on the catalog page + + Scenario: Catalog page loads with service cards + Given the tenant has 42 services in the catalog + When the catalog page loads + Then 20 service cards are displayed (first page) + And each card shows: service name, owner, health status badge, source icon, last updated + And a pagination control shows "Page 1 of 3" + + Scenario: Progressive disclosure — card expands on hover + Given a service card for "payment-api" is visible + When the user hovers over the card + Then the card expands to show additional details: + | Field | Value | + | Description | Handles payment flows | + | Region | us-east-1 | + | On-call | alice@acme.com | + | Tech stack | Node.js, Docker | + And the expansion animation completes within 200ms + + Scenario: Filter services by owner + Given the catalog has services from multiple teams + When the user selects filter "owner: payments-team" + Then only services owned by "payments-team" are shown + And the URL updates with ?owner=payments-team + And the filter chip is shown as active + + Scenario: Filter services by health status + Given some services have health_status: "healthy", "degraded", "unknown" + When the user selects filter "status: degraded" + Then only degraded services are shown + And the count badge updates to reflect filtered results + + Scenario: Filter services by source + Given services come from "aws-ecs", "aws-lambda", "github" + When the user selects filter "source: aws-lambda" + Then only Lambda functions are shown in the catalog + + Scenario: Sort services by last updated + Given the catalog has services with various updated_at timestamps + When the user selects sort "Last Updated (newest first)" + Then services are sorted with most recently updated first + And the sort selection persists across page navigation + + Scenario: Empty catalog state + Given the tenant has just onboarded and has 0 services + When the catalog page loads + Then an empty state is shown: "No services discovered yet" + And a CTA button: "Run Discovery Scan" is prominently displayed + And a link to the onboarding guide is shown +``` + +### Story 5.2 — Service Detail Drawer + +```gherkin +Feature: Service Detail Drawer + + Scenario: Open service detail drawer + Given the user clicks on service card "checkout-api" + When the drawer opens + Then it slides in from the right within 300ms + And displays full service details: + | Section | Content | + | Header | Service name, health badge, source | + | Ownership | Team name, on-call engineer | + | Metadata | Region, runtime, version, tags | + | Health | Scorecard with metrics | + | Links | GitHub repo, PagerDuty, AWS console | + And the main catalog page remains visible behind the drawer + + Scenario: Drawer URL is shareable + Given the drawer is open for "checkout-api" (id: svc-123) + When the user copies the URL + Then the URL is /catalog?service=svc-123 + And sharing this URL opens the catalog with the drawer pre-opened + + Scenario: Close drawer with Escape key + Given the service detail drawer is open + When the user presses Escape + Then the drawer closes + And focus returns to the service card that was clicked + + Scenario: Navigate between services in drawer + Given the drawer is open for "checkout-api" + When the user presses the right arrow or clicks "Next service" + Then the drawer transitions to the next service in the current filtered/sorted list + And the URL updates accordingly + + Scenario: Drawer shows stale data warning + Given service "legacy-api" has status: "stale" (not seen in last scan) + When the drawer opens for "legacy-api" + Then a warning banner shows: "This service was not found in the last scan (3 days ago)" + And a "Re-run scan" button is available + + Scenario: Edit service owner from drawer + Given the user has editor permissions + When they click "Edit owner" in the drawer + Then an inline edit field appears + And they can type a new owner name with autocomplete from known teams + And saving updates the catalog immediately + And the drawer reflects the new owner without a full page reload +``` + +### Story 5.3 — Cmd+K Search in UI + +```gherkin +Feature: Cmd+K Search UI Integration + + Scenario: Search modal shows categorized results + Given the user types "pay" in the Cmd+K search modal + When results are returned + Then they are grouped by category: + | Category | Examples | + | Services | payment-api, payment-processor | + | Teams | payments-team | + | Actions | Run discovery scan | + And keyboard navigation works between categories + + Scenario: Search modal shows recent searches on open + Given the user has previously searched for "auth", "billing", "gateway" + When the user opens Cmd+K without typing + Then recent searches are shown as suggestions + And clicking a suggestion populates the search input + + Scenario: Search result keyboard navigation + Given the search modal is open with 5 results + When the user presses the down arrow key + Then the first result is highlighted + And pressing down again highlights the second result + And pressing Enter navigates to the highlighted result + + Scenario: Search modal is accessible + Given the search modal is open + Then it has role="dialog" and aria-label="Search services" + And the input has aria-label="Search" + And results have role="listbox" with role="option" items + And screen readers announce result count changes +``` + +--- + +## Epic 6: Analytics Dashboards + +### Story 6.1 — Ownership Coverage Dashboard + +```gherkin +Feature: Ownership Coverage Analytics + + Scenario: Display ownership coverage percentage + Given the tenant has 100 services in the catalog + And 75 services have a confirmed owner + And 25 services have owner: null or "unknown" + When the analytics dashboard loads + Then the ownership coverage metric shows 75% + And a donut chart shows the breakdown: 75 owned, 25 unowned + And a trend line shows coverage change over the last 30 days + + Scenario: Ownership coverage by team + Given multiple teams own services + When the ownership breakdown table is rendered + Then it shows each team with their service count and percentage + And teams are sorted by service count descending + And clicking a team filters the catalog to that team's services + + Scenario: Ownership coverage drops below threshold + Given the ownership coverage threshold is set to 80% + And coverage drops from 82% to 76% after a scan + When the dashboard refreshes + Then a warning alert is shown: "Ownership coverage below 80% threshold" + And the affected unowned services are listed + And an email notification is sent to the tenant admin + + Scenario: Zero unowned services + Given all 50 services have confirmed owners + When the ownership dashboard loads + Then coverage shows 100% + And a success state is shown: "All services have owners" + And no warning alerts are displayed +``` + +### Story 6.2 — Service Health Scorecards + +```gherkin +Feature: Service Health Scorecards + + Scenario: Health scorecard is calculated for a service + Given service "payment-api" has the following signals: + | Signal | Value | Score | + | Has owner | Yes | 20 | + | Has on-call policy | Yes | 20 | + | Has documentation | Yes | 15 | + | Recent deployment | 2 days | 15 | + | No open P1 alerts | True | 30 | + When the health scorecard is computed + Then the overall score is 100/100 + And the health_status is "healthy" + + Scenario: Service with missing ownership scores lower + Given service "orphan-lambda" has: + | Signal | Value | Score | + | Has owner | No | 0 | + | Has on-call policy | No | 0 | + | Has documentation | No | 0 | + | Recent deployment | 90 days | 5 | + | No open P1 alerts | True | 30 | + When the health scorecard is computed + Then the overall score is 35/100 + And the health_status is "at-risk" + And the scorecard highlights the missing signals as improvement actions + + Scenario: Health scorecard trend over time + Given "checkout-api" has had weekly scorecard snapshots for 8 weeks + When the service detail drawer shows the health trend + Then a sparkline chart shows the score history + And the trend direction (improving/declining) is indicated + + Scenario: Team-level health KPI aggregation + Given "payments-team" owns 10 services with scores: [90, 85, 70, 95, 60, 80, 75, 88, 92, 65] + When the team KPI dashboard is rendered + Then the team average score is 80/100 + And the lowest-scoring service is highlighted for attention + And the team score is compared to the org average + + Scenario: Health scorecard for stale service + Given service "legacy-api" has status: "stale" + When the scorecard is computed + Then the "Recent deployment" signal scores 0 + And a penalty is applied for being stale + And the overall score reflects the staleness +``` + +### Story 6.3 — Tech Debt Tracking + +```gherkin +Feature: Tech Debt Tracking Dashboard + + Scenario: Identify services using deprecated runtimes + Given the portal has a policy: "Node.js < 18 is deprecated" + And 5 services use Node.js 16 + When the tech debt dashboard loads + Then it shows 5 services flagged for "deprecated runtime" + And each flagged service links to its catalog entry + And the total tech debt score is calculated + + Scenario: Track services with no recent deployments + Given the policy threshold is "no deployment in 90 days = tech debt" + And 8 services have not been deployed in over 90 days + When the tech debt dashboard loads + Then these 8 services appear in the "stale deployments" category + And the owning teams are notified via the dashboard + + Scenario: Tech debt trend over time + Given tech debt metrics have been tracked for 12 weeks + When the trend chart is rendered + Then it shows weekly tech debt item counts + And highlights weeks where debt increased significantly + And shows the net change (items resolved vs. items added) +``` + +---