# dd0c/portal — BDD Acceptance Test Specifications > Gherkin scenarios for all 10 epics. Edge cases included. --- ## Epic 1: AWS Discovery Scanner ### Story 1.1 — IAM Role Assumption into Customer Account ```gherkin Feature: IAM Role Assumption for Cross-Account Discovery Background: Given the portal has a tenant with AWS integration configured And the tenant has provided an IAM role ARN "arn:aws:iam::123456789012:role/PortalDiscoveryRole" Scenario: Successfully assume IAM role and begin scan Given the IAM role has the required trust policy allowing portal's scanner principal When the AWS Discovery Scanner initiates a scan for the tenant Then the scanner assumes the role via STS AssumeRole And receives temporary credentials valid for 1 hour And begins scanning the configured AWS regions Scenario: IAM role assumption fails due to missing trust policy Given the IAM role does NOT have a trust policy for the portal's scanner principal When the AWS Discovery Scanner attempts to assume the role Then STS returns an "AccessDenied" error And the scanner marks the scan job as FAILED with reason "IAM role assumption denied" And the tenant receives a notification: "AWS integration error: cannot assume role" And no partial scan results are persisted Scenario: IAM role ARN is malformed Given the tenant has configured an invalid role ARN "not-a-valid-arn" When the AWS Discovery Scanner attempts to assume the role Then the scanner marks the scan job as FAILED with reason "Invalid role ARN format" And logs a structured error with tenant_id and the invalid ARN Scenario: Temporary credentials expire mid-scan Given the scanner has assumed the IAM role and started scanning And the scan has been running for 61 minutes When the scanner attempts to call ECS DescribeClusters Then AWS returns an "ExpiredTokenException" And the scanner refreshes credentials via a new AssumeRole call And resumes scanning from the last checkpoint And the scan job status remains IN_PROGRESS Scenario: Role assumption succeeds but has insufficient permissions Given the IAM role is assumed successfully But the role lacks "ecs:ListClusters" permission When the scanner attempts to list ECS clusters Then AWS returns an "AccessDenied" error for that specific call And the scanner records a partial failure for the ECS resource type And continues scanning other resource types (Lambda, CloudFormation, API Gateway) And the final scan result includes a warnings list with "ECS: insufficient permissions" ``` ### Story 1.2 — ECS Service Discovery ```gherkin Feature: ECS Service Discovery Background: Given a tenant's AWS account has been successfully authenticated And the scanner is configured to scan region "us-east-1" Scenario: Discover ECS services across multiple clusters Given the AWS account has 3 ECS clusters: "prod-cluster", "staging-cluster", "dev-cluster" And "prod-cluster" has 5 services, "staging-cluster" has 3, "dev-cluster" has 2 When the AWS Discovery Scanner runs the ECS scan step Then it lists all 3 clusters via ecs:ListClusters And describes all services in each cluster via ecs:DescribeServices And emits 10 discovered service events to the catalog ingestion queue And each event includes: cluster_name, service_name, task_definition, desired_count, region Scenario: ECS cluster has no services Given the AWS account has 1 ECS cluster "empty-cluster" with 0 services When the scanner runs the ECS scan step Then it lists the cluster successfully And records 0 services discovered for that cluster And does not emit any service events for that cluster And the scan step completes without error Scenario: ECS DescribeServices returns throttling error Given the AWS account has an ECS cluster with 100 services When the scanner calls ecs:DescribeServices in batches of 10 And AWS returns a ThrottlingException on the 5th batch Then the scanner applies exponential backoff (2s, 4s, 8s) And retries the failed batch up to 3 times And if retry succeeds, continues with remaining batches And the final result includes all 100 services Scenario: ECS DescribeServices fails after all retries Given AWS returns ThrottlingException on every retry attempt When the scanner exhausts all 3 retries for a batch Then it marks that batch as partially failed And records which service ARNs could not be described And continues with the next batch And the scan summary includes "ECS partial failure: 10 services could not be described" Scenario: Multi-region ECS scan Given the tenant has configured scanning for regions: "us-east-1", "eu-west-1", "ap-southeast-1" When the AWS Discovery Scanner runs Then it scans ECS in all 3 regions in parallel via Step Functions Map state And aggregates results from all regions And each discovered service record includes its region And duplicate service names across regions are stored as separate catalog entries ``` ### Story 1.3 — Lambda Function Discovery ```gherkin Feature: Lambda Function Discovery Scenario: Discover all Lambda functions in a region Given the AWS account has 25 Lambda functions in "us-east-1" When the scanner runs the Lambda scan step Then it calls lambda:ListFunctions with pagination And retrieves all 25 functions across multiple pages And emits 25 service events with: function_name, runtime, memory_size, timeout, last_modified Scenario: Lambda function has tags indicating service ownership Given a Lambda function "payment-processor" has tags: | Key | Value | | team | payments-team | | service | payment-svc | | environment | production | When the scanner processes this Lambda function Then the catalog entry includes ownership: "payments-team" (source: implicit/tags) And the service name is inferred as "payment-svc" from the "service" tag Scenario: Lambda function has no tags Given a Lambda function "legacy-cron-job" has no tags When the scanner processes this Lambda function Then the catalog entry has ownership: null (source: heuristic/pending) And the entry is flagged for ownership inference via commit history Scenario: Lambda ListFunctions pagination Given the AWS account has 150 Lambda functions (exceeding the 50-per-page limit) When the scanner calls lambda:ListFunctions Then it follows NextMarker pagination tokens And retrieves all 150 functions across 3 pages And does not duplicate any function in the results ``` ### Story 1.4 — CloudFormation Stack Discovery ```gherkin Feature: CloudFormation Stack Discovery Scenario: Discover active CloudFormation stacks Given the AWS account has 8 CloudFormation stacks in "us-east-1" And 6 stacks have status "CREATE_COMPLETE" or "UPDATE_COMPLETE" And 2 stacks have status "ROLLBACK_COMPLETE" When the scanner runs the CloudFormation scan step Then it discovers all 8 stacks And includes stacks of all statuses in the raw results And marks ROLLBACK_COMPLETE stacks with health_status: "degraded" Scenario: CloudFormation stack outputs contain service metadata Given a stack "api-gateway-stack" has outputs: | OutputKey | OutputValue | | ServiceName | user-api | | TeamOwner | platform-team | | ApiEndpoint | https://api.example.com/users | When the scanner processes this stack Then the catalog entry uses "user-api" as the service name And sets ownership to "platform-team" (source: implicit/stack-output) And records the API endpoint in service metadata Scenario: Nested stacks are discovered Given a parent CloudFormation stack has 3 nested stacks When the scanner processes the parent stack Then it also discovers and catalogs each nested stack And links nested stacks to their parent in the catalog Scenario: Stack in DELETE_IN_PROGRESS state Given a CloudFormation stack has status "DELETE_IN_PROGRESS" When the scanner discovers this stack Then it records the stack with health_status: "terminating" And does not block the scan on this stack's state ``` ### Story 1.5 — API Gateway Discovery ```gherkin Feature: API Gateway Discovery Scenario: Discover REST APIs in API Gateway v1 Given the AWS account has 4 REST APIs in API Gateway When the scanner runs the API Gateway scan step Then it calls apigateway:GetRestApis And retrieves all 4 APIs with their names, IDs, and endpoints And emits catalog events for each API Scenario: Discover HTTP APIs in API Gateway v2 Given the AWS account has 3 HTTP APIs in API Gateway v2 When the scanner runs the API Gateway v2 scan step Then it calls apigatewayv2:GetApis And retrieves all 3 HTTP APIs And correctly distinguishes them from REST APIs in the catalog Scenario: API Gateway API has no associated service tag Given an API Gateway REST API "legacy-api" has no resource tags When the scanner processes this API Then it creates a catalog entry with name "legacy-api" And sets ownership to null pending heuristic inference And records the API type as "REST" and the invoke URL Scenario: API Gateway scan across multiple regions Given API Gateway APIs exist in "us-east-1" and "us-west-2" When the scanner runs in parallel across both regions Then APIs from both regions are discovered And each catalog entry includes the region field And there are no cross-region duplicates ``` ### Story 1.6 — Step Functions Orchestration ```gherkin Feature: Step Functions Orchestration of Discovery Scan Scenario: Full scan executes all resource type steps in parallel Given a scan job is triggered for a tenant When the Step Functions state machine starts Then it executes ECS, Lambda, CloudFormation, and API Gateway scan steps in parallel And waits for all parallel branches to complete And aggregates results from all branches And transitions to the catalog ingestion step Scenario: One parallel branch fails, others succeed Given the Step Functions state machine is running a scan And the Lambda scan step throws an unhandled exception When the state machine catches the error Then the ECS, CloudFormation, and API Gateway branches continue to completion And the Lambda branch is marked as FAILED in the scan summary And the overall scan job status is PARTIAL_SUCCESS And discovered resources from successful branches are ingested into the catalog Scenario: Scan job is idempotent on retry Given a scan job failed midway through execution When the Step Functions state machine retries the execution Then it does not duplicate already-ingested catalog entries And uses upsert semantics based on (tenant_id, resource_arn) as the unique key Scenario: Concurrent scan jobs for the same tenant are prevented Given a scan job is already IN_PROGRESS for tenant "acme-corp" When a second scan job is triggered for "acme-corp" Then the system detects the existing in-progress execution And rejects the new job with status 409 Conflict And returns the ID of the existing running scan job Scenario: Scan job timeout Given a scan job has been running for more than 30 minutes When the Step Functions state machine timeout is reached Then the execution is marked as TIMED_OUT And partial results already ingested are retained in the catalog And the tenant is notified of the timeout And the scan job record shows last_successful_step Scenario: WebSocket progress events during scan Given a user is watching the discovery progress in the UI When each Step Functions step completes Then a WebSocket event is pushed to the user's connection And the event includes: step_name, status, resources_found, timestamp And the UI progress bar updates in real-time ``` ### Story 1.7 — Multi-Region Scanning ```gherkin Feature: Multi-Region AWS Scanning Scenario: Tenant configures specific regions to scan Given a tenant has configured scan regions: ["us-east-1", "eu-central-1"] When the discovery scanner runs Then it only scans the configured regions And does not scan any other AWS regions And the scan summary shows per-region resource counts Scenario: Tenant configures "all regions" scan Given a tenant has configured scan_all_regions: true When the discovery scanner runs Then it first calls ec2:DescribeRegions to get the current list of enabled regions And scans all enabled regions And the Step Functions Map state iterates over the dynamic region list Scenario: A region is disabled in the AWS account Given the tenant's AWS account has "ap-east-1" disabled And the tenant has configured scan_all_regions: true When the scanner attempts to scan "ap-east-1" Then it receives an "OptInRequired" or region-disabled error And skips that region gracefully And records "ap-east-1: skipped (region disabled)" in the scan summary Scenario: Region scan results are isolated per tenant Given tenant "acme" and tenant "globex" both have resources in "us-east-1" When both tenants run discovery scans simultaneously Then acme's catalog only contains acme's resources And globex's catalog only contains globex's resources And no cross-tenant data leakage occurs ``` --- ## Epic 2: GitHub Discovery Scanner ### Story 2.1 — GitHub App Installation and Authentication ```gherkin Feature: GitHub App Authentication Background: Given the portal has a GitHub App registered with app_id "12345" And the app has a private key stored in AWS Secrets Manager Scenario: Successfully authenticate as GitHub App installation Given a tenant has installed the GitHub App on their organization "acme-org" And the installation_id is "67890" When the GitHub Discovery Scanner initializes for this tenant Then it generates a JWT signed with the app's private key And exchanges the JWT for an installation access token via POST /app/installations/67890/access_tokens And the token has scopes: contents:read, metadata:read And the scanner uses this token for all subsequent API calls Scenario: GitHub App JWT generation fails due to expired private key Given the private key in Secrets Manager is expired or invalid When the scanner attempts to generate a JWT Then it fails with a key validation error And marks the scan job as FAILED with reason "GitHub App authentication failed: invalid private key" And triggers an alert to the tenant admin Scenario: Installation access token request returns 404 Given the GitHub App installation has been uninstalled by the tenant When the scanner requests an installation access token Then GitHub returns 404 Not Found And the scanner marks the GitHub integration as DISCONNECTED And notifies the tenant: "GitHub App has been uninstalled. Please reinstall to continue discovery." Scenario: Installation access token expires mid-scan Given the scanner has a valid installation token (valid for 1 hour) And the scan has been running for 58 minutes When the token expires during a GraphQL query Then GitHub returns 401 Unauthorized And the scanner proactively refreshes the token before expiry (at 55 min mark) And the scan continues without interruption ``` ### Story 2.2 — Repository Discovery via GraphQL ```gherkin Feature: GitHub Repository Discovery via GraphQL API Scenario: Discover all repositories in an organization Given the tenant's GitHub organization "acme-org" has 150 repositories When the GitHub Discovery Scanner runs the repository listing query Then it uses the GraphQL API with cursor-based pagination And retrieves all 150 repositories in batches of 100 And each repository record includes: name, defaultBranch, isArchived, pushedAt, topics Scenario: Archived repositories are excluded by default Given "acme-org" has 150 active repos and 30 archived repos When the scanner runs with default settings Then it discovers 150 active repositories And excludes the 30 archived repositories And the scan summary notes "30 archived repositories skipped" Scenario: Tenant opts in to include archived repositories Given the tenant has configured include_archived: true When the scanner runs Then it discovers all 180 repositories including archived ones And archived repos are marked with status: "archived" in the catalog Scenario: Organization has no repositories Given the tenant's GitHub organization has 0 repositories When the scanner runs the repository listing query Then it returns an empty result set And the scan completes successfully with 0 GitHub services discovered And no errors are raised Scenario: GraphQL query returns partial data with errors Given the GraphQL query returns 80 repositories successfully But includes a "FORBIDDEN" error for 20 private repositories When the scanner processes the response Then it ingests the 80 accessible repositories And records a warning: "20 repositories inaccessible due to permissions" And the scan status is PARTIAL_SUCCESS ``` ### Story 2.3 — Service Metadata Extraction from Repositories ```gherkin Feature: Service Metadata Extraction from Repository Files Scenario: Extract service name from package.json Given a repository "user-service" has a package.json at the root: """ { "name": "@acme/user-service", "version": "2.1.0", "description": "Handles user authentication and profiles" } """ When the scanner fetches and parses package.json Then the catalog entry uses service name "@acme/user-service" And records version "2.1.0" And records description "Handles user authentication and profiles" And sets the service type as "nodejs" Scenario: Extract service metadata from Dockerfile Given a repository has a Dockerfile with: """ FROM python:3.11-slim LABEL service.name="data-pipeline" LABEL service.team="data-engineering" EXPOSE 8080 """ When the scanner parses the Dockerfile Then the catalog entry uses service name "data-pipeline" And sets ownership to "data-engineering" (source: implicit/dockerfile-label) And records the service type as "python" And records exposed port 8080 Scenario: Extract service metadata from Terraform files Given a repository has a main.tf with: """ module "service" { source = "..." service_name = "billing-api" team = "billing-team" } """ When the scanner parses Terraform files Then the catalog entry uses service name "billing-api" And sets ownership to "billing-team" (source: implicit/terraform) Scenario: Repository has no recognizable service metadata files Given a repository "random-scripts" has no package.json, Dockerfile, or terraform files When the scanner processes this repository Then it creates a catalog entry using the repository name "random-scripts" And sets service type as "unknown" And flags the entry for manual review And ownership is set to null pending heuristic inference Scenario: Multiple metadata files exist — precedence order Given a repository has both a package.json (name: "svc-a") and a Dockerfile (LABEL service.name="svc-b") When the scanner processes the repository Then it uses package.json as the primary metadata source And records the service name as "svc-a" And notes the Dockerfile label as secondary metadata Scenario: package.json is malformed JSON Given a repository has a package.json with invalid JSON syntax When the scanner attempts to parse it Then it logs a warning: "package.json parse error in repo: user-service" And falls back to Dockerfile or terraform for metadata And if no fallback exists, uses the repository name And does not crash the scan job ``` ### Story 2.4 — GitHub Rate Limit Handling ```gherkin Feature: GitHub API Rate Limit Handling Scenario: GraphQL rate limit warning threshold reached Given the scanner has consumed 4,500 of 5,000 GraphQL rate limit points When the scanner checks the rate limit headers after each response Then it detects the threshold (90% consumed) And pauses scanning until the rate limit resets And logs: "GitHub rate limit threshold reached, waiting for reset at {reset_time}" And resumes scanning after the reset window Scenario: REST API secondary rate limit hit Given the scanner is making rapid REST API calls When GitHub returns HTTP 429 with "secondary rate limit" in the response Then the scanner reads the Retry-After header And waits the specified number of seconds before retrying And does not count this as a scan failure Scenario: Rate limit exhausted with many repos remaining Given the scanner has 500 repositories to scan And the rate limit is exhausted after scanning 200 repositories When the scanner detects rate limit exhaustion Then it checkpoints the current progress (200 repos scanned) And schedules a continuation scan after the rate limit reset And the scan job status is set to RATE_LIMITED (not FAILED) And the tenant is notified of the delay Scenario: Rate limit headers are missing from response Given GitHub returns a response without X-RateLimit headers When the scanner processes the response Then it applies a conservative default delay of 1 second between requests And logs a warning about missing rate limit headers And continues scanning without crashing Scenario: Concurrent scans from multiple tenants share rate limit awareness Given two tenants share the same GitHub App installation When both tenants trigger scans simultaneously Then the scanner uses a shared rate limit counter in Redis And distributes available rate limit points between the two scans And neither scan exceeds the total available rate limit ``` ### Story 2.5 — Ownership Inference from Commit History ```gherkin Feature: Heuristic Ownership Inference from Commit History Scenario: Infer owner from most recent committer Given a repository has no explicit ownership config and no team tags And the last 10 commits were made by "alice@acme.com" (7 commits) and "bob@acme.com" (3 commits) When the ownership inference engine runs Then it identifies "alice@acme.com" as the primary contributor And maps the email to team "frontend-team" via the org's CODEOWNERS or team membership And sets ownership to "frontend-team" (source: heuristic/commit-history) And records confidence: 0.7 Scenario: Ownership inference from CODEOWNERS file Given a repository has a CODEOWNERS file: """ * @acme-org/platform-team /src/api/ @acme-org/api-team """ When the scanner processes the CODEOWNERS file Then it sets ownership to "platform-team" (source: implicit/codeowners) And records that the api directory has additional ownership by "api-team" And this takes precedence over commit history heuristics Scenario: Ownership conflict — multiple teams have equal commit share Given a repository has commits split 50/50 between "team-a" and "team-b" When the ownership inference engine runs Then it records both teams as co-owners And sets ownership_confidence: 0.5 And flags the service for manual ownership resolution And the catalog entry shows ownership_status: "conflict" Scenario: Explicit ownership config overrides all other sources Given a repository has a .portal.yaml file: """ owner: payments-team tier: critical """ And the repository also has CODEOWNERS pointing to "platform-team" And commit history suggests "devops-team" When the scanner processes the repository Then ownership is set to "payments-team" (source: explicit/portal-config) And CODEOWNERS and commit history are recorded as secondary metadata And the explicit config is not overridden by any other source Scenario: Commit history API call fails Given the GitHub API returns 500 when fetching commit history for a repo When the ownership inference engine attempts heuristic inference Then it marks ownership as null (source: unknown) And records the error in the scan summary And does not block catalog ingestion for this service ``` --- ## Epic 3: Service Catalog ### Story 3.1 — Catalog Ingestion and Upsert ```gherkin Feature: Service Catalog Ingestion Background: Given the catalog uses PostgreSQL Aurora Serverless v2 And the unique key for a service is (tenant_id, source, resource_id) Scenario: New service is ingested into the catalog Given a discovery scan emits a new service event for "payment-api" When the catalog ingestion handler processes the event Then a new row is inserted into the services table And the row includes: tenant_id, name, source, resource_id, owner, health_status, discovered_at And the Meilisearch index is updated with the new service document And the catalog entry is immediately searchable Scenario: Existing service is updated on re-scan (upsert) Given "payment-api" already exists in the catalog with owner "old-team" When a new scan emits an updated event for "payment-api" with owner "payments-team" Then the existing row is updated (not duplicated) And owner is changed to "payments-team" And updated_at is refreshed And the Meilisearch document is updated Scenario: Service removed from AWS is marked stale Given "deprecated-lambda" exists in the catalog from a previous scan When the latest scan completes and does not include "deprecated-lambda" Then the catalog entry is marked with status: "stale" And last_seen_at is not updated And the service remains visible in the catalog with a "stale" badge And is not immediately deleted Scenario: Stale service is purged after retention period Given "deprecated-lambda" has been stale for more than 30 days When the nightly cleanup job runs Then the service is soft-deleted from the catalog And removed from the Meilisearch index And a deletion event is logged for audit purposes Scenario: Catalog ingestion fails due to Aurora connection error Given Aurora Serverless is scaling up (cold start) When the ingestion handler attempts to write a service record Then it retries with exponential backoff up to 5 times And if all retries fail, the event is sent to a dead-letter queue And an alert is raised for the operations team And the scan job is marked PARTIAL_SUCCESS Scenario: Bulk ingestion of 500 services from a large account Given a scan discovers 500 services across all resource types When the catalog ingestion handler processes all events Then it uses batch inserts (100 records per batch) And completes ingestion within 30 seconds And all 500 services appear in the catalog And the Meilisearch index reflects all 500 services ``` ### Story 3.2 — PagerDuty / OpsGenie On-Call Mapping ```gherkin Feature: On-Call Ownership Mapping via PagerDuty and OpsGenie Scenario: Map service owner to PagerDuty escalation policy Given a service "checkout-api" has owner "payments-team" And PagerDuty integration is configured for the tenant And "payments-team" maps to PagerDuty service "checkout-escalation-policy" When the catalog builds the service detail record Then the service includes on_call_policy: "checkout-escalation-policy" And a link to the PagerDuty service page Scenario: Fetch current on-call engineer from PagerDuty Given "checkout-api" has a linked PagerDuty escalation policy When a user views the service detail page Then the portal calls PagerDuty GET /oncalls?escalation_policy_ids[]=... And displays the current on-call engineer's name and contact And caches the result for 5 minutes in Redis Scenario: PagerDuty API returns 401 (invalid token) Given the PagerDuty API token has been revoked When the portal attempts to fetch on-call data Then it returns the service detail without on-call info And displays a warning: "On-call data unavailable — check PagerDuty integration" And logs the auth failure for the tenant admin Scenario: OpsGenie integration as alternative to PagerDuty Given the tenant uses OpsGenie instead of PagerDuty And OpsGenie integration is configured with an API key When the portal fetches on-call data for a service Then it calls the OpsGenie API instead of PagerDuty And maps the OpsGenie schedule to the service owner And displays the current on-call responder Scenario: Service owner has no PagerDuty or OpsGenie mapping Given "internal-tool" has owner "eng-team" But "eng-team" has no mapping in PagerDuty or OpsGenie When the portal builds the service detail Then on_call_policy is null And the UI shows "No on-call policy configured" for this service And suggests linking a PagerDuty/OpsGenie service Scenario: Multiple services share the same on-call policy Given 10 services all map to the "platform-oncall" PagerDuty policy When the portal fetches on-call data Then it batches the PagerDuty API call for all 10 services And does not make 10 separate API calls And caches the shared result for all 10 services ``` ### Story 3.3 — Service Catalog CRUD API ```gherkin Feature: Service Catalog CRUD API Background: Given the API requires a valid Cognito JWT And the JWT contains tenant_id claim Scenario: List services for a tenant Given tenant "acme" has 42 services in the catalog When GET /api/v1/services is called with a valid JWT for "acme" Then the response returns 42 services And each service includes: id, name, owner, source, health_status, updated_at And results are paginated (default page size: 20) And the response includes total_count: 42 Scenario: List services with pagination Given tenant "acme" has 42 services When GET /api/v1/services?page=2&limit=20 is called Then the response returns services 21-40 And includes pagination metadata: page, limit, total_count, total_pages Scenario: Get single service by ID Given service "svc-uuid-123" exists for tenant "acme" When GET /api/v1/services/svc-uuid-123 is called with acme's JWT Then the response returns the full service detail And includes: metadata, owner, on_call_policy, health_scorecard, tags, links Scenario: Get service belonging to different tenant returns 404 Given service "svc-uuid-456" belongs to tenant "globex" When GET /api/v1/services/svc-uuid-456 is called with acme's JWT Then the response returns 404 Not Found And does not reveal that the service exists for another tenant Scenario: Manually update service owner via API Given service "legacy-api" has owner inferred as "unknown" When PATCH /api/v1/services/svc-uuid-789 is called with body: """ { "owner": "platform-team", "owner_source": "manual" } """ Then the service owner is updated to "platform-team" And owner_source is set to "explicit/manual" And the change is recorded in the audit log with the user's identity And the Meilisearch document is updated Scenario: Delete service from catalog Given service "decommissioned-api" exists in the catalog When DELETE /api/v1/services/svc-uuid-000 is called Then the service is soft-deleted (deleted_at is set) And it no longer appears in list or search results And the Meilisearch document is removed And the deletion is recorded in the audit log Scenario: Create service manually (not from discovery) When POST /api/v1/services is called with: """ { "name": "manual-service", "owner": "ops-team", "source": "manual" } """ Then a new service is created with source: "manual" And it appears in the catalog and search results And it is not overwritten by automated discovery scans Scenario: Unauthenticated request is rejected When GET /api/v1/services is called without an Authorization header Then the response returns 401 Unauthorized And no service data is returned ``` ### Story 3.4 — Full-Text Search via Meilisearch ```gherkin Feature: Catalog Full-Text Search Scenario: Search returns relevant services Given the catalog has services: "payment-api", "payment-processor", "user-service" When a search query "payment" is submitted Then Meilisearch returns "payment-api" and "payment-processor" And results are ranked by relevance score And the response time is under 10ms Scenario: Search is scoped to tenant Given tenant "acme" has service "auth-service" And tenant "globex" has service "auth-service" (same name) When tenant "acme" searches for "auth-service" Then only acme's "auth-service" is returned And globex's service is not included in results Scenario: Meilisearch index is corrupted or unavailable Given Meilisearch returns a 503 error When a search request is made Then the API falls back to PostgreSQL full-text search (pg_trgm) And returns results within 500ms And logs a warning: "Meilisearch unavailable, using PostgreSQL fallback" And the user sees results (degraded performance, not an error) Scenario: Catalog update triggers Meilisearch re-index Given a service "new-api" is added to the catalog When the catalog write completes Then a Meilisearch index update is triggered asynchronously And the service is searchable within 1 second And if the Meilisearch update fails, it is retried via a background queue ``` --- ## Epic 4: Search Engine ### Story 4.1 — Cmd+K Instant Search ```gherkin Feature: Cmd+K Instant Search Interface Scenario: User opens search with keyboard shortcut Given the user is on any page of the portal When the user presses Cmd+K (Mac) or Ctrl+K (Windows/Linux) Then the search modal opens immediately And the search input is focused And recent searches are shown as suggestions Scenario: Search returns results under 10ms Given the Meilisearch index has 500 services And Redis prefix cache is warm When the user types "pay" in the search box Then results appear within 10ms And the top 5 matching services are shown And each result shows: service name, owner, health status Scenario: Search with Redis prefix cache hit Given "pay" has been searched recently and is cached in Redis When the user types "pay" Then the API returns cached results from Redis And does not query Meilisearch And the response time is under 5ms Scenario: Search with Redis cache miss Given "xyz-unique-query" has never been searched When the user types "xyz-unique-query" Then the API queries Meilisearch directly And stores the result in Redis with TTL of 60 seconds And returns results within 10ms Scenario: Search cache is invalidated on catalog update Given "payment-api" search results are cached in Redis When the catalog is updated (new service added or existing service modified) Then all Redis cache keys matching affected prefixes are invalidated And the next search query fetches fresh results from Meilisearch Scenario: Search with no results Given no services match the query "zzznomatch" When the user searches for "zzznomatch" Then the search modal shows "No services found" And suggests: "Try a broader search or browse all services" And does not show an error Scenario: Search is dismissed Given the search modal is open When the user presses Escape Then the search modal closes And focus returns to the previously focused element Scenario: Search result is selected Given search results show "payment-api" When the user clicks or presses Enter on "payment-api" Then the search modal closes And the service detail drawer opens for "payment-api" And the URL updates to reflect the selected service ``` ### Story 4.2 — Redis Prefix Caching ```gherkin Feature: Redis Prefix Caching for Search Scenario: Cache key structure is tenant-scoped Given tenant "acme" searches for "pay" Then the Redis cache key is "search:acme:pay" And tenant "globex" searching for "pay" uses key "search:globex:pay" And the two cache entries are completely independent Scenario: Cache TTL expires and is refreshed Given a cache entry for "search:acme:pay" has TTL of 60 seconds When 61 seconds pass without a search for "pay" Then the cache entry expires And the next search for "pay" queries Meilisearch And a new cache entry is created with a fresh 60-second TTL Scenario: Redis is unavailable Given Redis returns a connection error When a search request is made Then the API bypasses the cache and queries Meilisearch directly And logs a warning: "Redis unavailable, cache bypassed" And the search still returns results (degraded, not broken) And does not attempt to write to Redis while it is down Scenario: Cache invalidation on bulk catalog update Given a discovery scan adds 50 new services to the catalog When the bulk ingestion completes Then all Redis search cache keys for the affected tenant are flushed And subsequent searches reflect the updated catalog And the flush is atomic (all keys or none) ``` ### Story 4.3 — Meilisearch Index Management ```gherkin Feature: Meilisearch Index Management Scenario: Index is created on first tenant onboarding Given a new tenant "startup-co" completes onboarding When the tenant's first discovery scan runs Then a Meilisearch index "services_startup-co" is created And the index is configured with searchable attributes: name, description, owner, tags And filterable attributes: source, health_status, owner, region Scenario: Index corruption is detected and rebuilt Given the Meilisearch index for tenant "acme" returns inconsistent results When the health check detects index corruption (checksum mismatch) Then an index rebuild is triggered And the index is rebuilt from the PostgreSQL catalog (source of truth) And during rebuild, search falls back to PostgreSQL full-text search And users see a banner: "Search index is being rebuilt, results may be incomplete" And the rebuild completes within 5 minutes for up to 10,000 services Scenario: Meilisearch index rebuild does not affect other tenants Given tenant "acme"'s index is being rebuilt When tenant "globex" performs a search Then globex's search is unaffected And globex's index is not touched during acme's rebuild Scenario: Search ranking is configured correctly Given the Meilisearch index has ranking rules configured When a user searches for "api" Then results are ranked by: typo, words, proximity, attribute, sort, exactness And services with "api" in the name rank higher than those with "api" in description And recently updated services rank higher among equal-relevance results ``` --- ## Epic 5: Dashboard UI ### Story 5.1 — Service Catalog Browse ```gherkin Feature: Service Catalog Browse UI Background: Given the user is authenticated and on the catalog page Scenario: Catalog page loads with service cards Given the tenant has 42 services in the catalog When the catalog page loads Then 20 service cards are displayed (first page) And each card shows: service name, owner, health status badge, source icon, last updated And a pagination control shows "Page 1 of 3" Scenario: Progressive disclosure — card expands on hover Given a service card for "payment-api" is visible When the user hovers over the card Then the card expands to show additional details: | Field | Value | | Description | Handles payment flows | | Region | us-east-1 | | On-call | alice@acme.com | | Tech stack | Node.js, Docker | And the expansion animation completes within 200ms Scenario: Filter services by owner Given the catalog has services from multiple teams When the user selects filter "owner: payments-team" Then only services owned by "payments-team" are shown And the URL updates with ?owner=payments-team And the filter chip is shown as active Scenario: Filter services by health status Given some services have health_status: "healthy", "degraded", "unknown" When the user selects filter "status: degraded" Then only degraded services are shown And the count badge updates to reflect filtered results Scenario: Filter services by source Given services come from "aws-ecs", "aws-lambda", "github" When the user selects filter "source: aws-lambda" Then only Lambda functions are shown in the catalog Scenario: Sort services by last updated Given the catalog has services with various updated_at timestamps When the user selects sort "Last Updated (newest first)" Then services are sorted with most recently updated first And the sort selection persists across page navigation Scenario: Empty catalog state Given the tenant has just onboarded and has 0 services When the catalog page loads Then an empty state is shown: "No services discovered yet" And a CTA button: "Run Discovery Scan" is prominently displayed And a link to the onboarding guide is shown ``` ### Story 5.2 — Service Detail Drawer ```gherkin Feature: Service Detail Drawer Scenario: Open service detail drawer Given the user clicks on service card "checkout-api" When the drawer opens Then it slides in from the right within 300ms And displays full service details: | Section | Content | | Header | Service name, health badge, source | | Ownership | Team name, on-call engineer | | Metadata | Region, runtime, version, tags | | Health | Scorecard with metrics | | Links | GitHub repo, PagerDuty, AWS console | And the main catalog page remains visible behind the drawer Scenario: Drawer URL is shareable Given the drawer is open for "checkout-api" (id: svc-123) When the user copies the URL Then the URL is /catalog?service=svc-123 And sharing this URL opens the catalog with the drawer pre-opened Scenario: Close drawer with Escape key Given the service detail drawer is open When the user presses Escape Then the drawer closes And focus returns to the service card that was clicked Scenario: Navigate between services in drawer Given the drawer is open for "checkout-api" When the user presses the right arrow or clicks "Next service" Then the drawer transitions to the next service in the current filtered/sorted list And the URL updates accordingly Scenario: Drawer shows stale data warning Given service "legacy-api" has status: "stale" (not seen in last scan) When the drawer opens for "legacy-api" Then a warning banner shows: "This service was not found in the last scan (3 days ago)" And a "Re-run scan" button is available Scenario: Edit service owner from drawer Given the user has editor permissions When they click "Edit owner" in the drawer Then an inline edit field appears And they can type a new owner name with autocomplete from known teams And saving updates the catalog immediately And the drawer reflects the new owner without a full page reload ``` ### Story 5.3 — Cmd+K Search in UI ```gherkin Feature: Cmd+K Search UI Integration Scenario: Search modal shows categorized results Given the user types "pay" in the Cmd+K search modal When results are returned Then they are grouped by category: | Category | Examples | | Services | payment-api, payment-processor | | Teams | payments-team | | Actions | Run discovery scan | And keyboard navigation works between categories Scenario: Search modal shows recent searches on open Given the user has previously searched for "auth", "billing", "gateway" When the user opens Cmd+K without typing Then recent searches are shown as suggestions And clicking a suggestion populates the search input Scenario: Search result keyboard navigation Given the search modal is open with 5 results When the user presses the down arrow key Then the first result is highlighted And pressing down again highlights the second result And pressing Enter navigates to the highlighted result Scenario: Search modal is accessible Given the search modal is open Then it has role="dialog" and aria-label="Search services" And the input has aria-label="Search" And results have role="listbox" with role="option" items And screen readers announce result count changes ``` --- ## Epic 6: Analytics Dashboards ### Story 6.1 — Ownership Coverage Dashboard ```gherkin Feature: Ownership Coverage Analytics Scenario: Display ownership coverage percentage Given the tenant has 100 services in the catalog And 75 services have a confirmed owner And 25 services have owner: null or "unknown" When the analytics dashboard loads Then the ownership coverage metric shows 75% And a donut chart shows the breakdown: 75 owned, 25 unowned And a trend line shows coverage change over the last 30 days Scenario: Ownership coverage by team Given multiple teams own services When the ownership breakdown table is rendered Then it shows each team with their service count and percentage And teams are sorted by service count descending And clicking a team filters the catalog to that team's services Scenario: Ownership coverage drops below threshold Given the ownership coverage threshold is set to 80% And coverage drops from 82% to 76% after a scan When the dashboard refreshes Then a warning alert is shown: "Ownership coverage below 80% threshold" And the affected unowned services are listed And an email notification is sent to the tenant admin Scenario: Zero unowned services Given all 50 services have confirmed owners When the ownership dashboard loads Then coverage shows 100% And a success state is shown: "All services have owners" And no warning alerts are displayed ``` ### Story 6.2 — Service Health Scorecards ```gherkin Feature: Service Health Scorecards Scenario: Health scorecard is calculated for a service Given service "payment-api" has the following signals: | Signal | Value | Score | | Has owner | Yes | 20 | | Has on-call policy | Yes | 20 | | Has documentation | Yes | 15 | | Recent deployment | 2 days | 15 | | No open P1 alerts | True | 30 | When the health scorecard is computed Then the overall score is 100/100 And the health_status is "healthy" Scenario: Service with missing ownership scores lower Given service "orphan-lambda" has: | Signal | Value | Score | | Has owner | No | 0 | | Has on-call policy | No | 0 | | Has documentation | No | 0 | | Recent deployment | 90 days | 5 | | No open P1 alerts | True | 30 | When the health scorecard is computed Then the overall score is 35/100 And the health_status is "at-risk" And the scorecard highlights the missing signals as improvement actions Scenario: Health scorecard trend over time Given "checkout-api" has had weekly scorecard snapshots for 8 weeks When the service detail drawer shows the health trend Then a sparkline chart shows the score history And the trend direction (improving/declining) is indicated Scenario: Team-level health KPI aggregation Given "payments-team" owns 10 services with scores: [90, 85, 70, 95, 60, 80, 75, 88, 92, 65] When the team KPI dashboard is rendered Then the team average score is 80/100 And the lowest-scoring service is highlighted for attention And the team score is compared to the org average Scenario: Health scorecard for stale service Given service "legacy-api" has status: "stale" When the scorecard is computed Then the "Recent deployment" signal scores 0 And a penalty is applied for being stale And the overall score reflects the staleness ``` ### Story 6.3 — Tech Debt Tracking ```gherkin Feature: Tech Debt Tracking Dashboard Scenario: Identify services using deprecated runtimes Given the portal has a policy: "Node.js < 18 is deprecated" And 5 services use Node.js 16 When the tech debt dashboard loads Then it shows 5 services flagged for "deprecated runtime" And each flagged service links to its catalog entry And the total tech debt score is calculated Scenario: Track services with no recent deployments Given the policy threshold is "no deployment in 90 days = tech debt" And 8 services have not been deployed in over 90 days When the tech debt dashboard loads Then these 8 services appear in the "stale deployments" category And the owning teams are notified via the dashboard Scenario: Tech debt trend over time Given tech debt metrics have been tracked for 12 weeks When the trend chart is rendered Then it shows weekly tech debt item counts And highlights weeks where debt increased significantly And shows the net change (items resolved vs. items added) ``` ---