products/04-lightweight-idp/acceptance-specs/acceptance-specs.md

# dd0c/portal — BDD Acceptance Test Specifications

> Gherkin scenarios for all 10 epics. Edge cases included.

---

## Epic 1: AWS Discovery Scanner

### Story 1.1 — IAM Role Assumption into Customer Account

```gherkin
Feature: IAM Role Assumption for Cross-Account Discovery

  Background:
    Given the portal has a tenant with AWS integration configured
    And the tenant has provided an IAM role ARN "arn:aws:iam::123456789012:role/PortalDiscoveryRole"

  Scenario: Successfully assume IAM role and begin scan
    Given the IAM role has the required trust policy allowing portal's scanner principal
    When the AWS Discovery Scanner initiates a scan for the tenant
    Then the scanner assumes the role via STS AssumeRole
    And receives temporary credentials valid for 1 hour
    And begins scanning the configured AWS regions

  Scenario: IAM role assumption fails due to missing trust policy
    Given the IAM role does NOT have a trust policy for the portal's scanner principal
    When the AWS Discovery Scanner attempts to assume the role
    Then STS returns an "AccessDenied" error
    And the scanner marks the scan job as FAILED with reason "IAM role assumption denied"
    And the tenant receives a notification: "AWS integration error: cannot assume role"
    And no partial scan results are persisted

  Scenario: IAM role ARN is malformed
    Given the tenant has configured an invalid role ARN "not-a-valid-arn"
    When the AWS Discovery Scanner attempts to assume the role
    Then the scanner marks the scan job as FAILED with reason "Invalid role ARN format"
    And logs a structured error with tenant_id and the invalid ARN

  Scenario: Temporary credentials expire mid-scan
    Given the scanner has assumed the IAM role and started scanning
    And the scan has been running for 61 minutes
    When the scanner attempts to call ECS DescribeClusters
    Then AWS returns an "ExpiredTokenException"
    And the scanner refreshes credentials via a new AssumeRole call
    And resumes scanning from the last checkpoint
    And the scan job status remains IN_PROGRESS

  Scenario: Role assumption succeeds but has insufficient permissions
    Given the IAM role is assumed successfully
    But the role lacks "ecs:ListClusters" permission
    When the scanner attempts to list ECS clusters
    Then AWS returns an "AccessDenied" error for that specific call
    And the scanner records a partial failure for the ECS resource type
    And continues scanning other resource types (Lambda, CloudFormation, API Gateway)
    And the final scan result includes a warnings list with "ECS: insufficient permissions"
```

### Story 1.2 — ECS Service Discovery

```gherkin
Feature: ECS Service Discovery

  Background:
    Given a tenant's AWS account has been successfully authenticated
    And the scanner is configured to scan region "us-east-1"

  Scenario: Discover ECS services across multiple clusters
    Given the AWS account has 3 ECS clusters: "prod-cluster", "staging-cluster", "dev-cluster"
    And "prod-cluster" has 5 services, "staging-cluster" has 3, "dev-cluster" has 2
    When the AWS Discovery Scanner runs the ECS scan step
    Then it lists all 3 clusters via ecs:ListClusters
    And describes all services in each cluster via ecs:DescribeServices
    And emits 10 discovered service events to the catalog ingestion queue
    And each event includes: cluster_name, service_name, task_definition, desired_count, region

  Scenario: ECS cluster has no services
    Given the AWS account has 1 ECS cluster "empty-cluster" with 0 services
    When the scanner runs the ECS scan step
    Then it lists the cluster successfully
    And records 0 services discovered for that cluster
    And does not emit any service events for that cluster
    And the scan step completes without error

  Scenario: ECS DescribeServices returns throttling error
    Given the AWS account has an ECS cluster with 100 services
    When the scanner calls ecs:DescribeServices in batches of 10
    And AWS returns a ThrottlingException on the 5th batch
    Then the scanner applies exponential backoff (2s, 4s, 8s)
    And retries the failed batch up to 3 times
    And if retry succeeds, continues with remaining batches
    And the final result includes all 100 services

  Scenario: ECS DescribeServices fails after all retries
    Given AWS returns ThrottlingException on every retry attempt
    When the scanner exhausts all 3 retries for a batch
    Then it marks that batch as partially failed
    And records which service ARNs could not be described
    And continues with the next batch
    And the scan summary includes "ECS partial failure: 10 services could not be described"

  Scenario: Multi-region ECS scan
    Given the tenant has configured scanning for regions: "us-east-1", "eu-west-1", "ap-southeast-1"
    When the AWS Discovery Scanner runs
    Then it scans ECS in all 3 regions in parallel via Step Functions Map state
    And aggregates results from all regions
    And each discovered service record includes its region
    And duplicate service names across regions are stored as separate catalog entries
```

### Story 1.3 — Lambda Function Discovery

```gherkin
Feature: Lambda Function Discovery

  Scenario: Discover all Lambda functions in a region
    Given the AWS account has 25 Lambda functions in "us-east-1"
    When the scanner runs the Lambda scan step
    Then it calls lambda:ListFunctions with pagination
    And retrieves all 25 functions across multiple pages
    And emits 25 service events with: function_name, runtime, memory_size, timeout, last_modified

  Scenario: Lambda function has tags indicating service ownership
    Given a Lambda function "payment-processor" has tags:
      | Key         | Value          |
      | team        | payments-team  |
      | service     | payment-svc    |
      | environment | production     |
    When the scanner processes this Lambda function
    Then the catalog entry includes ownership: "payments-team" (source: implicit/tags)
    And the service name is inferred as "payment-svc" from the "service" tag

  Scenario: Lambda function has no tags
    Given a Lambda function "legacy-cron-job" has no tags
    When the scanner processes this Lambda function
    Then the catalog entry has ownership: null (source: heuristic/pending)
    And the entry is flagged for ownership inference via commit history

  Scenario: Lambda ListFunctions pagination
    Given the AWS account has 150 Lambda functions (exceeding the 50-per-page limit)
    When the scanner calls lambda:ListFunctions
    Then it follows NextMarker pagination tokens
    And retrieves all 150 functions across 3 pages
    And does not duplicate any function in the results
```

### Story 1.4 — CloudFormation Stack Discovery

```gherkin
Feature: CloudFormation Stack Discovery

  Scenario: Discover active CloudFormation stacks
    Given the AWS account has 8 CloudFormation stacks in "us-east-1"
    And 6 stacks have status "CREATE_COMPLETE" or "UPDATE_COMPLETE"
    And 2 stacks have status "ROLLBACK_COMPLETE"
    When the scanner runs the CloudFormation scan step
    Then it discovers all 8 stacks
    And includes stacks of all statuses in the raw results
    And marks ROLLBACK_COMPLETE stacks with health_status: "degraded"

  Scenario: CloudFormation stack outputs contain service metadata
    Given a stack "api-gateway-stack" has outputs:
      | OutputKey    | OutputValue                        |
      | ServiceName  | user-api                           |
      | TeamOwner    | platform-team                      |
      | ApiEndpoint  | https://api.example.com/users      |
    When the scanner processes this stack
    Then the catalog entry uses "user-api" as the service name
    And sets ownership to "platform-team" (source: implicit/stack-output)
    And records the API endpoint in service metadata

  Scenario: Nested stacks are discovered
    Given a parent CloudFormation stack has 3 nested stacks
    When the scanner processes the parent stack
    Then it also discovers and catalogs each nested stack
    And links nested stacks to their parent in the catalog

  Scenario: Stack in DELETE_IN_PROGRESS state
    Given a CloudFormation stack has status "DELETE_IN_PROGRESS"
    When the scanner discovers this stack
    Then it records the stack with health_status: "terminating"
    And does not block the scan on this stack's state
```

### Story 1.5 — API Gateway Discovery

```gherkin
Feature: API Gateway Discovery

  Scenario: Discover REST APIs in API Gateway v1
    Given the AWS account has 4 REST APIs in API Gateway
    When the scanner runs the API Gateway scan step
    Then it calls apigateway:GetRestApis
    And retrieves all 4 APIs with their names, IDs, and endpoints
    And emits catalog events for each API

  Scenario: Discover HTTP APIs in API Gateway v2
    Given the AWS account has 3 HTTP APIs in API Gateway v2
    When the scanner runs the API Gateway v2 scan step
    Then it calls apigatewayv2:GetApis
    And retrieves all 3 HTTP APIs
    And correctly distinguishes them from REST APIs in the catalog

  Scenario: API Gateway API has no associated service tag
    Given an API Gateway REST API "legacy-api" has no resource tags
    When the scanner processes this API
    Then it creates a catalog entry with name "legacy-api"
    And sets ownership to null pending heuristic inference
    And records the API type as "REST" and the invoke URL

  Scenario: API Gateway scan across multiple regions
    Given API Gateway APIs exist in "us-east-1" and "us-west-2"
    When the scanner runs in parallel across both regions
    Then APIs from both regions are discovered
    And each catalog entry includes the region field
    And there are no cross-region duplicates
```

### Story 1.6 — Step Functions Orchestration

```gherkin
Feature: Step Functions Orchestration of Discovery Scan

  Scenario: Full scan executes all resource type steps in parallel
    Given a scan job is triggered for a tenant
    When the Step Functions state machine starts
    Then it executes ECS, Lambda, CloudFormation, and API Gateway scan steps in parallel
    And waits for all parallel branches to complete
    And aggregates results from all branches
    And transitions to the catalog ingestion step

  Scenario: One parallel branch fails, others succeed
    Given the Step Functions state machine is running a scan
    And the Lambda scan step throws an unhandled exception
    When the state machine catches the error
    Then the ECS, CloudFormation, and API Gateway branches continue to completion
    And the Lambda branch is marked as FAILED in the scan summary
    And the overall scan job status is PARTIAL_SUCCESS
    And discovered resources from successful branches are ingested into the catalog

  Scenario: Scan job is idempotent on retry
    Given a scan job failed midway through execution
    When the Step Functions state machine retries the execution
    Then it does not duplicate already-ingested catalog entries
    And uses upsert semantics based on (tenant_id, resource_arn) as the unique key

  Scenario: Concurrent scan jobs for the same tenant are prevented
    Given a scan job is already IN_PROGRESS for tenant "acme-corp"
    When a second scan job is triggered for "acme-corp"
    Then the system detects the existing in-progress execution
    And rejects the new job with status 409 Conflict
    And returns the ID of the existing running scan job

  Scenario: Scan job timeout
    Given a scan job has been running for more than 30 minutes
    When the Step Functions state machine timeout is reached
    Then the execution is marked as TIMED_OUT
    And partial results already ingested are retained in the catalog
    And the tenant is notified of the timeout
    And the scan job record shows last_successful_step

  Scenario: WebSocket progress events during scan
    Given a user is watching the discovery progress in the UI
    When each Step Functions step completes
    Then a WebSocket event is pushed to the user's connection
    And the event includes: step_name, status, resources_found, timestamp
    And the UI progress bar updates in real-time
```

### Story 1.7 — Multi-Region Scanning

```gherkin
Feature: Multi-Region AWS Scanning

  Scenario: Tenant configures specific regions to scan
    Given a tenant has configured scan regions: ["us-east-1", "eu-central-1"]
    When the discovery scanner runs
    Then it only scans the configured regions
    And does not scan any other AWS regions
    And the scan summary shows per-region resource counts

  Scenario: Tenant configures "all regions" scan
    Given a tenant has configured scan_all_regions: true
    When the discovery scanner runs
    Then it first calls ec2:DescribeRegions to get the current list of enabled regions
    And scans all enabled regions
    And the Step Functions Map state iterates over the dynamic region list

  Scenario: A region is disabled in the AWS account
    Given the tenant's AWS account has "ap-east-1" disabled
    And the tenant has configured scan_all_regions: true
    When the scanner attempts to scan "ap-east-1"
    Then it receives an "OptInRequired" or region-disabled error
    And skips that region gracefully
    And records "ap-east-1: skipped (region disabled)" in the scan summary

  Scenario: Region scan results are isolated per tenant
    Given tenant "acme" and tenant "globex" both have resources in "us-east-1"
    When both tenants run discovery scans simultaneously
    Then acme's catalog only contains acme's resources
    And globex's catalog only contains globex's resources
    And no cross-tenant data leakage occurs
```

---

## Epic 2: GitHub Discovery Scanner

### Story 2.1 — GitHub App Installation and Authentication

```gherkin
Feature: GitHub App Authentication

  Background:
    Given the portal has a GitHub App registered with app_id "12345"
    And the app has a private key stored in AWS Secrets Manager

  Scenario: Successfully authenticate as GitHub App installation
    Given a tenant has installed the GitHub App on their organization "acme-org"
    And the installation_id is "67890"
    When the GitHub Discovery Scanner initializes for this tenant
    Then it generates a JWT signed with the app's private key
    And exchanges the JWT for an installation access token via POST /app/installations/67890/access_tokens
    And the token has scopes: contents:read, metadata:read
    And the scanner uses this token for all subsequent API calls

  Scenario: GitHub App JWT generation fails due to expired private key
    Given the private key in Secrets Manager is expired or invalid
    When the scanner attempts to generate a JWT
    Then it fails with a key validation error
    And marks the scan job as FAILED with reason "GitHub App authentication failed: invalid private key"
    And triggers an alert to the tenant admin

  Scenario: Installation access token request returns 404
    Given the GitHub App installation has been uninstalled by the tenant
    When the scanner requests an installation access token
    Then GitHub returns 404 Not Found
    And the scanner marks the GitHub integration as DISCONNECTED
    And notifies the tenant: "GitHub App has been uninstalled. Please reinstall to continue discovery."

  Scenario: Installation access token expires mid-scan
    Given the scanner has a valid installation token (valid for 1 hour)
    And the scan has been running for 58 minutes
    When the token expires during a GraphQL query
    Then GitHub returns 401 Unauthorized
    And the scanner proactively refreshes the token before expiry (at 55 min mark)
    And the scan continues without interruption
```

### Story 2.2 — Repository Discovery via GraphQL

```gherkin
Feature: GitHub Repository Discovery via GraphQL API

  Scenario: Discover all repositories in an organization
    Given the tenant's GitHub organization "acme-org" has 150 repositories
    When the GitHub Discovery Scanner runs the repository listing query
    Then it uses the GraphQL API with cursor-based pagination
    And retrieves all 150 repositories in batches of 100
    And each repository record includes: name, defaultBranch, isArchived, pushedAt, topics

  Scenario: Archived repositories are excluded by default
    Given "acme-org" has 150 active repos and 30 archived repos
    When the scanner runs with default settings
    Then it discovers 150 active repositories
    And excludes the 30 archived repositories
    And the scan summary notes "30 archived repositories skipped"

  Scenario: Tenant opts in to include archived repositories
    Given the tenant has configured include_archived: true
    When the scanner runs
    Then it discovers all 180 repositories including archived ones
    And archived repos are marked with status: "archived" in the catalog

  Scenario: Organization has no repositories
    Given the tenant's GitHub organization has 0 repositories
    When the scanner runs the repository listing query
    Then it returns an empty result set
    And the scan completes successfully with 0 GitHub services discovered
    And no errors are raised

  Scenario: GraphQL query returns partial data with errors
    Given the GraphQL query returns 80 repositories successfully
    But includes a "FORBIDDEN" error for 20 private repositories
    When the scanner processes the response
    Then it ingests the 80 accessible repositories
    And records a warning: "20 repositories inaccessible due to permissions"
    And the scan status is PARTIAL_SUCCESS
```

### Story 2.3 — Service Metadata Extraction from Repositories

```gherkin
Feature: Service Metadata Extraction from Repository Files

  Scenario: Extract service name from package.json
    Given a repository "user-service" has a package.json at the root:
      """
      {
        "name": "@acme/user-service",
        "version": "2.1.0",
        "description": "Handles user authentication and profiles"
      }
      """
    When the scanner fetches and parses package.json
    Then the catalog entry uses service name "@acme/user-service"
    And records version "2.1.0"
    And records description "Handles user authentication and profiles"
    And sets the service type as "nodejs"

  Scenario: Extract service metadata from Dockerfile
    Given a repository has a Dockerfile with:
      """
      FROM python:3.11-slim
      LABEL service.name="data-pipeline"
      LABEL service.team="data-engineering"
      EXPOSE 8080
      """
    When the scanner parses the Dockerfile
    Then the catalog entry uses service name "data-pipeline"
    And sets ownership to "data-engineering" (source: implicit/dockerfile-label)
    And records the service type as "python"
    And records exposed port 8080

  Scenario: Extract service metadata from Terraform files
    Given a repository has a main.tf with:
      """
      module "service" {
        source = "..."
        service_name = "billing-api"
        team         = "billing-team"
      }
      """
    When the scanner parses Terraform files
    Then the catalog entry uses service name "billing-api"
    And sets ownership to "billing-team" (source: implicit/terraform)

  Scenario: Repository has no recognizable service metadata files
    Given a repository "random-scripts" has no package.json, Dockerfile, or terraform files
    When the scanner processes this repository
    Then it creates a catalog entry using the repository name "random-scripts"
    And sets service type as "unknown"
    And flags the entry for manual review
    And ownership is set to null pending heuristic inference

  Scenario: Multiple metadata files exist — precedence order
    Given a repository has both a package.json (name: "svc-a") and a Dockerfile (LABEL service.name="svc-b")
    When the scanner processes the repository
    Then it uses package.json as the primary metadata source
    And records the service name as "svc-a"
    And notes the Dockerfile label as secondary metadata

  Scenario: package.json is malformed JSON
    Given a repository has a package.json with invalid JSON syntax
    When the scanner attempts to parse it
    Then it logs a warning: "package.json parse error in repo: user-service"
    And falls back to Dockerfile or terraform for metadata
    And if no fallback exists, uses the repository name
    And does not crash the scan job
```

### Story 2.4 — GitHub Rate Limit Handling

```gherkin
Feature: GitHub API Rate Limit Handling

  Scenario: GraphQL rate limit warning threshold reached
    Given the scanner has consumed 4,500 of 5,000 GraphQL rate limit points
    When the scanner checks the rate limit headers after each response
    Then it detects the threshold (90% consumed)
    And pauses scanning until the rate limit resets
    And logs: "GitHub rate limit threshold reached, waiting for reset at {reset_time}"
    And resumes scanning after the reset window

  Scenario: REST API secondary rate limit hit
    Given the scanner is making rapid REST API calls
    When GitHub returns HTTP 429 with "secondary rate limit" in the response
    Then the scanner reads the Retry-After header
    And waits the specified number of seconds before retrying
    And does not count this as a scan failure

  Scenario: Rate limit exhausted with many repos remaining
    Given the scanner has 500 repositories to scan
    And the rate limit is exhausted after scanning 200 repositories
    When the scanner detects rate limit exhaustion
    Then it checkpoints the current progress (200 repos scanned)
    And schedules a continuation scan after the rate limit reset
    And the scan job status is set to RATE_LIMITED (not FAILED)
    And the tenant is notified of the delay

  Scenario: Rate limit headers are missing from response
    Given GitHub returns a response without X-RateLimit headers
    When the scanner processes the response
    Then it applies a conservative default delay of 1 second between requests
    And logs a warning about missing rate limit headers
    And continues scanning without crashing

  Scenario: Concurrent scans from multiple tenants share rate limit awareness
    Given two tenants share the same GitHub App installation
    When both tenants trigger scans simultaneously
    Then the scanner uses a shared rate limit counter in Redis
    And distributes available rate limit points between the two scans
    And neither scan exceeds the total available rate limit
```

### Story 2.5 — Ownership Inference from Commit History

```gherkin
Feature: Heuristic Ownership Inference from Commit History

  Scenario: Infer owner from most recent committer
    Given a repository has no explicit ownership config and no team tags
    And the last 10 commits were made by "alice@acme.com" (7 commits) and "bob@acme.com" (3 commits)
    When the ownership inference engine runs
    Then it identifies "alice@acme.com" as the primary contributor
    And maps the email to team "frontend-team" via the org's CODEOWNERS or team membership
    And sets ownership to "frontend-team" (source: heuristic/commit-history)
    And records confidence: 0.7

  Scenario: Ownership inference from CODEOWNERS file
    Given a repository has a CODEOWNERS file:
      """
      * @acme-org/platform-team
      /src/api/ @acme-org/api-team
      """
    When the scanner processes the CODEOWNERS file
    Then it sets ownership to "platform-team" (source: implicit/codeowners)
    And records that the api directory has additional ownership by "api-team"
    And this takes precedence over commit history heuristics

  Scenario: Ownership conflict — multiple teams have equal commit share
    Given a repository has commits split 50/50 between "team-a" and "team-b"
    When the ownership inference engine runs
    Then it records both teams as co-owners
    And sets ownership_confidence: 0.5
    And flags the service for manual ownership resolution
    And the catalog entry shows ownership_status: "conflict"

  Scenario: Explicit ownership config overrides all other sources
    Given a repository has a .portal.yaml file:
      """
      owner: payments-team
      tier: critical
      """
    And the repository also has CODEOWNERS pointing to "platform-team"
    And commit history suggests "devops-team"
    When the scanner processes the repository
    Then ownership is set to "payments-team" (source: explicit/portal-config)
    And CODEOWNERS and commit history are recorded as secondary metadata
    And the explicit config is not overridden by any other source

  Scenario: Commit history API call fails
    Given the GitHub API returns 500 when fetching commit history for a repo
    When the ownership inference engine attempts heuristic inference
    Then it marks ownership as null (source: unknown)
    And records the error in the scan summary
    And does not block catalog ingestion for this service
```

---
## Epic 3: Service Catalog

### Story 3.1 — Catalog Ingestion and Upsert

```gherkin
Feature: Service Catalog Ingestion

  Background:
    Given the catalog uses PostgreSQL Aurora Serverless v2
    And the unique key for a service is (tenant_id, source, resource_id)

  Scenario: New service is ingested into the catalog
    Given a discovery scan emits a new service event for "payment-api"
    When the catalog ingestion handler processes the event
    Then a new row is inserted into the services table
    And the row includes: tenant_id, name, source, resource_id, owner, health_status, discovered_at
    And the Meilisearch index is updated with the new service document
    And the catalog entry is immediately searchable

  Scenario: Existing service is updated on re-scan (upsert)
    Given "payment-api" already exists in the catalog with owner "old-team"
    When a new scan emits an updated event for "payment-api" with owner "payments-team"
    Then the existing row is updated (not duplicated)
    And owner is changed to "payments-team"
    And updated_at is refreshed
    And the Meilisearch document is updated

  Scenario: Service removed from AWS is marked stale
    Given "deprecated-lambda" exists in the catalog from a previous scan
    When the latest scan completes and does not include "deprecated-lambda"
    Then the catalog entry is marked with status: "stale"
    And last_seen_at is not updated
    And the service remains visible in the catalog with a "stale" badge
    And is not immediately deleted

  Scenario: Stale service is purged after retention period
    Given "deprecated-lambda" has been stale for more than 30 days
    When the nightly cleanup job runs
    Then the service is soft-deleted from the catalog
    And removed from the Meilisearch index
    And a deletion event is logged for audit purposes

  Scenario: Catalog ingestion fails due to Aurora connection error
    Given Aurora Serverless is scaling up (cold start)
    When the ingestion handler attempts to write a service record
    Then it retries with exponential backoff up to 5 times
    And if all retries fail, the event is sent to a dead-letter queue
    And an alert is raised for the operations team
    And the scan job is marked PARTIAL_SUCCESS

  Scenario: Bulk ingestion of 500 services from a large account
    Given a scan discovers 500 services across all resource types
    When the catalog ingestion handler processes all events
    Then it uses batch inserts (100 records per batch)
    And completes ingestion within 30 seconds
    And all 500 services appear in the catalog
    And the Meilisearch index reflects all 500 services
```

### Story 3.2 — PagerDuty / OpsGenie On-Call Mapping

```gherkin
Feature: On-Call Ownership Mapping via PagerDuty and OpsGenie

  Scenario: Map service owner to PagerDuty escalation policy
    Given a service "checkout-api" has owner "payments-team"
    And PagerDuty integration is configured for the tenant
    And "payments-team" maps to PagerDuty service "checkout-escalation-policy"
    When the catalog builds the service detail record
    Then the service includes on_call_policy: "checkout-escalation-policy"
    And a link to the PagerDuty service page

  Scenario: Fetch current on-call engineer from PagerDuty
    Given "checkout-api" has a linked PagerDuty escalation policy
    When a user views the service detail page
    Then the portal calls PagerDuty GET /oncalls?escalation_policy_ids[]=...
    And displays the current on-call engineer's name and contact
    And caches the result for 5 minutes in Redis

  Scenario: PagerDuty API returns 401 (invalid token)
    Given the PagerDuty API token has been revoked
    When the portal attempts to fetch on-call data
    Then it returns the service detail without on-call info
    And displays a warning: "On-call data unavailable — check PagerDuty integration"
    And logs the auth failure for the tenant admin

  Scenario: OpsGenie integration as alternative to PagerDuty
    Given the tenant uses OpsGenie instead of PagerDuty
    And OpsGenie integration is configured with an API key
    When the portal fetches on-call data for a service
    Then it calls the OpsGenie API instead of PagerDuty
    And maps the OpsGenie schedule to the service owner
    And displays the current on-call responder

  Scenario: Service owner has no PagerDuty or OpsGenie mapping
    Given "internal-tool" has owner "eng-team"
    But "eng-team" has no mapping in PagerDuty or OpsGenie
    When the portal builds the service detail
    Then on_call_policy is null
    And the UI shows "No on-call policy configured" for this service
    And suggests linking a PagerDuty/OpsGenie service

  Scenario: Multiple services share the same on-call policy
    Given 10 services all map to the "platform-oncall" PagerDuty policy
    When the portal fetches on-call data
    Then it batches the PagerDuty API call for all 10 services
    And does not make 10 separate API calls
    And caches the shared result for all 10 services
```

### Story 3.3 — Service Catalog CRUD API

```gherkin
Feature: Service Catalog CRUD API

  Background:
    Given the API requires a valid Cognito JWT
    And the JWT contains tenant_id claim

  Scenario: List services for a tenant
    Given tenant "acme" has 42 services in the catalog
    When GET /api/v1/services is called with a valid JWT for "acme"
    Then the response returns 42 services
    And each service includes: id, name, owner, source, health_status, updated_at
    And results are paginated (default page size: 20)
    And the response includes total_count: 42

  Scenario: List services with pagination
    Given tenant "acme" has 42 services
    When GET /api/v1/services?page=2&limit=20 is called
    Then the response returns services 21-40
    And includes pagination metadata: page, limit, total_count, total_pages

  Scenario: Get single service by ID
    Given service "svc-uuid-123" exists for tenant "acme"
    When GET /api/v1/services/svc-uuid-123 is called with acme's JWT
    Then the response returns the full service detail
    And includes: metadata, owner, on_call_policy, health_scorecard, tags, links

  Scenario: Get service belonging to different tenant returns 404
    Given service "svc-uuid-456" belongs to tenant "globex"
    When GET /api/v1/services/svc-uuid-456 is called with acme's JWT
    Then the response returns 404 Not Found
    And does not reveal that the service exists for another tenant

  Scenario: Manually update service owner via API
    Given service "legacy-api" has owner inferred as "unknown"
    When PATCH /api/v1/services/svc-uuid-789 is called with body:
      """
      { "owner": "platform-team", "owner_source": "manual" }
      """
    Then the service owner is updated to "platform-team"
    And owner_source is set to "explicit/manual"
    And the change is recorded in the audit log with the user's identity
    And the Meilisearch document is updated

  Scenario: Delete service from catalog
    Given service "decommissioned-api" exists in the catalog
    When DELETE /api/v1/services/svc-uuid-000 is called
    Then the service is soft-deleted (deleted_at is set)
    And it no longer appears in list or search results
    And the Meilisearch document is removed
    And the deletion is recorded in the audit log

  Scenario: Create service manually (not from discovery)
    When POST /api/v1/services is called with:
      """
      { "name": "manual-service", "owner": "ops-team", "source": "manual" }
      """
    Then a new service is created with source: "manual"
    And it appears in the catalog and search results
    And it is not overwritten by automated discovery scans

  Scenario: Unauthenticated request is rejected
    When GET /api/v1/services is called without an Authorization header
    Then the response returns 401 Unauthorized
    And no service data is returned
```

### Story 3.4 — Full-Text Search via Meilisearch

```gherkin
Feature: Catalog Full-Text Search

  Scenario: Search returns relevant services
    Given the catalog has services: "payment-api", "payment-processor", "user-service"
    When a search query "payment" is submitted
    Then Meilisearch returns "payment-api" and "payment-processor"
    And results are ranked by relevance score
    And the response time is under 10ms

  Scenario: Search is scoped to tenant
    Given tenant "acme" has service "auth-service"
    And tenant "globex" has service "auth-service" (same name)
    When tenant "acme" searches for "auth-service"
    Then only acme's "auth-service" is returned
    And globex's service is not included in results

  Scenario: Meilisearch index is corrupted or unavailable
    Given Meilisearch returns a 503 error
    When a search request is made
    Then the API falls back to PostgreSQL full-text search (pg_trgm)
    And returns results within 500ms
    And logs a warning: "Meilisearch unavailable, using PostgreSQL fallback"
    And the user sees results (degraded performance, not an error)

  Scenario: Catalog update triggers Meilisearch re-index
    Given a service "new-api" is added to the catalog
    When the catalog write completes
    Then a Meilisearch index update is triggered asynchronously
    And the service is searchable within 1 second
    And if the Meilisearch update fails, it is retried via a background queue
```

---

## Epic 4: Search Engine

### Story 4.1 — Cmd+K Instant Search

```gherkin
Feature: Cmd+K Instant Search Interface

  Scenario: User opens search with keyboard shortcut
    Given the user is on any page of the portal
    When the user presses Cmd+K (Mac) or Ctrl+K (Windows/Linux)
    Then the search modal opens immediately
    And the search input is focused
    And recent searches are shown as suggestions

  Scenario: Search returns results under 10ms
    Given the Meilisearch index has 500 services
    And Redis prefix cache is warm
    When the user types "pay" in the search box
    Then results appear within 10ms
    And the top 5 matching services are shown
    And each result shows: service name, owner, health status

  Scenario: Search with Redis prefix cache hit
    Given "pay" has been searched recently and is cached in Redis
    When the user types "pay"
    Then the API returns cached results from Redis
    And does not query Meilisearch
    And the response time is under 5ms

  Scenario: Search with Redis cache miss
    Given "xyz-unique-query" has never been searched
    When the user types "xyz-unique-query"
    Then the API queries Meilisearch directly
    And stores the result in Redis with TTL of 60 seconds
    And returns results within 10ms

  Scenario: Search cache is invalidated on catalog update
    Given "payment-api" search results are cached in Redis
    When the catalog is updated (new service added or existing service modified)
    Then all Redis cache keys matching affected prefixes are invalidated
    And the next search query fetches fresh results from Meilisearch

  Scenario: Search with no results
    Given no services match the query "zzznomatch"
    When the user searches for "zzznomatch"
    Then the search modal shows "No services found"
    And suggests: "Try a broader search or browse all services"
    And does not show an error

  Scenario: Search is dismissed
    Given the search modal is open
    When the user presses Escape
    Then the search modal closes
    And focus returns to the previously focused element

  Scenario: Search result is selected
    Given search results show "payment-api"
    When the user clicks or presses Enter on "payment-api"
    Then the search modal closes
    And the service detail drawer opens for "payment-api"
    And the URL updates to reflect the selected service
```

### Story 4.2 — Redis Prefix Caching

```gherkin
Feature: Redis Prefix Caching for Search

  Scenario: Cache key structure is tenant-scoped
    Given tenant "acme" searches for "pay"
    Then the Redis cache key is "search:acme:pay"
    And tenant "globex" searching for "pay" uses key "search:globex:pay"
    And the two cache entries are completely independent

  Scenario: Cache TTL expires and is refreshed
    Given a cache entry for "search:acme:pay" has TTL of 60 seconds
    When 61 seconds pass without a search for "pay"
    Then the cache entry expires
    And the next search for "pay" queries Meilisearch
    And a new cache entry is created with a fresh 60-second TTL

  Scenario: Redis is unavailable
    Given Redis returns a connection error
    When a search request is made
    Then the API bypasses the cache and queries Meilisearch directly
    And logs a warning: "Redis unavailable, cache bypassed"
    And the search still returns results (degraded, not broken)
    And does not attempt to write to Redis while it is down

  Scenario: Cache invalidation on bulk catalog update
    Given a discovery scan adds 50 new services to the catalog
    When the bulk ingestion completes
    Then all Redis search cache keys for the affected tenant are flushed
    And subsequent searches reflect the updated catalog
    And the flush is atomic (all keys or none)
```

### Story 4.3 — Meilisearch Index Management

```gherkin
Feature: Meilisearch Index Management

  Scenario: Index is created on first tenant onboarding
    Given a new tenant "startup-co" completes onboarding
    When the tenant's first discovery scan runs
    Then a Meilisearch index "services_startup-co" is created
    And the index is configured with searchable attributes: name, description, owner, tags
    And filterable attributes: source, health_status, owner, region

  Scenario: Index corruption is detected and rebuilt
    Given the Meilisearch index for tenant "acme" returns inconsistent results
    When the health check detects index corruption (checksum mismatch)
    Then an index rebuild is triggered
    And the index is rebuilt from the PostgreSQL catalog (source of truth)
    And during rebuild, search falls back to PostgreSQL full-text search
    And users see a banner: "Search index is being rebuilt, results may be incomplete"
    And the rebuild completes within 5 minutes for up to 10,000 services

  Scenario: Meilisearch index rebuild does not affect other tenants
    Given tenant "acme"'s index is being rebuilt
    When tenant "globex" performs a search
    Then globex's search is unaffected
    And globex's index is not touched during acme's rebuild

  Scenario: Search ranking is configured correctly
    Given the Meilisearch index has ranking rules configured
    When a user searches for "api"
    Then results are ranked by: typo, words, proximity, attribute, sort, exactness
    And services with "api" in the name rank higher than those with "api" in description
    And recently updated services rank higher among equal-relevance results
```

---

## Epic 5: Dashboard UI

### Story 5.1 — Service Catalog Browse

```gherkin
Feature: Service Catalog Browse UI

  Background:
    Given the user is authenticated and on the catalog page

  Scenario: Catalog page loads with service cards
    Given the tenant has 42 services in the catalog
    When the catalog page loads
    Then 20 service cards are displayed (first page)
    And each card shows: service name, owner, health status badge, source icon, last updated
    And a pagination control shows "Page 1 of 3"

  Scenario: Progressive disclosure — card expands on hover
    Given a service card for "payment-api" is visible
    When the user hovers over the card
    Then the card expands to show additional details:
      | Field         | Value                    |
      | Description   | Handles payment flows    |
      | Region        | us-east-1                |
      | On-call       | alice@acme.com           |
      | Tech stack    | Node.js, Docker          |
    And the expansion animation completes within 200ms

  Scenario: Filter services by owner
    Given the catalog has services from multiple teams
    When the user selects filter "owner: payments-team"
    Then only services owned by "payments-team" are shown
    And the URL updates with ?owner=payments-team
    And the filter chip is shown as active

  Scenario: Filter services by health status
    Given some services have health_status: "healthy", "degraded", "unknown"
    When the user selects filter "status: degraded"
    Then only degraded services are shown
    And the count badge updates to reflect filtered results

  Scenario: Filter services by source
    Given services come from "aws-ecs", "aws-lambda", "github"
    When the user selects filter "source: aws-lambda"
    Then only Lambda functions are shown in the catalog

  Scenario: Sort services by last updated
    Given the catalog has services with various updated_at timestamps
    When the user selects sort "Last Updated (newest first)"
    Then services are sorted with most recently updated first
    And the sort selection persists across page navigation

  Scenario: Empty catalog state
    Given the tenant has just onboarded and has 0 services
    When the catalog page loads
    Then an empty state is shown: "No services discovered yet"
    And a CTA button: "Run Discovery Scan" is prominently displayed
    And a link to the onboarding guide is shown
```

### Story 5.2 — Service Detail Drawer

```gherkin
Feature: Service Detail Drawer

  Scenario: Open service detail drawer
    Given the user clicks on service card "checkout-api"
    When the drawer opens
    Then it slides in from the right within 300ms
    And displays full service details:
      | Section       | Content                              |
      | Header        | Service name, health badge, source   |
      | Ownership     | Team name, on-call engineer          |
      | Metadata      | Region, runtime, version, tags       |
      | Health        | Scorecard with metrics               |
      | Links         | GitHub repo, PagerDuty, AWS console  |
    And the main catalog page remains visible behind the drawer

  Scenario: Drawer URL is shareable
    Given the drawer is open for "checkout-api" (id: svc-123)
    When the user copies the URL
    Then the URL is /catalog?service=svc-123
    And sharing this URL opens the catalog with the drawer pre-opened

  Scenario: Close drawer with Escape key
    Given the service detail drawer is open
    When the user presses Escape
    Then the drawer closes
    And focus returns to the service card that was clicked

  Scenario: Navigate between services in drawer
    Given the drawer is open for "checkout-api"
    When the user presses the right arrow or clicks "Next service"
    Then the drawer transitions to the next service in the current filtered/sorted list
    And the URL updates accordingly

  Scenario: Drawer shows stale data warning
    Given service "legacy-api" has status: "stale" (not seen in last scan)
    When the drawer opens for "legacy-api"
    Then a warning banner shows: "This service was not found in the last scan (3 days ago)"
    And a "Re-run scan" button is available

  Scenario: Edit service owner from drawer
    Given the user has editor permissions
    When they click "Edit owner" in the drawer
    Then an inline edit field appears
    And they can type a new owner name with autocomplete from known teams
    And saving updates the catalog immediately
    And the drawer reflects the new owner without a full page reload
```

### Story 5.3 — Cmd+K Search in UI

```gherkin
Feature: Cmd+K Search UI Integration

  Scenario: Search modal shows categorized results
    Given the user types "pay" in the Cmd+K search modal
    When results are returned
    Then they are grouped by category:
      | Category  | Examples                          |
      | Services  | payment-api, payment-processor    |
      | Teams     | payments-team                     |
      | Actions   | Run discovery scan                |
    And keyboard navigation works between categories

  Scenario: Search modal shows recent searches on open
    Given the user has previously searched for "auth", "billing", "gateway"
    When the user opens Cmd+K without typing
    Then recent searches are shown as suggestions
    And clicking a suggestion populates the search input

  Scenario: Search result keyboard navigation
    Given the search modal is open with 5 results
    When the user presses the down arrow key
    Then the first result is highlighted
    And pressing down again highlights the second result
    And pressing Enter navigates to the highlighted result

  Scenario: Search modal is accessible
    Given the search modal is open
    Then it has role="dialog" and aria-label="Search services"
    And the input has aria-label="Search"
    And results have role="listbox" with role="option" items
    And screen readers announce result count changes
```

---

## Epic 6: Analytics Dashboards

### Story 6.1 — Ownership Coverage Dashboard

```gherkin
Feature: Ownership Coverage Analytics

  Scenario: Display ownership coverage percentage
    Given the tenant has 100 services in the catalog
    And 75 services have a confirmed owner
    And 25 services have owner: null or "unknown"
    When the analytics dashboard loads
    Then the ownership coverage metric shows 75%
    And a donut chart shows the breakdown: 75 owned, 25 unowned
    And a trend line shows coverage change over the last 30 days

  Scenario: Ownership coverage by team
    Given multiple teams own services
    When the ownership breakdown table is rendered
    Then it shows each team with their service count and percentage
    And teams are sorted by service count descending
    And clicking a team filters the catalog to that team's services

  Scenario: Ownership coverage drops below threshold
    Given the ownership coverage threshold is set to 80%
    And coverage drops from 82% to 76% after a scan
    When the dashboard refreshes
    Then a warning alert is shown: "Ownership coverage below 80% threshold"
    And the affected unowned services are listed
    And an email notification is sent to the tenant admin

  Scenario: Zero unowned services
    Given all 50 services have confirmed owners
    When the ownership dashboard loads
    Then coverage shows 100%
    And a success state is shown: "All services have owners"
    And no warning alerts are displayed
```

### Story 6.2 — Service Health Scorecards

```gherkin
Feature: Service Health Scorecards

  Scenario: Health scorecard is calculated for a service
    Given service "payment-api" has the following signals:
      | Signal              | Value    | Score |
      | Has owner           | Yes      | 20    |
      | Has on-call policy  | Yes      | 20    |
      | Has documentation   | Yes      | 15    |
      | Recent deployment   | 2 days   | 15    |
      | No open P1 alerts   | True     | 30    |
    When the health scorecard is computed
    Then the overall score is 100/100
    And the health_status is "healthy"

  Scenario: Service with missing ownership scores lower
    Given service "orphan-lambda" has:
      | Signal              | Value    | Score |
      | Has owner           | No       | 0     |
      | Has on-call policy  | No       | 0     |
      | Has documentation   | No       | 0     |
      | Recent deployment   | 90 days  | 5     |
      | No open P1 alerts   | True     | 30    |
    When the health scorecard is computed
    Then the overall score is 35/100
    And the health_status is "at-risk"
    And the scorecard highlights the missing signals as improvement actions

  Scenario: Health scorecard trend over time
    Given "checkout-api" has had weekly scorecard snapshots for 8 weeks
    When the service detail drawer shows the health trend
    Then a sparkline chart shows the score history
    And the trend direction (improving/declining) is indicated

  Scenario: Team-level health KPI aggregation
    Given "payments-team" owns 10 services with scores: [90, 85, 70, 95, 60, 80, 75, 88, 92, 65]
    When the team KPI dashboard is rendered
    Then the team average score is 80/100
    And the lowest-scoring service is highlighted for attention
    And the team score is compared to the org average

  Scenario: Health scorecard for stale service
    Given service "legacy-api" has status: "stale"
    When the scorecard is computed
    Then the "Recent deployment" signal scores 0
    And a penalty is applied for being stale
    And the overall score reflects the staleness
```

### Story 6.3 — Tech Debt Tracking

```gherkin
Feature: Tech Debt Tracking Dashboard

  Scenario: Identify services using deprecated runtimes
    Given the portal has a policy: "Node.js < 18 is deprecated"
    And 5 services use Node.js 16
    When the tech debt dashboard loads
    Then it shows 5 services flagged for "deprecated runtime"
    And each flagged service links to its catalog entry
    And the total tech debt score is calculated

  Scenario: Track services with no recent deployments
    Given the policy threshold is "no deployment in 90 days = tech debt"
    And 8 services have not been deployed in over 90 days
    When the tech debt dashboard loads
    Then these 8 services appear in the "stale deployments" category
    And the owning teams are notified via the dashboard

  Scenario: Tech debt trend over time
    Given tech debt metrics have been tracked for 12 weeks
    When the trend chart is rendered
    Then it shows weekly tech debt item counts
    And highlights weeks where debt increased significantly
    And shows the net change (items resolved vs. items added)
```

---