diff --git a/products/04-lightweight-idp/acceptance-specs/acceptance-specs.md b/products/04-lightweight-idp/acceptance-specs/acceptance-specs.md
new file mode 100644
index 0000000..59d053e
--- /dev/null
+++ b/products/04-lightweight-idp/acceptance-specs/acceptance-specs.md
@@ -0,0 +1,1177 @@
+# dd0c/portal — BDD Acceptance Test Specifications
+
+> Gherkin scenarios for all 10 epics. Edge cases included.
+
+---
+
+## Epic 1: AWS Discovery Scanner
+
+### Story 1.1 — IAM Role Assumption into Customer Account
+
+```gherkin
+Feature: IAM Role Assumption for Cross-Account Discovery
+
+  Background:
+    Given the portal has a tenant with AWS integration configured
+    And the tenant has provided an IAM role ARN "arn:aws:iam::123456789012:role/PortalDiscoveryRole"
+
+  Scenario: Successfully assume IAM role and begin scan
+    Given the IAM role has the required trust policy allowing portal's scanner principal
+    When the AWS Discovery Scanner initiates a scan for the tenant
+    Then the scanner assumes the role via STS AssumeRole
+    And receives temporary credentials valid for 1 hour
+    And begins scanning the configured AWS regions
+
+  Scenario: IAM role assumption fails due to missing trust policy
+    Given the IAM role does NOT have a trust policy for the portal's scanner principal
+    When the AWS Discovery Scanner attempts to assume the role
+    Then STS returns an "AccessDenied" error
+    And the scanner marks the scan job as FAILED with reason "IAM role assumption denied"
+    And the tenant receives a notification: "AWS integration error: cannot assume role"
+    And no partial scan results are persisted
+
+  Scenario: IAM role ARN is malformed
+    Given the tenant has configured an invalid role ARN "not-a-valid-arn"
+    When the AWS Discovery Scanner attempts to assume the role
+    Then the scanner marks the scan job as FAILED with reason "Invalid role ARN format"
+    And logs a structured error with tenant_id and the invalid ARN
+
+  Scenario: Temporary credentials expire mid-scan
+    Given the scanner has assumed the IAM role and started scanning
+    And the scan has been running for 61 minutes
+    When the scanner attempts to call ECS DescribeClusters
+    Then AWS returns an "ExpiredTokenException"
+    And the scanner refreshes credentials via a new AssumeRole call
+    And resumes scanning from the last checkpoint
+    And the scan job status remains IN_PROGRESS
+
+  Scenario: Role assumption succeeds but has insufficient permissions
+    Given the IAM role is assumed successfully
+    But the role lacks "ecs:ListClusters" permission
+    When the scanner attempts to list ECS clusters
+    Then AWS returns an "AccessDenied" error for that specific call
+    And the scanner records a partial failure for the ECS resource type
+    And continues scanning other resource types (Lambda, CloudFormation, API Gateway)
+    And the final scan result includes a warnings list with "ECS: insufficient permissions"
+```
+
+### Story 1.2 — ECS Service Discovery
+
+```gherkin
+Feature: ECS Service Discovery
+
+  Background:
+    Given a tenant's AWS account has been successfully authenticated
+    And the scanner is configured to scan region "us-east-1"
+
+  Scenario: Discover ECS services across multiple clusters
+    Given the AWS account has 3 ECS clusters: "prod-cluster", "staging-cluster", "dev-cluster"
+    And "prod-cluster" has 5 services, "staging-cluster" has 3, "dev-cluster" has 2
+    When the AWS Discovery Scanner runs the ECS scan step
+    Then it lists all 3 clusters via ecs:ListClusters
+    And describes all services in each cluster via ecs:DescribeServices
+    And emits 10 discovered service events to the catalog ingestion queue
+    And each event includes: cluster_name, service_name, task_definition, desired_count, region
+
+  Scenario: ECS cluster has no services
+    Given the AWS account has 1 ECS cluster "empty-cluster" with 0 services
+    When the scanner runs the ECS scan step
+    Then it lists the cluster successfully
+    And records 0 services discovered for that cluster
+    And does not emit any service events for that cluster
+    And the scan step completes without error
+
+  Scenario: ECS DescribeServices returns throttling error
+    Given the AWS account has an ECS cluster with 100 services
+    When the scanner calls ecs:DescribeServices in batches of 10
+    And AWS returns a ThrottlingException on the 5th batch
+    Then the scanner applies exponential backoff (2s, 4s, 8s)
+    And retries the failed batch up to 3 times
+    And if retry succeeds, continues with remaining batches
+    And the final result includes all 100 services
+
+  Scenario: ECS DescribeServices fails after all retries
+    Given AWS returns ThrottlingException on every retry attempt
+    When the scanner exhausts all 3 retries for a batch
+    Then it marks that batch as partially failed
+    And records which service ARNs could not be described
+    And continues with the next batch
+    And the scan summary includes "ECS partial failure: 10 services could not be described"
+
+  Scenario: Multi-region ECS scan
+    Given the tenant has configured scanning for regions: "us-east-1", "eu-west-1", "ap-southeast-1"
+    When the AWS Discovery Scanner runs
+    Then it scans ECS in all 3 regions in parallel via Step Functions Map state
+    And aggregates results from all regions
+    And each discovered service record includes its region
+    And duplicate service names across regions are stored as separate catalog entries
+```
+
+### Story 1.3 — Lambda Function Discovery
+
+```gherkin
+Feature: Lambda Function Discovery
+
+  Scenario: Discover all Lambda functions in a region
+    Given the AWS account has 25 Lambda functions in "us-east-1"
+    When the scanner runs the Lambda scan step
+    Then it calls lambda:ListFunctions with pagination
+    And retrieves all 25 functions across multiple pages
+    And emits 25 service events with: function_name, runtime, memory_size, timeout, last_modified
+
+  Scenario: Lambda function has tags indicating service ownership
+    Given a Lambda function "payment-processor" has tags:
+      | Key         | Value          |
+      | team        | payments-team  |
+      | service     | payment-svc    |
+      | environment | production     |
+    When the scanner processes this Lambda function
+    Then the catalog entry includes ownership: "payments-team" (source: implicit/tags)
+    And the service name is inferred as "payment-svc" from the "service" tag
+
+  Scenario: Lambda function has no tags
+    Given a Lambda function "legacy-cron-job" has no tags
+    When the scanner processes this Lambda function
+    Then the catalog entry has ownership: null (source: heuristic/pending)
+    And the entry is flagged for ownership inference via commit history
+
+  Scenario: Lambda ListFunctions pagination
+    Given the AWS account has 150 Lambda functions (exceeding the 50-per-page limit)
+    When the scanner calls lambda:ListFunctions
+    Then it follows NextMarker pagination tokens
+    And retrieves all 150 functions across 3 pages
+    And does not duplicate any function in the results
+```
+
+### Story 1.4 — CloudFormation Stack Discovery
+
+```gherkin
+Feature: CloudFormation Stack Discovery
+
+  Scenario: Discover active CloudFormation stacks
+    Given the AWS account has 8 CloudFormation stacks in "us-east-1"
+    And 6 stacks have status "CREATE_COMPLETE" or "UPDATE_COMPLETE"
+    And 2 stacks have status "ROLLBACK_COMPLETE"
+    When the scanner runs the CloudFormation scan step
+    Then it discovers all 8 stacks
+    And includes stacks of all statuses in the raw results
+    And marks ROLLBACK_COMPLETE stacks with health_status: "degraded"
+
+  Scenario: CloudFormation stack outputs contain service metadata
+    Given a stack "api-gateway-stack" has outputs:
+      | OutputKey    | OutputValue                        |
+      | ServiceName  | user-api                           |
+      | TeamOwner    | platform-team                      |
+      | ApiEndpoint  | https://api.example.com/users      |
+    When the scanner processes this stack
+    Then the catalog entry uses "user-api" as the service name
+    And sets ownership to "platform-team" (source: implicit/stack-output)
+    And records the API endpoint in service metadata
+
+  Scenario: Nested stacks are discovered
+    Given a parent CloudFormation stack has 3 nested stacks
+    When the scanner processes the parent stack
+    Then it also discovers and catalogs each nested stack
+    And links nested stacks to their parent in the catalog
+
+  Scenario: Stack in DELETE_IN_PROGRESS state
+    Given a CloudFormation stack has status "DELETE_IN_PROGRESS"
+    When the scanner discovers this stack
+    Then it records the stack with health_status: "terminating"
+    And does not block the scan on this stack's state
+```
+
+### Story 1.5 — API Gateway Discovery
+
+```gherkin
+Feature: API Gateway Discovery
+
+  Scenario: Discover REST APIs in API Gateway v1
+    Given the AWS account has 4 REST APIs in API Gateway
+    When the scanner runs the API Gateway scan step
+    Then it calls apigateway:GetRestApis
+    And retrieves all 4 APIs with their names, IDs, and endpoints
+    And emits catalog events for each API
+
+  Scenario: Discover HTTP APIs in API Gateway v2
+    Given the AWS account has 3 HTTP APIs in API Gateway v2
+    When the scanner runs the API Gateway v2 scan step
+    Then it calls apigatewayv2:GetApis
+    And retrieves all 3 HTTP APIs
+    And correctly distinguishes them from REST APIs in the catalog
+
+  Scenario: API Gateway API has no associated service tag
+    Given an API Gateway REST API "legacy-api" has no resource tags
+    When the scanner processes this API
+    Then it creates a catalog entry with name "legacy-api"
+    And sets ownership to null pending heuristic inference
+    And records the API type as "REST" and the invoke URL
+
+  Scenario: API Gateway scan across multiple regions
+    Given API Gateway APIs exist in "us-east-1" and "us-west-2"
+    When the scanner runs in parallel across both regions
+    Then APIs from both regions are discovered
+    And each catalog entry includes the region field
+    And there are no cross-region duplicates
+```
+
+### Story 1.6 — Step Functions Orchestration
+
+```gherkin
+Feature: Step Functions Orchestration of Discovery Scan
+
+  Scenario: Full scan executes all resource type steps in parallel
+    Given a scan job is triggered for a tenant
+    When the Step Functions state machine starts
+    Then it executes ECS, Lambda, CloudFormation, and API Gateway scan steps in parallel
+    And waits for all parallel branches to complete
+    And aggregates results from all branches
+    And transitions to the catalog ingestion step
+
+  Scenario: One parallel branch fails, others succeed
+    Given the Step Functions state machine is running a scan
+    And the Lambda scan step throws an unhandled exception
+    When the state machine catches the error
+    Then the ECS, CloudFormation, and API Gateway branches continue to completion
+    And the Lambda branch is marked as FAILED in the scan summary
+    And the overall scan job status is PARTIAL_SUCCESS
+    And discovered resources from successful branches are ingested into the catalog
+
+  Scenario: Scan job is idempotent on retry
+    Given a scan job failed midway through execution
+    When the Step Functions state machine retries the execution
+    Then it does not duplicate already-ingested catalog entries
+    And uses upsert semantics based on (tenant_id, resource_arn) as the unique key
+
+  Scenario: Concurrent scan jobs for the same tenant are prevented
+    Given a scan job is already IN_PROGRESS for tenant "acme-corp"
+    When a second scan job is triggered for "acme-corp"
+    Then the system detects the existing in-progress execution
+    And rejects the new job with status 409 Conflict
+    And returns the ID of the existing running scan job
+
+  Scenario: Scan job timeout
+    Given a scan job has been running for more than 30 minutes
+    When the Step Functions state machine timeout is reached
+    Then the execution is marked as TIMED_OUT
+    And partial results already ingested are retained in the catalog
+    And the tenant is notified of the timeout
+    And the scan job record shows last_successful_step
+
+  Scenario: WebSocket progress events during scan
+    Given a user is watching the discovery progress in the UI
+    When each Step Functions step completes
+    Then a WebSocket event is pushed to the user's connection
+    And the event includes: step_name, status, resources_found, timestamp
+    And the UI progress bar updates in real-time
+```
+
+### Story 1.7 — Multi-Region Scanning
+
+```gherkin
+Feature: Multi-Region AWS Scanning
+
+  Scenario: Tenant configures specific regions to scan
+    Given a tenant has configured scan regions: ["us-east-1", "eu-central-1"]
+    When the discovery scanner runs
+    Then it only scans the configured regions
+    And does not scan any other AWS regions
+    And the scan summary shows per-region resource counts
+
+  Scenario: Tenant configures "all regions" scan
+    Given a tenant has configured scan_all_regions: true
+    When the discovery scanner runs
+    Then it first calls ec2:DescribeRegions to get the current list of enabled regions
+    And scans all enabled regions
+    And the Step Functions Map state iterates over the dynamic region list
+
+  Scenario: A region is disabled in the AWS account
+    Given the tenant's AWS account has "ap-east-1" disabled
+    And the tenant has configured scan_all_regions: true
+    When the scanner attempts to scan "ap-east-1"
+    Then it receives an "OptInRequired" or region-disabled error
+    And skips that region gracefully
+    And records "ap-east-1: skipped (region disabled)" in the scan summary
+
+  Scenario: Region scan results are isolated per tenant
+    Given tenant "acme" and tenant "globex" both have resources in "us-east-1"
+    When both tenants run discovery scans simultaneously
+    Then acme's catalog only contains acme's resources
+    And globex's catalog only contains globex's resources
+    And no cross-tenant data leakage occurs
+```
+
+---
+
+## Epic 2: GitHub Discovery Scanner
+
+### Story 2.1 — GitHub App Installation and Authentication
+
+```gherkin
+Feature: GitHub App Authentication
+
+  Background:
+    Given the portal has a GitHub App registered with app_id "12345"
+    And the app has a private key stored in AWS Secrets Manager
+
+  Scenario: Successfully authenticate as GitHub App installation
+    Given a tenant has installed the GitHub App on their organization "acme-org"
+    And the installation_id is "67890"
+    When the GitHub Discovery Scanner initializes for this tenant
+    Then it generates a JWT signed with the app's private key
+    And exchanges the JWT for an installation access token via POST /app/installations/67890/access_tokens
+    And the token has scopes: contents:read, metadata:read
+    And the scanner uses this token for all subsequent API calls
+
+  Scenario: GitHub App JWT generation fails due to expired private key
+    Given the private key in Secrets Manager is expired or invalid
+    When the scanner attempts to generate a JWT
+    Then it fails with a key validation error
+    And marks the scan job as FAILED with reason "GitHub App authentication failed: invalid private key"
+    And triggers an alert to the tenant admin
+
+  Scenario: Installation access token request returns 404
+    Given the GitHub App installation has been uninstalled by the tenant
+    When the scanner requests an installation access token
+    Then GitHub returns 404 Not Found
+    And the scanner marks the GitHub integration as DISCONNECTED
+    And notifies the tenant: "GitHub App has been uninstalled. Please reinstall to continue discovery."
+
+  Scenario: Installation access token expires mid-scan
+    Given the scanner has a valid installation token (valid for 1 hour)
+    And the scan has been running for 58 minutes
+    When the token expires during a GraphQL query
+    Then GitHub returns 401 Unauthorized
+    And the scanner proactively refreshes the token before expiry (at 55 min mark)
+    And the scan continues without interruption
+```
+
+### Story 2.2 — Repository Discovery via GraphQL
+
+```gherkin
+Feature: GitHub Repository Discovery via GraphQL API
+
+  Scenario: Discover all repositories in an organization
+    Given the tenant's GitHub organization "acme-org" has 150 repositories
+    When the GitHub Discovery Scanner runs the repository listing query
+    Then it uses the GraphQL API with cursor-based pagination
+    And retrieves all 150 repositories in batches of 100
+    And each repository record includes: name, defaultBranch, isArchived, pushedAt, topics
+
+  Scenario: Archived repositories are excluded by default
+    Given "acme-org" has 150 active repos and 30 archived repos
+    When the scanner runs with default settings
+    Then it discovers 150 active repositories
+    And excludes the 30 archived repositories
+    And the scan summary notes "30 archived repositories skipped"
+
+  Scenario: Tenant opts in to include archived repositories
+    Given the tenant has configured include_archived: true
+    When the scanner runs
+    Then it discovers all 180 repositories including archived ones
+    And archived repos are marked with status: "archived" in the catalog
+
+  Scenario: Organization has no repositories
+    Given the tenant's GitHub organization has 0 repositories
+    When the scanner runs the repository listing query
+    Then it returns an empty result set
+    And the scan completes successfully with 0 GitHub services discovered
+    And no errors are raised
+
+  Scenario: GraphQL query returns partial data with errors
+    Given the GraphQL query returns 80 repositories successfully
+    But includes a "FORBIDDEN" error for 20 private repositories
+    When the scanner processes the response
+    Then it ingests the 80 accessible repositories
+    And records a warning: "20 repositories inaccessible due to permissions"
+    And the scan status is PARTIAL_SUCCESS
+```
+
+### Story 2.3 — Service Metadata Extraction from Repositories
+
+```gherkin
+Feature: Service Metadata Extraction from Repository Files
+
+  Scenario: Extract service name from package.json
+    Given a repository "user-service" has a package.json at the root:
+      """
+      {
+        "name": "@acme/user-service",
+        "version": "2.1.0",
+        "description": "Handles user authentication and profiles"
+      }
+      """
+    When the scanner fetches and parses package.json
+    Then the catalog entry uses service name "@acme/user-service"
+    And records version "2.1.0"
+    And records description "Handles user authentication and profiles"
+    And sets the service type as "nodejs"
+
+  Scenario: Extract service metadata from Dockerfile
+    Given a repository has a Dockerfile with:
+      """
+      FROM python:3.11-slim
+      LABEL service.name="data-pipeline"
+      LABEL service.team="data-engineering"
+      EXPOSE 8080
+      """
+    When the scanner parses the Dockerfile
+    Then the catalog entry uses service name "data-pipeline"
+    And sets ownership to "data-engineering" (source: implicit/dockerfile-label)
+    And records the service type as "python"
+    And records exposed port 8080
+
+  Scenario: Extract service metadata from Terraform files
+    Given a repository has a main.tf with:
+      """
+      module "service" {
+        source = "..."
+        service_name = "billing-api"
+        team         = "billing-team"
+      }
+      """
+    When the scanner parses Terraform files
+    Then the catalog entry uses service name "billing-api"
+    And sets ownership to "billing-team" (source: implicit/terraform)
+
+  Scenario: Repository has no recognizable service metadata files
+    Given a repository "random-scripts" has no package.json, Dockerfile, or terraform files
+    When the scanner processes this repository
+    Then it creates a catalog entry using the repository name "random-scripts"
+    And sets service type as "unknown"
+    And flags the entry for manual review
+    And ownership is set to null pending heuristic inference
+
+  Scenario: Multiple metadata files exist — precedence order
+    Given a repository has both a package.json (name: "svc-a") and a Dockerfile (LABEL service.name="svc-b")
+    When the scanner processes the repository
+    Then it uses package.json as the primary metadata source
+    And records the service name as "svc-a"
+    And notes the Dockerfile label as secondary metadata
+
+  Scenario: package.json is malformed JSON
+    Given a repository has a package.json with invalid JSON syntax
+    When the scanner attempts to parse it
+    Then it logs a warning: "package.json parse error in repo: user-service"
+    And falls back to Dockerfile or terraform for metadata
+    And if no fallback exists, uses the repository name
+    And does not crash the scan job
+```
+
+### Story 2.4 — GitHub Rate Limit Handling
+
+```gherkin
+Feature: GitHub API Rate Limit Handling
+
+  Scenario: GraphQL rate limit warning threshold reached
+    Given the scanner has consumed 4,500 of 5,000 GraphQL rate limit points
+    When the scanner checks the rate limit headers after each response
+    Then it detects the threshold (90% consumed)
+    And pauses scanning until the rate limit resets
+    And logs: "GitHub rate limit threshold reached, waiting for reset at {reset_time}"
+    And resumes scanning after the reset window
+
+  Scenario: REST API secondary rate limit hit
+    Given the scanner is making rapid REST API calls
+    When GitHub returns HTTP 429 with "secondary rate limit" in the response
+    Then the scanner reads the Retry-After header
+    And waits the specified number of seconds before retrying
+    And does not count this as a scan failure
+
+  Scenario: Rate limit exhausted with many repos remaining
+    Given the scanner has 500 repositories to scan
+    And the rate limit is exhausted after scanning 200 repositories
+    When the scanner detects rate limit exhaustion
+    Then it checkpoints the current progress (200 repos scanned)
+    And schedules a continuation scan after the rate limit reset
+    And the scan job status is set to RATE_LIMITED (not FAILED)
+    And the tenant is notified of the delay
+
+  Scenario: Rate limit headers are missing from response
+    Given GitHub returns a response without X-RateLimit headers
+    When the scanner processes the response
+    Then it applies a conservative default delay of 1 second between requests
+    And logs a warning about missing rate limit headers
+    And continues scanning without crashing
+
+  Scenario: Concurrent scans from multiple tenants share rate limit awareness
+    Given two tenants share the same GitHub App installation
+    When both tenants trigger scans simultaneously
+    Then the scanner uses a shared rate limit counter in Redis
+    And distributes available rate limit points between the two scans
+    And neither scan exceeds the total available rate limit
+```
+
+### Story 2.5 — Ownership Inference from Commit History
+
+```gherkin
+Feature: Heuristic Ownership Inference from Commit History
+
+  Scenario: Infer owner from most recent committer
+    Given a repository has no explicit ownership config and no team tags
+    And the last 10 commits were made by "alice@acme.com" (7 commits) and "bob@acme.com" (3 commits)
+    When the ownership inference engine runs
+    Then it identifies "alice@acme.com" as the primary contributor
+    And maps the email to team "frontend-team" via the org's CODEOWNERS or team membership
+    And sets ownership to "frontend-team" (source: heuristic/commit-history)
+    And records confidence: 0.7
+
+  Scenario: Ownership inference from CODEOWNERS file
+    Given a repository has a CODEOWNERS file:
+      """
+      * @acme-org/platform-team
+      /src/api/ @acme-org/api-team
+      """
+    When the scanner processes the CODEOWNERS file
+    Then it sets ownership to "platform-team" (source: implicit/codeowners)
+    And records that the api directory has additional ownership by "api-team"
+    And this takes precedence over commit history heuristics
+
+  Scenario: Ownership conflict — multiple teams have equal commit share
+    Given a repository has commits split 50/50 between "team-a" and "team-b"
+    When the ownership inference engine runs
+    Then it records both teams as co-owners
+    And sets ownership_confidence: 0.5
+    And flags the service for manual ownership resolution
+    And the catalog entry shows ownership_status: "conflict"
+
+  Scenario: Explicit ownership config overrides all other sources
+    Given a repository has a .portal.yaml file:
+      """
+      owner: payments-team
+      tier: critical
+      """
+    And the repository also has CODEOWNERS pointing to "platform-team"
+    And commit history suggests "devops-team"
+    When the scanner processes the repository
+    Then ownership is set to "payments-team" (source: explicit/portal-config)
+    And CODEOWNERS and commit history are recorded as secondary metadata
+    And the explicit config is not overridden by any other source
+
+  Scenario: Commit history API call fails
+    Given the GitHub API returns 500 when fetching commit history for a repo
+    When the ownership inference engine attempts heuristic inference
+    Then it marks ownership as null (source: unknown)
+    And records the error in the scan summary
+    And does not block catalog ingestion for this service
+```
+
+---
+## Epic 3: Service Catalog
+
+### Story 3.1 — Catalog Ingestion and Upsert
+
+```gherkin
+Feature: Service Catalog Ingestion
+
+  Background:
+    Given the catalog uses PostgreSQL Aurora Serverless v2
+    And the unique key for a service is (tenant_id, source, resource_id)
+
+  Scenario: New service is ingested into the catalog
+    Given a discovery scan emits a new service event for "payment-api"
+    When the catalog ingestion handler processes the event
+    Then a new row is inserted into the services table
+    And the row includes: tenant_id, name, source, resource_id, owner, health_status, discovered_at
+    And the Meilisearch index is updated with the new service document
+    And the catalog entry is immediately searchable
+
+  Scenario: Existing service is updated on re-scan (upsert)
+    Given "payment-api" already exists in the catalog with owner "old-team"
+    When a new scan emits an updated event for "payment-api" with owner "payments-team"
+    Then the existing row is updated (not duplicated)
+    And owner is changed to "payments-team"
+    And updated_at is refreshed
+    And the Meilisearch document is updated
+
+  Scenario: Service removed from AWS is marked stale
+    Given "deprecated-lambda" exists in the catalog from a previous scan
+    When the latest scan completes and does not include "deprecated-lambda"
+    Then the catalog entry is marked with status: "stale"
+    And last_seen_at is not updated
+    And the service remains visible in the catalog with a "stale" badge
+    And is not immediately deleted
+
+  Scenario: Stale service is purged after retention period
+    Given "deprecated-lambda" has been stale for more than 30 days
+    When the nightly cleanup job runs
+    Then the service is soft-deleted from the catalog
+    And removed from the Meilisearch index
+    And a deletion event is logged for audit purposes
+
+  Scenario: Catalog ingestion fails due to Aurora connection error
+    Given Aurora Serverless is scaling up (cold start)
+    When the ingestion handler attempts to write a service record
+    Then it retries with exponential backoff up to 5 times
+    And if all retries fail, the event is sent to a dead-letter queue
+    And an alert is raised for the operations team
+    And the scan job is marked PARTIAL_SUCCESS
+
+  Scenario: Bulk ingestion of 500 services from a large account
+    Given a scan discovers 500 services across all resource types
+    When the catalog ingestion handler processes all events
+    Then it uses batch inserts (100 records per batch)
+    And completes ingestion within 30 seconds
+    And all 500 services appear in the catalog
+    And the Meilisearch index reflects all 500 services
+```
+
+### Story 3.2 — PagerDuty / OpsGenie On-Call Mapping
+
+```gherkin
+Feature: On-Call Ownership Mapping via PagerDuty and OpsGenie
+
+  Scenario: Map service owner to PagerDuty escalation policy
+    Given a service "checkout-api" has owner "payments-team"
+    And PagerDuty integration is configured for the tenant
+    And "payments-team" maps to PagerDuty service "checkout-escalation-policy"
+    When the catalog builds the service detail record
+    Then the service includes on_call_policy: "checkout-escalation-policy"
+    And a link to the PagerDuty service page
+
+  Scenario: Fetch current on-call engineer from PagerDuty
+    Given "checkout-api" has a linked PagerDuty escalation policy
+    When a user views the service detail page
+    Then the portal calls PagerDuty GET /oncalls?escalation_policy_ids[]=...
+    And displays the current on-call engineer's name and contact
+    And caches the result for 5 minutes in Redis
+
+  Scenario: PagerDuty API returns 401 (invalid token)
+    Given the PagerDuty API token has been revoked
+    When the portal attempts to fetch on-call data
+    Then it returns the service detail without on-call info
+    And displays a warning: "On-call data unavailable — check PagerDuty integration"
+    And logs the auth failure for the tenant admin
+
+  Scenario: OpsGenie integration as alternative to PagerDuty
+    Given the tenant uses OpsGenie instead of PagerDuty
+    And OpsGenie integration is configured with an API key
+    When the portal fetches on-call data for a service
+    Then it calls the OpsGenie API instead of PagerDuty
+    And maps the OpsGenie schedule to the service owner
+    And displays the current on-call responder
+
+  Scenario: Service owner has no PagerDuty or OpsGenie mapping
+    Given "internal-tool" has owner "eng-team"
+    But "eng-team" has no mapping in PagerDuty or OpsGenie
+    When the portal builds the service detail
+    Then on_call_policy is null
+    And the UI shows "No on-call policy configured" for this service
+    And suggests linking a PagerDuty/OpsGenie service
+
+  Scenario: Multiple services share the same on-call policy
+    Given 10 services all map to the "platform-oncall" PagerDuty policy
+    When the portal fetches on-call data
+    Then it batches the PagerDuty API call for all 10 services
+    And does not make 10 separate API calls
+    And caches the shared result for all 10 services
+```
+
+### Story 3.3 — Service Catalog CRUD API
+
+```gherkin
+Feature: Service Catalog CRUD API
+
+  Background:
+    Given the API requires a valid Cognito JWT
+    And the JWT contains tenant_id claim
+
+  Scenario: List services for a tenant
+    Given tenant "acme" has 42 services in the catalog
+    When GET /api/v1/services is called with a valid JWT for "acme"
+    Then the response returns 42 services
+    And each service includes: id, name, owner, source, health_status, updated_at
+    And results are paginated (default page size: 20)
+    And the response includes total_count: 42
+
+  Scenario: List services with pagination
+    Given tenant "acme" has 42 services
+    When GET /api/v1/services?page=2&limit=20 is called
+    Then the response returns services 21-40
+    And includes pagination metadata: page, limit, total_count, total_pages
+
+  Scenario: Get single service by ID
+    Given service "svc-uuid-123" exists for tenant "acme"
+    When GET /api/v1/services/svc-uuid-123 is called with acme's JWT
+    Then the response returns the full service detail
+    And includes: metadata, owner, on_call_policy, health_scorecard, tags, links
+
+  Scenario: Get service belonging to different tenant returns 404
+    Given service "svc-uuid-456" belongs to tenant "globex"
+    When GET /api/v1/services/svc-uuid-456 is called with acme's JWT
+    Then the response returns 404 Not Found
+    And does not reveal that the service exists for another tenant
+
+  Scenario: Manually update service owner via API
+    Given service "legacy-api" has owner inferred as "unknown"
+    When PATCH /api/v1/services/svc-uuid-789 is called with body:
+      """
+      { "owner": "platform-team", "owner_source": "manual" }
+      """
+    Then the service owner is updated to "platform-team"
+    And owner_source is set to "explicit/manual"
+    And the change is recorded in the audit log with the user's identity
+    And the Meilisearch document is updated
+
+  Scenario: Delete service from catalog
+    Given service "decommissioned-api" exists in the catalog
+    When DELETE /api/v1/services/svc-uuid-000 is called
+    Then the service is soft-deleted (deleted_at is set)
+    And it no longer appears in list or search results
+    And the Meilisearch document is removed
+    And the deletion is recorded in the audit log
+
+  Scenario: Create service manually (not from discovery)
+    When POST /api/v1/services is called with:
+      """
+      { "name": "manual-service", "owner": "ops-team", "source": "manual" }
+      """
+    Then a new service is created with source: "manual"
+    And it appears in the catalog and search results
+    And it is not overwritten by automated discovery scans
+
+  Scenario: Unauthenticated request is rejected
+    When GET /api/v1/services is called without an Authorization header
+    Then the response returns 401 Unauthorized
+    And no service data is returned
+```
+
+### Story 3.4 — Full-Text Search via Meilisearch
+
+```gherkin
+Feature: Catalog Full-Text Search
+
+  Scenario: Search returns relevant services
+    Given the catalog has services: "payment-api", "payment-processor", "user-service"
+    When a search query "payment" is submitted
+    Then Meilisearch returns "payment-api" and "payment-processor"
+    And results are ranked by relevance score
+    And the response time is under 10ms
+
+  Scenario: Search is scoped to tenant
+    Given tenant "acme" has service "auth-service"
+    And tenant "globex" has service "auth-service" (same name)
+    When tenant "acme" searches for "auth-service"
+    Then only acme's "auth-service" is returned
+    And globex's service is not included in results
+
+  Scenario: Meilisearch index is corrupted or unavailable
+    Given Meilisearch returns a 503 error
+    When a search request is made
+    Then the API falls back to PostgreSQL full-text search (pg_trgm)
+    And returns results within 500ms
+    And logs a warning: "Meilisearch unavailable, using PostgreSQL fallback"
+    And the user sees results (degraded performance, not an error)
+
+  Scenario: Catalog update triggers Meilisearch re-index
+    Given a service "new-api" is added to the catalog
+    When the catalog write completes
+    Then a Meilisearch index update is triggered asynchronously
+    And the service is searchable within 1 second
+    And if the Meilisearch update fails, it is retried via a background queue
+```
+
+---
+
+## Epic 4: Search Engine
+
+### Story 4.1 — Cmd+K Instant Search
+
+```gherkin
+Feature: Cmd+K Instant Search Interface
+
+  Scenario: User opens search with keyboard shortcut
+    Given the user is on any page of the portal
+    When the user presses Cmd+K (Mac) or Ctrl+K (Windows/Linux)
+    Then the search modal opens immediately
+    And the search input is focused
+    And recent searches are shown as suggestions
+
+  Scenario: Search returns results under 10ms
+    Given the Meilisearch index has 500 services
+    And Redis prefix cache is warm
+    When the user types "pay" in the search box
+    Then results appear within 10ms
+    And the top 5 matching services are shown
+    And each result shows: service name, owner, health status
+
+  Scenario: Search with Redis prefix cache hit
+    Given "pay" has been searched recently and is cached in Redis
+    When the user types "pay"
+    Then the API returns cached results from Redis
+    And does not query Meilisearch
+    And the response time is under 5ms
+
+  Scenario: Search with Redis cache miss
+    Given "xyz-unique-query" has never been searched
+    When the user types "xyz-unique-query"
+    Then the API queries Meilisearch directly
+    And stores the result in Redis with TTL of 60 seconds
+    And returns results within 10ms
+
+  Scenario: Search cache is invalidated on catalog update
+    Given "payment-api" search results are cached in Redis
+    When the catalog is updated (new service added or existing service modified)
+    Then all Redis cache keys matching affected prefixes are invalidated
+    And the next search query fetches fresh results from Meilisearch
+
+  Scenario: Search with no results
+    Given no services match the query "zzznomatch"
+    When the user searches for "zzznomatch"
+    Then the search modal shows "No services found"
+    And suggests: "Try a broader search or browse all services"
+    And does not show an error
+
+  Scenario: Search is dismissed
+    Given the search modal is open
+    When the user presses Escape
+    Then the search modal closes
+    And focus returns to the previously focused element
+
+  Scenario: Search result is selected
+    Given search results show "payment-api"
+    When the user clicks or presses Enter on "payment-api"
+    Then the search modal closes
+    And the service detail drawer opens for "payment-api"
+    And the URL updates to reflect the selected service
+```
+
+### Story 4.2 — Redis Prefix Caching
+
+```gherkin
+Feature: Redis Prefix Caching for Search
+
+  Scenario: Cache key structure is tenant-scoped
+    Given tenant "acme" searches for "pay"
+    Then the Redis cache key is "search:acme:pay"
+    And tenant "globex" searching for "pay" uses key "search:globex:pay"
+    And the two cache entries are completely independent
+
+  Scenario: Cache TTL expires and is refreshed
+    Given a cache entry for "search:acme:pay" has TTL of 60 seconds
+    When 61 seconds pass without a search for "pay"
+    Then the cache entry expires
+    And the next search for "pay" queries Meilisearch
+    And a new cache entry is created with a fresh 60-second TTL
+
+  Scenario: Redis is unavailable
+    Given Redis returns a connection error
+    When a search request is made
+    Then the API bypasses the cache and queries Meilisearch directly
+    And logs a warning: "Redis unavailable, cache bypassed"
+    And the search still returns results (degraded, not broken)
+    And does not attempt to write to Redis while it is down
+
+  Scenario: Cache invalidation on bulk catalog update
+    Given a discovery scan adds 50 new services to the catalog
+    When the bulk ingestion completes
+    Then all Redis search cache keys for the affected tenant are flushed
+    And subsequent searches reflect the updated catalog
+    And the flush is atomic (all keys or none)
+```
+
+### Story 4.3 — Meilisearch Index Management
+
+```gherkin
+Feature: Meilisearch Index Management
+
+  Scenario: Index is created on first tenant onboarding
+    Given a new tenant "startup-co" completes onboarding
+    When the tenant's first discovery scan runs
+    Then a Meilisearch index "services_startup-co" is created
+    And the index is configured with searchable attributes: name, description, owner, tags
+    And filterable attributes: source, health_status, owner, region
+
+  Scenario: Index corruption is detected and rebuilt
+    Given the Meilisearch index for tenant "acme" returns inconsistent results
+    When the health check detects index corruption (checksum mismatch)
+    Then an index rebuild is triggered
+    And the index is rebuilt from the PostgreSQL catalog (source of truth)
+    And during rebuild, search falls back to PostgreSQL full-text search
+    And users see a banner: "Search index is being rebuilt, results may be incomplete"
+    And the rebuild completes within 5 minutes for up to 10,000 services
+
+  Scenario: Meilisearch index rebuild does not affect other tenants
+    Given tenant "acme"'s index is being rebuilt
+    When tenant "globex" performs a search
+    Then globex's search is unaffected
+    And globex's index is not touched during acme's rebuild
+
+  Scenario: Search ranking is configured correctly
+    Given the Meilisearch index has ranking rules configured
+    When a user searches for "api"
+    Then results are ranked by: typo, words, proximity, attribute, sort, exactness
+    And services with "api" in the name rank higher than those with "api" in description
+    And recently updated services rank higher among equal-relevance results
+```
+
+---
+
+## Epic 5: Dashboard UI
+
+### Story 5.1 — Service Catalog Browse
+
+```gherkin
+Feature: Service Catalog Browse UI
+
+  Background:
+    Given the user is authenticated and on the catalog page
+
+  Scenario: Catalog page loads with service cards
+    Given the tenant has 42 services in the catalog
+    When the catalog page loads
+    Then 20 service cards are displayed (first page)
+    And each card shows: service name, owner, health status badge, source icon, last updated
+    And a pagination control shows "Page 1 of 3"
+
+  Scenario: Progressive disclosure — card expands on hover
+    Given a service card for "payment-api" is visible
+    When the user hovers over the card
+    Then the card expands to show additional details:
+      | Field         | Value                    |
+      | Description   | Handles payment flows    |
+      | Region        | us-east-1                |
+      | On-call       | alice@acme.com           |
+      | Tech stack    | Node.js, Docker          |
+    And the expansion animation completes within 200ms
+
+  Scenario: Filter services by owner
+    Given the catalog has services from multiple teams
+    When the user selects filter "owner: payments-team"
+    Then only services owned by "payments-team" are shown
+    And the URL updates with ?owner=payments-team
+    And the filter chip is shown as active
+
+  Scenario: Filter services by health status
+    Given some services have health_status: "healthy", "degraded", "unknown"
+    When the user selects filter "status: degraded"
+    Then only degraded services are shown
+    And the count badge updates to reflect filtered results
+
+  Scenario: Filter services by source
+    Given services come from "aws-ecs", "aws-lambda", "github"
+    When the user selects filter "source: aws-lambda"
+    Then only Lambda functions are shown in the catalog
+
+  Scenario: Sort services by last updated
+    Given the catalog has services with various updated_at timestamps
+    When the user selects sort "Last Updated (newest first)"
+    Then services are sorted with most recently updated first
+    And the sort selection persists across page navigation
+
+  Scenario: Empty catalog state
+    Given the tenant has just onboarded and has 0 services
+    When the catalog page loads
+    Then an empty state is shown: "No services discovered yet"
+    And a CTA button: "Run Discovery Scan" is prominently displayed
+    And a link to the onboarding guide is shown
+```
+
+### Story 5.2 — Service Detail Drawer
+
+```gherkin
+Feature: Service Detail Drawer
+
+  Scenario: Open service detail drawer
+    Given the user clicks on service card "checkout-api"
+    When the drawer opens
+    Then it slides in from the right within 300ms
+    And displays full service details:
+      | Section       | Content                              |
+      | Header        | Service name, health badge, source   |
+      | Ownership     | Team name, on-call engineer          |
+      | Metadata      | Region, runtime, version, tags       |
+      | Health        | Scorecard with metrics               |
+      | Links         | GitHub repo, PagerDuty, AWS console  |
+    And the main catalog page remains visible behind the drawer
+
+  Scenario: Drawer URL is shareable
+    Given the drawer is open for "checkout-api" (id: svc-123)
+    When the user copies the URL
+    Then the URL is /catalog?service=svc-123
+    And sharing this URL opens the catalog with the drawer pre-opened
+
+  Scenario: Close drawer with Escape key
+    Given the service detail drawer is open
+    When the user presses Escape
+    Then the drawer closes
+    And focus returns to the service card that was clicked
+
+  Scenario: Navigate between services in drawer
+    Given the drawer is open for "checkout-api"
+    When the user presses the right arrow or clicks "Next service"
+    Then the drawer transitions to the next service in the current filtered/sorted list
+    And the URL updates accordingly
+
+  Scenario: Drawer shows stale data warning
+    Given service "legacy-api" has status: "stale" (not seen in last scan)
+    When the drawer opens for "legacy-api"
+    Then a warning banner shows: "This service was not found in the last scan (3 days ago)"
+    And a "Re-run scan" button is available
+
+  Scenario: Edit service owner from drawer
+    Given the user has editor permissions
+    When they click "Edit owner" in the drawer
+    Then an inline edit field appears
+    And they can type a new owner name with autocomplete from known teams
+    And saving updates the catalog immediately
+    And the drawer reflects the new owner without a full page reload
+```
+
+### Story 5.3 — Cmd+K Search in UI
+
+```gherkin
+Feature: Cmd+K Search UI Integration
+
+  Scenario: Search modal shows categorized results
+    Given the user types "pay" in the Cmd+K search modal
+    When results are returned
+    Then they are grouped by category:
+      | Category  | Examples                          |
+      | Services  | payment-api, payment-processor    |
+      | Teams     | payments-team                     |
+      | Actions   | Run discovery scan                |
+    And keyboard navigation works between categories
+
+  Scenario: Search modal shows recent searches on open
+    Given the user has previously searched for "auth", "billing", "gateway"
+    When the user opens Cmd+K without typing
+    Then recent searches are shown as suggestions
+    And clicking a suggestion populates the search input
+
+  Scenario: Search result keyboard navigation
+    Given the search modal is open with 5 results
+    When the user presses the down arrow key
+    Then the first result is highlighted
+    And pressing down again highlights the second result
+    And pressing Enter navigates to the highlighted result
+
+  Scenario: Search modal is accessible
+    Given the search modal is open
+    Then it has role="dialog" and aria-label="Search services"
+    And the input has aria-label="Search"
+    And results have role="listbox" with role="option" items
+    And screen readers announce result count changes
+```
+
+---
+
+## Epic 6: Analytics Dashboards
+
+### Story 6.1 — Ownership Coverage Dashboard
+
+```gherkin
+Feature: Ownership Coverage Analytics
+
+  Scenario: Display ownership coverage percentage
+    Given the tenant has 100 services in the catalog
+    And 75 services have a confirmed owner
+    And 25 services have owner: null or "unknown"
+    When the analytics dashboard loads
+    Then the ownership coverage metric shows 75%
+    And a donut chart shows the breakdown: 75 owned, 25 unowned
+    And a trend line shows coverage change over the last 30 days
+
+  Scenario: Ownership coverage by team
+    Given multiple teams own services
+    When the ownership breakdown table is rendered
+    Then it shows each team with their service count and percentage
+    And teams are sorted by service count descending
+    And clicking a team filters the catalog to that team's services
+
+  Scenario: Ownership coverage drops below threshold
+    Given the ownership coverage threshold is set to 80%
+    And coverage drops from 82% to 76% after a scan
+    When the dashboard refreshes
+    Then a warning alert is shown: "Ownership coverage below 80% threshold"
+    And the affected unowned services are listed
+    And an email notification is sent to the tenant admin
+
+  Scenario: Zero unowned services
+    Given all 50 services have confirmed owners
+    When the ownership dashboard loads
+    Then coverage shows 100%
+    And a success state is shown: "All services have owners"
+    And no warning alerts are displayed
+```
+
+### Story 6.2 — Service Health Scorecards
+
+```gherkin
+Feature: Service Health Scorecards
+
+  Scenario: Health scorecard is calculated for a service
+    Given service "payment-api" has the following signals:
+      | Signal              | Value    | Score |
+      | Has owner           | Yes      | 20    |
+      | Has on-call policy  | Yes      | 20    |
+      | Has documentation   | Yes      | 15    |
+      | Recent deployment   | 2 days   | 15    |
+      | No open P1 alerts   | True     | 30    |
+    When the health scorecard is computed
+    Then the overall score is 100/100
+    And the health_status is "healthy"
+
+  Scenario: Service with missing ownership scores lower
+    Given service "orphan-lambda" has:
+      | Signal              | Value    | Score |
+      | Has owner           | No       | 0     |
+      | Has on-call policy  | No       | 0     |
+      | Has documentation   | No       | 0     |
+      | Recent deployment   | 90 days  | 5     |
+      | No open P1 alerts   | True     | 30    |
+    When the health scorecard is computed
+    Then the overall score is 35/100
+    And the health_status is "at-risk"
+    And the scorecard highlights the missing signals as improvement actions
+
+  Scenario: Health scorecard trend over time
+    Given "checkout-api" has had weekly scorecard snapshots for 8 weeks
+    When the service detail drawer shows the health trend
+    Then a sparkline chart shows the score history
+    And the trend direction (improving/declining) is indicated
+
+  Scenario: Team-level health KPI aggregation
+    Given "payments-team" owns 10 services with scores: [90, 85, 70, 95, 60, 80, 75, 88, 92, 65]
+    When the team KPI dashboard is rendered
+    Then the team average score is 80/100
+    And the lowest-scoring service is highlighted for attention
+    And the team score is compared to the org average
+
+  Scenario: Health scorecard for stale service
+    Given service "legacy-api" has status: "stale"
+    When the scorecard is computed
+    Then the "Recent deployment" signal scores 0
+    And a penalty is applied for being stale
+    And the overall score reflects the staleness
+```
+
+### Story 6.3 — Tech Debt Tracking
+
+```gherkin
+Feature: Tech Debt Tracking Dashboard
+
+  Scenario: Identify services using deprecated runtimes
+    Given the portal has a policy: "Node.js < 18 is deprecated"
+    And 5 services use Node.js 16
+    When the tech debt dashboard loads
+    Then it shows 5 services flagged for "deprecated runtime"
+    And each flagged service links to its catalog entry
+    And the total tech debt score is calculated
+
+  Scenario: Track services with no recent deployments
+    Given the policy threshold is "no deployment in 90 days = tech debt"
+    And 8 services have not been deployed in over 90 days
+    When the tech debt dashboard loads
+    Then these 8 services appear in the "stale deployments" category
+    And the owning teams are notified via the dashboard
+
+  Scenario: Tech debt trend over time
+    Given tech debt metrics have been tracked for 12 weeks
+    When the trend chart is rendered
+    Then it shows weekly tech debt item counts
+    And highlights weeks where debt increased significantly
+    And shows the net change (items resolved vs. items added)
+```
+
+---