Files
Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
2026-02-28 17:35:02 +00:00

29 KiB

dd0c/portal — V1 MVP Epics & Stories

This document outlines the complete set of Epics for the V1 MVP of dd0c/portal, a Lightweight Internal Developer Portal. The scope is strictly limited to the "5-Minute Miracle" auto-discovery from AWS and GitHub, Cmd+K search, basic catalog UI, Slack bot, and self-serve PLG onboarding. No AI agent, no GitLab, no scorecards in V1.


Epic 1: AWS Discovery Engine

Description: Build the core AWS scanning capability using a read-only cross-account IAM role. This engine must enumerate CloudFormation stacks, ECS services, Lambda functions, API Gateway APIs, and RDS instances, and group them into inferred "services" based on naming conventions and tags.

User Stories

Story 1.1: Cross-Account Role Assumption As a Platform Engineer, I want the discovery engine to securely assume a read-only IAM role in my AWS account using an External ID, so that my infrastructure data can be scanned without sharing long-lived credentials.

  • Acceptance Criteria:
    • System successfully assumes cross-account role using AWS STS.
    • Role assumption enforces a tenant-specific ExternalId.
    • Failure to assume role surfaces a clear error (e.g., "Role not found" or "Invalid ExternalId").
  • Estimate: 2
  • Dependencies: None
  • Technical Notes: Use boto3 STS client. Ensure the Step Functions orchestrator passes the correct tenant configuration.

Story 1.2: CloudFormation & Tag Scanner As a System, I want to scan CloudFormation stacks and tagged Resource Groups, so that I can group individual AWS resources into high-confidence "services".

  • Acceptance Criteria:
    • System extracts stack names and maps them to service names (Confidence: 0.95).
    • System extracts service, team, or project tags from resources.
    • Lists resources belonging to each stack/tag group.
  • Estimate: 3
  • Dependencies: Story 1.1
  • Technical Notes: Use cloudformation:DescribeStacks and resourcegroupstaggingapi:GetResources. Parallelize across regions if specified.

Story 1.3: Compute Resource Enumeration (ECS & Lambda) As a System, I want to list all ECS services and Lambda functions, so that I can discover deployable compute units.

  • Acceptance Criteria:
    • System lists all ECS clusters, services, and task definitions (extracting container images).
    • System lists all Lambda functions and their API Gateway event source mappings.
    • Standalone compute resources without a CFN stack are still captured as potential services.
  • Estimate: 5
  • Dependencies: Story 1.1
  • Technical Notes: Requires pagination handling. Lambda cold starts for this Python scanner are acceptable. Output mapped payload to SQS for the Reconciler.

Story 1.4: Database Enumeration (RDS) As a System, I want to list RDS instances, so that I can associate data stores with their consuming services.

  • Acceptance Criteria:
    • System lists RDS instances and their tags.
    • Maps databases to services based on naming prefixes or CFN stack membership.
  • Estimate: 2
  • Dependencies: Story 1.1, Story 1.2
  • Technical Notes: Use rds:DescribeDBInstances. These are marked as infrastructure attached to a service, rather than services themselves.

Epic 2: GitHub Discovery

Description: Implement the org-wide GitHub scanning to extract repositories, primary languages, CODEOWNERS, README content, and GitHub Actions deployments. Cross-reference this with the AWS scanner output.

User Stories

Story 2.1: Org & Repo Scanner As a System, I want to use the GitHub GraphQL API to list all repositories, their primary language, and commit history, so that I can map the codebase landscape.

  • Acceptance Criteria:
    • System lists all active (non-archived, non-forked) repos in the connected org.
    • System extracts primary language and top 5 committers per repo.
    • GraphQL queries are batched to avoid rate limits (up to 100 repos per call).
  • Estimate: 3
  • Dependencies: None
  • Technical Notes: Implement using octokit in a Node.js Lambda function.

Story 2.2: CODEOWNERS & README Extraction As a System, I want to extract and parse the CODEOWNERS and README files from the default branch, so that I can infer service ownership and descriptions.

  • Acceptance Criteria:
    • System parses HEAD:CODEOWNERS and extracts mapped GitHub teams or individuals.
    • System parses HEAD:README.md and extracts the first descriptive paragraph.
    • If a team is listed in CODEOWNERS, it becomes a candidate for service ownership (Weight: 0.40).
  • Estimate: 3
  • Dependencies: Story 2.1
  • Technical Notes: The GraphQL expression HEAD:CODEOWNERS retrieves blob text efficiently in the same repo query.

Story 2.3: CI/CD Target Extraction As a System, I want to scan .github/workflows/deploy.yml for deployment targets, so that I can link repositories to specific AWS infrastructure (ECS/Lambda).

  • Acceptance Criteria:
    • System parses workflow YAML to find deployment actions (e.g., aws-actions/amazon-ecs-deploy-task-definition).
    • System links the repo to a discovered AWS service if the task definition name matches.
  • Estimate: 5
  • Dependencies: Story 2.1, Epic 1
  • Technical Notes: This is crucial for cross-referencing in the Reconciliation Engine.

Story 2.4: Reconciliation & Ownership Inference Engine As a System, I want to merge the AWS infrastructure payloads and the GitHub repository payloads into a unified service entity, and calculate a confidence score for ownership.

  • Acceptance Criteria:
    • AWS resources are deduplicated and mapped to corresponding GitHub repos based on tags, deploy workflows, or fuzzy naming (e.g., "payment-api" matches "payment-service").
    • Ownership is scored based on CODEOWNERS, Git blame frequency, and tags.
    • Final merged entity is pushed to PostgreSQL as a "Service".
  • Estimate: 8
  • Dependencies: Story 1.2, 1.3, 2.1, 2.2, 2.3
  • Technical Notes: This runs in a Node.js Lambda triggered by the Step Functions orchestrator. It processes batches from the SQS queues holding AWS and GitHub scan results.

Epic 3: Service Catalog

Description: Implement the primary datastore (Aurora Serverless v2 PostgreSQL), the Service API (CRUD), ownership mapping logic, and manual enrichment endpoints. This is the source of truth for the entire platform.

User Stories

Story 3.1: Core Catalog Schema Setup As a System, I want a PostgreSQL relational database to store tenants, services, users, and teams, so that the discovered catalog data is durably stored.

  • Acceptance Criteria:
    • Create the schema per the architecture (tenants, users, connections, services, teams, service_ownership, discovery_runs).
    • Apply multi-tenant Row-Level Security (RLS) on all core tables using tenant_id.
  • Estimate: 3
  • Dependencies: None
  • Technical Notes: Use Prisma or raw SQL migrations. Implement API middleware to set tenant_id on every request.

Story 3.2: Service Ownership Model Implementation As a System, I want a service_ownership table and scoring logic, so that multiple ownership signals (CODEOWNERS vs Git Blame vs CFN tags) can be tracked and weighted.

  • Acceptance Criteria:
    • System stores multiple candidate teams per service with ownership types and confidence scores.
    • The highest-confidence team becomes the primary owner.
    • If top scores are tied or under 0.50, flag the service as "ambiguous" or "unowned".
  • Estimate: 5
  • Dependencies: Story 3.1
  • Technical Notes: Implement scoring algorithm in Python Lambda and map back to PostgreSQL.

Story 3.3: Manual Correction API As an Engineer, I want to manually override the inferred owner or description of a service, so that I can fix incorrect auto-discovered data.

  • Acceptance Criteria:
    • PATCH /api/v1/services/{service_id} allows updates to team_id, description, etc.
    • Manual corrections override inferred data with a confidence score of 1.00 (source="user_correction").
    • The correction is saved in the corrections log table.
  • Estimate: 3
  • Dependencies: Story 3.1
  • Technical Notes: The update should trigger an async background job (SQS) to update the Meilisearch index immediately.

Story 3.4: PagerDuty / OpsGenie Integration As an Engineering Manager, I want to link my team to a PagerDuty schedule, so that the service catalog shows the current on-call engineer for each service.

  • Acceptance Criteria:
    • The API allows saving an encrypted PagerDuty API key per tenant.
    • The system maps PagerDuty schedules to inferred Teams.
    • GET /api/v1/services/{service_id} returns the active on-call individual.
  • Estimate: 5
  • Dependencies: Story 3.1
  • Technical Notes: Credentials must be stored in AWS Secrets Manager using KMS. Use the PagerDuty GET /schedules API endpoint.

Epic 4: Search Engine

Description: Deploy Meilisearch and a Redis cache to support a <100ms Cmd+K search bar for the entire portal. The search must be typo-tolerant and isolate tenant data perfectly.

User Stories

Story 4.1: Meilisearch Index Sync As a System, I want to sync service entities from PostgreSQL to Meilisearch, so that I have a fast, full-text index for the UI.

  • Acceptance Criteria:
    • On every service upsert in PostgreSQL, publish to SQS.
    • A Lambda consumes SQS and updates Meilisearch addDocuments.
    • The index configuration is applied (searchable attributes, filterable attributes, typo-tolerance enabled).
  • Estimate: 5
  • Dependencies: Story 3.1
  • Technical Notes: The index sync must map relational data to a flat JSON structure.

Story 4.2: Cmd+K Search Endpoint & Security As an Engineer, I want a fast /api/v1/services/search endpoint that queries Meilisearch, so that I can instantly find services.

  • Acceptance Criteria:
    • API receives query and proxy queries to Meilisearch.
    • API enforces tenant isolation by injecting tenant_id = '{current_tenant}' filter.
    • Search returns results in <100ms.
  • Estimate: 3
  • Dependencies: Story 4.1
  • Technical Notes: Implement in Node.js Fastify.

Story 4.3: Prefix Caching with Redis As a System, I want to cache the most common searches in Redis, so that I can return results in <10ms for repeated queries.

  • Acceptance Criteria:
    • Cache identical query prefixes per tenant in ElastiCache Redis.
    • Set TTL to 5 minutes or invalidate on service upserts.
    • Redis cache miss falls back to Meilisearch.
  • Estimate: 2
  • Dependencies: Story 4.2
  • Technical Notes: Serverless Redis pricing scales perfectly. Use normalized queries as the cache key.

Epic 5: Service Cards UI

Description: Build the React Single Page Application (SPA) providing a fast, scannable catalog interface, Cmd+K search dialog, and detailed service cards.

User Stories

Story 5.1: SPA Framework & Routing Setup As an Engineer, I want the portal built in React with Vite and TailwindCSS, so that the UI is fast and responsive.

  • Acceptance Criteria:
    • Setup React + Vite + React Router.
    • Deploy to S3 and CloudFront with edge caching.
    • Implement a basic authentication context tied to GitHub OAuth.
  • Estimate: 2
  • Dependencies: None
  • Technical Notes: The frontend state must be minimal, relying heavily on SWR or React Query for data fetching.

Story 5.2: The Cmd+K Modal As an Engineer, I want to press Cmd+K to open a global search bar, so that I can find a service or team instantly from anywhere.

  • Acceptance Criteria:
    • Pressing Cmd+K (or Ctrl+K) opens a modal overlay.
    • Input debounces keystrokes by 150ms before calling /api/v1/services/search.
    • Arrow keys navigate search results, Enter selects a service.
  • Estimate: 5
  • Dependencies: Story 4.2, Story 5.1
  • Technical Notes: Use an accessible dialog library like Radix UI or Headless UI. Ensure ARIA labels are correct.

Story 5.3: Scannable Service List View As an Engineering Director, I want to view all my organization's services in a single-line-per-service table, so that I can quickly review ownership and health.

  • Acceptance Criteria:
    • The default dashboard displays a paginated or infinite-scroll table of services.
    • Columns: Name, Owner (with confidence badge), Health, Last Deploy, Repo Link.
    • The table is filterable by team, health, and tech stack.
  • Estimate: 5
  • Dependencies: Story 3.1, Story 5.1
  • Technical Notes: Use a lightweight table component (e.g., TanStack Table).

Story 5.4: Service Detail View (Progressive Disclosure) As an Engineer, I want to click on a service row to see its full details in a slide-over panel or expanded card, so that I can dive deeper when necessary.

  • Acceptance Criteria:
    • Clicking a service expands an in-page panel showing full details.
    • Tabs separate data: Overview (description, stack), Infra (AWS ARNs), On-Call, Corrections.
    • There is a "Correct" button next to the inferred owner or description.
  • Estimate: 8
  • Dependencies: Story 3.4, Story 5.1, Story 5.3
  • Technical Notes: Lazy-load tab contents from the API (GET /api/v1/services/{service_id}) to minimize initial render payload.

Epic 6: Dashboard & Overview

Description: Implement the org-wide and team-specific dashboards that provide aggregate health, ownership matrix, and discovery status.

User Stories

Story 6.1: The Director Dashboard As an Engineering Director, I want an org-wide view of my total services, teams, unowned services, and discovery accuracy, so that I can report to leadership and ensure compliance.

  • Acceptance Criteria:
    • The dashboard displays four summary KPIs: Total Services, Total Teams, Unowned Services, Accuracy Trend (week-over-week).
    • Contains a specific card showing "Services needing review".
    • A "Recent Activity" feed lists the latest deploys or ownership changes.
  • Estimate: 5
  • Dependencies: Story 3.1, Story 5.1
  • Technical Notes: The API needs a new /api/v1/dashboard/stats endpoint to compute these KPIs efficiently using PostgreSQL aggregations.

Story 6.2: "My Team" Dashboard Focus As an Engineer, I want the dashboard to default to showing services owned by my team, so that I don't have to filter out the noise of the entire org.

  • Acceptance Criteria:
    • The UI uses the logged-in user's GitHub ID to find their Team.
    • A "Your Team" section lists only their services with immediate health status.
    • "Recent" section shows the last 5 services the user viewed.
  • Estimate: 3
  • Dependencies: Story 3.1, Story 6.1
  • Technical Notes: Use browser local storage to save the "recent views".

Epic 7: Slack Bot

Description: Build the Slack integration to allow engineers to query the service catalog using /dd0c who owns <service> and /dd0c oncall <service>.

User Stories

Story 7.1: Slack App & OAuth Setup As an Administrator, I want to add a dd0c Slack app to my workspace, so that engineers can use slash commands.

  • Acceptance Criteria:
    • Create the Slack App using @slack/bolt.
    • The API handles the OAuth flow and saves the workspace token to the tenant connections table.
    • The bot is added to a tenant's workspace and receives slash commands.
  • Estimate: 3
  • Dependencies: Story 3.1
  • Technical Notes: The Slack Bot Lambda must handle verify tokens and return a 200 OK within 3 seconds.

Story 7.2: Slash Command: /dd0c who owns As an Engineer, I want to type /dd0c who owns <service_name>, so that the bot instantly replies with the owner, repo, and health.

  • Acceptance Criteria:
    • The bot receives the command, extracts the service name, and calls GET /api/v1/services/search.
    • The bot formats the response as a Slack block kit message with the service name, owner team, confidence score, repo link, and a link to the portal.
    • The response is ephemeral unless the user specifies otherwise.
  • Estimate: 3
  • Dependencies: Story 4.2, Story 7.1
  • Technical Notes: Use Meilisearch directly or the API. Ensure the search handles typo-tolerance if the user misspells the service.

Story 7.3: Slash Command: /dd0c oncall As an Engineer, I want to type /dd0c oncall <service_name>, so that the bot instantly tells me who is currently on-call for that service.

  • Acceptance Criteria:
    • The bot receives the command, looks up the service, and queries PagerDuty (via the API).
    • The bot returns the active on-call individual and schedule details.
  • Estimate: 3
  • Dependencies: Story 3.4, Story 7.2
  • Technical Notes: If no on-call is configured, the bot returns a friendly error with a link to set it up in the portal.

Epic 8: Infrastructure & DevOps

Description: Implement the AWS infrastructure for the dd0c platform itself using Infrastructure as Code, set up the CI/CD pipeline via GitHub Actions, and deploy the foundational resources including VPC, ECS Fargate clusters, RDS Aurora, and the cross-account IAM role assumption engine.

User Stories

Story 8.1: Core AWS Foundation (VPC, RDS, ElastiCache) As a System, I need a secure VPC and data tier, so that the Portal API and Discovery Engine have durable, isolated storage.

  • Acceptance Criteria:
    • Deploy VPC with public and private subnets.
    • Provision Aurora Serverless v2 PostgreSQL database in private subnets.
    • Provision ElastiCache Redis (Serverless) for caching and sessions.
    • Deploy KMS keys for credential encryption.
  • Estimate: 5
  • Dependencies: None
  • Technical Notes: Use AWS CDK or CloudFormation. Ensure all data stores are encrypted at rest using KMS.

Story 8.2: ECS Fargate Cluster & Portal API Deployment As a System, I need an ECS Fargate cluster running the Portal API and Meilisearch, so that the web application and search engine are highly available.

  • Acceptance Criteria:
    • Create ECS cluster.
    • Deploy Portal API Fargate service behind an Application Load Balancer.
    • Deploy Meilisearch Fargate service with an EFS volume for persistent index storage.
    • Configure auto-scaling rules based on CPU and request count.
  • Estimate: 5
  • Dependencies: Story 8.1
  • Technical Notes: Meilisearch only needs 1 task initially. Use multi-stage Docker builds to keep image sizes small.

Story 8.3: Discovery Engine Serverless Deployment As a System, I need the Step Functions orchestrator and Lambda scanners deployed, so that the auto-discovery pipeline can execute.

  • Acceptance Criteria:
    • Deploy AWS Scanner, GitHub Scanner, Reconciler, and Inference Lambdas.
    • Deploy the Step Functions state machine linking the Lambdas.
    • Provision SQS FIFO queues for discovery events.
  • Estimate: 3
  • Dependencies: Epic 1, Epic 2, Story 8.1
  • Technical Notes: Lambdas must have VPC access to write to Aurora, but need a NAT Gateway to reach the GitHub API.

Story 8.4: CI/CD Pipeline via GitHub Actions As an Engineer, I want automated CI/CD pipelines, so that I can safely build, test, and deploy the platform without manual intervention.

  • Acceptance Criteria:
    • CI workflow runs linting, unit tests, and Trivy container scanning on every PR.
    • CD workflow deploys to staging environment, runs a discovery accuracy smoke test, and waits for manual approval to deploy to production.
    • Deployment updates ECS services and Lambda aliases seamlessly.
  • Estimate: 5
  • Dependencies: Story 8.2, Story 8.3
  • Technical Notes: Keep it simple—use GitHub Actions natively, avoid complex external CD tools for V1.

Story 8.5: Customer IAM Role Template As an Administrator, I want a standardized CloudFormation template to deploy in my AWS account, so that I can easily grant dd0c read-only access.

  • Acceptance Criteria:
    • Template creates an IAM role with a strict read-only policy mapped to dd0c's required services.
    • Trust policy mandates a tenant-specific ExternalId.
    • Template output provides the Role ARN.
    • Template is hosted publicly on S3.
  • Estimate: 2
  • Dependencies: Epic 1
  • Technical Notes: Avoid using ReadOnlyAccess managed policy. Explicitly deny IAM, S3 object access, and Secrets Manager.

Epic 9: Onboarding & PLG

Description: Build the 5-minute self-serve onboarding wizard that drives the Product-Led Growth (PLG) motion. This flow must flawlessly guide the user through GitHub OAuth, AWS connection, and trigger the initial real-time discovery scan.

User Stories

Story 9.1: GitHub OAuth & Tenant Creation As a New User, I want to sign up using my GitHub account, so that I don't have to create a new password and the system can instantly identify my organization.

  • Acceptance Criteria:
    • User clicks "Sign up with GitHub" and authorizes the dd0c GitHub App.
    • System extracts user identity and organization context.
    • System creates a new Tenant and User record in PostgreSQL.
    • JWT session is established and user is routed to the setup wizard.
  • Estimate: 3
  • Dependencies: Story 3.1, Story 5.1
  • Technical Notes: Request minimal scopes initially. Store the installation ID securely.

Story 9.2: Stripe Billing Integration (Free & Team Tiers) As a New User, I want to select a pricing plan, so that I can start using the product within my budget.

  • Acceptance Criteria:
    • Wizard prompts user to select "Free" (up to 10 engineers) or "Team" ($10/engineer/mo).
    • If Team is selected, user is redirected to Stripe Checkout.
    • Webhook listener updates the tenant subscription status upon successful payment.
  • Estimate: 5
  • Dependencies: Story 9.1
  • Technical Notes: Use Stripe Checkout sessions. Keep the webhook Lambda extremely fast to avoid Stripe timeouts.

Story 9.3: AWS Connection Wizard Step As a New User, I want a frictionless way to connect my AWS account, so that the discovery engine can access my infrastructure.

  • Acceptance Criteria:
    • UI displays a one-click CloudFormation deployment link pre-populated with a unique ExternalId.
    • User pastes the generated Role ARN back into the UI.
    • API validates the role via sts:AssumeRole before proceeding.
  • Estimate: 3
  • Dependencies: Story 8.5, Story 9.1
  • Technical Notes: The sts:AssumeRole call validates both the ARN and the ExternalId. Give clear error messages if validation fails.

Story 9.4: Real-Time Discovery WebSocket As a New User, I want to see the discovery engine working in real-time, so that I trust the system and get that "Holy Shit" moment.

  • Acceptance Criteria:
    • API Gateway WebSocket API maintains a connection with the onboarding SPA.
    • Step Functions and Lambdas push progress events (e.g., "Found 47 AWS resources...") to the UI via the WebSocket.
    • When complete, the user is automatically redirected to their populated Service Catalog.
  • Estimate: 5
  • Dependencies: Epic 1, Epic 2, Story 9.3
  • Technical Notes: Implement via API Gateway WebSocket API and a simple broadcasting Lambda.

Epic 10: Transparent Factory Compliance

Description: Cross-cutting epic ensuring dd0c/portal adheres to the 5 Transparent Factory tenets. As an Internal Developer Platform, portal is the control plane for other teams' services — governance and observability are existential requirements, not nice-to-haves.

Story 10.1: Atomic Flagging — Feature Flags for Discovery & Catalog Behaviors

As a solo founder, I want every new auto-discovery heuristic, catalog enrichment, and scorecard rule behind a feature flag (default: off), so that a bad discovery rule doesn't pollute the service catalog with phantom services.

Acceptance Criteria:

  • OpenFeature SDK integrated into the catalog service. V1: JSON file provider.
  • All flags evaluate locally — no network calls during service discovery scans.
  • Every flag has owner and ttl (max 14 days). CI blocks if expired flags remain at 100%.
  • Automated circuit breaker: if a flagged discovery rule creates >5 unconfirmed services in a single scan, the flag auto-disables and the phantom entries are quarantined (not deleted).
  • Flags required for: new discovery sources (GitHub, GitLab, K8s), scorecard criteria, ownership inference rules, template generators.

Estimate: 5 points Dependencies: Epic 1 (Service Discovery) Technical Notes:

  • Quarantine pattern: flagged phantom services get status: quarantined rather than deletion. Allows review before purge.
  • Discovery scans are batch operations — flag check happens once per scan config, not per-service.

Story 10.2: Elastic Schema — Additive-Only for Service Catalog Store

As a solo founder, I want all DynamoDB catalog schema changes to be strictly additive, so that rollbacks never corrupt the service catalog or lose ownership mappings.

Acceptance Criteria:

  • CI rejects any schema change that removes, renames, or changes type of existing DynamoDB attributes.
  • New attributes use _v2 suffix for breaking changes.
  • All service catalog parsers ignore unknown fields (encoding/json with flexible unmarshalling).
  • Dual-write during migration windows within TransactWriteItems.
  • Every schema change includes sunset_date comment (max 30 days).

Estimate: 3 points Dependencies: Epic 2 (Catalog Store) Technical Notes:

  • Service catalog is the source of truth for org topology — schema corruption here cascades to scorecards, ownership, and templates.
  • DynamoDB Single Table Design: version items with _v attribute. Use item-level versioning, not table duplication.
  • Ownership mappings are especially sensitive — never overwrite, always append with timestamp.

Story 10.3: Cognitive Durability — Decision Logs for Ownership Inference

As a future maintainer, I want every change to ownership inference algorithms, scorecard weights, or discovery heuristics accompanied by a decision_log.json, so that I understand why service X was assigned to team Y.

Acceptance Criteria:

  • decision_log.json schema: { prompt, reasoning, alternatives_considered, confidence, timestamp, author }.
  • CI requires a decision log for PRs touching pkg/discovery/, pkg/scoring/, or pkg/ownership/.
  • Cyclomatic complexity cap of 10 via golangci-lint. PRs exceeding this are blocked.
  • Decision logs in docs/decisions/.

Estimate: 2 points Dependencies: None Technical Notes:

  • Ownership inference is the highest-risk logic — wrong assignments erode trust in the entire platform.
  • Document: "Why CODEOWNERS > git blame frequency > PR reviewer count for ownership signals?"
  • Scorecard weight changes must include before/after examples showing how scores shift.

Story 10.4: Semantic Observability — AI Reasoning Spans on Discovery & Scoring

As a platform engineer debugging a wrong ownership assignment, I want every discovery and scoring decision to emit an OpenTelemetry span with reasoning metadata, so that I can trace why a service was assigned to the wrong team.

Acceptance Criteria:

  • Every discovery scan emits a parent catalog_scan span. Each service evaluation emits child spans: ownership_inference, scorecard_evaluation.
  • Span attributes: catalog.service_id, catalog.ownership_signals (JSON array of signal sources + weights), catalog.confidence_score, catalog.scorecard_result.
  • If AI-assisted inference is used: ai.prompt_hash, ai.model_version, ai.reasoning_chain.
  • Spans export via OTLP. No PII — repo names and team names are hashed in spans.

Estimate: 3 points Dependencies: Epic 1 (Service Discovery), Epic 3 (Scorecards) Technical Notes:

  • Use go.opentelemetry.io/otel. Batch export to avoid per-service overhead during large scans.
  • Ownership inference spans should include ALL signals considered, not just the winning one — this is critical for debugging.

Story 10.5: Configurable Autonomy — Governance for Catalog Mutations

As a solo founder, I want a policy.json that controls whether the platform can auto-update ownership, auto-create services, or only suggest changes, so that teams maintain control over their catalog entries.

Acceptance Criteria:

  • policy.json defines governance_mode: strict (suggest-only, no auto-mutations) or audit (auto-apply with logging).
  • Default: strict. Auto-discovery populates a "pending review" queue rather than directly mutating the catalog.
  • panic_mode: when true, all discovery scans halt, catalog is frozen read-only, and a "maintenance" banner shows in the UI.
  • Per-team governance override: teams can lock their services to strict even if system is in audit mode.
  • All policy decisions logged: "Service X auto-created in audit mode", "Ownership change for Y queued for review in strict mode".

Estimate: 3 points Dependencies: Epic 1 (Service Discovery) Technical Notes:

  • "Pending review" queue is a DynamoDB table with status: pending. UI shows a review inbox for platform admins.
  • Per-team locks: team_settings item in DynamoDB with governance_override field.
  • Panic mode: POST /admin/panic or env var. Catalog API returns 503 for all write operations.

Epic 10 Summary

Story Tenet Points
10.1 Atomic Flagging 5
10.2 Elastic Schema 3
10.3 Cognitive Durability 2
10.4 Semantic Observability 3
10.5 Configurable Autonomy 3
Total 16