Files
dd0c/products/04-lightweight-idp/architecture/architecture.md
Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
2026-02-28 17:35:02 +00:00

96 KiB
Raw Blame History

dd0c/portal — Technical Architecture

Product: Lightweight Internal Developer Portal Phase: 6 — Architecture Design Date: 2026-02-28 Author: Solutions Architecture Status: Draft


1. SYSTEM OVERVIEW

High-Level Architecture

graph TB
    subgraph "Customer Environment"
        AWS_ACCOUNT["Customer AWS Account(s)"]
        GH_ORG["GitHub Organization"]
        PD["PagerDuty / OpsGenie"]
    end

    subgraph "dd0c Platform — Control Plane (us-east-1)"
        subgraph "Ingress"
            ALB["Application Load Balancer<br/>+ WAF + CloudFront"]
        end

        subgraph "API Layer"
            API["Portal API<br/>(ECS Fargate)"]
            WS["WebSocket Gateway<br/>(API Gateway v2)"]
        end

        subgraph "Discovery Engine"
            ORCH["Discovery Orchestrator<br/>(Step Functions)"]
            AWS_SCAN["AWS Scanner<br/>(Lambda)"]
            GH_SCAN["GitHub Scanner<br/>(Lambda)"]
            RECONCILER["Reconciliation Engine<br/>(Lambda)"]
            INFERENCE["Ownership Inference<br/>(Lambda)"]
        end

        subgraph "Data Layer"
            PG["PostgreSQL (RDS Aurora Serverless v2)<br/>Service Catalog + Tenants"]
            REDIS["ElastiCache Redis<br/>Session + Cache + Search"]
            S3_DATA["S3<br/>Discovery Snapshots + Exports"]
            SQS["SQS FIFO<br/>Discovery Events"]
        end

        subgraph "Search"
            MEILI["Meilisearch<br/>(ECS Fargate)<br/>Full-text + Faceted Search"]
        end

        subgraph "Integrations"
            SLACK_BOT["Slack Bot<br/>(Lambda)"]
            WEBHOOK_OUT["Outbound Webhooks<br/>(EventBridge → Lambda)"]
        end

        subgraph "Frontend"
            SPA["React SPA<br/>(CloudFront + S3)"]
        end
    end

    subgraph "dd0c Platform Modules"
        DD0C_COST["dd0c/cost"]
        DD0C_ALERT["dd0c/alert"]
        DD0C_RUN["dd0c/run"]
    end

    %% Customer → Platform connections
    AWS_ACCOUNT -- "AssumeRole<br/>(read-only)" --> AWS_SCAN
    GH_ORG -- "OAuth / GitHub App<br/>(read-only)" --> GH_SCAN
    PD -- "API Key<br/>(read-only)" --> API

    %% User flows
    SPA --> ALB --> API
    SPA --> WS

    %% Discovery flow
    ORCH --> AWS_SCAN
    ORCH --> GH_SCAN
    AWS_SCAN --> SQS
    GH_SCAN --> SQS
    SQS --> RECONCILER
    RECONCILER --> INFERENCE
    INFERENCE --> PG
    PG --> MEILI

    %% API reads
    API --> PG
    API --> MEILI
    API --> REDIS

    %% Integrations
    SLACK_BOT --> API
    API --> WEBHOOK_OUT

    %% dd0c platform
    API <-- "Internal API" --> DD0C_COST
    API <-- "Internal API" --> DD0C_ALERT
    API <-- "Internal API" --> DD0C_RUN

Component Inventory

Component Responsibility Technology Justification
Portal API REST/GraphQL API for catalog CRUD, search proxy, auth, billing Node.js (Fastify) on ECS Fargate Fastify is the fastest Node framework. Fargate eliminates server management. Node aligns with React frontend for code sharing (types, validation schemas).
Discovery Orchestrator Coordinates multi-source discovery runs, manages state machine for scan → reconcile → infer → index pipeline AWS Step Functions Native retry/error handling, visual debugging, pay-per-transition. Perfect for long-running multi-step workflows.
AWS Scanner Scans customer AWS accounts via cross-account AssumeRole. Enumerates CloudFormation stacks, ECS services, Lambda functions, API Gateway APIs, RDS instances, tagged resources. Python (Lambda) boto3 is the canonical AWS SDK. Lambda cold starts acceptable for background scanning (not user-facing). Python's AWS ecosystem is unmatched.
GitHub Scanner Scans GitHub org: repos, languages, CODEOWNERS, README content, workflow files, team memberships, recent commit authors. Node.js (Lambda) Octokit (GitHub SDK) is TypeScript-native. Shares types with API layer.
Reconciliation Engine Merges AWS + GitHub scan results into unified service entities. Deduplicates, cross-references repo→infra mappings, resolves conflicts. Node.js (Lambda) Core business logic. Shares domain types with API.
Ownership Inference Determines service ownership from CODEOWNERS, git blame frequency, team membership, CloudFormation tags, and historical corrections. Produces confidence scores. Python (Lambda) Scoring/ML-adjacent logic. Python's data processing libraries (pandas for frequency analysis) are superior.
PostgreSQL Primary datastore: service catalog, tenant data, user accounts, discovery history, corrections, billing state. Aurora Serverless v2 Scales to zero during low traffic (solo founder cost control). Relational model fits service catalog's structured data. Aurora's auto-scaling handles growth without capacity planning.
Redis Session store, API response cache, rate limiting, real-time search suggestions (prefix trie). ElastiCache Redis (Serverless) Sub-millisecond reads for Cmd+K autocomplete. Serverless pricing aligns with variable load.
Meilisearch Full-text search index for Cmd+K. Typo-tolerant, faceted, <50ms response. Meilisearch on ECS Fargate (single container) Meilisearch over Elasticsearch: 10x simpler to operate (single binary, no JVM, no cluster management), typo-tolerance out of the box, <50ms search on 10K documents. Solo founder can't babysit an ES cluster. Over Typesense: Meilisearch has better faceted search and a more active open-source community.
React SPA Portal UI: service catalog, Cmd+K search, service detail cards, team directory, correction UI, onboarding wizard. React + Vite + TailwindCSS, hosted on CloudFront + S3 SPA for instant Cmd+K interactions without server round-trips for UI state. CloudFront for global edge caching. Vite for fast builds.
Slack Bot Responds to /dd0c who owns <service> commands. Passive viral loop. Node.js (Lambda) via Slack Bolt Lambda for zero-cost when idle. Bolt is Slack's official SDK.
WebSocket Gateway Pushes real-time discovery progress to the UI during onboarding ("Found 47 services... 89 services... 147 services..."). API Gateway WebSocket API + Lambda Managed WebSocket infrastructure. Only needed during discovery runs — Lambda scales to zero otherwise.

Technology Choices — Key Decisions

Why Not Serverless-Everything (Lambda for API)? The Portal API handles Cmd+K search requests that must respond in <100ms. Lambda cold starts (500ms-2s for Node.js) are unacceptable for the primary user interaction. ECS Fargate with minimum 1 task provides warm, consistent latency. Discovery Lambdas are background tasks where cold starts are irrelevant.

Why Meilisearch Over Algolia/Elasticsearch?

  • Algolia: SaaS pricing at scale ($1/1K search requests) becomes expensive with high DAU. Self-hosted Meilisearch is ~$0 marginal cost per search.
  • Elasticsearch: Operational complexity is prohibitive for a solo founder. Requires JVM tuning, cluster management, index lifecycle policies. Meilisearch is a single binary with zero configuration.
  • Meilisearch: Typo-tolerant by default (critical for Cmd+K UX), faceted filtering, <50ms on 100K documents, single Docker container, 200MB RAM for 10K services. Perfect for the scale and operational model.

Why PostgreSQL Over DynamoDB? The service catalog is inherently relational: services belong to teams, teams have members, services have dependencies on other services, services map to repos, repos map to infrastructure. DynamoDB's single-table design would require complex GSIs and denormalization that increases development time. Aurora Serverless v2 scales to zero (minimum 0.5 ACU = ~$43/month) and handles relational queries natively. At the scale of 50-1000 services per tenant, PostgreSQL is more than sufficient.

Why Not a Graph Database for Dependencies (V1)? Service dependency graphs are a V1.1 feature. For V1, dependencies are stored as adjacency lists in PostgreSQL (service_dependencies join table). This is sufficient for "what does this service depend on?" queries. A dedicated graph database (Neptune at $0.10/hour minimum = $73/month, or Neo4j) is premature optimization for V1. If dependency visualization becomes a core feature in V1.1+, evaluate Neptune Serverless or an in-app graph traversal library (graphology.js).

The 5-Minute Auto-Discovery Flow — Core Architectural Driver

This is the most important sequence in the entire system. Every architectural decision serves this flow.

sequenceDiagram
    participant User as Engineer
    participant UI as Portal UI
    participant API as Portal API
    participant SF as Step Functions
    participant AWS as AWS Scanner (λ)
    participant GH as GitHub Scanner (λ)
    participant SQS as SQS FIFO
    participant REC as Reconciler (λ)
    participant INF as Inference (λ)
    participant DB as PostgreSQL
    participant MS as Meilisearch
    participant WS as WebSocket

    Note over User,WS: Minute 0:00 — Signup
    User->>UI: Sign up (GitHub OAuth)
    UI->>API: POST /auth/github
    API->>DB: Create tenant + user

    Note over User,WS: Minute 1:00 — Connect AWS
    User->>UI: Deploy CloudFormation template (1-click)
    UI->>API: POST /connections/aws {roleArn, externalId}
    API->>API: sts:AssumeRole validation
    API->>DB: Store connection credentials (encrypted)

    Note over User,WS: Minute 2:00 — Connect GitHub (already done via OAuth)
    API->>DB: Store GitHub org connection

    Note over User,WS: Minute 2:30 — Trigger Discovery
    API->>SF: StartExecution {tenantId, connections}
    SF->>WS: Push "Discovery started..."

    Note over User,WS: Minute 2:30-3:30 — Parallel Scanning
    par AWS Scan
        SF->>AWS: Scan CloudFormation stacks
        AWS->>SQS: {type: cfn_stack, resources: [...]}
        SF->>AWS: Scan ECS services
        AWS->>SQS: {type: ecs_service, services: [...]}
        SF->>AWS: Scan Lambda functions
        AWS->>SQS: {type: lambda_fn, functions: [...]}
        SF->>AWS: Scan API Gateway APIs
        AWS->>SQS: {type: apigw, apis: [...]}
        SF->>AWS: Scan RDS instances
        AWS->>SQS: {type: rds, instances: [...]}
    and GitHub Scan
        SF->>GH: Scan repos (non-archived, non-fork)
        GH->>SQS: {type: gh_repo, repos: [...]}
        SF->>GH: Scan CODEOWNERS files
        GH->>SQS: {type: codeowners, mappings: [...]}
        SF->>GH: Scan team memberships
        GH->>SQS: {type: gh_teams, teams: [...]}
    end

    WS-->>UI: Push "Found 47 AWS resources..."
    WS-->>UI: Push "Found 89 GitHub repos..."

    Note over User,WS: Minute 3:30-4:00 — Reconciliation
    SQS->>REC: Batch process discovery events
    REC->>REC: Cross-reference AWS resources ↔ GitHub repos
    REC->>REC: Deduplicate (CFN stack name = ECS service = repo name)
    REC->>REC: Merge into unified service entities
    REC->>DB: Upsert service entities

    Note over User,WS: Minute 4:00-4:30 — Ownership Inference
    SF->>INF: Infer ownership for all services
    INF->>INF: Score: CODEOWNERS (weight: 0.4) + git blame (0.25) + CFN tags (0.2) + team membership (0.15)
    INF->>DB: Update services with owner + confidence score
    INF->>MS: Index services for search

    WS-->>UI: Push "Discovered 147 services. Catalog ready."

    Note over User,WS: Minute 5:00 — First Search
    User->>UI: Cmd+K → "payment"
    UI->>API: GET /search?q=payment
    API->>MS: Search
    MS->>API: Results in <50ms
    API->>UI: payment-gateway, payment-processor, payment-webhook
    User->>User: "Holy shit, this actually works."

Critical timing constraints:

  • AWS scanning must complete in <60 seconds for accounts with up to 500 resources. Achieved via parallel Lambda invocations per resource type.
  • GitHub scanning must complete in <60 seconds for orgs with up to 500 repos. Achieved via GitHub GraphQL API (batch queries) instead of REST (one request per repo).
  • Reconciliation must complete in <30 seconds. Single Lambda invocation processing all SQS messages in batch.
  • Total pipeline: <120 seconds from trigger to searchable catalog. The "5-minute" promise includes signup + AWS connection time.

Why Step Functions (not a simple Lambda chain)?

  • Built-in retry with exponential backoff per step (AWS API throttling is common)
  • Parallel execution of AWS + GitHub scans with automatic join
  • Visual execution history for debugging failed discoveries
  • Error handling: if GitHub scan fails, AWS results still proceed (partial discovery > no discovery)
  • State machine is inspectable — critical for debugging accuracy issues in production

2. CORE COMPONENTS

2.1 Discovery Engine

The discovery engine is the product. Everything else is UI on top of discovered data. If discovery is wrong, nothing else matters.

Architecture

graph TB
    subgraph "Discovery Orchestrator (Step Functions)"
        TRIGGER["Trigger<br/>(API call / Schedule)"]
        PLAN["Plan Phase<br/>Determine scan scope"]

        subgraph "Scan Phase (Parallel)"
            subgraph "AWS Scanners"
                CFN["CloudFormation<br/>Scanner"]
                ECS_S["ECS<br/>Scanner"]
                LAMBDA_S["Lambda<br/>Scanner"]
                APIGW_S["API Gateway<br/>Scanner"]
                RDS_S["RDS<br/>Scanner"]
                TAG_S["Resource Groups<br/>Tag Scanner"]
            end
            subgraph "GitHub Scanners"
                REPO_S["Repository<br/>Scanner"]
                CODEOWNERS_S["CODEOWNERS<br/>Parser"]
                TEAM_S["Team Membership<br/>Scanner"]
                README_S["README<br/>Extractor"]
                WORKFLOW_S["Actions Workflow<br/>Scanner"]
            end
        end

        RECONCILE["Reconciliation Phase"]
        INFER["Inference Phase"]
        INDEX["Index Phase"]
    end

    TRIGGER --> PLAN
    PLAN --> CFN & ECS_S & LAMBDA_S & APIGW_S & RDS_S & TAG_S
    PLAN --> REPO_S & CODEOWNERS_S & TEAM_S & README_S & WORKFLOW_S
    CFN & ECS_S & LAMBDA_S & APIGW_S & RDS_S & TAG_S --> RECONCILE
    REPO_S & CODEOWNERS_S & TEAM_S & README_S & WORKFLOW_S --> RECONCILE
    RECONCILE --> INFER --> INDEX

AWS Scanner — Resource-to-Service Mapping Strategy

The hardest problem in auto-discovery: what constitutes a "service"? AWS resources are granular (individual Lambdas, ECS tasks, RDS instances), but engineers think in services (payment-service, auth-service, user-api). The scanner must infer service boundaries from infrastructure patterns.

Service Identification Heuristics (priority order):

Signal Confidence Logic
CloudFormation stack 0.95 Each stack is almost always a service or a closely related group. Stack name → service name. Stack tags (service, team, project) → metadata.
ECS service 0.90 Each ECS service is a deployable unit. Service name → service name. Task definition → tech stack (container image).
Lambda function with API Gateway trigger 0.85 Lambda + APIGW = API service. Group Lambdas sharing the same APIGW by API name.
Lambda function (standalone) 0.60 Standalone Lambdas may be services, cron jobs, or glue code. Group by naming prefix (e.g., payment-* → payment service).
Tagged resource group 0.80 Resources sharing a service or project tag are grouped. Tag value → service name.
RDS instance 0.50 Databases are infrastructure, not services — but map to owning service via naming convention or CFN association.

AWS API Calls per Scan (estimated):

cloudformation:ListStacks                    → 1 call (paginated)
cloudformation:DescribeStacks                → 1 call per stack (batched)
cloudformation:ListStackResources            → 1 call per stack
ecs:ListClusters + ListServices              → 2-5 calls
ecs:DescribeServices + DescribeTaskDefinition → 1 per service
lambda:ListFunctions                         → 1-3 calls (paginated)
lambda:ListEventSourceMappings               → 1 per function (batched)
apigateway:GetRestApis + GetResources        → 2-5 calls
apigatewayv2:GetApis                         → 1 call
rds:DescribeDBInstances                      → 1 call
resourcegroupstaggingapi:GetResources        → 1-5 calls (paginated, filtered by service/team tags)

Total: ~50-200 API calls per scan for a typical 50-service account. Well within AWS API rate limits. Parallel execution across resource types keeps total scan time under 30 seconds.

Cross-Account AssumeRole Pattern:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::DDOC_PLATFORM_ACCOUNT:role/dd0c-discovery-role"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "{{tenant-specific-external-id}}"
        }
      }
    }
  ]
}

The customer deploys a CloudFormation template (provided by dd0c) that creates a read-only IAM role with:

  • ReadOnlyAccess managed policy (or a custom policy scoped to the specific services above)
  • Trust policy allowing dd0c's platform account to assume the role
  • ExternalId unique per tenant (prevents confused deputy attacks)

GitHub Scanner — Repository-to-Service Mapping

GraphQL Batch Query (single request for up to 100 repos):

query($org: String!, $cursor: String) {
  organization(login: $org) {
    repositories(first: 100, after: $cursor, isArchived: false, isFork: false) {
      nodes {
        name
        description
        primaryLanguage { name }
        languages(first: 5) { nodes { name } }
        defaultBranchRef {
          target {
            ... on Commit {
              history(first: 1) {
                nodes { committedDate author { user { login } } }
              }
            }
          }
        }
        codeowners: object(expression: "HEAD:CODEOWNERS") {
          ... on Blob { text }
        }
        readme: object(expression: "HEAD:README.md") {
          ... on Blob { text }
        }
        catalogInfo: object(expression: "HEAD:catalog-info.yaml") {
          ... on Blob { text }
        }
        deployWorkflow: object(expression: "HEAD:.github/workflows/deploy.yml") {
          ... on Blob { text }
        }
      }
      pageInfo { hasNextPage endCursor }
    }
    teams(first: 100) {
      nodes {
        name slug
        members(first: 100) { nodes { login name } }
        repositories(first: 100) { nodes { name } }
      }
    }
  }
}

Key extraction logic:

  • CODEOWNERS → parse ownership patterns, map @org/team-name to team entities
  • README.md → extract first paragraph as service description (LLM-assisted summarization in V2)
  • catalog-info.yaml → if present (Backstage migrators), parse existing metadata as high-confidence input
  • .github/workflows/deploy.yml → extract deployment target (ECS service name, Lambda function name) to cross-reference with AWS scan
  • primaryLanguage → tech stack
  • Recent commit authors → contributor frequency for ownership inference

Service Relationship Inference

Cross-referencing AWS and GitHub data to build the service graph:

MATCHING RULES (priority order):

1. EXPLICIT TAG MATCH
   AWS resource tag "github_repo" = "org/repo-name"
   → Direct link. Confidence: 0.95

2. CFN STACK → GITHUB ACTIONS DEPLOY TARGET
   GitHub workflow deploys to ECS service "payment-api"
   CFN stack contains ECS service "payment-api"
   → Link repo to CFN stack's service. Confidence: 0.90

3. NAME MATCH (normalized)
   GitHub repo: "payment-service"
   ECS service: "payment-service" or "payment-svc"
   → Fuzzy name match (Levenshtein distance ≤ 2). Confidence: 0.75

4. ECR IMAGE → GITHUB REPO
   ECS task definition references ECR image "payment-api:latest"
   ECR image was built from GitHub repo "payment-api" (via image tag or build metadata)
   → Confidence: 0.85

5. LAMBDA FUNCTION NAME → REPO NAME
   Lambda: "payment-webhook-handler"
   Repo: "payment-webhook" or "payment-service" (contains Lambda deploy workflow)
   → Confidence: 0.70

Confidence Score Calculation:

Each service entity gets a composite confidence score:

def calculate_confidence(service: Service) -> float:
    scores = []

    # Existence confidence: how sure are we this is a real service?
    if service.source == "cloudformation_stack":
        scores.append(("existence", 0.95))
    elif service.source == "ecs_service":
        scores.append(("existence", 0.90))
    elif service.source == "github_repo_only":
        scores.append(("existence", 0.60))  # repo exists but no infra found

    # Ownership confidence
    if service.owner_source == "codeowners":
        scores.append(("ownership", 0.90))
    elif service.owner_source == "cfn_tag":
        scores.append(("ownership", 0.85))
    elif service.owner_source == "git_blame_frequency":
        scores.append(("ownership", 0.65))
    elif service.owner_source == "inferred_from_team_membership":
        scores.append(("ownership", 0.50))

    # Repo linkage confidence
    if service.repo_link_source == "explicit_tag":
        scores.append(("repo_link", 0.95))
    elif service.repo_link_source == "deploy_workflow":
        scores.append(("repo_link", 0.90))
    elif service.repo_link_source == "name_match":
        scores.append(("repo_link", 0.75))

    return weighted_average(scores)

The >80% accuracy target is measured as:

accuracy = (services_correct_without_user_correction) / (total_services_discovered)

Where "correct" means: service exists, owner is right, repo link is right. Measured during beta by asking each beta customer to review their catalog and mark corrections.

Discovery Scheduling

Trigger Frequency Scope
Initial onboarding Once Full scan (all resource types, all repos)
Scheduled refresh Every 6 hours (configurable: 1h-24h) Incremental — only scan resources modified since last scan (CloudFormation events, GitHub push webhooks)
Manual trigger On-demand (UI button) Full scan
Webhook-driven Real-time GitHub push to CODEOWNERS → re-infer ownership for affected repos. CloudFormation stack events → update service metadata.
User correction Immediate Re-score ownership model for similar services when user corrects one

2.2 Service Catalog

The service catalog is the central data model. Everything reads from it, everything writes to it.

Service Entity Model

┌─────────────────────────────────────────────────────────┐
│ SERVICE                                                  │
├─────────────────────────────────────────────────────────┤
│ id: uuid (PK)                                           │
│ tenant_id: uuid (FK → tenants)                          │
│ name: varchar(255)                                      │
│ display_name: varchar(255)                              │
│ description: text (extracted from README)                │
│ service_type: enum [api, worker, cron, database, queue] │
│ lifecycle: enum [production, staging, deprecated, eol]  │
│ tier: enum [critical, standard, experimental]           │
│ tech_stack: jsonb (languages, frameworks, runtime)      │
│ repo_url: varchar(500)                                  │
│ repo_default_branch: varchar(100)                       │
│ infrastructure: jsonb (aws_resources, regions, accounts)│
│ health_status: enum [healthy, degraded, down, unknown]  │
│ last_deploy_at: timestamptz                             │
│ last_discovered_at: timestamptz                         │
│ confidence_score: decimal(3,2) [0.00-1.00]             │
│ discovery_sources: jsonb (which scanners found this)    │
│ metadata: jsonb (extensible key-value pairs)            │
│ created_at: timestamptz                                 │
│ updated_at: timestamptz                                 │
├─────────────────────────────────────────────────────────┤
│ INDEXES: tenant_id, name (unique per tenant),           │
│          confidence_score, lifecycle                     │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ TEAM                                                     │
├─────────────────────────────────────────────────────────┤
│ id: uuid (PK)                                           │
│ tenant_id: uuid (FK → tenants)                          │
│ name: varchar(255)                                      │
│ slug: varchar(255)                                      │
│ github_team_slug: varchar(255)                          │
│ slack_channel: varchar(255)                             │
│ pagerduty_schedule_id: varchar(255)                     │
│ members: jsonb (user references)                        │
│ created_at: timestamptz                                 │
│ updated_at: timestamptz                                 │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ SERVICE_OWNERSHIP                                        │
├─────────────────────────────────────────────────────────┤
│ service_id: uuid (FK → services)                        │
│ team_id: uuid (FK → teams)                              │
│ ownership_type: enum [primary, contributing, on_call]   │
│ confidence: decimal(3,2)                                │
│ source: enum [codeowners, cfn_tag, git_blame,           │
│               team_membership, user_correction]          │
│ verified_by: uuid (FK → users, nullable)                │
│ verified_at: timestamptz (nullable)                     │
│ created_at: timestamptz                                 │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ SERVICE_DEPENDENCY (V1.1)                                │
├─────────────────────────────────────────────────────────┤
│ source_service_id: uuid (FK → services)                 │
│ target_service_id: uuid (FK → services)                 │
│ dependency_type: enum [calls, publishes_to, reads_from] │
│ confidence: decimal(3,2)                                │
│ source: enum [vpc_flow, apigw_integration,              │
│               lambda_event_source, user_defined]         │
│ created_at: timestamptz                                 │
└─────────────────────────────────────────────────────────┘

Ownership Mapping

Ownership is the highest-value data point in the catalog. The inference engine uses a weighted scoring model:

OWNERSHIP SCORING MODEL

Input signals (per service):
  1. CODEOWNERS file match          → weight: 0.40
  2. CloudFormation/resource tags   → weight: 0.20
  3. Git blame frequency (top team) → weight: 0.25
  4. GitHub team → repo association → weight: 0.15

Process:
  - For each candidate team, sum weighted scores across all signals
  - Normalize to [0, 1]
  - Assign primary owner = highest scoring team
  - If top score < 0.50 → mark as "unowned" (flag for user review)
  - If top two scores within 0.10 → mark as "ambiguous" (flag for user review)

User corrections:
  - When a user corrects ownership, the correction is stored as source="user_correction"
  - User corrections have implicit weight 1.0 (override all inference)
  - Corrections propagate: if user says "payment-* repos belong to @payments-team",
    apply to all matching repos with confidence 0.85

Metadata Enrichment

Beyond ownership, the catalog enriches each service with:

Field Source Extraction Method
Description GitHub README First paragraph extraction (regex: first non-heading, non-badge paragraph)
Tech stack GitHub primaryLanguage + languages Direct from GitHub API
Runtime ECS task definition / Lambda runtime Direct from AWS API
Last deploy GitHub Actions last successful workflow run / ECS last deployment Most recent timestamp
On-call PagerDuty schedule mapped to team PagerDuty API: GET /schedules → match by team name or escalation policy
Health CloudWatch alarm state for associated resources Aggregate: all alarms OK → healthy, any alarm → degraded, critical alarm → down
Cost dd0c/cost module (when connected) Internal API: GET /cost/services/{serviceId}/monthly

2.3 Search Engine

The Cmd+K search bar is the daily-use hook. It must be faster than asking in Slack.

Search Architecture

graph LR
    USER["User types in Cmd+K"] --> SPA["React SPA"]
    SPA -- "debounce 150ms" --> API["Portal API"]
    API --> REDIS["Redis<br/>Prefix cache<br/>(hot queries)"]
    REDIS -- "cache miss" --> MEILI["Meilisearch"]
    MEILI --> API
    API --> SPA
    SPA --> USER

    style REDIS fill:#f9f,stroke:#333
    style MEILI fill:#bbf,stroke:#333

Performance budget:

  • Keystroke to API request: <150ms (debounce)
  • API to Meilisearch: <10ms (same VPC, same AZ)
  • Meilisearch query execution: <50ms (for 10K documents)
  • API response to UI render: <50ms
  • Total perceived latency: <200ms (target: feels instant)

Meilisearch Index Configuration:

{
  "index": "services",
  "primaryKey": "id",
  "searchableAttributes": [
    "name",
    "display_name",
    "description",
    "team_name",
    "tech_stack",
    "repo_name",
    "tags"
  ],
  "filterableAttributes": [
    "tenant_id",
    "lifecycle",
    "tier",
    "team_name",
    "tech_stack",
    "health_status",
    "confidence_score"
  ],
  "sortableAttributes": [
    "name",
    "last_deploy_at",
    "confidence_score",
    "updated_at"
  ],
  "rankingRules": [
    "words",
    "typo",
    "proximity",
    "attribute",
    "sort",
    "exactness"
  ],
  "typoTolerance": {
    "enabled": true,
    "minWordSizeForTypos": {
      "oneTypo": 3,
      "twoTypos": 6
    }
  }
}

Multi-tenant isolation in search: Every document in Meilisearch includes tenant_id. Every query includes a mandatory filter: tenant_id = '{current_tenant}'. This is enforced at the API layer — the SPA never queries Meilisearch directly.

Search result format:

{
  "hits": [
    {
      "id": "svc_abc123",
      "name": "payment-gateway",
      "display_name": "Payment Gateway",
      "description": "Handles payment processing via Stripe integration",
      "team_name": "Payments Team",
      "repo_url": "https://github.com/acme/payment-gateway",
      "health_status": "healthy",
      "tech_stack": ["TypeScript", "Node.js"],
      "confidence_score": 0.92,
      "last_deploy_at": "2026-02-27T14:30:00Z",
      "_matchesPosition": { "name": [{"start": 0, "length": 7}] }
    }
  ],
  "query": "payment",
  "processingTimeMs": 12,
  "estimatedTotalHits": 3
}

Redis Prefix Cache

For the most common queries (top 100 per tenant), cache the Meilisearch response in Redis with a 5-minute TTL. This reduces Meilisearch load and provides <5ms response for repeated queries.

Key pattern: search:{tenant_id}:{normalized_query_prefix}
TTL: 300 seconds
Invalidation: on any service upsert for the tenant (conservative but simple)

2.4 AI Agent — "Ask Your Infra" (V2)

Deferred to V2 (Month 7-12), but the architecture must accommodate it from day one.

Design

graph TB
    USER["User: 'Which services handle PII?'"]
    AGENT["AI Agent (Lambda)"]
    LLM["LLM (Claude / GPT-4o)"]
    CATALOG["Service Catalog (PostgreSQL)"]
    SEARCH["Meilisearch"]
    COST["dd0c/cost API"]
    ALERT["dd0c/alert API"]

    USER --> AGENT
    AGENT --> LLM
    LLM -- "tool_call: search_services" --> SEARCH
    LLM -- "tool_call: query_catalog" --> CATALOG
    LLM -- "tool_call: get_cost" --> COST
    LLM -- "tool_call: get_incidents" --> ALERT
    LLM --> AGENT
    AGENT --> USER

RAG approach: The AI agent does NOT embed the entire catalog into a vector store. Instead, it uses structured tool calls:

  1. User asks a natural language question
  2. LLM receives the question + a system prompt describing available tools (search, SQL query, cost API, alert API)
  3. LLM generates tool calls to retrieve relevant data
  4. Results are injected into context
  5. LLM synthesizes a natural language answer with citations

Why tool-use over vector RAG?

  • The service catalog is structured data (tables, relationships). SQL queries are more precise than semantic similarity search.
  • The catalog is small enough (<10K services) that tool calls retrieve exact data, not "similar" data.
  • No embedding pipeline to maintain. No vector database to operate. Simpler architecture for a solo founder.

V2 scope:

  • Natural language queries via portal UI and Slack bot
  • Tool calls: search_services, get_service_detail, query_services_by_attribute, get_team_services, get_service_cost, get_service_incidents
  • Guardrails: tenant isolation (LLM can only query current tenant's data), no write operations, response length limits
  • Cost control: cache identical queries for 5 minutes, rate limit to 50 queries/user/day

2.5 Dashboard

The dashboard serves two audiences with different needs:

Engineers (daily use): Cmd+K search bar front and center. Recent services visited. Team's services quick-access. That's it. Calm surface.

Directors (weekly use): Org-wide metrics. Service count by team. Ownership coverage (% of services with verified owners). Health overview. Discovery accuracy trend. Exportable for compliance.

Dashboard Component Architecture

┌─────────────────────────────────────────────────────────────────┐
│ PORTAL DASHBOARD                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  🔍 Cmd+K: Search services, teams, or keywords...       │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐            │
│  │ 147 Services │ │ 12 Teams     │ │ 89% Accuracy │            │
│  │ 3 unowned    │ │ 2 on-call    │ │ ↑ from 82%   │            │
│  └──────────────┘ └──────────────┘ └──────────────┘            │
│                                                                  │
│  RECENT ─────────────────────────────────────────────           │
│  payment-gateway  │ @payments │ healthy │ 2h ago               │
│  auth-service     │ @platform │ healthy │ 1d ago               │
│  order-engine     │ @orders   │ degraded│ 3h ago               │
│                                                                  │
│  YOUR TEAM (@platform) ──────────────────────────────           │
│  auth-service     │ healthy │ ts/node  │ repo ↗                │
│  api-gateway      │ healthy │ ts/node  │ repo ↗                │
│  user-service     │ degraded│ python   │ repo ↗                │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ SERVICE DETAIL (expanded on click)                       │    │
│  │ ┌─────────┬──────────┬──────────┬──────────┬──────────┐ │    │
│  │ │ Overview│ Infra    │ On-Call  │ Cost     │ Incidents│ │    │
│  │ ├─────────┴──────────┴──────────┴──────────┴──────────┤ │    │
│  │ │ Owner: @payments-team (92% confidence) [Correct ✏️]  │ │    │
│  │ │ Repo: github.com/acme/payment-gateway               │ │    │
│  │ │ Stack: TypeScript, Node.js, ECS Fargate              │ │    │
│  │ │ Last Deploy: 2h ago by @sarah                        │ │    │
│  │ │ Health: ✅ All CloudWatch alarms OK                   │ │    │
│  │ │ On-Call: @mike (PagerDuty, ends in 4h)               │ │    │
│  │ │ Cost: $847/mo (dd0c/cost) ↑12% from last month      │ │    │
│  │ │ Incidents: 2 this month (dd0c/alert)                 │ │    │
│  │ └─────────────────────────────────────────────────────┘ │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Progressive disclosure in action:

  • Default: one-line-per-service table (name, owner, health, last deploy)
  • Click: expanded service card with tabs (overview, infra, on-call, cost, incidents)
  • Each tab loads data on demand (lazy loading) — no upfront cost for data the user doesn't need

3. DATA ARCHITECTURE

3.1 Complete Database Schema

Core Entities

-- Tenant isolation: every table has tenant_id. Every query filters by it. No exceptions.

CREATE TABLE tenants (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            VARCHAR(255) NOT NULL,
    slug            VARCHAR(255) NOT NULL UNIQUE,
    plan            VARCHAR(50) NOT NULL DEFAULT 'free', -- free, team, business
    stripe_customer_id VARCHAR(255),
    stripe_subscription_id VARCHAR(255),
    settings        JSONB NOT NULL DEFAULT '{}',
    -- settings: { discovery_interval_hours: 6, auto_refresh: true, slack_workspace_id: "T..." }
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    github_id       BIGINT NOT NULL,
    github_login    VARCHAR(255) NOT NULL,
    email           VARCHAR(255),
    display_name    VARCHAR(255),
    avatar_url      VARCHAR(500),
    role            VARCHAR(50) NOT NULL DEFAULT 'member', -- admin, member
    last_active_at  TIMESTAMPTZ,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(tenant_id, github_id)
);

CREATE TABLE connections (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    provider        VARCHAR(50) NOT NULL, -- aws, github, pagerduty, opsgenie
    status          VARCHAR(50) NOT NULL DEFAULT 'pending', -- pending, active, error, revoked
    credentials     JSONB NOT NULL, -- encrypted at rest (KMS)
    -- aws: { role_arn, external_id, regions: ["us-east-1", "us-west-2"] }
    -- github: { installation_id, org_login, access_token_encrypted }
    -- pagerduty: { api_key_encrypted, subdomain }
    last_scan_at    TIMESTAMPTZ,
    last_error      TEXT,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(tenant_id, provider)
);

CREATE TABLE services (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    name            VARCHAR(255) NOT NULL,
    display_name    VARCHAR(255),
    description     TEXT,
    service_type    VARCHAR(50), -- api, worker, cron, database, queue, frontend, library
    lifecycle       VARCHAR(50) NOT NULL DEFAULT 'production',
    tier            VARCHAR(50) NOT NULL DEFAULT 'standard', -- critical, standard, experimental
    tech_stack      JSONB DEFAULT '[]', -- ["TypeScript", "Node.js", "Express"]
    repo_url        VARCHAR(500),
    repo_default_branch VARCHAR(100) DEFAULT 'main',
    infrastructure  JSONB DEFAULT '{}',
    -- { aws_resources: [{type: "ecs_service", arn: "...", region: "us-east-1"}],
    --   aws_account_id: "123456789012" }
    health_status   VARCHAR(50) DEFAULT 'unknown', -- healthy, degraded, down, unknown
    last_deploy_at  TIMESTAMPTZ,
    last_discovered_at TIMESTAMPTZ,
    confidence_score DECIMAL(3,2) DEFAULT 0.00,
    discovery_sources JSONB DEFAULT '[]', -- ["cloudformation", "github_repo", "ecs_service"]
    metadata        JSONB DEFAULT '{}',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(tenant_id, name)
);
CREATE INDEX idx_services_tenant ON services(tenant_id);
CREATE INDEX idx_services_lifecycle ON services(tenant_id, lifecycle);
CREATE INDEX idx_services_confidence ON services(tenant_id, confidence_score);
CREATE INDEX idx_services_health ON services(tenant_id, health_status);

CREATE TABLE teams (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    name            VARCHAR(255) NOT NULL,
    slug            VARCHAR(255) NOT NULL,
    github_team_slug VARCHAR(255),
    slack_channel_id VARCHAR(255),
    slack_channel_name VARCHAR(255),
    pagerduty_schedule_id VARCHAR(255),
    opsgenie_team_id VARCHAR(255),
    contact_email   VARCHAR(255),
    members         JSONB DEFAULT '[]',
    -- [{ github_login: "sarah", name: "Sarah Chen", role: "lead" }]
    metadata        JSONB DEFAULT '{}',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(tenant_id, slug)
);

CREATE TABLE service_ownership (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    service_id      UUID NOT NULL REFERENCES services(id) ON DELETE CASCADE,
    team_id         UUID NOT NULL REFERENCES teams(id) ON DELETE CASCADE,
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    ownership_type  VARCHAR(50) NOT NULL DEFAULT 'primary', -- primary, contributing, on_call
    confidence      DECIMAL(3,2) NOT NULL DEFAULT 0.00,
    source          VARCHAR(50) NOT NULL, -- codeowners, cfn_tag, git_blame, team_membership, user_correction
    verified_by     UUID REFERENCES users(id),
    verified_at     TIMESTAMPTZ,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(service_id, team_id, ownership_type)
);
CREATE INDEX idx_ownership_service ON service_ownership(service_id);
CREATE INDEX idx_ownership_team ON service_ownership(team_id);

CREATE TABLE service_dependencies (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    source_service_id UUID NOT NULL REFERENCES services(id) ON DELETE CASCADE,
    target_service_id UUID NOT NULL REFERENCES services(id) ON DELETE CASCADE,
    dependency_type VARCHAR(50) NOT NULL, -- calls, publishes_to, reads_from, triggers
    confidence      DECIMAL(3,2) NOT NULL DEFAULT 0.00,
    source          VARCHAR(50) NOT NULL, -- vpc_flow, apigw_integration, lambda_event_source, user_defined
    metadata        JSONB DEFAULT '{}',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(source_service_id, target_service_id, dependency_type)
);

Discovery Event Log

Every discovery run produces an immutable event log. This is critical for debugging accuracy issues, auditing what changed, and measuring improvement over time.

CREATE TABLE discovery_runs (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    trigger_type    VARCHAR(50) NOT NULL, -- onboarding, scheduled, manual, webhook
    status          VARCHAR(50) NOT NULL DEFAULT 'running', -- running, completed, partial, failed
    step_function_execution_arn VARCHAR(500),
    started_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at    TIMESTAMPTZ,
    stats           JSONB DEFAULT '{}',
    -- { aws_resources_found: 234, github_repos_found: 89,
    --   services_created: 12, services_updated: 135, services_unchanged: 0,
    --   ownership_inferred: 140, ownership_ambiguous: 7,
    --   scan_duration_ms: 28400, reconcile_duration_ms: 4200 }
    errors          JSONB DEFAULT '[]'
    -- [{ phase: "aws_scan", resource: "lambda", error: "ThrottlingException", retried: true }]
);
CREATE INDEX idx_discovery_runs_tenant ON discovery_runs(tenant_id, started_at DESC);

CREATE TABLE discovery_events (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    run_id          UUID NOT NULL REFERENCES discovery_runs(id) ON DELETE CASCADE,
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    event_type      VARCHAR(50) NOT NULL,
    -- service_created, service_updated, service_removed,
    -- ownership_changed, ownership_ambiguous,
    -- repo_linked, repo_unlinked,
    -- resource_discovered, resource_removed
    service_id      UUID REFERENCES services(id),
    payload         JSONB NOT NULL,
    -- { field: "owner", old_value: "@platform", new_value: "@payments",
    --   old_confidence: 0.65, new_confidence: 0.88,
    --   reason: "CODEOWNERS updated" }
    confidence      DECIMAL(3,2),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_discovery_events_run ON discovery_events(run_id);
CREATE INDEX idx_discovery_events_service ON discovery_events(service_id, created_at DESC);

-- Partition discovery_events by month for efficient cleanup
-- Retain 90 days of events, archive to S3 after that

User Corrections (Feedback Loop)

CREATE TABLE corrections (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    service_id      UUID NOT NULL REFERENCES services(id),
    user_id         UUID NOT NULL REFERENCES users(id),
    field           VARCHAR(100) NOT NULL, -- owner, description, tier, lifecycle, repo_url
    old_value       JSONB,
    new_value       JSONB,
    applied         BOOLEAN NOT NULL DEFAULT TRUE,
    propagated      BOOLEAN NOT NULL DEFAULT FALSE,
    -- propagated: did this correction update inference for similar services?
    propagation_count INT DEFAULT 0,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_corrections_tenant ON corrections(tenant_id, created_at DESC);

Corrections are the most valuable data in the system. They:

  1. Immediately fix the corrected service
  2. Feed back into the ownership inference model (increase weight for the corrected signal)
  3. Propagate to similar services when patterns are detected (e.g., "user corrected 3 services in payment-* repos to @payments-team → auto-apply to remaining payment-* repos with confidence 0.85")

3.2 Search Index Design

Meilisearch document structure (denormalized from PostgreSQL for search performance):

{
  "id": "svc_abc123",
  "tenant_id": "tenant_xyz",
  "name": "payment-gateway",
  "display_name": "Payment Gateway",
  "description": "Handles payment processing via Stripe. Exposes REST API for checkout flow.",
  "service_type": "api",
  "lifecycle": "production",
  "tier": "critical",
  "team_name": "Payments Team",
  "team_slug": "payments",
  "owner_confidence": 0.92,
  "tech_stack": ["TypeScript", "Node.js"],
  "repo_name": "payment-gateway",
  "repo_url": "https://github.com/acme/payment-gateway",
  "health_status": "healthy",
  "last_deploy_at": 1740667800,
  "aws_services": ["ecs", "rds", "elasticache"],
  "aws_region": "us-east-1",
  "tags": ["payments", "stripe", "checkout", "critical-path"],
  "confidence_score": 0.92,
  "updated_at": 1740667800
}

Index sync strategy:

  • On every service upsert in PostgreSQL → publish to SQS → Lambda consumer → Meilisearch addDocuments (batch, async)
  • Latency: service update → searchable in <5 seconds
  • Full reindex: triggered on Meilisearch restart or schema change. Reads all services from PostgreSQL, batches of 1000 documents. For 10K services: ~10 seconds.

Index sizing:

  • Average document size: ~1KB
  • 1,000 services: ~1MB index, ~50MB RAM
  • 10,000 services: ~10MB index, ~200MB RAM
  • Meilisearch on a single Fargate task (0.5 vCPU, 1GB RAM) handles 10K+ services comfortably

3.3 Graph Database Decision: Not Yet

V1: PostgreSQL adjacency list. Service dependencies stored in service_dependencies table. Queries like "what does service X depend on?" are simple JOINs. Queries like "what's the full transitive dependency tree?" use recursive CTEs:

WITH RECURSIVE dep_tree AS (
    SELECT target_service_id, 1 as depth
    FROM service_dependencies
    WHERE source_service_id = :service_id AND tenant_id = :tenant_id
    UNION ALL
    SELECT sd.target_service_id, dt.depth + 1
    FROM service_dependencies sd
    JOIN dep_tree dt ON sd.source_service_id = dt.target_service_id
    WHERE dt.depth < 10  -- prevent infinite loops
)
SELECT DISTINCT s.* FROM dep_tree dt
JOIN services s ON s.id = dt.target_service_id;

At the scale of 50-1000 services with an average of 3-5 dependencies each, this recursive CTE executes in <50ms on Aurora. A graph database is unnecessary.

V1.1+ evaluation criteria for Neptune Serverless:

  • If dependency visualization becomes a core feature with >100 daily queries
  • If customers have >5,000 services with deep dependency chains (>10 levels)
  • If graph traversal queries (shortest path, impact radius) become latency-sensitive
  • Neptune Serverless minimum cost: ~$0.12/NCU-hour × 2.5 NCU minimum = ~$220/month. Only justified when dependency features drive measurable retention.

3.4 Multi-Tenant Data Isolation

Strategy: Shared database, tenant_id column, application-level enforcement.

Why not database-per-tenant:

  • Aurora Serverless v2 charges per ACU. One database with 50 tenants is cheaper than 50 databases.
  • Schema migrations are applied once, not 50 times.
  • Cross-tenant analytics (anonymized, for product metrics) are simple queries.

Enforcement layers:

Layer Mechanism
API middleware Every authenticated request extracts tenant_id from JWT. Injected into every database query. No query can omit tenant_id.
PostgreSQL RLS (Row-Level Security) Backup enforcement. Even if application code has a bug, RLS prevents cross-tenant data access.
Meilisearch filter Every search query includes mandatory tenant_id filter. Enforced at API layer.
S3 prefix Discovery snapshots stored at s3://dd0c-data/{tenant_id}/snapshots/. IAM policy scopes Lambda access to tenant prefix during discovery.
Logging All API logs include tenant_id. Anomaly detection: alert if a single request touches multiple tenant_ids.

PostgreSQL RLS implementation:

ALTER TABLE services ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON services
    USING (tenant_id = current_setting('app.current_tenant_id')::UUID);

-- Set per-request in API middleware:
-- SET LOCAL app.current_tenant_id = 'tenant-uuid-here';

3.5 Sync/Refresh Strategy

Event Trigger Scope Latency
Initial discovery User completes onboarding Full scan: all AWS resource types + all GitHub repos <120 seconds
Scheduled refresh EventBridge cron (default: every 6h) Incremental: CloudFormation events since last scan, GitHub repos with pushes since last scan <60 seconds
GitHub webhook Push to CODEOWNERS, README, or deploy workflow Single repo: re-extract metadata, re-infer ownership <10 seconds
CloudFormation event Stack create/update/delete (via EventBridge rule in customer account) Single stack: update associated service <10 seconds
User correction User clicks "Correct" in UI Single service + propagation to similar services <5 seconds
Manual full rescan User clicks "Rescan" in settings Full scan (same as initial) <120 seconds

Incremental scan optimization:

For scheduled refreshes, avoid re-scanning everything:

  1. AWS: Use CloudTrail events (if available) or compare CloudFormation stack LastUpdatedTime to skip unchanged stacks. For ECS/Lambda, compare resource tags and configuration hashes.
  2. GitHub: Use the GitHub Events API or webhook payloads to identify repos with changes since last scan. Only re-scan changed repos.
  3. Result: Incremental scans touch 5-15% of resources, completing in <30 seconds instead of 120.

Staleness detection:

If a service hasn't been seen in 3 consecutive full scans:

  • Mark as lifecycle: deprecated with a note "Not found in recent discovery scans"
  • Surface in dashboard: "3 services may have been removed from your infrastructure"
  • After 5 consecutive misses: mark as lifecycle: eol, remove from default search results (still accessible via filter)

4. INFRASTRUCTURE

4.1 AWS Architecture

graph TB
    subgraph "us-east-1 — Primary Region"
        subgraph "Public Subnet"
            CF["CloudFront Distribution<br/>SPA + API Cache"]
            ALB["Application Load Balancer<br/>+ WAF v2"]
        end

        subgraph "Private Subnet — App Tier"
            ECS_API["ECS Fargate<br/>Portal API<br/>(min: 1, max: 10 tasks)<br/>0.5 vCPU / 1GB RAM"]
            ECS_MEILI["ECS Fargate<br/>Meilisearch<br/>(1 task)<br/>0.5 vCPU / 1GB RAM<br/>+ EFS volume"]
        end

        subgraph "Private Subnet — Compute"
            SF["Step Functions<br/>Discovery Orchestrator"]
            L_AWS["Lambda — AWS Scanner<br/>Python 3.12<br/>512MB / 5min timeout"]
            L_GH["Lambda — GitHub Scanner<br/>Node.js 20<br/>512MB / 5min timeout"]
            L_REC["Lambda — Reconciler<br/>Node.js 20<br/>1GB / 5min timeout"]
            L_INF["Lambda — Inference<br/>Python 3.12<br/>512MB / 2min timeout"]
            L_SLACK["Lambda — Slack Bot<br/>Node.js 20<br/>256MB / 30s timeout"]
            L_WEBHOOK["Lambda — Webhook Processor<br/>Node.js 20<br/>256MB / 30s timeout"]
        end

        subgraph "Private Subnet — Data Tier"
            AURORA["Aurora Serverless v2<br/>PostgreSQL 15<br/>0.5-8 ACU<br/>Multi-AZ"]
            REDIS_C["ElastiCache Redis<br/>Serverless<br/>1-5 ECPUs"]
        end

        subgraph "Storage & Messaging"
            S3_SPA["S3 — SPA Assets"]
            S3_DATA["S3 — Discovery Snapshots<br/>+ Exports"]
            SQS_DISC["SQS FIFO<br/>Discovery Events"]
            SQS_INDEX["SQS Standard<br/>Search Index Updates"]
            EB["EventBridge<br/>Scheduled Discovery<br/>+ Webhook Routing"]
        end

        subgraph "Security & Observability"
            KMS["KMS — Encryption Keys<br/>(credentials, PII)"]
            SM["Secrets Manager<br/>GitHub tokens, PD keys"]
            CW["CloudWatch<br/>Logs + Metrics + Alarms"]
            XRAY["X-Ray<br/>Distributed Tracing"]
        end

        subgraph "API Management"
            APIGW["API Gateway v2<br/>WebSocket API<br/>(discovery progress)"]
        end
    end

    CF --> S3_SPA
    CF --> ALB
    ALB --> ECS_API
    ECS_API --> AURORA
    ECS_API --> REDIS_C
    ECS_API --> ECS_MEILI
    ECS_API --> SQS_INDEX
    SQS_INDEX --> L_WEBHOOK
    L_WEBHOOK --> ECS_MEILI

    SF --> L_AWS & L_GH
    L_AWS --> SQS_DISC
    L_GH --> SQS_DISC
    SQS_DISC --> L_REC
    L_REC --> L_INF
    L_INF --> AURORA
    L_INF --> SQS_INDEX

    EB --> SF
    APIGW --> ECS_API

    ECS_MEILI --> ECS_MEILI_EFS["EFS Volume<br/>(Meilisearch data persistence)"]

4.2 Customer-Side: Read-Only IAM Role

The customer deploys a single CloudFormation template provided by dd0c. This is the only thing the customer installs.

CloudFormation Template (provided to customer):

AWSTemplateFormatVersion: '2010-09-09'
Description: dd0c/portal read-only discovery role

Parameters:
  ExternalId:
    Type: String
    Description: Unique identifier provided by dd0c during onboarding
    NoEcho: true
  Dd0cAccountId:
    Type: String
    Default: '123456789012'  # dd0c platform AWS account
    Description: dd0c platform account ID

Resources:
  Dd0cDiscoveryRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: dd0c-portal-discovery
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              AWS: !Sub 'arn:aws:iam::${Dd0cAccountId}:role/dd0c-scanner-role'
            Action: sts:AssumeRole
            Condition:
              StringEquals:
                sts:ExternalId: !Ref ExternalId
      ManagedPolicyArns: []  # No managed policies — custom policy only
      Policies:
        - PolicyName: dd0c-discovery-readonly
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              # CloudFormation — read stacks and resources
              - Effect: Allow
                Action:
                  - cloudformation:ListStacks
                  - cloudformation:DescribeStacks
                  - cloudformation:ListStackResources
                  - cloudformation:GetTemplate
                Resource: '*'
              # ECS — read clusters, services, task definitions
              - Effect: Allow
                Action:
                  - ecs:ListClusters
                  - ecs:ListServices
                  - ecs:DescribeServices
                  - ecs:DescribeClusters
                  - ecs:DescribeTaskDefinition
                  - ecs:ListTaskDefinitions
                Resource: '*'
              # Lambda — read functions and event sources
              - Effect: Allow
                Action:
                  - lambda:ListFunctions
                  - lambda:GetFunction
                  - lambda:ListEventSourceMappings
                  - lambda:ListTags
                Resource: '*'
              # API Gateway — read APIs and resources
              - Effect: Allow
                Action:
                  - apigateway:GET
                Resource: '*'
              # RDS — read instances
              - Effect: Allow
                Action:
                  - rds:DescribeDBInstances
                  - rds:DescribeDBClusters
                  - rds:ListTagsForResource
                Resource: '*'
              # Resource Groups — read tags
              - Effect: Allow
                Action:
                  - tag:GetResources
                  - tag:GetTagKeys
                  - tag:GetTagValues
                  - resourcegroupstaggingapi:GetResources
                Resource: '*'
              # CloudWatch — read alarm states for health
              - Effect: Allow
                Action:
                  - cloudwatch:DescribeAlarms
                  - cloudwatch:GetMetricData
                Resource: '*'
              # STS — for identity verification
              - Effect: Allow
                Action:
                  - sts:GetCallerIdentity
                Resource: '*'

              # EXPLICIT DENIES — defense in depth
              - Effect: Deny
                Action:
                  - iam:*
                  - s3:GetObject
                  - s3:PutObject
                  - secretsmanager:GetSecretValue
                  - ssm:GetParameter*
                  - kms:Decrypt
                  - logs:GetLogEvents
                Resource: '*'

Outputs:
  RoleArn:
    Value: !GetAtt Dd0cDiscoveryRole.Arn
    Description: Provide this ARN to dd0c during onboarding

Key security decisions:

  • Explicit deny on IAM, S3 object access, Secrets Manager, SSM Parameter Store, KMS, and CloudWatch Logs. Even if AWS adds new read actions to a managed policy, these denies prevent access to sensitive data.
  • No ReadOnlyAccess managed policy — too broad. Custom policy scoped to exactly the services dd0c needs.
  • ExternalId prevents confused deputy attacks.
  • Role name is fixed (dd0c-portal-discovery) so customers can audit it easily.

4.3 GitHub/GitLab Integration

V1: GitHub App (preferred over OAuth for org-level access)

Permission Access Justification
Repository contents Read CODEOWNERS, README, workflow files
Repository metadata Read Repo name, description, language, topics
Organization members Read Team membership for ownership inference
Organization administration Read Team structure

No write permissions. No webhook creation (V1). No code push. No issue creation.

The GitHub App is installed at the org level. The customer clicks "Install" on the GitHub Marketplace listing, selects their org, and grants read-only access. The installation ID is stored in the connections table.

GitLab (V2): GitLab Group Access Token with read_api scope. Same pattern — read-only, scoped to the group, no write access.

4.4 Cost Estimates

All costs in USD/month. Assumes us-east-1 pricing as of 2026.

50 Services (10-20 engineers, Free/Team tier, ~5 tenants)

Service Configuration Monthly Cost
Aurora Serverless v2 0.5 ACU min (mostly idle) $43
ElastiCache Redis Serverless Minimal ECPU usage $15
ECS Fargate — API 1 task, 0.5 vCPU, 1GB $18
ECS Fargate — Meilisearch 1 task, 0.5 vCPU, 1GB + EFS $20
Lambda (all functions) ~50K invocations/month $2
Step Functions ~150 state transitions/month $1
SQS ~10K messages/month $1
S3 <1GB storage $1
CloudFront <10GB transfer $2
ALB 1 ALB, minimal LCUs $18
KMS 2 keys $2
Secrets Manager 5 secrets $3
CloudWatch Logs + basic metrics $10
Route 53 1 hosted zone $1
Total ~$137/month

Revenue at 50 services (5 tenants × 10 eng × $10): $500/month → 73% gross margin

200 Services (50-100 engineers, ~15 tenants)

Service Configuration Monthly Cost
Aurora Serverless v2 0.5-2 ACU (scales with queries) $90
ElastiCache Redis Serverless Moderate ECPU $30
ECS Fargate — API 2 tasks (auto-scaling) $36
ECS Fargate — Meilisearch 1 task, 1 vCPU, 2GB $36
Lambda ~500K invocations/month $10
Step Functions ~1,500 transitions/month $5
SQS ~100K messages/month $2
S3 ~5GB $2
CloudFront ~50GB transfer $8
ALB Moderate LCUs $25
Observability (CW + X-Ray) $30
Other (KMS, SM, R53) $10
Total ~$284/month

Revenue at 200 services (15 tenants × ~5 eng avg × $10): ~$750/month → 62% gross margin Note: conservative — many tenants will have 20-50 engineers, pushing revenue to $2-5K/month

1,000 Services (200-500 engineers, ~50 tenants)

Service Configuration Monthly Cost
Aurora Serverless v2 2-8 ACU $350
ElastiCache Redis Serverless Higher ECPU $80
ECS Fargate — API 3-5 tasks $90
ECS Fargate — Meilisearch 1 task, 2 vCPU, 4GB $72
Lambda ~5M invocations/month $50
Step Functions ~15K transitions/month $20
SQS + EventBridge $10
S3 ~50GB $5
CloudFront ~200GB transfer $25
ALB Higher LCUs $40
Observability $80
WAF $10
Other $20
Total ~$852/month

Revenue at 1,000 services (50 tenants × ~7 eng avg × $10): ~$3,500/month → 76% gross margin At scale, Aurora and Fargate efficiency improves. Gross margin stays healthy.

4.5 Scaling Strategy

Phase 1 (0-50 tenants): Single-region, minimal resources

  • Aurora Serverless v2 scales ACUs automatically
  • ECS API auto-scales 1-3 tasks based on CPU/request count
  • Meilisearch single instance (handles 100K+ documents easily)
  • No read replicas, no multi-region

Phase 2 (50-200 tenants): Optimize hot paths

  • Add Aurora read replica for search/dashboard queries (write to primary, read from replica)
  • Redis cluster mode for session/cache scaling
  • Meilisearch: evaluate moving to dedicated EC2 instance for cost efficiency at sustained load
  • Add CloudFront caching for API responses (service cards change infrequently — 60s TTL)

Phase 3 (200+ tenants): Multi-region consideration

  • Evaluate us-west-2 deployment for latency (EU customers → eu-west-1)
  • Aurora Global Database for cross-region reads
  • CloudFront + Lambda@Edge for API routing
  • This is a $100K+ MRR problem — don't solve it prematurely

4.6 CI/CD Pipeline

graph LR
    DEV["Developer Push<br/>(GitHub)"] --> GHA["GitHub Actions"]

    subgraph "CI Pipeline"
        GHA --> LINT["Lint + Type Check"]
        LINT --> TEST["Unit Tests<br/>+ Integration Tests"]
        TEST --> BUILD["Docker Build<br/>(API + Meilisearch config)"]
        BUILD --> SCAN["Trivy Container Scan"]
        SCAN --> ECR["Push to ECR"]
    end

    subgraph "CD Pipeline"
        ECR --> STAGING["Deploy to Staging<br/>(ECS rolling update)"]
        STAGING --> SMOKE["Smoke Tests<br/>(discovery accuracy suite)"]
        SMOKE --> APPROVE["Manual Approval<br/>(solo founder reviews)"]
        APPROVE --> PROD["Deploy to Production<br/>(ECS rolling update<br/>+ Lambda version publish)"]
        PROD --> CANARY["Canary Check<br/>(5 min health check)"]
        CANARY --> DONE["✅ Deploy Complete"]
    end

Key CI/CD decisions:

  • GitHub Actions (not CodePipeline) — simpler, cheaper, Brian already knows it
  • Docker multi-stage builds for API (Node.js) and scanner Lambdas (Python/Node.js)
  • Staging environment: minimal Aurora (0.5 ACU) + single Fargate task. Cost: ~$60/month
  • Discovery accuracy regression suite: run discovery against a known test AWS account + GitHub org, assert >80% accuracy. If accuracy drops, block deploy.
  • Lambda deployments via SAM/CDK with versioning and aliases for instant rollback
  • Database migrations via Prisma Migrate (or raw SQL migrations) — run in CI before ECS deploy

5. SECURITY

5.1 IAM Role Design for Customer AWS Accounts

The trust model is the single most sensitive aspect of dd0c/portal. Customers are granting a third-party SaaS read access to their infrastructure topology. This must be treated with the gravity it deserves.

Principle: Minimum viable access, maximum transparency.

Role Architecture

┌─────────────────────────────────────────────────────────────┐
│ CUSTOMER AWS ACCOUNT                                         │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ dd0c-portal-discovery (IAM Role)                      │   │
│  │                                                        │   │
│  │ Trust: dd0c platform account + ExternalId             │   │
│  │                                                        │   │
│  │ ALLOW:                                                │   │
│  │   cloudformation:List*, Describe*                     │   │
│  │   ecs:List*, Describe*                                │   │
│  │   lambda:List*, Get* (config only)                    │   │
│  │   apigateway:GET                                      │   │
│  │   rds:Describe*, ListTags*                            │   │
│  │   tag:Get*                                            │   │
│  │   cloudwatch:DescribeAlarms, GetMetricData            │   │
│  │                                                        │   │
│  │ EXPLICIT DENY:                                        │   │
│  │   iam:*                                               │   │
│  │   s3:GetObject, PutObject (no data access)            │   │
│  │   secretsmanager:GetSecretValue                       │   │
│  │   ssm:GetParameter*                                   │   │
│  │   kms:Decrypt                                         │   │
│  │   logs:GetLogEvents (no application logs)             │   │
│  │   ec2:GetConsoleOutput, GetPasswordData               │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  dd0c CANNOT:                                               │
│  ✗ Read S3 objects (no customer data)                       │
│  ✗ Read secrets or parameters                               │
│  ✗ Read application logs                                    │
│  ✗ Modify any resource                                      │
│  ✗ Create/delete/update anything                            │
│  ✗ Access IAM users, roles, or policies                     │
│  ✗ Decrypt any KMS-encrypted data                           │
└─────────────────────────────────────────────────────────────┘

Confused deputy prevention:

  • Every tenant gets a unique ExternalId (UUID v4) generated at onboarding
  • The customer's trust policy requires this ExternalId in the sts:AssumeRole condition
  • dd0c's scanner Lambda passes the tenant-specific ExternalId when assuming the role
  • Without the correct ExternalId, AssumeRole fails — even if an attacker knows the role ARN

Credential rotation:

  • The cross-account role uses temporary STS credentials (1-hour expiry by default)
  • No long-lived access keys are stored
  • The ExternalId can be rotated by the customer at any time (update CFN stack + update dd0c connection settings)

Audit trail:

  • Every AssumeRole call appears in the customer's CloudTrail
  • dd0c provides a "Discovery Activity Log" in the UI showing exactly which API calls were made, when, and what was returned (metadata only, not full responses)
  • Customers can verify dd0c's access patterns against their own CloudTrail

5.2 GitHub/GitLab Token Scoping

GitHub App permissions (V1):

Permission Level Justification What it CANNOT do
Contents Read Read CODEOWNERS, README, workflow files Cannot push code, create branches, or modify files
Metadata Read Repo name, description, language, topics Cannot modify repo settings
Members Read Org member list for ownership inference Cannot invite/remove members
Administration Read Team structure and membership Cannot create/modify teams

What the GitHub App explicitly cannot do:

  • ✗ Push code or create pull requests
  • ✗ Create/close issues
  • ✗ Modify repository settings
  • ✗ Access private repository secrets
  • ✗ Trigger or modify GitHub Actions workflows
  • ✗ Access GitHub Packages or Container Registry
  • ✗ Create webhooks (V1 — webhooks added in V1.1 with explicit customer consent)

Token storage:

  • GitHub App installation tokens are short-lived (1 hour)
  • The GitHub App private key is stored in AWS Secrets Manager (KMS-encrypted)
  • Installation tokens are generated on-demand per scan, never persisted

5.3 Service Catalog Data Sensitivity

The service catalog contains infrastructure topology — a map of what services exist, who owns them, how they're connected, and what technology they use. This is sensitive data.

Threat model:

Threat Impact Mitigation
Catalog data breach Attacker learns customer's service topology, tech stack, and team structure. Enables targeted attacks. Encryption at rest (Aurora + S3 + KMS). Encryption in transit (TLS 1.3 everywhere). Multi-tenant RLS.
Cross-tenant data leak Tenant A sees Tenant B's services. Reputational catastrophe. PostgreSQL RLS + application-level tenant_id enforcement + automated cross-tenant access tests in CI.
Insider threat (dd0c employee) Solo founder has access to all tenant data. Audit logging on all database access. Principle: Brian should never need to query customer data directly. Build admin tools that log every access.
Supply chain attack Compromised npm/pip dependency exfiltrates catalog data. Dependabot + Snyk. Pin dependency versions. Minimal dependency tree. Lambda functions have no outbound internet access except to AWS APIs (VPC + NAT gateway scoped to AWS endpoints).
Customer credential compromise Attacker steals the cross-account IAM role ARN + ExternalId. ExternalId is a UUID — not guessable. Role trust policy limits to dd0c's specific account. Even with both, attacker only gets read-only access to infrastructure metadata (not data).

Data classification:

Data Type Classification Storage Retention
Service names, descriptions Internal Aurora (encrypted) Active tenant lifetime
Team names, members Internal Aurora (encrypted) Active tenant lifetime
AWS resource ARNs Confidential Aurora (encrypted) Active tenant lifetime
GitHub repo URLs Internal Aurora (encrypted) Active tenant lifetime
Discovery event logs Internal Aurora → S3 archive 90 days hot, 1 year archive
IAM role ARNs + ExternalIds Confidential Secrets Manager (KMS) Active connection lifetime
GitHub App private key Secret Secrets Manager (KMS) Rotated annually
User sessions Internal Redis (encrypted in transit) 24-hour TTL

5.4 SOC 2 Considerations

dd0c/portal will need SOC 2 Type II certification by the time Business tier launches (Month 12+). Engineering directors buying at $25/engineer need compliance evidence.

SOC 2 Trust Service Criteria — Architecture Alignment:

Criteria Requirement dd0c Architecture
CC6.1 — Logical Access Restrict access to authorized users GitHub OAuth SSO. Tenant isolation via RLS. No shared accounts.
CC6.3 — Encryption Encrypt data in transit and at rest TLS 1.3 (ALB, CloudFront). Aurora encryption (AES-256). S3 SSE-KMS. Redis in-transit encryption.
CC6.6 — System Boundaries Define and protect system boundaries VPC with private subnets. Security groups restrict inter-service communication. WAF on ALB.
CC7.1 — Monitoring Detect anomalies and security events CloudWatch alarms. CloudTrail for API access. GuardDuty for threat detection.
CC7.2 — Incident Response Respond to security incidents PagerDuty alerting for dd0c's own infrastructure. Incident response runbook (documented).
CC8.1 — Change Management Control changes to infrastructure and code GitHub PRs with required reviews (when team grows). CI/CD pipeline with staging.
A1.2 — Availability Maintain system availability Aurora Multi-AZ. ECS multi-task. CloudFront edge caching. Health checks + auto-recovery.

Pre-SOC 2 actions (build into V1):

  • Enable CloudTrail in dd0c's AWS account (all API calls logged)
  • Enable GuardDuty (threat detection)
  • Enable AWS Config (configuration compliance)
  • Implement audit logging in the application (who accessed what, when)
  • Document data retention and deletion policies
  • Build a "delete my data" endpoint (GDPR + SOC 2 requirement)

5.5 The Trust Model

This is the hardest sell in the product. Customers are giving dd0c read access to their infrastructure graph. The architecture must make this trust decision as easy as possible.

Trust-building mechanisms:

  1. Transparency: The CloudFormation template is public and auditable. Customers can read every IAM permission before deploying. No hidden access.

  2. Customer-controlled revocation: Delete the CloudFormation stack → dd0c loses all access instantly. No "please contact support to revoke." The customer is always in control.

  3. Minimal blast radius: Even if dd0c is fully compromised, the attacker gets read-only access to infrastructure metadata (service names, resource ARNs, team names). They do NOT get application data, secrets, logs, or write access. The worst case is an attacker learning "Acme Corp has a payment-gateway service running on ECS in us-east-1." Sensitive, but not catastrophic.

  4. Open-source discovery agent (V1.1): Open-source the AWS scanner and GitHub scanner code. Customers can audit exactly what API calls are made and what data is collected. This is the strongest trust signal possible.

  5. Data residency: All customer data stored in the same AWS region as the customer's primary infrastructure (us-east-1 by default, eu-west-1 for EU customers at Business tier). No cross-region data transfer without explicit consent.

  6. Deletion guarantee: When a customer disconnects or churns, all their data (services, teams, discovery logs, corrections) is hard-deleted within 30 days. S3 objects are deleted. Meilisearch index entries are removed. Backups are excluded from restore for that tenant.

Trust comparison with competitors:

Aspect Backstage Port/Cortex dd0c/portal
Data location Self-hosted (customer controls) Vendor SaaS Vendor SaaS
AWS access N/A (manual YAML) Similar IAM role Read-only IAM role
Code auditability Open source Closed source Closed source (scanners open-sourced V1.1)
Revocation N/A Contact support Delete CFN stack (instant)
Blast radius N/A Read + sometimes write Read-only, explicit denies

dd0c's trust model is weaker than self-hosted Backstage (customer controls everything) but stronger than Port/Cortex (more transparent, easier revocation, smaller blast radius). The open-source scanner in V1.1 closes the gap significantly.


6. MVP SCOPE

6.1 V1 Technical Scope — "The 5-Minute Miracle"

V1 ships exactly four capabilities. Nothing else.

┌─────────────────────────────────────────────────────────────┐
│ V1 SCOPE                                                     │
│                                                              │
│  ✅ IN SCOPE                        ❌ OUT OF SCOPE          │
│  ─────────────                      ──────────────           │
│  AWS auto-discovery                 AI agent ("Ask Your      │
│    - CloudFormation                   Infra")                │
│    - ECS                            GitLab support           │
│    - Lambda                         Dependency visualization │
│    - API Gateway                    Scorecards / maturity    │
│    - RDS                            Kubernetes discovery     │
│    - Resource tags                  Terraform state parsing  │
│                                     Custom plugins           │
│  GitHub org scanning                Advanced RBAC            │
│    - Repos + languages              SSO (Okta/Azure AD)     │
│    - CODEOWNERS                     Self-hosted option       │
│    - README extraction              Multi-cloud (GCP/Azure) │
│    - Team memberships               Compliance reports       │
│    - Actions workflows              Software templates       │
│                                     Change feed              │
│  Service catalog UI                 Zombie service detection │
│    - Service cards                  Cost anomaly per service │
│    - Cmd+K search (<200ms)                                   │
│    - Team directory                                          │
│    - Correction UI                                           │
│    - Confidence scores                                       │
│                                                              │
│  Integrations                                                │
│    - PagerDuty/OpsGenie (on-call)                           │
│    - Slack bot (/dd0c who owns)                             │
│    - GitHub OAuth (auth)                                     │
│    - Stripe (billing)                                        │
└─────────────────────────────────────────────────────────────┘

6.2 What's Deferred to V2

Feature V2 Target Dependency
AI Agent ("Ask Your Infra") Month 7-9 Requires stable catalog data + LLM integration
GitLab support Month 8-10 Separate scanner Lambda, GitLab Group Access Token flow
Dependency visualization Month 5-7 (V1.1) Requires VPC flow log analysis or API Gateway integration mapping
Scorecards Month 5-6 (V1.1) Requires stable service entity model + enough metadata signals
dd0c/cost integration Month 7-9 Requires dd0c/cost to be live with per-service attribution
dd0c/alert integration Month 7-9 Requires dd0c/alert to be live with service-level routing
Backstage YAML importer Month 5-6 (V1.1) Low effort, high acquisition value for Backstage refugees
Change feed Month 10-12 Requires discovery event log to be stable + diff computation
Advanced RBAC Month 12+ (Business tier) Requires team-level permission model
SSO (Okta/Azure AD) Month 12+ (Business tier) Requires SAML/OIDC integration

6.3 The 5-Minute Onboarding Flow — Technical Implementation

sequenceDiagram
    participant U as Engineer
    participant SPA as Portal SPA
    participant API as Portal API
    participant GH as GitHub OAuth
    participant STRIPE as Stripe
    participant AWS_CFN as Customer AWS<br/>(CloudFormation)
    participant SF as Step Functions

    Note over U,SF: Step 1: Sign Up (30 seconds)
    U->>SPA: Click "Sign up with GitHub"
    SPA->>GH: OAuth redirect (scope: read:org, read:user)
    GH->>SPA: Authorization code
    SPA->>API: POST /auth/github {code}
    API->>GH: Exchange code for token
    API->>API: Create tenant + user
    API->>SPA: JWT + redirect to onboarding wizard

    Note over U,SF: Step 2: Select Plan (30 seconds)
    SPA->>U: "Free (≤10 eng) or Team ($10/eng)?"
    U->>SPA: Select Team
    SPA->>STRIPE: Stripe Checkout Session
    STRIPE->>U: Enter credit card
    STRIPE->>API: Webhook: checkout.session.completed
    API->>API: Activate subscription

    Note over U,SF: Step 3: Connect AWS (90 seconds)
    SPA->>U: "Deploy this CloudFormation template"
    Note right of SPA: One-click link:<br/>https://console.aws.amazon.com/cloudformation/<br/>home#/stacks/create/review?<br/>templateURL=https://dd0c-public.s3.amazonaws.com/<br/>cfn/dd0c-discovery-role.yaml&<br/>param_ExternalId={{generated-uuid}}
    U->>AWS_CFN: Click "Create Stack" in AWS Console
    AWS_CFN->>AWS_CFN: Create IAM role (~60 seconds)
    U->>SPA: Paste Role ARN
    SPA->>API: POST /connections/aws {roleArn, externalId}
    API->>API: sts:AssumeRole (validate access)
    API->>SPA: ✅ "AWS connected"

    Note over U,SF: Step 4: Connect GitHub (already done — OAuth grants org access)
    API->>API: List orgs from GitHub token
    SPA->>U: "Select your GitHub org"
    U->>SPA: Select org
    SPA->>API: POST /connections/github {orgLogin}
    API->>SPA: ✅ "GitHub connected"

    Note over U,SF: Step 5: Auto-Discovery (90 seconds)
    API->>SF: StartExecution {tenantId}
    SF-->>SPA: WebSocket: "Scanning AWS resources..."
    SF-->>SPA: WebSocket: "Found 234 AWS resources..."
    SF-->>SPA: WebSocket: "Scanning 89 GitHub repos..."
    SF-->>SPA: WebSocket: "Reconciling services..."
    SF-->>SPA: WebSocket: "Inferring ownership..."
    SF-->>SPA: WebSocket: "✅ Discovered 147 services"
    SPA->>U: Redirect to catalog view

    Note over U,SF: Total elapsed: ~4 minutes

Critical UX decisions:

  • The CloudFormation template link is pre-populated with the ExternalId. The customer clicks one link, lands in the AWS Console with the template loaded, and clicks "Create Stack." Three clicks total.
  • GitHub org access is granted during OAuth signup — no separate connection step. The onboarding wizard just asks "which org?" from the list of orgs the user belongs to.
  • WebSocket progress updates during discovery create the "Holy Shit" moment. Watching services appear in real-time (47... 89... 120... 147) is the emotional hook that drives screenshots and sharing.
  • If discovery takes >120 seconds, show a "This is taking longer than usual" message with a progress bar. Never show a spinner with no context.

6.4 >80% Discovery Accuracy — How to Measure It

Definition of accuracy:

accuracy = correct_services / total_discovered_services

Where "correct" means ALL of:
  1. The service actually exists (not a phantom/duplicate)
  2. The service name is recognizable to the team
  3. The primary owner is correct (or marked "unowned" if truly unowned)
  4. The repo link is correct (if a repo exists)

Measurement methodology:

  1. Beta measurement (Month 2-3): Personal call with each of 20 beta customers. Walk through their catalog together. For each service, ask: "Is this right?" Record corrections. Calculate accuracy per customer and aggregate.

  2. Production measurement (Month 3+): Track the correction rate.

    correction_rate = services_corrected_within_7_days / services_discovered
    accuracy_estimate = 1 - correction_rate
    

    This is a lower bound — some incorrect services won't be corrected because users don't notice or don't care. But it's a continuous, automated metric.

  3. Accuracy by signal source:

    SELECT
      discovery_sources,
      COUNT(*) as total,
      COUNT(*) FILTER (WHERE corrected_within_7d = false) as uncorrected,
      1.0 - (COUNT(*) FILTER (WHERE corrected_within_7d = true)::decimal / COUNT(*)) as accuracy
    FROM services
    WHERE tenant_id = :tenant_id
    GROUP BY discovery_sources
    ORDER BY accuracy ASC;
    

    This reveals which discovery signals are weakest (e.g., "Lambda-only services have 60% accuracy" → invest in Lambda grouping heuristics).

  4. Accuracy improvement tracking:

    Week 1: 78% accuracy (initial discovery)
    Week 2: 85% accuracy (after user corrections + propagation)
    Week 4: 91% accuracy (after model improvements from correction patterns)
    Week 8: 93% accuracy (steady state)
    

If accuracy is below 80% on first run:

  • Show a prominent "Review your catalog" banner: "We found 147 services. Some may need corrections. Help us learn your infrastructure by reviewing the flagged items."
  • Sort low-confidence services to the top of the review queue
  • Gamify corrections: "12 services need your review. ~5 minutes."
  • Each correction improves the model for similar services — show this: "Your correction improved confidence for 3 similar services."

6.5 Technical Debt Budget

V1 is built fast. Some shortcuts are intentional. Document them so they don't become surprises.

Debt Item Shortcut Proper Solution When to Fix
Meilisearch persistence EFS volume (slow, but works) Dedicated EC2 instance with local SSD >200 tenants
Search index sync SQS → Lambda → Meilisearch (eventual consistency, ~5s delay) Change Data Capture from Aurora → real-time sync If users complain about stale search results
GitHub rate limiting Simple retry with backoff Distributed rate limiter with token bucket per GitHub App installation >50 tenants scanning simultaneously
Discovery scheduling EventBridge cron (same time for all tenants) Distributed scheduler with jitter to spread load >100 tenants
Monitoring CloudWatch basic metrics + alarms Datadog or Grafana Cloud for full observability >$10K MRR (can afford $200/month for monitoring)
Database migrations Raw SQL files run in CI Prisma Migrate or Flyway with proper versioning When team grows beyond solo founder
Error handling Generic error pages, console.error logging Structured error codes, Sentry integration, user-facing error messages Month 3-4
Test coverage Integration tests for discovery accuracy, minimal unit tests 80%+ unit test coverage, E2E tests with Playwright Month 4-6

6.6 Solo Founder Operational Model

What Brian operates:

  • 1 AWS account (dd0c platform)
  • 1 GitHub org (dd0c)
  • 1 Stripe account
  • 1 Slack workspace (dd0c community + bot)
  • 1 PagerDuty account (dd0c's own alerting)

Operational runbook (daily):

  • Check CloudWatch dashboard (5 minutes): any alarms firing? Any discovery failures?
  • Check Stripe dashboard: new signups? Churns? Failed payments?
  • Check support channel (Slack/email): any customer issues?
  • Total daily ops: ~15 minutes when nothing is broken

Alerting (PagerDuty):

  • P1 (page immediately): API 5xx rate >5%, Aurora connection failures, discovery pipeline stuck >30 minutes
  • P2 (page during business hours): Discovery accuracy drop >10% for any tenant, Meilisearch index lag >60 seconds, Stripe webhook failures
  • P3 (daily digest): Lambda error rate >1%, CloudWatch log anomalies, dependency vulnerability alerts

On-call: Brian is the only on-call. This is sustainable up to ~50 tenants if the architecture is reliable. At $20K MRR, hire a part-time contractor for L1 support (Slack responses, known-issue triage).


7. API DESIGN

The Portal API is a RESTful JSON API. It serves the SPA frontend, the Slack bot, and internal integrations (dd0c/cost, dd0c/alert). In V1, there is no public API for customers to programmatically query their catalog, but the internal API is designed cleanly enough to be exposed later.

All requests require authentication. The SPA uses HTTP-only cookies (JWT). Integrations use internal IAM/VPC auth or signed tokens.

7.1 Discovery API

Manages the auto-discovery lifecycle.

POST /api/v1/discovery/run Triggers a manual full discovery scan.

  • Request: { "connections": ["aws", "github"] }
  • Response: { "run_id": "uuid", "status": "started" }

GET /api/v1/discovery/runs/{run_id} Polls the status of a specific discovery run.

  • Response:
    {
      "id": "uuid",
      "status": "running",
      "started_at": "2026-02-28T10:00:00Z",
      "progress": {
        "aws_resources_found": 142,
        "github_repos_found": 89,
        "current_phase": "reconciliation"
      }
    }
    

WebSocket: wss://api.dd0c.com/v1/discovery/stream?run_id=uuid Real-time progress events pushed to the UI during onboarding.

  • Events: phase_started, resources_discovered, services_reconciled, ownership_inferred, completed.

7.2 Service API

CRUD and search operations for the catalog.

GET /api/v1/services/search?q={query}&limit=10 The Cmd+K search endpoint. Proxies to Meilisearch or Redis cache.

  • Response: Array of matched services with highlighted snippets and confidence scores. (See Section 2.3).

GET /api/v1/services/{service_id} Retrieves the full, expanded detail view of a single service.

  • Response:
    {
      "id": "svc_123",
      "name": "payment-gateway",
      "display_name": "Payment Gateway",
      "description": "Handles Stripe checkout",
      "lifecycle": "production",
      "owner": {
        "team_id": "team_456",
        "name": "Payments Team",
        "confidence": 0.92,
        "source": "codeowners"
      },
      "repo": {
        "url": "https://github.com/acme/payment-gateway",
        "default_branch": "main"
      },
      "infrastructure": {
        "aws_resources": [
          {"type": "ecs_service", "arn": "arn:aws:ecs:...", "region": "us-east-1"}
        ]
      },
      "health_status": "healthy",
      "last_deploy_at": "2026-02-27T14:30:00Z"
    }
    

PATCH /api/v1/services/{service_id} Allows users to correct service metadata (e.g., fix a wrong owner).

  • Request: { "team_id": "team_789", "correction_reason": "Team reorg" }
  • Response: 200 OK. (Triggers background propagation to similar services).

7.3 Team & Ownership API

GET /api/v1/teams Lists all inferred and synced teams.

GET /api/v1/teams/{team_id}/services Lists all services owned by a specific team.

  • Query params: ?role=primary|contributing|on_call
  • Response: Array of service summaries.

7.4 Slack Bot API

The Slack bot translates slash commands into API queries. The bot Lambda receives the webhook from Slack, authenticates the workspace, maps it to a tenant, and calls the internal Portal API.

Command: /dd0c who owns payment-gateway

  • Bot logic: Calls GET /api/v1/services/search?q=payment-gateway&limit=1.
  • Bot response (ephemeral or in-channel):

    Payment Gateway is owned by @payments-team (92% confidence). Repo: acme/payment-gateway | Health: Healthy | On-Call: @mike

Command: /dd0c oncall auth-service

  • Bot logic: Looks up service, finds owner team, queries mapped PagerDuty schedule.
  • Bot response:

    Primary on-call for Auth Service (@platform-team) is Sarah Chen until 5:00 PM.

7.5 Webhooks (Outbound)

V1 supports outbound webhooks so customers can react to catalog changes.

POST https://customer-endpoint.com/webhooks/dd0c

  • Event: service.ownership.changed
  • Payload:
    {
      "event_id": "evt_abc123",
      "type": "service.ownership.changed",
      "timestamp": "2026-02-28T12:00:00Z",
      "data": {
        "service_id": "svc_123",
        "service_name": "payment-gateway",
        "old_owner_id": "team_456",
        "new_owner_id": "team_789",
        "confidence": 0.95,
        "source": "user_correction"
      }
    }
    
  • Other events: service.discovered, service.health.degraded, discovery.run.completed.

7.6 Platform Integration APIs (Internal)

These endpoints allow other dd0c modules to enrich the portal, and the portal to enrich other modules. These use internal IAM auth and bypass the public API Gateway.

dd0c/cost Integration

  • Portal calls Cost: GET http://cost.internal/api/v1/services/{service_id}/spend Retrieves the trailing 30-day AWS spend for the resources mapped to this service. Displayed on the service card.
  • Cost calls Portal: GET http://portal.internal/api/v1/resources/{arn}/service When dd0c/cost detects an anomaly on an RDS instance, it asks the portal "which service owns this ARN?" so it can alert the correct team.

dd0c/alert Integration

  • Portal calls Alert: GET http://alert.internal/api/v1/services/{service_id}/incidents?status=active Retrieves active incidents to update the service's health_status badge.
  • Alert calls Portal: GET http://portal.internal/api/v1/services/{service_id}/routing When an alert fires for a service, dd0c/alert asks the portal for the primary owner's Slack channel and PagerDuty schedule to route the page correctly.

dd0c/run Integration

  • Portal calls Run: GET http://run.internal/api/v1/services/{service_id}/runbooks Links executable runbooks directly on the service detail card.