Files

Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.

2026-02-28 17:35:02 +00:00

96 KiB

Raw Blame History

dd0c/portal — Technical Architecture

Product: Lightweight Internal Developer Portal Phase: 6 — Architecture Design Date: 2026-02-28 Author: Solutions Architecture Status: Draft

1. SYSTEM OVERVIEW

High-Level Architecture

graph TB
    subgraph "Customer Environment"
        AWS_ACCOUNT["Customer AWS Account(s)"]
        GH_ORG["GitHub Organization"]
        PD["PagerDuty / OpsGenie"]
    end

    subgraph "dd0c Platform — Control Plane (us-east-1)"
        subgraph "Ingress"
            ALB["Application Load Balancer<br/>+ WAF + CloudFront"]
        end

        subgraph "API Layer"
            API["Portal API<br/>(ECS Fargate)"]
            WS["WebSocket Gateway<br/>(API Gateway v2)"]
        end

        subgraph "Discovery Engine"
            ORCH["Discovery Orchestrator<br/>(Step Functions)"]
            AWS_SCAN["AWS Scanner<br/>(Lambda)"]
            GH_SCAN["GitHub Scanner<br/>(Lambda)"]
            RECONCILER["Reconciliation Engine<br/>(Lambda)"]
            INFERENCE["Ownership Inference<br/>(Lambda)"]
        end

        subgraph "Data Layer"
            PG["PostgreSQL (RDS Aurora Serverless v2)<br/>Service Catalog + Tenants"]
            REDIS["ElastiCache Redis<br/>Session + Cache + Search"]
            S3_DATA["S3<br/>Discovery Snapshots + Exports"]
            SQS["SQS FIFO<br/>Discovery Events"]
        end

        subgraph "Search"
            MEILI["Meilisearch<br/>(ECS Fargate)<br/>Full-text + Faceted Search"]
        end

        subgraph "Integrations"
            SLACK_BOT["Slack Bot<br/>(Lambda)"]
            WEBHOOK_OUT["Outbound Webhooks<br/>(EventBridge → Lambda)"]
        end

        subgraph "Frontend"
            SPA["React SPA<br/>(CloudFront + S3)"]
        end
    end

    subgraph "dd0c Platform Modules"
        DD0C_COST["dd0c/cost"]
        DD0C_ALERT["dd0c/alert"]
        DD0C_RUN["dd0c/run"]
    end

    %% Customer → Platform connections
    AWS_ACCOUNT -- "AssumeRole<br/>(read-only)" --> AWS_SCAN
    GH_ORG -- "OAuth / GitHub App<br/>(read-only)" --> GH_SCAN
    PD -- "API Key<br/>(read-only)" --> API

    %% User flows
    SPA --> ALB --> API
    SPA --> WS

    %% Discovery flow
    ORCH --> AWS_SCAN
    ORCH --> GH_SCAN
    AWS_SCAN --> SQS
    GH_SCAN --> SQS
    SQS --> RECONCILER
    RECONCILER --> INFERENCE
    INFERENCE --> PG
    PG --> MEILI

    %% API reads
    API --> PG
    API --> MEILI
    API --> REDIS

    %% Integrations
    SLACK_BOT --> API
    API --> WEBHOOK_OUT

    %% dd0c platform
    API <-- "Internal API" --> DD0C_COST
    API <-- "Internal API" --> DD0C_ALERT
    API <-- "Internal API" --> DD0C_RUN

Component Inventory

Component	Responsibility	Technology	Justification
Portal API	REST/GraphQL API for catalog CRUD, search proxy, auth, billing	Node.js (Fastify) on ECS Fargate	Fastify is the fastest Node framework. Fargate eliminates server management. Node aligns with React frontend for code sharing (types, validation schemas).
Discovery Orchestrator	Coordinates multi-source discovery runs, manages state machine for scan → reconcile → infer → index pipeline	AWS Step Functions	Native retry/error handling, visual debugging, pay-per-transition. Perfect for long-running multi-step workflows.
AWS Scanner	Scans customer AWS accounts via cross-account AssumeRole. Enumerates CloudFormation stacks, ECS services, Lambda functions, API Gateway APIs, RDS instances, tagged resources.	Python (Lambda)	boto3 is the canonical AWS SDK. Lambda cold starts acceptable for background scanning (not user-facing). Python's AWS ecosystem is unmatched.
GitHub Scanner	Scans GitHub org: repos, languages, CODEOWNERS, README content, workflow files, team memberships, recent commit authors.	Node.js (Lambda)	Octokit (GitHub SDK) is TypeScript-native. Shares types with API layer.
Reconciliation Engine	Merges AWS + GitHub scan results into unified service entities. Deduplicates, cross-references repo→infra mappings, resolves conflicts.	Node.js (Lambda)	Core business logic. Shares domain types with API.
Ownership Inference	Determines service ownership from CODEOWNERS, git blame frequency, team membership, CloudFormation tags, and historical corrections. Produces confidence scores.	Python (Lambda)	Scoring/ML-adjacent logic. Python's data processing libraries (pandas for frequency analysis) are superior.
PostgreSQL	Primary datastore: service catalog, tenant data, user accounts, discovery history, corrections, billing state.	Aurora Serverless v2	Scales to zero during low traffic (solo founder cost control). Relational model fits service catalog's structured data. Aurora's auto-scaling handles growth without capacity planning.
Redis	Session store, API response cache, rate limiting, real-time search suggestions (prefix trie).	ElastiCache Redis (Serverless)	Sub-millisecond reads for Cmd+K autocomplete. Serverless pricing aligns with variable load.
Meilisearch	Full-text search index for Cmd+K. Typo-tolerant, faceted, <50ms response.	Meilisearch on ECS Fargate (single container)	Meilisearch over Elasticsearch: 10x simpler to operate (single binary, no JVM, no cluster management), typo-tolerance out of the box, <50ms search on 10K documents. Solo founder can't babysit an ES cluster. Over Typesense: Meilisearch has better faceted search and a more active open-source community.
React SPA	Portal UI: service catalog, Cmd+K search, service detail cards, team directory, correction UI, onboarding wizard.	React + Vite + TailwindCSS, hosted on CloudFront + S3	SPA for instant Cmd+K interactions without server round-trips for UI state. CloudFront for global edge caching. Vite for fast builds.
Slack Bot	Responds to `/dd0c who owns <service>` commands. Passive viral loop.	Node.js (Lambda) via Slack Bolt	Lambda for zero-cost when idle. Bolt is Slack's official SDK.
WebSocket Gateway	Pushes real-time discovery progress to the UI during onboarding ("Found 47 services... 89 services... 147 services...").	API Gateway WebSocket API + Lambda	Managed WebSocket infrastructure. Only needed during discovery runs — Lambda scales to zero otherwise.

Technology Choices — Key Decisions

Why Not Serverless-Everything (Lambda for API)? The Portal API handles Cmd+K search requests that must respond in <100ms. Lambda cold starts (500ms-2s for Node.js) are unacceptable for the primary user interaction. ECS Fargate with minimum 1 task provides warm, consistent latency. Discovery Lambdas are background tasks where cold starts are irrelevant.

Why Meilisearch Over Algolia/Elasticsearch?

Algolia: SaaS pricing at scale ($1/1K search requests) becomes expensive with high DAU. Self-hosted Meilisearch is ~$0 marginal cost per search.
Elasticsearch: Operational complexity is prohibitive for a solo founder. Requires JVM tuning, cluster management, index lifecycle policies. Meilisearch is a single binary with zero configuration.
Meilisearch: Typo-tolerant by default (critical for Cmd+K UX), faceted filtering, <50ms on 100K documents, single Docker container, 200MB RAM for 10K services. Perfect for the scale and operational model.

Why PostgreSQL Over DynamoDB? The service catalog is inherently relational: services belong to teams, teams have members, services have dependencies on other services, services map to repos, repos map to infrastructure. DynamoDB's single-table design would require complex GSIs and denormalization that increases development time. Aurora Serverless v2 scales to zero (minimum 0.5 ACU = ~$43/month) and handles relational queries natively. At the scale of 50-1000 services per tenant, PostgreSQL is more than sufficient.

Why Not a Graph Database for Dependencies (V1)? Service dependency graphs are a V1.1 feature. For V1, dependencies are stored as adjacency lists in PostgreSQL (service_dependencies join table). This is sufficient for "what does this service depend on?" queries. A dedicated graph database (Neptune at $0.10/hour minimum = $73/month, or Neo4j) is premature optimization for V1. If dependency visualization becomes a core feature in V1.1+, evaluate Neptune Serverless or an in-app graph traversal library (graphology.js).

The 5-Minute Auto-Discovery Flow — Core Architectural Driver

This is the most important sequence in the entire system. Every architectural decision serves this flow.

sequenceDiagram
    participant User as Engineer
    participant UI as Portal UI
    participant API as Portal API
    participant SF as Step Functions
    participant AWS as AWS Scanner (λ)
    participant GH as GitHub Scanner (λ)
    participant SQS as SQS FIFO
    participant REC as Reconciler (λ)
    participant INF as Inference (λ)
    participant DB as PostgreSQL
    participant MS as Meilisearch
    participant WS as WebSocket

    Note over User,WS: Minute 0:00 — Signup
    User->>UI: Sign up (GitHub OAuth)
    UI->>API: POST /auth/github
    API->>DB: Create tenant + user

    Note over User,WS: Minute 1:00 — Connect AWS
    User->>UI: Deploy CloudFormation template (1-click)
    UI->>API: POST /connections/aws {roleArn, externalId}
    API->>API: sts:AssumeRole validation
    API->>DB: Store connection credentials (encrypted)

    Note over User,WS: Minute 2:00 — Connect GitHub (already done via OAuth)
    API->>DB: Store GitHub org connection

    Note over User,WS: Minute 2:30 — Trigger Discovery
    API->>SF: StartExecution {tenantId, connections}
    SF->>WS: Push "Discovery started..."

    Note over User,WS: Minute 2:30-3:30 — Parallel Scanning
    par AWS Scan
        SF->>AWS: Scan CloudFormation stacks
        AWS->>SQS: {type: cfn_stack, resources: [...]}
        SF->>AWS: Scan ECS services
        AWS->>SQS: {type: ecs_service, services: [...]}
        SF->>AWS: Scan Lambda functions
        AWS->>SQS: {type: lambda_fn, functions: [...]}
        SF->>AWS: Scan API Gateway APIs
        AWS->>SQS: {type: apigw, apis: [...]}
        SF->>AWS: Scan RDS instances
        AWS->>SQS: {type: rds, instances: [...]}
    and GitHub Scan
        SF->>GH: Scan repos (non-archived, non-fork)
        GH->>SQS: {type: gh_repo, repos: [...]}
        SF->>GH: Scan CODEOWNERS files
        GH->>SQS: {type: codeowners, mappings: [...]}
        SF->>GH: Scan team memberships
        GH->>SQS: {type: gh_teams, teams: [...]}
    end

    WS-->>UI: Push "Found 47 AWS resources..."
    WS-->>UI: Push "Found 89 GitHub repos..."

    Note over User,WS: Minute 3:30-4:00 — Reconciliation
    SQS->>REC: Batch process discovery events
    REC->>REC: Cross-reference AWS resources ↔ GitHub repos
    REC->>REC: Deduplicate (CFN stack name = ECS service = repo name)
    REC->>REC: Merge into unified service entities
    REC->>DB: Upsert service entities

    Note over User,WS: Minute 4:00-4:30 — Ownership Inference
    SF->>INF: Infer ownership for all services
    INF->>INF: Score: CODEOWNERS (weight: 0.4) + git blame (0.25) + CFN tags (0.2) + team membership (0.15)
    INF->>DB: Update services with owner + confidence score
    INF->>MS: Index services for search

    WS-->>UI: Push "Discovered 147 services. Catalog ready."

    Note over User,WS: Minute 5:00 — First Search
    User->>UI: Cmd+K → "payment"
    UI->>API: GET /search?q=payment
    API->>MS: Search
    MS->>API: Results in <50ms
    API->>UI: payment-gateway, payment-processor, payment-webhook
    User->>User: "Holy shit, this actually works."

Critical timing constraints:

AWS scanning must complete in <60 seconds for accounts with up to 500 resources. Achieved via parallel Lambda invocations per resource type.
GitHub scanning must complete in <60 seconds for orgs with up to 500 repos. Achieved via GitHub GraphQL API (batch queries) instead of REST (one request per repo).
Reconciliation must complete in <30 seconds. Single Lambda invocation processing all SQS messages in batch.
Total pipeline: <120 seconds from trigger to searchable catalog. The "5-minute" promise includes signup + AWS connection time.

Why Step Functions (not a simple Lambda chain)?

Built-in retry with exponential backoff per step (AWS API throttling is common)
Parallel execution of AWS + GitHub scans with automatic join
Visual execution history for debugging failed discoveries
Error handling: if GitHub scan fails, AWS results still proceed (partial discovery > no discovery)
State machine is inspectable — critical for debugging accuracy issues in production

2. CORE COMPONENTS

2.1 Discovery Engine

The discovery engine is the product. Everything else is UI on top of discovered data. If discovery is wrong, nothing else matters.

Architecture

graph TB
    subgraph "Discovery Orchestrator (Step Functions)"
        TRIGGER["Trigger<br/>(API call / Schedule)"]
        PLAN["Plan Phase<br/>Determine scan scope"]

        subgraph "Scan Phase (Parallel)"
            subgraph "AWS Scanners"
                CFN["CloudFormation<br/>Scanner"]
                ECS_S["ECS<br/>Scanner"]
                LAMBDA_S["Lambda<br/>Scanner"]
                APIGW_S["API Gateway<br/>Scanner"]
                RDS_S["RDS<br/>Scanner"]
                TAG_S["Resource Groups<br/>Tag Scanner"]
            end
            subgraph "GitHub Scanners"
                REPO_S["Repository<br/>Scanner"]
                CODEOWNERS_S["CODEOWNERS<br/>Parser"]
                TEAM_S["Team Membership<br/>Scanner"]
                README_S["README<br/>Extractor"]
                WORKFLOW_S["Actions Workflow<br/>Scanner"]
            end
        end

        RECONCILE["Reconciliation Phase"]
        INFER["Inference Phase"]
        INDEX["Index Phase"]
    end

    TRIGGER --> PLAN
    PLAN --> CFN & ECS_S & LAMBDA_S & APIGW_S & RDS_S & TAG_S
    PLAN --> REPO_S & CODEOWNERS_S & TEAM_S & README_S & WORKFLOW_S
    CFN & ECS_S & LAMBDA_S & APIGW_S & RDS_S & TAG_S --> RECONCILE
    REPO_S & CODEOWNERS_S & TEAM_S & README_S & WORKFLOW_S --> RECONCILE
    RECONCILE --> INFER --> INDEX

AWS Scanner — Resource-to-Service Mapping Strategy

The hardest problem in auto-discovery: what constitutes a "service"? AWS resources are granular (individual Lambdas, ECS tasks, RDS instances), but engineers think in services (payment-service, auth-service, user-api). The scanner must infer service boundaries from infrastructure patterns.

Service Identification Heuristics (priority order):

Signal	Confidence	Logic
CloudFormation stack	0.95	Each stack is almost always a service or a closely related group. Stack name → service name. Stack tags (`service`, `team`, `project`) → metadata.
ECS service	0.90	Each ECS service is a deployable unit. Service name → service name. Task definition → tech stack (container image).
Lambda function with API Gateway trigger	0.85	Lambda + APIGW = API service. Group Lambdas sharing the same APIGW by API name.
Lambda function (standalone)	0.60	Standalone Lambdas may be services, cron jobs, or glue code. Group by naming prefix (e.g., `payment-*` → payment service).
Tagged resource group	0.80	Resources sharing a `service` or `project` tag are grouped. Tag value → service name.
RDS instance	0.50	Databases are infrastructure, not services — but map to owning service via naming convention or CFN association.

AWS API Calls per Scan (estimated):

cloudformation:ListStacks                    → 1 call (paginated)
cloudformation:DescribeStacks                → 1 call per stack (batched)
cloudformation:ListStackResources            → 1 call per stack
ecs:ListClusters + ListServices              → 2-5 calls
ecs:DescribeServices + DescribeTaskDefinition → 1 per service
lambda:ListFunctions                         → 1-3 calls (paginated)
lambda:ListEventSourceMappings               → 1 per function (batched)
apigateway:GetRestApis + GetResources        → 2-5 calls
apigatewayv2:GetApis                         → 1 call
rds:DescribeDBInstances                      → 1 call
resourcegroupstaggingapi:GetResources        → 1-5 calls (paginated, filtered by service/team tags)

Total: ~50-200 API calls per scan for a typical 50-service account. Well within AWS API rate limits. Parallel execution across resource types keeps total scan time under 30 seconds.

Cross-Account AssumeRole Pattern:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::DDOC_PLATFORM_ACCOUNT:role/dd0c-discovery-role"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "{{tenant-specific-external-id}}"
        }
      }
    }
  ]
}

The customer deploys a CloudFormation template (provided by dd0c) that creates a read-only IAM role with:

ReadOnlyAccess managed policy (or a custom policy scoped to the specific services above)
Trust policy allowing dd0c's platform account to assume the role
ExternalId unique per tenant (prevents confused deputy attacks)

GitHub Scanner — Repository-to-Service Mapping

GraphQL Batch Query (single request for up to 100 repos):

query($org: String!, $cursor: String) {
  organization(login: $org) {
    repositories(first: 100, after: $cursor, isArchived: false, isFork: false) {
      nodes {
        name
        description
        primaryLanguage { name }
        languages(first: 5) { nodes { name } }
        defaultBranchRef {
          target {
            ... on Commit {
              history(first: 1) {
                nodes { committedDate author { user { login } } }
              }
            }
          }
        }
        codeowners: object(expression: "HEAD:CODEOWNERS") {
          ... on Blob { text }
        }
        readme: object(expression: "HEAD:README.md") {
          ... on Blob { text }
        }
        catalogInfo: object(expression: "HEAD:catalog-info.yaml") {
          ... on Blob { text }
        }
        deployWorkflow: object(expression: "HEAD:.github/workflows/deploy.yml") {
          ... on Blob { text }
        }
      }
      pageInfo { hasNextPage endCursor }
    }
    teams(first: 100) {
      nodes {
        name slug
        members(first: 100) { nodes { login name } }
        repositories(first: 100) { nodes { name } }
      }
    }
  }
}

Key extraction logic:

CODEOWNERS → parse ownership patterns, map @org/team-name to team entities
README.md → extract first paragraph as service description (LLM-assisted summarization in V2)
catalog-info.yaml → if present (Backstage migrators), parse existing metadata as high-confidence input
.github/workflows/deploy.yml → extract deployment target (ECS service name, Lambda function name) to cross-reference with AWS scan
primaryLanguage → tech stack
Recent commit authors → contributor frequency for ownership inference

Service Relationship Inference

Cross-referencing AWS and GitHub data to build the service graph:

MATCHING RULES (priority order):

1. EXPLICIT TAG MATCH
   AWS resource tag "github_repo" = "org/repo-name"
   → Direct link. Confidence: 0.95

2. CFN STACK → GITHUB ACTIONS DEPLOY TARGET
   GitHub workflow deploys to ECS service "payment-api"
   CFN stack contains ECS service "payment-api"
   → Link repo to CFN stack's service. Confidence: 0.90

3. NAME MATCH (normalized)
   GitHub repo: "payment-service"
   ECS service: "payment-service" or "payment-svc"
   → Fuzzy name match (Levenshtein distance ≤ 2). Confidence: 0.75

4. ECR IMAGE → GITHUB REPO
   ECS task definition references ECR image "payment-api:latest"
   ECR image was built from GitHub repo "payment-api" (via image tag or build metadata)
   → Confidence: 0.85

5. LAMBDA FUNCTION NAME → REPO NAME
   Lambda: "payment-webhook-handler"
   Repo: "payment-webhook" or "payment-service" (contains Lambda deploy workflow)
   → Confidence: 0.70

Confidence Score Calculation:

Each service entity gets a composite confidence score:

def calculate_confidence(service: Service) -> float:
    scores = []

    # Existence confidence: how sure are we this is a real service?
    if service.source == "cloudformation_stack":
        scores.append(("existence", 0.95))
    elif service.source == "ecs_service":
        scores.append(("existence", 0.90))
    elif service.source == "github_repo_only":
        scores.append(("existence", 0.60))  # repo exists but no infra found

    # Ownership confidence
    if service.owner_source == "codeowners":
        scores.append(("ownership", 0.90))
    elif service.owner_source == "cfn_tag":
        scores.append(("ownership", 0.85))
    elif service.owner_source == "git_blame_frequency":
        scores.append(("ownership", 0.65))
    elif service.owner_source == "inferred_from_team_membership":
        scores.append(("ownership", 0.50))

    # Repo linkage confidence
    if service.repo_link_source == "explicit_tag":
        scores.append(("repo_link", 0.95))
    elif service.repo_link_source == "deploy_workflow":
        scores.append(("repo_link", 0.90))
    elif service.repo_link_source == "name_match":
        scores.append(("repo_link", 0.75))

    return weighted_average(scores)

The >80% accuracy target is measured as:

accuracy = (services_correct_without_user_correction) / (total_services_discovered)

Where "correct" means: service exists, owner is right, repo link is right. Measured during beta by asking each beta customer to review their catalog and mark corrections.

Discovery Scheduling

Trigger	Frequency	Scope
Initial onboarding	Once	Full scan (all resource types, all repos)
Scheduled refresh	Every 6 hours (configurable: 1h-24h)	Incremental — only scan resources modified since last scan (CloudFormation events, GitHub push webhooks)
Manual trigger	On-demand (UI button)	Full scan
Webhook-driven	Real-time	GitHub push to CODEOWNERS → re-infer ownership for affected repos. CloudFormation stack events → update service metadata.
User correction	Immediate	Re-score ownership model for similar services when user corrects one

2.2 Service Catalog

The service catalog is the central data model. Everything reads from it, everything writes to it.

Service Entity Model

┌─────────────────────────────────────────────────────────┐
│ SERVICE                                                  │
├─────────────────────────────────────────────────────────┤
│ id: uuid (PK)                                           │
│ tenant_id: uuid (FK → tenants)                          │
│ name: varchar(255)                                      │
│ display_name: varchar(255)                              │
│ description: text (extracted from README)                │
│ service_type: enum [api, worker, cron, database, queue] │
│ lifecycle: enum [production, staging, deprecated, eol]  │
│ tier: enum [critical, standard, experimental]           │
│ tech_stack: jsonb (languages, frameworks, runtime)      │
│ repo_url: varchar(500)                                  │
│ repo_default_branch: varchar(100)                       │
│ infrastructure: jsonb (aws_resources, regions, accounts)│
│ health_status: enum [healthy, degraded, down, unknown]  │
│ last_deploy_at: timestamptz                             │
│ last_discovered_at: timestamptz                         │
│ confidence_score: decimal(3,2) [0.00-1.00]             │
│ discovery_sources: jsonb (which scanners found this)    │
│ metadata: jsonb (extensible key-value pairs)            │
│ created_at: timestamptz                                 │
│ updated_at: timestamptz                                 │
├─────────────────────────────────────────────────────────┤
│ INDEXES: tenant_id, name (unique per tenant),           │
│          confidence_score, lifecycle                     │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ TEAM                                                     │
├─────────────────────────────────────────────────────────┤
│ id: uuid (PK)                                           │
│ tenant_id: uuid (FK → tenants)                          │
│ name: varchar(255)                                      │
│ slug: varchar(255)                                      │
│ github_team_slug: varchar(255)                          │
│ slack_channel: varchar(255)                             │
│ pagerduty_schedule_id: varchar(255)                     │
│ members: jsonb (user references)                        │
│ created_at: timestamptz                                 │
│ updated_at: timestamptz                                 │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ SERVICE_OWNERSHIP                                        │
├─────────────────────────────────────────────────────────┤
│ service_id: uuid (FK → services)                        │
│ team_id: uuid (FK → teams)                              │
│ ownership_type: enum [primary, contributing, on_call]   │
│ confidence: decimal(3,2)                                │
│ source: enum [codeowners, cfn_tag, git_blame,           │
│               team_membership, user_correction]          │
│ verified_by: uuid (FK → users, nullable)                │
│ verified_at: timestamptz (nullable)                     │
│ created_at: timestamptz                                 │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ SERVICE_DEPENDENCY (V1.1)                                │
├─────────────────────────────────────────────────────────┤
│ source_service_id: uuid (FK → services)                 │
│ target_service_id: uuid (FK → services)                 │
│ dependency_type: enum [calls, publishes_to, reads_from] │
│ confidence: decimal(3,2)                                │
│ source: enum [vpc_flow, apigw_integration,              │
│               lambda_event_source, user_defined]         │
│ created_at: timestamptz                                 │
└─────────────────────────────────────────────────────────┘

Ownership Mapping

Ownership is the highest-value data point in the catalog. The inference engine uses a weighted scoring model:

OWNERSHIP SCORING MODEL

Input signals (per service):
  1. CODEOWNERS file match          → weight: 0.40
  2. CloudFormation/resource tags   → weight: 0.20
  3. Git blame frequency (top team) → weight: 0.25
  4. GitHub team → repo association → weight: 0.15

Process:
  - For each candidate team, sum weighted scores across all signals
  - Normalize to [0, 1]
  - Assign primary owner = highest scoring team
  - If top score < 0.50 → mark as "unowned" (flag for user review)
  - If top two scores within 0.10 → mark as "ambiguous" (flag for user review)

User corrections:
  - When a user corrects ownership, the correction is stored as source="user_correction"
  - User corrections have implicit weight 1.0 (override all inference)
  - Corrections propagate: if user says "payment-* repos belong to @payments-team",
    apply to all matching repos with confidence 0.85

Metadata Enrichment

Beyond ownership, the catalog enriches each service with:

Field	Source	Extraction Method
Description	GitHub README	First paragraph extraction (regex: first non-heading, non-badge paragraph)
Tech stack	GitHub `primaryLanguage` + `languages`	Direct from GitHub API
Runtime	ECS task definition / Lambda runtime	Direct from AWS API
Last deploy	GitHub Actions last successful workflow run / ECS last deployment	Most recent timestamp
On-call	PagerDuty schedule mapped to team	PagerDuty API: `GET /schedules` → match by team name or escalation policy
Health	CloudWatch alarm state for associated resources	Aggregate: all alarms OK → healthy, any alarm → degraded, critical alarm → down
Cost	dd0c/cost module (when connected)	Internal API: `GET /cost/services/{serviceId}/monthly`

2.3 Search Engine

The Cmd+K search bar is the daily-use hook. It must be faster than asking in Slack.

Search Architecture

graph LR
    USER["User types in Cmd+K"] --> SPA["React SPA"]
    SPA -- "debounce 150ms" --> API["Portal API"]
    API --> REDIS["Redis<br/>Prefix cache<br/>(hot queries)"]
    REDIS -- "cache miss" --> MEILI["Meilisearch"]
    MEILI --> API
    API --> SPA
    SPA --> USER

    style REDIS fill:#f9f,stroke:#333
    style MEILI fill:#bbf,stroke:#333

Performance budget:

Keystroke to API request: <150ms (debounce)
API to Meilisearch: <10ms (same VPC, same AZ)
Meilisearch query execution: <50ms (for 10K documents)
API response to UI render: <50ms
Total perceived latency: <200ms (target: feels instant)

Meilisearch Index Configuration:

{
  "index": "services",
  "primaryKey": "id",
  "searchableAttributes": [
    "name",
    "display_name",
    "description",
    "team_name",
    "tech_stack",
    "repo_name",
    "tags"
  ],
  "filterableAttributes": [
    "tenant_id",
    "lifecycle",
    "tier",
    "team_name",
    "tech_stack",
    "health_status",
    "confidence_score"
  ],
  "sortableAttributes": [
    "name",
    "last_deploy_at",
    "confidence_score",
    "updated_at"
  ],
  "rankingRules": [
    "words",
    "typo",
    "proximity",
    "attribute",
    "sort",
    "exactness"
  ],
  "typoTolerance": {
    "enabled": true,
    "minWordSizeForTypos": {
      "oneTypo": 3,
      "twoTypos": 6
    }
  }
}

Multi-tenant isolation in search: Every document in Meilisearch includes tenant_id. Every query includes a mandatory filter: tenant_id = '{current_tenant}'. This is enforced at the API layer — the SPA never queries Meilisearch directly.

Search result format:

{
  "hits": [
    {
      "id": "svc_abc123",
      "name": "payment-gateway",
      "display_name": "Payment Gateway",
      "description": "Handles payment processing via Stripe integration",
      "team_name": "Payments Team",
      "repo_url": "https://github.com/acme/payment-gateway",
      "health_status": "healthy",
      "tech_stack": ["TypeScript", "Node.js"],
      "confidence_score": 0.92,
      "last_deploy_at": "2026-02-27T14:30:00Z",
      "_matchesPosition": { "name": [{"start": 0, "length": 7}] }
    }
  ],
  "query": "payment",
  "processingTimeMs": 12,
  "estimatedTotalHits": 3
}

Redis Prefix Cache

For the most common queries (top 100 per tenant), cache the Meilisearch response in Redis with a 5-minute TTL. This reduces Meilisearch load and provides <5ms response for repeated queries.

Key pattern: search:{tenant_id}:{normalized_query_prefix}
TTL: 300 seconds
Invalidation: on any service upsert for the tenant (conservative but simple)

2.4 AI Agent — "Ask Your Infra" (V2)

Deferred to V2 (Month 7-12), but the architecture must accommodate it from day one.

Design

graph TB
    USER["User: 'Which services handle PII?'"]
    AGENT["AI Agent (Lambda)"]
    LLM["LLM (Claude / GPT-4o)"]
    CATALOG["Service Catalog (PostgreSQL)"]
    SEARCH["Meilisearch"]
    COST["dd0c/cost API"]
    ALERT["dd0c/alert API"]

    USER --> AGENT
    AGENT --> LLM
    LLM -- "tool_call: search_services" --> SEARCH
    LLM -- "tool_call: query_catalog" --> CATALOG
    LLM -- "tool_call: get_cost" --> COST
    LLM -- "tool_call: get_incidents" --> ALERT
    LLM --> AGENT
    AGENT --> USER

RAG approach: The AI agent does NOT embed the entire catalog into a vector store. Instead, it uses structured tool calls:

User asks a natural language question
LLM receives the question + a system prompt describing available tools (search, SQL query, cost API, alert API)
LLM generates tool calls to retrieve relevant data
Results are injected into context
LLM synthesizes a natural language answer with citations

Why tool-use over vector RAG?

The service catalog is structured data (tables, relationships). SQL queries are more precise than semantic similarity search.
The catalog is small enough (<10K services) that tool calls retrieve exact data, not "similar" data.
No embedding pipeline to maintain. No vector database to operate. Simpler architecture for a solo founder.

V2 scope:

Natural language queries via portal UI and Slack bot
Tool calls: search_services, get_service_detail, query_services_by_attribute, get_team_services, get_service_cost, get_service_incidents
Guardrails: tenant isolation (LLM can only query current tenant's data), no write operations, response length limits
Cost control: cache identical queries for 5 minutes, rate limit to 50 queries/user/day

2.5 Dashboard

The dashboard serves two audiences with different needs:

Engineers (daily use): Cmd+K search bar front and center. Recent services visited. Team's services quick-access. That's it. Calm surface.

Directors (weekly use): Org-wide metrics. Service count by team. Ownership coverage (% of services with verified owners). Health overview. Discovery accuracy trend. Exportable for compliance.

Dashboard Component Architecture

┌─────────────────────────────────────────────────────────────────┐
│ PORTAL DASHBOARD                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  🔍 Cmd+K: Search services, teams, or keywords...       │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐            │
│  │ 147 Services │ │ 12 Teams     │ │ 89% Accuracy │            │
│  │ 3 unowned    │ │ 2 on-call    │ │ ↑ from 82%   │            │
│  └──────────────┘ └──────────────┘ └──────────────┘            │
│                                                                  │
│  RECENT ─────────────────────────────────────────────           │
│  payment-gateway  │ @payments │ healthy │ 2h ago               │
│  auth-service     │ @platform │ healthy │ 1d ago               │
│  order-engine     │ @orders   │ degraded│ 3h ago               │
│                                                                  │
│  YOUR TEAM (@platform) ──────────────────────────────           │
│  auth-service     │ healthy │ ts/node  │ repo ↗                │
│  api-gateway      │ healthy │ ts/node  │ repo ↗                │
│  user-service     │ degraded│ python   │ repo ↗                │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ SERVICE DETAIL (expanded on click)                       │    │
│  │ ┌─────────┬──────────┬──────────┬──────────┬──────────┐ │    │
│  │ │ Overview│ Infra    │ On-Call  │ Cost     │ Incidents│ │    │
│  │ ├─────────┴──────────┴──────────┴──────────┴──────────┤ │    │
│  │ │ Owner: @payments-team (92% confidence) [Correct ✏️]  │ │    │
│  │ │ Repo: github.com/acme/payment-gateway               │ │    │
│  │ │ Stack: TypeScript, Node.js, ECS Fargate              │ │    │
│  │ │ Last Deploy: 2h ago by @sarah                        │ │    │
│  │ │ Health: ✅ All CloudWatch alarms OK                   │ │    │
│  │ │ On-Call: @mike (PagerDuty, ends in 4h)               │ │    │
│  │ │ Cost: $847/mo (dd0c/cost) ↑12% from last month      │ │    │
│  │ │ Incidents: 2 this month (dd0c/alert)                 │ │    │
│  │ └─────────────────────────────────────────────────────┘ │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Progressive disclosure in action:

Default: one-line-per-service table (name, owner, health, last deploy)
Click: expanded service card with tabs (overview, infra, on-call, cost, incidents)
Each tab loads data on demand (lazy loading) — no upfront cost for data the user doesn't need

3. DATA ARCHITECTURE

3.1 Complete Database Schema

Core Entities

-- Tenant isolation: every table has tenant_id. Every query filters by it. No exceptions.

CREATE TABLE tenants (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            VARCHAR(255) NOT NULL,
    slug            VARCHAR(255) NOT NULL UNIQUE,
    plan            VARCHAR(50) NOT NULL DEFAULT 'free', -- free, team, business
    stripe_customer_id VARCHAR(255),
    stripe_subscription_id VARCHAR(255),
    settings        JSONB NOT NULL DEFAULT '{}',
    -- settings: { discovery_interval_hours: 6, auto_refresh: true, slack_workspace_id: "T..." }
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    github_id       BIGINT NOT NULL,
    github_login    VARCHAR(255) NOT NULL,
    email           VARCHAR(255),
    display_name    VARCHAR(255),
    avatar_url      VARCHAR(500),
    role            VARCHAR(50) NOT NULL DEFAULT 'member', -- admin, member
    last_active_at  TIMESTAMPTZ,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(tenant_id, github_id)
);

CREATE TABLE connections (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    provider        VARCHAR(50) NOT NULL, -- aws, github, pagerduty, opsgenie
    status          VARCHAR(50) NOT NULL DEFAULT 'pending', -- pending, active, error, revoked
    credentials     JSONB NOT NULL, -- encrypted at rest (KMS)
    -- aws: { role_arn, external_id, regions: ["us-east-1", "us-west-2"] }
    -- github: { installation_id, org_login, access_token_encrypted }
    -- pagerduty: { api_key_encrypted, subdomain }
    last_scan_at    TIMESTAMPTZ,
    last_error      TEXT,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(tenant_id, provider)
);

CREATE TABLE services (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    name            VARCHAR(255) NOT NULL,
    display_name    VARCHAR(255),
    description     TEXT,
    service_type    VARCHAR(50), -- api, worker, cron, database, queue, frontend, library
    lifecycle       VARCHAR(50) NOT NULL DEFAULT 'production',
    tier            VARCHAR(50) NOT NULL DEFAULT 'standard', -- critical, standard, experimental
    tech_stack      JSONB DEFAULT '[]', -- ["TypeScript", "Node.js", "Express"]
    repo_url        VARCHAR(500),
    repo_default_branch VARCHAR(100) DEFAULT 'main',
    infrastructure  JSONB DEFAULT '{}',
    -- { aws_resources: [{type: "ecs_service", arn: "...", region: "us-east-1"}],
    --   aws_account_id: "123456789012" }
    health_status   VARCHAR(50) DEFAULT 'unknown', -- healthy, degraded, down, unknown
    last_deploy_at  TIMESTAMPTZ,
    last_discovered_at TIMESTAMPTZ,
    confidence_score DECIMAL(3,2) DEFAULT 0.00,
    discovery_sources JSONB DEFAULT '[]', -- ["cloudformation", "github_repo", "ecs_service"]
    metadata        JSONB DEFAULT '{}',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(tenant_id, name)
);
CREATE INDEX idx_services_tenant ON services(tenant_id);
CREATE INDEX idx_services_lifecycle ON services(tenant_id, lifecycle);
CREATE INDEX idx_services_confidence ON services(tenant_id, confidence_score);
CREATE INDEX idx_services_health ON services(tenant_id, health_status);

CREATE TABLE teams (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    name            VARCHAR(255) NOT NULL,
    slug            VARCHAR(255) NOT NULL,
    github_team_slug VARCHAR(255),
    slack_channel_id VARCHAR(255),
    slack_channel_name VARCHAR(255),
    pagerduty_schedule_id VARCHAR(255),
    opsgenie_team_id VARCHAR(255),
    contact_email   VARCHAR(255),
    members         JSONB DEFAULT '[]',
    -- [{ github_login: "sarah", name: "Sarah Chen", role: "lead" }]
    metadata        JSONB DEFAULT '{}',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(tenant_id, slug)
);

CREATE TABLE service_ownership (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    service_id      UUID NOT NULL REFERENCES services(id) ON DELETE CASCADE,
    team_id         UUID NOT NULL REFERENCES teams(id) ON DELETE CASCADE,
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    ownership_type  VARCHAR(50) NOT NULL DEFAULT 'primary', -- primary, contributing, on_call
    confidence      DECIMAL(3,2) NOT NULL DEFAULT 0.00,
    source          VARCHAR(50) NOT NULL, -- codeowners, cfn_tag, git_blame, team_membership, user_correction
    verified_by     UUID REFERENCES users(id),
    verified_at     TIMESTAMPTZ,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(service_id, team_id, ownership_type)
);
CREATE INDEX idx_ownership_service ON service_ownership(service_id);
CREATE INDEX idx_ownership_team ON service_ownership(team_id);

CREATE TABLE service_dependencies (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    source_service_id UUID NOT NULL REFERENCES services(id) ON DELETE CASCADE,
    target_service_id UUID NOT NULL REFERENCES services(id) ON DELETE CASCADE,
    dependency_type VARCHAR(50) NOT NULL, -- calls, publishes_to, reads_from, triggers
    confidence      DECIMAL(3,2) NOT NULL DEFAULT 0.00,
    source          VARCHAR(50) NOT NULL, -- vpc_flow, apigw_integration, lambda_event_source, user_defined
    metadata        JSONB DEFAULT '{}',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(source_service_id, target_service_id, dependency_type)
);

Discovery Event Log

Every discovery run produces an immutable event log. This is critical for debugging accuracy issues, auditing what changed, and measuring improvement over time.

CREATE TABLE discovery_runs (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    trigger_type    VARCHAR(50) NOT NULL, -- onboarding, scheduled, manual, webhook
    status          VARCHAR(50) NOT NULL DEFAULT 'running', -- running, completed, partial, failed
    step_function_execution_arn VARCHAR(500),
    started_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at    TIMESTAMPTZ,
    stats           JSONB DEFAULT '{}',
    -- { aws_resources_found: 234, github_repos_found: 89,
    --   services_created: 12, services_updated: 135, services_unchanged: 0,
    --   ownership_inferred: 140, ownership_ambiguous: 7,
    --   scan_duration_ms: 28400, reconcile_duration_ms: 4200 }
    errors          JSONB DEFAULT '[]'
    -- [{ phase: "aws_scan", resource: "lambda", error: "ThrottlingException", retried: true }]
);
CREATE INDEX idx_discovery_runs_tenant ON discovery_runs(tenant_id, started_at DESC);

CREATE TABLE discovery_events (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    run_id          UUID NOT NULL REFERENCES discovery_runs(id) ON DELETE CASCADE,
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    event_type      VARCHAR(50) NOT NULL,
    -- service_created, service_updated, service_removed,
    -- ownership_changed, ownership_ambiguous,
    -- repo_linked, repo_unlinked,
    -- resource_discovered, resource_removed
    service_id      UUID REFERENCES services(id),
    payload         JSONB NOT NULL,
    -- { field: "owner", old_value: "@platform", new_value: "@payments",
    --   old_confidence: 0.65, new_confidence: 0.88,
    --   reason: "CODEOWNERS updated" }
    confidence      DECIMAL(3,2),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_discovery_events_run ON discovery_events(run_id);
CREATE INDEX idx_discovery_events_service ON discovery_events(service_id, created_at DESC);

-- Partition discovery_events by month for efficient cleanup
-- Retain 90 days of events, archive to S3 after that

User Corrections (Feedback Loop)

CREATE TABLE corrections (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    service_id      UUID NOT NULL REFERENCES services(id),
    user_id         UUID NOT NULL REFERENCES users(id),
    field           VARCHAR(100) NOT NULL, -- owner, description, tier, lifecycle, repo_url
    old_value       JSONB,
    new_value       JSONB,
    applied         BOOLEAN NOT NULL DEFAULT TRUE,
    propagated      BOOLEAN NOT NULL DEFAULT FALSE,
    -- propagated: did this correction update inference for similar services?
    propagation_count INT DEFAULT 0,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_corrections_tenant ON corrections(tenant_id, created_at DESC);

Corrections are the most valuable data in the system. They:

Immediately fix the corrected service
Feed back into the ownership inference model (increase weight for the corrected signal)
Propagate to similar services when patterns are detected (e.g., "user corrected 3 services in payment-* repos to @payments-team → auto-apply to remaining payment-* repos with confidence 0.85")

3.2 Search Index Design

Meilisearch document structure (denormalized from PostgreSQL for search performance):

{
  "id": "svc_abc123",
  "tenant_id": "tenant_xyz",
  "name": "payment-gateway",
  "display_name": "Payment Gateway",
  "description": "Handles payment processing via Stripe. Exposes REST API for checkout flow.",
  "service_type": "api",
  "lifecycle": "production",
  "tier": "critical",
  "team_name": "Payments Team",
  "team_slug": "payments",
  "owner_confidence": 0.92,
  "tech_stack": ["TypeScript", "Node.js"],
  "repo_name": "payment-gateway",
  "repo_url": "https://github.com/acme/payment-gateway",
  "health_status": "healthy",
  "last_deploy_at": 1740667800,
  "aws_services": ["ecs", "rds", "elasticache"],
  "aws_region": "us-east-1",
  "tags": ["payments", "stripe", "checkout", "critical-path"],
  "confidence_score": 0.92,
  "updated_at": 1740667800
}

Index sync strategy:

On every service upsert in PostgreSQL → publish to SQS → Lambda consumer → Meilisearch addDocuments (batch, async)
Latency: service update → searchable in <5 seconds
Full reindex: triggered on Meilisearch restart or schema change. Reads all services from PostgreSQL, batches of 1000 documents. For 10K services: ~10 seconds.

Index sizing:

Average document size: ~1KB
1,000 services: ~1MB index, ~50MB RAM
10,000 services: ~10MB index, ~200MB RAM
Meilisearch on a single Fargate task (0.5 vCPU, 1GB RAM) handles 10K+ services comfortably

3.3 Graph Database Decision: Not Yet

V1: PostgreSQL adjacency list. Service dependencies stored in service_dependencies table. Queries like "what does service X depend on?" are simple JOINs. Queries like "what's the full transitive dependency tree?" use recursive CTEs:

WITH RECURSIVE dep_tree AS (
    SELECT target_service_id, 1 as depth
    FROM service_dependencies
    WHERE source_service_id = :service_id AND tenant_id = :tenant_id
    UNION ALL
    SELECT sd.target_service_id, dt.depth + 1
    FROM service_dependencies sd
    JOIN dep_tree dt ON sd.source_service_id = dt.target_service_id
    WHERE dt.depth < 10  -- prevent infinite loops
)
SELECT DISTINCT s.* FROM dep_tree dt
JOIN services s ON s.id = dt.target_service_id;

At the scale of 50-1000 services with an average of 3-5 dependencies each, this recursive CTE executes in <50ms on Aurora. A graph database is unnecessary.

V1.1+ evaluation criteria for Neptune Serverless:

If dependency visualization becomes a core feature with >100 daily queries
If customers have >5,000 services with deep dependency chains (>10 levels)
If graph traversal queries (shortest path, impact radius) become latency-sensitive
Neptune Serverless minimum cost: ~$0.12/NCU-hour × 2.5 NCU minimum = ~$220/month. Only justified when dependency features drive measurable retention.

3.4 Multi-Tenant Data Isolation

Strategy: Shared database, tenant_id column, application-level enforcement.

Why not database-per-tenant:

Aurora Serverless v2 charges per ACU. One database with 50 tenants is cheaper than 50 databases.
Schema migrations are applied once, not 50 times.
Cross-tenant analytics (anonymized, for product metrics) are simple queries.

Enforcement layers:

Layer	Mechanism
API middleware	Every authenticated request extracts `tenant_id` from JWT. Injected into every database query. No query can omit `tenant_id`.
PostgreSQL RLS (Row-Level Security)	Backup enforcement. Even if application code has a bug, RLS prevents cross-tenant data access.
Meilisearch filter	Every search query includes mandatory `tenant_id` filter. Enforced at API layer.
S3 prefix	Discovery snapshots stored at `s3://dd0c-data/{tenant_id}/snapshots/`. IAM policy scopes Lambda access to tenant prefix during discovery.
Logging	All API logs include `tenant_id`. Anomaly detection: alert if a single request touches multiple tenant_ids.

PostgreSQL RLS implementation:

ALTER TABLE services ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON services
    USING (tenant_id = current_setting('app.current_tenant_id')::UUID);

-- Set per-request in API middleware:
-- SET LOCAL app.current_tenant_id = 'tenant-uuid-here';

3.5 Sync/Refresh Strategy

Event	Trigger	Scope	Latency
Initial discovery	User completes onboarding	Full scan: all AWS resource types + all GitHub repos	<120 seconds
Scheduled refresh	EventBridge cron (default: every 6h)	Incremental: CloudFormation events since last scan, GitHub repos with pushes since last scan	<60 seconds
GitHub webhook	Push to CODEOWNERS, README, or deploy workflow	Single repo: re-extract metadata, re-infer ownership	<10 seconds
CloudFormation event	Stack create/update/delete (via EventBridge rule in customer account)	Single stack: update associated service	<10 seconds
User correction	User clicks "Correct" in UI	Single service + propagation to similar services	<5 seconds
Manual full rescan	User clicks "Rescan" in settings	Full scan (same as initial)	<120 seconds

Incremental scan optimization:

For scheduled refreshes, avoid re-scanning everything:

AWS: Use CloudTrail events (if available) or compare CloudFormation stack LastUpdatedTime to skip unchanged stacks. For ECS/Lambda, compare resource tags and configuration hashes.
GitHub: Use the GitHub Events API or webhook payloads to identify repos with changes since last scan. Only re-scan changed repos.
Result: Incremental scans touch 5-15% of resources, completing in <30 seconds instead of 120.

Staleness detection:

If a service hasn't been seen in 3 consecutive full scans:

Mark as lifecycle: deprecated with a note "Not found in recent discovery scans"
Surface in dashboard: "3 services may have been removed from your infrastructure"
After 5 consecutive misses: mark as lifecycle: eol, remove from default search results (still accessible via filter)

4. INFRASTRUCTURE

4.1 AWS Architecture

graph TB
    subgraph "us-east-1 — Primary Region"
        subgraph "Public Subnet"
            CF["CloudFront Distribution<br/>SPA + API Cache"]
            ALB["Application Load Balancer<br/>+ WAF v2"]
        end

        subgraph "Private Subnet — App Tier"
            ECS_API["ECS Fargate<br/>Portal API<br/>(min: 1, max: 10 tasks)<br/>0.5 vCPU / 1GB RAM"]
            ECS_MEILI["ECS Fargate<br/>Meilisearch<br/>(1 task)<br/>0.5 vCPU / 1GB RAM<br/>+ EFS volume"]
        end

        subgraph "Private Subnet — Compute"
            SF["Step Functions<br/>Discovery Orchestrator"]
            L_AWS["Lambda — AWS Scanner<br/>Python 3.12<br/>512MB / 5min timeout"]
            L_GH["Lambda — GitHub Scanner<br/>Node.js 20<br/>512MB / 5min timeout"]
            L_REC["Lambda — Reconciler<br/>Node.js 20<br/>1GB / 5min timeout"]
            L_INF["Lambda — Inference<br/>Python 3.12<br/>512MB / 2min timeout"]
            L_SLACK["Lambda — Slack Bot<br/>Node.js 20<br/>256MB / 30s timeout"]
            L_WEBHOOK["Lambda — Webhook Processor<br/>Node.js 20<br/>256MB / 30s timeout"]
        end

        subgraph "Private Subnet — Data Tier"
            AURORA["Aurora Serverless v2<br/>PostgreSQL 15<br/>0.5-8 ACU<br/>Multi-AZ"]
            REDIS_C["ElastiCache Redis<br/>Serverless<br/>1-5 ECPUs"]
        end

        subgraph "Storage & Messaging"
            S3_SPA["S3 — SPA Assets"]
            S3_DATA["S3 — Discovery Snapshots<br/>+ Exports"]
            SQS_DISC["SQS FIFO<br/>Discovery Events"]
            SQS_INDEX["SQS Standard<br/>Search Index Updates"]
            EB["EventBridge<br/>Scheduled Discovery<br/>+ Webhook Routing"]
        end

        subgraph "Security & Observability"
            KMS["KMS — Encryption Keys<br/>(credentials, PII)"]
            SM["Secrets Manager<br/>GitHub tokens, PD keys"]
            CW["CloudWatch<br/>Logs + Metrics + Alarms"]
            XRAY["X-Ray<br/>Distributed Tracing"]
        end

        subgraph "API Management"
            APIGW["API Gateway v2<br/>WebSocket API<br/>(discovery progress)"]
        end
    end

    CF --> S3_SPA
    CF --> ALB
    ALB --> ECS_API
    ECS_API --> AURORA
    ECS_API --> REDIS_C
    ECS_API --> ECS_MEILI
    ECS_API --> SQS_INDEX
    SQS_INDEX --> L_WEBHOOK
    L_WEBHOOK --> ECS_MEILI

    SF --> L_AWS & L_GH
    L_AWS --> SQS_DISC
    L_GH --> SQS_DISC
    SQS_DISC --> L_REC
    L_REC --> L_INF
    L_INF --> AURORA
    L_INF --> SQS_INDEX

    EB --> SF
    APIGW --> ECS_API

    ECS_MEILI --> ECS_MEILI_EFS["EFS Volume<br/>(Meilisearch data persistence)"]

4.2 Customer-Side: Read-Only IAM Role

The customer deploys a single CloudFormation template provided by dd0c. This is the only thing the customer installs.

CloudFormation Template (provided to customer):

AWSTemplateFormatVersion: '2010-09-09'
Description: dd0c/portal read-only discovery role

Parameters:
  ExternalId:
    Type: String
    Description: Unique identifier provided by dd0c during onboarding
    NoEcho: true
  Dd0cAccountId:
    Type: String
    Default: '123456789012'  # dd0c platform AWS account
    Description: dd0c platform account ID

Resources:
  Dd0cDiscoveryRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: dd0c-portal-discovery
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              AWS: !Sub 'arn:aws:iam::${Dd0cAccountId}:role/dd0c-scanner-role'
            Action: sts:AssumeRole
            Condition:
              StringEquals:
                sts:ExternalId: !Ref ExternalId
      ManagedPolicyArns: []  # No managed policies — custom policy only
      Policies:
        - PolicyName: dd0c-discovery-readonly
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              # CloudFormation — read stacks and resources
              - Effect: Allow
                Action:
                  - cloudformation:ListStacks
                  - cloudformation:DescribeStacks
                  - cloudformation:ListStackResources
                  - cloudformation:GetTemplate
                Resource: '*'
              # ECS — read clusters, services, task definitions
              - Effect: Allow
                Action:
                  - ecs:ListClusters
                  - ecs:ListServices
                  - ecs:DescribeServices
                  - ecs:DescribeClusters
                  - ecs:DescribeTaskDefinition
                  - ecs:ListTaskDefinitions
                Resource: '*'
              # Lambda — read functions and event sources
              - Effect: Allow
                Action:
                  - lambda:ListFunctions
                  - lambda:GetFunction
                  - lambda:ListEventSourceMappings
                  - lambda:ListTags
                Resource: '*'
              # API Gateway — read APIs and resources
              - Effect: Allow
                Action:
                  - apigateway:GET
                Resource: '*'
              # RDS — read instances
              - Effect: Allow
                Action:
                  - rds:DescribeDBInstances
                  - rds:DescribeDBClusters
                  - rds:ListTagsForResource
                Resource: '*'
              # Resource Groups — read tags
              - Effect: Allow
                Action:
                  - tag:GetResources
                  - tag:GetTagKeys
                  - tag:GetTagValues
                  - resourcegroupstaggingapi:GetResources
                Resource: '*'
              # CloudWatch — read alarm states for health
              - Effect: Allow
                Action:
                  - cloudwatch:DescribeAlarms
                  - cloudwatch:GetMetricData
                Resource: '*'
              # STS — for identity verification
              - Effect: Allow
                Action:
                  - sts:GetCallerIdentity
                Resource: '*'

              # EXPLICIT DENIES — defense in depth
              - Effect: Deny
                Action:
                  - iam:*
                  - s3:GetObject
                  - s3:PutObject
                  - secretsmanager:GetSecretValue
                  - ssm:GetParameter*
                  - kms:Decrypt
                  - logs:GetLogEvents
                Resource: '*'

Outputs:
  RoleArn:
    Value: !GetAtt Dd0cDiscoveryRole.Arn
    Description: Provide this ARN to dd0c during onboarding

Key security decisions:

Explicit deny on IAM, S3 object access, Secrets Manager, SSM Parameter Store, KMS, and CloudWatch Logs. Even if AWS adds new read actions to a managed policy, these denies prevent access to sensitive data.
No ReadOnlyAccess managed policy — too broad. Custom policy scoped to exactly the services dd0c needs.
ExternalId prevents confused deputy attacks.
Role name is fixed (dd0c-portal-discovery) so customers can audit it easily.

4.3 GitHub/GitLab Integration

V1: GitHub App (preferred over OAuth for org-level access)

Permission	Access	Justification
Repository contents	Read	CODEOWNERS, README, workflow files
Repository metadata	Read	Repo name, description, language, topics
Organization members	Read	Team membership for ownership inference
Organization administration	Read	Team structure

No write permissions. No webhook creation (V1). No code push. No issue creation.

The GitHub App is installed at the org level. The customer clicks "Install" on the GitHub Marketplace listing, selects their org, and grants read-only access. The installation ID is stored in the connections table.

GitLab (V2): GitLab Group Access Token with read_api scope. Same pattern — read-only, scoped to the group, no write access.

4.4 Cost Estimates

All costs in USD/month. Assumes us-east-1 pricing as of 2026.

50 Services (10-20 engineers, Free/Team tier, ~5 tenants)

Service	Configuration	Monthly Cost
Aurora Serverless v2	0.5 ACU min (mostly idle)	$43
ElastiCache Redis Serverless	Minimal ECPU usage	$15
ECS Fargate — API	1 task, 0.5 vCPU, 1GB	$18
ECS Fargate — Meilisearch	1 task, 0.5 vCPU, 1GB + EFS	$20
Lambda (all functions)	~50K invocations/month	$2
Step Functions	~150 state transitions/month	$1
SQS	~10K messages/month	$1
S3	<1GB storage	$1
CloudFront	<10GB transfer	$2
ALB	1 ALB, minimal LCUs	$18
KMS	2 keys	$2
Secrets Manager	5 secrets	$3
CloudWatch	Logs + basic metrics	$10
Route 53	1 hosted zone	$1
Total		~$137/month

Revenue at 50 services (5 tenants × 10 eng × $10): $500/month → 73% gross margin

200 Services (50-100 engineers, ~15 tenants)

Service	Configuration	Monthly Cost
Aurora Serverless v2	0.5-2 ACU (scales with queries)	$90
ElastiCache Redis Serverless	Moderate ECPU	$30
ECS Fargate — API	2 tasks (auto-scaling)	$36
ECS Fargate — Meilisearch	1 task, 1 vCPU, 2GB	$36
Lambda	~500K invocations/month	$10
Step Functions	~1,500 transitions/month	$5
SQS	~100K messages/month	$2
S3	~5GB	$2
CloudFront	~50GB transfer	$8
ALB	Moderate LCUs	$25
Observability (CW + X-Ray)		$30
Other (KMS, SM, R53)		$10
Total		~$284/month

Revenue at 200 services (15 tenants × ~5 eng avg × $10): ~$750/month → 62% gross margin Note: conservative — many tenants will have 20-50 engineers, pushing revenue to $2-5K/month

1,000 Services (200-500 engineers, ~50 tenants)

Service	Configuration	Monthly Cost
Aurora Serverless v2	2-8 ACU	$350
ElastiCache Redis Serverless	Higher ECPU	$80
ECS Fargate — API	3-5 tasks	$90
ECS Fargate — Meilisearch	1 task, 2 vCPU, 4GB	$72
Lambda	~5M invocations/month	$50
Step Functions	~15K transitions/month	$20
SQS + EventBridge		$10
S3	~50GB	$5
CloudFront	~200GB transfer	$25
ALB	Higher LCUs	$40
Observability		$80
WAF		$10
Other		$20
Total		~$852/month

Revenue at 1,000 services (50 tenants × ~7 eng avg × $10): ~$3,500/month → 76% gross margin At scale, Aurora and Fargate efficiency improves. Gross margin stays healthy.

4.5 Scaling Strategy

Phase 1 (0-50 tenants): Single-region, minimal resources

Aurora Serverless v2 scales ACUs automatically
ECS API auto-scales 1-3 tasks based on CPU/request count
Meilisearch single instance (handles 100K+ documents easily)
No read replicas, no multi-region

Phase 2 (50-200 tenants): Optimize hot paths

Add Aurora read replica for search/dashboard queries (write to primary, read from replica)
Redis cluster mode for session/cache scaling
Meilisearch: evaluate moving to dedicated EC2 instance for cost efficiency at sustained load
Add CloudFront caching for API responses (service cards change infrequently — 60s TTL)

Phase 3 (200+ tenants): Multi-region consideration

Evaluate us-west-2 deployment for latency (EU customers → eu-west-1)
Aurora Global Database for cross-region reads
CloudFront + Lambda@Edge for API routing
This is a $100K+ MRR problem — don't solve it prematurely

4.6 CI/CD Pipeline

graph LR
    DEV["Developer Push<br/>(GitHub)"] --> GHA["GitHub Actions"]

    subgraph "CI Pipeline"
        GHA --> LINT["Lint + Type Check"]
        LINT --> TEST["Unit Tests<br/>+ Integration Tests"]
        TEST --> BUILD["Docker Build<br/>(API + Meilisearch config)"]
        BUILD --> SCAN["Trivy Container Scan"]
        SCAN --> ECR["Push to ECR"]
    end

    subgraph "CD Pipeline"
        ECR --> STAGING["Deploy to Staging<br/>(ECS rolling update)"]
        STAGING --> SMOKE["Smoke Tests<br/>(discovery accuracy suite)"]
        SMOKE --> APPROVE["Manual Approval<br/>(solo founder reviews)"]
        APPROVE --> PROD["Deploy to Production<br/>(ECS rolling update<br/>+ Lambda version publish)"]
        PROD --> CANARY["Canary Check<br/>(5 min health check)"]
        CANARY --> DONE["✅ Deploy Complete"]
    end

Key CI/CD decisions:

GitHub Actions (not CodePipeline) — simpler, cheaper, Brian already knows it
Docker multi-stage builds for API (Node.js) and scanner Lambdas (Python/Node.js)
Staging environment: minimal Aurora (0.5 ACU) + single Fargate task. Cost: ~$60/month
Discovery accuracy regression suite: run discovery against a known test AWS account + GitHub org, assert >80% accuracy. If accuracy drops, block deploy.
Lambda deployments via SAM/CDK with versioning and aliases for instant rollback
Database migrations via Prisma Migrate (or raw SQL migrations) — run in CI before ECS deploy

5. SECURITY

5.1 IAM Role Design for Customer AWS Accounts

The trust model is the single most sensitive aspect of dd0c/portal. Customers are granting a third-party SaaS read access to their infrastructure topology. This must be treated with the gravity it deserves.

Principle: Minimum viable access, maximum transparency.

Role Architecture

┌─────────────────────────────────────────────────────────────┐
│ CUSTOMER AWS ACCOUNT                                         │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ dd0c-portal-discovery (IAM Role)                      │   │
│  │                                                        │   │
│  │ Trust: dd0c platform account + ExternalId             │   │
│  │                                                        │   │
│  │ ALLOW:                                                │   │
│  │   cloudformation:List*, Describe*                     │   │
│  │   ecs:List*, Describe*                                │   │
│  │   lambda:List*, Get* (config only)                    │   │
│  │   apigateway:GET                                      │   │
│  │   rds:Describe*, ListTags*                            │   │
│  │   tag:Get*                                            │   │
│  │   cloudwatch:DescribeAlarms, GetMetricData            │   │
│  │                                                        │   │
│  │ EXPLICIT DENY:                                        │   │
│  │   iam:*                                               │   │
│  │   s3:GetObject, PutObject (no data access)            │   │
│  │   secretsmanager:GetSecretValue                       │   │
│  │   ssm:GetParameter*                                   │   │
│  │   kms:Decrypt                                         │   │
│  │   logs:GetLogEvents (no application logs)             │   │
│  │   ec2:GetConsoleOutput, GetPasswordData               │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  dd0c CANNOT:                                               │
│  ✗ Read S3 objects (no customer data)                       │
│  ✗ Read secrets or parameters                               │
│  ✗ Read application logs                                    │
│  ✗ Modify any resource                                      │
│  ✗ Create/delete/update anything                            │
│  ✗ Access IAM users, roles, or policies                     │
│  ✗ Decrypt any KMS-encrypted data                           │
└─────────────────────────────────────────────────────────────┘

Confused deputy prevention:

Every tenant gets a unique ExternalId (UUID v4) generated at onboarding
The customer's trust policy requires this ExternalId in the sts:AssumeRole condition
dd0c's scanner Lambda passes the tenant-specific ExternalId when assuming the role
Without the correct ExternalId, AssumeRole fails — even if an attacker knows the role ARN

Credential rotation:

The cross-account role uses temporary STS credentials (1-hour expiry by default)
No long-lived access keys are stored
The ExternalId can be rotated by the customer at any time (update CFN stack + update dd0c connection settings)

Audit trail:

Every AssumeRole call appears in the customer's CloudTrail
dd0c provides a "Discovery Activity Log" in the UI showing exactly which API calls were made, when, and what was returned (metadata only, not full responses)
Customers can verify dd0c's access patterns against their own CloudTrail

5.2 GitHub/GitLab Token Scoping

GitHub App permissions (V1):

Permission	Level	Justification	What it CANNOT do
Contents	Read	Read CODEOWNERS, README, workflow files	Cannot push code, create branches, or modify files
Metadata	Read	Repo name, description, language, topics	Cannot modify repo settings
Members	Read	Org member list for ownership inference	Cannot invite/remove members
Administration	Read	Team structure and membership	Cannot create/modify teams

What the GitHub App explicitly cannot do:

✗ Push code or create pull requests
✗ Create/close issues
✗ Modify repository settings
✗ Access private repository secrets
✗ Trigger or modify GitHub Actions workflows
✗ Access GitHub Packages or Container Registry
✗ Create webhooks (V1 — webhooks added in V1.1 with explicit customer consent)

Token storage:

GitHub App installation tokens are short-lived (1 hour)
The GitHub App private key is stored in AWS Secrets Manager (KMS-encrypted)
Installation tokens are generated on-demand per scan, never persisted

5.3 Service Catalog Data Sensitivity

The service catalog contains infrastructure topology — a map of what services exist, who owns them, how they're connected, and what technology they use. This is sensitive data.

Threat model:

Threat	Impact	Mitigation
Catalog data breach	Attacker learns customer's service topology, tech stack, and team structure. Enables targeted attacks.	Encryption at rest (Aurora + S3 + KMS). Encryption in transit (TLS 1.3 everywhere). Multi-tenant RLS.
Cross-tenant data leak	Tenant A sees Tenant B's services. Reputational catastrophe.	PostgreSQL RLS + application-level tenant_id enforcement + automated cross-tenant access tests in CI.
Insider threat (dd0c employee)	Solo founder has access to all tenant data.	Audit logging on all database access. Principle: Brian should never need to query customer data directly. Build admin tools that log every access.
Supply chain attack	Compromised npm/pip dependency exfiltrates catalog data.	Dependabot + Snyk. Pin dependency versions. Minimal dependency tree. Lambda functions have no outbound internet access except to AWS APIs (VPC + NAT gateway scoped to AWS endpoints).
Customer credential compromise	Attacker steals the cross-account IAM role ARN + ExternalId.	ExternalId is a UUID — not guessable. Role trust policy limits to dd0c's specific account. Even with both, attacker only gets read-only access to infrastructure metadata (not data).

Data classification:

Data Type	Classification	Storage	Retention
Service names, descriptions	Internal	Aurora (encrypted)	Active tenant lifetime
Team names, members	Internal	Aurora (encrypted)	Active tenant lifetime
AWS resource ARNs	Confidential	Aurora (encrypted)	Active tenant lifetime
GitHub repo URLs	Internal	Aurora (encrypted)	Active tenant lifetime
Discovery event logs	Internal	Aurora → S3 archive	90 days hot, 1 year archive
IAM role ARNs + ExternalIds	Confidential	Secrets Manager (KMS)	Active connection lifetime
GitHub App private key	Secret	Secrets Manager (KMS)	Rotated annually
User sessions	Internal	Redis (encrypted in transit)	24-hour TTL

5.4 SOC 2 Considerations

dd0c/portal will need SOC 2 Type II certification by the time Business tier launches (Month 12+). Engineering directors buying at $25/engineer need compliance evidence.

SOC 2 Trust Service Criteria — Architecture Alignment:

Criteria	Requirement	dd0c Architecture
CC6.1 — Logical Access	Restrict access to authorized users	GitHub OAuth SSO. Tenant isolation via RLS. No shared accounts.
CC6.3 — Encryption	Encrypt data in transit and at rest	TLS 1.3 (ALB, CloudFront). Aurora encryption (AES-256). S3 SSE-KMS. Redis in-transit encryption.
CC6.6 — System Boundaries	Define and protect system boundaries	VPC with private subnets. Security groups restrict inter-service communication. WAF on ALB.
CC7.1 — Monitoring	Detect anomalies and security events	CloudWatch alarms. CloudTrail for API access. GuardDuty for threat detection.
CC7.2 — Incident Response	Respond to security incidents	PagerDuty alerting for dd0c's own infrastructure. Incident response runbook (documented).
CC8.1 — Change Management	Control changes to infrastructure and code	GitHub PRs with required reviews (when team grows). CI/CD pipeline with staging.
A1.2 — Availability	Maintain system availability	Aurora Multi-AZ. ECS multi-task. CloudFront edge caching. Health checks + auto-recovery.

Pre-SOC 2 actions (build into V1):

Enable CloudTrail in dd0c's AWS account (all API calls logged)
Enable GuardDuty (threat detection)
Enable AWS Config (configuration compliance)
Implement audit logging in the application (who accessed what, when)
Document data retention and deletion policies
Build a "delete my data" endpoint (GDPR + SOC 2 requirement)

5.5 The Trust Model

This is the hardest sell in the product. Customers are giving dd0c read access to their infrastructure graph. The architecture must make this trust decision as easy as possible.

Trust-building mechanisms:

Transparency: The CloudFormation template is public and auditable. Customers can read every IAM permission before deploying. No hidden access.
Customer-controlled revocation: Delete the CloudFormation stack → dd0c loses all access instantly. No "please contact support to revoke." The customer is always in control.
Minimal blast radius: Even if dd0c is fully compromised, the attacker gets read-only access to infrastructure metadata (service names, resource ARNs, team names). They do NOT get application data, secrets, logs, or write access. The worst case is an attacker learning "Acme Corp has a payment-gateway service running on ECS in us-east-1." Sensitive, but not catastrophic.
Open-source discovery agent (V1.1): Open-source the AWS scanner and GitHub scanner code. Customers can audit exactly what API calls are made and what data is collected. This is the strongest trust signal possible.
Data residency: All customer data stored in the same AWS region as the customer's primary infrastructure (us-east-1 by default, eu-west-1 for EU customers at Business tier). No cross-region data transfer without explicit consent.
Deletion guarantee: When a customer disconnects or churns, all their data (services, teams, discovery logs, corrections) is hard-deleted within 30 days. S3 objects are deleted. Meilisearch index entries are removed. Backups are excluded from restore for that tenant.

Trust comparison with competitors:

Aspect	Backstage	Port/Cortex	dd0c/portal
Data location	Self-hosted (customer controls)	Vendor SaaS	Vendor SaaS
AWS access	N/A (manual YAML)	Similar IAM role	Read-only IAM role
Code auditability	Open source	Closed source	Closed source (scanners open-sourced V1.1)
Revocation	N/A	Contact support	Delete CFN stack (instant)
Blast radius	N/A	Read + sometimes write	Read-only, explicit denies

dd0c's trust model is weaker than self-hosted Backstage (customer controls everything) but stronger than Port/Cortex (more transparent, easier revocation, smaller blast radius). The open-source scanner in V1.1 closes the gap significantly.

6. MVP SCOPE

6.1 V1 Technical Scope — "The 5-Minute Miracle"

V1 ships exactly four capabilities. Nothing else.

┌─────────────────────────────────────────────────────────────┐
│ V1 SCOPE                                                     │
│                                                              │
│  ✅ IN SCOPE                        ❌ OUT OF SCOPE          │
│  ─────────────                      ──────────────           │
│  AWS auto-discovery                 AI agent ("Ask Your      │
│    - CloudFormation                   Infra")                │
│    - ECS                            GitLab support           │
│    - Lambda                         Dependency visualization │
│    - API Gateway                    Scorecards / maturity    │
│    - RDS                            Kubernetes discovery     │
│    - Resource tags                  Terraform state parsing  │
│                                     Custom plugins           │
│  GitHub org scanning                Advanced RBAC            │
│    - Repos + languages              SSO (Okta/Azure AD)     │
│    - CODEOWNERS                     Self-hosted option       │
│    - README extraction              Multi-cloud (GCP/Azure) │
│    - Team memberships               Compliance reports       │
│    - Actions workflows              Software templates       │
│                                     Change feed              │
│  Service catalog UI                 Zombie service detection │
│    - Service cards                  Cost anomaly per service │
│    - Cmd+K search (<200ms)                                   │
│    - Team directory                                          │
│    - Correction UI                                           │
│    - Confidence scores                                       │
│                                                              │
│  Integrations                                                │
│    - PagerDuty/OpsGenie (on-call)                           │
│    - Slack bot (/dd0c who owns)                             │
│    - GitHub OAuth (auth)                                     │
│    - Stripe (billing)                                        │
└─────────────────────────────────────────────────────────────┘

6.2 What's Deferred to V2

Feature	V2 Target	Dependency
AI Agent ("Ask Your Infra")	Month 7-9	Requires stable catalog data + LLM integration
GitLab support	Month 8-10	Separate scanner Lambda, GitLab Group Access Token flow
Dependency visualization	Month 5-7 (V1.1)	Requires VPC flow log analysis or API Gateway integration mapping
Scorecards	Month 5-6 (V1.1)	Requires stable service entity model + enough metadata signals
dd0c/cost integration	Month 7-9	Requires dd0c/cost to be live with per-service attribution
dd0c/alert integration	Month 7-9	Requires dd0c/alert to be live with service-level routing
Backstage YAML importer	Month 5-6 (V1.1)	Low effort, high acquisition value for Backstage refugees
Change feed	Month 10-12	Requires discovery event log to be stable + diff computation
Advanced RBAC	Month 12+ (Business tier)	Requires team-level permission model
SSO (Okta/Azure AD)	Month 12+ (Business tier)	Requires SAML/OIDC integration

6.3 The 5-Minute Onboarding Flow — Technical Implementation

sequenceDiagram
    participant U as Engineer
    participant SPA as Portal SPA
    participant API as Portal API
    participant GH as GitHub OAuth
    participant STRIPE as Stripe
    participant AWS_CFN as Customer AWS<br/>(CloudFormation)
    participant SF as Step Functions

    Note over U,SF: Step 1: Sign Up (30 seconds)
    U->>SPA: Click "Sign up with GitHub"
    SPA->>GH: OAuth redirect (scope: read:org, read:user)
    GH->>SPA: Authorization code
    SPA->>API: POST /auth/github {code}
    API->>GH: Exchange code for token
    API->>API: Create tenant + user
    API->>SPA: JWT + redirect to onboarding wizard

    Note over U,SF: Step 2: Select Plan (30 seconds)
    SPA->>U: "Free (≤10 eng) or Team ($10/eng)?"
    U->>SPA: Select Team
    SPA->>STRIPE: Stripe Checkout Session
    STRIPE->>U: Enter credit card
    STRIPE->>API: Webhook: checkout.session.completed
    API->>API: Activate subscription

    Note over U,SF: Step 3: Connect AWS (90 seconds)
    SPA->>U: "Deploy this CloudFormation template"
    Note right of SPA: One-click link:<br/>https://console.aws.amazon.com/cloudformation/<br/>home#/stacks/create/review?<br/>templateURL=https://dd0c-public.s3.amazonaws.com/<br/>cfn/dd0c-discovery-role.yaml&<br/>param_ExternalId={{generated-uuid}}
    U->>AWS_CFN: Click "Create Stack" in AWS Console
    AWS_CFN->>AWS_CFN: Create IAM role (~60 seconds)
    U->>SPA: Paste Role ARN
    SPA->>API: POST /connections/aws {roleArn, externalId}
    API->>API: sts:AssumeRole (validate access)
    API->>SPA: ✅ "AWS connected"

    Note over U,SF: Step 4: Connect GitHub (already done — OAuth grants org access)
    API->>API: List orgs from GitHub token
    SPA->>U: "Select your GitHub org"
    U->>SPA: Select org
    SPA->>API: POST /connections/github {orgLogin}
    API->>SPA: ✅ "GitHub connected"

    Note over U,SF: Step 5: Auto-Discovery (90 seconds)
    API->>SF: StartExecution {tenantId}
    SF-->>SPA: WebSocket: "Scanning AWS resources..."
    SF-->>SPA: WebSocket: "Found 234 AWS resources..."
    SF-->>SPA: WebSocket: "Scanning 89 GitHub repos..."
    SF-->>SPA: WebSocket: "Reconciling services..."
    SF-->>SPA: WebSocket: "Inferring ownership..."
    SF-->>SPA: WebSocket: "✅ Discovered 147 services"
    SPA->>U: Redirect to catalog view

    Note over U,SF: Total elapsed: ~4 minutes

Critical UX decisions:

The CloudFormation template link is pre-populated with the ExternalId. The customer clicks one link, lands in the AWS Console with the template loaded, and clicks "Create Stack." Three clicks total.
GitHub org access is granted during OAuth signup — no separate connection step. The onboarding wizard just asks "which org?" from the list of orgs the user belongs to.
WebSocket progress updates during discovery create the "Holy Shit" moment. Watching services appear in real-time (47... 89... 120... 147) is the emotional hook that drives screenshots and sharing.
If discovery takes >120 seconds, show a "This is taking longer than usual" message with a progress bar. Never show a spinner with no context.

6.4 >80% Discovery Accuracy — How to Measure It

Definition of accuracy:

accuracy = correct_services / total_discovered_services

Where "correct" means ALL of:
  1. The service actually exists (not a phantom/duplicate)
  2. The service name is recognizable to the team
  3. The primary owner is correct (or marked "unowned" if truly unowned)
  4. The repo link is correct (if a repo exists)

Measurement methodology:

Beta measurement (Month 2-3): Personal call with each of 20 beta customers. Walk through their catalog together. For each service, ask: "Is this right?" Record corrections. Calculate accuracy per customer and aggregate.
Production measurement (Month 3+): Track the correction rate.
```
correction_rate = services_corrected_within_7_days / services_discovered
accuracy_estimate = 1 - correction_rate
```
This is a lower bound — some incorrect services won't be corrected because users don't notice or don't care. But it's a continuous, automated metric.

Accuracy by signal source:

SELECT
  discovery_sources,
  COUNT(*) as total,
  COUNT(*) FILTER (WHERE corrected_within_7d = false) as uncorrected,
  1.0 - (COUNT(*) FILTER (WHERE corrected_within_7d = true)::decimal / COUNT(*)) as accuracy
FROM services
WHERE tenant_id = :tenant_id
GROUP BY discovery_sources
ORDER BY accuracy ASC;

This reveals which discovery signals are weakest (e.g., "Lambda-only services have 60% accuracy" → invest in Lambda grouping heuristics).

Accuracy improvement tracking:

Week 1: 78% accuracy (initial discovery)
Week 2: 85% accuracy (after user corrections + propagation)
Week 4: 91% accuracy (after model improvements from correction patterns)
Week 8: 93% accuracy (steady state)

If accuracy is below 80% on first run:

Show a prominent "Review your catalog" banner: "We found 147 services. Some may need corrections. Help us learn your infrastructure by reviewing the flagged items."
Sort low-confidence services to the top of the review queue
Gamify corrections: "12 services need your review. ~5 minutes."
Each correction improves the model for similar services — show this: "Your correction improved confidence for 3 similar services."

6.5 Technical Debt Budget

V1 is built fast. Some shortcuts are intentional. Document them so they don't become surprises.

Debt Item	Shortcut	Proper Solution	When to Fix
Meilisearch persistence	EFS volume (slow, but works)	Dedicated EC2 instance with local SSD	>200 tenants
Search index sync	SQS → Lambda → Meilisearch (eventual consistency, ~5s delay)	Change Data Capture from Aurora → real-time sync	If users complain about stale search results
GitHub rate limiting	Simple retry with backoff	Distributed rate limiter with token bucket per GitHub App installation	>50 tenants scanning simultaneously
Discovery scheduling	EventBridge cron (same time for all tenants)	Distributed scheduler with jitter to spread load	>100 tenants
Monitoring	CloudWatch basic metrics + alarms	Datadog or Grafana Cloud for full observability	>$10K MRR (can afford $200/month for monitoring)
Database migrations	Raw SQL files run in CI	Prisma Migrate or Flyway with proper versioning	When team grows beyond solo founder
Error handling	Generic error pages, console.error logging	Structured error codes, Sentry integration, user-facing error messages	Month 3-4
Test coverage	Integration tests for discovery accuracy, minimal unit tests	80%+ unit test coverage, E2E tests with Playwright	Month 4-6

6.6 Solo Founder Operational Model

What Brian operates:

1 AWS account (dd0c platform)
1 GitHub org (dd0c)
1 Stripe account
1 Slack workspace (dd0c community + bot)
1 PagerDuty account (dd0c's own alerting)

Operational runbook (daily):

Check CloudWatch dashboard (5 minutes): any alarms firing? Any discovery failures?
Check Stripe dashboard: new signups? Churns? Failed payments?
Check support channel (Slack/email): any customer issues?
Total daily ops: ~15 minutes when nothing is broken

Alerting (PagerDuty):

P1 (page immediately): API 5xx rate >5%, Aurora connection failures, discovery pipeline stuck >30 minutes
P2 (page during business hours): Discovery accuracy drop >10% for any tenant, Meilisearch index lag >60 seconds, Stripe webhook failures
P3 (daily digest): Lambda error rate >1%, CloudWatch log anomalies, dependency vulnerability alerts

On-call: Brian is the only on-call. This is sustainable up to ~50 tenants if the architecture is reliable. At $20K MRR, hire a part-time contractor for L1 support (Slack responses, known-issue triage).

7. API DESIGN

The Portal API is a RESTful JSON API. It serves the SPA frontend, the Slack bot, and internal integrations (dd0c/cost, dd0c/alert). In V1, there is no public API for customers to programmatically query their catalog, but the internal API is designed cleanly enough to be exposed later.

All requests require authentication. The SPA uses HTTP-only cookies (JWT). Integrations use internal IAM/VPC auth or signed tokens.

7.1 Discovery API

Manages the auto-discovery lifecycle.

POST /api/v1/discovery/run Triggers a manual full discovery scan.

Request: { "connections": ["aws", "github"] }
Response: { "run_id": "uuid", "status": "started" }

GET /api/v1/discovery/runs/{run_id} Polls the status of a specific discovery run.

Response:

{
  "id": "uuid",
  "status": "running",
  "started_at": "2026-02-28T10:00:00Z",
  "progress": {
    "aws_resources_found": 142,
    "github_repos_found": 89,
    "current_phase": "reconciliation"
  }
}

WebSocket: wss://api.dd0c.com/v1/discovery/stream?run_id=uuid Real-time progress events pushed to the UI during onboarding.

Events: phase_started, resources_discovered, services_reconciled, ownership_inferred, completed.

7.2 Service API

CRUD and search operations for the catalog.

GET /api/v1/services/search?q={query}&limit=10 The Cmd+K search endpoint. Proxies to Meilisearch or Redis cache.

Response: Array of matched services with highlighted snippets and confidence scores. (See Section 2.3).

GET /api/v1/services/{service_id} Retrieves the full, expanded detail view of a single service.

Response:

{
  "id": "svc_123",
  "name": "payment-gateway",
  "display_name": "Payment Gateway",
  "description": "Handles Stripe checkout",
  "lifecycle": "production",
  "owner": {
    "team_id": "team_456",
    "name": "Payments Team",
    "confidence": 0.92,
    "source": "codeowners"
  },
  "repo": {
    "url": "https://github.com/acme/payment-gateway",
    "default_branch": "main"
  },
  "infrastructure": {
    "aws_resources": [
      {"type": "ecs_service", "arn": "arn:aws:ecs:...", "region": "us-east-1"}
    ]
  },
  "health_status": "healthy",
  "last_deploy_at": "2026-02-27T14:30:00Z"
}

PATCH /api/v1/services/{service_id} Allows users to correct service metadata (e.g., fix a wrong owner).

Request: { "team_id": "team_789", "correction_reason": "Team reorg" }
Response: 200 OK. (Triggers background propagation to similar services).

7.3 Team & Ownership API

GET /api/v1/teams Lists all inferred and synced teams.

GET /api/v1/teams/{team_id}/services Lists all services owned by a specific team.

Query params: ?role=primary|contributing|on_call
Response: Array of service summaries.

7.4 Slack Bot API

The Slack bot translates slash commands into API queries. The bot Lambda receives the webhook from Slack, authenticates the workspace, maps it to a tenant, and calls the internal Portal API.

Command: /dd0c who owns payment-gateway

Bot logic: Calls GET /api/v1/services/search?q=payment-gateway&limit=1.
Bot response (ephemeral or in-channel):

Payment Gateway is owned by @payments-team (92% confidence). Repo: acme/payment-gateway | Health: ✅ Healthy | On-Call: @mike

Command: /dd0c oncall auth-service

Bot logic: Looks up service, finds owner team, queries mapped PagerDuty schedule.
Bot response:

Primary on-call for Auth Service (@platform-team) is Sarah Chen until 5:00 PM.

7.5 Webhooks (Outbound)

V1 supports outbound webhooks so customers can react to catalog changes.

POST https://customer-endpoint.com/webhooks/dd0c

Event: service.ownership.changed

Payload:

{
  "event_id": "evt_abc123",
  "type": "service.ownership.changed",
  "timestamp": "2026-02-28T12:00:00Z",
  "data": {
    "service_id": "svc_123",
    "service_name": "payment-gateway",
    "old_owner_id": "team_456",
    "new_owner_id": "team_789",
    "confidence": 0.95,
    "source": "user_correction"
  }
}

Other events: service.discovered, service.health.degraded, discovery.run.completed.

7.6 Platform Integration APIs (Internal)

These endpoints allow other dd0c modules to enrich the portal, and the portal to enrich other modules. These use internal IAM auth and bypass the public API Gateway.

dd0c/cost Integration

Portal calls Cost: GET http://cost.internal/api/v1/services/{service_id}/spend Retrieves the trailing 30-day AWS spend for the resources mapped to this service. Displayed on the service card.
Cost calls Portal: GET http://portal.internal/api/v1/resources/{arn}/service When dd0c/cost detects an anomaly on an RDS instance, it asks the portal "which service owns this ARN?" so it can alert the correct team.

dd0c/alert Integration

Portal calls Alert: GET http://alert.internal/api/v1/services/{service_id}/incidents?status=active Retrieves active incidents to update the service's health_status badge.
Alert calls Portal: GET http://portal.internal/api/v1/services/{service_id}/routing When an alert fires for a service, dd0c/alert asks the portal for the primary owner's Slack channel and PagerDuty schedule to route the page correctly.

dd0c/run Integration

Portal calls Run: GET http://run.internal/api/v1/services/{service_id}/runbooks Links executable runbooks directly on the service detail card.

96 KiB Raw Blame History Unescape Escape

dd0c/portal — Technical Architecture

1. SYSTEM OVERVIEW

High-Level Architecture

Component Inventory

Technology Choices — Key Decisions

The 5-Minute Auto-Discovery Flow — Core Architectural Driver

2. CORE COMPONENTS

2.1 Discovery Engine

Architecture

AWS Scanner — Resource-to-Service Mapping Strategy

GitHub Scanner — Repository-to-Service Mapping

Service Relationship Inference

Discovery Scheduling

2.2 Service Catalog

Service Entity Model

Ownership Mapping

Metadata Enrichment

2.3 Search Engine

Search Architecture

Redis Prefix Cache

2.4 AI Agent — "Ask Your Infra" (V2)

Design

2.5 Dashboard

Dashboard Component Architecture

3. DATA ARCHITECTURE

3.1 Complete Database Schema

Core Entities

Discovery Event Log

User Corrections (Feedback Loop)

3.2 Search Index Design

3.3 Graph Database Decision: Not Yet

3.4 Multi-Tenant Data Isolation

3.5 Sync/Refresh Strategy

4. INFRASTRUCTURE

4.1 AWS Architecture

4.2 Customer-Side: Read-Only IAM Role

4.3 GitHub/GitLab Integration

4.4 Cost Estimates

50 Services (10-20 engineers, Free/Team tier, ~5 tenants)

200 Services (50-100 engineers, ~15 tenants)

1,000 Services (200-500 engineers, ~50 tenants)

4.5 Scaling Strategy

4.6 CI/CD Pipeline

5. SECURITY

5.1 IAM Role Design for Customer AWS Accounts

Role Architecture

5.2 GitHub/GitLab Token Scoping

5.3 Service Catalog Data Sensitivity

5.4 SOC 2 Considerations

5.5 The Trust Model

6. MVP SCOPE

6.1 V1 Technical Scope — "The 5-Minute Miracle"

6.2 What's Deferred to V2

6.3 The 5-Minute Onboarding Flow — Technical Implementation

6.4 >80% Discovery Accuracy — How to Measure It

6.5 Technical Debt Budget

6.6 Solo Founder Operational Model

7. API DESIGN

7.1 Discovery API

7.2 Service API

7.3 Team & Ownership API

7.4 Slack Bot API

7.5 Webhooks (Outbound)

7.6 Platform Integration APIs (Internal)

96 KiB

Raw Blame History