Files
dd0c/products/04-lightweight-idp/architecture/architecture.md

2018 lines
96 KiB
Markdown
Raw Normal View History

# dd0c/portal — Technical Architecture
**Product:** Lightweight Internal Developer Portal
**Phase:** 6 — Architecture Design
**Date:** 2026-02-28
**Author:** Solutions Architecture
**Status:** Draft
---
## 1. SYSTEM OVERVIEW
### High-Level Architecture
```mermaid
graph TB
subgraph "Customer Environment"
AWS_ACCOUNT["Customer AWS Account(s)"]
GH_ORG["GitHub Organization"]
PD["PagerDuty / OpsGenie"]
end
subgraph "dd0c Platform — Control Plane (us-east-1)"
subgraph "Ingress"
ALB["Application Load Balancer<br/>+ WAF + CloudFront"]
end
subgraph "API Layer"
API["Portal API<br/>(ECS Fargate)"]
WS["WebSocket Gateway<br/>(API Gateway v2)"]
end
subgraph "Discovery Engine"
ORCH["Discovery Orchestrator<br/>(Step Functions)"]
AWS_SCAN["AWS Scanner<br/>(Lambda)"]
GH_SCAN["GitHub Scanner<br/>(Lambda)"]
RECONCILER["Reconciliation Engine<br/>(Lambda)"]
INFERENCE["Ownership Inference<br/>(Lambda)"]
end
subgraph "Data Layer"
PG["PostgreSQL (RDS Aurora Serverless v2)<br/>Service Catalog + Tenants"]
REDIS["ElastiCache Redis<br/>Session + Cache + Search"]
S3_DATA["S3<br/>Discovery Snapshots + Exports"]
SQS["SQS FIFO<br/>Discovery Events"]
end
subgraph "Search"
MEILI["Meilisearch<br/>(ECS Fargate)<br/>Full-text + Faceted Search"]
end
subgraph "Integrations"
SLACK_BOT["Slack Bot<br/>(Lambda)"]
WEBHOOK_OUT["Outbound Webhooks<br/>(EventBridge → Lambda)"]
end
subgraph "Frontend"
SPA["React SPA<br/>(CloudFront + S3)"]
end
end
subgraph "dd0c Platform Modules"
DD0C_COST["dd0c/cost"]
DD0C_ALERT["dd0c/alert"]
DD0C_RUN["dd0c/run"]
end
%% Customer → Platform connections
AWS_ACCOUNT -- "AssumeRole<br/>(read-only)" --> AWS_SCAN
GH_ORG -- "OAuth / GitHub App<br/>(read-only)" --> GH_SCAN
PD -- "API Key<br/>(read-only)" --> API
%% User flows
SPA --> ALB --> API
SPA --> WS
%% Discovery flow
ORCH --> AWS_SCAN
ORCH --> GH_SCAN
AWS_SCAN --> SQS
GH_SCAN --> SQS
SQS --> RECONCILER
RECONCILER --> INFERENCE
INFERENCE --> PG
PG --> MEILI
%% API reads
API --> PG
API --> MEILI
API --> REDIS
%% Integrations
SLACK_BOT --> API
API --> WEBHOOK_OUT
%% dd0c platform
API <-- "Internal API" --> DD0C_COST
API <-- "Internal API" --> DD0C_ALERT
API <-- "Internal API" --> DD0C_RUN
```
### Component Inventory
| Component | Responsibility | Technology | Justification |
|-----------|---------------|------------|---------------|
| **Portal API** | REST/GraphQL API for catalog CRUD, search proxy, auth, billing | Node.js (Fastify) on ECS Fargate | Fastify is the fastest Node framework. Fargate eliminates server management. Node aligns with React frontend for code sharing (types, validation schemas). |
| **Discovery Orchestrator** | Coordinates multi-source discovery runs, manages state machine for scan → reconcile → infer → index pipeline | AWS Step Functions | Native retry/error handling, visual debugging, pay-per-transition. Perfect for long-running multi-step workflows. |
| **AWS Scanner** | Scans customer AWS accounts via cross-account AssumeRole. Enumerates CloudFormation stacks, ECS services, Lambda functions, API Gateway APIs, RDS instances, tagged resources. | Python (Lambda) | boto3 is the canonical AWS SDK. Lambda cold starts acceptable for background scanning (not user-facing). Python's AWS ecosystem is unmatched. |
| **GitHub Scanner** | Scans GitHub org: repos, languages, CODEOWNERS, README content, workflow files, team memberships, recent commit authors. | Node.js (Lambda) | Octokit (GitHub SDK) is TypeScript-native. Shares types with API layer. |
| **Reconciliation Engine** | Merges AWS + GitHub scan results into unified service entities. Deduplicates, cross-references repo→infra mappings, resolves conflicts. | Node.js (Lambda) | Core business logic. Shares domain types with API. |
| **Ownership Inference** | Determines service ownership from CODEOWNERS, git blame frequency, team membership, CloudFormation tags, and historical corrections. Produces confidence scores. | Python (Lambda) | Scoring/ML-adjacent logic. Python's data processing libraries (pandas for frequency analysis) are superior. |
| **PostgreSQL** | Primary datastore: service catalog, tenant data, user accounts, discovery history, corrections, billing state. | Aurora Serverless v2 | Scales to zero during low traffic (solo founder cost control). Relational model fits service catalog's structured data. Aurora's auto-scaling handles growth without capacity planning. |
| **Redis** | Session store, API response cache, rate limiting, real-time search suggestions (prefix trie). | ElastiCache Redis (Serverless) | Sub-millisecond reads for Cmd+K autocomplete. Serverless pricing aligns with variable load. |
| **Meilisearch** | Full-text search index for Cmd+K. Typo-tolerant, faceted, <50ms response. | Meilisearch on ECS Fargate (single container) | Meilisearch over Elasticsearch: 10x simpler to operate (single binary, no JVM, no cluster management), typo-tolerance out of the box, <50ms search on 10K documents. Solo founder can't babysit an ES cluster. Over Typesense: Meilisearch has better faceted search and a more active open-source community. |
| **React SPA** | Portal UI: service catalog, Cmd+K search, service detail cards, team directory, correction UI, onboarding wizard. | React + Vite + TailwindCSS, hosted on CloudFront + S3 | SPA for instant Cmd+K interactions without server round-trips for UI state. CloudFront for global edge caching. Vite for fast builds. |
| **Slack Bot** | Responds to `/dd0c who owns <service>` commands. Passive viral loop. | Node.js (Lambda) via Slack Bolt | Lambda for zero-cost when idle. Bolt is Slack's official SDK. |
| **WebSocket Gateway** | Pushes real-time discovery progress to the UI during onboarding ("Found 47 services... 89 services... 147 services..."). | API Gateway WebSocket API + Lambda | Managed WebSocket infrastructure. Only needed during discovery runs — Lambda scales to zero otherwise. |
### Technology Choices — Key Decisions
**Why Not Serverless-Everything (Lambda for API)?**
The Portal API handles Cmd+K search requests that must respond in <100ms. Lambda cold starts (500ms-2s for Node.js) are unacceptable for the primary user interaction. ECS Fargate with minimum 1 task provides warm, consistent latency. Discovery Lambdas are background tasks where cold starts are irrelevant.
**Why Meilisearch Over Algolia/Elasticsearch?**
- Algolia: SaaS pricing at scale ($1/1K search requests) becomes expensive with high DAU. Self-hosted Meilisearch is ~$0 marginal cost per search.
- Elasticsearch: Operational complexity is prohibitive for a solo founder. Requires JVM tuning, cluster management, index lifecycle policies. Meilisearch is a single binary with zero configuration.
- Meilisearch: Typo-tolerant by default (critical for Cmd+K UX), faceted filtering, <50ms on 100K documents, single Docker container, 200MB RAM for 10K services. Perfect for the scale and operational model.
**Why PostgreSQL Over DynamoDB?**
The service catalog is inherently relational: services belong to teams, teams have members, services have dependencies on other services, services map to repos, repos map to infrastructure. DynamoDB's single-table design would require complex GSIs and denormalization that increases development time. Aurora Serverless v2 scales to zero (minimum 0.5 ACU = ~$43/month) and handles relational queries natively. At the scale of 50-1000 services per tenant, PostgreSQL is more than sufficient.
**Why Not a Graph Database for Dependencies (V1)?**
Service dependency graphs are a V1.1 feature. For V1, dependencies are stored as adjacency lists in PostgreSQL (`service_dependencies` join table). This is sufficient for "what does this service depend on?" queries. A dedicated graph database (Neptune at $0.10/hour minimum = $73/month, or Neo4j) is premature optimization for V1. If dependency visualization becomes a core feature in V1.1+, evaluate Neptune Serverless or an in-app graph traversal library (graphology.js).
### The 5-Minute Auto-Discovery Flow — Core Architectural Driver
This is the most important sequence in the entire system. Every architectural decision serves this flow.
```mermaid
sequenceDiagram
participant User as Engineer
participant UI as Portal UI
participant API as Portal API
participant SF as Step Functions
participant AWS as AWS Scanner (λ)
participant GH as GitHub Scanner (λ)
participant SQS as SQS FIFO
participant REC as Reconciler (λ)
participant INF as Inference (λ)
participant DB as PostgreSQL
participant MS as Meilisearch
participant WS as WebSocket
Note over User,WS: Minute 0:00 — Signup
User->>UI: Sign up (GitHub OAuth)
UI->>API: POST /auth/github
API->>DB: Create tenant + user
Note over User,WS: Minute 1:00 — Connect AWS
User->>UI: Deploy CloudFormation template (1-click)
UI->>API: POST /connections/aws {roleArn, externalId}
API->>API: sts:AssumeRole validation
API->>DB: Store connection credentials (encrypted)
Note over User,WS: Minute 2:00 — Connect GitHub (already done via OAuth)
API->>DB: Store GitHub org connection
Note over User,WS: Minute 2:30 — Trigger Discovery
API->>SF: StartExecution {tenantId, connections}
SF->>WS: Push "Discovery started..."
Note over User,WS: Minute 2:30-3:30 — Parallel Scanning
par AWS Scan
SF->>AWS: Scan CloudFormation stacks
AWS->>SQS: {type: cfn_stack, resources: [...]}
SF->>AWS: Scan ECS services
AWS->>SQS: {type: ecs_service, services: [...]}
SF->>AWS: Scan Lambda functions
AWS->>SQS: {type: lambda_fn, functions: [...]}
SF->>AWS: Scan API Gateway APIs
AWS->>SQS: {type: apigw, apis: [...]}
SF->>AWS: Scan RDS instances
AWS->>SQS: {type: rds, instances: [...]}
and GitHub Scan
SF->>GH: Scan repos (non-archived, non-fork)
GH->>SQS: {type: gh_repo, repos: [...]}
SF->>GH: Scan CODEOWNERS files
GH->>SQS: {type: codeowners, mappings: [...]}
SF->>GH: Scan team memberships
GH->>SQS: {type: gh_teams, teams: [...]}
end
WS-->>UI: Push "Found 47 AWS resources..."
WS-->>UI: Push "Found 89 GitHub repos..."
Note over User,WS: Minute 3:30-4:00 — Reconciliation
SQS->>REC: Batch process discovery events
REC->>REC: Cross-reference AWS resources ↔ GitHub repos
REC->>REC: Deduplicate (CFN stack name = ECS service = repo name)
REC->>REC: Merge into unified service entities
REC->>DB: Upsert service entities
Note over User,WS: Minute 4:00-4:30 — Ownership Inference
SF->>INF: Infer ownership for all services
INF->>INF: Score: CODEOWNERS (weight: 0.4) + git blame (0.25) + CFN tags (0.2) + team membership (0.15)
INF->>DB: Update services with owner + confidence score
INF->>MS: Index services for search
WS-->>UI: Push "Discovered 147 services. Catalog ready."
Note over User,WS: Minute 5:00 — First Search
User->>UI: Cmd+K → "payment"
UI->>API: GET /search?q=payment
API->>MS: Search
MS->>API: Results in <50ms
API->>UI: payment-gateway, payment-processor, payment-webhook
User->>User: "Holy shit, this actually works."
```
**Critical timing constraints:**
- AWS scanning must complete in <60 seconds for accounts with up to 500 resources. Achieved via parallel Lambda invocations per resource type.
- GitHub scanning must complete in <60 seconds for orgs with up to 500 repos. Achieved via GitHub GraphQL API (batch queries) instead of REST (one request per repo).
- Reconciliation must complete in <30 seconds. Single Lambda invocation processing all SQS messages in batch.
- Total pipeline: <120 seconds from trigger to searchable catalog. The "5-minute" promise includes signup + AWS connection time.
**Why Step Functions (not a simple Lambda chain)?**
- Built-in retry with exponential backoff per step (AWS API throttling is common)
- Parallel execution of AWS + GitHub scans with automatic join
- Visual execution history for debugging failed discoveries
- Error handling: if GitHub scan fails, AWS results still proceed (partial discovery > no discovery)
- State machine is inspectable — critical for debugging accuracy issues in production
---
## 2. CORE COMPONENTS
### 2.1 Discovery Engine
The discovery engine is the product. Everything else is UI on top of discovered data. If discovery is wrong, nothing else matters.
#### Architecture
```mermaid
graph TB
subgraph "Discovery Orchestrator (Step Functions)"
TRIGGER["Trigger<br/>(API call / Schedule)"]
PLAN["Plan Phase<br/>Determine scan scope"]
subgraph "Scan Phase (Parallel)"
subgraph "AWS Scanners"
CFN["CloudFormation<br/>Scanner"]
ECS_S["ECS<br/>Scanner"]
LAMBDA_S["Lambda<br/>Scanner"]
APIGW_S["API Gateway<br/>Scanner"]
RDS_S["RDS<br/>Scanner"]
TAG_S["Resource Groups<br/>Tag Scanner"]
end
subgraph "GitHub Scanners"
REPO_S["Repository<br/>Scanner"]
CODEOWNERS_S["CODEOWNERS<br/>Parser"]
TEAM_S["Team Membership<br/>Scanner"]
README_S["README<br/>Extractor"]
WORKFLOW_S["Actions Workflow<br/>Scanner"]
end
end
RECONCILE["Reconciliation Phase"]
INFER["Inference Phase"]
INDEX["Index Phase"]
end
TRIGGER --> PLAN
PLAN --> CFN & ECS_S & LAMBDA_S & APIGW_S & RDS_S & TAG_S
PLAN --> REPO_S & CODEOWNERS_S & TEAM_S & README_S & WORKFLOW_S
CFN & ECS_S & LAMBDA_S & APIGW_S & RDS_S & TAG_S --> RECONCILE
REPO_S & CODEOWNERS_S & TEAM_S & README_S & WORKFLOW_S --> RECONCILE
RECONCILE --> INFER --> INDEX
```
#### AWS Scanner — Resource-to-Service Mapping Strategy
The hardest problem in auto-discovery: what constitutes a "service"? AWS resources are granular (individual Lambdas, ECS tasks, RDS instances), but engineers think in services (payment-service, auth-service, user-api). The scanner must infer service boundaries from infrastructure patterns.
**Service Identification Heuristics (priority order):**
| Signal | Confidence | Logic |
|--------|-----------|-------|
| CloudFormation stack | 0.95 | Each stack is almost always a service or a closely related group. Stack name → service name. Stack tags (`service`, `team`, `project`) → metadata. |
| ECS service | 0.90 | Each ECS service is a deployable unit. Service name → service name. Task definition → tech stack (container image). |
| Lambda function with API Gateway trigger | 0.85 | Lambda + APIGW = API service. Group Lambdas sharing the same APIGW by API name. |
| Lambda function (standalone) | 0.60 | Standalone Lambdas may be services, cron jobs, or glue code. Group by naming prefix (e.g., `payment-*` → payment service). |
| Tagged resource group | 0.80 | Resources sharing a `service` or `project` tag are grouped. Tag value → service name. |
| RDS instance | 0.50 | Databases are infrastructure, not services — but map to owning service via naming convention or CFN association. |
**AWS API Calls per Scan (estimated):**
```
cloudformation:ListStacks → 1 call (paginated)
cloudformation:DescribeStacks → 1 call per stack (batched)
cloudformation:ListStackResources → 1 call per stack
ecs:ListClusters + ListServices → 2-5 calls
ecs:DescribeServices + DescribeTaskDefinition → 1 per service
lambda:ListFunctions → 1-3 calls (paginated)
lambda:ListEventSourceMappings → 1 per function (batched)
apigateway:GetRestApis + GetResources → 2-5 calls
apigatewayv2:GetApis → 1 call
rds:DescribeDBInstances → 1 call
resourcegroupstaggingapi:GetResources → 1-5 calls (paginated, filtered by service/team tags)
```
**Total: ~50-200 API calls per scan for a typical 50-service account.** Well within AWS API rate limits. Parallel execution across resource types keeps total scan time under 30 seconds.
**Cross-Account AssumeRole Pattern:**
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::DDOC_PLATFORM_ACCOUNT:role/dd0c-discovery-role"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "{{tenant-specific-external-id}}"
}
}
}
]
}
```
The customer deploys a CloudFormation template (provided by dd0c) that creates a read-only IAM role with:
- `ReadOnlyAccess` managed policy (or a custom policy scoped to the specific services above)
- Trust policy allowing dd0c's platform account to assume the role
- ExternalId unique per tenant (prevents confused deputy attacks)
#### GitHub Scanner — Repository-to-Service Mapping
**GraphQL Batch Query (single request for up to 100 repos):**
```graphql
query($org: String!, $cursor: String) {
organization(login: $org) {
repositories(first: 100, after: $cursor, isArchived: false, isFork: false) {
nodes {
name
description
primaryLanguage { name }
languages(first: 5) { nodes { name } }
defaultBranchRef {
target {
... on Commit {
history(first: 1) {
nodes { committedDate author { user { login } } }
}
}
}
}
codeowners: object(expression: "HEAD:CODEOWNERS") {
... on Blob { text }
}
readme: object(expression: "HEAD:README.md") {
... on Blob { text }
}
catalogInfo: object(expression: "HEAD:catalog-info.yaml") {
... on Blob { text }
}
deployWorkflow: object(expression: "HEAD:.github/workflows/deploy.yml") {
... on Blob { text }
}
}
pageInfo { hasNextPage endCursor }
}
teams(first: 100) {
nodes {
name slug
members(first: 100) { nodes { login name } }
repositories(first: 100) { nodes { name } }
}
}
}
}
```
**Key extraction logic:**
- `CODEOWNERS` → parse ownership patterns, map `@org/team-name` to team entities
- `README.md` → extract first paragraph as service description (LLM-assisted summarization in V2)
- `catalog-info.yaml` → if present (Backstage migrators), parse existing metadata as high-confidence input
- `.github/workflows/deploy.yml` → extract deployment target (ECS service name, Lambda function name) to cross-reference with AWS scan
- `primaryLanguage` → tech stack
- Recent commit authors → contributor frequency for ownership inference
#### Service Relationship Inference
Cross-referencing AWS and GitHub data to build the service graph:
```
MATCHING RULES (priority order):
1. EXPLICIT TAG MATCH
AWS resource tag "github_repo" = "org/repo-name"
→ Direct link. Confidence: 0.95
2. CFN STACK → GITHUB ACTIONS DEPLOY TARGET
GitHub workflow deploys to ECS service "payment-api"
CFN stack contains ECS service "payment-api"
→ Link repo to CFN stack's service. Confidence: 0.90
3. NAME MATCH (normalized)
GitHub repo: "payment-service"
ECS service: "payment-service" or "payment-svc"
→ Fuzzy name match (Levenshtein distance ≤ 2). Confidence: 0.75
4. ECR IMAGE → GITHUB REPO
ECS task definition references ECR image "payment-api:latest"
ECR image was built from GitHub repo "payment-api" (via image tag or build metadata)
→ Confidence: 0.85
5. LAMBDA FUNCTION NAME → REPO NAME
Lambda: "payment-webhook-handler"
Repo: "payment-webhook" or "payment-service" (contains Lambda deploy workflow)
→ Confidence: 0.70
```
**Confidence Score Calculation:**
Each service entity gets a composite confidence score:
```python
def calculate_confidence(service: Service) -> float:
scores = []
# Existence confidence: how sure are we this is a real service?
if service.source == "cloudformation_stack":
scores.append(("existence", 0.95))
elif service.source == "ecs_service":
scores.append(("existence", 0.90))
elif service.source == "github_repo_only":
scores.append(("existence", 0.60)) # repo exists but no infra found
# Ownership confidence
if service.owner_source == "codeowners":
scores.append(("ownership", 0.90))
elif service.owner_source == "cfn_tag":
scores.append(("ownership", 0.85))
elif service.owner_source == "git_blame_frequency":
scores.append(("ownership", 0.65))
elif service.owner_source == "inferred_from_team_membership":
scores.append(("ownership", 0.50))
# Repo linkage confidence
if service.repo_link_source == "explicit_tag":
scores.append(("repo_link", 0.95))
elif service.repo_link_source == "deploy_workflow":
scores.append(("repo_link", 0.90))
elif service.repo_link_source == "name_match":
scores.append(("repo_link", 0.75))
return weighted_average(scores)
```
**The >80% accuracy target** is measured as:
```
accuracy = (services_correct_without_user_correction) / (total_services_discovered)
```
Where "correct" means: service exists, owner is right, repo link is right. Measured during beta by asking each beta customer to review their catalog and mark corrections.
#### Discovery Scheduling
| Trigger | Frequency | Scope |
|---------|-----------|-------|
| Initial onboarding | Once | Full scan (all resource types, all repos) |
| Scheduled refresh | Every 6 hours (configurable: 1h-24h) | Incremental — only scan resources modified since last scan (CloudFormation events, GitHub push webhooks) |
| Manual trigger | On-demand (UI button) | Full scan |
| Webhook-driven | Real-time | GitHub push to CODEOWNERS → re-infer ownership for affected repos. CloudFormation stack events → update service metadata. |
| User correction | Immediate | Re-score ownership model for similar services when user corrects one |
### 2.2 Service Catalog
The service catalog is the central data model. Everything reads from it, everything writes to it.
#### Service Entity Model
```
┌─────────────────────────────────────────────────────────┐
│ SERVICE │
├─────────────────────────────────────────────────────────┤
│ id: uuid (PK) │
│ tenant_id: uuid (FK → tenants) │
│ name: varchar(255) │
│ display_name: varchar(255) │
│ description: text (extracted from README) │
│ service_type: enum [api, worker, cron, database, queue] │
│ lifecycle: enum [production, staging, deprecated, eol] │
│ tier: enum [critical, standard, experimental] │
│ tech_stack: jsonb (languages, frameworks, runtime) │
│ repo_url: varchar(500) │
│ repo_default_branch: varchar(100) │
│ infrastructure: jsonb (aws_resources, regions, accounts)│
│ health_status: enum [healthy, degraded, down, unknown] │
│ last_deploy_at: timestamptz │
│ last_discovered_at: timestamptz │
│ confidence_score: decimal(3,2) [0.00-1.00] │
│ discovery_sources: jsonb (which scanners found this) │
│ metadata: jsonb (extensible key-value pairs) │
│ created_at: timestamptz │
│ updated_at: timestamptz │
├─────────────────────────────────────────────────────────┤
│ INDEXES: tenant_id, name (unique per tenant), │
│ confidence_score, lifecycle │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ TEAM │
├─────────────────────────────────────────────────────────┤
│ id: uuid (PK) │
│ tenant_id: uuid (FK → tenants) │
│ name: varchar(255) │
│ slug: varchar(255) │
│ github_team_slug: varchar(255) │
│ slack_channel: varchar(255) │
│ pagerduty_schedule_id: varchar(255) │
│ members: jsonb (user references) │
│ created_at: timestamptz │
│ updated_at: timestamptz │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ SERVICE_OWNERSHIP │
├─────────────────────────────────────────────────────────┤
│ service_id: uuid (FK → services) │
│ team_id: uuid (FK → teams) │
│ ownership_type: enum [primary, contributing, on_call] │
│ confidence: decimal(3,2) │
│ source: enum [codeowners, cfn_tag, git_blame, │
│ team_membership, user_correction] │
│ verified_by: uuid (FK → users, nullable) │
│ verified_at: timestamptz (nullable) │
│ created_at: timestamptz │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ SERVICE_DEPENDENCY (V1.1) │
├─────────────────────────────────────────────────────────┤
│ source_service_id: uuid (FK → services) │
│ target_service_id: uuid (FK → services) │
│ dependency_type: enum [calls, publishes_to, reads_from] │
│ confidence: decimal(3,2) │
│ source: enum [vpc_flow, apigw_integration, │
│ lambda_event_source, user_defined] │
│ created_at: timestamptz │
└─────────────────────────────────────────────────────────┘
```
#### Ownership Mapping
Ownership is the highest-value data point in the catalog. The inference engine uses a weighted scoring model:
```
OWNERSHIP SCORING MODEL
Input signals (per service):
1. CODEOWNERS file match → weight: 0.40
2. CloudFormation/resource tags → weight: 0.20
3. Git blame frequency (top team) → weight: 0.25
4. GitHub team → repo association → weight: 0.15
Process:
- For each candidate team, sum weighted scores across all signals
- Normalize to [0, 1]
- Assign primary owner = highest scoring team
- If top score < 0.50 → mark as "unowned" (flag for user review)
- If top two scores within 0.10 → mark as "ambiguous" (flag for user review)
User corrections:
- When a user corrects ownership, the correction is stored as source="user_correction"
- User corrections have implicit weight 1.0 (override all inference)
- Corrections propagate: if user says "payment-* repos belong to @payments-team",
apply to all matching repos with confidence 0.85
```
#### Metadata Enrichment
Beyond ownership, the catalog enriches each service with:
| Field | Source | Extraction Method |
|-------|--------|-------------------|
| Description | GitHub README | First paragraph extraction (regex: first non-heading, non-badge paragraph) |
| Tech stack | GitHub `primaryLanguage` + `languages` | Direct from GitHub API |
| Runtime | ECS task definition / Lambda runtime | Direct from AWS API |
| Last deploy | GitHub Actions last successful workflow run / ECS last deployment | Most recent timestamp |
| On-call | PagerDuty schedule mapped to team | PagerDuty API: `GET /schedules` → match by team name or escalation policy |
| Health | CloudWatch alarm state for associated resources | Aggregate: all alarms OK → healthy, any alarm → degraded, critical alarm → down |
| Cost | dd0c/cost module (when connected) | Internal API: `GET /cost/services/{serviceId}/monthly` |
### 2.3 Search Engine
The Cmd+K search bar is the daily-use hook. It must be faster than asking in Slack.
#### Search Architecture
```mermaid
graph LR
USER["User types in Cmd+K"] --> SPA["React SPA"]
SPA -- "debounce 150ms" --> API["Portal API"]
API --> REDIS["Redis<br/>Prefix cache<br/>(hot queries)"]
REDIS -- "cache miss" --> MEILI["Meilisearch"]
MEILI --> API
API --> SPA
SPA --> USER
style REDIS fill:#f9f,stroke:#333
style MEILI fill:#bbf,stroke:#333
```
**Performance budget:**
- Keystroke to API request: <150ms (debounce)
- API to Meilisearch: <10ms (same VPC, same AZ)
- Meilisearch query execution: <50ms (for 10K documents)
- API response to UI render: <50ms
- **Total perceived latency: <200ms** (target: feels instant)
**Meilisearch Index Configuration:**
```json
{
"index": "services",
"primaryKey": "id",
"searchableAttributes": [
"name",
"display_name",
"description",
"team_name",
"tech_stack",
"repo_name",
"tags"
],
"filterableAttributes": [
"tenant_id",
"lifecycle",
"tier",
"team_name",
"tech_stack",
"health_status",
"confidence_score"
],
"sortableAttributes": [
"name",
"last_deploy_at",
"confidence_score",
"updated_at"
],
"rankingRules": [
"words",
"typo",
"proximity",
"attribute",
"sort",
"exactness"
],
"typoTolerance": {
"enabled": true,
"minWordSizeForTypos": {
"oneTypo": 3,
"twoTypos": 6
}
}
}
```
**Multi-tenant isolation in search:** Every document in Meilisearch includes `tenant_id`. Every query includes a mandatory filter: `tenant_id = '{current_tenant}'`. This is enforced at the API layer — the SPA never queries Meilisearch directly.
**Search result format:**
```json
{
"hits": [
{
"id": "svc_abc123",
"name": "payment-gateway",
"display_name": "Payment Gateway",
"description": "Handles payment processing via Stripe integration",
"team_name": "Payments Team",
"repo_url": "https://github.com/acme/payment-gateway",
"health_status": "healthy",
"tech_stack": ["TypeScript", "Node.js"],
"confidence_score": 0.92,
"last_deploy_at": "2026-02-27T14:30:00Z",
"_matchesPosition": { "name": [{"start": 0, "length": 7}] }
}
],
"query": "payment",
"processingTimeMs": 12,
"estimatedTotalHits": 3
}
```
#### Redis Prefix Cache
For the most common queries (top 100 per tenant), cache the Meilisearch response in Redis with a 5-minute TTL. This reduces Meilisearch load and provides <5ms response for repeated queries.
```
Key pattern: search:{tenant_id}:{normalized_query_prefix}
TTL: 300 seconds
Invalidation: on any service upsert for the tenant (conservative but simple)
```
### 2.4 AI Agent — "Ask Your Infra" (V2)
Deferred to V2 (Month 7-12), but the architecture must accommodate it from day one.
#### Design
```mermaid
graph TB
USER["User: 'Which services handle PII?'"]
AGENT["AI Agent (Lambda)"]
LLM["LLM (Claude / GPT-4o)"]
CATALOG["Service Catalog (PostgreSQL)"]
SEARCH["Meilisearch"]
COST["dd0c/cost API"]
ALERT["dd0c/alert API"]
USER --> AGENT
AGENT --> LLM
LLM -- "tool_call: search_services" --> SEARCH
LLM -- "tool_call: query_catalog" --> CATALOG
LLM -- "tool_call: get_cost" --> COST
LLM -- "tool_call: get_incidents" --> ALERT
LLM --> AGENT
AGENT --> USER
```
**RAG approach:** The AI agent does NOT embed the entire catalog into a vector store. Instead, it uses structured tool calls:
1. User asks a natural language question
2. LLM receives the question + a system prompt describing available tools (search, SQL query, cost API, alert API)
3. LLM generates tool calls to retrieve relevant data
4. Results are injected into context
5. LLM synthesizes a natural language answer with citations
**Why tool-use over vector RAG?**
- The service catalog is structured data (tables, relationships). SQL queries are more precise than semantic similarity search.
- The catalog is small enough (<10K services) that tool calls retrieve exact data, not "similar" data.
- No embedding pipeline to maintain. No vector database to operate. Simpler architecture for a solo founder.
**V2 scope:**
- Natural language queries via portal UI and Slack bot
- Tool calls: `search_services`, `get_service_detail`, `query_services_by_attribute`, `get_team_services`, `get_service_cost`, `get_service_incidents`
- Guardrails: tenant isolation (LLM can only query current tenant's data), no write operations, response length limits
- Cost control: cache identical queries for 5 minutes, rate limit to 50 queries/user/day
### 2.5 Dashboard
The dashboard serves two audiences with different needs:
**Engineers (daily use):** Cmd+K search bar front and center. Recent services visited. Team's services quick-access. That's it. Calm surface.
**Directors (weekly use):** Org-wide metrics. Service count by team. Ownership coverage (% of services with verified owners). Health overview. Discovery accuracy trend. Exportable for compliance.
#### Dashboard Component Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ PORTAL DASHBOARD │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 🔍 Cmd+K: Search services, teams, or keywords... │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ 147 Services │ │ 12 Teams │ │ 89% Accuracy │ │
│ │ 3 unowned │ │ 2 on-call │ │ ↑ from 82% │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ RECENT ───────────────────────────────────────────── │
│ payment-gateway │ @payments │ healthy │ 2h ago │
│ auth-service │ @platform │ healthy │ 1d ago │
│ order-engine │ @orders │ degraded│ 3h ago │
│ │
│ YOUR TEAM (@platform) ────────────────────────────── │
│ auth-service │ healthy │ ts/node │ repo ↗ │
│ api-gateway │ healthy │ ts/node │ repo ↗ │
│ user-service │ degraded│ python │ repo ↗ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ SERVICE DETAIL (expanded on click) │ │
│ │ ┌─────────┬──────────┬──────────┬──────────┬──────────┐ │ │
│ │ │ Overview│ Infra │ On-Call │ Cost │ Incidents│ │ │
│ │ ├─────────┴──────────┴──────────┴──────────┴──────────┤ │ │
│ │ │ Owner: @payments-team (92% confidence) [Correct ✏️] │ │ │
│ │ │ Repo: github.com/acme/payment-gateway │ │ │
│ │ │ Stack: TypeScript, Node.js, ECS Fargate │ │ │
│ │ │ Last Deploy: 2h ago by @sarah │ │ │
│ │ │ Health: ✅ All CloudWatch alarms OK │ │ │
│ │ │ On-Call: @mike (PagerDuty, ends in 4h) │ │ │
│ │ │ Cost: $847/mo (dd0c/cost) ↑12% from last month │ │ │
│ │ │ Incidents: 2 this month (dd0c/alert) │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
**Progressive disclosure in action:**
- Default: one-line-per-service table (name, owner, health, last deploy)
- Click: expanded service card with tabs (overview, infra, on-call, cost, incidents)
- Each tab loads data on demand (lazy loading) — no upfront cost for data the user doesn't need
---
## 3. DATA ARCHITECTURE
### 3.1 Complete Database Schema
#### Core Entities
```sql
-- Tenant isolation: every table has tenant_id. Every query filters by it. No exceptions.
CREATE TABLE tenants (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
slug VARCHAR(255) NOT NULL UNIQUE,
plan VARCHAR(50) NOT NULL DEFAULT 'free', -- free, team, business
stripe_customer_id VARCHAR(255),
stripe_subscription_id VARCHAR(255),
settings JSONB NOT NULL DEFAULT '{}',
-- settings: { discovery_interval_hours: 6, auto_refresh: true, slack_workspace_id: "T..." }
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
github_id BIGINT NOT NULL,
github_login VARCHAR(255) NOT NULL,
email VARCHAR(255),
display_name VARCHAR(255),
avatar_url VARCHAR(500),
role VARCHAR(50) NOT NULL DEFAULT 'member', -- admin, member
last_active_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(tenant_id, github_id)
);
CREATE TABLE connections (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
provider VARCHAR(50) NOT NULL, -- aws, github, pagerduty, opsgenie
status VARCHAR(50) NOT NULL DEFAULT 'pending', -- pending, active, error, revoked
credentials JSONB NOT NULL, -- encrypted at rest (KMS)
-- aws: { role_arn, external_id, regions: ["us-east-1", "us-west-2"] }
-- github: { installation_id, org_login, access_token_encrypted }
-- pagerduty: { api_key_encrypted, subdomain }
last_scan_at TIMESTAMPTZ,
last_error TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(tenant_id, provider)
);
CREATE TABLE services (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
name VARCHAR(255) NOT NULL,
display_name VARCHAR(255),
description TEXT,
service_type VARCHAR(50), -- api, worker, cron, database, queue, frontend, library
lifecycle VARCHAR(50) NOT NULL DEFAULT 'production',
tier VARCHAR(50) NOT NULL DEFAULT 'standard', -- critical, standard, experimental
tech_stack JSONB DEFAULT '[]', -- ["TypeScript", "Node.js", "Express"]
repo_url VARCHAR(500),
repo_default_branch VARCHAR(100) DEFAULT 'main',
infrastructure JSONB DEFAULT '{}',
-- { aws_resources: [{type: "ecs_service", arn: "...", region: "us-east-1"}],
-- aws_account_id: "123456789012" }
health_status VARCHAR(50) DEFAULT 'unknown', -- healthy, degraded, down, unknown
last_deploy_at TIMESTAMPTZ,
last_discovered_at TIMESTAMPTZ,
confidence_score DECIMAL(3,2) DEFAULT 0.00,
discovery_sources JSONB DEFAULT '[]', -- ["cloudformation", "github_repo", "ecs_service"]
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(tenant_id, name)
);
CREATE INDEX idx_services_tenant ON services(tenant_id);
CREATE INDEX idx_services_lifecycle ON services(tenant_id, lifecycle);
CREATE INDEX idx_services_confidence ON services(tenant_id, confidence_score);
CREATE INDEX idx_services_health ON services(tenant_id, health_status);
CREATE TABLE teams (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
name VARCHAR(255) NOT NULL,
slug VARCHAR(255) NOT NULL,
github_team_slug VARCHAR(255),
slack_channel_id VARCHAR(255),
slack_channel_name VARCHAR(255),
pagerduty_schedule_id VARCHAR(255),
opsgenie_team_id VARCHAR(255),
contact_email VARCHAR(255),
members JSONB DEFAULT '[]',
-- [{ github_login: "sarah", name: "Sarah Chen", role: "lead" }]
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(tenant_id, slug)
);
CREATE TABLE service_ownership (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
service_id UUID NOT NULL REFERENCES services(id) ON DELETE CASCADE,
team_id UUID NOT NULL REFERENCES teams(id) ON DELETE CASCADE,
tenant_id UUID NOT NULL REFERENCES tenants(id),
ownership_type VARCHAR(50) NOT NULL DEFAULT 'primary', -- primary, contributing, on_call
confidence DECIMAL(3,2) NOT NULL DEFAULT 0.00,
source VARCHAR(50) NOT NULL, -- codeowners, cfn_tag, git_blame, team_membership, user_correction
verified_by UUID REFERENCES users(id),
verified_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(service_id, team_id, ownership_type)
);
CREATE INDEX idx_ownership_service ON service_ownership(service_id);
CREATE INDEX idx_ownership_team ON service_ownership(team_id);
CREATE TABLE service_dependencies (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
source_service_id UUID NOT NULL REFERENCES services(id) ON DELETE CASCADE,
target_service_id UUID NOT NULL REFERENCES services(id) ON DELETE CASCADE,
dependency_type VARCHAR(50) NOT NULL, -- calls, publishes_to, reads_from, triggers
confidence DECIMAL(3,2) NOT NULL DEFAULT 0.00,
source VARCHAR(50) NOT NULL, -- vpc_flow, apigw_integration, lambda_event_source, user_defined
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(source_service_id, target_service_id, dependency_type)
);
```
#### Discovery Event Log
Every discovery run produces an immutable event log. This is critical for debugging accuracy issues, auditing what changed, and measuring improvement over time.
```sql
CREATE TABLE discovery_runs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
trigger_type VARCHAR(50) NOT NULL, -- onboarding, scheduled, manual, webhook
status VARCHAR(50) NOT NULL DEFAULT 'running', -- running, completed, partial, failed
step_function_execution_arn VARCHAR(500),
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
stats JSONB DEFAULT '{}',
-- { aws_resources_found: 234, github_repos_found: 89,
-- services_created: 12, services_updated: 135, services_unchanged: 0,
-- ownership_inferred: 140, ownership_ambiguous: 7,
-- scan_duration_ms: 28400, reconcile_duration_ms: 4200 }
errors JSONB DEFAULT '[]'
-- [{ phase: "aws_scan", resource: "lambda", error: "ThrottlingException", retried: true }]
);
CREATE INDEX idx_discovery_runs_tenant ON discovery_runs(tenant_id, started_at DESC);
CREATE TABLE discovery_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
run_id UUID NOT NULL REFERENCES discovery_runs(id) ON DELETE CASCADE,
tenant_id UUID NOT NULL REFERENCES tenants(id),
event_type VARCHAR(50) NOT NULL,
-- service_created, service_updated, service_removed,
-- ownership_changed, ownership_ambiguous,
-- repo_linked, repo_unlinked,
-- resource_discovered, resource_removed
service_id UUID REFERENCES services(id),
payload JSONB NOT NULL,
-- { field: "owner", old_value: "@platform", new_value: "@payments",
-- old_confidence: 0.65, new_confidence: 0.88,
-- reason: "CODEOWNERS updated" }
confidence DECIMAL(3,2),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_discovery_events_run ON discovery_events(run_id);
CREATE INDEX idx_discovery_events_service ON discovery_events(service_id, created_at DESC);
-- Partition discovery_events by month for efficient cleanup
-- Retain 90 days of events, archive to S3 after that
```
#### User Corrections (Feedback Loop)
```sql
CREATE TABLE corrections (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
service_id UUID NOT NULL REFERENCES services(id),
user_id UUID NOT NULL REFERENCES users(id),
field VARCHAR(100) NOT NULL, -- owner, description, tier, lifecycle, repo_url
old_value JSONB,
new_value JSONB,
applied BOOLEAN NOT NULL DEFAULT TRUE,
propagated BOOLEAN NOT NULL DEFAULT FALSE,
-- propagated: did this correction update inference for similar services?
propagation_count INT DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_corrections_tenant ON corrections(tenant_id, created_at DESC);
```
Corrections are the most valuable data in the system. They:
1. Immediately fix the corrected service
2. Feed back into the ownership inference model (increase weight for the corrected signal)
3. Propagate to similar services when patterns are detected (e.g., "user corrected 3 services in `payment-*` repos to @payments-team → auto-apply to remaining `payment-*` repos with confidence 0.85")
### 3.2 Search Index Design
**Meilisearch document structure** (denormalized from PostgreSQL for search performance):
```json
{
"id": "svc_abc123",
"tenant_id": "tenant_xyz",
"name": "payment-gateway",
"display_name": "Payment Gateway",
"description": "Handles payment processing via Stripe. Exposes REST API for checkout flow.",
"service_type": "api",
"lifecycle": "production",
"tier": "critical",
"team_name": "Payments Team",
"team_slug": "payments",
"owner_confidence": 0.92,
"tech_stack": ["TypeScript", "Node.js"],
"repo_name": "payment-gateway",
"repo_url": "https://github.com/acme/payment-gateway",
"health_status": "healthy",
"last_deploy_at": 1740667800,
"aws_services": ["ecs", "rds", "elasticache"],
"aws_region": "us-east-1",
"tags": ["payments", "stripe", "checkout", "critical-path"],
"confidence_score": 0.92,
"updated_at": 1740667800
}
```
**Index sync strategy:**
- On every service upsert in PostgreSQL → publish to SQS → Lambda consumer → Meilisearch `addDocuments` (batch, async)
- Latency: service update → searchable in <5 seconds
- Full reindex: triggered on Meilisearch restart or schema change. Reads all services from PostgreSQL, batches of 1000 documents. For 10K services: ~10 seconds.
**Index sizing:**
- Average document size: ~1KB
- 1,000 services: ~1MB index, ~50MB RAM
- 10,000 services: ~10MB index, ~200MB RAM
- Meilisearch on a single Fargate task (0.5 vCPU, 1GB RAM) handles 10K+ services comfortably
### 3.3 Graph Database Decision: Not Yet
**V1: PostgreSQL adjacency list.** Service dependencies stored in `service_dependencies` table. Queries like "what does service X depend on?" are simple JOINs. Queries like "what's the full transitive dependency tree?" use recursive CTEs:
```sql
WITH RECURSIVE dep_tree AS (
SELECT target_service_id, 1 as depth
FROM service_dependencies
WHERE source_service_id = :service_id AND tenant_id = :tenant_id
UNION ALL
SELECT sd.target_service_id, dt.depth + 1
FROM service_dependencies sd
JOIN dep_tree dt ON sd.source_service_id = dt.target_service_id
WHERE dt.depth < 10 -- prevent infinite loops
)
SELECT DISTINCT s.* FROM dep_tree dt
JOIN services s ON s.id = dt.target_service_id;
```
At the scale of 50-1000 services with an average of 3-5 dependencies each, this recursive CTE executes in <50ms on Aurora. A graph database is unnecessary.
**V1.1+ evaluation criteria for Neptune Serverless:**
- If dependency visualization becomes a core feature with >100 daily queries
- If customers have >5,000 services with deep dependency chains (>10 levels)
- If graph traversal queries (shortest path, impact radius) become latency-sensitive
- Neptune Serverless minimum cost: ~$0.12/NCU-hour × 2.5 NCU minimum = ~$220/month. Only justified when dependency features drive measurable retention.
### 3.4 Multi-Tenant Data Isolation
**Strategy: Shared database, tenant_id column, application-level enforcement.**
Why not database-per-tenant:
- Aurora Serverless v2 charges per ACU. One database with 50 tenants is cheaper than 50 databases.
- Schema migrations are applied once, not 50 times.
- Cross-tenant analytics (anonymized, for product metrics) are simple queries.
**Enforcement layers:**
| Layer | Mechanism |
|-------|-----------|
| **API middleware** | Every authenticated request extracts `tenant_id` from JWT. Injected into every database query. No query can omit `tenant_id`. |
| **PostgreSQL RLS (Row-Level Security)** | Backup enforcement. Even if application code has a bug, RLS prevents cross-tenant data access. |
| **Meilisearch filter** | Every search query includes mandatory `tenant_id` filter. Enforced at API layer. |
| **S3 prefix** | Discovery snapshots stored at `s3://dd0c-data/{tenant_id}/snapshots/`. IAM policy scopes Lambda access to tenant prefix during discovery. |
| **Logging** | All API logs include `tenant_id`. Anomaly detection: alert if a single request touches multiple tenant_ids. |
**PostgreSQL RLS implementation:**
```sql
ALTER TABLE services ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON services
USING (tenant_id = current_setting('app.current_tenant_id')::UUID);
-- Set per-request in API middleware:
-- SET LOCAL app.current_tenant_id = 'tenant-uuid-here';
```
### 3.5 Sync/Refresh Strategy
| Event | Trigger | Scope | Latency |
|-------|---------|-------|---------|
| **Initial discovery** | User completes onboarding | Full scan: all AWS resource types + all GitHub repos | <120 seconds |
| **Scheduled refresh** | EventBridge cron (default: every 6h) | Incremental: CloudFormation events since last scan, GitHub repos with pushes since last scan | <60 seconds |
| **GitHub webhook** | Push to CODEOWNERS, README, or deploy workflow | Single repo: re-extract metadata, re-infer ownership | <10 seconds |
| **CloudFormation event** | Stack create/update/delete (via EventBridge rule in customer account) | Single stack: update associated service | <10 seconds |
| **User correction** | User clicks "Correct" in UI | Single service + propagation to similar services | <5 seconds |
| **Manual full rescan** | User clicks "Rescan" in settings | Full scan (same as initial) | <120 seconds |
**Incremental scan optimization:**
For scheduled refreshes, avoid re-scanning everything:
1. **AWS:** Use CloudTrail events (if available) or compare CloudFormation stack `LastUpdatedTime` to skip unchanged stacks. For ECS/Lambda, compare resource tags and configuration hashes.
2. **GitHub:** Use the GitHub Events API or webhook payloads to identify repos with changes since last scan. Only re-scan changed repos.
3. **Result:** Incremental scans touch 5-15% of resources, completing in <30 seconds instead of 120.
**Staleness detection:**
If a service hasn't been seen in 3 consecutive full scans:
- Mark as `lifecycle: deprecated` with a note "Not found in recent discovery scans"
- Surface in dashboard: "3 services may have been removed from your infrastructure"
- After 5 consecutive misses: mark as `lifecycle: eol`, remove from default search results (still accessible via filter)
---
## 4. INFRASTRUCTURE
### 4.1 AWS Architecture
```mermaid
graph TB
subgraph "us-east-1 — Primary Region"
subgraph "Public Subnet"
CF["CloudFront Distribution<br/>SPA + API Cache"]
ALB["Application Load Balancer<br/>+ WAF v2"]
end
subgraph "Private Subnet — App Tier"
ECS_API["ECS Fargate<br/>Portal API<br/>(min: 1, max: 10 tasks)<br/>0.5 vCPU / 1GB RAM"]
ECS_MEILI["ECS Fargate<br/>Meilisearch<br/>(1 task)<br/>0.5 vCPU / 1GB RAM<br/>+ EFS volume"]
end
subgraph "Private Subnet — Compute"
SF["Step Functions<br/>Discovery Orchestrator"]
L_AWS["Lambda — AWS Scanner<br/>Python 3.12<br/>512MB / 5min timeout"]
L_GH["Lambda — GitHub Scanner<br/>Node.js 20<br/>512MB / 5min timeout"]
L_REC["Lambda — Reconciler<br/>Node.js 20<br/>1GB / 5min timeout"]
L_INF["Lambda — Inference<br/>Python 3.12<br/>512MB / 2min timeout"]
L_SLACK["Lambda — Slack Bot<br/>Node.js 20<br/>256MB / 30s timeout"]
L_WEBHOOK["Lambda — Webhook Processor<br/>Node.js 20<br/>256MB / 30s timeout"]
end
subgraph "Private Subnet — Data Tier"
AURORA["Aurora Serverless v2<br/>PostgreSQL 15<br/>0.5-8 ACU<br/>Multi-AZ"]
REDIS_C["ElastiCache Redis<br/>Serverless<br/>1-5 ECPUs"]
end
subgraph "Storage & Messaging"
S3_SPA["S3 — SPA Assets"]
S3_DATA["S3 — Discovery Snapshots<br/>+ Exports"]
SQS_DISC["SQS FIFO<br/>Discovery Events"]
SQS_INDEX["SQS Standard<br/>Search Index Updates"]
EB["EventBridge<br/>Scheduled Discovery<br/>+ Webhook Routing"]
end
subgraph "Security & Observability"
KMS["KMS — Encryption Keys<br/>(credentials, PII)"]
SM["Secrets Manager<br/>GitHub tokens, PD keys"]
CW["CloudWatch<br/>Logs + Metrics + Alarms"]
XRAY["X-Ray<br/>Distributed Tracing"]
end
subgraph "API Management"
APIGW["API Gateway v2<br/>WebSocket API<br/>(discovery progress)"]
end
end
CF --> S3_SPA
CF --> ALB
ALB --> ECS_API
ECS_API --> AURORA
ECS_API --> REDIS_C
ECS_API --> ECS_MEILI
ECS_API --> SQS_INDEX
SQS_INDEX --> L_WEBHOOK
L_WEBHOOK --> ECS_MEILI
SF --> L_AWS & L_GH
L_AWS --> SQS_DISC
L_GH --> SQS_DISC
SQS_DISC --> L_REC
L_REC --> L_INF
L_INF --> AURORA
L_INF --> SQS_INDEX
EB --> SF
APIGW --> ECS_API
ECS_MEILI --> ECS_MEILI_EFS["EFS Volume<br/>(Meilisearch data persistence)"]
```
### 4.2 Customer-Side: Read-Only IAM Role
The customer deploys a single CloudFormation template provided by dd0c. This is the only thing the customer installs.
**CloudFormation Template (provided to customer):**
```yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: dd0c/portal read-only discovery role
Parameters:
ExternalId:
Type: String
Description: Unique identifier provided by dd0c during onboarding
NoEcho: true
Dd0cAccountId:
Type: String
Default: '123456789012' # dd0c platform AWS account
Description: dd0c platform account ID
Resources:
Dd0cDiscoveryRole:
Type: AWS::IAM::Role
Properties:
RoleName: dd0c-portal-discovery
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
AWS: !Sub 'arn:aws:iam::${Dd0cAccountId}:role/dd0c-scanner-role'
Action: sts:AssumeRole
Condition:
StringEquals:
sts:ExternalId: !Ref ExternalId
ManagedPolicyArns: [] # No managed policies — custom policy only
Policies:
- PolicyName: dd0c-discovery-readonly
PolicyDocument:
Version: '2012-10-17'
Statement:
# CloudFormation — read stacks and resources
- Effect: Allow
Action:
- cloudformation:ListStacks
- cloudformation:DescribeStacks
- cloudformation:ListStackResources
- cloudformation:GetTemplate
Resource: '*'
# ECS — read clusters, services, task definitions
- Effect: Allow
Action:
- ecs:ListClusters
- ecs:ListServices
- ecs:DescribeServices
- ecs:DescribeClusters
- ecs:DescribeTaskDefinition
- ecs:ListTaskDefinitions
Resource: '*'
# Lambda — read functions and event sources
- Effect: Allow
Action:
- lambda:ListFunctions
- lambda:GetFunction
- lambda:ListEventSourceMappings
- lambda:ListTags
Resource: '*'
# API Gateway — read APIs and resources
- Effect: Allow
Action:
- apigateway:GET
Resource: '*'
# RDS — read instances
- Effect: Allow
Action:
- rds:DescribeDBInstances
- rds:DescribeDBClusters
- rds:ListTagsForResource
Resource: '*'
# Resource Groups — read tags
- Effect: Allow
Action:
- tag:GetResources
- tag:GetTagKeys
- tag:GetTagValues
- resourcegroupstaggingapi:GetResources
Resource: '*'
# CloudWatch — read alarm states for health
- Effect: Allow
Action:
- cloudwatch:DescribeAlarms
- cloudwatch:GetMetricData
Resource: '*'
# STS — for identity verification
- Effect: Allow
Action:
- sts:GetCallerIdentity
Resource: '*'
# EXPLICIT DENIES — defense in depth
- Effect: Deny
Action:
- iam:*
- s3:GetObject
- s3:PutObject
- secretsmanager:GetSecretValue
- ssm:GetParameter*
- kms:Decrypt
- logs:GetLogEvents
Resource: '*'
Outputs:
RoleArn:
Value: !GetAtt Dd0cDiscoveryRole.Arn
Description: Provide this ARN to dd0c during onboarding
```
**Key security decisions:**
- Explicit deny on IAM, S3 object access, Secrets Manager, SSM Parameter Store, KMS, and CloudWatch Logs. Even if AWS adds new read actions to a managed policy, these denies prevent access to sensitive data.
- No `ReadOnlyAccess` managed policy — too broad. Custom policy scoped to exactly the services dd0c needs.
- ExternalId prevents confused deputy attacks.
- Role name is fixed (`dd0c-portal-discovery`) so customers can audit it easily.
### 4.3 GitHub/GitLab Integration
**V1: GitHub App (preferred over OAuth for org-level access)**
| Permission | Access | Justification |
|-----------|--------|---------------|
| Repository contents | Read | CODEOWNERS, README, workflow files |
| Repository metadata | Read | Repo name, description, language, topics |
| Organization members | Read | Team membership for ownership inference |
| Organization administration | Read | Team structure |
**No write permissions. No webhook creation (V1). No code push. No issue creation.**
The GitHub App is installed at the org level. The customer clicks "Install" on the GitHub Marketplace listing, selects their org, and grants read-only access. The installation ID is stored in the `connections` table.
**GitLab (V2):** GitLab Group Access Token with `read_api` scope. Same pattern — read-only, scoped to the group, no write access.
### 4.4 Cost Estimates
All costs in USD/month. Assumes us-east-1 pricing as of 2026.
#### 50 Services (10-20 engineers, Free/Team tier, ~5 tenants)
| Service | Configuration | Monthly Cost |
|---------|--------------|-------------|
| Aurora Serverless v2 | 0.5 ACU min (mostly idle) | $43 |
| ElastiCache Redis Serverless | Minimal ECPU usage | $15 |
| ECS Fargate — API | 1 task, 0.5 vCPU, 1GB | $18 |
| ECS Fargate — Meilisearch | 1 task, 0.5 vCPU, 1GB + EFS | $20 |
| Lambda (all functions) | ~50K invocations/month | $2 |
| Step Functions | ~150 state transitions/month | $1 |
| SQS | ~10K messages/month | $1 |
| S3 | <1GB storage | $1 |
| CloudFront | <10GB transfer | $2 |
| ALB | 1 ALB, minimal LCUs | $18 |
| KMS | 2 keys | $2 |
| Secrets Manager | 5 secrets | $3 |
| CloudWatch | Logs + basic metrics | $10 |
| Route 53 | 1 hosted zone | $1 |
| **Total** | | **~$137/month** |
**Revenue at 50 services (5 tenants × 10 eng × $10):** $500/month → **73% gross margin**
#### 200 Services (50-100 engineers, ~15 tenants)
| Service | Configuration | Monthly Cost |
|---------|--------------|-------------|
| Aurora Serverless v2 | 0.5-2 ACU (scales with queries) | $90 |
| ElastiCache Redis Serverless | Moderate ECPU | $30 |
| ECS Fargate — API | 2 tasks (auto-scaling) | $36 |
| ECS Fargate — Meilisearch | 1 task, 1 vCPU, 2GB | $36 |
| Lambda | ~500K invocations/month | $10 |
| Step Functions | ~1,500 transitions/month | $5 |
| SQS | ~100K messages/month | $2 |
| S3 | ~5GB | $2 |
| CloudFront | ~50GB transfer | $8 |
| ALB | Moderate LCUs | $25 |
| Observability (CW + X-Ray) | | $30 |
| Other (KMS, SM, R53) | | $10 |
| **Total** | | **~$284/month** |
**Revenue at 200 services (15 tenants × ~5 eng avg × $10):** ~$750/month → **62% gross margin**
*Note: conservative — many tenants will have 20-50 engineers, pushing revenue to $2-5K/month*
#### 1,000 Services (200-500 engineers, ~50 tenants)
| Service | Configuration | Monthly Cost |
|---------|--------------|-------------|
| Aurora Serverless v2 | 2-8 ACU | $350 |
| ElastiCache Redis Serverless | Higher ECPU | $80 |
| ECS Fargate — API | 3-5 tasks | $90 |
| ECS Fargate — Meilisearch | 1 task, 2 vCPU, 4GB | $72 |
| Lambda | ~5M invocations/month | $50 |
| Step Functions | ~15K transitions/month | $20 |
| SQS + EventBridge | | $10 |
| S3 | ~50GB | $5 |
| CloudFront | ~200GB transfer | $25 |
| ALB | Higher LCUs | $40 |
| Observability | | $80 |
| WAF | | $10 |
| Other | | $20 |
| **Total** | | **~$852/month** |
**Revenue at 1,000 services (50 tenants × ~7 eng avg × $10):** ~$3,500/month → **76% gross margin**
*At scale, Aurora and Fargate efficiency improves. Gross margin stays healthy.*
### 4.5 Scaling Strategy
**Phase 1 (0-50 tenants): Single-region, minimal resources**
- Aurora Serverless v2 scales ACUs automatically
- ECS API auto-scales 1-3 tasks based on CPU/request count
- Meilisearch single instance (handles 100K+ documents easily)
- No read replicas, no multi-region
**Phase 2 (50-200 tenants): Optimize hot paths**
- Add Aurora read replica for search/dashboard queries (write to primary, read from replica)
- Redis cluster mode for session/cache scaling
- Meilisearch: evaluate moving to dedicated EC2 instance for cost efficiency at sustained load
- Add CloudFront caching for API responses (service cards change infrequently — 60s TTL)
**Phase 3 (200+ tenants): Multi-region consideration**
- Evaluate us-west-2 deployment for latency (EU customers → eu-west-1)
- Aurora Global Database for cross-region reads
- CloudFront + Lambda@Edge for API routing
- This is a $100K+ MRR problem — don't solve it prematurely
### 4.6 CI/CD Pipeline
```mermaid
graph LR
DEV["Developer Push<br/>(GitHub)"] --> GHA["GitHub Actions"]
subgraph "CI Pipeline"
GHA --> LINT["Lint + Type Check"]
LINT --> TEST["Unit Tests<br/>+ Integration Tests"]
TEST --> BUILD["Docker Build<br/>(API + Meilisearch config)"]
BUILD --> SCAN["Trivy Container Scan"]
SCAN --> ECR["Push to ECR"]
end
subgraph "CD Pipeline"
ECR --> STAGING["Deploy to Staging<br/>(ECS rolling update)"]
STAGING --> SMOKE["Smoke Tests<br/>(discovery accuracy suite)"]
SMOKE --> APPROVE["Manual Approval<br/>(solo founder reviews)"]
APPROVE --> PROD["Deploy to Production<br/>(ECS rolling update<br/>+ Lambda version publish)"]
PROD --> CANARY["Canary Check<br/>(5 min health check)"]
CANARY --> DONE["✅ Deploy Complete"]
end
```
**Key CI/CD decisions:**
- GitHub Actions (not CodePipeline) — simpler, cheaper, Brian already knows it
- Docker multi-stage builds for API (Node.js) and scanner Lambdas (Python/Node.js)
- Staging environment: minimal Aurora (0.5 ACU) + single Fargate task. Cost: ~$60/month
- Discovery accuracy regression suite: run discovery against a known test AWS account + GitHub org, assert >80% accuracy. If accuracy drops, block deploy.
- Lambda deployments via SAM/CDK with versioning and aliases for instant rollback
- Database migrations via Prisma Migrate (or raw SQL migrations) — run in CI before ECS deploy
---
## 5. SECURITY
### 5.1 IAM Role Design for Customer AWS Accounts
The trust model is the single most sensitive aspect of dd0c/portal. Customers are granting a third-party SaaS read access to their infrastructure topology. This must be treated with the gravity it deserves.
**Principle: Minimum viable access, maximum transparency.**
#### Role Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ CUSTOMER AWS ACCOUNT │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ dd0c-portal-discovery (IAM Role) │ │
│ │ │ │
│ │ Trust: dd0c platform account + ExternalId │ │
│ │ │ │
│ │ ALLOW: │ │
│ │ cloudformation:List*, Describe* │ │
│ │ ecs:List*, Describe* │ │
│ │ lambda:List*, Get* (config only) │ │
│ │ apigateway:GET │ │
│ │ rds:Describe*, ListTags* │ │
│ │ tag:Get* │ │
│ │ cloudwatch:DescribeAlarms, GetMetricData │ │
│ │ │ │
│ │ EXPLICIT DENY: │ │
│ │ iam:* │ │
│ │ s3:GetObject, PutObject (no data access) │ │
│ │ secretsmanager:GetSecretValue │ │
│ │ ssm:GetParameter* │ │
│ │ kms:Decrypt │ │
│ │ logs:GetLogEvents (no application logs) │ │
│ │ ec2:GetConsoleOutput, GetPasswordData │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ dd0c CANNOT: │
│ ✗ Read S3 objects (no customer data) │
│ ✗ Read secrets or parameters │
│ ✗ Read application logs │
│ ✗ Modify any resource │
│ ✗ Create/delete/update anything │
│ ✗ Access IAM users, roles, or policies │
│ ✗ Decrypt any KMS-encrypted data │
└─────────────────────────────────────────────────────────────┘
```
**Confused deputy prevention:**
- Every tenant gets a unique `ExternalId` (UUID v4) generated at onboarding
- The customer's trust policy requires this ExternalId in the `sts:AssumeRole` condition
- dd0c's scanner Lambda passes the tenant-specific ExternalId when assuming the role
- Without the correct ExternalId, AssumeRole fails — even if an attacker knows the role ARN
**Credential rotation:**
- The cross-account role uses temporary STS credentials (1-hour expiry by default)
- No long-lived access keys are stored
- The ExternalId can be rotated by the customer at any time (update CFN stack + update dd0c connection settings)
**Audit trail:**
- Every AssumeRole call appears in the customer's CloudTrail
- dd0c provides a "Discovery Activity Log" in the UI showing exactly which API calls were made, when, and what was returned (metadata only, not full responses)
- Customers can verify dd0c's access patterns against their own CloudTrail
### 5.2 GitHub/GitLab Token Scoping
**GitHub App permissions (V1):**
| Permission | Level | Justification | What it CANNOT do |
|-----------|-------|---------------|-------------------|
| Contents | Read | Read CODEOWNERS, README, workflow files | Cannot push code, create branches, or modify files |
| Metadata | Read | Repo name, description, language, topics | Cannot modify repo settings |
| Members | Read | Org member list for ownership inference | Cannot invite/remove members |
| Administration | Read | Team structure and membership | Cannot create/modify teams |
**What the GitHub App explicitly cannot do:**
- ✗ Push code or create pull requests
- ✗ Create/close issues
- ✗ Modify repository settings
- ✗ Access private repository secrets
- ✗ Trigger or modify GitHub Actions workflows
- ✗ Access GitHub Packages or Container Registry
- ✗ Create webhooks (V1 — webhooks added in V1.1 with explicit customer consent)
**Token storage:**
- GitHub App installation tokens are short-lived (1 hour)
- The GitHub App private key is stored in AWS Secrets Manager (KMS-encrypted)
- Installation tokens are generated on-demand per scan, never persisted
### 5.3 Service Catalog Data Sensitivity
The service catalog contains infrastructure topology — a map of what services exist, who owns them, how they're connected, and what technology they use. This is sensitive data.
**Threat model:**
| Threat | Impact | Mitigation |
|--------|--------|------------|
| **Catalog data breach** | Attacker learns customer's service topology, tech stack, and team structure. Enables targeted attacks. | Encryption at rest (Aurora + S3 + KMS). Encryption in transit (TLS 1.3 everywhere). Multi-tenant RLS. |
| **Cross-tenant data leak** | Tenant A sees Tenant B's services. Reputational catastrophe. | PostgreSQL RLS + application-level tenant_id enforcement + automated cross-tenant access tests in CI. |
| **Insider threat (dd0c employee)** | Solo founder has access to all tenant data. | Audit logging on all database access. Principle: Brian should never need to query customer data directly. Build admin tools that log every access. |
| **Supply chain attack** | Compromised npm/pip dependency exfiltrates catalog data. | Dependabot + Snyk. Pin dependency versions. Minimal dependency tree. Lambda functions have no outbound internet access except to AWS APIs (VPC + NAT gateway scoped to AWS endpoints). |
| **Customer credential compromise** | Attacker steals the cross-account IAM role ARN + ExternalId. | ExternalId is a UUID — not guessable. Role trust policy limits to dd0c's specific account. Even with both, attacker only gets read-only access to infrastructure metadata (not data). |
**Data classification:**
| Data Type | Classification | Storage | Retention |
|-----------|---------------|---------|-----------|
| Service names, descriptions | Internal | Aurora (encrypted) | Active tenant lifetime |
| Team names, members | Internal | Aurora (encrypted) | Active tenant lifetime |
| AWS resource ARNs | Confidential | Aurora (encrypted) | Active tenant lifetime |
| GitHub repo URLs | Internal | Aurora (encrypted) | Active tenant lifetime |
| Discovery event logs | Internal | Aurora → S3 archive | 90 days hot, 1 year archive |
| IAM role ARNs + ExternalIds | Confidential | Secrets Manager (KMS) | Active connection lifetime |
| GitHub App private key | Secret | Secrets Manager (KMS) | Rotated annually |
| User sessions | Internal | Redis (encrypted in transit) | 24-hour TTL |
### 5.4 SOC 2 Considerations
dd0c/portal will need SOC 2 Type II certification by the time Business tier launches (Month 12+). Engineering directors buying at $25/engineer need compliance evidence.
**SOC 2 Trust Service Criteria — Architecture Alignment:**
| Criteria | Requirement | dd0c Architecture |
|----------|------------|-------------------|
| **CC6.1 — Logical Access** | Restrict access to authorized users | GitHub OAuth SSO. Tenant isolation via RLS. No shared accounts. |
| **CC6.3 — Encryption** | Encrypt data in transit and at rest | TLS 1.3 (ALB, CloudFront). Aurora encryption (AES-256). S3 SSE-KMS. Redis in-transit encryption. |
| **CC6.6 — System Boundaries** | Define and protect system boundaries | VPC with private subnets. Security groups restrict inter-service communication. WAF on ALB. |
| **CC7.1 — Monitoring** | Detect anomalies and security events | CloudWatch alarms. CloudTrail for API access. GuardDuty for threat detection. |
| **CC7.2 — Incident Response** | Respond to security incidents | PagerDuty alerting for dd0c's own infrastructure. Incident response runbook (documented). |
| **CC8.1 — Change Management** | Control changes to infrastructure and code | GitHub PRs with required reviews (when team grows). CI/CD pipeline with staging. |
| **A1.2 — Availability** | Maintain system availability | Aurora Multi-AZ. ECS multi-task. CloudFront edge caching. Health checks + auto-recovery. |
**Pre-SOC 2 actions (build into V1):**
- Enable CloudTrail in dd0c's AWS account (all API calls logged)
- Enable GuardDuty (threat detection)
- Enable AWS Config (configuration compliance)
- Implement audit logging in the application (who accessed what, when)
- Document data retention and deletion policies
- Build a "delete my data" endpoint (GDPR + SOC 2 requirement)
### 5.5 The Trust Model
This is the hardest sell in the product. Customers are giving dd0c read access to their infrastructure graph. The architecture must make this trust decision as easy as possible.
**Trust-building mechanisms:**
1. **Transparency:** The CloudFormation template is public and auditable. Customers can read every IAM permission before deploying. No hidden access.
2. **Customer-controlled revocation:** Delete the CloudFormation stack → dd0c loses all access instantly. No "please contact support to revoke." The customer is always in control.
3. **Minimal blast radius:** Even if dd0c is fully compromised, the attacker gets read-only access to infrastructure metadata (service names, resource ARNs, team names). They do NOT get application data, secrets, logs, or write access. The worst case is an attacker learning "Acme Corp has a payment-gateway service running on ECS in us-east-1." Sensitive, but not catastrophic.
4. **Open-source discovery agent (V1.1):** Open-source the AWS scanner and GitHub scanner code. Customers can audit exactly what API calls are made and what data is collected. This is the strongest trust signal possible.
5. **Data residency:** All customer data stored in the same AWS region as the customer's primary infrastructure (us-east-1 by default, eu-west-1 for EU customers at Business tier). No cross-region data transfer without explicit consent.
6. **Deletion guarantee:** When a customer disconnects or churns, all their data (services, teams, discovery logs, corrections) is hard-deleted within 30 days. S3 objects are deleted. Meilisearch index entries are removed. Backups are excluded from restore for that tenant.
**Trust comparison with competitors:**
| Aspect | Backstage | Port/Cortex | dd0c/portal |
|--------|-----------|-------------|-------------|
| Data location | Self-hosted (customer controls) | Vendor SaaS | Vendor SaaS |
| AWS access | N/A (manual YAML) | Similar IAM role | Read-only IAM role |
| Code auditability | Open source | Closed source | Closed source (scanners open-sourced V1.1) |
| Revocation | N/A | Contact support | Delete CFN stack (instant) |
| Blast radius | N/A | Read + sometimes write | Read-only, explicit denies |
dd0c's trust model is weaker than self-hosted Backstage (customer controls everything) but stronger than Port/Cortex (more transparent, easier revocation, smaller blast radius). The open-source scanner in V1.1 closes the gap significantly.
---
## 6. MVP SCOPE
### 6.1 V1 Technical Scope — "The 5-Minute Miracle"
V1 ships exactly four capabilities. Nothing else.
```
┌─────────────────────────────────────────────────────────────┐
│ V1 SCOPE │
│ │
│ ✅ IN SCOPE ❌ OUT OF SCOPE │
│ ───────────── ────────────── │
│ AWS auto-discovery AI agent ("Ask Your │
│ - CloudFormation Infra") │
│ - ECS GitLab support │
│ - Lambda Dependency visualization │
│ - API Gateway Scorecards / maturity │
│ - RDS Kubernetes discovery │
│ - Resource tags Terraform state parsing │
│ Custom plugins │
│ GitHub org scanning Advanced RBAC │
│ - Repos + languages SSO (Okta/Azure AD) │
│ - CODEOWNERS Self-hosted option │
│ - README extraction Multi-cloud (GCP/Azure) │
│ - Team memberships Compliance reports │
│ - Actions workflows Software templates │
│ Change feed │
│ Service catalog UI Zombie service detection │
│ - Service cards Cost anomaly per service │
│ - Cmd+K search (<200ms) │
│ - Team directory │
│ - Correction UI │
│ - Confidence scores │
│ │
│ Integrations │
│ - PagerDuty/OpsGenie (on-call) │
│ - Slack bot (/dd0c who owns) │
│ - GitHub OAuth (auth) │
│ - Stripe (billing) │
└─────────────────────────────────────────────────────────────┘
```
### 6.2 What's Deferred to V2
| Feature | V2 Target | Dependency |
|---------|-----------|------------|
| AI Agent ("Ask Your Infra") | Month 7-9 | Requires stable catalog data + LLM integration |
| GitLab support | Month 8-10 | Separate scanner Lambda, GitLab Group Access Token flow |
| Dependency visualization | Month 5-7 (V1.1) | Requires VPC flow log analysis or API Gateway integration mapping |
| Scorecards | Month 5-6 (V1.1) | Requires stable service entity model + enough metadata signals |
| dd0c/cost integration | Month 7-9 | Requires dd0c/cost to be live with per-service attribution |
| dd0c/alert integration | Month 7-9 | Requires dd0c/alert to be live with service-level routing |
| Backstage YAML importer | Month 5-6 (V1.1) | Low effort, high acquisition value for Backstage refugees |
| Change feed | Month 10-12 | Requires discovery event log to be stable + diff computation |
| Advanced RBAC | Month 12+ (Business tier) | Requires team-level permission model |
| SSO (Okta/Azure AD) | Month 12+ (Business tier) | Requires SAML/OIDC integration |
### 6.3 The 5-Minute Onboarding Flow — Technical Implementation
```mermaid
sequenceDiagram
participant U as Engineer
participant SPA as Portal SPA
participant API as Portal API
participant GH as GitHub OAuth
participant STRIPE as Stripe
participant AWS_CFN as Customer AWS<br/>(CloudFormation)
participant SF as Step Functions
Note over U,SF: Step 1: Sign Up (30 seconds)
U->>SPA: Click "Sign up with GitHub"
SPA->>GH: OAuth redirect (scope: read:org, read:user)
GH->>SPA: Authorization code
SPA->>API: POST /auth/github {code}
API->>GH: Exchange code for token
API->>API: Create tenant + user
API->>SPA: JWT + redirect to onboarding wizard
Note over U,SF: Step 2: Select Plan (30 seconds)
SPA->>U: "Free (≤10 eng) or Team ($10/eng)?"
U->>SPA: Select Team
SPA->>STRIPE: Stripe Checkout Session
STRIPE->>U: Enter credit card
STRIPE->>API: Webhook: checkout.session.completed
API->>API: Activate subscription
Note over U,SF: Step 3: Connect AWS (90 seconds)
SPA->>U: "Deploy this CloudFormation template"
Note right of SPA: One-click link:<br/>https://console.aws.amazon.com/cloudformation/<br/>home#/stacks/create/review?<br/>templateURL=https://dd0c-public.s3.amazonaws.com/<br/>cfn/dd0c-discovery-role.yaml&<br/>param_ExternalId={{generated-uuid}}
U->>AWS_CFN: Click "Create Stack" in AWS Console
AWS_CFN->>AWS_CFN: Create IAM role (~60 seconds)
U->>SPA: Paste Role ARN
SPA->>API: POST /connections/aws {roleArn, externalId}
API->>API: sts:AssumeRole (validate access)
API->>SPA: ✅ "AWS connected"
Note over U,SF: Step 4: Connect GitHub (already done — OAuth grants org access)
API->>API: List orgs from GitHub token
SPA->>U: "Select your GitHub org"
U->>SPA: Select org
SPA->>API: POST /connections/github {orgLogin}
API->>SPA: ✅ "GitHub connected"
Note over U,SF: Step 5: Auto-Discovery (90 seconds)
API->>SF: StartExecution {tenantId}
SF-->>SPA: WebSocket: "Scanning AWS resources..."
SF-->>SPA: WebSocket: "Found 234 AWS resources..."
SF-->>SPA: WebSocket: "Scanning 89 GitHub repos..."
SF-->>SPA: WebSocket: "Reconciling services..."
SF-->>SPA: WebSocket: "Inferring ownership..."
SF-->>SPA: WebSocket: "✅ Discovered 147 services"
SPA->>U: Redirect to catalog view
Note over U,SF: Total elapsed: ~4 minutes
```
**Critical UX decisions:**
- The CloudFormation template link is pre-populated with the ExternalId. The customer clicks one link, lands in the AWS Console with the template loaded, and clicks "Create Stack." Three clicks total.
- GitHub org access is granted during OAuth signup — no separate connection step. The onboarding wizard just asks "which org?" from the list of orgs the user belongs to.
- WebSocket progress updates during discovery create the "Holy Shit" moment. Watching services appear in real-time (47... 89... 120... 147) is the emotional hook that drives screenshots and sharing.
- If discovery takes >120 seconds, show a "This is taking longer than usual" message with a progress bar. Never show a spinner with no context.
### 6.4 >80% Discovery Accuracy — How to Measure It
**Definition of accuracy:**
```
accuracy = correct_services / total_discovered_services
Where "correct" means ALL of:
1. The service actually exists (not a phantom/duplicate)
2. The service name is recognizable to the team
3. The primary owner is correct (or marked "unowned" if truly unowned)
4. The repo link is correct (if a repo exists)
```
**Measurement methodology:**
1. **Beta measurement (Month 2-3):** Personal call with each of 20 beta customers. Walk through their catalog together. For each service, ask: "Is this right?" Record corrections. Calculate accuracy per customer and aggregate.
2. **Production measurement (Month 3+):** Track the correction rate.
```
correction_rate = services_corrected_within_7_days / services_discovered
accuracy_estimate = 1 - correction_rate
```
This is a lower bound — some incorrect services won't be corrected because users don't notice or don't care. But it's a continuous, automated metric.
3. **Accuracy by signal source:**
```sql
SELECT
discovery_sources,
COUNT(*) as total,
COUNT(*) FILTER (WHERE corrected_within_7d = false) as uncorrected,
1.0 - (COUNT(*) FILTER (WHERE corrected_within_7d = true)::decimal / COUNT(*)) as accuracy
FROM services
WHERE tenant_id = :tenant_id
GROUP BY discovery_sources
ORDER BY accuracy ASC;
```
This reveals which discovery signals are weakest (e.g., "Lambda-only services have 60% accuracy" → invest in Lambda grouping heuristics).
4. **Accuracy improvement tracking:**
```
Week 1: 78% accuracy (initial discovery)
Week 2: 85% accuracy (after user corrections + propagation)
Week 4: 91% accuracy (after model improvements from correction patterns)
Week 8: 93% accuracy (steady state)
```
**If accuracy is below 80% on first run:**
- Show a prominent "Review your catalog" banner: "We found 147 services. Some may need corrections. Help us learn your infrastructure by reviewing the flagged items."
- Sort low-confidence services to the top of the review queue
- Gamify corrections: "12 services need your review. ~5 minutes."
- Each correction improves the model for similar services — show this: "Your correction improved confidence for 3 similar services."
### 6.5 Technical Debt Budget
V1 is built fast. Some shortcuts are intentional. Document them so they don't become surprises.
| Debt Item | Shortcut | Proper Solution | When to Fix |
|-----------|----------|----------------|-------------|
| **Meilisearch persistence** | EFS volume (slow, but works) | Dedicated EC2 instance with local SSD | >200 tenants |
| **Search index sync** | SQS → Lambda → Meilisearch (eventual consistency, ~5s delay) | Change Data Capture from Aurora → real-time sync | If users complain about stale search results |
| **GitHub rate limiting** | Simple retry with backoff | Distributed rate limiter with token bucket per GitHub App installation | >50 tenants scanning simultaneously |
| **Discovery scheduling** | EventBridge cron (same time for all tenants) | Distributed scheduler with jitter to spread load | >100 tenants |
| **Monitoring** | CloudWatch basic metrics + alarms | Datadog or Grafana Cloud for full observability | >$10K MRR (can afford $200/month for monitoring) |
| **Database migrations** | Raw SQL files run in CI | Prisma Migrate or Flyway with proper versioning | When team grows beyond solo founder |
| **Error handling** | Generic error pages, console.error logging | Structured error codes, Sentry integration, user-facing error messages | Month 3-4 |
| **Test coverage** | Integration tests for discovery accuracy, minimal unit tests | 80%+ unit test coverage, E2E tests with Playwright | Month 4-6 |
### 6.6 Solo Founder Operational Model
**What Brian operates:**
- 1 AWS account (dd0c platform)
- 1 GitHub org (dd0c)
- 1 Stripe account
- 1 Slack workspace (dd0c community + bot)
- 1 PagerDuty account (dd0c's own alerting)
**Operational runbook (daily):**
- Check CloudWatch dashboard (5 minutes): any alarms firing? Any discovery failures?
- Check Stripe dashboard: new signups? Churns? Failed payments?
- Check support channel (Slack/email): any customer issues?
- Total daily ops: ~15 minutes when nothing is broken
**Alerting (PagerDuty):**
- P1 (page immediately): API 5xx rate >5%, Aurora connection failures, discovery pipeline stuck >30 minutes
- P2 (page during business hours): Discovery accuracy drop >10% for any tenant, Meilisearch index lag >60 seconds, Stripe webhook failures
- P3 (daily digest): Lambda error rate >1%, CloudWatch log anomalies, dependency vulnerability alerts
**On-call:** Brian is the only on-call. This is sustainable up to ~50 tenants if the architecture is reliable. At $20K MRR, hire a part-time contractor for L1 support (Slack responses, known-issue triage).
---
## 7. API DESIGN
The Portal API is a RESTful JSON API. It serves the SPA frontend, the Slack bot, and internal integrations (dd0c/cost, dd0c/alert). In V1, there is no public API for customers to programmatically query their catalog, but the internal API is designed cleanly enough to be exposed later.
All requests require authentication. The SPA uses HTTP-only cookies (JWT). Integrations use internal IAM/VPC auth or signed tokens.
### 7.1 Discovery API
Manages the auto-discovery lifecycle.
**`POST /api/v1/discovery/run`**
Triggers a manual full discovery scan.
- **Request:** `{ "connections": ["aws", "github"] }`
- **Response:** `{ "run_id": "uuid", "status": "started" }`
**`GET /api/v1/discovery/runs/{run_id}`**
Polls the status of a specific discovery run.
- **Response:**
```json
{
"id": "uuid",
"status": "running",
"started_at": "2026-02-28T10:00:00Z",
"progress": {
"aws_resources_found": 142,
"github_repos_found": 89,
"current_phase": "reconciliation"
}
}
```
**WebSocket: `wss://api.dd0c.com/v1/discovery/stream?run_id=uuid`**
Real-time progress events pushed to the UI during onboarding.
- **Events:** `phase_started`, `resources_discovered`, `services_reconciled`, `ownership_inferred`, `completed`.
### 7.2 Service API
CRUD and search operations for the catalog.
**`GET /api/v1/services/search?q={query}&limit=10`**
The Cmd+K search endpoint. Proxies to Meilisearch or Redis cache.
- **Response:** Array of matched services with highlighted snippets and confidence scores. (See Section 2.3).
**`GET /api/v1/services/{service_id}`**
Retrieves the full, expanded detail view of a single service.
- **Response:**
```json
{
"id": "svc_123",
"name": "payment-gateway",
"display_name": "Payment Gateway",
"description": "Handles Stripe checkout",
"lifecycle": "production",
"owner": {
"team_id": "team_456",
"name": "Payments Team",
"confidence": 0.92,
"source": "codeowners"
},
"repo": {
"url": "https://github.com/acme/payment-gateway",
"default_branch": "main"
},
"infrastructure": {
"aws_resources": [
{"type": "ecs_service", "arn": "arn:aws:ecs:...", "region": "us-east-1"}
]
},
"health_status": "healthy",
"last_deploy_at": "2026-02-27T14:30:00Z"
}
```
**`PATCH /api/v1/services/{service_id}`**
Allows users to correct service metadata (e.g., fix a wrong owner).
- **Request:** `{ "team_id": "team_789", "correction_reason": "Team reorg" }`
- **Response:** `200 OK`. (Triggers background propagation to similar services).
### 7.3 Team & Ownership API
**`GET /api/v1/teams`**
Lists all inferred and synced teams.
**`GET /api/v1/teams/{team_id}/services`**
Lists all services owned by a specific team.
- **Query params:** `?role=primary|contributing|on_call`
- **Response:** Array of service summaries.
### 7.4 Slack Bot API
The Slack bot translates slash commands into API queries. The bot Lambda receives the webhook from Slack, authenticates the workspace, maps it to a tenant, and calls the internal Portal API.
**Command:** `/dd0c who owns payment-gateway`
- **Bot logic:** Calls `GET /api/v1/services/search?q=payment-gateway&limit=1`.
- **Bot response (ephemeral or in-channel):**
> **Payment Gateway** is owned by **@payments-team** (92% confidence).
> Repo: [acme/payment-gateway](https://github.com/...) | Health: ✅ Healthy | On-Call: @mike
**Command:** `/dd0c oncall auth-service`
- **Bot logic:** Looks up service, finds owner team, queries mapped PagerDuty schedule.
- **Bot response:**
> Primary on-call for **Auth Service** (@platform-team) is **Sarah Chen** until 5:00 PM.
### 7.5 Webhooks (Outbound)
V1 supports outbound webhooks so customers can react to catalog changes.
**`POST https://customer-endpoint.com/webhooks/dd0c`**
- **Event:** `service.ownership.changed`
- **Payload:**
```json
{
"event_id": "evt_abc123",
"type": "service.ownership.changed",
"timestamp": "2026-02-28T12:00:00Z",
"data": {
"service_id": "svc_123",
"service_name": "payment-gateway",
"old_owner_id": "team_456",
"new_owner_id": "team_789",
"confidence": 0.95,
"source": "user_correction"
}
}
```
- **Other events:** `service.discovered`, `service.health.degraded`, `discovery.run.completed`.
### 7.6 Platform Integration APIs (Internal)
These endpoints allow other dd0c modules to enrich the portal, and the portal to enrich other modules. These use internal IAM auth and bypass the public API Gateway.
**dd0c/cost Integration**
- **Portal calls Cost:** `GET http://cost.internal/api/v1/services/{service_id}/spend`
Retrieves the trailing 30-day AWS spend for the resources mapped to this service. Displayed on the service card.
- **Cost calls Portal:** `GET http://portal.internal/api/v1/resources/{arn}/service`
When dd0c/cost detects an anomaly on an RDS instance, it asks the portal "which service owns this ARN?" so it can alert the correct team.
**dd0c/alert Integration**
- **Portal calls Alert:** `GET http://alert.internal/api/v1/services/{service_id}/incidents?status=active`
Retrieves active incidents to update the service's `health_status` badge.
- **Alert calls Portal:** `GET http://portal.internal/api/v1/services/{service_id}/routing`
When an alert fires for a service, dd0c/alert asks the portal for the primary owner's Slack channel and PagerDuty schedule to route the page correctly.
**dd0c/run Integration**
- **Portal calls Run:** `GET http://run.internal/api/v1/services/{service_id}/runbooks`
Links executable runbooks directly on the service detail card.
---