dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
This commit is contained in:
2026-02-28 17:35:02 +00:00
commit 5ee95d8b13
51 changed files with 36935 additions and 0 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,245 @@
# dd0c/portal — Brainstorm Session
**Product:** Lightweight Internal Developer Portal ("The Anti-Backstage")
**Facilitator:** Carson, Elite Brainstorming Specialist
**Date:** 2026-02-28
> *Every idea gets a seat at the table. We filter later. Let's GO.*
---
## Phase 1: Problem Space (25 ideas)
### Why Does Backstage Suck?
1. **YAML Cemetery** — Backstage requires hand-written `catalog-info.yaml` in every repo. Engineers write it once, never update it. Within 6 months your catalog is a graveyard of lies.
2. **Plugin Roulette** — Backstage plugins break on every upgrade. The plugin ecosystem is wide but shallow — half-maintained community plugins that rot.
3. **Dedicated Platform Team Required** — You need 1-2 full-time engineers just to keep Backstage running. For a 30-person team, that's 3-7% of your engineering headcount babysitting a developer portal.
4. **React Monolith From Hell** — Backstage is a massive React app you have to fork, customize, build, and deploy yourself. It's not a product, it's a framework. Spotify built it for Spotify.
5. **Upgrade Treadmill** — Backstage releases constantly. Each upgrade risks breaking your custom plugins and templates. Teams fall behind and get stuck on ancient versions.
6. **Cold Start Problem** — Day 1 of Backstage: empty catalog. You have to manually register every service. Nobody does it. The portal launches to crickets.
7. **No Opinions** — Backstage is infinitely configurable, which means it ships with zero useful defaults. You have to decide everything: what metadata to track, what plugins to install, how to organize the catalog.
8. **Search Is Terrible** — Backstage's built-in search is basic. Finding "who owns the payment service" requires navigating a clunky UI tree.
9. **Authentication Nightmare** — Setting up auth (Okta, GitHub, Google) in Backstage requires custom provider configuration that's poorly documented.
10. **No Auto-Discovery** — Backstage doesn't discover anything. It's a static registry that depends entirely on humans keeping it current. Humans don't.
### What Do Engineers Actually Need? (The 80/20)
11. **"Who owns this?"** — The #1 question. When something breaks at 3 AM, you need to know who to page. That's it. That's the killer feature.
12. **"What does this service do?"** — A one-paragraph description, its dependencies, and its API docs. Not a 40-page Confluence novel.
13. **"Is it healthy right now?"** — Green/yellow/red. Deployment status. Last deploy time. Current error rate. One glance.
14. **"Where's the runbook?"** — When the service is on fire, where do I go? Link to the runbook, the dashboard, the logs.
15. **"What depends on this?"** — Dependency graph. If I change this service, what breaks?
16. **"How do I set up my dev environment for this?"** — README, setup scripts, required env vars. Onboarding in 10 minutes, not 10 days.
### The Pain of NOT Having an IDP
17. **Tribal Knowledge Monopoly** — "Ask Dave, he built that service 3 years ago." Dave left 6 months ago. Now nobody knows.
18. **Confluence Graveyard** — Teams document services in Confluence pages that are 2 years stale. New engineers follow outdated instructions and waste days.
19. **Slack Archaeology** — "I think someone posted the architecture diagram in #platform-eng last March?" Engineers spend hours searching Slack history for institutional knowledge.
20. **Incident Response Roulette** — Alert fires → nobody knows who owns the service → 30-minute delay finding the right person → MTTR doubles.
21. **Onboarding Black Hole** — New engineer joins. Spends first 2 weeks asking "what is this service?" and "who do I talk to about X?" in Slack. Productivity = zero.
22. **Duplicate Services** — Without a catalog, Team A builds a notification service. Team B doesn't know it exists. Team B builds another notification service. Now you have two.
23. **Zombie Services** — Services that nobody owns, nobody uses, but nobody is brave enough to turn off. They accumulate like barnacles, costing money and creating security risk.
24. **Compliance Panic** — Auditor asks "show me all services that handle PII and their owners." Without an IDP, this is a multi-week scavenger hunt.
25. **Shadow Architecture** — The actual architecture diverges from every diagram ever drawn. Nobody has a true picture of what's running in production.
---
## Phase 2: Solution Space (42 ideas)
### Auto-Discovery Approaches
26. **AWS Resource Tagger** — Scan AWS accounts via read-only IAM role. Discover EC2, ECS, Lambda, RDS, S3, API Gateway. Map them to services using tags, naming conventions, and CloudFormation stack associations.
27. **GitHub/GitLab Repo Scanner** — Scan org repos. Infer services from repo names, `Dockerfile` presence, CI/CD configs, deployment manifests. Extract README descriptions automatically.
28. **Kubernetes Label Harvester** — Connect to K8s clusters. Discover deployments, services, ingresses. Map labels (`app`, `team`, `owner`) to catalog entries.
29. **Terraform State Reader** — Parse Terraform state files (S3 backends). Build infrastructure graph from resource relationships. Know exactly what infra each service uses.
30. **CI/CD Pipeline Analyzer** — Read GitHub Actions / GitLab CI / Jenkins configs. Infer deployment targets, environments, and service relationships from pipeline definitions.
31. **DNS/Route53 Reverse Map** — Scan DNS records to discover all public-facing services and map them back to infrastructure.
32. **CloudFormation Stack Walker** — Parse CF stacks to understand resource groupings and cross-stack references. Build dependency graphs automatically.
33. **Package.json / go.mod / pom.xml Dependency Inference** — Read dependency files to infer internal service-to-service relationships (shared libraries = likely communication).
34. **Git Blame Ownership** — Infer service ownership from git commit history. Who commits most to this repo? That's probably the owner (or at least knows who is).
35. **PagerDuty/OpsGenie Schedule Import** — Pull on-call schedules to auto-populate "who to page" for each service.
36. **OpenAPI/Swagger Auto-Ingest** — Detect and index API specs from repos. Surface them in the portal as live, searchable API documentation.
37. **Docker Compose Graph** — Parse `docker-compose.yml` files to understand local development service topologies.
### Service Catalog Features
38. **One-Line Service Card** — Every service gets a card: name, owner, health, last deploy, language, repo link. Scannable in 2 seconds.
39. **Dependency Graph Visualization** — Interactive graph showing service-to-service dependencies. Click a node to see details. Highlight blast radius.
40. **Health Dashboard** — Aggregate health from multiple sources (CloudWatch, Datadog, Grafana, custom health endpoints). Show unified red/yellow/green.
41. **Ownership Registry** — Team → services mapping. Click a team, see everything they own. Click a service, see the team and on-call rotation.
42. **Runbook Linker** — Auto-detect runbooks in repos (markdown files in `/runbooks`, `/docs`, or linked in README). Surface them on the service card.
43. **Environment Matrix** — Show all environments (dev, staging, prod) for each service. Current version deployed in each. Drift between environments highlighted.
44. **SLO Tracker** — Define SLOs per service. Show current burn rate. Alert when SLO budget is burning too fast. Simple — not a full SLO platform, just visibility.
45. **Cost Attribution** — Pull from dd0c/cost. Show monthly AWS cost per service. "This service costs $847/month." Engineers never see this data today.
46. **Tech Radar Integration** — Tag services with their tech stack. Surface org-wide technology adoption. "We have 47 services on Node 18, 3 still on Node 14."
47. **README Renderer** — Pull and render the repo README directly in the portal. No context switching to GitHub.
48. **Changelog Feed** — Show recent deployments, config changes, and incidents per service. "What happened to this service this week?"
### Developer Experience
49. **Instant Search (Cmd+K)** — Algolia-fast search across all services, teams, APIs, runbooks. The portal IS the search bar.
50. **Slack Bot**`/dd0c who owns payment-service` → instant answer in Slack. No need to open the portal.
51. **CLI Tool**`dd0c portal search "payment"` → results in terminal. For engineers who live in the terminal.
52. **Browser New Tab** — dd0c/portal as the browser new tab page. Every time an engineer opens a tab, they see their team's services, recent incidents, and deployment status.
53. **VS Code Extension** — Right-click a service import → "View in dd0c/portal" → opens service card. See ownership and docs without leaving the editor.
54. **GitHub PR Enrichment** — Bot comments on PRs with service context: "This PR affects payment-service (owned by @payments-team, 99.9% SLO, last incident 3 days ago)."
55. **Mobile-Friendly View** — When you're on-call and get paged on your phone, the portal should be usable on mobile. Backstage is not.
56. **Deep Links** — Every service, team, runbook, and API has a stable URL. Paste it in Slack, Jira, anywhere. It just works.
### Zero-Config Magic
57. **Convention Over Configuration** — If your repo is named `payment-service`, the service is named `payment-service`. If it has a `Dockerfile`, it's a deployable service. If it has `owner` in CODEOWNERS, that's the owner. Zero YAML needed.
58. **Smart Defaults** — First run: connect AWS account + GitHub org. Portal auto-populates with everything it finds. Engineer reviews and corrects, not creates from scratch.
59. **Progressive Enhancement** — Start with auto-discovered data (maybe 60% accurate). Let teams enrich over time. Never require manual entry as a prerequisite.
60. **Confidence Scores** — Show "we're 85% sure @payments-team owns this" based on git history and AWS tags. Let humans confirm or correct. Learn from corrections.
61. **Ghost Service Detection** — Find AWS resources that don't map to any known repo or team. Surface them as "orphaned infrastructure" — potential zombie services or cost waste.
### Scorecard / Maturity Model
62. **Production Readiness Score** — Does this service have: health check? Logging? Alerting? Runbook? Score it 0-100. Gamify production readiness.
63. **Documentation Coverage** — Does the repo have a README? API docs? Architecture decision records? Score it.
64. **Security Posture** — Are dependencies up to date? Any known CVEs? Is the Docker image scanned? Secrets in env vars vs. secrets manager?
65. **On-Call Readiness** — Is there an on-call rotation defined? Is the runbook current? Has the team done a recent incident drill?
66. **Leaderboard** — Team-level maturity scores. Friendly competition. "Platform team is at 92%, payments team is at 67%." Gamification drives adoption.
67. **Improvement Suggestions** — "Your service is missing a health check endpoint. Here's a template for Express/FastAPI/Go." Actionable, not just a score.
### dd0c Module Integration
68. **Alert → Owner Routing (dd0c/alert)** — Alert fires → portal knows the owner → alert routes directly to the right person. No more generic #alerts channel.
69. **Drift Visibility (dd0c/drift)** — Service card shows "⚠️ 3 infrastructure drifts detected." Click to see details in dd0c/drift.
70. **Cost Per Service (dd0c/cost)** — Service card shows monthly AWS cost. "This Lambda costs $234/month." Engineers finally see the money.
71. **Runbook Execution (dd0c/run)** — Runbook linked in portal is executable via dd0c/run. "Service is down → click runbook → AI walks you through recovery."
72. **LLM Cost Per Service (dd0c/route)** — If the service uses LLM APIs, show the AI spend. "This service spent $1,200 on GPT-4o last month."
73. **Unified Incident View** — When an incident happens, the portal becomes the war room: service health, owner, runbook, recent changes, cost impact — all on one screen.
### Wild Ideas 🔥
74. **The IDP Is Just a Search Engine** — Forget the catalog UI. The entire product is a search bar. Type anything: service name, team name, API endpoint, error message. Get instant answers. Google for your infrastructure.
75. **AI Agent: "Ask Your Infra"** — Natural language queries: "Who owns the service that handles Stripe webhooks?" "What changed in production in the last 24 hours?" "Which services don't have runbooks?" The AI queries the catalog and answers.
76. **Auto-Generated Architecture Diagrams** — From discovered services and dependencies, auto-generate C4 / system context diagrams. Always up-to-date because they're generated from reality, not drawn by hand.
77. **"New Engineer" Mode** — A guided tour for new hires. "Here are the 10 most important services. Here's who owns what. Here's how to set up your dev environment." Onboarding in 1 hour, not 1 week.
78. **Service DNA** — Every service gets a unique fingerprint: its tech stack, dependencies, deployment pattern, cost profile, health history. Use this to find similar services, suggest best practices, detect anomalies.
79. **Incident Replay** — After an incident, the portal shows a timeline: what changed, what broke, who responded, how it was fixed. Auto-generated post-mortem skeleton.
80. **"What If" Simulator** — "What if we deprecate service X?" Show the blast radius: which services depend on it, which teams are affected, estimated migration effort.
81. **Service Lifecycle Tracker** — Track services from creation → active → deprecated → decommissioned. Prevent zombie services by making the lifecycle visible.
82. **Auto-PR for Missing Metadata** — Portal detects a service is missing an owner tag. Automatically opens a PR to the repo adding a `CODEOWNERS` file with a suggested owner based on git history.
83. **Ambient Dashboard (TV Mode)** — Full-screen mode for office TVs. Show team services, health status, recent deploys, SLO burn rates. The engineering floor's heartbeat monitor.
84. **Service Comparison** — Side-by-side comparison of two services: tech stack, cost, health, maturity score. Useful for migration planning or standardization.
---
## Phase 3: Differentiation & Moat (18 ideas)
### How to Beat Backstage
85. **Time-to-Value: 5 Minutes vs. 5 Months** — Backstage takes months to set up. dd0c/portal takes 5 minutes (connect AWS + GitHub, auto-discover, done). This is the entire pitch. Speed kills.
86. **Zero Maintenance** — Backstage is self-hosted and requires constant upgrades. dd0c/portal is SaaS. We handle upgrades, scaling, and plugin compatibility. Your platform team can go back to building platforms.
87. **Auto-Discovery vs. Manual Entry** — Backstage requires humans to write YAML. dd0c/portal discovers everything automatically. The catalog is always current because it's generated from reality, not maintained by humans.
88. **Opinionated > Configurable** — Backstage gives you a blank canvas. dd0c/portal gives you a finished painting. We make the decisions so you don't have to. Convention over configuration.
89. **"Backstage Migrator"** — One-click import from existing Backstage `catalog-info.yaml` files. Lower the switching cost to zero. Eat their lunch.
### How to Beat Port/Cortex/OpsLevel
90. **Price: $10/eng vs. $200+/eng** — Port, Cortex, and OpsLevel charge enterprise prices ($20K+/year). dd0c/portal is $10/engineer/month with self-serve signup. No sales calls. No procurement process.
91. **Self-Serve vs. Sales-Led** — You can start using dd0c/portal today. Port requires a demo call, a POC, and a 6-week evaluation. By the time their sales cycle completes, you've been using dd0c for 2 months.
92. **Simplicity as Feature** — Port and Cortex have massive feature sets designed for 1000+ engineer orgs. dd0c/portal has 20% of the features for 80% of the value. For a 30-person team, less is more.
93. **dd0c Platform Integration** — Port is a standalone IDP. dd0c/portal is part of a unified platform (cost, alerts, drift, runbooks). The IDP that knows your costs, routes your alerts, and executes your runbooks. Nobody else can do this.
### The Moat
94. **Data Network Effect** — The more services discovered, the better the dependency graph, the smarter the ownership inference, the more accurate the health aggregation. Data compounds.
95. **Platform Lock-In (The Good Kind)** — Once dd0c/portal is the browser homepage for every engineer, switching costs are enormous. It's the operating system for your engineering org.
96. **Cross-Module Flywheel** — Portal makes alerts smarter (route to owner). Alerts make portal stickier (engineers open it during incidents). Cost data makes portal indispensable (engineers check service costs). Each module reinforces the others.
97. **AI-Powered Inference Engine** — Over time, dd0c learns patterns across all customers (anonymized): common service architectures, typical ownership structures, standard tech stacks. The AI gets smarter with scale. New customers get better auto-discovery on day 1.
98. **Community Catalog Templates** — Open-source library of service templates (Express API, Lambda function, ECS service). New services created from templates are automatically portal-ready. The community builds the ecosystem.
99. **"Agent Control Plane" Positioning** — As agentic AI grows, AI agents need a source of truth about services. dd0c/portal becomes the registry that AI agents query. "Which service handles payments?" The IDP isn't just for humans anymore — it's for AI agents too.
100. **Compliance Moat** — Once dd0c/portal is the system of record for service ownership and maturity, it becomes compliance infrastructure. SOC 2 auditors love it. Ripping it out means losing your compliance evidence.
101. **Integration Depth** — Build deep integrations with the tools teams already use (GitHub, Slack, PagerDuty, Datadog, AWS). Each integration makes dd0c/portal harder to replace.
102. **Open-Source Discovery Agent** — Open-source the discovery agent (runs in their VPC). Proprietary SaaS dashboard. The OSS agent builds trust and community. The dashboard is the business.
---
## Phase 4: Anti-Ideas & Red Team (14 ideas)
### Why Would This Fail?
103. **"Lightweight" = "Toy"** — Teams might dismiss dd0c/portal as too simple. "We need Backstage because we're a serious engineering org." Perception problem: lightweight sounds like it can't scale.
104. **GitHub Ships a Built-In Catalog** — GitHub already has repository topics, CODEOWNERS, and dependency graphs. If they add a "Service Catalog" tab, dd0c/portal's value proposition evaporates overnight.
105. **Backstage Gets Easy** — Roadie (managed Backstage) is improving. If Backstage 2.0 ships with auto-discovery and zero-config setup, the "Anti-Backstage" positioning dies.
106. **AWS Ships a Good IDP** — AWS has Service Catalog, but it's terrible. If they build a real IDP integrated with their ecosystem, they have distribution dd0c can't match.
107. **Discovery Accuracy Problem** — Auto-discovery sounds magical but might be 60% accurate. If engineers open the portal and see wrong data, they'll never trust it again. First impressions are everything.
108. **Small Teams Don't Need an IDP** — A 15-person team might genuinely not need a service catalog. They all sit in the same room. The TAM might be smaller than expected.
109. **Enterprise Gravity** — As teams grow past 100 engineers, they'll "graduate" to Port or Cortex. dd0c/portal might be a stepping stone, not a destination. High churn at the top end.
110. **Solo Founder Risk** — Building an IDP requires integrations with dozens of tools (AWS, GCP, Azure, GitHub, GitLab, Bitbucket, PagerDuty, OpsGenie, Datadog, Grafana...). That's a massive surface area for one person.
111. **The "Free" Competitor Problem** — Backstage is free. Convincing teams to pay $10/eng/month when a free option exists requires the value gap to be enormous and obvious.
112. **Data Sensitivity** — The portal needs read access to AWS accounts and GitHub orgs. Security teams at larger companies will block this. The trust barrier is real.
113. **Agentic AI Makes IDPs Obsolete** — If AI agents can answer "who owns this service?" by reading git history and Slack in real-time, do you need a static catalog at all?
114. **Platform Engineering Fatigue** — Teams are tired of adopting new tools. "We just finished setting up Backstage, we're not switching." Migration fatigue is real.
115. **The "Good Enough" Spreadsheet** — Many teams track services in a Google Sheet or Notion database. It's ugly but it works. Convincing them to pay for a dedicated tool is harder than it sounds.
116. **Churn from Simplicity** — If the product is truly lightweight, there's less surface area for stickiness. Users might churn because they feel they've "outgrown" it.
---
## Phase 5: Synthesis
### Top 10 Ideas (Ranked)
| Rank | Idea | Why It Wins |
|------|------|-------------|
| 1 | **5-Minute Auto-Discovery Setup** (#57, #58) | THE differentiator. Connect AWS + GitHub → catalog populated. Zero YAML. This is the entire pitch against Backstage. |
| 2 | **Cmd+K Instant Search** (#49, #74) | The portal IS the search bar. "Who owns X?" answered in 2 seconds. This is the daily-use hook that makes it the browser homepage. |
| 3 | **AI "Ask Your Infra" Agent** (#75) | Natural language queries against your service catalog. "What changed in prod today?" This is the 2026 differentiator that no competitor has. |
| 4 | **Ownership Registry + PagerDuty Sync** (#41, #35) | The #1 use case: who owns this, who's on-call. Auto-populated from PagerDuty/OpsGenie + git history. Solves the 3 AM problem. |
| 5 | **dd0c Cross-Module Integration** (#68-73, #96) | Alerts route to owners. Costs attributed to services. Runbooks linked and executable. The platform flywheel that standalone IDPs can't match. |
| 6 | **Production Readiness Scorecard** (#62-67) | Gamified maturity model. Teams compete to improve scores. Drives adoption AND improves engineering practices. Two birds, one stone. |
| 7 | **Slack Bot** (#50) | `/dd0c who owns payment-service` — meet engineers where they already are. Reduces friction to zero. Drives organic adoption. |
| 8 | **Auto-Generated Dependency Graphs** (#39, #76) | Visual blast radius. "If this service goes down, these 12 services are affected." Always current because it's generated from reality. |
| 9 | **Backstage Migrator** (#89) | One-click import from Backstage YAML. Lowers switching cost to zero. Directly targets the frustrated Backstage user base. |
| 10 | **$10/eng Self-Serve Pricing** (#90, #91) | No sales calls. No procurement. Credit card signup. This alone disqualifies Port/Cortex/OpsLevel for 80% of the market. |
### 3 Wild Cards 🃏
| # | Wild Card | Why It's Wild |
|---|-----------|---------------|
| 🃏1 | **"New Engineer" Guided Onboarding Mode** (#77) | Turns the IDP into an onboarding tool. "Welcome to Acme Corp. Here are your 47 services in 5 minutes." HR teams would champion this. Completely different buyer persona. |
| 🃏2 | **Agent Control Plane** (#99) | Position dd0c/portal as the registry that AI agents query, not just humans. As agentic DevOps explodes in 2026, this could be the defining use case. The IDP becomes infrastructure for AI. |
| 🃏3 | **Auto-PR for Missing Metadata** (#82) | The portal doesn't just show gaps — it fixes them. Detects missing CODEOWNERS, opens a PR with suggested owners. The catalog improves itself. Self-healing metadata. |
### Recommended V1 Scope
**Core (Must Ship):**
- AWS auto-discovery (EC2, ECS, Lambda, RDS, API Gateway via read-only IAM role)
- GitHub org scan (repos, languages, CODEOWNERS, README)
- Service cards (name, owner, description, repo, health, last deploy)
- Cmd+K instant search
- Team → services ownership mapping
- PagerDuty/OpsGenie on-call schedule import
- Slack bot (`/dd0c who owns X`)
- Self-serve signup, $10/engineer/month
**V1.1 (Fast Follow):**
- Production readiness scorecard
- Dependency graph visualization
- Kubernetes discovery
- Backstage YAML importer
- dd0c/alert integration (route alerts to service owners)
- dd0c/cost integration (show cost per service)
**V1.2 (Differentiator):**
- AI "Ask Your Infra" natural language queries
- Auto-PR for missing metadata
- New engineer onboarding mode
- Terraform state parsing
**Explicitly NOT V1:**
- Custom plugins/extensions (that's Backstage's trap)
- GCP/Azure support (AWS-first, expand later)
- Software templates / scaffolding (stay focused on catalog)
- Full SLO management (just show basic health)
- Self-hosted option (SaaS only to start)
---
> **Total ideas generated: 116**
> *Session complete. The Anti-Backstage has a blueprint. Now go build it.* 🔥

View File

@@ -0,0 +1,301 @@
# dd0c/portal — Design Thinking Session
**Product:** Lightweight Internal Developer Portal ("The Anti-Backstage")
**Facilitator:** Maya, Design Thinking Maestro
**Date:** 2026-02-28
> *"Design is jazz. You listen before you play. You feel the room before you solo. And the room right now? It's full of engineers who are lost, frustrated, and pretending they're not."*
---
## Phase 1: EMPATHIZE 🎭
The cardinal sin of developer tooling is building for the architect's ego instead of the practitioner's panic. Before we sketch a single wireframe, we need to sit in three very different chairs and feel the splinters.
---
### Persona 1: The New Hire — "Jordan"
**Profile:** Software Engineer II, 2 years experience, just joined a 60-person engineering org. First day was Monday. It's now Wednesday. Jordan has imposter syndrome and a laptop full of bookmarks that don't work.
#### Empathy Map
| Dimension | Details |
|-----------|---------|
| **SAYS** | "Where's the documentation for this?" · "Who should I ask about the payments service?" · "I don't want to bother anyone with a dumb question" · "The wiki says to use Jenkins but I think we use GitHub Actions now?" · "Can someone add me to the right Slack channels?" |
| **THINKS** | "Everyone else seems to know how this all fits together" · "I'm going to look stupid if I ask this again" · "This architecture diagram is from 2023, is it still accurate?" · "I have no idea what half these services do" · "Am I even looking at the right repo?" |
| **DOES** | Searches Slack history obsessively · Opens 47 browser tabs trying to map the system · Reads READMEs that haven't been updated in 18 months · Asks their buddy (assigned mentor) the same question three different ways · Builds a personal Notion doc trying to map services to teams · Sits in standup not understanding half the service names mentioned |
| **FEELS** | Overwhelmed — the system landscape is a fog · Anxious — "I should be productive by now" · Isolated — everyone's too busy to hand-hold · Frustrated — information exists but it's scattered across 6 tools · Embarrassed — asking "who owns this?" feels like admitting ignorance |
#### Pain Points
1. **The Scavenger Hunt** — Finding who owns what requires asking in Slack, searching Confluence, checking GitHub CODEOWNERS, and sometimes just guessing. There's no single source of truth.
2. **Stale Documentation** — 70% of internal docs are outdated. Jordan follows a setup guide, hits an error on step 3, and has no idea if the guide is wrong or they are.
3. **Invisible Architecture** — No one has drawn an accurate system diagram in over a year. The mental model of "how things connect" lives exclusively in senior engineers' heads.
4. **Social Cost of Questions** — Every question Jordan asks interrupts someone. After the third "hey, quick question" DM in a day, Jordan stops asking and starts guessing.
5. **Tool Sprawl** — Services are documented across GitHub READMEs, Confluence, Notion, Google Docs, and Slack pinned messages. There's no index of indexes.
#### Current Workarounds
- Personal Notion database mapping services → owners → repos (manually maintained, already drifting)
- Screenshot collection of architecture whiteboard photos from onboarding
- Bookmarked Slack threads with "useful" context that's already stale
- Relying heavily on one senior engineer who's becoming a bottleneck
- Reading git blame to figure out who last touched a service
#### Jobs To Be Done (JTBD)
- **When** I join a new company, **I want to** quickly understand the service landscape **so I can** contribute meaningfully without feeling lost.
- **When** I'm assigned a ticket involving an unfamiliar service, **I want to** find the owner and documentation in under 60 seconds **so I can** unblock myself without interrupting teammates.
- **When** I hear a service name in standup, **I want to** look it up instantly **so I can** follow the conversation and not feel like an outsider.
#### Day-in-the-Life Scenario
> **9:00 AM** — Jordan opens Slack. 47 unread messages across 12 channels. Half reference services Jordan has never heard of.
>
> **9:30 AM** — Standup. Tech lead mentions "the inventory-sync Lambda is flaking again." Jordan nods. Has no idea what inventory-sync does or where it lives.
>
> **10:00 AM** — Assigned first real ticket: "Add retry logic to the order-processor." Jordan searches GitHub for "order-processor." Three repos come up. Which one is the right one? The README in the first one says "DEPRECATED — see order-processor-v2." The v2 repo has no README.
>
> **10:45 AM** — Jordan DMs their mentor: "Hey, which repo is the actual order-processor?" Mentor is in a meeting. No response for 2 hours.
>
> **11:00 AM** — Jordan searches Confluence for "order-processor." Finds a page from 2024 with an architecture diagram that references services that no longer exist.
>
> **12:00 PM** — Lunch. Jordan feels behind. Other new hires from the same cohort seem to be figuring things out faster (they're not — they're just better at hiding it).
>
> **2:00 PM** — Mentor responds: "Oh yeah, it's order-processor-v2 but we actually call it order-engine now. The repo name is wrong. Talk to @sarah for the runbook."
>
> **2:30 PM** — Jordan DMs Sarah. Sarah is on PTO until Friday.
>
> **3:00 PM** — Jordan has spent 5 hours and written zero lines of code. The ticket is untouched. The frustration is real.
>
> **5:00 PM** — Jordan updates their personal Notion doc with everything learned today. Tomorrow they'll forget half of it.
---
### Persona 2: The Platform Engineer — "Alex"
**Profile:** Senior Platform Engineer, 6 years experience, has been maintaining the company's Backstage instance for 14 months. Alex is burned out. The Backstage instance is a Frankenstein monster of custom plugins, broken upgrades, and YAML files that nobody updates. Alex fantasizes about deleting the entire thing.
#### Empathy Map
| Dimension | Details |
|-----------|---------|
| **SAYS** | "I spend 40% of my time maintaining Backstage instead of building actual platform tooling" · "Nobody updates their catalog-info.yaml" · "The plugin broke again after the upgrade" · "I've become the Backstage janitor" · "We need to simplify this" |
| **THINKS** | "I didn't become an engineer to babysit a React monolith" · "If I leave, nobody can maintain this" · "Backstage was supposed to save us time, not create more work" · "Maybe we should just use a spreadsheet" · "The catalog is 60% lies at this point" |
| **DOES** | Writes custom Backstage plugins that break on every upgrade · Sends monthly "please update your catalog-info.yaml" Slack messages that everyone ignores · Manually fixes broken service entries · Runs Backstage upgrade migrations that take 2-3 days each · Writes documentation for Backstage that nobody reads · Attends Backstage community calls hoping for solutions that never come |
| **FEELS** | Exhausted — maintaining Backstage is a full-time job on top of their actual job · Resentful — "I'm a platform engineer, not a portal admin" · Trapped — too much invested to abandon Backstage, too broken to love it · Lonely — nobody else on the team understands or cares about the IDP · Skeptical — "every new tool promises to be different" |
#### Pain Points
1. **The Maintenance Tax** — Backstage requires constant care: plugin updates, dependency bumps, custom provider maintenance, auth config changes. It's a pet, not cattle.
2. **The YAML Lie**`catalog-info.yaml` files are the foundation of Backstage, and they're fiction. Teams write them once during onboarding, never update them. The catalog drifts from reality within weeks.
3. **Plugin Fragility** — Community plugins are maintained by volunteers who disappear. Custom plugins break on Backstage upgrades. The plugin ecosystem is a house of cards.
4. **Zero Adoption Metrics** — Alex has no idea if anyone actually uses Backstage. There's no analytics. The portal might have 3 daily users or 30. Alex suspects it's closer to 3.
5. **Upgrade Dread** — Every Backstage version bump is a multi-day migration project. Alex is 4 versions behind because the last upgrade broke 3 plugins and took a week to fix.
#### Current Workarounds
- Wrote a cron job that checks for missing `catalog-info.yaml` files and posts in Slack (everyone mutes the channel)
- Maintains a "known broken" list of Backstage features they tell people to avoid
- Built a simple internal API that returns service ownership from GitHub CODEOWNERS as a backup
- Runs a quarterly "Backstage cleanup day" that nobody attends
- Has a secret spreadsheet that's actually more accurate than Backstage
#### Jobs To Be Done (JTBD)
- **When** I'm responsible for the developer portal, **I want** it to maintain itself **so I can** focus on building actual platform capabilities.
- **When** the catalog data drifts from reality, **I want** automatic reconciliation from source-of-truth systems **so I can** trust the data without manual intervention.
- **When** leadership asks "how's the IDP going?", **I want** real adoption metrics **so I can** prove value or make the case to change course.
#### Day-in-the-Life Scenario
> **8:30 AM** — Alex opens their laptop. Three Slack DMs overnight: "Backstage is showing the wrong owner for auth-service," "The API docs plugin isn't loading," and "Can you add our new service to the catalog?"
>
> **9:00 AM** — Alex investigates the wrong-owner issue. The `catalog-info.yaml` in the auth-service repo lists the previous team. The team was reorged 4 months ago. Nobody updated the YAML. Alex manually fixes it. This is the third time this month.
>
> **9:30 AM** — The API docs plugin. It broke after last week's Backstage patch update. Alex checks the plugin's GitHub repo. Last commit: 8 months ago. The maintainer hasn't responded to issues in 6 months. Alex starts debugging the plugin source code.
>
> **11:00 AM** — Still debugging the plugin. Alex considers writing a replacement from scratch. Estimates 2 weeks of work. Puts it on the backlog that's already 40 items deep.
>
> **11:30 AM** — "Can you add our new service?" Alex sends the team the `catalog-info.yaml` template for the 50th time. They'll fill it out wrong. Alex will fix it later.
>
> **1:00 PM** — Alex's actual platform work (building a new CI/CD pipeline template) has been untouched for 3 days. The sprint review is tomorrow. Alex has nothing to show.
>
> **3:00 PM** — Engineering Director asks: "How many teams are actively using the portal?" Alex doesn't know. Backstage has no built-in analytics. Alex says "most teams" and hopes nobody asks for specifics.
>
> **5:00 PM** — Alex updates their resume. Not seriously. Mostly seriously.
---
### Persona 3: The Engineering Director — "Priya"
**Profile:** Director of Engineering, manages 8 teams (62 engineers), reports to the VP of Engineering. Priya needs to answer hard questions about service ownership, production readiness, and engineering maturity — and currently can't answer any of them with confidence.
#### Empathy Map
| Dimension | Details |
|-----------|---------|
| **SAYS** | "Which teams own which services? I need a complete map" · "Are we production-ready for the SOC 2 audit?" · "How many services don't have runbooks?" · "I need to justify the platform team's headcount" · "Why did that incident take 45 minutes to route to the right team?" |
| **THINKS** | "I'm flying blind on service maturity" · "If an auditor asks me about ownership, I'm going to look incompetent" · "We're spending $200K/year on a platform engineer to maintain Backstage — is that worth it?" · "I bet there are zombie services costing us money that nobody owns" · "The new hires are taking too long to ramp up" |
| **DOES** | Asks team leads for service ownership spreadsheets (gets different answers from each) · Runs quarterly "production readiness" reviews that are manual and painful · Approves platform team budget without clear ROI metrics · Escalates during incidents when nobody knows who owns the broken service · Presents engineering maturity metrics to the VP that are mostly guesswork |
| **FEELS** | Anxious — lack of visibility into the service landscape is a leadership risk · Frustrated — simple questions ("who owns this?") shouldn't require a research project · Pressured — SOC 2 audit is in 3 months and the compliance evidence is scattered · Guilty — knows the platform team is drowning but can't justify more headcount without data · Impatient — "we've been talking about fixing this for a year" |
#### Pain Points
1. **The Ownership Black Hole** — No single, trustworthy source of service-to-team mapping. During incidents, precious minutes are wasted finding the right responder.
2. **Compliance Anxiety** — SOC 2 auditors will ask "show me every service that handles PII and its owner." Today, answering this requires a multi-week manual audit.
3. **Invisible ROI** — The platform team maintains Backstage, but Priya can't quantify the value. Is it saving time? Is anyone using it? She's spending $200K/year on vibes.
4. **Onboarding Drag** — New engineers take 3-4 weeks to become productive. Priya suspects poor internal tooling is a major factor but can't prove it.
5. **Zombie Infrastructure** — Priya knows there are services and AWS resources that nobody owns. They cost money, create security risk, and nobody's accountable.
#### Current Workarounds
- Quarterly manual audit where each team lead submits a spreadsheet of their services (takes 2 weeks, outdated by the time it's compiled)
- Incident post-mortems that repeatedly cite "unclear ownership" as a contributing factor
- A shared Google Sheet titled "Service Ownership Map" that 4 different people maintain with conflicting data
- Relies on tribal knowledge from senior engineers who've been at the company 3+ years
- Asks the platform engineer (Alex) for ad-hoc reports that take days to compile
#### Jobs To Be Done (JTBD)
- **When** an incident occurs, **I want** instant, authoritative service-to-owner mapping **so that** mean time to resolution isn't inflated by "who owns this?" confusion.
- **When** preparing for a compliance audit, **I want** an always-current inventory of services, owners, and maturity attributes **so that** I can produce evidence in minutes, not weeks.
- **When** justifying platform team investment, **I want** concrete adoption and value metrics **so that** I can defend the budget with data, not anecdotes.
#### Day-in-the-Life Scenario
> **7:30 AM** — Priya checks her phone. PagerDuty alert from 2 AM: "payment-gateway latency spike." The incident channel shows 15 minutes of "does anyone know who owns payment-gateway?" before the right engineer was found. MTTR: 47 minutes. Should have been 15.
>
> **9:00 AM** — Leadership sync. VP asks: "How many of our services meet production readiness standards?" Priya says "most of the critical path services." The VP asks for a number. Priya doesn't have one. She promises a report by Friday.
>
> **10:00 AM** — Priya asks Alex (platform engineer) to generate a production readiness report. Alex says it'll take 3-4 days because the data is scattered across Backstage (partially accurate), GitHub (partially complete), and team leads' heads (partially available).
>
> **11:00 AM** — SOC 2 prep meeting. Compliance team asks: "Can you provide a list of all services that process customer data, their owners, and their security controls?" Priya's stomach drops. She knows this will be a fire drill.
>
> **1:00 PM** — 1:1 with a new hire (Jordan). Jordan mentions it took 3 days to figure out which repo to work in. Priya makes a mental note to "fix onboarding" — the same note she's made every quarter for a year.
>
> **3:00 PM** — Budget review. CFO asks why the platform team needs another engineer. Priya can't quantify the current IDP's value. The headcount request is deferred to next quarter.
>
> **4:30 PM** — Priya opens the "Service Ownership Map" Google Sheet. It was last updated 6 weeks ago. Three services listed don't exist anymore. Two new services aren't listed. She closes the tab and sighs.
>
> **6:00 PM** — Priya drafts an email to her team leads: "Please update the service ownership spreadsheet by EOW." She knows from experience that 3 out of 8 will respond, and the data will be inconsistent.
---
### Cross-Persona Insight: The Emotional Thread
There's a jazz chord that connects all three personas — it's the **anxiety of not knowing**:
- **Jordan** doesn't know where things are and is afraid to ask.
- **Alex** doesn't know if anyone uses what they've built and is afraid to find out.
- **Priya** doesn't know the true state of her engineering org and is afraid to be exposed.
The product that resolves this anxiety — that replaces "I don't know" with "I can look it up in 2 seconds" — wins not just their budget, but their loyalty.
> *"The best products don't just solve problems. They dissolve the anxiety that surrounds the problem. That's the difference between a tool and a companion."*
---
## Phase 2: DEFINE 🎯
Time to distill the empathy into precision. We've sat in the chairs. We've felt the splinters. Now we name the splinters — because you can't pull out what you can't name.
---
### Point-of-View (POV) Statements
A POV statement is a design brief in one sentence. It reframes the problem from the user's emotional reality, not the product team's assumptions.
**POV 1 — The New Hire (Jordan)**
> Jordan, a newly hired engineer drowning in scattered tribal knowledge, needs a way to instantly understand the service landscape and find who owns what — because every hour spent on a scavenger hunt for basic context is an hour of productivity lost and confidence eroded.
**POV 2 — The Platform Engineer (Alex)**
> Alex, a burned-out platform engineer trapped maintaining a Backstage instance that nobody trusts, needs a developer portal that maintains itself from real infrastructure data — because the current model of human-maintained YAML catalogs is a lie that creates more work than it eliminates.
**POV 3 — The Engineering Director (Priya)**
> Priya, an engineering director who can't answer basic questions about service ownership and maturity, needs always-current, authoritative visibility into her engineering org's service landscape — because flying blind on ownership creates incident response delays, compliance risk, and an inability to justify platform investment.
---
### "How Might We" Questions
HMW questions are the bridge between empathy and ideation. Each one is a door. Some lead to hallways. Some lead to ballrooms. Let's open them all.
#### Discoverability & Knowledge
1. **HMW** make the entire service landscape searchable in under 2 seconds, so that "who owns this?" is never a Slack question again?
2. **HMW** eliminate stale documentation by generating service context directly from infrastructure reality (AWS, GitHub, K8s) instead of human-written YAML?
3. **HMW** surface the "invisible architecture" — the actual dependency graph — without requiring anyone to draw or maintain diagrams?
4. **HMW** make a new hire's first week feel like orientation instead of archaeology?
#### Maintenance & Trust
5. **HMW** build a catalog that stays accurate without requiring any engineer to manually update it — ever?
6. **HMW** give platform engineers their time back by replacing portal maintenance with portal auto-healing?
7. **HMW** create confidence scores for catalog data so users know what's verified vs. inferred, rather than treating everything as equally trustworthy (or untrustworthy)?
8. **HMW** make the portal's data freshness visible and transparent, so trust is earned through proof, not promises?
#### Visibility & Accountability
9. **HMW** give engineering leaders a real-time, always-current view of service ownership, maturity, and gaps — without quarterly manual audits?
10. **HMW** turn "who owns this?" from a 15-minute incident delay into a 0-second lookup?
11. **HMW** make production readiness measurable and visible so that teams self-correct without top-down mandates?
12. **HMW** surface zombie services and orphaned infrastructure automatically, so cost waste and security risk don't hide in the shadows?
#### Adoption & Stickiness
13. **HMW** make the portal so useful for daily work that engineers voluntarily set it as their browser homepage?
14. **HMW** meet engineers where they already are (Slack, terminal, VS Code) instead of asking them to visit yet another dashboard?
15. **HMW** create a "5-minute setup to first insight" experience that makes Backstage's months-long onboarding feel absurd?
16. **HMW** design the portal so that using it is faster than NOT using it — so adoption is driven by selfishness, not mandates?
---
### Key Insights
These are the non-obvious truths that emerged from empathy. They're the foundation of everything we design.
**Insight 1: The Real Problem Isn't Missing Data — It's Scattered Data**
The information exists. Ownership is in CODEOWNERS. Descriptions are in READMEs. Health is in CloudWatch. On-call is in PagerDuty. The problem isn't that nobody documented anything — it's that nobody aggregated it. The portal's job isn't to create new data. It's to unify existing data into one searchable surface.
**Insight 2: Manual Maintenance Is a Design Flaw, Not a User Failure**
Backstage blames engineers for not updating YAML. That's like blaming users for not reading the manual. If your product requires humans to do repetitive, low-reward maintenance to function, your product is broken. Auto-discovery isn't a feature — it's the correction of a fundamental design error.
**Insight 3: The New Hire Is the Canary in the Coal Mine**
If a new hire can't find what they need in their first week, the entire org has a knowledge management problem. They're just the ones who feel it most acutely. Solving for Jordan solves for everyone — because every engineer is a "new hire" to the 80% of services they've never touched.
**Insight 4: Trust Is Binary — And It's Earned on First Contact**
If an engineer opens the portal and sees one wrong owner or one stale description, they close the tab and never come back. Trust in a catalog is binary: either you trust it or you don't. There's no "mostly trust." This means accuracy on day one matters more than feature completeness. Ship less, but ship truth.
**Insight 5: The Portal Must Be Faster Than Slack**
The current workaround for "who owns this?" is asking in Slack. If the portal is slower than typing a Slack message and waiting for a response, the portal loses. The bar isn't "better than Backstage." The bar is "faster than asking a human." That's Cmd+K in under 2 seconds.
**Insight 6: Directors Don't Need Dashboards — They Need Answers**
Priya doesn't want a dashboard with 47 widgets. She wants to answer three questions: "Who owns what?", "Are we production-ready?", and "What's falling through the cracks?" If the portal can answer those three questions on demand, Priya becomes the champion who sells it to the VP.
---
### Core Tension: Comprehensiveness vs. Simplicity
Here's the jazz dissonance at the heart of this product. It's the tension that will define every design decision:
```
COMPREHENSIVENESS SIMPLICITY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"Show me everything about "Just tell me who owns
every service: dependencies, this and where the repo
health, cost, SLOs, runbooks, is. I don't need a
deployment history, tech stack, dissertation."
security posture, compliance
status..."
→ This is Backstage/Port/Cortex. → This is a spreadsheet.
Powerful but overwhelming. Simple but insufficient.
Nobody uses it because it's Everyone outgrows it
too much. in 3 months.
```
**The Resolution: Progressive Disclosure**
dd0c/portal must be a spreadsheet on the surface and a platform underneath. The default view is radically simple: service name, owner, health, repo link. One line per service. Scannable in 2 seconds.
But depth is one click away. Click the service → dependencies, cost, deployment history, runbook, scorecard. The complexity exists but it doesn't assault you on arrival.
The design principle: **"Calm surface, deep ocean."**
This is how you beat both Backstage (too complex on arrival) AND the Google Sheet (too shallow to grow into). You start simpler than a spreadsheet and grow deeper than Backstage — but only when the user asks for depth.
> *"In jazz, the notes you don't play matter more than the ones you do. The portal's default state should be silence — clean, calm, just the essentials. The solo comes when you lean in."*
---

View File

@@ -0,0 +1,477 @@
# dd0c/portal — V1 MVP Epics & Stories
This document outlines the complete set of Epics for the V1 MVP of dd0c/portal, a Lightweight Internal Developer Portal. The scope is strictly limited to the "5-Minute Miracle" auto-discovery from AWS and GitHub, Cmd+K search, basic catalog UI, Slack bot, and self-serve PLG onboarding. No AI agent, no GitLab, no scorecards in V1.
---
## Epic 1: AWS Discovery Engine
**Description:** Build the core AWS scanning capability using a read-only cross-account IAM role. This engine must enumerate CloudFormation stacks, ECS services, Lambda functions, API Gateway APIs, and RDS instances, and group them into inferred "services" based on naming conventions and tags.
### User Stories
**Story 1.1: Cross-Account Role Assumption**
*As a Platform Engineer, I want the discovery engine to securely assume a read-only IAM role in my AWS account using an External ID, so that my infrastructure data can be scanned without sharing long-lived credentials.*
- **Acceptance Criteria:**
- System successfully assumes cross-account role using AWS STS.
- Role assumption enforces a tenant-specific `ExternalId`.
- Failure to assume role surfaces a clear error (e.g., "Role not found" or "Invalid ExternalId").
- **Estimate:** 2
- **Dependencies:** None
- **Technical Notes:** Use `boto3` STS client. Ensure the Step Functions orchestrator passes the correct tenant configuration.
**Story 1.2: CloudFormation & Tag Scanner**
*As a System, I want to scan CloudFormation stacks and tagged Resource Groups, so that I can group individual AWS resources into high-confidence "services".*
- **Acceptance Criteria:**
- System extracts stack names and maps them to service names (Confidence: 0.95).
- System extracts `service`, `team`, or `project` tags from resources.
- Lists resources belonging to each stack/tag group.
- **Estimate:** 3
- **Dependencies:** Story 1.1
- **Technical Notes:** Use `cloudformation:DescribeStacks` and `resourcegroupstaggingapi:GetResources`. Parallelize across regions if specified.
**Story 1.3: Compute Resource Enumeration (ECS & Lambda)**
*As a System, I want to list all ECS services and Lambda functions, so that I can discover deployable compute units.*
- **Acceptance Criteria:**
- System lists all ECS clusters, services, and task definitions (extracting container images).
- System lists all Lambda functions and their API Gateway event source mappings.
- Standalone compute resources without a CFN stack are still captured as potential services.
- **Estimate:** 5
- **Dependencies:** Story 1.1
- **Technical Notes:** Requires pagination handling. Lambda cold starts for this Python scanner are acceptable. Output mapped payload to SQS for the Reconciler.
**Story 1.4: Database Enumeration (RDS)**
*As a System, I want to list RDS instances, so that I can associate data stores with their consuming services.*
- **Acceptance Criteria:**
- System lists RDS instances and their tags.
- Maps databases to services based on naming prefixes or CFN stack membership.
- **Estimate:** 2
- **Dependencies:** Story 1.1, Story 1.2
- **Technical Notes:** Use `rds:DescribeDBInstances`. These are marked as infrastructure attached to a service, rather than services themselves.
## Epic 2: GitHub Discovery
**Description:** Implement the org-wide GitHub scanning to extract repositories, primary languages, CODEOWNERS, README content, and GitHub Actions deployments. Cross-reference this with the AWS scanner output.
### User Stories
**Story 2.1: Org & Repo Scanner**
*As a System, I want to use the GitHub GraphQL API to list all repositories, their primary language, and commit history, so that I can map the codebase landscape.*
- **Acceptance Criteria:**
- System lists all active (non-archived, non-forked) repos in the connected org.
- System extracts primary language and top 5 committers per repo.
- GraphQL queries are batched to avoid rate limits (up to 100 repos per call).
- **Estimate:** 3
- **Dependencies:** None
- **Technical Notes:** Implement using `octokit` in a Node.js Lambda function.
**Story 2.2: CODEOWNERS & README Extraction**
*As a System, I want to extract and parse the CODEOWNERS and README files from the default branch, so that I can infer service ownership and descriptions.*
- **Acceptance Criteria:**
- System parses `HEAD:CODEOWNERS` and extracts mapped GitHub teams or individuals.
- System parses `HEAD:README.md` and extracts the first descriptive paragraph.
- If a team is listed in CODEOWNERS, it becomes a candidate for service ownership (Weight: 0.40).
- **Estimate:** 3
- **Dependencies:** Story 2.1
- **Technical Notes:** The GraphQL expression `HEAD:CODEOWNERS` retrieves blob text efficiently in the same repo query.
**Story 2.3: CI/CD Target Extraction**
*As a System, I want to scan `.github/workflows/deploy.yml` for deployment targets, so that I can link repositories to specific AWS infrastructure (ECS/Lambda).*
- **Acceptance Criteria:**
- System parses workflow YAML to find deployment actions (e.g., `aws-actions/amazon-ecs-deploy-task-definition`).
- System links the repo to a discovered AWS service if the task definition name matches.
- **Estimate:** 5
- **Dependencies:** Story 2.1, Epic 1
- **Technical Notes:** This is crucial for cross-referencing in the Reconciliation Engine.
**Story 2.4: Reconciliation & Ownership Inference Engine**
*As a System, I want to merge the AWS infrastructure payloads and the GitHub repository payloads into a unified service entity, and calculate a confidence score for ownership.*
- **Acceptance Criteria:**
- AWS resources are deduplicated and mapped to corresponding GitHub repos based on tags, deploy workflows, or fuzzy naming (e.g., "payment-api" matches "payment-service").
- Ownership is scored based on CODEOWNERS, Git blame frequency, and tags.
- Final merged entity is pushed to PostgreSQL as a "Service".
- **Estimate:** 8
- **Dependencies:** Story 1.2, 1.3, 2.1, 2.2, 2.3
- **Technical Notes:** This runs in a Node.js Lambda triggered by the Step Functions orchestrator. It processes batches from the SQS queues holding AWS and GitHub scan results.
## Epic 3: Service Catalog
**Description:** Implement the primary datastore (Aurora Serverless v2 PostgreSQL), the Service API (CRUD), ownership mapping logic, and manual enrichment endpoints. This is the source of truth for the entire platform.
### User Stories
**Story 3.1: Core Catalog Schema Setup**
*As a System, I want a PostgreSQL relational database to store tenants, services, users, and teams, so that the discovered catalog data is durably stored.*
- **Acceptance Criteria:**
- Create the schema per the architecture (tenants, users, connections, services, teams, service_ownership, discovery_runs).
- Apply multi-tenant Row-Level Security (RLS) on all core tables using `tenant_id`.
- **Estimate:** 3
- **Dependencies:** None
- **Technical Notes:** Use Prisma or raw SQL migrations. Implement API middleware to set `tenant_id` on every request.
**Story 3.2: Service Ownership Model Implementation**
*As a System, I want a `service_ownership` table and scoring logic, so that multiple ownership signals (CODEOWNERS vs Git Blame vs CFN tags) can be tracked and weighted.*
- **Acceptance Criteria:**
- System stores multiple candidate teams per service with ownership types and confidence scores.
- The highest-confidence team becomes the primary owner.
- If top scores are tied or under 0.50, flag the service as "ambiguous" or "unowned".
- **Estimate:** 5
- **Dependencies:** Story 3.1
- **Technical Notes:** Implement scoring algorithm in Python Lambda and map back to PostgreSQL.
**Story 3.3: Manual Correction API**
*As an Engineer, I want to manually override the inferred owner or description of a service, so that I can fix incorrect auto-discovered data.*
- **Acceptance Criteria:**
- `PATCH /api/v1/services/{service_id}` allows updates to `team_id`, `description`, etc.
- Manual corrections override inferred data with a confidence score of 1.00 (`source="user_correction"`).
- The correction is saved in the `corrections` log table.
- **Estimate:** 3
- **Dependencies:** Story 3.1
- **Technical Notes:** The update should trigger an async background job (SQS) to update the Meilisearch index immediately.
**Story 3.4: PagerDuty / OpsGenie Integration**
*As an Engineering Manager, I want to link my team to a PagerDuty schedule, so that the service catalog shows the current on-call engineer for each service.*
- **Acceptance Criteria:**
- The API allows saving an encrypted PagerDuty API key per tenant.
- The system maps PagerDuty schedules to inferred Teams.
- `GET /api/v1/services/{service_id}` returns the active on-call individual.
- **Estimate:** 5
- **Dependencies:** Story 3.1
- **Technical Notes:** Credentials must be stored in AWS Secrets Manager using KMS. Use the PagerDuty `GET /schedules` API endpoint.
## Epic 4: Search Engine
**Description:** Deploy Meilisearch and a Redis cache to support a <100ms Cmd+K search bar for the entire portal. The search must be typo-tolerant and isolate tenant data perfectly.
### User Stories
**Story 4.1: Meilisearch Index Sync**
*As a System, I want to sync service entities from PostgreSQL to Meilisearch, so that I have a fast, full-text index for the UI.*
- **Acceptance Criteria:**
- On every service upsert in PostgreSQL, publish to SQS.
- A Lambda consumes SQS and updates Meilisearch `addDocuments`.
- The index configuration is applied (searchable attributes, filterable attributes, typo-tolerance enabled).
- **Estimate:** 5
- **Dependencies:** Story 3.1
- **Technical Notes:** The index sync must map relational data to a flat JSON structure.
**Story 4.2: Cmd+K Search Endpoint & Security**
*As an Engineer, I want a fast `/api/v1/services/search` endpoint that queries Meilisearch, so that I can instantly find services.*
- **Acceptance Criteria:**
- API receives query and proxy queries to Meilisearch.
- API enforces tenant isolation by injecting `tenant_id = '{current_tenant}'` filter.
- Search returns results in <100ms.
- **Estimate:** 3
- **Dependencies:** Story 4.1
- **Technical Notes:** Implement in Node.js Fastify.
**Story 4.3: Prefix Caching with Redis**
*As a System, I want to cache the most common searches in Redis, so that I can return results in <10ms for repeated queries.*
- **Acceptance Criteria:**
- Cache identical query prefixes per tenant in ElastiCache Redis.
- Set TTL to 5 minutes or invalidate on service upserts.
- Redis cache miss falls back to Meilisearch.
- **Estimate:** 2
- **Dependencies:** Story 4.2
- **Technical Notes:** Serverless Redis pricing scales perfectly. Use normalized queries as the cache key.
## Epic 5: Service Cards UI
**Description:** Build the React Single Page Application (SPA) providing a fast, scannable catalog interface, Cmd+K search dialog, and detailed service cards.
### User Stories
**Story 5.1: SPA Framework & Routing Setup**
*As an Engineer, I want the portal built in React with Vite and TailwindCSS, so that the UI is fast and responsive.*
- **Acceptance Criteria:**
- Setup React + Vite + React Router.
- Deploy to S3 and CloudFront with edge caching.
- Implement a basic authentication context tied to GitHub OAuth.
- **Estimate:** 2
- **Dependencies:** None
- **Technical Notes:** The frontend state must be minimal, relying heavily on SWR or React Query for data fetching.
**Story 5.2: The Cmd+K Modal**
*As an Engineer, I want to press Cmd+K to open a global search bar, so that I can find a service or team instantly from anywhere.*
- **Acceptance Criteria:**
- Pressing Cmd+K (or Ctrl+K) opens a modal overlay.
- Input debounces keystrokes by 150ms before calling `/api/v1/services/search`.
- Arrow keys navigate search results, Enter selects a service.
- **Estimate:** 5
- **Dependencies:** Story 4.2, Story 5.1
- **Technical Notes:** Use an accessible dialog library like Radix UI or Headless UI. Ensure ARIA labels are correct.
**Story 5.3: Scannable Service List View**
*As an Engineering Director, I want to view all my organization's services in a single-line-per-service table, so that I can quickly review ownership and health.*
- **Acceptance Criteria:**
- The default dashboard displays a paginated or infinite-scroll table of services.
- Columns: Name, Owner (with confidence badge), Health, Last Deploy, Repo Link.
- The table is filterable by team, health, and tech stack.
- **Estimate:** 5
- **Dependencies:** Story 3.1, Story 5.1
- **Technical Notes:** Use a lightweight table component (e.g., TanStack Table).
**Story 5.4: Service Detail View (Progressive Disclosure)**
*As an Engineer, I want to click on a service row to see its full details in a slide-over panel or expanded card, so that I can dive deeper when necessary.*
- **Acceptance Criteria:**
- Clicking a service expands an in-page panel showing full details.
- Tabs separate data: Overview (description, stack), Infra (AWS ARNs), On-Call, Corrections.
- There is a "Correct" button next to the inferred owner or description.
- **Estimate:** 8
- **Dependencies:** Story 3.4, Story 5.1, Story 5.3
- **Technical Notes:** Lazy-load tab contents from the API (`GET /api/v1/services/{service_id}`) to minimize initial render payload.
## Epic 6: Dashboard & Overview
**Description:** Implement the org-wide and team-specific dashboards that provide aggregate health, ownership matrix, and discovery status.
### User Stories
**Story 6.1: The Director Dashboard**
*As an Engineering Director, I want an org-wide view of my total services, teams, unowned services, and discovery accuracy, so that I can report to leadership and ensure compliance.*
- **Acceptance Criteria:**
- The dashboard displays four summary KPIs: Total Services, Total Teams, Unowned Services, Accuracy Trend (week-over-week).
- Contains a specific card showing "Services needing review".
- A "Recent Activity" feed lists the latest deploys or ownership changes.
- **Estimate:** 5
- **Dependencies:** Story 3.1, Story 5.1
- **Technical Notes:** The API needs a new `/api/v1/dashboard/stats` endpoint to compute these KPIs efficiently using PostgreSQL aggregations.
**Story 6.2: "My Team" Dashboard Focus**
*As an Engineer, I want the dashboard to default to showing services owned by my team, so that I don't have to filter out the noise of the entire org.*
- **Acceptance Criteria:**
- The UI uses the logged-in user's GitHub ID to find their Team.
- A "Your Team" section lists only their services with immediate health status.
- "Recent" section shows the last 5 services the user viewed.
- **Estimate:** 3
- **Dependencies:** Story 3.1, Story 6.1
- **Technical Notes:** Use browser local storage to save the "recent views".
## Epic 7: Slack Bot
**Description:** Build the Slack integration to allow engineers to query the service catalog using `/dd0c who owns <service>` and `/dd0c oncall <service>`.
### User Stories
**Story 7.1: Slack App & OAuth Setup**
*As an Administrator, I want to add a dd0c Slack app to my workspace, so that engineers can use slash commands.*
- **Acceptance Criteria:**
- Create the Slack App using `@slack/bolt`.
- The API handles the OAuth flow and saves the workspace token to the tenant `connections` table.
- The bot is added to a tenant's workspace and receives slash commands.
- **Estimate:** 3
- **Dependencies:** Story 3.1
- **Technical Notes:** The Slack Bot Lambda must handle verify tokens and return a 200 OK within 3 seconds.
**Story 7.2: Slash Command: /dd0c who owns**
*As an Engineer, I want to type `/dd0c who owns <service_name>`, so that the bot instantly replies with the owner, repo, and health.*
- **Acceptance Criteria:**
- The bot receives the command, extracts the service name, and calls `GET /api/v1/services/search`.
- The bot formats the response as a Slack block kit message with the service name, owner team, confidence score, repo link, and a link to the portal.
- The response is ephemeral unless the user specifies otherwise.
- **Estimate:** 3
- **Dependencies:** Story 4.2, Story 7.1
- **Technical Notes:** Use Meilisearch directly or the API. Ensure the search handles typo-tolerance if the user misspells the service.
**Story 7.3: Slash Command: /dd0c oncall**
*As an Engineer, I want to type `/dd0c oncall <service_name>`, so that the bot instantly tells me who is currently on-call for that service.*
- **Acceptance Criteria:**
- The bot receives the command, looks up the service, and queries PagerDuty (via the API).
- The bot returns the active on-call individual and schedule details.
- **Estimate:** 3
- **Dependencies:** Story 3.4, Story 7.2
- **Technical Notes:** If no on-call is configured, the bot returns a friendly error with a link to set it up in the portal.
## Epic 8: Infrastructure & DevOps
**Description:** Implement the AWS infrastructure for the dd0c platform itself using Infrastructure as Code, set up the CI/CD pipeline via GitHub Actions, and deploy the foundational resources including VPC, ECS Fargate clusters, RDS Aurora, and the cross-account IAM role assumption engine.
### User Stories
**Story 8.1: Core AWS Foundation (VPC, RDS, ElastiCache)**
*As a System, I need a secure VPC and data tier, so that the Portal API and Discovery Engine have durable, isolated storage.*
- **Acceptance Criteria:**
- Deploy VPC with public and private subnets.
- Provision Aurora Serverless v2 PostgreSQL database in private subnets.
- Provision ElastiCache Redis (Serverless) for caching and sessions.
- Deploy KMS keys for credential encryption.
- **Estimate:** 5
- **Dependencies:** None
- **Technical Notes:** Use AWS CDK or CloudFormation. Ensure all data stores are encrypted at rest using KMS.
**Story 8.2: ECS Fargate Cluster & Portal API Deployment**
*As a System, I need an ECS Fargate cluster running the Portal API and Meilisearch, so that the web application and search engine are highly available.*
- **Acceptance Criteria:**
- Create ECS cluster.
- Deploy Portal API Fargate service behind an Application Load Balancer.
- Deploy Meilisearch Fargate service with an EFS volume for persistent index storage.
- Configure auto-scaling rules based on CPU and request count.
- **Estimate:** 5
- **Dependencies:** Story 8.1
- **Technical Notes:** Meilisearch only needs 1 task initially. Use multi-stage Docker builds to keep image sizes small.
**Story 8.3: Discovery Engine Serverless Deployment**
*As a System, I need the Step Functions orchestrator and Lambda scanners deployed, so that the auto-discovery pipeline can execute.*
- **Acceptance Criteria:**
- Deploy AWS Scanner, GitHub Scanner, Reconciler, and Inference Lambdas.
- Deploy the Step Functions state machine linking the Lambdas.
- Provision SQS FIFO queues for discovery events.
- **Estimate:** 3
- **Dependencies:** Epic 1, Epic 2, Story 8.1
- **Technical Notes:** Lambdas must have VPC access to write to Aurora, but need a NAT Gateway to reach the GitHub API.
**Story 8.4: CI/CD Pipeline via GitHub Actions**
*As an Engineer, I want automated CI/CD pipelines, so that I can safely build, test, and deploy the platform without manual intervention.*
- **Acceptance Criteria:**
- CI workflow runs linting, unit tests, and Trivy container scanning on every PR.
- CD workflow deploys to staging environment, runs a discovery accuracy smoke test, and waits for manual approval to deploy to production.
- Deployment updates ECS services and Lambda aliases seamlessly.
- **Estimate:** 5
- **Dependencies:** Story 8.2, Story 8.3
- **Technical Notes:** Keep it simple—use GitHub Actions natively, avoid complex external CD tools for V1.
**Story 8.5: Customer IAM Role Template**
*As an Administrator, I want a standardized CloudFormation template to deploy in my AWS account, so that I can easily grant dd0c read-only access.*
- **Acceptance Criteria:**
- Template creates an IAM role with a strict read-only policy mapped to dd0c's required services.
- Trust policy mandates a tenant-specific `ExternalId`.
- Template output provides the Role ARN.
- Template is hosted publicly on S3.
- **Estimate:** 2
- **Dependencies:** Epic 1
- **Technical Notes:** Avoid using `ReadOnlyAccess` managed policy. Explicitly deny IAM, S3 object access, and Secrets Manager.
## Epic 9: Onboarding & PLG
**Description:** Build the 5-minute self-serve onboarding wizard that drives the Product-Led Growth (PLG) motion. This flow must flawlessly guide the user through GitHub OAuth, AWS connection, and trigger the initial real-time discovery scan.
### User Stories
**Story 9.1: GitHub OAuth & Tenant Creation**
*As a New User, I want to sign up using my GitHub account, so that I don't have to create a new password and the system can instantly identify my organization.*
- **Acceptance Criteria:**
- User clicks "Sign up with GitHub" and authorizes the dd0c GitHub App.
- System extracts user identity and organization context.
- System creates a new Tenant and User record in PostgreSQL.
- JWT session is established and user is routed to the setup wizard.
- **Estimate:** 3
- **Dependencies:** Story 3.1, Story 5.1
- **Technical Notes:** Request minimal scopes initially. Store the installation ID securely.
**Story 9.2: Stripe Billing Integration (Free & Team Tiers)**
*As a New User, I want to select a pricing plan, so that I can start using the product within my budget.*
- **Acceptance Criteria:**
- Wizard prompts user to select "Free" (up to 10 engineers) or "Team" ($10/engineer/mo).
- If Team is selected, user is redirected to Stripe Checkout.
- Webhook listener updates the tenant subscription status upon successful payment.
- **Estimate:** 5
- **Dependencies:** Story 9.1
- **Technical Notes:** Use Stripe Checkout sessions. Keep the webhook Lambda extremely fast to avoid Stripe timeouts.
**Story 9.3: AWS Connection Wizard Step**
*As a New User, I want a frictionless way to connect my AWS account, so that the discovery engine can access my infrastructure.*
- **Acceptance Criteria:**
- UI displays a one-click CloudFormation deployment link pre-populated with a unique `ExternalId`.
- User pastes the generated Role ARN back into the UI.
- API validates the role via `sts:AssumeRole` before proceeding.
- **Estimate:** 3
- **Dependencies:** Story 8.5, Story 9.1
- **Technical Notes:** The `sts:AssumeRole` call validates both the ARN and the ExternalId. Give clear error messages if validation fails.
**Story 9.4: Real-Time Discovery WebSocket**
*As a New User, I want to see the discovery engine working in real-time, so that I trust the system and get that "Holy Shit" moment.*
- **Acceptance Criteria:**
- API Gateway WebSocket API maintains a connection with the onboarding SPA.
- Step Functions and Lambdas push progress events (e.g., "Found 47 AWS resources...") to the UI via the WebSocket.
- When complete, the user is automatically redirected to their populated Service Catalog.
- **Estimate:** 5
- **Dependencies:** Epic 1, Epic 2, Story 9.3
- **Technical Notes:** Implement via API Gateway WebSocket API and a simple broadcasting Lambda.
---
## Epic 10: Transparent Factory Compliance
**Description:** Cross-cutting epic ensuring dd0c/portal adheres to the 5 Transparent Factory tenets. As an Internal Developer Platform, portal is the control plane for other teams' services — governance and observability are existential requirements, not nice-to-haves.
### Story 10.1: Atomic Flagging — Feature Flags for Discovery & Catalog Behaviors
**As a** solo founder, **I want** every new auto-discovery heuristic, catalog enrichment, and scorecard rule behind a feature flag (default: off), **so that** a bad discovery rule doesn't pollute the service catalog with phantom services.
**Acceptance Criteria:**
- OpenFeature SDK integrated into the catalog service. V1: JSON file provider.
- All flags evaluate locally — no network calls during service discovery scans.
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
- Automated circuit breaker: if a flagged discovery rule creates >5 unconfirmed services in a single scan, the flag auto-disables and the phantom entries are quarantined (not deleted).
- Flags required for: new discovery sources (GitHub, GitLab, K8s), scorecard criteria, ownership inference rules, template generators.
**Estimate:** 5 points
**Dependencies:** Epic 1 (Service Discovery)
**Technical Notes:**
- Quarantine pattern: flagged phantom services get `status: quarantined` rather than deletion. Allows review before purge.
- Discovery scans are batch operations — flag check happens once per scan config, not per-service.
### Story 10.2: Elastic Schema — Additive-Only for Service Catalog Store
**As a** solo founder, **I want** all DynamoDB catalog schema changes to be strictly additive, **so that** rollbacks never corrupt the service catalog or lose ownership mappings.
**Acceptance Criteria:**
- CI rejects any schema change that removes, renames, or changes type of existing DynamoDB attributes.
- New attributes use `_v2` suffix for breaking changes.
- All service catalog parsers ignore unknown fields (`encoding/json` with flexible unmarshalling).
- Dual-write during migration windows within `TransactWriteItems`.
- Every schema change includes `sunset_date` comment (max 30 days).
**Estimate:** 3 points
**Dependencies:** Epic 2 (Catalog Store)
**Technical Notes:**
- Service catalog is the source of truth for org topology — schema corruption here cascades to scorecards, ownership, and templates.
- DynamoDB Single Table Design: version items with `_v` attribute. Use item-level versioning, not table duplication.
- Ownership mappings are especially sensitive — never overwrite, always append with timestamp.
### Story 10.3: Cognitive Durability — Decision Logs for Ownership Inference
**As a** future maintainer, **I want** every change to ownership inference algorithms, scorecard weights, or discovery heuristics accompanied by a `decision_log.json`, **so that** I understand why service X was assigned to team Y.
**Acceptance Criteria:**
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
- CI requires a decision log for PRs touching `pkg/discovery/`, `pkg/scoring/`, or `pkg/ownership/`.
- Cyclomatic complexity cap of 10 via `golangci-lint`. PRs exceeding this are blocked.
- Decision logs in `docs/decisions/`.
**Estimate:** 2 points
**Dependencies:** None
**Technical Notes:**
- Ownership inference is the highest-risk logic — wrong assignments erode trust in the entire platform.
- Document: "Why CODEOWNERS > git blame frequency > PR reviewer count for ownership signals?"
- Scorecard weight changes must include before/after examples showing how scores shift.
### Story 10.4: Semantic Observability — AI Reasoning Spans on Discovery & Scoring
**As a** platform engineer debugging a wrong ownership assignment, **I want** every discovery and scoring decision to emit an OpenTelemetry span with reasoning metadata, **so that** I can trace why a service was assigned to the wrong team.
**Acceptance Criteria:**
- Every discovery scan emits a parent `catalog_scan` span. Each service evaluation emits child spans: `ownership_inference`, `scorecard_evaluation`.
- Span attributes: `catalog.service_id`, `catalog.ownership_signals` (JSON array of signal sources + weights), `catalog.confidence_score`, `catalog.scorecard_result`.
- If AI-assisted inference is used: `ai.prompt_hash`, `ai.model_version`, `ai.reasoning_chain`.
- Spans export via OTLP. No PII — repo names and team names are hashed in spans.
**Estimate:** 3 points
**Dependencies:** Epic 1 (Service Discovery), Epic 3 (Scorecards)
**Technical Notes:**
- Use `go.opentelemetry.io/otel`. Batch export to avoid per-service overhead during large scans.
- Ownership inference spans should include ALL signals considered, not just the winning one — this is critical for debugging.
### Story 10.5: Configurable Autonomy — Governance for Catalog Mutations
**As a** solo founder, **I want** a `policy.json` that controls whether the platform can auto-update ownership, auto-create services, or only suggest changes, **so that** teams maintain control over their catalog entries.
**Acceptance Criteria:**
- `policy.json` defines `governance_mode`: `strict` (suggest-only, no auto-mutations) or `audit` (auto-apply with logging).
- Default: `strict`. Auto-discovery populates a "pending review" queue rather than directly mutating the catalog.
- `panic_mode`: when true, all discovery scans halt, catalog is frozen read-only, and a "maintenance" banner shows in the UI.
- Per-team governance override: teams can lock their services to `strict` even if system is in `audit` mode.
- All policy decisions logged: "Service X auto-created in audit mode", "Ownership change for Y queued for review in strict mode".
**Estimate:** 3 points
**Dependencies:** Epic 1 (Service Discovery)
**Technical Notes:**
- "Pending review" queue is a DynamoDB table with `status: pending`. UI shows a review inbox for platform admins.
- Per-team locks: `team_settings` item in DynamoDB with `governance_override` field.
- Panic mode: `POST /admin/panic` or env var. Catalog API returns 503 for all write operations.
### Epic 10 Summary
| Story | Tenet | Points |
|-------|-------|--------|
| 10.1 | Atomic Flagging | 5 |
| 10.2 | Elastic Schema | 3 |
| 10.3 | Cognitive Durability | 2 |
| 10.4 | Semantic Observability | 3 |
| 10.5 | Configurable Autonomy | 3 |
| **Total** | | **16** |

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,96 @@
# dd0c/portal — Advisory Board "Party Mode" Review
## Round 1: INDIVIDUAL REVIEWS
### 1. The VC
**What excites me:** The wedge. You're attacking the bottom of the market that Port and Cortex are completely ignoring because their VC masters demand $50k enterprise ACVs. The $10/eng price point is a beautiful bottom-up PLG motion.
**What worries me:** The moat. Specifically, GitHub. It's an existential kill shot. If GitHub adds a "Service Catalog" tab that just aggregates CODEOWNERS and Actions, your entire TAM evaporates overnight. Also, building integrations for AWS, GitHub, PagerDuty, and Slack is a whole team's job, not a solo dev's.
**Vote:** CONDITIONAL (Prove the GitHub moat).
### 2. The CTO
**What excites me:** Finally, someone admitting that `catalog-info.yaml` is a lie. Generating the catalog from infrastructure truth (AWS tags, Terraform state) instead of human intentions is exactly how it should work.
**What worries me:** Auto-discovery accuracy. You claim 80% day-one accuracy. I've been doing this 20 years, and I've never seen an AWS environment clean enough to auto-discover accurately. If my devs log in and see 40% garbage data, they will never trust the tool again. Trust is binary here.
**Vote:** CONDITIONAL (Show me the discovery engine works on a messy, real-world AWS account).
### 3. The Bootstrap Founder
**What excites me:** The $10/engineer/month self-serve model. 50 engineers = $500/mo. You only need 20 mid-sized customers to hit $10k MRR. The "Backstage Migrator" and the calculator tool are genius low-friction acquisition channels.
**What worries me:** The surface area. A portal needs constant care and feeding of API integrations. AWS changes an API? Your portal breaks. GitHub rate limits? Portal breaks. Supporting this solo while trying to market it is a fast track to burnout.
**Vote:** GO (If you keep the scope aggressively small).
### 4. The Platform Engineer
**What excites me:** No YAML. Oh my god, no YAML. I spent 14 months babysitting Backstage and fighting community plugins. If I can connect an IAM role and GitHub OAuth and get a 2-second Cmd+K search for my team, I'll put my corporate card in right now.
**What worries me:** Will people actually use it daily? I already have a Confluence page that's *mostly* right. If this is just another dashboard I have to bookmark, it'll die. It has to be faster than asking in Slack.
**Vote:** GO. Take my money and save me from Spotify's monolith.
### 5. The Contrarian
**What excites me:** Nothing. You're building a feature, not a product.
**What worries me:** You are trying to sell a glorified spreadsheet. "We'll map your AWS to your GitHub!" Cool, I can do that in Notion with a Python script on a cron job. You're betting against human nature—if the org is chaotic enough to need this, they won't have the clean AWS tags required for your auto-discovery to work. If they have clean tags, they don't need you.
**Vote:** NO-GO.
## Round 2: CROSS-EXAMINATION
**1. The VC:** Okay, Mr. Bootstrap. You love the $10/eng math. "Just 20 customers to $10k MRR!" But what is the TAM, really? Are there enough 50-engineer teams that will actually pull out a credit card for a service catalog?
**2. The Bootstrap Founder:** The SAM is $840M, easily. A solo founder doesn't need to conquer the world. We just need 100 teams to hit half a mil ARR. The Backstage graveyard is full of teams practically begging for this.
**3. The VC:** Yeah, until those 50-eng teams grow to 200 engineers, realize they need RBAC and compliance reports, and immediately churn to Port or Cortex. You're building a stepping stone, not a billion-dollar company. You have zero enterprise retention.
**4. The Bootstrap Founder:** Who cares about a billion dollars?! $50k MRR with 90% margins, no sales team, and no board meetings is the ultimate dream. Let them churn to Cortex when they're huge. We own the bottom of the funnel. Why feed the VC machine?
**5. The Platform Engineer:** Time out. Let's talk about the real problem. CTO, you said auto-discovery won't work on a messy AWS account. Have you ever actually tried to maintain a Backstage YAML file?
**6. The CTO:** I've been in AWS accounts where every Lambda is named `test-function-final-2` and there isn't a single tag in sight. Good luck inferring service ownership from that garbage. Auto-discovery only works if your infra is already pristine.
**7. The Platform Engineer:** But that's the beauty of it! If the portal shows you the garbage, it forces teams to fix the root cause. We don't have to manually write YAML, we just fix the actual AWS tags, which we should be doing anyway!
**8. The CTO:** You're assuming they'll stick around to fix it. If the first impression is 60% accuracy, developers will bounce immediately. Engineers absolutely despise tools that lie to them. Trust is binary.
**9. The Contrarian:** Why are we even having this debate? If you're a 50-person team, just use a Notion database! It's free, it's already there, and it takes five minutes to update. You're trying to solve a social problem with a $500/month SaaS tool.
**10. The Platform Engineer:** Are you kidding me? A Notion database rots in exactly two weeks! Have you ever tried to keep an engineering org's Confluence page updated while 15 devs are pushing to production every day? It's a full-time job!
**11. The Contrarian:** Oh, please. It takes 5 minutes a month per team to update a table. You want to pay a third party $6,000 a year just so you don't have to type "payments team" next to a repository link? This isn't a startup, it's a weekend project!
**12. The VC:** Actually, the Contrarian is missing the point. Notion doesn't have a PagerDuty integration that dynamically syncs with GitHub CODEOWNERS. When a P1 incident hits at 3 AM, a Cmd+K Slack integration that instantly routes to the right on-call engineer is easily worth $500 a month in saved MTTR.
**13. The Contrarian:** Then write a Zapier webhook for your Notion page. Or better yet, just ask in Slack. If your MTTR is tanking because people don't know who owns what, you have a management problem, not a tooling problem.
**14. The Bootstrap Founder:** Spoken like someone who hasn't built software in 10 years. People pay for convenience. This "weekend project" replaces a 14-month Backstage deployment nightmare. I'd pay $10/head tomorrow.
## Round 3: STRESS TEST
### Threat 1: GitHub Ships a Native Service Catalog
**The Attack:** GitHub already has CODEOWNERS, dependency graphs, and Actions. Microsoft is arguably already building this. If they launch a "Services" tab that renders all of this perfectly for free, the market is dead.
- **Severity:** 10/10. Existential threat.
- **Mitigations:** Multi-cloud and multi-source data fusion. GitHub only knows about code and CI; it doesn't know about PagerDuty on-call schedules, AWS CloudWatch health metrics, or AWS costs. dd0c/portal must integrate deeply with the operational tools GitHub ignores.
- **Pivot Options:** Double down on the dd0c platform flywheel. If GitHub owns the catalog layer natively, pivot dd0c/portal into a free aggregation layer that drives users directly into paid modules like dd0c/cost, dd0c/alert, and dd0c/run.
### Threat 2: Auto-Discovery Accuracy is Only 70% (And Trust Dies)
**The Attack:** The 5-minute magical first run maps AWS and GitHub, but because of messy tags and chaotic monorepos, 30% of the data is garbage. Engineers log in, see a mess, declare it useless, and churn on day 1.
- **Severity:** 8/10. Fatal to adoption and PLG motion.
- **Mitigations:** Introduce explicit confidence scores immediately. Do not state facts; state probabilities ("We are 75% confident @payments owns this"). Make the first experience a guided calibration wizard rather than a static presentation of broken data.
- **Pivot Options:** Switch the core pitch from "zero-config auto-discovery" to "AI-assisted onboarding." If fully autonomous discovery fails, use LLMs to analyze repos and actively chat with team leads to build the catalog interactively.
### Threat 3: Backstage 2.0 Dramatically Simplifies Setup
**The Attack:** The CNCF gets tired of the complaints, or Roadie open-sources a zero-config setup wizard. The "Backstage takes 6 months" wedge evaporates.
- **Severity:** 6/10. Harmful to the acquisition model, but survivable.
- **Mitigations:** Backstage will never stop being a massive React monolith that requires heavy hosting and maintenance. Its DNA is "unopinionated framework." Lean into the "Opinionated SaaS vs. Self-Hosted Monolith" angle.
- **Pivot Options:** Focus heavily on proprietary features Backstage won't build, like AI natural language querying ("Ask Your Infra") and deep daily-use habits (the Slack bot and Cmd+K shortcut). Backstage is a dashboard you visit quarterly; dd0c is a workflow tool you use daily.
## Round 4: FINAL VERDICT
**The Decision:** SPLIT DECISION (4-1 in favor). The Contrarian dissents, citing "Glorified Spreadsheet."
**Revised Priority in dd0c lineup:**
Portal is the **Hub**, but it should be built **Third**. Launch `dd0c/cost` and `dd0c/alert` first. Use those to capture initial revenue and validate the pain points, then launch `dd0c/portal` as the connective tissue that makes the whole platform sticky. It's the ultimate upsell mechanism, not the first wedge.
**Top 3 Must-Get-Right Items:**
1. **The 5-Minute "Holy Shit" Moment.** If the auto-discovery engine doesn't map 80% of an AWS/GitHub environment on the first try with high confidence, the PLG motion dies in the water.
2. **Speed > Features.** The Cmd+K search and Slack bot must be sub-500ms. It has to be faster than asking a human in Slack, or the behavioral habit will never form.
3. **The dd0c Flywheel.** It cannot be just a standalone catalog. It must immediately show cross-module value via cost visibility (dd0c/cost) and alert routing (dd0c/alert).
**The One Kill Condition:**
If GitHub announces a native "Service Catalog" at GitHub Universe 2026, and it integrates with PagerDuty and AWS natively, kill `dd0c/portal` as a standalone product immediately. Pivot to making it a free internal feature of the dd0c platform to drive adoption of the paid operational modules.
**Final Verdict:** **GO.**
The IDP space is a graveyard of $30M VC-funded over-engineered monoliths and abandoned Spotify YAML files. There is a massive, starving market of 50-engineer teams who just want to know who is on call for the payment gateway at 3 AM. Stop asking them to write configuration files. Give them the search bar. Take their $10 a month. Build the Anti-Backstage.

View File

@@ -0,0 +1,813 @@
# dd0c/portal — Product Brief
**Product:** Lightweight Internal Developer Portal ("The Anti-Backstage")
**Version:** 1.0
**Date:** 2026-02-28
**Author:** Product Strategy Team
**Status:** Phase 5 — Product Brief (Investor-Ready)
**Board Decision:** GO (4-1) — Build Third in dd0c Sequence
---
## 1. EXECUTIVE SUMMARY
### Elevator Pitch
dd0c/portal is a zero-configuration internal developer portal that auto-discovers your services from AWS and GitHub in 5 minutes, replaces months of Backstage setup with a single IAM role connection, and gives every engineer a Cmd+K search bar that answers "who owns this?" faster than asking in Slack. $10/engineer/month. No YAML. No dedicated platform team. No lies in your catalog.
### Problem Statement
The internal developer portal market is broken in both directions:
**The Enterprise Trap:** Backstage (open-source, Spotify) requires 1-2 dedicated engineers to maintain, takes 3-6 months to deploy, and depends on `catalog-info.yaml` files that engineers write once and never update. After 12 months, most Backstage instances have <30% catalog accuracy. Port, Cortex, and OpsLevel charge $25-50/engineer/month with enterprise sales cycles, pricing out the 80% of engineering teams with 20-200 engineers.
**The Spreadsheet Trap:** Teams that can't afford enterprise IDPs track services in Google Sheets and Notion databases that rot within weeks. When a P1 incident fires at 3 AM, precious minutes are wasted in Slack asking "who owns this?" — a question that should take 0 seconds to answer.
**The data tells the story:**
- Engineers spend an average of 3-5 hours/week searching for internal service information (Humanitec 2025 Platform Engineering Survey)
- New hires take 3-4 weeks to become productive, with poor internal tooling cited as the #1 friction point
- 70%+ of internal documentation is stale within 6 months of creation
- Incident MTTR increases by 15-30 minutes when service ownership is unclear
- <5% of organizations have a functioning IDP in 2026 (Gartner), despite 80% recognizing the need
- The Backstage graveyard is growing — community forums and Reddit are filled with teams who invested months and got minimal value
### Solution Overview
dd0c/portal is a SaaS developer portal that generates its service catalog from infrastructure reality instead of human-maintained YAML files:
1. **5-Minute Auto-Discovery:** Connect a read-only AWS IAM role and GitHub OAuth. The discovery engine scans EC2, ECS, Lambda, RDS, API Gateway, CloudFormation stacks, GitHub repos, CODEOWNERS, and README files. Within 30 seconds, the catalog is populated with >80% accuracy — including service names, inferred owners, descriptions, tech stacks, and repo links.
2. **Cmd+K Instant Search:** The entire portal is a search bar. Type any service name, team name, or keyword — results in <500ms. Faster than asking in Slack. This is the daily-use hook that makes the portal the browser homepage.
3. **Zero Maintenance:** The catalog auto-refreshes from infrastructure sources. No YAML to maintain. No platform engineer babysitting plugins. The catalog stays accurate because it's generated from truth, not maintained by humans.
4. **dd0c Platform Integration:** Portal is the hub of the dd0c platform. dd0c/cost shows per-service AWS spend. dd0c/alert routes incidents to the right owner. dd0c/run links executable runbooks. Each module makes the portal richer; the portal makes each module smarter.
### Target Customer
**Primary:** Engineering teams of 20-200 engineers, AWS-primary, using GitHub, who either:
- Tried Backstage and abandoned it (or are limping along with a broken instance)
- Evaluated enterprise IDPs (Port, Cortex) and couldn't justify the $60-150K/year price tag
- Currently rely on Google Sheets, Notion, or Slack tribal knowledge for service ownership
**Buyer Personas:**
- **The Platform Engineer (Alex):** Burned out maintaining Backstage. Wants a portal that maintains itself so they can build actual platform capabilities.
- **The Engineering Director (Priya):** Needs always-current service ownership for incident response, compliance audits, and headcount justification. Can't answer "who owns what?" with confidence today.
- **The New Hire (Jordan):** Drowning in scattered tribal knowledge. Needs to understand the service landscape in hours, not weeks.
**Firmographic Profile:**
- 20-200 engineers (sweet spot: 40-100)
- AWS as primary cloud (70%+ of infrastructure)
- GitHub organization (not GitLab/Bitbucket at launch)
- Series A through Series D startups, or mid-market companies with modern engineering practices
- US/EU-based (timezone alignment for solo founder support)
### Key Differentiators
| Dimension | Backstage | Port/Cortex | dd0c/portal |
|-----------|-----------|-------------|-------------|
| Time to value | 3-6 months | 2-6 weeks | 5 minutes |
| Catalog maintenance | Manual YAML (1-2 FTEs) | Semi-manual (0.25 FTE) | Auto-discovery (0 FTE) |
| Day-1 accuracy | 0% (empty catalog) | 30-50% (import + manual) | 80%+ (auto-discovered) |
| Pricing (50 eng) | $0 + $200-400K labor | $60-150K/year | $6,000/year |
| Daily-use stickiness | Low (nobody visits) | Medium | High (Cmd+K, Slack bot, browser homepage) |
| AI capabilities | None | Basic | Native (Ask Your Infra, V2) |
| Platform integration | Standalone | Standalone | dd0c flywheel (cost, alerts, drift, runbooks) |
| Setup requirement | Fork React monolith, configure plugins, write YAML | Sales call, POC, weeks of configuration | Connect AWS + GitHub, done |
---
## 2. MARKET OPPORTUNITY
### Market Sizing
**Total Addressable Market (TAM): $4.3B/year**
- Global software developers: ~28 million (Evans Data, 2026)
- Developers in organizations with 20+ engineers (IDP-relevant): ~12 million
- Average willingness to pay for developer portal tooling: ~$30/dev/month (blended across tiers)
- TAM = 12M × $30 × 12 = $4.3B
**Serviceable Addressable Market (SAM): $840M/year**
- AWS-primary organizations (dd0c's initial integration scope): ~40% of cloud market
- Teams of 20-500 engineers (too big for a spreadsheet, too small for Port/Cortex)
- Estimated developer population in segment: ~2.8M developers
- At $25/dev/month blended: SAM = 2.8M × $25 × 12 = $840M
**Serviceable Obtainable Market (SOM):**
- Year 1-2 realistic penetration: 0.1-1% of SAM
- Conservative (0.1%): ~2,800 developers across 50-80 orgs → $336K ARR at $10/eng
- Aggressive (1% by Year 3): ~28,000 developers → $3.36M ARR
- Expected value at Month 12: ~$300K ARR (probability-weighted across scenarios)
**Market growth:** IDP market growing at 30-40% CAGR as platform engineering becomes mainstream. Gartner estimates <5% of organizations have a functioning IDP in 2026, with 80% recognizing the need. The gap between awareness and adoption is dd0c's opportunity.
### Competitive Landscape
#### Tier 1: Open Source Incumbent — Backstage (Spotify/CNCF)
- 27K+ GitHub stars, CNCF Incubating project
- Free but requires 1-2 FTEs to maintain ($200-400K/year true cost)
- `catalog-info.yaml` model is fundamentally broken — depends on humans doing unpaid maintenance
- Most instances have <30% catalog accuracy after 12 months
- Plugin ecosystem is wide but shallow; community plugins rot
- **dd0c's angle:** "Backstage takes 5 months. We take 5 minutes. Backstage requires YAML. We require nothing."
#### Tier 2: Managed Backstage — Roadie
- Managed Backstage-as-a-Service, ~$30-50/engineer/month
- Removes hosting burden but still fundamentally Backstage (still requires YAML, still has cold-start problem)
- Validates that Backstage maintenance is a real pain point — which is dd0c's thesis
- **Threat level:** Low. Roadie is a better Backstage; dd0c is a different paradigm.
#### Tier 3: Enterprise IDP Platforms — Port ($33M Series A), Cortex ($35M+), OpsLevel
- Enterprise pricing: $25-50/engineer/month, sales-led, minimum contracts $30K+/year
- Feature-rich: self-service workflows, advanced RBAC, compliance reports, custom scorecards
- Built for 500+ engineer orgs with dedicated platform teams and procurement budgets
- Setup takes weeks. Requires significant configuration. Pricing excludes small-to-mid teams entirely.
- **dd0c's angle:** 10-20x cheaper, 100x faster to set up, zero maintenance. For the 80% of teams that Port/Cortex ignores.
#### Tier 4: Shadow Competitors (Existential Threats)
- **GitHub:** Already has CODEOWNERS, dependency graphs, repository topics, Actions. If GitHub ships a "Services" tab, the lightweight IDP market contracts overnight. **Severity: 10/10.**
- **Datadog Service Catalog:** Basic today, but Datadog has $2B+ revenue and massive distribution. If bundled effectively with monitoring, it's "free" for existing customers.
- **Atlassian Compass:** Integrated with Jira/Confluence. Currently mediocre, but Atlassian has massive mid-market distribution.
- **AWS Service Catalog:** Terrible UX today, but AWS has native infrastructure data access.
#### Competitive Positioning Matrix
```
Factor | Backstage | Port/Cortex | Roadie | dd0c/portal
------------------------|-----------|-------------|--------|------------
Setup Time | Months | Weeks | Days | 5 minutes
Catalog Maintenance | Manual | Semi-manual | Manual | Auto
Day-1 Accuracy | 0% | 30-50% | 0% | 80%+
Price (50 engineers) | $200K+ | $60-150K | $18-30K| $6K
Daily Active Usage | <10% | 20-30% | 15-25% | Target: 40%+
AI-Native | No | Basic | No | Core (V2)
Platform Integration | Plugins | Standalone | Plugins| dd0c flywheel
Solo Founder Viable | No | No | No | Yes
```
### Timing Thesis: Why Now
Four forces are converging to create a window of opportunity:
**1. Backstage Fatigue (2024-2026)**
The Backstage hype cycle has peaked and entered the trough of disillusionment. Teams that adopted Backstage in 2023-2024 are 12-18 months in and discovering the maintenance burden is unsustainable, catalog accuracy has degraded below 50%, and the platform engineer maintaining it is burned out or has quit. This creates a massive pool of "Backstage refugees" — teams that believe in the IDP concept but are disillusioned with the execution. These are dd0c's first customers.
**2. Platform Engineering Goes Mainstream (2025-2026)**
Platform engineering is no longer a luxury — it's expected. Budget is being allocated specifically for developer experience tooling. The question has shifted from "do we need an IDP?" to "which one?" This means more buyers, more budget, and less evangelism required.
**3. AI-Native Expectations (2026)**
Engineers use Copilot, Cursor, and Claude daily. A developer portal that requires manual YAML maintenance feels archaic. The expectation is: "Why can't it just figure out what services we have?" dd0c's auto-discovery + AI query model aligns with 2026 developer expectations. Backstage's manual model feels like 2020.
**4. FinOps + Platform Engineering Convergence**
Engineering leaders want cost-per-service alongside ownership and health. No standalone IDP does this well. dd0c's platform approach (portal + cost + alerts) is uniquely positioned for this convergence.
**Regulatory Tailwinds:**
- SOC 2 / ISO 27001 adoption increasing — auditors ask "show me service ownership." An always-current IDP is compliance infrastructure.
- EU DORA (Digital Operational Resilience Act) requires service mapping and incident response capabilities for financial services.
- US Executive Order on AI requires organizations to inventory AI-powered services and their owners.
- Platform engineering job postings up 340% since 2023 — the buyer persona is proliferating.
**The window is open. It won't stay open forever.** GitHub could ship a native service catalog. Datadog could invest seriously in theirs. The 12-18 month head start window is the entire strategic opportunity.
---
## 3. PRODUCT DEFINITION
### Value Proposition
**For platform engineers** who are drowning in Backstage maintenance, dd0c/portal is a self-maintaining developer portal that generates its catalog from infrastructure reality. Unlike Backstage, which requires manual YAML files that nobody updates, dd0c/portal auto-discovers services from AWS and GitHub with >80% accuracy in 5 minutes and zero ongoing maintenance.
**For engineering directors** who can't answer "who owns what?" with confidence, dd0c/portal provides always-current, authoritative service ownership mapping. Unlike quarterly manual audits or stale Google Sheets, dd0c/portal auto-refreshes from source-of-truth systems and answers compliance questions in seconds, not weeks.
**For every engineer** who wastes time searching for service information, dd0c/portal is a Cmd+K search bar that's faster than asking in Slack. Unlike scattered documentation across Confluence, Notion, and pinned Slack messages, dd0c/portal is one surface with everything: owner, repo, health, on-call, cost, runbook.
**Core design principle: "Calm surface, deep ocean."** The default view is radically simple — service name, owner, health, repo link. One line per service. Scannable in 2 seconds. But depth is one click away: dependencies, cost, deployment history, runbook, scorecard. Start simpler than a spreadsheet, grow deeper than Backstage — but only when the user asks for depth.
### Personas
#### Persona 1: Jordan — The New Hire
- Software Engineer II, 2 years experience, just joined a 60-person engineering org
- Spends first week on a scavenger hunt across Slack, Confluence, GitHub, and Google Docs trying to map the service landscape
- Assigned a ticket involving an unfamiliar service — can't find the right repo, the owner, or current documentation
- **JTBD:** "When I join a new company, I want to quickly understand the service landscape so I can contribute meaningfully without feeling lost."
- **dd0c moment:** Jordan opens the portal, hits Cmd+K, types the service name from standup, and instantly sees: owner, repo, description, on-call, last deploy. What used to take 3 days takes 3 seconds.
#### Persona 2: Alex — The Platform Engineer
- Senior Platform Engineer, 6 years experience, maintaining Backstage for 14 months
- Spends 40% of time on Backstage maintenance instead of building actual platform tooling
- Sends monthly "please update your catalog-info.yaml" Slack messages that everyone ignores
- Has a secret spreadsheet that's more accurate than Backstage
- **JTBD:** "When I'm responsible for the developer portal, I want it to maintain itself so I can focus on building actual platform capabilities."
- **dd0c moment:** Alex connects the AWS IAM role and GitHub OAuth. 30 seconds later, 147 services appear — with owners, repos, and health status. Alex's spreadsheet is obsolete. The monthly YAML nagging stops forever.
#### Persona 3: Priya — The Engineering Director
- Director of Engineering, manages 8 teams (62 engineers), reports to VP of Engineering
- Can't answer "which teams own which services?" without a multi-week manual audit
- SOC 2 audit in 3 months — compliance team needs service ownership evidence she can't produce
- Incident MTTR inflated by 15+ minutes because nobody knows who to page
- **JTBD:** "When preparing for a compliance audit, I want an always-current inventory of services, owners, and maturity attributes so I can produce evidence in minutes, not weeks."
- **dd0c moment:** Priya opens the portal dashboard. Every service, every owner, every health status — live, accurate, exportable. The SOC 2 evidence that used to take 2 weeks is generated in 30 seconds.
### Feature Roadmap
#### MVP (V1) — "The 5-Minute Miracle" — Months 1-3
The MVP is ruthlessly scoped. It does three things and does them perfectly:
**Auto-Discovery Engine (THE product)**
- AWS discovery via read-only IAM role: EC2, ECS, Lambda, RDS, API Gateway, CloudFormation stacks
- Infers "services" from CloudFormation stacks, ECS services, tagged resource groups, and naming conventions
- GitHub org scanner: repos, languages, CODEOWNERS, README extraction, deployment workflows
- Cross-references AWS resources ↔ GitHub repos to build service-to-repo mapping
- Ownership inference from CODEOWNERS, git blame frequency, and team membership
- Confidence scores on every data point: "85% confident @payments-team owns this (source: CODEOWNERS + git history)"
- Auto-refresh on configurable schedule (default: every 6 hours)
- **Target: >80% accuracy on first run. This is the entire business.**
**Service Catalog UI**
- Service cards: name, owner (with confidence), description (extracted from README), repo link, tech stack, health status, last deploy timestamp
- Cmd+K instant search across all services, teams, and keywords — results in <500ms
- Progressive disclosure: default view is one-line-per-service table. Click to expand full service detail.
- Team directory: which team owns which services, team members, contact info
- Correction UI: one-click to fix wrong ownership or add missing data. Corrections feed back into the discovery model.
**Integrations (Minimum Viable)**
- AWS: read-only IAM role (runs in customer's VPC, pushes metadata to SaaS)
- GitHub: OAuth app for org access
- PagerDuty/OpsGenie: import on-call schedules, map to services ("Who's on-call for this right now?")
- Slack bot: `/dd0c who owns <service>` — responds in <2 seconds
**Infrastructure**
- Auth: GitHub OAuth (SSO via GitHub org membership)
- Billing: Stripe, $10/engineer/month, self-serve credit card signup
- Onboarding: connect AWS + GitHub in 3 clicks, auto-discovery runs, catalog populated
**What V1 explicitly does NOT include:**
- No scorecards or maturity models
- No dependency graphs
- No software templates or scaffolding
- No custom plugins or extensibility
- No Kubernetes or Terraform discovery (AWS + GitHub only)
- No advanced RBAC (org-level access only)
- No self-hosted option
#### V1.1 — "The Daily Habit" — Months 4-6
**Dependency Visualization**
- Auto-inferred service dependency graph from AWS VPC flow logs, API Gateway routes, and Lambda event sources
- Visual dependency map (click a service → see what it calls and what calls it)
- Impact radius: "If this service goes down, these 5 services are affected"
**Scorecards (Lightweight)**
- Production readiness checklist per service: has README? Has CODEOWNERS? Has runbook? Has monitoring? Has on-call rotation?
- Org-wide scorecard dashboard: "72% of services meet production readiness standards"
- Exportable for compliance evidence (SOC 2, ISO 27001)
**Backstage Migrator**
- One-click import from existing `catalog-info.yaml` files
- Maps Backstage catalog entries to dd0c services, merges with auto-discovered data
- "Migrate from Backstage in 10 minutes" — the acquisition wedge for Backstage refugees
**Enhanced Discovery**
- Terraform state file parsing (service infrastructure mapping)
- Kubernetes label/annotation discovery (for K8s-based services)
- Improved accuracy via ML-assisted pattern matching from user corrections across customers
#### V2 — "Ask Your Infra" — Months 7-12
**AI Natural Language Querying (The Differentiator)**
- "Ask Your Infra" agent: natural language questions against the service catalog
- Examples:
- "Which services handle PII data?"
- "Who owns the services that the checkout flow depends on?"
- "Show me all services that haven't been deployed in 90 days"
- "What's the total AWS cost of the payments team's services?"
- "Which services don't have runbooks?"
- Powered by LLM with the service catalog as structured context — not hallucinating, querying verified data
- Available in portal UI, Slack bot, and CLI
**dd0c Platform Integration**
- dd0c/cost integration: per-service AWS cost attribution on every service card
- dd0c/alert integration: alert routing to service owner, incident history on service card
- dd0c/run integration: linked runbooks per service, AI-assisted incident response
- Cross-module dashboard: "Service X costs $847/mo, had 3 incidents this month, has 2 IaC drifts"
**Advanced Features**
- Change feed: "What changed in the service landscape this week?" (new services, ownership changes, health status changes)
- Zombie service detection: services with no deployments, no traffic, and no owner for 90+ days
- Cost anomaly per service: "Service X cost jumped 340% this week"
#### V3 — "The Platform" — Months 12-18
**Multi-Cloud**
- GCP discovery (Cloud Run, GKE, Cloud Functions)
- Azure discovery (App Service, AKS, Functions)
- Multi-cloud service catalog with unified view
**Enterprise Features**
- Advanced RBAC (team-level permissions, service-level visibility controls)
- SSO (Okta, Azure AD) beyond GitHub OAuth
- Audit logs (who viewed/changed what, when)
- Compliance reports (SOC 2 evidence packages, auto-generated)
- API access (programmatic catalog queries for CI/CD integration)
**Agent Control Plane**
- Registry for AI agents operating in the infrastructure
- "Which AI agents have access to which services?"
- Agent activity monitoring and governance
- Position dd0c/portal as the source of truth that AI agents query — not competing with agents, but enabling them
### User Journey: First 30 Minutes
```
Minute 0:00 — Engineer discovers dd0c/portal (blog post, HN, colleague recommendation)
Minute 0:30 — Signs up with GitHub OAuth. Enters credit card. 30 seconds.
Minute 1:00 — Onboarding wizard: "Connect your AWS account." Provides CloudFormation
template for read-only IAM role. One-click deploy.
Minute 3:00 — AWS connected. GitHub org already connected via OAuth.
"Starting auto-discovery..."
Minute 3:30 — Discovery complete. "Found 147 services across 89 repos and 12 AWS accounts."
The "Holy Shit" moment.
Minute 4:00 — Engineer sees the catalog. Services listed with names, owners (with confidence
scores), repos, health status. Scans the list. "This is... actually right."
Minute 5:00 — Hits Cmd+K. Types "payment." Instant results: payment-gateway, payment-processor,
payment-webhook. Clicks payment-gateway → sees owner, repo, on-call, last deploy,
tech stack. All auto-discovered.
Minute 6:00 — Notices one wrong owner. Clicks "Correct" → selects the right team → done.
System learns. Confidence score on similar services adjusts.
Minute 8:00 — Copies the Slack bot install link. Adds /dd0c to the team Slack.
Minute 10:00 — Types /dd0c who owns auth-service in #engineering. Bot responds instantly
with owner, on-call, repo link. Three colleagues react with 👀.
Minute 15:00 — Sets dd0c/portal as browser homepage.
Minute 30:00 — Shares screenshot in team Slack: "Look what I just found."
Three teammates sign up within the hour.
```
### Pricing
**Free Tier — "Try It"**
- Up to 10 engineers
- Up to 25 discovered services
- Cmd+K search, basic service cards
- No Slack bot, no PagerDuty integration
- Purpose: let individual engineers try the product without budget approval. They become internal champions.
**Team Tier — $10/engineer/month — "The Sweet Spot"**
- Unlimited services
- Full auto-discovery (AWS + GitHub)
- Cmd+K search, Slack bot, PagerDuty/OpsGenie integration
- Confidence scores, correction UI, auto-refresh
- Scorecards (V1.1+)
- Self-serve signup, credit card, no sales call
- No minimum commitment, cancel anytime
- **Target customer: 20-100 engineers → $200-$1,000/month**
**Business Tier — $25/engineer/month — "The Platform" (Month 12+)**
- Everything in Team, plus:
- dd0c module integrations (cost, alert, drift, run)
- "Ask Your Infra" AI agent
- Dependency graphs
- Advanced RBAC, SSO (Okta/Azure AD)
- Compliance reports (SOC 2 evidence packages)
- Priority support
- **Target customer: 100-500 engineers → $2,500-$12,500/month**
**Pricing rationale:**
- $10/engineer removes the procurement barrier. For a 50-person team, that's $500/month — within most engineering managers' discretionary spending authority. No procurement process, no legal review, no 6-month sales cycle.
- The ROI calculation is trivial: "Does this save each engineer more than 15 minutes per month?" The Cmd+K search alone saves 15 minutes in the first week.
- $10/engineer is structurally impossible for VC-backed competitors to match. Port and Cortex have 50+ employees and $30M+ in funding. Their cost structure requires $25-50/engineer. dd0c's cost structure is one person + cloud hosting.
- Pricing evolution planned: $10 at launch → introduce Business tier at $25 by month 12 → effective ARPU rises to $15-20 as customers add modules and upgrade.
---
## 4. GO-TO-MARKET PLAN
### Launch Strategy: Targeting Backstage Refugees
The primary acquisition channel is not paid ads, not outbound sales, and not conference sponsorships. It's content-driven PLG targeting the single most receptive audience in DevOps: **teams that tried Backstage and failed.**
These teams have already self-selected. They believe in the IDP concept. They've invested 3-18 months. They've felt the pain of YAML rot, plugin maintenance, and catalog decay. They are actively searching for alternatives. They are dd0c's first 100 customers.
**Where they congregate:**
- Backstage Discord and GitHub Discussions (frustrated questions, feature requests that will never ship)
- r/devops and r/platformengineering (posts titled "Backstage alternatives?" appear monthly)
- Platform Engineering Slack community (~15K members)
- PlatformCon conference attendees and speakers
- Blog posts titled "What We Learned From Our Backstage Implementation" (translation: "Why We Failed")
- Google searches for "Backstage alternatives," "Backstage too complex," "IDP without YAML"
### Content Strategy: Engineering-as-Marketing
Every piece of content serves one of two purposes: (1) capture Backstage refugees, or (2) demonstrate the 5-minute magic.
**Tier 1: Flagship Content (Pre-Launch)**
1. **"I Maintained Backstage for 18 Months. Here's Why I Quit."**
- Honest, technical post-mortem. Not a hit piece — a relatable story.
- Target: HN front page, r/devops top post, Twitter/X viral thread
- CTA: "We built the alternative. Try it in 5 minutes."
- This single piece of content, if it resonates, generates the first 50-100 signups.
2. **"The Backstage Migration Calculator"**
- Free web tool: input your Backstage metrics (engineers, hours/week on maintenance, catalog accuracy %)
- Output: total cost of ownership, comparison to dd0c/portal pricing, projected time savings
- Lead capture: email required for full report
- SEO targets: "Backstage cost," "Backstage alternatives," "Backstage vs"
- This tool validates demand before a single line of portal code is written.
3. **"Is Your IDP Actually Used? A 5-Minute Audit"**
- Checklist/scorecard format. "How many engineers visited your IDP this week? Is your catalog >80% accurate?"
- Most teams score poorly → creates urgency → CTA to dd0c
**Tier 2: SEO & Thought Leadership (Launch + Ongoing)**
4. **"Zero-Config Service Discovery: How We Auto-Map Your AWS Infrastructure"** — technical deep-dive
5. **"The Internal Developer Portal Buyer's Guide (2026)"** — honest comparison including dd0c's weaknesses
6. **"Why Your Service Catalog Is Lying to You"** — thought piece on manual vs. auto-discovered catalogs
7. **"How [Company] Replaced Backstage in 5 Minutes"** — customer case study video walkthroughs
**Tier 3: Community & Social Proof (Post-Launch)**
8. Customer case studies (as soon as first 3 customers are live)
9. Monthly "State of the Catalog" reports — anonymized data on discovery accuracy and adoption patterns
10. Open-source the discovery agent — builds trust, enables security audits, creates community contributions
### Growth Loops
**Loop 1: The Slack Bot Viral Loop**
Every time an engineer uses `/dd0c who owns <service>` in a public Slack channel, it's a product demo. Colleagues see the instant response, ask "what is this?", and sign up. The Slack bot is a passive viral mechanism that scales with usage.
**Loop 2: The Screenshot Loop**
The "Holy Shit" moment — 147 services auto-discovered in 30 seconds — is inherently shareable. Engineers screenshot the catalog and share it in team Slack, Twitter/X, and engineering blogs. The visual impact of a populated catalog appearing from nothing is the product's best marketing asset.
**Loop 3: The Org Expansion Loop**
One team adopts dd0c/portal → other teams in the org notice → engineering director sees cross-team adoption → approves org-wide rollout. Bottom-up adoption creates top-down demand. This is the Slack/Figma playbook.
**Loop 4: The dd0c Platform Upsell Loop**
Portal customer sees per-service cost data (dd0c/cost integration) → "Wait, Service X costs $847/month?!" → upgrades to dd0c/cost → sees alert routing (dd0c/alert) → upgrades again. Portal is the top-of-funnel for the entire dd0c platform.
**Viral coefficient target:** k=0.3 (each new user brings 0.3 additional users through Slack sharing and team adoption). At k=0.3, organic growth supplements content marketing but doesn't replace it.
### Partnership Strategy
**AWS Marketplace (Month 6)**
- List dd0c/portal on AWS Marketplace
- Customers with committed AWS spend (EDPs) can use marketplace credits to pay for dd0c — removes the budget objection entirely
- AWS Marketplace provides credibility signaling and co-marketing opportunities via ISV Partner program
**GitHub Marketplace (Month 3)**
- List the GitHub App on GitHub Marketplace
- Lower friction than AWS Marketplace, higher discovery volume
- Natural discovery point for GitHub-centric engineering teams
**PagerDuty / OpsGenie Integration Partners**
- Deep integration with on-call tools is a key feature
- Co-marketing: "Route alerts to the right owner automatically"
- PagerDuty partner ecosystem promotes integrated tools
**Strategic Non-Partnerships:**
- Do NOT partner with Datadog (competing service catalog)
- Do NOT seek AWS investment (maintain independence and future multi-cloud optionality)
- Do NOT pursue SI/consulting partnerships (wrong channel for $10/eng PLG product)
### 90-Day Launch Timeline
**Weeks 1-4: Build the Core**
| Week | Deliverable |
|------|------------|
| 1 | AWS auto-discovery engine: EC2, ECS, Lambda, RDS via read-only IAM role. Map resources to services using CloudFormation stacks, tags, naming conventions. |
| 2 | GitHub org scanner: repos, languages, CODEOWNERS, README extraction. Cross-reference with AWS to build service-to-repo mapping. |
| 3 | Service catalog UI: service cards, Cmd+K instant search (<300ms). Auth (GitHub OAuth), billing (Stripe). |
| 4 | Onboarding flow (connect AWS + GitHub in 3 clicks). Confidence scores. Correction UI. |
**Weeks 5-8: Polish & Beta**
| Week | Deliverable |
|------|------------|
| 5 | Slack bot: `/dd0c who owns <service>`. Responds in <2 seconds. |
| 6 | PagerDuty/OpsGenie integration: import on-call schedules, map to services. |
| 7 | Backstage YAML importer. Landing page. Waitlist. |
| 8 | Beta launch: invite 20 teams. Personal onboarding call with each. Obsessively collect feedback on discovery accuracy. |
**Weeks 9-12: Launch & First Revenue**
| Week | Deliverable |
|------|------------|
| 9 | Incorporate beta feedback. Fix top 5 discovery accuracy issues. |
| 10 | Publish "I Maintained Backstage for 18 Months" blog post. Ship Backstage Migration Calculator. |
| 11 | Public launch: HN "Show HN," Reddit (r/devops, r/platformengineering, r/aws), Twitter/X thread. Coordinated for maximum simultaneous visibility. |
| 12 | Convert beta users to paid. Target: 10 paying customers by end of week 12. |
**Pre-launch content (before writing portal code):**
- Ship the Backstage Migration Calculator and "Why I Quit Backstage" blog post first
- Validate demand with content before investing in engineering
- If the content doesn't resonate, the product won't either
---
## 5. BUSINESS MODEL
### Revenue Model
**Primary revenue:** Per-seat SaaS subscription at $10/engineer/month (Team tier).
**Revenue expansion mechanisms:**
1. **Seat expansion:** As customer engineering teams grow, revenue grows automatically
2. **Tier upgrade:** Team ($10/eng) → Business ($25/eng) as customers need RBAC, compliance, AI queries
3. **Module attach:** Portal customers add dd0c/cost, dd0c/alert, dd0c/run — each at $10-15/eng/month
4. **Effective ARPU trajectory:** $10/eng (portal only) → $25/eng (portal + 1 module) → $40/eng (portal + 2 modules + Business tier)
**Revenue model is NOT:**
- Usage-based (predictable revenue > usage volatility for a solo founder)
- Enterprise sales-led (no sales team, no SDRs, no demo calls)
- Freemium-dependent (free tier is acquisition, not the business)
### Unit Economics
**Cost structure (solo founder, Month 6):**
| Cost Item | Monthly |
|-----------|---------|
| Cloud infrastructure (AWS/Vercel) | $500-1,500 |
| Stripe payment processing (2.9% + $0.30) | ~3% of revenue |
| Domain, email, tooling | $200 |
| LLM API costs (discovery + AI queries, V2) | $200-500 |
| Brian's time (opportunity cost) | $15,000 (imputed) |
| **Total operating cost (excl. founder)** | **~$2,000-2,500/month** |
**Gross margin:** ~90% (SaaS infrastructure costs are minimal at this scale)
**Customer Acquisition Cost (CAC):** Near-zero for content-driven PLG. Primary cost is Brian's time writing blog posts and building free tools. Imputed CAC: ~$50-100 per customer (time spent on content ÷ customers acquired).
**Lifetime Value (LTV):** At $500/month average (50 engineers × $10), with 5% monthly churn → average customer lifetime of 20 months → LTV = $10,000. LTV:CAC ratio > 100:1. (This is unusually high because CAC is near-zero for PLG.)
**Payback period:** <1 month (self-serve credit card, no sales cycle, immediate revenue).
### Path to Revenue Milestones
**$10K MRR — "Validation" (Target: Month 9-12)**
| Metric | Requirement |
|--------|-------------|
| Customers | 20-25 |
| Avg engineers per customer | 40-50 |
| Avg MRR per customer | $400-500 |
| Monthly churn | <5% |
| **Milestone significance** | Product-market fit confirmed. Solo founder business is viable. |
**$50K MRR — "Sustainability" (Target: Month 15-18)**
| Metric | Requirement |
|--------|-------------|
| Customers | 80-100 |
| Avg engineers per customer | 50-60 |
| Avg MRR per customer | $500-625 |
| Module attach rate | >20% (portal + at least one other dd0c module) |
| Monthly churn | <3% |
| **Milestone significance** | $600K ARR. Hire first contractor (support + integration maintenance). Brian focuses on product + growth. |
**$100K MRR — "Scale" (Target: Month 20-24)**
| Metric | Requirement |
|--------|-------------|
| Customers | 150-200 |
| Avg engineers per customer | 55-65 |
| Avg MRR per customer | $500-667 |
| Business tier adoption | >15% of customers |
| dd0c platform revenue (all modules) | $200K+ MRR combined |
| Monthly churn | <2.5% |
| **Milestone significance** | $1.2M ARR. Hire 1-2 FTEs (engineer + DevRel). Consider seed funding if growth warrants acceleration. |
### Solo Founder Constraints & Sequencing Rationale
The Party Mode board recommended building dd0c/portal **third** in the dd0c launch sequence, after dd0c/route (LLM cost router) and dd0c/cost (AWS cost anomaly). This sequencing is strategically correct for three reasons:
**1. Revenue before platform.**
dd0c/route and dd0c/cost have immediate monetary ROI — they save companies money on LLM and AWS bills in week one. This generates revenue and political capital before portal launches. Portal is a harder sell as a standalone ("pay us to organize your services") but an easy sell as a platform add-on ("you're already saving $2K/month with dd0c — now see which services cost what").
**2. Data before catalog.**
dd0c/cost generates per-service cost data. dd0c/alert generates per-service incident data. When portal launches, it can immediately show cost and incident data on every service card — making the portal 10x more valuable on day one than it would be as a standalone catalog. The modules create the data; the portal visualizes it.
**3. Customers before cold-start.**
By the time portal launches (Month 7-12 of the dd0c platform), dd0c already has paying customers on route and cost. These customers are the portal's first users — no cold-start problem, no "will anyone try this?" uncertainty. They're already in the dd0c ecosystem and the portal is a natural expansion.
**The sequencing risk:** If dd0c/route and dd0c/cost fail to gain traction, portal never launches. This is acceptable — if the FinOps wedge doesn't work, the platform thesis is wrong, and building a standalone IDP at $10/engineer is an even harder path.
**Solo founder capacity allocation (portal phase):**
- 50% — Discovery engine accuracy (the product)
- 20% — UI/UX and integrations
- 15% — Content marketing and community
- 10% — Infrastructure and ops
- 5% — Customer support (automated first, personal for top accounts)
---
## 6. RISKS & MITIGATIONS
### Risk 1: GitHub Ships a Native Service Catalog
**Probability:** 40% within 24 months
**Impact:** Catastrophic (existential)
**Severity:** CRITICAL
GitHub already has CODEOWNERS, dependency graphs, repository topics, and Actions. If GitHub adds a "Services" tab that aggregates these into a searchable catalog with auto-discovery from Actions deployment targets, dd0c/portal's core value proposition evaporates for GitHub-only shops.
**Why it might not happen:**
- GitHub's product strategy is focused on AI (Copilot) and security (Advanced Security). IDP is not their priority.
- Microsoft/GitHub has historically been slow to build platform features (GitHub Projects took years and is still mediocre).
- A real IDP requires cross-platform data (AWS, PagerDuty, Datadog) that GitHub doesn't have and may not want to integrate.
**Mitigations:**
- Build value GitHub can't replicate: cross-platform integration (AWS + GitHub + PagerDuty + Slack), the dd0c module flywheel, and AI-native querying
- Position dd0c as "GitHub + AWS + everything else" — not "GitHub but better"
- If GitHub announces a service catalog, immediately pivot to positioning dd0c as the multi-source aggregation layer that includes GitHub's data alongside operational tools
- **Speed is the primary mitigation.** Establish the beachhead and build switching costs before GitHub moves. Every month of head start is a month of habit formation.
**Kill trigger:** If GitHub announces a native service catalog at GitHub Universe 2026 that integrates with PagerDuty and AWS natively, kill dd0c/portal as a standalone product. Pivot to making it a free feature within the dd0c platform to drive adoption of paid modules.
### Risk 2: Auto-Discovery Accuracy Falls Below 80%
**Probability:** 35%
**Impact:** Critical (fatal to PLG motion)
**Severity:** CRITICAL
The entire product thesis rests on auto-discovery being "good enough" on first run. If engineers connect their AWS account and see 60% accuracy with wrong owners and phantom services, they close the tab and never return. Trust is binary. One bad first impression is fatal.
**Technical challenges:**
- AWS resources don't always map cleanly to "services" (is each Lambda a service? Each ECS task definition?)
- GitHub repos don't always map to deployed services (monorepos, shared libraries, archived repos)
- Ownership inference from git blame is noisy
- Naming conventions vary wildly across organizations
**Mitigations:**
- Confidence scores, not assertions: "85% confident @payments-team owns this (source: CODEOWNERS + git history). Confirm or correct."
- Conservative discovery: show 80 services at 90% accuracy rather than 150 services at 60% accuracy
- Rapid feedback loop: user corrections improve the model. After 10 corrections, accuracy should be >95%.
- Invest 50% of engineering time on discovery accuracy in year 1. This is the product.
- If fully autonomous discovery fails, pivot to "AI-assisted onboarding" — LLMs analyze repos and chat with team leads to build the catalog interactively.
**Kill trigger:** If accuracy is <60% after 3 months of engineering on diverse real-world AWS accounts, the technical thesis is wrong. Kill.
### Risk 3: Solo Founder Burnout / Capacity Ceiling
**Probability:** 50%
**Impact:** Critical
**Severity:** CRITICAL
Building an IDP with AWS integration, GitHub integration, PagerDuty, Slack bot, Cmd+K search, auto-discovery engine, SaaS infrastructure, billing, auth, and customer support — as one person — is enormous. The failure mode isn't "Brian can't code it." It's "Brian builds it, gets 30 customers, and drowns in support tickets, bug reports, and integration requests."
**Mitigations:**
- Ruthless scope control: V1 is auto-discovery + service cards + Cmd+K + Slack bot + billing. That's it.
- Architecture for zero maintenance: serverless infrastructure, automated deployments, automated monitoring
- AI-assisted development: Cursor/Copilot/Claude for 50%+ of code generation
- Kill criteria with deadlines: if milestones aren't hit, kill the product. Don't let sunk cost fallacy turn a 6-month experiment into a 2-year death march.
- Hire first contractor at $20K MRR for support and integration maintenance
### Risk 4: Datadog Bundles a Good-Enough Service Catalog
**Probability:** 30%
**Impact:** High
**Severity:** HIGH
Datadog already has a Service Catalog feature. With $2B+ revenue and massive distribution, if they invest seriously, teams already paying for Datadog get an IDP for "free."
**Mitigations:**
- Datadog's catalog is monitoring-centric (discovers from APM traces, not infrastructure). dd0c discovers from AWS + GitHub, which is more comprehensive.
- Datadog's pricing ($23+/host/month) means their IDP is "free" only for existing $50K+/year Datadog customers. For teams using CloudWatch or Grafana, Datadog's IDP is irrelevant.
- Position dd0c as monitoring-agnostic. Works with Datadog, Grafana, CloudWatch, or nothing.
### Risk 5: Market Is Smaller Than Estimated
**Probability:** 25%
**Impact:** High
**Severity:** HIGH
Do teams of 20-50 engineers actually need — and will they pay for — a service catalog? Many function fine with a Google Sheet and Slack.
**Mitigations:**
- Free tier tests this hypothesis at zero cost. If teams under 30 don't convert, raise minimum target to 50+.
- Beachhead is "teams that tried Backstage and failed" — already self-selected as needing an IDP.
- If market is smaller than expected, portal becomes a free feature driving adoption of paid dd0c modules (cost, alerts, drift).
### Kill Criteria Summary
| Milestone | Deadline | Kill Trigger |
|-----------|----------|-------------|
| Auto-discovery >75% accuracy on test accounts | Month 3 | <60% accuracy after 3 months of engineering |
| 10 beta users actively using weekly | Month 4 | Can't get 10 free users |
| 5 paying customers ($10/eng) | Month 6 | Can't convert 5 teams from free to paid |
| 20 paying customers, <10% monthly churn | Month 9 | Churn exceeds 10%/month (novelty, not habit) |
| $10K MRR | Month 12 | Market too small or GTM broken |
| GitHub announces native service catalog | Any time | Assess if dd0c differentiation survives. If GitHub covers 70%+ of dd0c value, kill standalone portal. |
### Pivot Options
If dd0c/portal fails as a standalone product, three pivot paths exist:
1. **Free portal → paid modules.** Make portal free. Use it as top-of-funnel for dd0c/cost and dd0c/alert. "Free service catalog → discover your services → see that Service X costs $847/month → upgrade to dd0c/cost." Portal becomes the world's most effective upsell mechanism.
2. **AI-assisted onboarding.** If autonomous discovery fails, pivot from "zero-config auto-discovery" to "AI-assisted catalog building." LLMs analyze repos, Slack history, and AWS resources, then chat with team leads to build the catalog interactively. Still faster than Backstage YAML, but with a human-in-the-loop.
3. **Agent Control Plane.** If AI agents make static catalogs obsolete, pivot dd0c/portal to be the registry and governance layer for AI agents operating in infrastructure. "Which agents have access to which services? What did they do? Who authorized it?" The portal becomes infrastructure for AI, not a competitor to AI.
---
## 7. SUCCESS METRICS
### North Star Metric: Daily Active Users (DAU), Not Seats
Most IDP companies measure seats (how many engineers have accounts). This is a vanity metric. An engineer can have an account and never log in. dd0c/portal's north star is **Daily Active Users as a percentage of total seats (DAU/Seats).**
**Why DAU, not seats:**
- DAU measures habit formation. If engineers open the portal daily, it's their browser homepage. If it's their browser homepage, switching costs are enormous.
- DAU correlates with retention. High DAU = low churn. Low DAU = the product is a novelty that will be forgotten.
- DAU drives the viral loop. Active users use the Slack bot, share screenshots, and recommend the product to colleagues. Inactive seat-holders do nothing.
**Target:** >40% DAU/Seats by Month 6. (For context: Slack achieves ~60% DAU/Seats. Linear achieves ~40%. Most enterprise SaaS tools achieve <15%.)
### Leading Indicators (Weekly Review)
| Metric | Month 3 Target | Month 6 Target | Month 12 Target |
|--------|---------------|----------------|-----------------|
| DAU / Total Seats | >25% | >40% | >45% |
| Cmd+K searches per user per week | >3 | >5 | >7 |
| Slack bot queries per org per week | >5 | >10 | >15 |
| Auto-discovery accuracy (1 - correction rate) | >75% | >85% | >90% |
| Time-to-first-value (signup → first search) | <10 min | <5 min | <5 min |
| Organic signup rate (word-of-mouth %) | >20% | >30% | >40% |
### Lagging Indicators (Monthly Review)
| Metric | Month 6 Target | Month 12 Target |
|--------|---------------|-----------------|
| Paying customers | 10-25 | 50-120 |
| MRR | $4K-12.5K | $25K-72K |
| Net Revenue Retention (NRR) | >105% | >110% |
| Monthly logo churn | <8% | <5% |
| Monthly revenue churn | <6% | <3% |
| Catalog completeness (% of actual services discovered) | >85% | >90% |
| Module attach rate (portal customers adding 2nd dd0c module) | N/A | >20% |
### Milestones
| Milestone | Target Date | Significance |
|-----------|------------|--------------|
| Auto-discovery engine working (>75% accuracy) | Month 3 | Technical thesis validated |
| 10 beta users with weekly active usage | Month 4 | Product resonates with real users |
| First 5 paying customers | Month 6 | Willingness to pay confirmed |
| "I Quit Backstage" post hits 10K+ views | Month 3-4 | Content-market fit validated |
| 20 paying customers, <10% churn | Month 9 | Retention thesis validated |
| $10K MRR | Month 12 | Solo founder business viable |
| First customer adds second dd0c module | Month 12 | Platform flywheel activated |
| $50K MRR | Month 18 | Hire first team member |
| >90% auto-discovery accuracy | Month 12 | Technical moat established |
| AWS Marketplace listing live | Month 6 | Distribution channel opened |
### The Anti-Metric: Feature Count
Do NOT measure or celebrate feature count. Every feature added is maintenance burden for a solo founder. The goal is maximum value from minimum features. Measure value delivered per feature, not features shipped. The product with 10 features used daily beats the product with 100 features used quarterly.
---
## APPENDIX: SCENARIO PROJECTIONS
### Scenario A: "The Rocket" (15% probability)
Everything works. Auto-discovery is accurate. Content goes viral. Backstage refugees flock to dd0c.
| Month | Customers | Avg Engineers | MRR | ARR |
|-------|-----------|--------------|-----|-----|
| 6 | 25 | 50 | $12,500 | $150K |
| 9 | 60 | 55 | $33,000 | $396K |
| 12 | 120 | 60 | $72,000 | $864K |
### Scenario B: "The Grind" (50% probability)
Product works but growth is slower. Content gets moderate traction. Word of mouth builds gradually.
| Month | Customers | Avg Engineers | MRR | ARR |
|-------|-----------|--------------|-----|-----|
| 6 | 10 | 40 | $4,000 | $48K |
| 9 | 25 | 45 | $11,250 | $135K |
| 12 | 50 | 50 | $25,000 | $300K |
### Scenario C: "The Stall" (25% probability)
Discovery accuracy is a persistent challenge. Market is smaller than expected.
| Month | Customers | Avg Engineers | MRR | ARR |
|-------|-----------|--------------|-----|-----|
| 6 | 5 | 35 | $1,750 | $21K |
| 9 | 10 | 35 | $3,500 | $42K |
| 12 | 15 | 40 | $6,000 | $72K |
### Scenario D: "The Kill" (10% probability)
GitHub ships a service catalog, or discovery never reaches acceptable accuracy.
| Month | Action |
|-------|--------|
| 6 | <5 paying customers. Reassess. |
| 9 | No improvement. Kill standalone portal. |
| 9+ | Pivot: portal becomes free feature within dd0c platform to drive paid module adoption. |
**Expected Value (Month 12 ARR):**
E(ARR) = (0.15 × $864K) + (0.50 × $300K) + (0.25 × $72K) + (0.10 × $0) = **~$298K**
---
*This product brief synthesizes findings from four prior phases: Brainstorm, Design Thinking, Innovation Strategy, and Party Mode Advisory Board Review. All contradictions between phases have been resolved in favor of the most conservative, execution-focused position. The brief is designed for investor-ready presentation and solo founder execution planning.*
*Build the portal. Build it third. Build it fast. And build the discovery engine like your business depends on it — because it does.*

View File

@@ -0,0 +1,623 @@
# dd0c/portal — Test Architecture & TDD Strategy
**Product:** Lightweight Internal Developer Portal
**Phase:** 6 — Architecture Design
**Date:** 2026-02-28
**Status:** Draft
---
## 1. Testing Philosophy & TDD Workflow
### Core Principle
dd0c/portal's most critical logic — ownership inference, discovery reconciliation, and confidence scoring — is pure algorithmic code with well-defined inputs and outputs. This is ideal TDD territory. The test suite is the specification.
The product's >80% discovery accuracy target is not a QA metric — it's a product promise. Tests enforce it continuously.
### Red-Green-Refactor Adapted to This Product
```
RED → Write a failing test that encodes a discovery heuristic or ownership rule
GREEN → Write the minimum code to pass it (no clever abstractions yet)
REFACTOR → Clean up once the rule is proven correct against real-world fixtures
```
**Adapted cycle for discovery heuristics:**
1. Capture a real-world failure case (e.g., "Lambda functions named `payment-*` were not grouped into a service")
2. Write a unit test encoding the expected grouping behavior using a fixture of that Lambda response
3. Fix the heuristic
4. Add the fixture to the regression suite permanently
This means every production accuracy bug becomes a permanent test. The test suite grows as a living record of every edge case the discovery engine has encountered.
### When to Write Tests First vs. Integration Tests Lead
| Scenario | Approach | Rationale |
|----------|----------|-----------|
| Ownership scoring algorithm | Unit-first TDD | Pure function, deterministic, no I/O |
| Discovery heuristics (CFN → service mapping) | Unit-first TDD | Deterministic logic over fixture data |
| GitHub GraphQL query construction | Unit-first TDD | Query builder logic is pure |
| AWS API pagination handling | Integration-first | Behavior depends on real API shape |
| Meilisearch index sync | Integration-first | Depends on Meilisearch document model |
| DynamoDB schema migrations | Integration-first | Requires real DynamoDB Local behavior |
| WebSocket progress events | E2E-first | Requires full pipeline to be meaningful |
| Stripe webhook handling | Integration-first | Depends on Stripe event payload shape |
### Test Naming Conventions
All tests follow the pattern: `[unit under test]_[scenario]_[expected outcome]`
**TypeScript/Node.js (Jest):**
```typescript
describe('OwnershipInferenceEngine', () => {
describe('scoreOwnership', () => {
it('returns_primary_owner_when_codeowners_present_with_high_confidence', () => {})
it('marks_service_unowned_when_top_score_below_threshold', () => {})
it('marks_service_ambiguous_when_top_two_scores_within_tolerance', () => {})
})
})
```
**Python (pytest):**
```python
class TestOwnershipScorer:
def test_codeowners_signal_weighted_highest_among_all_signals(self): ...
def test_git_blame_frequency_used_when_codeowners_absent(self): ...
def test_confidence_below_threshold_flags_service_as_unowned(self): ...
```
**File naming:**
- Unit tests: `*.test.ts` / `test_*.py` co-located with source
- Integration tests: `*.integration.test.ts` / `test_*_integration.py` in `tests/integration/`
- E2E tests: `tests/e2e/*.spec.ts` (Playwright)
---
## 2. Test Pyramid
### Recommended Ratio: 70 / 20 / 10
```
┌─────────────┐
│ E2E / Smoke│ 10% (~30 tests)
│ (Playwright)│ Critical user journeys only
├─────────────┤
│ Integration │ 20% (~80 tests)
│ (real deps) │ Service boundaries, API contracts
├─────────────┤
│ Unit │ 70% (~280 tests)
│ (pure logic)│ All heuristics, scoring, parsing
└─────────────┘
```
### Unit Test Targets (per component)
| Component | Language | Test Framework | Target Coverage |
|-----------|----------|---------------|----------------|
| AWS Scanner (heuristics) | Python | pytest | 90% |
| GitHub Scanner (parsers) | Node.js | Jest | 90% |
| Reconciliation Engine | Node.js | Jest | 85% |
| Ownership Inference | Python | pytest | 95% |
| Portal API (route handlers) | Node.js | Jest + Supertest | 80% |
| Search proxy + cache logic | Node.js | Jest | 85% |
| Slack Bot command handlers | Node.js | Jest | 80% |
| Feature flag evaluation | Node.js/Python | Jest/pytest | 95% |
| Governance policy engine | Node.js | Jest | 95% |
| Schema migration validators | Node.js | Jest | 100% |
### Integration Test Boundaries
| Boundary | What to Test | Tool |
|----------|-------------|------|
| Discovery → GitHub API | GraphQL query shape, pagination, rate limit handling | MSW (mock service worker) or nock |
| Discovery → AWS APIs | boto3 call sequences, pagination, error handling | moto (AWS mock library) |
| Reconciler → PostgreSQL | Upsert logic, conflict resolution, RLS enforcement | Testcontainers (PostgreSQL) |
| Inference → PostgreSQL | Ownership write, confidence update, correction propagation | Testcontainers (PostgreSQL) |
| API → Meilisearch | Index sync, search query construction, tenant filter injection | Meilisearch test instance (Docker) |
| API → Redis | Cache set/get/invalidation, TTL behavior | ioredis-mock or Testcontainers (Redis) |
| Slack Bot → Portal API | Command → search → format response | Supertest against local API |
| Stripe webhook → API | Subscription activation, plan change, cancellation | Stripe CLI webhook forwarding |
### E2E / Smoke Test Scenarios
1. Full onboarding: GitHub OAuth → AWS connection → discovery trigger → catalog populated
2. Cmd+K search returns results in <200ms after discovery
3. Ownership correction propagates to similar services
4. Slack `/dd0c who owns` returns correct owner
5. Discovery accuracy: synthetic org with known ground truth scores >80%
6. Governance strict mode: discovery populates pending queue, not catalog directly
7. Panic mode: all catalog writes return 503
---
## 3. Unit Test Strategy (Per Component)
### 3.1 AWS Scanner (Python / pytest)
**What to test:**
- Resource-to-service grouping heuristics (the core logic)
- Confidence score assignment per signal type
- Pagination handling for each AWS API
- Cross-region scan aggregation
- Error handling for throttling, missing permissions, empty accounts
**Key test cases:**
```python
# tests/unit/test_cfn_scanner.py
class TestCloudFormationScanner:
def test_stack_name_becomes_service_name_with_high_confidence(self):
# Given a CFN stack named "payment-api"
# Expect service entity with name="payment-api", confidence=0.95
def test_stack_tags_extracted_as_service_metadata(self):
# Given stack with tags {"service": "payment", "team": "payments"}
# Expect service.metadata includes both tags
def test_stacks_in_multiple_regions_deduplicated_by_name(self):
# Given same stack name in us-east-1 and us-west-2
# Expect single service entity with both regions in infrastructure
def test_deleted_stacks_excluded_from_results(self):
# Given stack with status DELETE_COMPLETE
# Expect it is not included in discovered services
def test_pagination_fetches_all_stacks_beyond_first_page(self):
# Given mock returning 2 pages of stacks
# Expect all stacks from both pages are processed
class TestLambdaScanner:
def test_lambdas_with_shared_prefix_grouped_into_single_service(self):
# Given ["payment-webhook", "payment-processor", "payment-refund"]
# Expect single service "payment" with confidence=0.60
def test_lambda_with_apigw_trigger_gets_higher_confidence(self):
# Given Lambda with API Gateway event source mapping
# Expect confidence=0.85 (not 0.60)
def test_standalone_lambda_without_prefix_pattern_kept_as_individual(self):
# Given Lambda named "data-export-job" with no siblings
# Expect individual service entity, not grouped
class TestServiceGroupingHeuristics:
def test_cfn_stack_takes_priority_over_ecs_service_for_same_name(self):
# Given CFN stack "payment-api" AND ECS service "payment-api"
# Expect single service entity (not duplicate), source=cloudformation
def test_explicit_github_repo_tag_overrides_name_matching(self):
# Given AWS resource with tag github_repo="acme/payments-v2"
# Expect repo_link="acme/payments-v2" with confidence=0.95
# (not fuzzy name match result)
```
**Mocking strategy:**
- Use `moto` to mock all boto3 calls — no real AWS calls in unit tests
- Fixture files in `tests/fixtures/aws/` contain realistic API response payloads
- Each fixture named after the scenario: `cfn_stacks_multi_region.json`, `lambda_functions_with_apigw.json`
```python
@pytest.fixture
def mock_aws(aws_credentials):
with mock_cloudformation(), mock_ecs(), mock_lambda_():
yield
def test_full_scan_produces_expected_service_count(mock_aws, cfn_fixture):
setup_mock_cfn_stacks(cfn_fixture)
result = AWSScanner(tenant_id="test", role_arn="arn:aws:iam::123:role/test").scan()
assert len(result.services) == cfn_fixture["expected_service_count"]
```
---
### 3.2 GitHub Scanner (Node.js / Jest)
**What to test:**
- GraphQL query construction and batching
- CODEOWNERS file parsing (all valid formats)
- README first-paragraph extraction
- Deploy workflow target extraction
- Rate limit detection and backoff
**Key test cases:**
```typescript
// tests/unit/github-scanner/codeowners-parser.test.ts
describe('CODEOWNERSParser', () => {
it('parses_simple_wildcard_ownership_to_team', () => {
const input = '* @acme/platform-team'
expect(parse(input)).toEqual([{ pattern: '*', owners: ['@acme/platform-team'] }])
})
it('parses_path_specific_ownership', () => {
const input = '/src/payments/ @acme/payments-team'
expect(parse(input)).toEqual([{ pattern: '/src/payments/', owners: ['@acme/payments-team'] }])
})
it('handles_multiple_owners_per_pattern', () => {
const input = '*.ts @acme/frontend @acme/platform'
expect(parse(input).owners).toHaveLength(2)
})
it('ignores_comment_lines', () => {
const input = '# This is a comment\n* @acme/team'
expect(parse(input)).toHaveLength(1)
})
it('returns_empty_array_for_missing_codeowners_file', () => {
expect(parse(null)).toEqual([])
})
it('handles_individual_user_ownership_not_just_teams', () => {
const input = '* @sarah-chen'
expect(parse(input)[0].owners[0]).toBe('@sarah-chen')
})
})
describe('READMEExtractor', () => {
it('extracts_first_non_heading_non_badge_paragraph', () => {
const readme = `# Payment Gateway\n\n![build](badge.svg)\n\nHandles Stripe checkout flows.`
expect(extractDescription(readme)).toBe('Handles Stripe checkout flows.')
})
it('returns_null_when_readme_has_only_headings_and_badges', () => {
const readme = `# Title\n\n![badge](url)`
expect(extractDescription(readme)).toBeNull()
})
})
describe('WorkflowTargetExtractor', () => {
it('extracts_ecs_service_name_from_deploy_workflow', () => {
const yaml = loadFixture('deploy-workflow-ecs.yml')
expect(extractDeployTarget(yaml)).toEqual({
type: 'ecs_service',
name: 'payment-api',
cluster: 'production'
})
})
it('extracts_lambda_function_name_from_serverless_deploy', () => {
const yaml = loadFixture('deploy-workflow-lambda.yml')
expect(extractDeployTarget(yaml)).toEqual({
type: 'lambda_function',
name: 'payment-webhook-handler'
})
})
})
```
**Mocking strategy:**
- Use `nock` or `msw` to intercept GitHub GraphQL API calls
- Fixture files in `tests/fixtures/github/` for realistic API responses
- Test the GraphQL query builder separately from the HTTP client
---
### 3.3 Reconciliation Engine (Node.js / Jest)
**What to test:**
- Cross-referencing AWS resources with GitHub repos (all 5 matching rules)
- Deduplication when multiple signals point to the same service
- Conflict resolution when signals disagree
- Batch processing of SQS messages
**Key test cases:**
```typescript
describe('ReconciliationEngine', () => {
describe('matchAWSToGitHub', () => {
it('explicit_tag_match_takes_highest_priority', () => {
const awsService = buildAWSService({ tags: { github_repo: 'acme/payment-gateway' } })
const ghRepo = buildGHRepo({ name: 'payment-gateway', org: 'acme' })
const result = reconcile([awsService], [ghRepo])
expect(result[0].repoLinkSource).toBe('explicit_tag')
expect(result[0].repoLinkConfidence).toBe(0.95)
})
it('deploy_workflow_match_used_when_no_explicit_tag', () => {
const awsService = buildAWSService({ name: 'payment-api' })
const ghRepo = buildGHRepo({ deployTarget: 'payment-api' })
const result = reconcile([awsService], [ghRepo])
expect(result[0].repoLinkSource).toBe('deploy_workflow')
})
it('fuzzy_name_match_used_as_fallback', () => {
const awsService = buildAWSService({ name: 'payment-service' })
const ghRepo = buildGHRepo({ name: 'payment-svc' })
const result = reconcile([awsService], [ghRepo])
expect(result[0].repoLinkSource).toBe('name_match')
expect(result[0].repoLinkConfidence).toBe(0.75)
})
it('no_match_produces_aws_only_service_entity', () => {
const awsService = buildAWSService({ name: 'legacy-monolith' })
const result = reconcile([awsService], [])
expect(result[0].repoUrl).toBeNull()
expect(result[0].discoverySources).toContain('cloudformation')
expect(result[0].discoverySources).not.toContain('github_repo')
})
it('deduplicates_cfn_stack_and_ecs_service_with_same_name', () => {
const cfnService = buildAWSService({ source: 'cloudformation', name: 'payment-api' })
const ecsService = buildAWSService({ source: 'ecs_service', name: 'payment-api' })
const result = reconcile([cfnService, ecsService], [])
expect(result).toHaveLength(1)
expect(result[0].discoverySources).toContain('cloudformation')
expect(result[0].discoverySources).toContain('ecs_service')
})
})
})
```
---
### 3.4 Ownership Inference Engine (Python / pytest)
This is the highest-value unit test target. Ownership inference is the most complex logic and the most likely source of accuracy failures.
**Key test cases:**
```python
class TestOwnershipScorer:
def test_codeowners_weighted_highest_at_0_40(self):
signals = [Signal(type='codeowners', team='payments', raw_score=1.0)]
result = score_ownership(signals)
assert result['payments'].weighted_score == pytest.approx(0.40)
def test_multiple_signals_summed_correctly(self):
signals = [
Signal(type='codeowners', team='payments', raw_score=1.0), # 0.40
Signal(type='cfn_tag', team='payments', raw_score=1.0), # 0.20
Signal(type='git_blame_frequency', team='payments', raw_score=1.0), # 0.25
]
result = score_ownership(signals)
assert result['payments'].total_score == pytest.approx(0.85)
def test_primary_owner_is_highest_scoring_team(self):
signals = [
Signal(type='codeowners', team='payments', raw_score=1.0),
Signal(type='git_blame_frequency', team='platform', raw_score=1.0),
]
result = score_ownership(signals)
assert result.primary_owner == 'payments'
def test_service_marked_unowned_when_top_score_below_0_50(self):
signals = [Signal(type='git_blame_frequency', team='unknown', raw_score=0.3)]
result = score_ownership(signals)
assert result.status == 'unowned'
def test_service_marked_ambiguous_when_top_two_within_0_10(self):
signals = [
Signal(type='codeowners', team='payments', raw_score=0.8),
Signal(type='codeowners', team='platform', raw_score=0.75),
]
result = score_ownership(signals)
assert result.status == 'ambiguous'
def test_user_correction_overrides_all_inference_with_score_1_00(self):
signals = [
Signal(type='codeowners', team='payments', raw_score=1.0),
Signal(type='user_correction', team='platform', raw_score=1.0),
]
result = score_ownership(signals)
assert result.primary_owner == 'platform'
assert result.primary_confidence == 1.00
assert result.primary_source == 'user_correction'
def test_correction_propagation_applies_to_matching_repo_prefix(self):
correction = Correction(repo='payment-gateway', team='payments')
candidates = ['payment-processor', 'payment-webhook', 'auth-service']
propagated = propagate_correction(correction, candidates)
assert 'payment-processor' in propagated
assert 'payment-webhook' in propagated
assert 'auth-service' not in propagated
```
---
### 3.5 Portal API — Route Handlers (Node.js / Jest + Supertest)
**What to test:**
- Tenant isolation enforcement (tenant_id injected into every query)
- Search endpoint proxies to Meilisearch with mandatory tenant filter
- PATCH /services enforces correction logging
- Auth middleware rejects unauthenticated requests
```typescript
describe('GET /api/v1/services/search', () => {
it('injects_tenant_id_filter_into_meilisearch_query', async () => {
const spy = jest.spyOn(meilisearchClient, 'search')
await request(app).get('/api/v1/services/search?q=payment').set('Authorization', `Bearer ${tenantAToken}`)
expect(spy).toHaveBeenCalledWith(expect.objectContaining({
filter: expect.stringContaining(`tenant_id = '${TENANT_A_ID}'`)
}))
})
it('returns_401_when_no_auth_token_provided', async () => {
const res = await request(app).get('/api/v1/services/search?q=payment')
expect(res.status).toBe(401)
})
it('tenant_a_cannot_see_tenant_b_services', async () => {
// Seed Meilisearch with services for both tenants
// Query as tenant A, assert no tenant B results
})
})
describe('PATCH /api/v1/services/:id', () => {
it('stores_correction_in_corrections_table', async () => {
await request(app)
.patch(`/api/v1/services/${SERVICE_ID}`)
.send({ team_id: NEW_TEAM_ID })
.set('Authorization', `Bearer ${adminToken}`)
const correction = await db.corrections.findFirst({ where: { service_id: SERVICE_ID } })
expect(correction).toBeDefined()
expect(correction.new_value).toMatchObject({ team_id: NEW_TEAM_ID })
})
it('sets_confidence_to_1_00_on_user_correction', async () => {
await request(app).patch(`/api/v1/services/${SERVICE_ID}`).send({ team_id: NEW_TEAM_ID })
const ownership = await db.service_ownership.findFirst({ where: { service_id: SERVICE_ID } })
expect(ownership.confidence).toBe(1.00)
expect(ownership.source).toBe('user_correction')
})
})
```
### 3.6 Slack Bot Command Handlers (Node.js / Jest)
**What to test:**
- Command parsing (`/dd0c who owns <service>`)
- Typo tolerance matching logic (delegated to search, but bot needs to handle 0 results)
- Block kit message formatting
- Error handling (unauthorized workspace, missing service)
### 3.7 Feature Flags & Governance Policy (Node.js / Jest)
**What to test:**
- Flag evaluation (`openfeature` provider)
- Governance strict vs. audit mode
- Panic mode blocking writes
---
## 4. Integration Test Strategy
Integration tests verify that our code correctly interacts with external boundaries: databases, caches, search indices, and third-party APIs.
### 4.1 Service Boundary Tests
- **Discovery ↔ GitHub/GitLab:** Use `nock` or `MSW` to mock the GitHub GraphQL endpoint. Assert that the Node.js scanner constructs the correct query and handles rate limits (HTTP 403/429) via retries.
- **Catalog ↔ PostgreSQL:** Use Testcontainers for PostgreSQL to verify complex `upsert` queries, foreign key constraints, and RLS (Row-Level Security) tenant isolation.
- **API ↔ Meilisearch:** Use a Meilisearch Docker container. Assert that document syncing (PostgreSQL -> SQS -> Meilisearch) completes and search queries with `tenant_id` filters return the expected subset of data.
### 4.2 Git Provider API Contract Tests
- Write scheduled "contract tests" that run against the *live* GitHub API daily using a dedicated test org.
- These detect if GitHub changes their GraphQL schema or rate limit behavior.
- Assert that `HEAD:CODEOWNERS` blob extraction still works.
### 4.3 Testcontainers for Local Infrastructure
- **Database:** `testcontainers-node` spinning up `postgres:15-alpine`.
- **Search:** `getmeili/meilisearch:latest`.
- **Cache:** `redis:7-alpine`.
- Run these in GitHub Actions via Docker-in-Docker.
---
## 5. E2E & Smoke Tests
E2E tests treat the system as a black box, interacting only through the API and the React UI. We keep these fast and focused on the "5-Minute Miracle" critical path.
### 5.1 Critical User Journeys (Playwright)
1. **The Onboarding Flow:** Mock GitHub OAuth login -> Connect AWS (mock CFN role ARN validation) -> Trigger Discovery -> Wait for WebSocket completion -> Verify 147 services appear in catalog.
2. **Cmd+K Search:** Open modal (`Cmd+K`) -> type "pay" -> assert "payment-gateway" is highlighted in < 200ms -> press Enter -> assert service detail card opens.
3. **Correcting Ownership:** Open service detail -> Click "Correct Owner" -> select new team -> assert badge changes to 100% confidence -> assert Meilisearch is updated.
### 5.2 The >80% Auto-Discovery Accuracy Validation
- **The "Party Mode" Org:** Maintain a real GitHub org and a mock AWS environment with exactly 100 known services, 10 known teams, and specific chaotic naming conventions.
- **The Assertion:** Run discovery. Assert that > 80 of the services are correctly inferred with the right primary owner and repo link.
- *This is the most important test in the suite. If a PR drops this below 80%, it cannot be merged.*
### 5.3 Synthetic Topology Generation
- Script to generate `N` mock CFN stacks, `M` ECS services, and `K` GitHub repos to feed the E2E environment without hitting AWS/GitHub limits.
---
## 6. Performance & Load Testing
Load tests ensure the serverless architecture scales correctly and the Cmd+K search remains instantaneous.
### 6.1 Discovery Scan Benchmarks
- **Target:** 500 AWS resources + 500 GitHub repos scanned and reconciled in < 120 seconds.
- **Tooling:** K6 or Artillery. Push 5,000 synthetic SQS messages into the Reconciler queue and measure Lambda batch processing throughput.
### 6.2 Catalog Query Latency
- **Target:** API search endpoint returns in < 100ms at the 99th percentile.
- **Test:** Load Meilisearch with 10,000 service documents. Fire 50 concurrent Cmd+K search requests per second. Assert p99 latency.
### 6.3 Concurrent Scorecard Evaluation
- Ensure the Python inference Lambda can evaluate 1,000 services concurrently without database connection exhaustion (using Aurora Serverless v2 connection pooling).
---
## 7. CI/CD Pipeline Integration
The test pyramid is enforced through GitHub Actions.
### 7.1 Test Stages
- **Pre-commit:** Husky runs ESLint, Prettier, and fast unit tests (Jest/pytest) for changed files only.
- **PR Gate:** Runs the full Unit and Integration test suites. Blocks merge if coverage drops or tests fail.
- **Merge (Main):** Deploys to Staging. Runs E2E Critical User Journeys and the 80% Accuracy Validation suite against the Party Mode org.
- **Post-Deploy:** Smoke tests verify health endpoints and ALB routing in production.
### 7.2 Coverage Thresholds
- Global Unit Test Coverage: 80%
- Ownership Inference & Reconciliation Logic: 95%
- Feature Flag & Governance Evaluators: 100%
### 7.3 Test Parallelization
- Jest tests run with `--maxWorkers=50%` locally, `100%` in CI.
- Integration tests using Testcontainers run serially per file to avoid database port conflicts, or use dynamic port binding and separate schemas for parallel execution.
---
## 8. Transparent Factory Tenet Testing
Testing the governance and compliance features of the IDP itself.
### 8.1 Feature Flag Circuit Breakers
- **Test:** Enable a flagged discovery heuristic that generates 10 phantom services.
- **Assert:** The system detects the threshold (>5 unconfirmed), auto-disables the flag, and marks the 10 services as `status: quarantined`.
### 8.2 Schema Migration Validation
- **Test:** Attempt to apply a PR that drops a column from the `services` table.
- **Assert:** CI migration validator script fails the build (additive-only rule).
### 8.3 Decision Log Enforcement
- **Test:** Run a discovery scan where service ownership is inferred from `git blame`.
- **Assert:** A `decision_log` entry is written to PostgreSQL with the prompt/reasoning, alternatives, and confidence.
### 8.4 OTEL Span Assertions
- **Test:** Run the Reconciler Lambda.
- **Assert:** The `catalog_scan` parent span contains child spans for `ownership_inference` with attributes for `catalog.service_id`, `catalog.ownership_signals`, and `catalog.confidence_score`. Use an in-memory OTEL exporter for testing.
### 8.5 Governance Policy Enforcement
- **Test:** Set tenant policy to `strict` mode. Simulate auto-discovery finding a new service.
- **Assert:** Service is placed in the "pending review" queue and NOT visible in the main catalog.
- **Test:** Set `panic_mode: true`. Attempt a `PATCH /api/v1/services/123`.
- **Assert:** HTTP 503 Service Unavailable.
---
## 9. Test Data & Fixtures
High-quality fixtures are the lifeblood of this TDD strategy.
### 9.1 GitHub/GitLab API Response Factories
- JSON files containing real obfuscated GraphQL responses for Repositories, `CODEOWNERS` blobs, and Team memberships.
- Use factories (e.g., `fishery` or custom functions) to easily override fields: `buildGHRepo({ name: 'auth-service', languages: ['Go'] })`.
### 9.2 Synthetic Topology Generators
- Scripts that generate interconnected AWS resources (e.g., a CFN stack containing an API Gateway routing to 3 Lambdas interacting with 1 RDS instance).
### 9.3 `CODEOWNERS` and Git Blame Mocks
- Diverse `CODEOWNERS` files covering edge cases: wildcard matching, deep path matching, invalid syntax, user-vs-team owners.
---
## 10. TDD Implementation Order
To bootstrap the platform efficiently, testing and development should follow this sequence based on Epic dependencies:
1. **Epic 2 (GitHub Parsers):** Write pure unit tests for `CODEOWNERS` parser and `README` extractor. *Value: High ROI, zero dependencies.*
2. **Epic 1 (AWS Heuristics):** Write unit tests for mapping CFN stacks and Tags to Service entities. *Value: Core product logic.*
3. **Epic 2 (Ownership Inference):** TDD the scoring algorithm. Build the weighting math. *Value: The brain of the platform.*
4. **Epic 3 (Service Catalog Schema):** Integration tests for PostgreSQL RLS and upserting services. *Value: Data durability.*
5. **Epic 2 (Reconciliation):** Unit tests merging AWS and GitHub mock entities. *Value: Pipeline glue.*
6. **Epic 4 (Search Sync):** Integration tests for pushing DB updates to Meilisearch.
7. **Epic 5 (API & UI):** E2E test for the Cmd+K search flow.
8. **Epic 10 (Governance & Flags):** Unit tests for feature flag circuit breakers and strict mode.
9. **Epic 9 (Onboarding):** Playwright E2E for the 5-Minute Miracle flow.
This sequence ensures the most complex algorithmic logic is proven before it is wired to databases and APIs.