dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run Phases: brainstorm, design-thinking, innovation-strategy, party-mode, product-brief, architecture, epics (incl. Epic 10 TF compliance), test-architecture (TDD strategy) Brand strategy and market research included.
2026-02-28 17:35:02 +00:00
commit 5ee95d8b13
51 changed files with 36935 additions and 0 deletions
--- a/products/04-lightweight-idp/brainstorm/session.md
+++ b/products/04-lightweight-idp/brainstorm/session.md
@@ -0,0 +1,245 @@
+# dd0c/portal — Brainstorm Session
+**Product:** Lightweight Internal Developer Portal ("The Anti-Backstage")  
+**Facilitator:** Carson, Elite Brainstorming Specialist  
+**Date:** 2026-02-28  
+
+> *Every idea gets a seat at the table. We filter later. Let's GO.*
+
+---
+
+## Phase 1: Problem Space (25 ideas)
+
+### Why Does Backstage Suck?
+
+1. **YAML Cemetery** — Backstage requires hand-written `catalog-info.yaml` in every repo. Engineers write it once, never update it. Within 6 months your catalog is a graveyard of lies.
+2. **Plugin Roulette** — Backstage plugins break on every upgrade. The plugin ecosystem is wide but shallow — half-maintained community plugins that rot.
+3. **Dedicated Platform Team Required** — You need 1-2 full-time engineers just to keep Backstage running. For a 30-person team, that's 3-7% of your engineering headcount babysitting a developer portal.
+4. **React Monolith From Hell** — Backstage is a massive React app you have to fork, customize, build, and deploy yourself. It's not a product, it's a framework. Spotify built it for Spotify.
+5. **Upgrade Treadmill** — Backstage releases constantly. Each upgrade risks breaking your custom plugins and templates. Teams fall behind and get stuck on ancient versions.
+6. **Cold Start Problem** — Day 1 of Backstage: empty catalog. You have to manually register every service. Nobody does it. The portal launches to crickets.
+7. **No Opinions** — Backstage is infinitely configurable, which means it ships with zero useful defaults. You have to decide everything: what metadata to track, what plugins to install, how to organize the catalog.
+8. **Search Is Terrible** — Backstage's built-in search is basic. Finding "who owns the payment service" requires navigating a clunky UI tree.
+9. **Authentication Nightmare** — Setting up auth (Okta, GitHub, Google) in Backstage requires custom provider configuration that's poorly documented.
+10. **No Auto-Discovery** — Backstage doesn't discover anything. It's a static registry that depends entirely on humans keeping it current. Humans don't.
+
+### What Do Engineers Actually Need? (The 80/20)
+
+11. **"Who owns this?"** — The #1 question. When something breaks at 3 AM, you need to know who to page. That's it. That's the killer feature.
+12. **"What does this service do?"** — A one-paragraph description, its dependencies, and its API docs. Not a 40-page Confluence novel.
+13. **"Is it healthy right now?"** — Green/yellow/red. Deployment status. Last deploy time. Current error rate. One glance.
+14. **"Where's the runbook?"** — When the service is on fire, where do I go? Link to the runbook, the dashboard, the logs.
+15. **"What depends on this?"** — Dependency graph. If I change this service, what breaks?
+16. **"How do I set up my dev environment for this?"** — README, setup scripts, required env vars. Onboarding in 10 minutes, not 10 days.
+
+### The Pain of NOT Having an IDP
+
+17. **Tribal Knowledge Monopoly** — "Ask Dave, he built that service 3 years ago." Dave left 6 months ago. Now nobody knows.
+18. **Confluence Graveyard** — Teams document services in Confluence pages that are 2 years stale. New engineers follow outdated instructions and waste days.
+19. **Slack Archaeology** — "I think someone posted the architecture diagram in #platform-eng last March?" Engineers spend hours searching Slack history for institutional knowledge.
+20. **Incident Response Roulette** — Alert fires → nobody knows who owns the service → 30-minute delay finding the right person → MTTR doubles.
+21. **Onboarding Black Hole** — New engineer joins. Spends first 2 weeks asking "what is this service?" and "who do I talk to about X?" in Slack. Productivity = zero.
+22. **Duplicate Services** — Without a catalog, Team A builds a notification service. Team B doesn't know it exists. Team B builds another notification service. Now you have two.
+23. **Zombie Services** — Services that nobody owns, nobody uses, but nobody is brave enough to turn off. They accumulate like barnacles, costing money and creating security risk.
+24. **Compliance Panic** — Auditor asks "show me all services that handle PII and their owners." Without an IDP, this is a multi-week scavenger hunt.
+25. **Shadow Architecture** — The actual architecture diverges from every diagram ever drawn. Nobody has a true picture of what's running in production.
+
+---
+
+## Phase 2: Solution Space (42 ideas)
+
+### Auto-Discovery Approaches
+
+26. **AWS Resource Tagger** — Scan AWS accounts via read-only IAM role. Discover EC2, ECS, Lambda, RDS, S3, API Gateway. Map them to services using tags, naming conventions, and CloudFormation stack associations.
+27. **GitHub/GitLab Repo Scanner** — Scan org repos. Infer services from repo names, `Dockerfile` presence, CI/CD configs, deployment manifests. Extract README descriptions automatically.
+28. **Kubernetes Label Harvester** — Connect to K8s clusters. Discover deployments, services, ingresses. Map labels (`app`, `team`, `owner`) to catalog entries.
+29. **Terraform State Reader** — Parse Terraform state files (S3 backends). Build infrastructure graph from resource relationships. Know exactly what infra each service uses.
+30. **CI/CD Pipeline Analyzer** — Read GitHub Actions / GitLab CI / Jenkins configs. Infer deployment targets, environments, and service relationships from pipeline definitions.
+31. **DNS/Route53 Reverse Map** — Scan DNS records to discover all public-facing services and map them back to infrastructure.
+32. **CloudFormation Stack Walker** — Parse CF stacks to understand resource groupings and cross-stack references. Build dependency graphs automatically.
+33. **Package.json / go.mod / pom.xml Dependency Inference** — Read dependency files to infer internal service-to-service relationships (shared libraries = likely communication).
+34. **Git Blame Ownership** — Infer service ownership from git commit history. Who commits most to this repo? That's probably the owner (or at least knows who is).
+35. **PagerDuty/OpsGenie Schedule Import** — Pull on-call schedules to auto-populate "who to page" for each service.
+36. **OpenAPI/Swagger Auto-Ingest** — Detect and index API specs from repos. Surface them in the portal as live, searchable API documentation.
+37. **Docker Compose Graph** — Parse `docker-compose.yml` files to understand local development service topologies.
+
+### Service Catalog Features
+
+38. **One-Line Service Card** — Every service gets a card: name, owner, health, last deploy, language, repo link. Scannable in 2 seconds.
+39. **Dependency Graph Visualization** — Interactive graph showing service-to-service dependencies. Click a node to see details. Highlight blast radius.
+40. **Health Dashboard** — Aggregate health from multiple sources (CloudWatch, Datadog, Grafana, custom health endpoints). Show unified red/yellow/green.
+41. **Ownership Registry** — Team → services mapping. Click a team, see everything they own. Click a service, see the team and on-call rotation.
+42. **Runbook Linker** — Auto-detect runbooks in repos (markdown files in `/runbooks`, `/docs`, or linked in README). Surface them on the service card.
+43. **Environment Matrix** — Show all environments (dev, staging, prod) for each service. Current version deployed in each. Drift between environments highlighted.
+44. **SLO Tracker** — Define SLOs per service. Show current burn rate. Alert when SLO budget is burning too fast. Simple — not a full SLO platform, just visibility.
+45. **Cost Attribution** — Pull from dd0c/cost. Show monthly AWS cost per service. "This service costs $847/month." Engineers never see this data today.
+46. **Tech Radar Integration** — Tag services with their tech stack. Surface org-wide technology adoption. "We have 47 services on Node 18, 3 still on Node 14."
+47. **README Renderer** — Pull and render the repo README directly in the portal. No context switching to GitHub.
+48. **Changelog Feed** — Show recent deployments, config changes, and incidents per service. "What happened to this service this week?"
+
+### Developer Experience
+
+49. **Instant Search (Cmd+K)** — Algolia-fast search across all services, teams, APIs, runbooks. The portal IS the search bar.
+50. **Slack Bot** — `/dd0c who owns payment-service` → instant answer in Slack. No need to open the portal.
+51. **CLI Tool** — `dd0c portal search "payment"` → results in terminal. For engineers who live in the terminal.
+52. **Browser New Tab** — dd0c/portal as the browser new tab page. Every time an engineer opens a tab, they see their team's services, recent incidents, and deployment status.
+53. **VS Code Extension** — Right-click a service import → "View in dd0c/portal" → opens service card. See ownership and docs without leaving the editor.
+54. **GitHub PR Enrichment** — Bot comments on PRs with service context: "This PR affects payment-service (owned by @payments-team, 99.9% SLO, last incident 3 days ago)."
+55. **Mobile-Friendly View** — When you're on-call and get paged on your phone, the portal should be usable on mobile. Backstage is not.
+56. **Deep Links** — Every service, team, runbook, and API has a stable URL. Paste it in Slack, Jira, anywhere. It just works.
+
+### Zero-Config Magic
+
+57. **Convention Over Configuration** — If your repo is named `payment-service`, the service is named `payment-service`. If it has a `Dockerfile`, it's a deployable service. If it has `owner` in CODEOWNERS, that's the owner. Zero YAML needed.
+58. **Smart Defaults** — First run: connect AWS account + GitHub org. Portal auto-populates with everything it finds. Engineer reviews and corrects, not creates from scratch.
+59. **Progressive Enhancement** — Start with auto-discovered data (maybe 60% accurate). Let teams enrich over time. Never require manual entry as a prerequisite.
+60. **Confidence Scores** — Show "we're 85% sure @payments-team owns this" based on git history and AWS tags. Let humans confirm or correct. Learn from corrections.
+61. **Ghost Service Detection** — Find AWS resources that don't map to any known repo or team. Surface them as "orphaned infrastructure" — potential zombie services or cost waste.
+
+### Scorecard / Maturity Model
+
+62. **Production Readiness Score** — Does this service have: health check? Logging? Alerting? Runbook? Score it 0-100. Gamify production readiness.
+63. **Documentation Coverage** — Does the repo have a README? API docs? Architecture decision records? Score it.
+64. **Security Posture** — Are dependencies up to date? Any known CVEs? Is the Docker image scanned? Secrets in env vars vs. secrets manager?
+65. **On-Call Readiness** — Is there an on-call rotation defined? Is the runbook current? Has the team done a recent incident drill?
+66. **Leaderboard** — Team-level maturity scores. Friendly competition. "Platform team is at 92%, payments team is at 67%." Gamification drives adoption.
+67. **Improvement Suggestions** — "Your service is missing a health check endpoint. Here's a template for Express/FastAPI/Go." Actionable, not just a score.
+
+### dd0c Module Integration
+
+68. **Alert → Owner Routing (dd0c/alert)** — Alert fires → portal knows the owner → alert routes directly to the right person. No more generic #alerts channel.
+69. **Drift Visibility (dd0c/drift)** — Service card shows "⚠️ 3 infrastructure drifts detected." Click to see details in dd0c/drift.
+70. **Cost Per Service (dd0c/cost)** — Service card shows monthly AWS cost. "This Lambda costs $234/month." Engineers finally see the money.
+71. **Runbook Execution (dd0c/run)** — Runbook linked in portal is executable via dd0c/run. "Service is down → click runbook → AI walks you through recovery."
+72. **LLM Cost Per Service (dd0c/route)** — If the service uses LLM APIs, show the AI spend. "This service spent $1,200 on GPT-4o last month."
+73. **Unified Incident View** — When an incident happens, the portal becomes the war room: service health, owner, runbook, recent changes, cost impact — all on one screen.
+
+### Wild Ideas 🔥
+
+74. **The IDP Is Just a Search Engine** — Forget the catalog UI. The entire product is a search bar. Type anything: service name, team name, API endpoint, error message. Get instant answers. Google for your infrastructure.
+75. **AI Agent: "Ask Your Infra"** — Natural language queries: "Who owns the service that handles Stripe webhooks?" "What changed in production in the last 24 hours?" "Which services don't have runbooks?" The AI queries the catalog and answers.
+76. **Auto-Generated Architecture Diagrams** — From discovered services and dependencies, auto-generate C4 / system context diagrams. Always up-to-date because they're generated from reality, not drawn by hand.
+77. **"New Engineer" Mode** — A guided tour for new hires. "Here are the 10 most important services. Here's who owns what. Here's how to set up your dev environment." Onboarding in 1 hour, not 1 week.
+78. **Service DNA** — Every service gets a unique fingerprint: its tech stack, dependencies, deployment pattern, cost profile, health history. Use this to find similar services, suggest best practices, detect anomalies.
+79. **Incident Replay** — After an incident, the portal shows a timeline: what changed, what broke, who responded, how it was fixed. Auto-generated post-mortem skeleton.
+80. **"What If" Simulator** — "What if we deprecate service X?" Show the blast radius: which services depend on it, which teams are affected, estimated migration effort.
+81. **Service Lifecycle Tracker** — Track services from creation → active → deprecated → decommissioned. Prevent zombie services by making the lifecycle visible.
+82. **Auto-PR for Missing Metadata** — Portal detects a service is missing an owner tag. Automatically opens a PR to the repo adding a `CODEOWNERS` file with a suggested owner based on git history.
+83. **Ambient Dashboard (TV Mode)** — Full-screen mode for office TVs. Show team services, health status, recent deploys, SLO burn rates. The engineering floor's heartbeat monitor.
+84. **Service Comparison** — Side-by-side comparison of two services: tech stack, cost, health, maturity score. Useful for migration planning or standardization.
+
+---
+
+## Phase 3: Differentiation & Moat (18 ideas)
+
+### How to Beat Backstage
+
+85. **Time-to-Value: 5 Minutes vs. 5 Months** — Backstage takes months to set up. dd0c/portal takes 5 minutes (connect AWS + GitHub, auto-discover, done). This is the entire pitch. Speed kills.
+86. **Zero Maintenance** — Backstage is self-hosted and requires constant upgrades. dd0c/portal is SaaS. We handle upgrades, scaling, and plugin compatibility. Your platform team can go back to building platforms.
+87. **Auto-Discovery vs. Manual Entry** — Backstage requires humans to write YAML. dd0c/portal discovers everything automatically. The catalog is always current because it's generated from reality, not maintained by humans.
+88. **Opinionated > Configurable** — Backstage gives you a blank canvas. dd0c/portal gives you a finished painting. We make the decisions so you don't have to. Convention over configuration.
+89. **"Backstage Migrator"** — One-click import from existing Backstage `catalog-info.yaml` files. Lower the switching cost to zero. Eat their lunch.
+
+### How to Beat Port/Cortex/OpsLevel
+
+90. **Price: $10/eng vs. $200+/eng** — Port, Cortex, and OpsLevel charge enterprise prices ($20K+/year). dd0c/portal is $10/engineer/month with self-serve signup. No sales calls. No procurement process.
+91. **Self-Serve vs. Sales-Led** — You can start using dd0c/portal today. Port requires a demo call, a POC, and a 6-week evaluation. By the time their sales cycle completes, you've been using dd0c for 2 months.
+92. **Simplicity as Feature** — Port and Cortex have massive feature sets designed for 1000+ engineer orgs. dd0c/portal has 20% of the features for 80% of the value. For a 30-person team, less is more.
+93. **dd0c Platform Integration** — Port is a standalone IDP. dd0c/portal is part of a unified platform (cost, alerts, drift, runbooks). The IDP that knows your costs, routes your alerts, and executes your runbooks. Nobody else can do this.
+
+### The Moat
+
+94. **Data Network Effect** — The more services discovered, the better the dependency graph, the smarter the ownership inference, the more accurate the health aggregation. Data compounds.
+95. **Platform Lock-In (The Good Kind)** — Once dd0c/portal is the browser homepage for every engineer, switching costs are enormous. It's the operating system for your engineering org.
+96. **Cross-Module Flywheel** — Portal makes alerts smarter (route to owner). Alerts make portal stickier (engineers open it during incidents). Cost data makes portal indispensable (engineers check service costs). Each module reinforces the others.
+97. **AI-Powered Inference Engine** — Over time, dd0c learns patterns across all customers (anonymized): common service architectures, typical ownership structures, standard tech stacks. The AI gets smarter with scale. New customers get better auto-discovery on day 1.
+98. **Community Catalog Templates** — Open-source library of service templates (Express API, Lambda function, ECS service). New services created from templates are automatically portal-ready. The community builds the ecosystem.
+99. **"Agent Control Plane" Positioning** — As agentic AI grows, AI agents need a source of truth about services. dd0c/portal becomes the registry that AI agents query. "Which service handles payments?" The IDP isn't just for humans anymore — it's for AI agents too.
+100. **Compliance Moat** — Once dd0c/portal is the system of record for service ownership and maturity, it becomes compliance infrastructure. SOC 2 auditors love it. Ripping it out means losing your compliance evidence.
+101. **Integration Depth** — Build deep integrations with the tools teams already use (GitHub, Slack, PagerDuty, Datadog, AWS). Each integration makes dd0c/portal harder to replace.
+102. **Open-Source Discovery Agent** — Open-source the discovery agent (runs in their VPC). Proprietary SaaS dashboard. The OSS agent builds trust and community. The dashboard is the business.
+
+---
+
+## Phase 4: Anti-Ideas & Red Team (14 ideas)
+
+### Why Would This Fail?
+
+103. **"Lightweight" = "Toy"** — Teams might dismiss dd0c/portal as too simple. "We need Backstage because we're a serious engineering org." Perception problem: lightweight sounds like it can't scale.
+104. **GitHub Ships a Built-In Catalog** — GitHub already has repository topics, CODEOWNERS, and dependency graphs. If they add a "Service Catalog" tab, dd0c/portal's value proposition evaporates overnight.
+105. **Backstage Gets Easy** — Roadie (managed Backstage) is improving. If Backstage 2.0 ships with auto-discovery and zero-config setup, the "Anti-Backstage" positioning dies.
+106. **AWS Ships a Good IDP** — AWS has Service Catalog, but it's terrible. If they build a real IDP integrated with their ecosystem, they have distribution dd0c can't match.
+107. **Discovery Accuracy Problem** — Auto-discovery sounds magical but might be 60% accurate. If engineers open the portal and see wrong data, they'll never trust it again. First impressions are everything.
+108. **Small Teams Don't Need an IDP** — A 15-person team might genuinely not need a service catalog. They all sit in the same room. The TAM might be smaller than expected.
+109. **Enterprise Gravity** — As teams grow past 100 engineers, they'll "graduate" to Port or Cortex. dd0c/portal might be a stepping stone, not a destination. High churn at the top end.
+110. **Solo Founder Risk** — Building an IDP requires integrations with dozens of tools (AWS, GCP, Azure, GitHub, GitLab, Bitbucket, PagerDuty, OpsGenie, Datadog, Grafana...). That's a massive surface area for one person.
+111. **The "Free" Competitor Problem** — Backstage is free. Convincing teams to pay $10/eng/month when a free option exists requires the value gap to be enormous and obvious.
+112. **Data Sensitivity** — The portal needs read access to AWS accounts and GitHub orgs. Security teams at larger companies will block this. The trust barrier is real.
+113. **Agentic AI Makes IDPs Obsolete** — If AI agents can answer "who owns this service?" by reading git history and Slack in real-time, do you need a static catalog at all?
+114. **Platform Engineering Fatigue** — Teams are tired of adopting new tools. "We just finished setting up Backstage, we're not switching." Migration fatigue is real.
+115. **The "Good Enough" Spreadsheet** — Many teams track services in a Google Sheet or Notion database. It's ugly but it works. Convincing them to pay for a dedicated tool is harder than it sounds.
+116. **Churn from Simplicity** — If the product is truly lightweight, there's less surface area for stickiness. Users might churn because they feel they've "outgrown" it.
+
+---
+
+## Phase 5: Synthesis
+
+### Top 10 Ideas (Ranked)
+
+| Rank | Idea | Why It Wins |
+|------|------|-------------|
+| 1 | **5-Minute Auto-Discovery Setup** (#57, #58) | THE differentiator. Connect AWS + GitHub → catalog populated. Zero YAML. This is the entire pitch against Backstage. |
+| 2 | **Cmd+K Instant Search** (#49, #74) | The portal IS the search bar. "Who owns X?" answered in 2 seconds. This is the daily-use hook that makes it the browser homepage. |
+| 3 | **AI "Ask Your Infra" Agent** (#75) | Natural language queries against your service catalog. "What changed in prod today?" This is the 2026 differentiator that no competitor has. |
+| 4 | **Ownership Registry + PagerDuty Sync** (#41, #35) | The #1 use case: who owns this, who's on-call. Auto-populated from PagerDuty/OpsGenie + git history. Solves the 3 AM problem. |
+| 5 | **dd0c Cross-Module Integration** (#68-73, #96) | Alerts route to owners. Costs attributed to services. Runbooks linked and executable. The platform flywheel that standalone IDPs can't match. |
+| 6 | **Production Readiness Scorecard** (#62-67) | Gamified maturity model. Teams compete to improve scores. Drives adoption AND improves engineering practices. Two birds, one stone. |
+| 7 | **Slack Bot** (#50) | `/dd0c who owns payment-service` — meet engineers where they already are. Reduces friction to zero. Drives organic adoption. |
+| 8 | **Auto-Generated Dependency Graphs** (#39, #76) | Visual blast radius. "If this service goes down, these 12 services are affected." Always current because it's generated from reality. |
+| 9 | **Backstage Migrator** (#89) | One-click import from Backstage YAML. Lowers switching cost to zero. Directly targets the frustrated Backstage user base. |
+| 10 | **$10/eng Self-Serve Pricing** (#90, #91) | No sales calls. No procurement. Credit card signup. This alone disqualifies Port/Cortex/OpsLevel for 80% of the market. |
+
+### 3 Wild Cards 🃏
+
+| # | Wild Card | Why It's Wild |
+|---|-----------|---------------|
+| 🃏1 | **"New Engineer" Guided Onboarding Mode** (#77) | Turns the IDP into an onboarding tool. "Welcome to Acme Corp. Here are your 47 services in 5 minutes." HR teams would champion this. Completely different buyer persona. |
+| 🃏2 | **Agent Control Plane** (#99) | Position dd0c/portal as the registry that AI agents query, not just humans. As agentic DevOps explodes in 2026, this could be the defining use case. The IDP becomes infrastructure for AI. |
+| 🃏3 | **Auto-PR for Missing Metadata** (#82) | The portal doesn't just show gaps — it fixes them. Detects missing CODEOWNERS, opens a PR with suggested owners. The catalog improves itself. Self-healing metadata. |
+
+### Recommended V1 Scope
+
+**Core (Must Ship):**
+- AWS auto-discovery (EC2, ECS, Lambda, RDS, API Gateway via read-only IAM role)
+- GitHub org scan (repos, languages, CODEOWNERS, README)
+- Service cards (name, owner, description, repo, health, last deploy)
+- Cmd+K instant search
+- Team → services ownership mapping
+- PagerDuty/OpsGenie on-call schedule import
+- Slack bot (`/dd0c who owns X`)
+- Self-serve signup, $10/engineer/month
+
+**V1.1 (Fast Follow):**
+- Production readiness scorecard
+- Dependency graph visualization
+- Kubernetes discovery
+- Backstage YAML importer
+- dd0c/alert integration (route alerts to service owners)
+- dd0c/cost integration (show cost per service)
+
+**V1.2 (Differentiator):**
+- AI "Ask Your Infra" natural language queries
+- Auto-PR for missing metadata
+- New engineer onboarding mode
+- Terraform state parsing
+
+**Explicitly NOT V1:**
+- Custom plugins/extensions (that's Backstage's trap)
+- GCP/Azure support (AWS-first, expand later)
+- Software templates / scaffolding (stay focused on catalog)
+- Full SLO management (just show basic health)
+- Self-hosted option (SaaS only to start)
+
+---
+
+> **Total ideas generated: 116**  
+> *Session complete. The Anti-Backstage has a blueprint. Now go build it.* 🔥