Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
24 KiB
dd0c/portal — Design Thinking Session
Product: Lightweight Internal Developer Portal ("The Anti-Backstage") Facilitator: Maya, Design Thinking Maestro Date: 2026-02-28
"Design is jazz. You listen before you play. You feel the room before you solo. And the room right now? It's full of engineers who are lost, frustrated, and pretending they're not."
Phase 1: EMPATHIZE 🎭
The cardinal sin of developer tooling is building for the architect's ego instead of the practitioner's panic. Before we sketch a single wireframe, we need to sit in three very different chairs and feel the splinters.
Persona 1: The New Hire — "Jordan"
Profile: Software Engineer II, 2 years experience, just joined a 60-person engineering org. First day was Monday. It's now Wednesday. Jordan has imposter syndrome and a laptop full of bookmarks that don't work.
Empathy Map
| Dimension | Details |
|---|---|
| SAYS | "Where's the documentation for this?" · "Who should I ask about the payments service?" · "I don't want to bother anyone with a dumb question" · "The wiki says to use Jenkins but I think we use GitHub Actions now?" · "Can someone add me to the right Slack channels?" |
| THINKS | "Everyone else seems to know how this all fits together" · "I'm going to look stupid if I ask this again" · "This architecture diagram is from 2023, is it still accurate?" · "I have no idea what half these services do" · "Am I even looking at the right repo?" |
| DOES | Searches Slack history obsessively · Opens 47 browser tabs trying to map the system · Reads READMEs that haven't been updated in 18 months · Asks their buddy (assigned mentor) the same question three different ways · Builds a personal Notion doc trying to map services to teams · Sits in standup not understanding half the service names mentioned |
| FEELS | Overwhelmed — the system landscape is a fog · Anxious — "I should be productive by now" · Isolated — everyone's too busy to hand-hold · Frustrated — information exists but it's scattered across 6 tools · Embarrassed — asking "who owns this?" feels like admitting ignorance |
Pain Points
- The Scavenger Hunt — Finding who owns what requires asking in Slack, searching Confluence, checking GitHub CODEOWNERS, and sometimes just guessing. There's no single source of truth.
- Stale Documentation — 70% of internal docs are outdated. Jordan follows a setup guide, hits an error on step 3, and has no idea if the guide is wrong or they are.
- Invisible Architecture — No one has drawn an accurate system diagram in over a year. The mental model of "how things connect" lives exclusively in senior engineers' heads.
- Social Cost of Questions — Every question Jordan asks interrupts someone. After the third "hey, quick question" DM in a day, Jordan stops asking and starts guessing.
- Tool Sprawl — Services are documented across GitHub READMEs, Confluence, Notion, Google Docs, and Slack pinned messages. There's no index of indexes.
Current Workarounds
- Personal Notion database mapping services → owners → repos (manually maintained, already drifting)
- Screenshot collection of architecture whiteboard photos from onboarding
- Bookmarked Slack threads with "useful" context that's already stale
- Relying heavily on one senior engineer who's becoming a bottleneck
- Reading git blame to figure out who last touched a service
Jobs To Be Done (JTBD)
- When I join a new company, I want to quickly understand the service landscape so I can contribute meaningfully without feeling lost.
- When I'm assigned a ticket involving an unfamiliar service, I want to find the owner and documentation in under 60 seconds so I can unblock myself without interrupting teammates.
- When I hear a service name in standup, I want to look it up instantly so I can follow the conversation and not feel like an outsider.
Day-in-the-Life Scenario
9:00 AM — Jordan opens Slack. 47 unread messages across 12 channels. Half reference services Jordan has never heard of.
9:30 AM — Standup. Tech lead mentions "the inventory-sync Lambda is flaking again." Jordan nods. Has no idea what inventory-sync does or where it lives.
10:00 AM — Assigned first real ticket: "Add retry logic to the order-processor." Jordan searches GitHub for "order-processor." Three repos come up. Which one is the right one? The README in the first one says "DEPRECATED — see order-processor-v2." The v2 repo has no README.
10:45 AM — Jordan DMs their mentor: "Hey, which repo is the actual order-processor?" Mentor is in a meeting. No response for 2 hours.
11:00 AM — Jordan searches Confluence for "order-processor." Finds a page from 2024 with an architecture diagram that references services that no longer exist.
12:00 PM — Lunch. Jordan feels behind. Other new hires from the same cohort seem to be figuring things out faster (they're not — they're just better at hiding it).
2:00 PM — Mentor responds: "Oh yeah, it's order-processor-v2 but we actually call it order-engine now. The repo name is wrong. Talk to @sarah for the runbook."
2:30 PM — Jordan DMs Sarah. Sarah is on PTO until Friday.
3:00 PM — Jordan has spent 5 hours and written zero lines of code. The ticket is untouched. The frustration is real.
5:00 PM — Jordan updates their personal Notion doc with everything learned today. Tomorrow they'll forget half of it.
Persona 2: The Platform Engineer — "Alex"
Profile: Senior Platform Engineer, 6 years experience, has been maintaining the company's Backstage instance for 14 months. Alex is burned out. The Backstage instance is a Frankenstein monster of custom plugins, broken upgrades, and YAML files that nobody updates. Alex fantasizes about deleting the entire thing.
Empathy Map
| Dimension | Details |
|---|---|
| SAYS | "I spend 40% of my time maintaining Backstage instead of building actual platform tooling" · "Nobody updates their catalog-info.yaml" · "The plugin broke again after the upgrade" · "I've become the Backstage janitor" · "We need to simplify this" |
| THINKS | "I didn't become an engineer to babysit a React monolith" · "If I leave, nobody can maintain this" · "Backstage was supposed to save us time, not create more work" · "Maybe we should just use a spreadsheet" · "The catalog is 60% lies at this point" |
| DOES | Writes custom Backstage plugins that break on every upgrade · Sends monthly "please update your catalog-info.yaml" Slack messages that everyone ignores · Manually fixes broken service entries · Runs Backstage upgrade migrations that take 2-3 days each · Writes documentation for Backstage that nobody reads · Attends Backstage community calls hoping for solutions that never come |
| FEELS | Exhausted — maintaining Backstage is a full-time job on top of their actual job · Resentful — "I'm a platform engineer, not a portal admin" · Trapped — too much invested to abandon Backstage, too broken to love it · Lonely — nobody else on the team understands or cares about the IDP · Skeptical — "every new tool promises to be different" |
Pain Points
- The Maintenance Tax — Backstage requires constant care: plugin updates, dependency bumps, custom provider maintenance, auth config changes. It's a pet, not cattle.
- The YAML Lie —
catalog-info.yamlfiles are the foundation of Backstage, and they're fiction. Teams write them once during onboarding, never update them. The catalog drifts from reality within weeks. - Plugin Fragility — Community plugins are maintained by volunteers who disappear. Custom plugins break on Backstage upgrades. The plugin ecosystem is a house of cards.
- Zero Adoption Metrics — Alex has no idea if anyone actually uses Backstage. There's no analytics. The portal might have 3 daily users or 30. Alex suspects it's closer to 3.
- Upgrade Dread — Every Backstage version bump is a multi-day migration project. Alex is 4 versions behind because the last upgrade broke 3 plugins and took a week to fix.
Current Workarounds
- Wrote a cron job that checks for missing
catalog-info.yamlfiles and posts in Slack (everyone mutes the channel) - Maintains a "known broken" list of Backstage features they tell people to avoid
- Built a simple internal API that returns service ownership from GitHub CODEOWNERS as a backup
- Runs a quarterly "Backstage cleanup day" that nobody attends
- Has a secret spreadsheet that's actually more accurate than Backstage
Jobs To Be Done (JTBD)
- When I'm responsible for the developer portal, I want it to maintain itself so I can focus on building actual platform capabilities.
- When the catalog data drifts from reality, I want automatic reconciliation from source-of-truth systems so I can trust the data without manual intervention.
- When leadership asks "how's the IDP going?", I want real adoption metrics so I can prove value or make the case to change course.
Day-in-the-Life Scenario
8:30 AM — Alex opens their laptop. Three Slack DMs overnight: "Backstage is showing the wrong owner for auth-service," "The API docs plugin isn't loading," and "Can you add our new service to the catalog?"
9:00 AM — Alex investigates the wrong-owner issue. The
catalog-info.yamlin the auth-service repo lists the previous team. The team was reorged 4 months ago. Nobody updated the YAML. Alex manually fixes it. This is the third time this month.9:30 AM — The API docs plugin. It broke after last week's Backstage patch update. Alex checks the plugin's GitHub repo. Last commit: 8 months ago. The maintainer hasn't responded to issues in 6 months. Alex starts debugging the plugin source code.
11:00 AM — Still debugging the plugin. Alex considers writing a replacement from scratch. Estimates 2 weeks of work. Puts it on the backlog that's already 40 items deep.
11:30 AM — "Can you add our new service?" Alex sends the team the
catalog-info.yamltemplate for the 50th time. They'll fill it out wrong. Alex will fix it later.1:00 PM — Alex's actual platform work (building a new CI/CD pipeline template) has been untouched for 3 days. The sprint review is tomorrow. Alex has nothing to show.
3:00 PM — Engineering Director asks: "How many teams are actively using the portal?" Alex doesn't know. Backstage has no built-in analytics. Alex says "most teams" and hopes nobody asks for specifics.
5:00 PM — Alex updates their resume. Not seriously. Mostly seriously.
Persona 3: The Engineering Director — "Priya"
Profile: Director of Engineering, manages 8 teams (62 engineers), reports to the VP of Engineering. Priya needs to answer hard questions about service ownership, production readiness, and engineering maturity — and currently can't answer any of them with confidence.
Empathy Map
| Dimension | Details |
|---|---|
| SAYS | "Which teams own which services? I need a complete map" · "Are we production-ready for the SOC 2 audit?" · "How many services don't have runbooks?" · "I need to justify the platform team's headcount" · "Why did that incident take 45 minutes to route to the right team?" |
| THINKS | "I'm flying blind on service maturity" · "If an auditor asks me about ownership, I'm going to look incompetent" · "We're spending $200K/year on a platform engineer to maintain Backstage — is that worth it?" · "I bet there are zombie services costing us money that nobody owns" · "The new hires are taking too long to ramp up" |
| DOES | Asks team leads for service ownership spreadsheets (gets different answers from each) · Runs quarterly "production readiness" reviews that are manual and painful · Approves platform team budget without clear ROI metrics · Escalates during incidents when nobody knows who owns the broken service · Presents engineering maturity metrics to the VP that are mostly guesswork |
| FEELS | Anxious — lack of visibility into the service landscape is a leadership risk · Frustrated — simple questions ("who owns this?") shouldn't require a research project · Pressured — SOC 2 audit is in 3 months and the compliance evidence is scattered · Guilty — knows the platform team is drowning but can't justify more headcount without data · Impatient — "we've been talking about fixing this for a year" |
Pain Points
- The Ownership Black Hole — No single, trustworthy source of service-to-team mapping. During incidents, precious minutes are wasted finding the right responder.
- Compliance Anxiety — SOC 2 auditors will ask "show me every service that handles PII and its owner." Today, answering this requires a multi-week manual audit.
- Invisible ROI — The platform team maintains Backstage, but Priya can't quantify the value. Is it saving time? Is anyone using it? She's spending $200K/year on vibes.
- Onboarding Drag — New engineers take 3-4 weeks to become productive. Priya suspects poor internal tooling is a major factor but can't prove it.
- Zombie Infrastructure — Priya knows there are services and AWS resources that nobody owns. They cost money, create security risk, and nobody's accountable.
Current Workarounds
- Quarterly manual audit where each team lead submits a spreadsheet of their services (takes 2 weeks, outdated by the time it's compiled)
- Incident post-mortems that repeatedly cite "unclear ownership" as a contributing factor
- A shared Google Sheet titled "Service Ownership Map" that 4 different people maintain with conflicting data
- Relies on tribal knowledge from senior engineers who've been at the company 3+ years
- Asks the platform engineer (Alex) for ad-hoc reports that take days to compile
Jobs To Be Done (JTBD)
- When an incident occurs, I want instant, authoritative service-to-owner mapping so that mean time to resolution isn't inflated by "who owns this?" confusion.
- When preparing for a compliance audit, I want an always-current inventory of services, owners, and maturity attributes so that I can produce evidence in minutes, not weeks.
- When justifying platform team investment, I want concrete adoption and value metrics so that I can defend the budget with data, not anecdotes.
Day-in-the-Life Scenario
7:30 AM — Priya checks her phone. PagerDuty alert from 2 AM: "payment-gateway latency spike." The incident channel shows 15 minutes of "does anyone know who owns payment-gateway?" before the right engineer was found. MTTR: 47 minutes. Should have been 15.
9:00 AM — Leadership sync. VP asks: "How many of our services meet production readiness standards?" Priya says "most of the critical path services." The VP asks for a number. Priya doesn't have one. She promises a report by Friday.
10:00 AM — Priya asks Alex (platform engineer) to generate a production readiness report. Alex says it'll take 3-4 days because the data is scattered across Backstage (partially accurate), GitHub (partially complete), and team leads' heads (partially available).
11:00 AM — SOC 2 prep meeting. Compliance team asks: "Can you provide a list of all services that process customer data, their owners, and their security controls?" Priya's stomach drops. She knows this will be a fire drill.
1:00 PM — 1:1 with a new hire (Jordan). Jordan mentions it took 3 days to figure out which repo to work in. Priya makes a mental note to "fix onboarding" — the same note she's made every quarter for a year.
3:00 PM — Budget review. CFO asks why the platform team needs another engineer. Priya can't quantify the current IDP's value. The headcount request is deferred to next quarter.
4:30 PM — Priya opens the "Service Ownership Map" Google Sheet. It was last updated 6 weeks ago. Three services listed don't exist anymore. Two new services aren't listed. She closes the tab and sighs.
6:00 PM — Priya drafts an email to her team leads: "Please update the service ownership spreadsheet by EOW." She knows from experience that 3 out of 8 will respond, and the data will be inconsistent.
Cross-Persona Insight: The Emotional Thread
There's a jazz chord that connects all three personas — it's the anxiety of not knowing:
- Jordan doesn't know where things are and is afraid to ask.
- Alex doesn't know if anyone uses what they've built and is afraid to find out.
- Priya doesn't know the true state of her engineering org and is afraid to be exposed.
The product that resolves this anxiety — that replaces "I don't know" with "I can look it up in 2 seconds" — wins not just their budget, but their loyalty.
"The best products don't just solve problems. They dissolve the anxiety that surrounds the problem. That's the difference between a tool and a companion."
Phase 2: DEFINE 🎯
Time to distill the empathy into precision. We've sat in the chairs. We've felt the splinters. Now we name the splinters — because you can't pull out what you can't name.
Point-of-View (POV) Statements
A POV statement is a design brief in one sentence. It reframes the problem from the user's emotional reality, not the product team's assumptions.
POV 1 — The New Hire (Jordan)
Jordan, a newly hired engineer drowning in scattered tribal knowledge, needs a way to instantly understand the service landscape and find who owns what — because every hour spent on a scavenger hunt for basic context is an hour of productivity lost and confidence eroded.
POV 2 — The Platform Engineer (Alex)
Alex, a burned-out platform engineer trapped maintaining a Backstage instance that nobody trusts, needs a developer portal that maintains itself from real infrastructure data — because the current model of human-maintained YAML catalogs is a lie that creates more work than it eliminates.
POV 3 — The Engineering Director (Priya)
Priya, an engineering director who can't answer basic questions about service ownership and maturity, needs always-current, authoritative visibility into her engineering org's service landscape — because flying blind on ownership creates incident response delays, compliance risk, and an inability to justify platform investment.
"How Might We" Questions
HMW questions are the bridge between empathy and ideation. Each one is a door. Some lead to hallways. Some lead to ballrooms. Let's open them all.
Discoverability & Knowledge
- HMW make the entire service landscape searchable in under 2 seconds, so that "who owns this?" is never a Slack question again?
- HMW eliminate stale documentation by generating service context directly from infrastructure reality (AWS, GitHub, K8s) instead of human-written YAML?
- HMW surface the "invisible architecture" — the actual dependency graph — without requiring anyone to draw or maintain diagrams?
- HMW make a new hire's first week feel like orientation instead of archaeology?
Maintenance & Trust
- HMW build a catalog that stays accurate without requiring any engineer to manually update it — ever?
- HMW give platform engineers their time back by replacing portal maintenance with portal auto-healing?
- HMW create confidence scores for catalog data so users know what's verified vs. inferred, rather than treating everything as equally trustworthy (or untrustworthy)?
- HMW make the portal's data freshness visible and transparent, so trust is earned through proof, not promises?
Visibility & Accountability
- HMW give engineering leaders a real-time, always-current view of service ownership, maturity, and gaps — without quarterly manual audits?
- HMW turn "who owns this?" from a 15-minute incident delay into a 0-second lookup?
- HMW make production readiness measurable and visible so that teams self-correct without top-down mandates?
- HMW surface zombie services and orphaned infrastructure automatically, so cost waste and security risk don't hide in the shadows?
Adoption & Stickiness
- HMW make the portal so useful for daily work that engineers voluntarily set it as their browser homepage?
- HMW meet engineers where they already are (Slack, terminal, VS Code) instead of asking them to visit yet another dashboard?
- HMW create a "5-minute setup to first insight" experience that makes Backstage's months-long onboarding feel absurd?
- HMW design the portal so that using it is faster than NOT using it — so adoption is driven by selfishness, not mandates?
Key Insights
These are the non-obvious truths that emerged from empathy. They're the foundation of everything we design.
Insight 1: The Real Problem Isn't Missing Data — It's Scattered Data The information exists. Ownership is in CODEOWNERS. Descriptions are in READMEs. Health is in CloudWatch. On-call is in PagerDuty. The problem isn't that nobody documented anything — it's that nobody aggregated it. The portal's job isn't to create new data. It's to unify existing data into one searchable surface.
Insight 2: Manual Maintenance Is a Design Flaw, Not a User Failure Backstage blames engineers for not updating YAML. That's like blaming users for not reading the manual. If your product requires humans to do repetitive, low-reward maintenance to function, your product is broken. Auto-discovery isn't a feature — it's the correction of a fundamental design error.
Insight 3: The New Hire Is the Canary in the Coal Mine If a new hire can't find what they need in their first week, the entire org has a knowledge management problem. They're just the ones who feel it most acutely. Solving for Jordan solves for everyone — because every engineer is a "new hire" to the 80% of services they've never touched.
Insight 4: Trust Is Binary — And It's Earned on First Contact If an engineer opens the portal and sees one wrong owner or one stale description, they close the tab and never come back. Trust in a catalog is binary: either you trust it or you don't. There's no "mostly trust." This means accuracy on day one matters more than feature completeness. Ship less, but ship truth.
Insight 5: The Portal Must Be Faster Than Slack The current workaround for "who owns this?" is asking in Slack. If the portal is slower than typing a Slack message and waiting for a response, the portal loses. The bar isn't "better than Backstage." The bar is "faster than asking a human." That's Cmd+K in under 2 seconds.
Insight 6: Directors Don't Need Dashboards — They Need Answers Priya doesn't want a dashboard with 47 widgets. She wants to answer three questions: "Who owns what?", "Are we production-ready?", and "What's falling through the cracks?" If the portal can answer those three questions on demand, Priya becomes the champion who sells it to the VP.
Core Tension: Comprehensiveness vs. Simplicity
Here's the jazz dissonance at the heart of this product. It's the tension that will define every design decision:
COMPREHENSIVENESS SIMPLICITY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"Show me everything about "Just tell me who owns
every service: dependencies, this and where the repo
health, cost, SLOs, runbooks, is. I don't need a
deployment history, tech stack, dissertation."
security posture, compliance
status..."
→ This is Backstage/Port/Cortex. → This is a spreadsheet.
Powerful but overwhelming. Simple but insufficient.
Nobody uses it because it's Everyone outgrows it
too much. in 3 months.
The Resolution: Progressive Disclosure
dd0c/portal must be a spreadsheet on the surface and a platform underneath. The default view is radically simple: service name, owner, health, repo link. One line per service. Scannable in 2 seconds.
But depth is one click away. Click the service → dependencies, cost, deployment history, runbook, scorecard. The complexity exists but it doesn't assault you on arrival.
The design principle: "Calm surface, deep ocean."
This is how you beat both Backstage (too complex on arrival) AND the Google Sheet (too shallow to grow into). You start simpler than a spreadsheet and grow deeper than Backstage — but only when the user asks for depth.
"In jazz, the notes you don't play matter more than the ones you do. The portal's default state should be silence — clean, calm, just the essentials. The solo comes when you lean in."