Files

Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.

2026-02-28 17:35:02 +00:00

24 KiB

Raw Permalink Blame History

dd0c/route — V1 MVP Epics

This document outlines the core Epics and User Stories for the V1 MVP of dd0c/route, designed for a solo founder to implement in 1-3 day chunks per story.

Epic 1: Proxy Engine

Description: Core Rust proxy that sits between the client application and LLM providers. Must maintain strict OpenAI API compatibility, support SSE streaming, and introduce <5ms latency overhead.

User Stories

Story 1.1: As a developer, I want to swap my OPENAI_BASE_URL to the proxy endpoint, so that my existing OpenAI SDK works without code changes.
Story 1.2: As a developer, I want streaming support (SSE) preserved, so that my chat applications remain responsive while using the proxy.
Story 1.3: As a platform engineer, I want the proxy latency overhead to be <5ms, so that intelligent routing doesn't degrade our application's user experience.
Story 1.4: As a developer, I want provider errors (e.g., rate limits) to be passed through transparently, so that my app's existing error handling continues to work.

Acceptance Criteria

Implements POST /v1/chat/completions for both streaming (stream: true) and non-streaming requests.
Validates the Authorization: Bearer header against a Redis cache (falling back to DB).
Successfully forwards requests to OpenAI and Anthropic, translating formats if necessary.
Asynchronously emits telemetry events to an in-memory channel without blocking the hot path.
P99 latency overhead is measured at <5ms.

Estimate: 13 points

Dependencies: None

Technical Notes:

Stack: Rust, tokio, hyper, axum.
Use connection pooling for upstream providers to eliminate TLS handshake overhead.
For streaming, parse only the first chunk/headers to make a routing decision, then passthrough. Count tokens from the final SSE chunk (e.g., [DONE]).

Epic 2: Router Brain

Description: The intelligence core of dd0c/route embedded within the proxy. It evaluates incoming requests against routing rules, classifies complexity heuristically, checks cost tables, and executes fallback chains.

User Stories

Story 2.1: As an engineering manager, I want the router to classify the complexity of requests, so that simple extraction tasks are downgraded to cheaper models.
Story 2.2: As an engineering manager, I want to configure routing rules (e.g., if feature=classify -> use cheapest from [gpt-4o-mini, claude-haiku]), so that I can automatically save money on predictable workloads.
Story 2.3: As an application developer, I want the router to automatically fallback to an alternative model if the primary model fails or rate-limits, so that my application remains highly available.
Story 2.4: As an engineering manager, I want cost savings calculated instantly based on up-to-date provider pricing, so that my dashboard data is immediately accurate.

Acceptance Criteria

Heuristic complexity classifier runs in <2ms based on token count, task patterns (regex on system prompt), and model hints.
Evaluates first-match routing rules based on request tags (X-DD0C-Feature, X-DD0C-Team).
Executes "passthrough", "cheapest", "quality-first", and "cascading" routing strategies.
Enforces circuit breakers on downstream providers (e.g., open circuit if error rate > 10%).
Calculates cost_saved = cost_original - cost_actual on the fly using in-memory cost tables.

Estimate: 8 points

Dependencies: Epic 1 (Proxy Engine)

Technical Notes:

Stack: Rust.
Run purely in-memory on the proxy hot path. No DB queries per request.
Cost tables and routing rules must be loaded at startup and refreshed via a background task every 60s.
Use serde_json to inspect the messages array for complexity classification but do not persist the prompt.
Circuit breaker state must be shared via Redis so all proxy instances agree on provider health.

Epic 3: Analytics Pipeline

Description: High-throughput logging and aggregation system using TimescaleDB. Focuses on ingesting asynchronous telemetry from the Proxy Engine without blocking request processing.

User Stories

Story 3.1: As a platform engineer, I want the proxy to emit telemetry without blocking the main request thread, so that our application performance remains unaffected.
Story 3.2: As an engineering manager, I want my dashboard queries to be lightning fast even with millions of rows, so that I can quickly slice and dice our AI spend.
Story 3.3: As an engineering manager, I want historical telemetry to be compressed or aged out automatically, so that the database storage costs remain minimal.

Acceptance Criteria

Proxy emits a RequestEvent over an in-memory mpsc channel via tokio::spawn.
A background worker batches events and inserts them into TimescaleDB every 1s or 100 events using bulk COPY INTO.
Continuous aggregates (hourly_cost_summary, daily_cost_summary) are created and updated on schedule to pre-calculate total_cost, total_saved, and avg_latency.
TimescaleDB compression policies compress chunks older than 7 days by 90%+.
The proxy must degrade gracefully if the analytics database is unavailable.

Estimate: 8 points

Dependencies: Epic 1 (Proxy Engine)

Technical Notes:

Stack: Rust (worker), PostgreSQL/TimescaleDB.
Write the TimescaleDB migration scripts for the hypertable request_events and the continuous aggregates.
Batching must be robust to worker panics (use bounded channels).

Epic 4: Dashboard API

Description: Axum REST API providing authentication, org/team management, routing rule CRUD, and data endpoints for the frontend dashboard. Focuses on frictionless developer onboarding.

User Stories

Story 4.1: As an engineering manager, I want to authenticate via GitHub OAuth, so that I can create an organization and get an API key in under 60 seconds without remembering a password.
Story 4.2: As an engineering manager, I want to manage my organization's routing rules and provider API keys securely, so that dd0c/route can successfully broker requests to OpenAI and Anthropic.
Story 4.3: As an engineering manager, I want an endpoint that provides my historical spend and savings summary, so that I can visualize it in the UI.
Story 4.4: As a platform engineer, I want to revoke an active API key, so that compromised credentials are immediately blocked.

Acceptance Criteria

Implements /api/auth/github OAuth flow issuing JWTs and refresh tokens.
Implements /api/orgs CRUD for managing an organization and API keys.
Implements /api/dashboard/summary and /api/dashboard/treemap queries hitting the TimescaleDB continuous aggregates.
Implements /api/requests for the request inspector with filters (e.g., model, feature, team).
Securely stores and encrypts provider API keys in PostgreSQL using an AES-256-GCM Data Encryption Key.
Enforces an RBAC model (Owner, Member) per organization.

Estimate: 13 points

Dependencies: Epic 3 (Analytics Pipeline)

Technical Notes:

Stack: Rust (axum), PostgreSQL.
Reuse tokio runtime to minimize context switches for a solo founder.
Use oauth2 crate for GitHub integration. JWTs are signed with RS256, refresh tokens in Redis.
Ensure API keys are hashed (SHA-256) before storage; raw keys are never stored.

Epic 5: Dashboard UI

Description: The React SPA serving the cost attribution dashboard. Visualizes the AI spend treemap, routing rules editor, real-time ticker, and request inspector. This is the product's primary visual "Aha" moment.

User Stories

Story 5.1: As an engineering manager, I want to see a treemap of my organization's AI spend broken down by team, feature, and model, so that I can instantly identify the most expensive areas of my application.
Story 5.2: As an engineering manager, I want a real-time counter showing "You saved $X this week," so that I feel confident the tool is paying for itself.
Story 5.3: As a platform engineer, I want an interface to configure routing rules (e.g., drag-to-reorder priority), so that I can instruct the proxy without editing config files.
Story 5.4: As a platform engineer, I want a request inspector that displays metadata, cost, latency, and the specific routing decision for every request, so that I can debug why a certain model was chosen.

Acceptance Criteria

React + Vite SPA deployed as static assets to S3 + CloudFront.
Treemap visualization renders cost aggregations dynamically over selected time periods (7d/30d/90d).
A routing rules editor allows CRUD operations and priority reordering for a team's rules.
Request Inspector table displays paginated, filterable (feature, team, status) lists of telemetry without showing prompt content.
Allows an admin to securely input OpenAI and Anthropic API keys.

Estimate: 13 points

Dependencies: Epic 4 (Dashboard API)

Technical Notes:

Stack: React, TypeScript, Vite, Tailwind CSS.
No SSR required for V1 (keep it simple). Use react-query or similar for data fetching and caching.
Build the treemap with a charting library like D3 or Recharts.
Emphasize speed—data fetches should resolve from continuous aggregates in <200ms.

Epic 6: Shadow Audit CLI

Description: The PLG "Shadow Audit" command-line tool (npx dd0c-scan). It analyzes a local codebase for LLM API calls, estimates monthly cost based on prompt templates, and projects savings with dd0c/route.

User Stories

Story 6.1: As a developer, I want a zero-setup CLI tool (npx dd0c-scan) that scans my codebase and estimates how much money I'm currently wasting on overqualified LLMs, so that I can convince my manager to use dd0c/route.
Story 6.2: As an engineering manager, I want the CLI to run locally without sending my source code to a third party, so that I can securely audit my own projects.
Story 6.3: As an engineering manager, I want a clean, visually appealing terminal report showing "Top Opportunities" for model downgrades, so that I immediately see the value of routing.

Acceptance Criteria

Parses a local directory for OpenAI or Anthropic SDK usage in TypeScript/JavaScript/Python files.
Identifies the models requested in the code and estimates token usage heuristically based on the strings passed to the SDK.
Hits /api/v1/pricing/current to fetch the latest cost tables and calculates an estimated monthly bill and projected savings.
Outputs a formatted terminal report showing total potential savings and a breakdown of the highest-impact files.
Anonymized scan summary is sent to the server only if the user explicitly opts in.

Estimate: 8 points

Dependencies: Epic 4 (Dashboard API - Pricing Endpoint)

Technical Notes:

Stack: Node.js, commander, chalk, simple regex parsers for Python/JS SDKs.
Keep the CLI lightweight, fast, and dependency-free as possible. No actual LLM parsing; use heuristics (string length/structure) for token estimates.
Must run completely offline if the pricing table is cached.

Epic 7: Slack Integration

Description: The primary retention mechanism and anomaly alerting system. An asynchronous worker task dispatches weekly savings digests and threshold-based budget alerts to Slack and Email.

User Stories

Story 7.1: As an engineering manager, I want an automated weekly digest summarizing my team's AI savings, so that I can easily report to the CFO that our tooling investment is paying off.
Story 7.2: As a platform engineer, I want to configure a budget limit (e.g., alert if daily spend > $100) and receive a Slack webhook notification immediately, so that I can stop a retry storm before the bill gets out of hand.
Story 7.3: As an engineering manager, I want an email version of the weekly digest, so that I can forward it straight to my leadership team.

Acceptance Criteria

A standalone asynchronous worker (dd0c-worker) evaluates continuous aggregates (via TimescaleDB) every hour.
Generates a "Monday Morning Digest" email via AWS SES.
Emits Slack webhook payloads when a threshold alert is triggered (threshold_amount, threshold_pct).
Adds a X-DD0C-Signature to outbound webhooks to prevent spoofing.

Estimate: 8 points

Dependencies: Epic 3 (Analytics Pipeline), Epic 4 (Dashboard API)

Technical Notes:

Stack: Rust (tokio-cron), reqwest (for webhooks), AWS SES.
Worker is a singleton container (1 task) running alongside the proxy to avoid lock contention on cron tasks.
Ensure alerts maintain state (using PostgreSQL alert_configs and last_fired_at) so users aren't spammed for the same incident.

Epic 8: Infrastructure & DevOps

Description: Containerized ECS Fargate deployment, AWS native networking, basic monitoring, and fully automated CI/CD for the entire dd0c stack. Essential for a solo founder to deploy safely and frequently.

User Stories

Story 8.1: As a solo founder, I want to use AWS ECS Fargate, so that I don't have to manage EC2 instances or worry about OS-level patching.
Story 8.2: As a solo founder, I want a GitHub Actions CI/CD pipeline, so that git push automatically runs tests, builds containers, and deploys rolling updates with zero downtime.
Story 8.3: As an operator, I want standard AWS CloudWatch alarms (e.g., P99 proxy latency > 50ms) connected to PagerDuty, so that I am only woken up when a critical threshold is breached.
Story 8.4: As a solo founder, I want a strict separation between my configuration (PostgreSQL) and telemetry (TimescaleDB) stores, so that I can scale analytics independently from org/auth state.

Acceptance Criteria

Full AWS infrastructure defined via CDK (TypeScript) or Terraform.
ALB routes /v1/* to the proxy container, /api/* to the dashboard API container.
Dashboard static assets deployed to an S3 bucket with CloudFront caching.
docker build produces three optimized images from a single Rust workspace (dd0c-proxy, dd0c-api, dd0c-worker).
CloudWatch dashboards and minimum alarms configured (CPU >80%, Proxy Error Rate >5%, ALB 5xx Rate).
git push main triggers a GitHub Action to test, lint, build, push to ECR, and update the ECS Fargate services.

Estimate: 13 points

Dependencies: Epic 1 (Proxy Engine), Epic 4 (Dashboard API)

Technical Notes:

Stack: AWS ECS Fargate, ALB, CloudFront, S3, RDS (PostgreSQL/TimescaleDB), ElastiCache (Redis), GitHub Actions.
Ensure the ALB utilizes path-based routing correctly and handles TLS termination.
For cost optimization on AWS, explore consolidating NAT Gateways or utilizing VPC Endpoints for S3/ECR/CloudWatch.

Epic 9: Onboarding & PLG

Description: Self-serve signup, free tier, API key management, and a getting-started flow that gets users routing their first LLM call through dd0c/route in under 2 minutes. This is the growth engine.

User Stories

Story 9.1: As a new user, I want to sign up with GitHub OAuth in one click, so that I can start using dd0c/route without filling out forms.
Story 9.2: As a new user, I want a free tier (up to $50/month in routed LLM spend), so that I can evaluate the product with real traffic before committing.
Story 9.3: As a developer, I want to generate and manage API keys from the dashboard, so that I can integrate dd0c/route into my applications.
Story 9.4: As a new user, I want a guided "First Route" onboarding flow that gives me a working curl command, so that I see cost savings within 2 minutes of signing up.
Story 9.5: As a team lead, I want to invite team members via email, so that my team can share a single org and see aggregated savings.

Acceptance Criteria

GitHub OAuth signup creates org + first API key automatically.
Free tier enforced at the proxy level — requests beyond $50/month routed spend return 429 with upgrade CTA.
API key CRUD: create, list, revoke, rotate. Keys are hashed at rest (bcrypt), only shown once on creation.
Onboarding wizard: 3 steps — (1) copy API key, (2) paste curl command, (3) see first request in dashboard. Completion rate tracked.
Team invite sends email with magic link. Invited user joins existing org on signup.
Stripe Checkout integration for upgrade from free → paid ($49/month base).

Estimate: 8 points

Dependencies: Epic 4 (Dashboard API), Epic 5 (Dashboard UI)

Technical Notes:

Use Stripe Checkout Sessions for payment — no custom billing UI needed for V1.
Free tier enforcement happens in the proxy hot path — must be O(1) lookup (Redis counter per org, reset monthly via cron).
Onboarding completion events tracked via PostHog or simple DB events for funnel analysis.
Magic link invites use signed JWTs with 72-hour expiry, stored in pending_invites table.

Epic 10: Transparent Factory Compliance

Description: Cross-cutting epic ensuring dd0c/route adheres to the 5 Transparent Factory architectural tenets: Atomic Flagging, Elastic Schema, Cognitive Durability, Semantic Observability, and Configurable Autonomy. These stories are woven across the existing system — they don't add features, they add engineering discipline.

Story 10.1: Atomic Flagging — Feature Flag Infrastructure

As a solo founder, I want every new routing rule, cost threshold, and provider failover behavior wrapped in a feature flag (default: off), so that I can deploy code continuously without risking production traffic.

Acceptance Criteria:

OpenFeature SDK integrated into the Rust proxy via a compatible provider (e.g., flagd sidecar or env-based provider for V1).
All flags evaluate locally (in-memory or sidecar) — zero network calls on the hot path.
Every flag has an owner field and a ttl (max 14 days). CI blocks deployment if any flag exceeds its TTL at 100% rollout.
Automated circuit breaker: if a flagged code path increases P99 latency by >5% or error rate >2%, the flag auto-disables within 30 seconds.
Flags exist for: model routing strategies, complexity classifier thresholds, provider failover chains, new dashboard features.

Estimate: 5 points Dependencies: Epic 1 (Proxy Engine), Epic 2 (Router Brain) Technical Notes:

Use OpenFeature Rust SDK. For V1, a simple JSON file or env-var provider is fine — no LaunchDarkly needed.
Circuit breaker integration: extend the existing Redis-backed circuit breaker to also flip flags.
Flag cleanup: add a make flag-audit target that lists expired flags.

Story 10.2: Elastic Schema — Additive-Only Migration Discipline

As a solo founder, I want all TimescaleDB and Redis schema changes to be strictly additive, so that I can roll back any deployment instantly without data loss or broken readers.

Acceptance Criteria:

CI lint step rejects any migration containing DROP, ALTER ... TYPE, or RENAME on existing columns.
New fields use _v2 suffix or a new table when breaking changes are unavoidable.
All Rust structs use #[serde(deny_unknown_fields = false)] (or equivalent) so V1 code ignores V2 fields.
Dual-write pattern documented and enforced: during migration windows, the API writes to both old and new schema targets within the same DB transaction.
Every migration file includes a sunset_date comment (max 30 days). A CI check warns if any migration is past sunset without cleanup.

Estimate: 3 points Dependencies: Epic 3 (Analytics Pipeline) Technical Notes:

Use sqlx migration files. Add a pre-commit hook or CI step that greps for forbidden DDL keywords.
Redis key schema: version keys with prefix (e.g., route:v1:config, route:v2:config). Never rename keys.
For the request_events hypertable, new columns are always NULLABLE with defaults.

Story 10.3: Cognitive Durability — Decision Logs for Routing Logic

As a future maintainer (or future me), I want every change to routing algorithms, cost models, or provider selection logic accompanied by a decision_log.json, so that I can understand why a decision was made months later in under 60 seconds.

Acceptance Criteria:

decision_log.json schema defined: { prompt, reasoning, alternatives_considered, confidence, timestamp, author }.
CI requires a decision_log.json entry for any PR touching src/router/, src/cost/, or migration files.
Cyclomatic complexity cap of 10 enforced via cargo clippy or a custom lint. PRs exceeding this are blocked.
Decision logs are committed alongside code in a docs/decisions/ directory, one file per significant change.

Estimate: 2 points Dependencies: None Technical Notes:

Use a PR template that prompts for the decision log fields.
For the complexity cap, cargo clippy -W clippy::cognitive_complexity with threshold 10.
Decision logs for cost table updates should include: source of pricing data, comparison with previous rates, expected savings impact.

Story 10.4: Semantic Observability — AI Reasoning Spans on Routing Decisions

As a platform engineer debugging a misrouted request, I want every proxy routing decision to emit an OpenTelemetry span with structured AI reasoning metadata, so that I can trace exactly which model was chosen, why, and what alternatives were rejected.

Acceptance Criteria:

Every /v1/chat/completions request generates an ai_routing_decision span as a child of the request trace.
Span attributes include: ai.model_selected, ai.model_alternatives (JSON array of rejected models + reasons), ai.cost_delta (savings vs. default), ai.complexity_score, ai.routing_strategy (passthrough/cheapest/quality-first/cascading).
ai.prompt_hash (SHA-256 of first 500 chars of system prompt) included for correlation — never raw prompt content.
Spans export to any OTLP-compatible backend (Grafana Cloud, Jaeger, etc.).
No PII in any span attribute. Prompt content is hashed, not logged.

Estimate: 3 points Dependencies: Epic 1 (Proxy Engine), Epic 2 (Router Brain) Technical Notes:

Use tracing + opentelemetry-rust crate with OTLP exporter.
The span should be created inside the router decision function, not as middleware — it needs access to the alternatives list.
For V1, export to stdout in OTLP JSON format. Production: OTLP gRPC to a collector.

Story 10.5: Configurable Autonomy — Governance Policy for Automated Routing

As a solo founder, I want a policy.json governance file that controls what the system is allowed to do autonomously (e.g., switch models, update cost tables, add providers), so that I maintain human oversight as the system grows.

Acceptance Criteria:

policy.json defines governance_mode: strict (all changes require manual approval) or audit (changes auto-apply but are logged).
The proxy checks governance_mode before applying any runtime config change (routing rule update, cost table refresh, provider addition).
panic_mode flag: when set to true, the proxy freezes all routing rules to their last-known-good state, disables auto-failover, and routes everything to a single hardcoded provider.
Governance drift monitoring: a weekly cron job logs the ratio of auto-applied vs. manually-approved changes. If auto-applied changes exceed 80% in strict mode, an alert fires.
All policy check decisions logged: "Allowed by audit mode", "Blocked by strict mode", "Panic mode active — frozen".

Estimate: 3 points Dependencies: Epic 2 (Router Brain) Technical Notes:

policy.json lives in the repo root and is loaded at startup + watched for changes via notify crate.
For V1 as a solo founder, start in audit mode. strict mode is for when you hire or add AI agents to the pipeline.
Panic mode should be triggerable via a single API call (POST /admin/panic) or by setting an env var — whichever is faster in an emergency.

Epic 10 Summary

Story	Tenet	Points
10.1	Atomic Flagging	5
10.2	Elastic Schema	3
10.3	Cognitive Durability	2
10.4	Semantic Observability	3
10.5	Configurable Autonomy	3
Total		16

24 KiB Raw Permalink Blame History

dd0c/route — V1 MVP Epics

Epic 1: Proxy Engine

User Stories

Acceptance Criteria

Estimate: 13 points

Dependencies: None

Technical Notes:

Epic 2: Router Brain

User Stories

Acceptance Criteria

Estimate: 8 points

Dependencies: Epic 1 (Proxy Engine)

Technical Notes:

Epic 3: Analytics Pipeline

User Stories

Acceptance Criteria

Estimate: 8 points

Dependencies: Epic 1 (Proxy Engine)

Technical Notes:

Epic 4: Dashboard API

User Stories

Acceptance Criteria

Estimate: 13 points

Dependencies: Epic 3 (Analytics Pipeline)

Technical Notes:

Epic 5: Dashboard UI

User Stories

Acceptance Criteria

Estimate: 13 points

Dependencies: Epic 4 (Dashboard API)

Technical Notes:

Epic 6: Shadow Audit CLI

User Stories

Acceptance Criteria

Estimate: 8 points

Dependencies: Epic 4 (Dashboard API - Pricing Endpoint)

Technical Notes:

Epic 7: Slack Integration

User Stories

Acceptance Criteria

Estimate: 8 points

Dependencies: Epic 3 (Analytics Pipeline), Epic 4 (Dashboard API)

Technical Notes:

Epic 8: Infrastructure & DevOps

User Stories

Acceptance Criteria

Estimate: 13 points

Dependencies: Epic 1 (Proxy Engine), Epic 4 (Dashboard API)

Technical Notes:

Epic 9: Onboarding & PLG

User Stories

Acceptance Criteria

Estimate: 8 points

Dependencies: Epic 4 (Dashboard API), Epic 5 (Dashboard UI)

Technical Notes:

Epic 10: Transparent Factory Compliance

Story 10.1: Atomic Flagging — Feature Flag Infrastructure

Story 10.2: Elastic Schema — Additive-Only Migration Discipline

Story 10.3: Cognitive Durability — Decision Logs for Routing Logic

Story 10.4: Semantic Observability — AI Reasoning Spans on Routing Decisions

Story 10.5: Configurable Autonomy — Governance Policy for Automated Routing

Epic 10 Summary

24 KiB

Raw Permalink Blame History