dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run Phases: brainstorm, design-thinking, innovation-strategy, party-mode, product-brief, architecture, epics (incl. Epic 10 TF compliance), test-architecture (TDD strategy) Brand strategy and market research included.
2026-02-28 17:35:02 +00:00
commit 5ee95d8b13
51 changed files with 36935 additions and 0 deletions
--- a/products/01-llm-cost-router/epics/epics.md
+++ b/products/01-llm-cost-router/epics/epics.md
@@ -0,0 +1,340 @@
+# dd0c/route — V1 MVP Epics
+
+This document outlines the core Epics and User Stories for the V1 MVP of dd0c/route, designed for a solo founder to implement in 1-3 day chunks per story.
+
+---
+
+## Epic 1: Proxy Engine
+**Description:** Core Rust proxy that sits between the client application and LLM providers. Must maintain strict OpenAI API compatibility, support SSE streaming, and introduce <5ms latency overhead.
+
+### User Stories
+- **Story 1.1:** As a developer, I want to swap my `OPENAI_BASE_URL` to the proxy endpoint, so that my existing OpenAI SDK works without code changes.
+- **Story 1.2:** As a developer, I want streaming support (SSE) preserved, so that my chat applications remain responsive while using the proxy.
+- **Story 1.3:** As a platform engineer, I want the proxy latency overhead to be <5ms, so that intelligent routing doesn't degrade our application's user experience.
+- **Story 1.4:** As a developer, I want provider errors (e.g., rate limits) to be passed through transparently, so that my app's existing error handling continues to work.
+
+### Acceptance Criteria
+- Implements `POST /v1/chat/completions` for both streaming (`stream: true`) and non-streaming requests.
+- Validates the `Authorization: Bearer` header against a Redis cache (falling back to DB).
+- Successfully forwards requests to OpenAI and Anthropic, translating formats if necessary.
+- Asynchronously emits telemetry events to an in-memory channel without blocking the hot path.
+- P99 latency overhead is measured at <5ms.
+
+### Estimate: 13 points
+### Dependencies: None
+### Technical Notes:
+- Stack: Rust, `tokio`, `hyper`, `axum`.
+- Use connection pooling for upstream providers to eliminate TLS handshake overhead.
+- For streaming, parse only the first chunk/headers to make a routing decision, then passthrough. Count tokens from the final SSE chunk (e.g., `[DONE]`).
+
+
+---
+
+## Epic 2: Router Brain
+**Description:** The intelligence core of dd0c/route embedded within the proxy. It evaluates incoming requests against routing rules, classifies complexity heuristically, checks cost tables, and executes fallback chains.
+
+### User Stories
+- **Story 2.1:** As an engineering manager, I want the router to classify the complexity of requests, so that simple extraction tasks are downgraded to cheaper models.
+- **Story 2.2:** As an engineering manager, I want to configure routing rules (e.g., if feature=classify -> use cheapest from [gpt-4o-mini, claude-haiku]), so that I can automatically save money on predictable workloads.
+- **Story 2.3:** As an application developer, I want the router to automatically fallback to an alternative model if the primary model fails or rate-limits, so that my application remains highly available.
+- **Story 2.4:** As an engineering manager, I want cost savings calculated instantly based on up-to-date provider pricing, so that my dashboard data is immediately accurate.
+
+### Acceptance Criteria
+- Heuristic complexity classifier runs in <2ms based on token count, task patterns (regex on system prompt), and model hints.
+- Evaluates first-match routing rules based on request tags (`X-DD0C-Feature`, `X-DD0C-Team`).
+- Executes "passthrough", "cheapest", "quality-first", and "cascading" routing strategies.
+- Enforces circuit breakers on downstream providers (e.g., open circuit if error rate > 10%).
+- Calculates `cost_saved = cost_original - cost_actual` on the fly using in-memory cost tables.
+
+### Estimate: 8 points
+### Dependencies: Epic 1 (Proxy Engine)
+### Technical Notes:
+- Stack: Rust.
+- Run purely in-memory on the proxy hot path. No DB queries per request.
+- Cost tables and routing rules must be loaded at startup and refreshed via a background task every 60s.
+- Use `serde_json` to inspect the `messages` array for complexity classification but do not persist the prompt.
+- Circuit breaker state must be shared via Redis so all proxy instances agree on provider health.
+
+
+---
+
+## Epic 3: Analytics Pipeline
+**Description:** High-throughput logging and aggregation system using TimescaleDB. Focuses on ingesting asynchronous telemetry from the Proxy Engine without blocking request processing.
+
+### User Stories
+- **Story 3.1:** As a platform engineer, I want the proxy to emit telemetry without blocking the main request thread, so that our application performance remains unaffected.
+- **Story 3.2:** As an engineering manager, I want my dashboard queries to be lightning fast even with millions of rows, so that I can quickly slice and dice our AI spend.
+- **Story 3.3:** As an engineering manager, I want historical telemetry to be compressed or aged out automatically, so that the database storage costs remain minimal.
+
+### Acceptance Criteria
+- Proxy emits a `RequestEvent` over an in-memory `mpsc` channel via `tokio::spawn`.
+- A background worker batches events and inserts them into TimescaleDB every 1s or 100 events using bulk `COPY INTO`.
+- Continuous aggregates (`hourly_cost_summary`, `daily_cost_summary`) are created and updated on schedule to pre-calculate `total_cost`, `total_saved`, and `avg_latency`.
+- TimescaleDB compression policies compress chunks older than 7 days by 90%+.
+- The proxy must degrade gracefully if the analytics database is unavailable.
+
+### Estimate: 8 points
+### Dependencies: Epic 1 (Proxy Engine)
+### Technical Notes:
+- Stack: Rust (worker), PostgreSQL/TimescaleDB.
+- Write the TimescaleDB migration scripts for the hypertable `request_events` and the continuous aggregates.
+- Batching must be robust to worker panics (use bounded channels).
+
+
+---
+
+## Epic 4: Dashboard API
+**Description:** Axum REST API providing authentication, org/team management, routing rule CRUD, and data endpoints for the frontend dashboard. Focuses on frictionless developer onboarding.
+
+### User Stories
+- **Story 4.1:** As an engineering manager, I want to authenticate via GitHub OAuth, so that I can create an organization and get an API key in under 60 seconds without remembering a password.
+- **Story 4.2:** As an engineering manager, I want to manage my organization's routing rules and provider API keys securely, so that dd0c/route can successfully broker requests to OpenAI and Anthropic.
+- **Story 4.3:** As an engineering manager, I want an endpoint that provides my historical spend and savings summary, so that I can visualize it in the UI.
+- **Story 4.4:** As a platform engineer, I want to revoke an active API key, so that compromised credentials are immediately blocked.
+
+### Acceptance Criteria
+- Implements `/api/auth/github` OAuth flow issuing JWTs and refresh tokens.
+- Implements `/api/orgs` CRUD for managing an organization and API keys.
+- Implements `/api/dashboard/summary` and `/api/dashboard/treemap` queries hitting the TimescaleDB continuous aggregates.
+- Implements `/api/requests` for the request inspector with filters (e.g., `model`, `feature`, `team`).
+- Securely stores and encrypts provider API keys in PostgreSQL using an AES-256-GCM Data Encryption Key.
+- Enforces an RBAC model (Owner, Member) per organization.
+
+### Estimate: 13 points
+### Dependencies: Epic 3 (Analytics Pipeline)
+### Technical Notes:
+- Stack: Rust (`axum`), PostgreSQL.
+- Reuse `tokio` runtime to minimize context switches for a solo founder.
+- Use `oauth2` crate for GitHub integration. JWTs are signed with RS256, refresh tokens in Redis.
+- Ensure API keys are hashed (SHA-256) before storage; raw keys are never stored.
+
+
+---
+
+## Epic 5: Dashboard UI
+**Description:** The React SPA serving the cost attribution dashboard. Visualizes the AI spend treemap, routing rules editor, real-time ticker, and request inspector. This is the product's primary visual "Aha" moment.
+
+### User Stories
+- **Story 5.1:** As an engineering manager, I want to see a treemap of my organization's AI spend broken down by team, feature, and model, so that I can instantly identify the most expensive areas of my application.
+- **Story 5.2:** As an engineering manager, I want a real-time counter showing "You saved $X this week," so that I feel confident the tool is paying for itself.
+- **Story 5.3:** As a platform engineer, I want an interface to configure routing rules (e.g., drag-to-reorder priority), so that I can instruct the proxy without editing config files.
+- **Story 5.4:** As a platform engineer, I want a request inspector that displays metadata, cost, latency, and the specific routing decision for every request, so that I can debug why a certain model was chosen.
+
+### Acceptance Criteria
+- React + Vite SPA deployed as static assets to S3 + CloudFront.
+- Treemap visualization renders cost aggregations dynamically over selected time periods (7d/30d/90d).
+- A routing rules editor allows CRUD operations and priority reordering for a team's rules.
+- Request Inspector table displays paginated, filterable (`feature`, `team`, `status`) lists of telemetry without showing prompt content.
+- Allows an admin to securely input OpenAI and Anthropic API keys.
+
+### Estimate: 13 points
+### Dependencies: Epic 4 (Dashboard API)
+### Technical Notes:
+- Stack: React, TypeScript, Vite, Tailwind CSS.
+- No SSR required for V1 (keep it simple). Use `react-query` or similar for data fetching and caching.
+- Build the treemap with a charting library like D3 or Recharts.
+- Emphasize speed—data fetches should resolve from continuous aggregates in <200ms.
+
+
+---
+
+## Epic 6: Shadow Audit CLI
+**Description:** The PLG "Shadow Audit" command-line tool (`npx dd0c-scan`). It analyzes a local codebase for LLM API calls, estimates monthly cost based on prompt templates, and projects savings with dd0c/route.
+
+### User Stories
+- **Story 6.1:** As a developer, I want a zero-setup CLI tool (`npx dd0c-scan`) that scans my codebase and estimates how much money I'm currently wasting on overqualified LLMs, so that I can convince my manager to use dd0c/route.
+- **Story 6.2:** As an engineering manager, I want the CLI to run locally without sending my source code to a third party, so that I can securely audit my own projects.
+- **Story 6.3:** As an engineering manager, I want a clean, visually appealing terminal report showing "Top Opportunities" for model downgrades, so that I immediately see the value of routing.
+
+### Acceptance Criteria
+- Parses a local directory for OpenAI or Anthropic SDK usage in TypeScript/JavaScript/Python files.
+- Identifies the models requested in the code and estimates token usage heuristically based on the strings passed to the SDK.
+- Hits `/api/v1/pricing/current` to fetch the latest cost tables and calculates an estimated monthly bill and projected savings.
+- Outputs a formatted terminal report showing total potential savings and a breakdown of the highest-impact files.
+- Anonymized scan summary is sent to the server only if the user explicitly opts in.
+
+### Estimate: 8 points
+### Dependencies: Epic 4 (Dashboard API - Pricing Endpoint)
+### Technical Notes:
+- Stack: Node.js, `commander`, `chalk`, simple regex parsers for Python/JS SDKs.
+- Keep the CLI lightweight, fast, and dependency-free as possible. No actual LLM parsing; use heuristics (string length/structure) for token estimates.
+- Must run completely offline if the pricing table is cached.
+
+
+---
+
+## Epic 7: Slack Integration
+**Description:** The primary retention mechanism and anomaly alerting system. An asynchronous worker task dispatches weekly savings digests and threshold-based budget alerts to Slack and Email.
+
+### User Stories
+- **Story 7.1:** As an engineering manager, I want an automated weekly digest summarizing my team's AI savings, so that I can easily report to the CFO that our tooling investment is paying off.
+- **Story 7.2:** As a platform engineer, I want to configure a budget limit (e.g., alert if daily spend > $100) and receive a Slack webhook notification immediately, so that I can stop a retry storm before the bill gets out of hand.
+- **Story 7.3:** As an engineering manager, I want an email version of the weekly digest, so that I can forward it straight to my leadership team.
+
+### Acceptance Criteria
+- A standalone asynchronous worker (`dd0c-worker`) evaluates continuous aggregates (via TimescaleDB) every hour.
+- Generates a "Monday Morning Digest" email via AWS SES.
+- Emits Slack webhook payloads when a threshold alert is triggered (`threshold_amount`, `threshold_pct`).
+- Adds a `X-DD0C-Signature` to outbound webhooks to prevent spoofing.
+
+### Estimate: 8 points
+### Dependencies: Epic 3 (Analytics Pipeline), Epic 4 (Dashboard API)
+### Technical Notes:
+- Stack: Rust (`tokio-cron`), `reqwest` (for webhooks), AWS SES.
+- Worker is a singleton container (1 task) running alongside the proxy to avoid lock contention on cron tasks.
+- Ensure alerts maintain state (using PostgreSQL `alert_configs` and `last_fired_at`) so users aren't spammed for the same incident.
+
+
+---
+
+## Epic 8: Infrastructure & DevOps
+**Description:** Containerized ECS Fargate deployment, AWS native networking, basic monitoring, and fully automated CI/CD for the entire dd0c stack. Essential for a solo founder to deploy safely and frequently.
+
+### User Stories
+- **Story 8.1:** As a solo founder, I want to use AWS ECS Fargate, so that I don't have to manage EC2 instances or worry about OS-level patching.
+- **Story 8.2:** As a solo founder, I want a GitHub Actions CI/CD pipeline, so that `git push` automatically runs tests, builds containers, and deploys rolling updates with zero downtime.
+- **Story 8.3:** As an operator, I want standard AWS CloudWatch alarms (e.g., P99 proxy latency > 50ms) connected to PagerDuty, so that I am only woken up when a critical threshold is breached.
+- **Story 8.4:** As a solo founder, I want a strict separation between my configuration (PostgreSQL) and telemetry (TimescaleDB) stores, so that I can scale analytics independently from org/auth state.
+
+### Acceptance Criteria
+- Full AWS infrastructure defined via CDK (TypeScript) or Terraform.
+- ALB routes `/v1/*` to the proxy container, `/api/*` to the dashboard API container.
+- Dashboard static assets deployed to an S3 bucket with CloudFront caching.
+- `docker build` produces three optimized images from a single Rust workspace (`dd0c-proxy`, `dd0c-api`, `dd0c-worker`).
+- CloudWatch dashboards and minimum alarms configured (CPU >80%, Proxy Error Rate >5%, ALB 5xx Rate).
+- `git push main` triggers a GitHub Action to test, lint, build, push to ECR, and update the ECS Fargate services.
+
+### Estimate: 13 points
+### Dependencies: Epic 1 (Proxy Engine), Epic 4 (Dashboard API)
+### Technical Notes:
+- Stack: AWS ECS Fargate, ALB, CloudFront, S3, RDS (PostgreSQL/TimescaleDB), ElastiCache (Redis), GitHub Actions.
+- Ensure the ALB utilizes path-based routing correctly and handles TLS termination.
+- For cost optimization on AWS, explore consolidating NAT Gateways or utilizing VPC Endpoints for S3/ECR/CloudWatch.
+
+---
+
+## Epic 9: Onboarding & PLG
+**Description:** Self-serve signup, free tier, API key management, and a getting-started flow that gets users routing their first LLM call through dd0c/route in under 2 minutes. This is the growth engine.
+
+### User Stories
+- **Story 9.1:** As a new user, I want to sign up with GitHub OAuth in one click, so that I can start using dd0c/route without filling out forms.
+- **Story 9.2:** As a new user, I want a free tier (up to $50/month in routed LLM spend), so that I can evaluate the product with real traffic before committing.
+- **Story 9.3:** As a developer, I want to generate and manage API keys from the dashboard, so that I can integrate dd0c/route into my applications.
+- **Story 9.4:** As a new user, I want a guided "First Route" onboarding flow that gives me a working curl command, so that I see cost savings within 2 minutes of signing up.
+- **Story 9.5:** As a team lead, I want to invite team members via email, so that my team can share a single org and see aggregated savings.
+
+### Acceptance Criteria
+- GitHub OAuth signup creates org + first API key automatically.
+- Free tier enforced at the proxy level — requests beyond $50/month routed spend return 429 with upgrade CTA.
+- API key CRUD: create, list, revoke, rotate. Keys are hashed at rest (bcrypt), only shown once on creation.
+- Onboarding wizard: 3 steps — (1) copy API key, (2) paste curl command, (3) see first request in dashboard. Completion rate tracked.
+- Team invite sends email with magic link. Invited user joins existing org on signup.
+- Stripe Checkout integration for upgrade from free → paid ($49/month base).
+
+### Estimate: 8 points
+### Dependencies: Epic 4 (Dashboard API), Epic 5 (Dashboard UI)
+### Technical Notes:
+- Use Stripe Checkout Sessions for payment — no custom billing UI needed for V1.
+- Free tier enforcement happens in the proxy hot path — must be O(1) lookup (Redis counter per org, reset monthly via cron).
+- Onboarding completion events tracked via PostHog or simple DB events for funnel analysis.
+- Magic link invites use signed JWTs with 72-hour expiry, stored in `pending_invites` table.
+
+
+---
+
+## Epic 10: Transparent Factory Compliance
+**Description:** Cross-cutting epic ensuring dd0c/route adheres to the 5 Transparent Factory architectural tenets: Atomic Flagging, Elastic Schema, Cognitive Durability, Semantic Observability, and Configurable Autonomy. These stories are woven across the existing system — they don't add features, they add engineering discipline.
+
+### Story 10.1: Atomic Flagging — Feature Flag Infrastructure
+**As a** solo founder, **I want** every new routing rule, cost threshold, and provider failover behavior wrapped in a feature flag (default: off), **so that** I can deploy code continuously without risking production traffic.
+
+**Acceptance Criteria:**
+- OpenFeature SDK integrated into the Rust proxy via a compatible provider (e.g., `flagd` sidecar or env-based provider for V1).
+- All flags evaluate locally (in-memory or sidecar) — zero network calls on the hot path.
+- Every flag has an `owner` field and a `ttl` (max 14 days). CI blocks deployment if any flag exceeds its TTL at 100% rollout.
+- Automated circuit breaker: if a flagged code path increases P99 latency by >5% or error rate >2%, the flag auto-disables within 30 seconds.
+- Flags exist for: model routing strategies, complexity classifier thresholds, provider failover chains, new dashboard features.
+
+**Estimate:** 5 points
+**Dependencies:** Epic 1 (Proxy Engine), Epic 2 (Router Brain)
+**Technical Notes:**
+- Use OpenFeature Rust SDK. For V1, a simple JSON file or env-var provider is fine — no LaunchDarkly needed.
+- Circuit breaker integration: extend the existing Redis-backed circuit breaker to also flip flags.
+- Flag cleanup: add a `make flag-audit` target that lists expired flags.
+
+### Story 10.2: Elastic Schema — Additive-Only Migration Discipline
+**As a** solo founder, **I want** all TimescaleDB and Redis schema changes to be strictly additive, **so that** I can roll back any deployment instantly without data loss or broken readers.
+
+**Acceptance Criteria:**
+- CI lint step rejects any migration containing `DROP`, `ALTER ... TYPE`, or `RENAME` on existing columns.
+- New fields use `_v2` suffix or a new table when breaking changes are unavoidable.
+- All Rust structs use `#[serde(deny_unknown_fields = false)]` (or equivalent) so V1 code ignores V2 fields.
+- Dual-write pattern documented and enforced: during migration windows, the API writes to both old and new schema targets within the same DB transaction.
+- Every migration file includes a `sunset_date` comment (max 30 days). A CI check warns if any migration is past sunset without cleanup.
+
+**Estimate:** 3 points
+**Dependencies:** Epic 3 (Analytics Pipeline)
+**Technical Notes:**
+- Use `sqlx` migration files. Add a pre-commit hook or CI step that greps for forbidden DDL keywords.
+- Redis key schema: version keys with prefix (e.g., `route:v1:config`, `route:v2:config`). Never rename keys.
+- For the `request_events` hypertable, new columns are always `NULLABLE` with defaults.
+
+### Story 10.3: Cognitive Durability — Decision Logs for Routing Logic
+**As a** future maintainer (or future me), **I want** every change to routing algorithms, cost models, or provider selection logic accompanied by a `decision_log.json`, **so that** I can understand *why* a decision was made months later in under 60 seconds.
+
+**Acceptance Criteria:**
+- `decision_log.json` schema defined: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
+- CI requires a `decision_log.json` entry for any PR touching `src/router/`, `src/cost/`, or migration files.
+- Cyclomatic complexity cap of 10 enforced via `cargo clippy` or a custom lint. PRs exceeding this are blocked.
+- Decision logs are committed alongside code in a `docs/decisions/` directory, one file per significant change.
+
+**Estimate:** 2 points
+**Dependencies:** None
+**Technical Notes:**
+- Use a PR template that prompts for the decision log fields.
+- For the complexity cap, `cargo clippy -W clippy::cognitive_complexity` with threshold 10.
+- Decision logs for cost table updates should include: source of pricing data, comparison with previous rates, expected savings impact.
+
+### Story 10.4: Semantic Observability — AI Reasoning Spans on Routing Decisions
+**As a** platform engineer debugging a misrouted request, **I want** every proxy routing decision to emit an OpenTelemetry span with structured AI reasoning metadata, **so that** I can trace exactly which model was chosen, why, and what alternatives were rejected.
+
+**Acceptance Criteria:**
+- Every `/v1/chat/completions` request generates an `ai_routing_decision` span as a child of the request trace.
+- Span attributes include: `ai.model_selected`, `ai.model_alternatives` (JSON array of rejected models + reasons), `ai.cost_delta` (savings vs. default), `ai.complexity_score`, `ai.routing_strategy` (passthrough/cheapest/quality-first/cascading).
+- `ai.prompt_hash` (SHA-256 of first 500 chars of system prompt) included for correlation — never raw prompt content.
+- Spans export to any OTLP-compatible backend (Grafana Cloud, Jaeger, etc.).
+- No PII in any span attribute. Prompt content is hashed, not logged.
+
+**Estimate:** 3 points
+**Dependencies:** Epic 1 (Proxy Engine), Epic 2 (Router Brain)
+**Technical Notes:**
+- Use `tracing` + `opentelemetry-rust` crate with OTLP exporter.
+- The span should be created *inside* the router decision function, not as middleware — it needs access to the alternatives list.
+- For V1, export to stdout in OTLP JSON format. Production: OTLP gRPC to a collector.
+
+### Story 10.5: Configurable Autonomy — Governance Policy for Automated Routing
+**As a** solo founder, **I want** a `policy.json` governance file that controls what the system is allowed to do autonomously (e.g., switch models, update cost tables, add providers), **so that** I maintain human oversight as the system grows.
+
+**Acceptance Criteria:**
+- `policy.json` defines `governance_mode`: `strict` (all changes require manual approval) or `audit` (changes auto-apply but are logged).
+- The proxy checks `governance_mode` before applying any runtime config change (routing rule update, cost table refresh, provider addition).
+- `panic_mode` flag: when set to `true`, the proxy freezes all routing rules to their last-known-good state, disables auto-failover, and routes everything to a single hardcoded provider.
+- Governance drift monitoring: a weekly cron job logs the ratio of auto-applied vs. manually-approved changes. If auto-applied changes exceed 80% in `strict` mode, an alert fires.
+- All policy check decisions logged: "Allowed by audit mode", "Blocked by strict mode", "Panic mode active — frozen".
+
+**Estimate:** 3 points
+**Dependencies:** Epic 2 (Router Brain)
+**Technical Notes:**
+- `policy.json` lives in the repo root and is loaded at startup + watched for changes via `notify` crate.
+- For V1 as a solo founder, start in `audit` mode. `strict` mode is for when you hire or add AI agents to the pipeline.
+- Panic mode should be triggerable via a single API call (`POST /admin/panic`) or by setting an env var — whichever is faster in an emergency.
+
+### Epic 10 Summary
+| Story | Tenet | Points |
+|-------|-------|--------|
+| 10.1 | Atomic Flagging | 5 |
+| 10.2 | Elastic Schema | 3 |
+| 10.3 | Cognitive Durability | 2 |
+| 10.4 | Semantic Observability | 3 |
+| 10.5 | Configurable Autonomy | 3 |
+| **Total** | | **16** |