Phase 1 (Security Critical):
- Auth plugin encapsulation: replaced global addHook with Fastify plugin scope
- Removed startsWith URL matching; public routes registered outside auth scope
- JWT verify now enforces algorithms: ['HS256'] (prevents algorithm confusion)
- Raw pool no longer exported from db.ts; systemQuery() + getPoolForAuth() instead
- withTenant() remains primary tenant-scoped query path
Phase 2 (Infrastructure):
- docker-compose.yml: all secrets via env var substitution (${VAR:-default})
- Per-service Postgres users (dd0c_drift, dd0c_alert, etc.) in docker-init-db.sh
- .env.example with all configurable secrets
- build-push.sh uses $REGISTRY_PASSWORD instead of hardcoded
- .gitignore excludes .env files
- @fastify/rate-limit: 100 req/min global, 5/min login, 3/min signup
- CORS_ORIGIN default changed from '*' to 'http://localhost:5173'
Phase 3 (Product):
- Team invite flow: tenant_invites table, POST /invite, GET /invites, DELETE /invites/:id
- Signup accepts optional invite_token to join existing tenant
- Async webhook ingestion (P3): LPUSH to Redis, BRPOP worker, dead-letter queue
Console:
- All 5 product modules wired: drift, alert, portal, cost, run
- PageHeader accepts children prop
- 71 modules, 70KB gzipped production build
All 6 projects compile clean (tsc --noEmit).
8.2 KiB
ARCHITECTURE SPEC & IMPLEMENTATION PLAN
Project: dd0c DevOps SaaS Platform Author: Dr. Quinn, BMad Master Problem Solver Date: 2026-03-02
1. ROOT CAUSE ANALYSIS
We have systematically analyzed the 10 adversarial review findings and grouped them by their structural root causes.
Group A: Middleware & Request Pipeline Integrity (Issues 1, 7, 8, 9, 10)
- Root Cause: The authentication middleware is implemented as a global
app.addHook('onRequest')using brittle string manipulation (.startsWith()) to bypass public routes. Fastify’s strength lies in its plugin encapsulation model, which is completely bypassed here. Furthermore, foundational web security controls (CORS scoping, Rate Limiting, Request Validation via schema) are missing, and JWT verification lacks algorithm strictness, opening the door to algorithm confusion.
Group B: Tenant Isolation & Data Security (Issues 2, 3)
- Root Cause: The system's multi-tenancy model is incomplete.
- Data Leakage Risk: The Postgres
poolobject is exported directly fromdb.ts. Even thoughwithTenant()exists to enforce Row-Level Security (RLS) viaSET LOCAL, exportingpoolmeans developers can inadvertently bypass RLS by callingpool.query()directly. - Product Gap: The data model strictly creates a 1:1 mapping between a new signup and a new tenant. There is no relational entity (e.g.,
tenant_invites) to securely map a new user to an existing tenant, breaking the "built for teams" promise.
- Data Leakage Risk: The Postgres
Group C: Infrastructure Configuration Secrets (Issues 5, 6)
- Root Cause: Infrastructure-as-Code (Docker Compose) is being used as a secrets manager. All 5 services authenticate to Postgres using the root
dd0csuperuser, and secrets (JWT_SECRET, passwords) are hardcoded into the compose YAML. If one service is compromised (e.g., via SQLi), the attacker gains root access to all databases.
Group D: Architectural Reliability Constraints (Issue 4)
- Root Cause: Webhooks operate synchronously on containers configured to scale-to-zero. External providers (PagerDuty, Datadog) have strict timeout thresholds (typically 5-10s). Fly.io container cold-starts often exceed these thresholds, causing providers to drop payloads before the container can awaken and process the request.
2. ARCHITECTURE DECISIONS
Considering the constraints of a solo founder, a current NAS deployment with a path to Fly.io, and the need to preserve existing tests, we will adopt the following architectural standards:
- Fastify Plugin Encapsulation (Fixes #1, #10): We will stop using global hooks. Public routes (health, webhooks) will be registered on the main
appinstance. Authenticated routes will be registered inside an encapsulated Fastify plugin where the auth hook is applied safely without string checking. We will use@fastify/type-provider-zodfor built-in, strict request schema validation. - Strict Data Access Module (Fixes #3): The
db.tsfile will no longer export the rawpool. It will export adbobject containing exactly two methods:withTenant<T>(tenantId, callback)for business logic, andsystemQuery<T>(query)explicitly marked and audited for system-level background tasks. - Config-Driven Secrets & Least Privilege DB Roles (Fixes #5, #6):
docker-compose.ymlwill be scrubbed of secrets. Secrets will be loaded via a.envfile. We will update01-init-db.shto create service-specific PostgreSQL users with access only to their respective databases. - Invite-Based Onboarding (Fixes #2): We will introduce a
tenant_invitestable. The signup flow will be modified: if a user signs up with a valid invite token, they bypass tenant creation and are appended to the existing tenant with the defined role. - Decoupled Webhook Ingestion (Fixes #4): To support scale-to-zero without losing webhooks, we will leverage the already-running Upstash Redis instance. Webhooks will hit a lightweight, non-scaling ingestion function (or the Rust
route-proxywhich is always on) that simplyLPUSHes payloads to Redis. The main Node services can safely wake up,BRPOPfrom the queue, and process asynchronously.
3. IMPLEMENTATION PLAN
This plan is phased. Existing tests MUST be run (./integration-test.sh) after each phase.
Phase 1: Security Critical (Do this before production)
Task 1.1: Fastify Auth Encapsulation
- Change: Modify
shared/auth.tsand03-alert-intelligence/src/auth/middleware.ts. Removeapp.addHook('onRequest'). Export a Fastify pluginexport const requireAuth = fp(async (fastify) => { fastify.addHook(...) }). - Change: In
index.ts, register public routes first. Then register therequireAuthplugin, then register protected routes. - Effort: Medium
- Risk if deferred: High (Auth bypass easily discoverable by automated scanners).
Task 1.2: Hardened JWT Validation
- Change: In
auth/middleware.ts, updatejwt.verify(token, jwtSecret)tojwt.verify(token, jwtSecret, { algorithms: ['HS256'] }). - Effort: Small
Task 1.3: Restrict Database Pool Access
- Change: In
saas/src/data/db.ts(and shared equivalents), removeexport const pool. Wrap queries in a strictly exportedsystemQueryfunction, and enforcewithTenantfor the rest. Update all services relying onpool.query()to use the new paradigm. - Effort: Medium
- Risk if deferred: Critical (Any developer mistake results in massive cross-tenant data leakage).
Phase 2: Architecture (Structural Readiness)
Task 2.1: Secrets Management & DB Roles
- Change: Update
docker-compose.ymlto use variable substitution (e.g.,${POSTGRES_PASSWORD}). Provide an.env.example. - Change: Update
docker-init-db.shto executeCREATE USER dd0c_alert WITH PASSWORD '...'; GRANT ALL ON DATABASE dd0c_alert TO dd0c_alert;. Update services to use their designated credentials. - Effort: Medium
- Dependencies: None.
Task 2.2: Rate Limiting, CORS, and Zod Integration
- Change: Install
@fastify/rate-limit. Apply a strict limit (e.g., 5 requests/min) to/api/v1/auth/*. - Change: In
config/index.ts, enforceCORS_ORIGINusing Zod regex (e.g.,^https?://.*\.dd0c\.localhost$). - Change: Integrate
@fastify/type-provider-zodinto route definitions to reject bad payloads at the Fastify schema level. - Effort: Medium
Phase 3: Product (Feature Blockers)
Task 3.1: Team Invite Flow
- Change: Create a new migration for
tenant_invites (id, tenant_id, email, token, role, expires_at). - Change: Add
POST /api/v1/auth/invite(generates token) and updatePOST /api/v1/auth/signupto accept an optionalinvite_token. - Effort: Large
- Dependencies: Database migrations.
Task 3.2: Async Webhook Ingestion
- Change: Shift webhook endpoints to simply validate signatures and
LPUSHthe raw payload to Redis. - Change: Create a background worker loop in the Node service that uses
BRPOPto pull and process webhooks. (Alternatively, route webhooks through the constantly-running Rust proxy). - Effort: Large
- Risk if deferred: Medium (External webhook timeouts on Fly.io scale-to-zero).
4. TESTING STRATEGY
To verify fixes without breaking the 27 integration tests:
- Auth Bypass: Write a new test in the smoke suite that attempts to hit
/api/v1/auth/login-hackand/webhooks/../api/protected. Expect404 Not Foundor401 Unauthorized. - RLS Protection: After restricting
pool, run./integration-test.sh. Any query that was improperly bypassingwithTenantwill cause TypeScript compilation to fail (sincepool.queryis no longer available), ensuring safe refactoring. - DB Roles: Spin up a clean docker-compose environment. Use a Postgres client to verify that
dd0c_alertuser cannot runSELECT * FROM dd0c_route.users. - Webhooks: Simulate a Fly.io cold start by pausing the
dd0c-alertcontainer, firing a webhook to the ingestion endpoint, and verifying the payload is queued in Redis and processed upon container resume. - Invite Flow: Add a multi-user flow to
integration-test.shasserting User B can be invited by User A and both can query the sametenant_idrecords.