Files
dd0c/products/SECURITY-ARCHITECTURE-PLAN.md
Max Mayfield eb953cdea5
Some checks failed
CI — P2 Drift (Go + Node) / agent (push) Successful in 43s
CI — P2 Drift (Go + Node) / saas (push) Failing after 5s
CI — P3 Alert / test (push) Failing after 4s
CI — P4 Portal / test (push) Failing after 4s
CI — P5 Cost / test (push) Failing after 4s
CI — P6 Run / saas (push) Failing after 5s
CI — P2 Drift (Go + Node) / build-push (push) Failing after 7s
CI — P3 Alert / build-push (push) Has been skipped
CI — P4 Portal / build-push (push) Has been skipped
CI — P5 Cost / build-push (push) Has been skipped
CI — P6 Run / build-push (push) Failing after 5s
Security hardening: auth encapsulation, pool restriction, rate limiting, invites, async webhooks
Phase 1 (Security Critical):
- Auth plugin encapsulation: replaced global addHook with Fastify plugin scope
- Removed startsWith URL matching; public routes registered outside auth scope
- JWT verify now enforces algorithms: ['HS256'] (prevents algorithm confusion)
- Raw pool no longer exported from db.ts; systemQuery() + getPoolForAuth() instead
- withTenant() remains primary tenant-scoped query path

Phase 2 (Infrastructure):
- docker-compose.yml: all secrets via env var substitution (${VAR:-default})
- Per-service Postgres users (dd0c_drift, dd0c_alert, etc.) in docker-init-db.sh
- .env.example with all configurable secrets
- build-push.sh uses $REGISTRY_PASSWORD instead of hardcoded
- .gitignore excludes .env files
- @fastify/rate-limit: 100 req/min global, 5/min login, 3/min signup
- CORS_ORIGIN default changed from '*' to 'http://localhost:5173'

Phase 3 (Product):
- Team invite flow: tenant_invites table, POST /invite, GET /invites, DELETE /invites/:id
- Signup accepts optional invite_token to join existing tenant
- Async webhook ingestion (P3): LPUSH to Redis, BRPOP worker, dead-letter queue

Console:
- All 5 product modules wired: drift, alert, portal, cost, run
- PageHeader accepts children prop
- 71 modules, 70KB gzipped production build

All 6 projects compile clean (tsc --noEmit).
2026-03-02 23:53:55 +00:00

97 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ARCHITECTURE SPEC & IMPLEMENTATION PLAN
**Project:** dd0c DevOps SaaS Platform
**Author:** Dr. Quinn, BMad Master Problem Solver
**Date:** 2026-03-02
## 1. ROOT CAUSE ANALYSIS
We have systematically analyzed the 10 adversarial review findings and grouped them by their structural root causes.
### Group A: Middleware & Request Pipeline Integrity (Issues 1, 7, 8, 9, 10)
* **Root Cause:** The authentication middleware is implemented as a global `app.addHook('onRequest')` using brittle string manipulation (`.startsWith()`) to bypass public routes. Fastifys strength lies in its plugin encapsulation model, which is completely bypassed here. Furthermore, foundational web security controls (CORS scoping, Rate Limiting, Request Validation via schema) are missing, and JWT verification lacks algorithm strictness, opening the door to algorithm confusion.
### Group B: Tenant Isolation & Data Security (Issues 2, 3)
* **Root Cause:** The system's multi-tenancy model is incomplete.
* **Data Leakage Risk:** The Postgres `pool` object is exported directly from `db.ts`. Even though `withTenant()` exists to enforce Row-Level Security (RLS) via `SET LOCAL`, exporting `pool` means developers can inadvertently bypass RLS by calling `pool.query()` directly.
* **Product Gap:** The data model strictly creates a 1:1 mapping between a new signup and a new tenant. There is no relational entity (e.g., `tenant_invites`) to securely map a new user to an existing tenant, breaking the "built for teams" promise.
### Group C: Infrastructure Configuration Secrets (Issues 5, 6)
* **Root Cause:** Infrastructure-as-Code (Docker Compose) is being used as a secrets manager. All 5 services authenticate to Postgres using the root `dd0c` superuser, and secrets (`JWT_SECRET`, passwords) are hardcoded into the compose YAML. If one service is compromised (e.g., via SQLi), the attacker gains root access to all databases.
### Group D: Architectural Reliability Constraints (Issue 4)
* **Root Cause:** Webhooks operate synchronously on containers configured to scale-to-zero. External providers (PagerDuty, Datadog) have strict timeout thresholds (typically 5-10s). Fly.io container cold-starts often exceed these thresholds, causing providers to drop payloads before the container can awaken and process the request.
---
## 2. ARCHITECTURE DECISIONS
Considering the constraints of a solo founder, a current NAS deployment with a path to Fly.io, and the need to preserve existing tests, we will adopt the following architectural standards:
1. **Fastify Plugin Encapsulation (Fixes #1, #10):** We will stop using global hooks. Public routes (health, webhooks) will be registered on the main `app` instance. Authenticated routes will be registered inside an encapsulated Fastify plugin where the auth hook is applied safely without string checking. We will use `@fastify/type-provider-zod` for built-in, strict request schema validation.
2. **Strict Data Access Module (Fixes #3):** The `db.ts` file will no longer export the raw `pool`. It will export a `db` object containing exactly two methods: `withTenant<T>(tenantId, callback)` for business logic, and `systemQuery<T>(query)` explicitly marked and audited for system-level background tasks.
3. **Config-Driven Secrets & Least Privilege DB Roles (Fixes #5, #6):** `docker-compose.yml` will be scrubbed of secrets. Secrets will be loaded via a `.env` file. We will update `01-init-db.sh` to create service-specific PostgreSQL users with access *only* to their respective databases.
4. **Invite-Based Onboarding (Fixes #2):** We will introduce a `tenant_invites` table. The signup flow will be modified: if a user signs up with a valid invite token, they bypass tenant creation and are appended to the existing tenant with the defined role.
5. **Decoupled Webhook Ingestion (Fixes #4):** To support scale-to-zero without losing webhooks, we will leverage the already-running Upstash Redis instance. Webhooks will hit a lightweight, non-scaling ingestion function (or the Rust `route-proxy` which is always on) that simply `LPUSH`es payloads to Redis. The main Node services can safely wake up, `BRPOP` from the queue, and process asynchronously.
---
## 3. IMPLEMENTATION PLAN
This plan is phased. Existing tests MUST be run (`./integration-test.sh`) after each phase.
### Phase 1: Security Critical (Do this before production)
**Task 1.1: Fastify Auth Encapsulation**
* **Change:** Modify `shared/auth.ts` and `03-alert-intelligence/src/auth/middleware.ts`. Remove `app.addHook('onRequest')`. Export a Fastify plugin `export const requireAuth = fp(async (fastify) => { fastify.addHook(...) })`.
* **Change:** In `index.ts`, register public routes first. Then register the `requireAuth` plugin, then register protected routes.
* **Effort:** Medium
* **Risk if deferred:** High (Auth bypass easily discoverable by automated scanners).
**Task 1.2: Hardened JWT Validation**
* **Change:** In `auth/middleware.ts`, update `jwt.verify(token, jwtSecret)` to `jwt.verify(token, jwtSecret, { algorithms: ['HS256'] })`.
* **Effort:** Small
**Task 1.3: Restrict Database Pool Access**
* **Change:** In `saas/src/data/db.ts` (and shared equivalents), remove `export const pool`. Wrap queries in a strictly exported `systemQuery` function, and enforce `withTenant` for the rest. Update all services relying on `pool.query()` to use the new paradigm.
* **Effort:** Medium
* **Risk if deferred:** Critical (Any developer mistake results in massive cross-tenant data leakage).
### Phase 2: Architecture (Structural Readiness)
**Task 2.1: Secrets Management & DB Roles**
* **Change:** Update `docker-compose.yml` to use variable substitution (e.g., `${POSTGRES_PASSWORD}`). Provide an `.env.example`.
* **Change:** Update `docker-init-db.sh` to execute `CREATE USER dd0c_alert WITH PASSWORD '...'; GRANT ALL ON DATABASE dd0c_alert TO dd0c_alert;`. Update services to use their designated credentials.
* **Effort:** Medium
* **Dependencies:** None.
**Task 2.2: Rate Limiting, CORS, and Zod Integration**
* **Change:** Install `@fastify/rate-limit`. Apply a strict limit (e.g., 5 requests/min) to `/api/v1/auth/*`.
* **Change:** In `config/index.ts`, enforce `CORS_ORIGIN` using Zod regex (e.g., `^https?://.*\.dd0c\.localhost$`).
* **Change:** Integrate `@fastify/type-provider-zod` into route definitions to reject bad payloads at the Fastify schema level.
* **Effort:** Medium
### Phase 3: Product (Feature Blockers)
**Task 3.1: Team Invite Flow**
* **Change:** Create a new migration for `tenant_invites (id, tenant_id, email, token, role, expires_at)`.
* **Change:** Add `POST /api/v1/auth/invite` (generates token) and update `POST /api/v1/auth/signup` to accept an optional `invite_token`.
* **Effort:** Large
* **Dependencies:** Database migrations.
**Task 3.2: Async Webhook Ingestion**
* **Change:** Shift webhook endpoints to simply validate signatures and `LPUSH` the raw payload to Redis.
* **Change:** Create a background worker loop in the Node service that uses `BRPOP` to pull and process webhooks. (Alternatively, route webhooks through the constantly-running Rust proxy).
* **Effort:** Large
* **Risk if deferred:** Medium (External webhook timeouts on Fly.io scale-to-zero).
---
## 4. TESTING STRATEGY
To verify fixes without breaking the 27 integration tests:
1. **Auth Bypass:** Write a new test in the smoke suite that attempts to hit `/api/v1/auth/login-hack` and `/webhooks/../api/protected`. Expect `404 Not Found` or `401 Unauthorized`.
2. **RLS Protection:** After restricting `pool`, run `./integration-test.sh`. Any query that was improperly bypassing `withTenant` will cause TypeScript compilation to fail (since `pool.query` is no longer available), ensuring safe refactoring.
3. **DB Roles:** Spin up a clean docker-compose environment. Use a Postgres client to verify that `dd0c_alert` user cannot run `SELECT * FROM dd0c_route.users`.
4. **Webhooks:** Simulate a Fly.io cold start by pausing the `dd0c-alert` container, firing a webhook to the ingestion endpoint, and verifying the payload is queued in Redis and processed upon container resume.
5. **Invite Flow:** Add a multi-user flow to `integration-test.sh` asserting User B can be invited by User A and both can query the same `tenant_id` records.