Files
dd0c/products/SPEC-GAP-ANALYSIS.md

87 lines
9.0 KiB
Markdown
Raw Normal View History

# dd0c Platform - BDD Specification Gap Analysis
## Executive Summary
This gap analysis compares the BDD acceptance specifications against the currently implemented Node.js/Fastify source code and PostgreSQL database migrations for the dd0c monorepo (P2-P6).
Overall, the **Dashboard APIs** required by the React Console are highly implemented across all services. The frontend will successfully render and operate. The major gaps lie in the out-of-band background workers, external agents, robust message queuing (SQS/DLQ), and advanced intelligence/scoring heuristics.
**Estimated Implementation Completion:**
* **P4 - Lightweight IDP:** ~75% (Core scanners, catalog, and search are functional)
* **P3 - Alert Intelligence:** ~65% (Ingestion, basic correlation, and UI APIs are solid)
* **P5 - AWS Cost Anomaly:** ~50% (Scorer and APIs exist, but CloudTrail ingestion is missing)
* **P6 - Runbook Automation:** ~40% (APIs and Slackbot exist; parsing, classification, and agent execution are completely missing)
* **P2 - IaC Drift Detection:** ~30% (SaaS ingestion APIs exist; the entire external agent, mTLS, and diff engines are missing)
---
## Per-Service Breakdown by Epic
### P2: IaC Drift Detection
* **Epic 1: Drift Detection Agent****MISSING** - No Go agent binary. Terraform, CloudFormation, Kubernetes, and Pulumi state scanning engines do not exist. Secret scrubbing logic is missing.
* **Epic 2: Agent Communication** 🟡 **PARTIAL** - Basic HTTP ingestion route exists (`/v1/ingest/drift`), but mTLS authentication and SQS FIFO message queues are not implemented.
* **Epic 3: Event Processor** 🟡 **PARTIAL** - Ingestion, nonce replay prevention, and PostgreSQL persistence with RLS are implemented. Missing canonical schema normalization and chunked report reassembly.
* **Epic 4: Notification Engine** 🟡 **PARTIAL** - Slack Block Kit, Email (Resend), Webhook, and PagerDuty dispatchers are implemented. Missing Daily Digest job and severity-based routing logic.
* **Epic 5: Remediation****MISSING** - Interactive Slack buttons exist in notification payloads, but the backend workflow engine, approval tracking, and agent-side execution dispatch are missing.
* **Epic 6 & 7: Dashboard UI & API****IMPLEMENTED** - `fetchStacks`, `fetchStackHistory`, and `fetchLatestReport` endpoints are fully implemented with tenant RLS.
* **Epic 8 & 9: Infrastructure / PLG****MISSING** - No CDK templates, CI/CD pipelines, Stripe billing, or CLI setup logic.
* **Epic 10: Transparent Factory** 🟡 **PARTIAL** - Database migrations and RLS are implemented. Missing Feature Flag service and OTEL Tracing.
### P3: Alert Intelligence
* **Epic 1: Webhook Ingestion** 🟡 **PARTIAL** - Webhook routes and HMAC validation for Datadog, PagerDuty, OpsGenie, and Grafana are implemented via Redis queue. Missing S3 archival, oversized payload handling, and SQS/DLQ.
* **Epic 2: Alert Normalization** 🟡 **PARTIAL** - Basic provider mapping logic exists in `webhook-processor.ts`.
* **Epic 3: Correlation Engine** 🟡 **PARTIAL** - Time-window correlation and fingerprint deduplication are implemented using Redis. Missing Service-Affinity matching and strict cross-tenant worker isolation.
* **Epic 4: Notification & Escalation** 🟡 **PARTIAL** - Slack, Email, and Webhook dispatchers are implemented. Missing PagerDuty auto-escalation cron and Daily Noise Report.
* **Epic 5: Slack Bot** 🟡 **PARTIAL** - Missing interactive feedback button handlers (`/slack/interactions`) for noise/helpful marking, and missing `/dd0c` slash commands.
* **Epic 6 & 7: Dashboard UI & API** 🟡 **PARTIAL** - Incident CRUD, filtering, and summary endpoints are implemented. Missing `MTTR` and `Noise Reduction` analytics endpoints requested by the spec.
* **Epic 8 & 9: Infrastructure / PLG****MISSING** - No CDK, Stripe billing, or Free Tier (10K alerts/month) limit enforcement.
### P4: Lightweight IDP
* **Epic 1: AWS Discovery Scanner** 🟡 **PARTIAL** - ECS, Lambda, and RDS resource discovery implemented. Missing CloudFormation, API Gateway, and Step Functions orchestration.
* **Epic 2: GitHub Discovery Scanner** 🟡 **PARTIAL** - Repository fetching, pagination, and basic `package.json`/`Dockerfile` heuristics implemented. Missing advanced CODEOWNERS and commit history parsing.
* **Epic 3: Service Catalog** 🟡 **PARTIAL** - Catalog ingestion, partial update staging, ownership resolution, and DB APIs implemented. Missing PagerDuty/OpsGenie on-call mapping.
* **Epic 4: Search Engine** 🟡 **PARTIAL** - Meilisearch integration with PostgreSQL fallback implemented. Missing Redis prefix caching for `Cmd+K` performance optimization.
* **Epic 5: Dashboard API****IMPLEMENTED** - Service CRUD and ownership summary endpoints are fully functional and align with Console requirements.
* **Epic 6: Analytics Dashboards****MISSING** - API endpoints for Ownership Coverage, Health Scorecards, and Tech Debt tracking are missing.
### P5: AWS Cost Anomaly
* **Epic 1: CloudTrail Ingestion****MISSING** - A batch ingestion API exists, but the AWS EventBridge cross-account rules, SQS FIFO, and Lambda normalizer are entirely missing.
* **Epic 2: Anomaly Detection** 🟡 **PARTIAL** - Welford's algorithm and basic Z-Score computation are implemented. Missing novelty scoring, cold-start fast path, and composite scoring logic.
* **Epic 3: Zombie Hunter****MISSING** - No scheduled jobs or logic to detect idle EC2, RDS, or EBS resources.
* **Epic 4: Notification & Remediation** 🟡 **PARTIAL** - Slack notification generation is implemented. Missing the `/slack/interactions` endpoint to process remediation buttons (e.g., Stop Instance).
* **Epic 6 & 7: Dashboard UI & API****IMPLEMENTED** - Anomalies, Baselines, and Governance rule CRUD endpoints match Console expectations.
* **Epic 10: Transparent Factory** 🟡 **PARTIAL** - The 14-day `GovernanceEngine` (Shadow -> Audit -> Enforce) auto-promotion and Panic Mode logic is implemented. Missing Circuit Breakers and OTEL spans.
### P6: Runbook Automation
* **Epic 1: Runbook Parser****MISSING** - The system currently expects raw YAML inputs. Confluence HTML, Notion Markdown, and LLM step extraction parsing engines are entirely missing.
* **Epic 2: Action Classifier****MISSING** - Neither the deterministic regex safety scanner nor the secondary LLM risk classifier exist.
* **Epic 3: Execution Engine** 🟡 **PARTIAL** - Basic state transitions are handled in `api/runbooks.ts`. Missing Trust Level enforcement, network partition recovery, and step idempotency logic.
* **Epic 4: Agent****MISSING** - No Go agent binary, gRPC bidirectional streaming, or local sandbox execution environments exist.
* **Epic 5: Audit Trail** 🟡 **PARTIAL** - Basic Postgres `audit_entries` table exists. Missing the immutable append-only hash chain logic and CSV/PDF compliance export APIs.
* **Epic 6: Dashboard API****IMPLEMENTED** - Runbook, execution, and approval APIs are implemented. Redis pub/sub Agent Bridge exists. Slackbot interaction handlers are fully implemented with signature verification.
---
## Priority Ranking (What to Implement Next)
This ranking is based on maximizing time-to-value: prioritizing services where the Console UI is already supported, the backend logic is mostly complete, and the remaining gaps are well-defined.
**1. P4 - Lightweight IDP**
* **Why:** It is functionally the most complete. The Console APIs work, Meilisearch sync works, and basic AWS/GitHub discovery is operational.
* **Next Steps:** Implement the missing AWS scanners (CloudFormation, API Gateway) and the `Redis` prefix caching for search. Add the analytics endpoints (Ownership, Health, Tech Debt) to unlock the remaining UI views.
**2. P3 - Alert Intelligence**
* **Why:** The core pipeline (Webhook -> Redis -> Worker -> DB) is functional and deduplication logic works. Console APIs are satisfied.
* **Next Steps:** Build the `MTTR` and `Noise Reduction` analytics SQL queries, add PagerDuty escalation triggers, and implement the interactive Slack button handlers.
**3. P5 - AWS Cost Anomaly**
* **Why:** The complex math (Welford running stats) and database governance logic are done, making the dashboard functional for demo data.
* **Next Steps:** The biggest blocker is that there is no data pipeline. Implement the CDK stack to deploy the EventBridge rules and the `Lambda Normalizer` to translate CloudTrail events into the existing `/v1/ingest` API.
**4. P6 - Runbook Automation**
* **Why:** The API orchestration, Slack integrations, and Redis Pub/Sub bridges are nicely implemented, but it is currently a "brain without a body."
* **Next Steps:** It requires two massive standalone systems: the `Runbook Parser` (LLM + AST logic) and the actual external `Agent` (Go binary with gRPC and sandboxing).
**5. P2 - IaC Drift Detection**
* **Why:** Furthest from completion. While the SaaS API exists, it requires a highly complex external Go agent capable of reading Terraform/K8s/Pulumi state, a secure mTLS CA registration system, and a diffing/scoring engine—none of which currently exist.