87 lines
9.0 KiB
Markdown
87 lines
9.0 KiB
Markdown
|
|
# dd0c Platform - BDD Specification Gap Analysis
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
This gap analysis compares the BDD acceptance specifications against the currently implemented Node.js/Fastify source code and PostgreSQL database migrations for the dd0c monorepo (P2-P6).
|
||
|
|
|
||
|
|
Overall, the **Dashboard APIs** required by the React Console are highly implemented across all services. The frontend will successfully render and operate. The major gaps lie in the out-of-band background workers, external agents, robust message queuing (SQS/DLQ), and advanced intelligence/scoring heuristics.
|
||
|
|
|
||
|
|
**Estimated Implementation Completion:**
|
||
|
|
* **P4 - Lightweight IDP:** ~75% (Core scanners, catalog, and search are functional)
|
||
|
|
* **P3 - Alert Intelligence:** ~65% (Ingestion, basic correlation, and UI APIs are solid)
|
||
|
|
* **P5 - AWS Cost Anomaly:** ~50% (Scorer and APIs exist, but CloudTrail ingestion is missing)
|
||
|
|
* **P6 - Runbook Automation:** ~40% (APIs and Slackbot exist; parsing, classification, and agent execution are completely missing)
|
||
|
|
* **P2 - IaC Drift Detection:** ~30% (SaaS ingestion APIs exist; the entire external agent, mTLS, and diff engines are missing)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Per-Service Breakdown by Epic
|
||
|
|
|
||
|
|
### P2: IaC Drift Detection
|
||
|
|
* **Epic 1: Drift Detection Agent** ❌ **MISSING** - No Go agent binary. Terraform, CloudFormation, Kubernetes, and Pulumi state scanning engines do not exist. Secret scrubbing logic is missing.
|
||
|
|
* **Epic 2: Agent Communication** 🟡 **PARTIAL** - Basic HTTP ingestion route exists (`/v1/ingest/drift`), but mTLS authentication and SQS FIFO message queues are not implemented.
|
||
|
|
* **Epic 3: Event Processor** 🟡 **PARTIAL** - Ingestion, nonce replay prevention, and PostgreSQL persistence with RLS are implemented. Missing canonical schema normalization and chunked report reassembly.
|
||
|
|
* **Epic 4: Notification Engine** 🟡 **PARTIAL** - Slack Block Kit, Email (Resend), Webhook, and PagerDuty dispatchers are implemented. Missing Daily Digest job and severity-based routing logic.
|
||
|
|
* **Epic 5: Remediation** ❌ **MISSING** - Interactive Slack buttons exist in notification payloads, but the backend workflow engine, approval tracking, and agent-side execution dispatch are missing.
|
||
|
|
* **Epic 6 & 7: Dashboard UI & API** ✅ **IMPLEMENTED** - `fetchStacks`, `fetchStackHistory`, and `fetchLatestReport` endpoints are fully implemented with tenant RLS.
|
||
|
|
* **Epic 8 & 9: Infrastructure / PLG** ❌ **MISSING** - No CDK templates, CI/CD pipelines, Stripe billing, or CLI setup logic.
|
||
|
|
* **Epic 10: Transparent Factory** 🟡 **PARTIAL** - Database migrations and RLS are implemented. Missing Feature Flag service and OTEL Tracing.
|
||
|
|
|
||
|
|
### P3: Alert Intelligence
|
||
|
|
* **Epic 1: Webhook Ingestion** 🟡 **PARTIAL** - Webhook routes and HMAC validation for Datadog, PagerDuty, OpsGenie, and Grafana are implemented via Redis queue. Missing S3 archival, oversized payload handling, and SQS/DLQ.
|
||
|
|
* **Epic 2: Alert Normalization** 🟡 **PARTIAL** - Basic provider mapping logic exists in `webhook-processor.ts`.
|
||
|
|
* **Epic 3: Correlation Engine** 🟡 **PARTIAL** - Time-window correlation and fingerprint deduplication are implemented using Redis. Missing Service-Affinity matching and strict cross-tenant worker isolation.
|
||
|
|
* **Epic 4: Notification & Escalation** 🟡 **PARTIAL** - Slack, Email, and Webhook dispatchers are implemented. Missing PagerDuty auto-escalation cron and Daily Noise Report.
|
||
|
|
* **Epic 5: Slack Bot** 🟡 **PARTIAL** - Missing interactive feedback button handlers (`/slack/interactions`) for noise/helpful marking, and missing `/dd0c` slash commands.
|
||
|
|
* **Epic 6 & 7: Dashboard UI & API** 🟡 **PARTIAL** - Incident CRUD, filtering, and summary endpoints are implemented. Missing `MTTR` and `Noise Reduction` analytics endpoints requested by the spec.
|
||
|
|
* **Epic 8 & 9: Infrastructure / PLG** ❌ **MISSING** - No CDK, Stripe billing, or Free Tier (10K alerts/month) limit enforcement.
|
||
|
|
|
||
|
|
### P4: Lightweight IDP
|
||
|
|
* **Epic 1: AWS Discovery Scanner** 🟡 **PARTIAL** - ECS, Lambda, and RDS resource discovery implemented. Missing CloudFormation, API Gateway, and Step Functions orchestration.
|
||
|
|
* **Epic 2: GitHub Discovery Scanner** 🟡 **PARTIAL** - Repository fetching, pagination, and basic `package.json`/`Dockerfile` heuristics implemented. Missing advanced CODEOWNERS and commit history parsing.
|
||
|
|
* **Epic 3: Service Catalog** 🟡 **PARTIAL** - Catalog ingestion, partial update staging, ownership resolution, and DB APIs implemented. Missing PagerDuty/OpsGenie on-call mapping.
|
||
|
|
* **Epic 4: Search Engine** 🟡 **PARTIAL** - Meilisearch integration with PostgreSQL fallback implemented. Missing Redis prefix caching for `Cmd+K` performance optimization.
|
||
|
|
* **Epic 5: Dashboard API** ✅ **IMPLEMENTED** - Service CRUD and ownership summary endpoints are fully functional and align with Console requirements.
|
||
|
|
* **Epic 6: Analytics Dashboards** ❌ **MISSING** - API endpoints for Ownership Coverage, Health Scorecards, and Tech Debt tracking are missing.
|
||
|
|
|
||
|
|
### P5: AWS Cost Anomaly
|
||
|
|
* **Epic 1: CloudTrail Ingestion** ❌ **MISSING** - A batch ingestion API exists, but the AWS EventBridge cross-account rules, SQS FIFO, and Lambda normalizer are entirely missing.
|
||
|
|
* **Epic 2: Anomaly Detection** 🟡 **PARTIAL** - Welford's algorithm and basic Z-Score computation are implemented. Missing novelty scoring, cold-start fast path, and composite scoring logic.
|
||
|
|
* **Epic 3: Zombie Hunter** ❌ **MISSING** - No scheduled jobs or logic to detect idle EC2, RDS, or EBS resources.
|
||
|
|
* **Epic 4: Notification & Remediation** 🟡 **PARTIAL** - Slack notification generation is implemented. Missing the `/slack/interactions` endpoint to process remediation buttons (e.g., Stop Instance).
|
||
|
|
* **Epic 6 & 7: Dashboard UI & API** ✅ **IMPLEMENTED** - Anomalies, Baselines, and Governance rule CRUD endpoints match Console expectations.
|
||
|
|
* **Epic 10: Transparent Factory** 🟡 **PARTIAL** - The 14-day `GovernanceEngine` (Shadow -> Audit -> Enforce) auto-promotion and Panic Mode logic is implemented. Missing Circuit Breakers and OTEL spans.
|
||
|
|
|
||
|
|
### P6: Runbook Automation
|
||
|
|
* **Epic 1: Runbook Parser** ❌ **MISSING** - The system currently expects raw YAML inputs. Confluence HTML, Notion Markdown, and LLM step extraction parsing engines are entirely missing.
|
||
|
|
* **Epic 2: Action Classifier** ❌ **MISSING** - Neither the deterministic regex safety scanner nor the secondary LLM risk classifier exist.
|
||
|
|
* **Epic 3: Execution Engine** 🟡 **PARTIAL** - Basic state transitions are handled in `api/runbooks.ts`. Missing Trust Level enforcement, network partition recovery, and step idempotency logic.
|
||
|
|
* **Epic 4: Agent** ❌ **MISSING** - No Go agent binary, gRPC bidirectional streaming, or local sandbox execution environments exist.
|
||
|
|
* **Epic 5: Audit Trail** 🟡 **PARTIAL** - Basic Postgres `audit_entries` table exists. Missing the immutable append-only hash chain logic and CSV/PDF compliance export APIs.
|
||
|
|
* **Epic 6: Dashboard API** ✅ **IMPLEMENTED** - Runbook, execution, and approval APIs are implemented. Redis pub/sub Agent Bridge exists. Slackbot interaction handlers are fully implemented with signature verification.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Priority Ranking (What to Implement Next)
|
||
|
|
|
||
|
|
This ranking is based on maximizing time-to-value: prioritizing services where the Console UI is already supported, the backend logic is mostly complete, and the remaining gaps are well-defined.
|
||
|
|
|
||
|
|
**1. P4 - Lightweight IDP**
|
||
|
|
* **Why:** It is functionally the most complete. The Console APIs work, Meilisearch sync works, and basic AWS/GitHub discovery is operational.
|
||
|
|
* **Next Steps:** Implement the missing AWS scanners (CloudFormation, API Gateway) and the `Redis` prefix caching for search. Add the analytics endpoints (Ownership, Health, Tech Debt) to unlock the remaining UI views.
|
||
|
|
|
||
|
|
**2. P3 - Alert Intelligence**
|
||
|
|
* **Why:** The core pipeline (Webhook -> Redis -> Worker -> DB) is functional and deduplication logic works. Console APIs are satisfied.
|
||
|
|
* **Next Steps:** Build the `MTTR` and `Noise Reduction` analytics SQL queries, add PagerDuty escalation triggers, and implement the interactive Slack button handlers.
|
||
|
|
|
||
|
|
**3. P5 - AWS Cost Anomaly**
|
||
|
|
* **Why:** The complex math (Welford running stats) and database governance logic are done, making the dashboard functional for demo data.
|
||
|
|
* **Next Steps:** The biggest blocker is that there is no data pipeline. Implement the CDK stack to deploy the EventBridge rules and the `Lambda Normalizer` to translate CloudTrail events into the existing `/v1/ingest` API.
|
||
|
|
|
||
|
|
**4. P6 - Runbook Automation**
|
||
|
|
* **Why:** The API orchestration, Slack integrations, and Redis Pub/Sub bridges are nicely implemented, but it is currently a "brain without a body."
|
||
|
|
* **Next Steps:** It requires two massive standalone systems: the `Runbook Parser` (LLM + AST logic) and the actual external `Agent` (Go binary with gRPC and sandboxing).
|
||
|
|
|
||
|
|
**5. P2 - IaC Drift Detection**
|
||
|
|
* **Why:** Furthest from completion. While the SaaS API exists, it requires a highly complex external Go agent capable of reading Terraform/K8s/Pulumi state, a secure mTLS CA registration system, and a diffing/scoring engine—none of which currently exist.
|