Files
dd0c/products/SPEC-GAP-ANALYSIS.md
Max f1f4dee7ab
Some checks failed
CI — P3 Alert / test (push) Successful in 28s
CI — P5 Cost / test (push) Successful in 42s
CI — P6 Run / saas (push) Successful in 41s
CI — P6 Run / build-push (push) Has been cancelled
CI — P3 Alert / build-push (push) Failing after 53s
CI — P5 Cost / build-push (push) Failing after 5s
feat(cost): add zombie hunter, Slack interactions, composite scoring
- Zombie resource hunter: detects idle EC2/RDS/EBS/EIP/NAT resources
- Slack interactive handler: acknowledge, snooze, create-ticket actions
- Composite anomaly scorer: Z-Score + rate-of-change + pattern + novelty
- Cold-start fast path for new resources (<7 days data)
- 005_zombies.sql migration
2026-03-03 06:39:20 +00:00

9.0 KiB

dd0c Platform - BDD Specification Gap Analysis

Executive Summary

This gap analysis compares the BDD acceptance specifications against the currently implemented Node.js/Fastify source code and PostgreSQL database migrations for the dd0c monorepo (P2-P6).

Overall, the Dashboard APIs required by the React Console are highly implemented across all services. The frontend will successfully render and operate. The major gaps lie in the out-of-band background workers, external agents, robust message queuing (SQS/DLQ), and advanced intelligence/scoring heuristics.

Estimated Implementation Completion:

  • P4 - Lightweight IDP: ~75% (Core scanners, catalog, and search are functional)
  • P3 - Alert Intelligence: ~65% (Ingestion, basic correlation, and UI APIs are solid)
  • P5 - AWS Cost Anomaly: ~50% (Scorer and APIs exist, but CloudTrail ingestion is missing)
  • P6 - Runbook Automation: ~40% (APIs and Slackbot exist; parsing, classification, and agent execution are completely missing)
  • P2 - IaC Drift Detection: ~30% (SaaS ingestion APIs exist; the entire external agent, mTLS, and diff engines are missing)

Per-Service Breakdown by Epic

P2: IaC Drift Detection

  • Epic 1: Drift Detection Agent MISSING - No Go agent binary. Terraform, CloudFormation, Kubernetes, and Pulumi state scanning engines do not exist. Secret scrubbing logic is missing.
  • Epic 2: Agent Communication 🟡 PARTIAL - Basic HTTP ingestion route exists (/v1/ingest/drift), but mTLS authentication and SQS FIFO message queues are not implemented.
  • Epic 3: Event Processor 🟡 PARTIAL - Ingestion, nonce replay prevention, and PostgreSQL persistence with RLS are implemented. Missing canonical schema normalization and chunked report reassembly.
  • Epic 4: Notification Engine 🟡 PARTIAL - Slack Block Kit, Email (Resend), Webhook, and PagerDuty dispatchers are implemented. Missing Daily Digest job and severity-based routing logic.
  • Epic 5: Remediation MISSING - Interactive Slack buttons exist in notification payloads, but the backend workflow engine, approval tracking, and agent-side execution dispatch are missing.
  • Epic 6 & 7: Dashboard UI & API IMPLEMENTED - fetchStacks, fetchStackHistory, and fetchLatestReport endpoints are fully implemented with tenant RLS.
  • Epic 8 & 9: Infrastructure / PLG MISSING - No CDK templates, CI/CD pipelines, Stripe billing, or CLI setup logic.
  • Epic 10: Transparent Factory 🟡 PARTIAL - Database migrations and RLS are implemented. Missing Feature Flag service and OTEL Tracing.

P3: Alert Intelligence

  • Epic 1: Webhook Ingestion 🟡 PARTIAL - Webhook routes and HMAC validation for Datadog, PagerDuty, OpsGenie, and Grafana are implemented via Redis queue. Missing S3 archival, oversized payload handling, and SQS/DLQ.
  • Epic 2: Alert Normalization 🟡 PARTIAL - Basic provider mapping logic exists in webhook-processor.ts.
  • Epic 3: Correlation Engine 🟡 PARTIAL - Time-window correlation and fingerprint deduplication are implemented using Redis. Missing Service-Affinity matching and strict cross-tenant worker isolation.
  • Epic 4: Notification & Escalation 🟡 PARTIAL - Slack, Email, and Webhook dispatchers are implemented. Missing PagerDuty auto-escalation cron and Daily Noise Report.
  • Epic 5: Slack Bot 🟡 PARTIAL - Missing interactive feedback button handlers (/slack/interactions) for noise/helpful marking, and missing /dd0c slash commands.
  • Epic 6 & 7: Dashboard UI & API 🟡 PARTIAL - Incident CRUD, filtering, and summary endpoints are implemented. Missing MTTR and Noise Reduction analytics endpoints requested by the spec.
  • Epic 8 & 9: Infrastructure / PLG MISSING - No CDK, Stripe billing, or Free Tier (10K alerts/month) limit enforcement.

P4: Lightweight IDP

  • Epic 1: AWS Discovery Scanner 🟡 PARTIAL - ECS, Lambda, and RDS resource discovery implemented. Missing CloudFormation, API Gateway, and Step Functions orchestration.
  • Epic 2: GitHub Discovery Scanner 🟡 PARTIAL - Repository fetching, pagination, and basic package.json/Dockerfile heuristics implemented. Missing advanced CODEOWNERS and commit history parsing.
  • Epic 3: Service Catalog 🟡 PARTIAL - Catalog ingestion, partial update staging, ownership resolution, and DB APIs implemented. Missing PagerDuty/OpsGenie on-call mapping.
  • Epic 4: Search Engine 🟡 PARTIAL - Meilisearch integration with PostgreSQL fallback implemented. Missing Redis prefix caching for Cmd+K performance optimization.
  • Epic 5: Dashboard API IMPLEMENTED - Service CRUD and ownership summary endpoints are fully functional and align with Console requirements.
  • Epic 6: Analytics Dashboards MISSING - API endpoints for Ownership Coverage, Health Scorecards, and Tech Debt tracking are missing.

P5: AWS Cost Anomaly

  • Epic 1: CloudTrail Ingestion MISSING - A batch ingestion API exists, but the AWS EventBridge cross-account rules, SQS FIFO, and Lambda normalizer are entirely missing.
  • Epic 2: Anomaly Detection 🟡 PARTIAL - Welford's algorithm and basic Z-Score computation are implemented. Missing novelty scoring, cold-start fast path, and composite scoring logic.
  • Epic 3: Zombie Hunter MISSING - No scheduled jobs or logic to detect idle EC2, RDS, or EBS resources.
  • Epic 4: Notification & Remediation 🟡 PARTIAL - Slack notification generation is implemented. Missing the /slack/interactions endpoint to process remediation buttons (e.g., Stop Instance).
  • Epic 6 & 7: Dashboard UI & API IMPLEMENTED - Anomalies, Baselines, and Governance rule CRUD endpoints match Console expectations.
  • Epic 10: Transparent Factory 🟡 PARTIAL - The 14-day GovernanceEngine (Shadow -> Audit -> Enforce) auto-promotion and Panic Mode logic is implemented. Missing Circuit Breakers and OTEL spans.

P6: Runbook Automation

  • Epic 1: Runbook Parser MISSING - The system currently expects raw YAML inputs. Confluence HTML, Notion Markdown, and LLM step extraction parsing engines are entirely missing.
  • Epic 2: Action Classifier MISSING - Neither the deterministic regex safety scanner nor the secondary LLM risk classifier exist.
  • Epic 3: Execution Engine 🟡 PARTIAL - Basic state transitions are handled in api/runbooks.ts. Missing Trust Level enforcement, network partition recovery, and step idempotency logic.
  • Epic 4: Agent MISSING - No Go agent binary, gRPC bidirectional streaming, or local sandbox execution environments exist.
  • Epic 5: Audit Trail 🟡 PARTIAL - Basic Postgres audit_entries table exists. Missing the immutable append-only hash chain logic and CSV/PDF compliance export APIs.
  • Epic 6: Dashboard API IMPLEMENTED - Runbook, execution, and approval APIs are implemented. Redis pub/sub Agent Bridge exists. Slackbot interaction handlers are fully implemented with signature verification.

Priority Ranking (What to Implement Next)

This ranking is based on maximizing time-to-value: prioritizing services where the Console UI is already supported, the backend logic is mostly complete, and the remaining gaps are well-defined.

1. P4 - Lightweight IDP

  • Why: It is functionally the most complete. The Console APIs work, Meilisearch sync works, and basic AWS/GitHub discovery is operational.
  • Next Steps: Implement the missing AWS scanners (CloudFormation, API Gateway) and the Redis prefix caching for search. Add the analytics endpoints (Ownership, Health, Tech Debt) to unlock the remaining UI views.

2. P3 - Alert Intelligence

  • Why: The core pipeline (Webhook -> Redis -> Worker -> DB) is functional and deduplication logic works. Console APIs are satisfied.
  • Next Steps: Build the MTTR and Noise Reduction analytics SQL queries, add PagerDuty escalation triggers, and implement the interactive Slack button handlers.

3. P5 - AWS Cost Anomaly

  • Why: The complex math (Welford running stats) and database governance logic are done, making the dashboard functional for demo data.
  • Next Steps: The biggest blocker is that there is no data pipeline. Implement the CDK stack to deploy the EventBridge rules and the Lambda Normalizer to translate CloudTrail events into the existing /v1/ingest API.

4. P6 - Runbook Automation

  • Why: The API orchestration, Slack integrations, and Redis Pub/Sub bridges are nicely implemented, but it is currently a "brain without a body."
  • Next Steps: It requires two massive standalone systems: the Runbook Parser (LLM + AST logic) and the actual external Agent (Go binary with gRPC and sandboxing).

5. P2 - IaC Drift Detection

  • Why: Furthest from completion. While the SaaS API exists, it requires a highly complex external Go agent capable of reading Terraform/K8s/Pulumi state, a secure mTLS CA registration system, and a diffing/scoring engine—none of which currently exist.