- Zombie resource hunter: detects idle EC2/RDS/EBS/EIP/NAT resources - Slack interactive handler: acknowledge, snooze, create-ticket actions - Composite anomaly scorer: Z-Score + rate-of-change + pattern + novelty - Cold-start fast path for new resources (<7 days data) - 005_zombies.sql migration
9.0 KiB
dd0c Platform - BDD Specification Gap Analysis
Executive Summary
This gap analysis compares the BDD acceptance specifications against the currently implemented Node.js/Fastify source code and PostgreSQL database migrations for the dd0c monorepo (P2-P6).
Overall, the Dashboard APIs required by the React Console are highly implemented across all services. The frontend will successfully render and operate. The major gaps lie in the out-of-band background workers, external agents, robust message queuing (SQS/DLQ), and advanced intelligence/scoring heuristics.
Estimated Implementation Completion:
- P4 - Lightweight IDP: ~75% (Core scanners, catalog, and search are functional)
- P3 - Alert Intelligence: ~65% (Ingestion, basic correlation, and UI APIs are solid)
- P5 - AWS Cost Anomaly: ~50% (Scorer and APIs exist, but CloudTrail ingestion is missing)
- P6 - Runbook Automation: ~40% (APIs and Slackbot exist; parsing, classification, and agent execution are completely missing)
- P2 - IaC Drift Detection: ~30% (SaaS ingestion APIs exist; the entire external agent, mTLS, and diff engines are missing)
Per-Service Breakdown by Epic
P2: IaC Drift Detection
- Epic 1: Drift Detection Agent ❌ MISSING - No Go agent binary. Terraform, CloudFormation, Kubernetes, and Pulumi state scanning engines do not exist. Secret scrubbing logic is missing.
- Epic 2: Agent Communication 🟡 PARTIAL - Basic HTTP ingestion route exists (
/v1/ingest/drift), but mTLS authentication and SQS FIFO message queues are not implemented. - Epic 3: Event Processor 🟡 PARTIAL - Ingestion, nonce replay prevention, and PostgreSQL persistence with RLS are implemented. Missing canonical schema normalization and chunked report reassembly.
- Epic 4: Notification Engine 🟡 PARTIAL - Slack Block Kit, Email (Resend), Webhook, and PagerDuty dispatchers are implemented. Missing Daily Digest job and severity-based routing logic.
- Epic 5: Remediation ❌ MISSING - Interactive Slack buttons exist in notification payloads, but the backend workflow engine, approval tracking, and agent-side execution dispatch are missing.
- Epic 6 & 7: Dashboard UI & API ✅ IMPLEMENTED -
fetchStacks,fetchStackHistory, andfetchLatestReportendpoints are fully implemented with tenant RLS. - Epic 8 & 9: Infrastructure / PLG ❌ MISSING - No CDK templates, CI/CD pipelines, Stripe billing, or CLI setup logic.
- Epic 10: Transparent Factory 🟡 PARTIAL - Database migrations and RLS are implemented. Missing Feature Flag service and OTEL Tracing.
P3: Alert Intelligence
- Epic 1: Webhook Ingestion 🟡 PARTIAL - Webhook routes and HMAC validation for Datadog, PagerDuty, OpsGenie, and Grafana are implemented via Redis queue. Missing S3 archival, oversized payload handling, and SQS/DLQ.
- Epic 2: Alert Normalization 🟡 PARTIAL - Basic provider mapping logic exists in
webhook-processor.ts. - Epic 3: Correlation Engine 🟡 PARTIAL - Time-window correlation and fingerprint deduplication are implemented using Redis. Missing Service-Affinity matching and strict cross-tenant worker isolation.
- Epic 4: Notification & Escalation 🟡 PARTIAL - Slack, Email, and Webhook dispatchers are implemented. Missing PagerDuty auto-escalation cron and Daily Noise Report.
- Epic 5: Slack Bot 🟡 PARTIAL - Missing interactive feedback button handlers (
/slack/interactions) for noise/helpful marking, and missing/dd0cslash commands. - Epic 6 & 7: Dashboard UI & API 🟡 PARTIAL - Incident CRUD, filtering, and summary endpoints are implemented. Missing
MTTRandNoise Reductionanalytics endpoints requested by the spec. - Epic 8 & 9: Infrastructure / PLG ❌ MISSING - No CDK, Stripe billing, or Free Tier (10K alerts/month) limit enforcement.
P4: Lightweight IDP
- Epic 1: AWS Discovery Scanner 🟡 PARTIAL - ECS, Lambda, and RDS resource discovery implemented. Missing CloudFormation, API Gateway, and Step Functions orchestration.
- Epic 2: GitHub Discovery Scanner 🟡 PARTIAL - Repository fetching, pagination, and basic
package.json/Dockerfileheuristics implemented. Missing advanced CODEOWNERS and commit history parsing. - Epic 3: Service Catalog 🟡 PARTIAL - Catalog ingestion, partial update staging, ownership resolution, and DB APIs implemented. Missing PagerDuty/OpsGenie on-call mapping.
- Epic 4: Search Engine 🟡 PARTIAL - Meilisearch integration with PostgreSQL fallback implemented. Missing Redis prefix caching for
Cmd+Kperformance optimization. - Epic 5: Dashboard API ✅ IMPLEMENTED - Service CRUD and ownership summary endpoints are fully functional and align with Console requirements.
- Epic 6: Analytics Dashboards ❌ MISSING - API endpoints for Ownership Coverage, Health Scorecards, and Tech Debt tracking are missing.
P5: AWS Cost Anomaly
- Epic 1: CloudTrail Ingestion ❌ MISSING - A batch ingestion API exists, but the AWS EventBridge cross-account rules, SQS FIFO, and Lambda normalizer are entirely missing.
- Epic 2: Anomaly Detection 🟡 PARTIAL - Welford's algorithm and basic Z-Score computation are implemented. Missing novelty scoring, cold-start fast path, and composite scoring logic.
- Epic 3: Zombie Hunter ❌ MISSING - No scheduled jobs or logic to detect idle EC2, RDS, or EBS resources.
- Epic 4: Notification & Remediation 🟡 PARTIAL - Slack notification generation is implemented. Missing the
/slack/interactionsendpoint to process remediation buttons (e.g., Stop Instance). - Epic 6 & 7: Dashboard UI & API ✅ IMPLEMENTED - Anomalies, Baselines, and Governance rule CRUD endpoints match Console expectations.
- Epic 10: Transparent Factory 🟡 PARTIAL - The 14-day
GovernanceEngine(Shadow -> Audit -> Enforce) auto-promotion and Panic Mode logic is implemented. Missing Circuit Breakers and OTEL spans.
P6: Runbook Automation
- Epic 1: Runbook Parser ❌ MISSING - The system currently expects raw YAML inputs. Confluence HTML, Notion Markdown, and LLM step extraction parsing engines are entirely missing.
- Epic 2: Action Classifier ❌ MISSING - Neither the deterministic regex safety scanner nor the secondary LLM risk classifier exist.
- Epic 3: Execution Engine 🟡 PARTIAL - Basic state transitions are handled in
api/runbooks.ts. Missing Trust Level enforcement, network partition recovery, and step idempotency logic. - Epic 4: Agent ❌ MISSING - No Go agent binary, gRPC bidirectional streaming, or local sandbox execution environments exist.
- Epic 5: Audit Trail 🟡 PARTIAL - Basic Postgres
audit_entriestable exists. Missing the immutable append-only hash chain logic and CSV/PDF compliance export APIs. - Epic 6: Dashboard API ✅ IMPLEMENTED - Runbook, execution, and approval APIs are implemented. Redis pub/sub Agent Bridge exists. Slackbot interaction handlers are fully implemented with signature verification.
Priority Ranking (What to Implement Next)
This ranking is based on maximizing time-to-value: prioritizing services where the Console UI is already supported, the backend logic is mostly complete, and the remaining gaps are well-defined.
1. P4 - Lightweight IDP
- Why: It is functionally the most complete. The Console APIs work, Meilisearch sync works, and basic AWS/GitHub discovery is operational.
- Next Steps: Implement the missing AWS scanners (CloudFormation, API Gateway) and the
Redisprefix caching for search. Add the analytics endpoints (Ownership, Health, Tech Debt) to unlock the remaining UI views.
2. P3 - Alert Intelligence
- Why: The core pipeline (Webhook -> Redis -> Worker -> DB) is functional and deduplication logic works. Console APIs are satisfied.
- Next Steps: Build the
MTTRandNoise Reductionanalytics SQL queries, add PagerDuty escalation triggers, and implement the interactive Slack button handlers.
3. P5 - AWS Cost Anomaly
- Why: The complex math (Welford running stats) and database governance logic are done, making the dashboard functional for demo data.
- Next Steps: The biggest blocker is that there is no data pipeline. Implement the CDK stack to deploy the EventBridge rules and the
Lambda Normalizerto translate CloudTrail events into the existing/v1/ingestAPI.
4. P6 - Runbook Automation
- Why: The API orchestration, Slack integrations, and Redis Pub/Sub bridges are nicely implemented, but it is currently a "brain without a body."
- Next Steps: It requires two massive standalone systems: the
Runbook Parser(LLM + AST logic) and the actual externalAgent(Go binary with gRPC and sandboxing).
5. P2 - IaC Drift Detection
- Why: Furthest from completion. While the SaaS API exists, it requires a highly complex external Go agent capable of reading Terraform/K8s/Pulumi state, a secure mTLS CA registration system, and a diffing/scoring engine—none of which currently exist.