553 lines
33 KiB
Markdown
553 lines
33 KiB
Markdown
|
|
# dd0c/run — V1 MVP Epics
|
||
|
|
|
||
|
|
This document breaks down the V1 MVP of dd0c/run into implementation Epics and User Stories. Scope is strictly limited to V1 (read-only execution 🟢, copilot approval for 🟡/🔴, no autopilot, no terminal watcher).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Epic 1: Runbook Parser
|
||
|
|
**Description:** The "5-second wow moment." Ingest raw unstructured text from a paste event, normalize it, and use a fast LLM (e.g., Claude Haiku via dd0c/route) to extract a structured JSON representation of executable steps, variables, conditional branches, and prerequisites in under 5 seconds.
|
||
|
|
|
||
|
|
**Dependencies:** None. This is the foundational data structure.
|
||
|
|
|
||
|
|
**Technical Notes:**
|
||
|
|
- Pure Rust normalizer to strip HTML/Markdown before hitting the LLM to save tokens and latency.
|
||
|
|
- LLM prompt must enforce a strict JSON schema.
|
||
|
|
- Idempotent parsing: same text + temperature 0 = same output.
|
||
|
|
- Must run in < 3.5s to leave room for the Action Classifier within the 5s SLA.
|
||
|
|
|
||
|
|
### User Stories
|
||
|
|
|
||
|
|
**Story 1.1: Raw Text Normalization & Ingestion**
|
||
|
|
*As a runbook author (Morgan), I want to paste raw text from Confluence or Notion into the system, so that I don't have to learn a proprietary YAML DSL.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- System accepts raw text payload via API.
|
||
|
|
- Rust-based normalizer strips HTML tags, markdown formatting, and normalizes whitespace/bullet points.
|
||
|
|
- Original raw text and hash are preserved in the DB for audit/re-parsing.
|
||
|
|
- **Story Points:** 2
|
||
|
|
- **Dependencies:** None
|
||
|
|
|
||
|
|
**Story 1.2: LLM Structured Step Extraction**
|
||
|
|
*As a system, I want to pass normalized text to a fast LLM to extract an ordered JSON array of steps and commands, so that the runbook becomes machine-readable.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Sends normalized text to `dd0c/route` with a strict JSON schema prompt.
|
||
|
|
- Correctly extracts step order, natural language description, and embedded CLI/shell commands.
|
||
|
|
- P95 latency for extraction is < 3.5 seconds.
|
||
|
|
- Rejects/errors gracefully if the text contains no actionable steps.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 1.1
|
||
|
|
|
||
|
|
**Story 1.3: Variable & Prerequisite Detection**
|
||
|
|
*As an on-call engineer (Riley), I want the parser to identify implicit prerequisites (like VPN) and variables (like `<instance-id>`), so that I'm fully prepared before the runbook starts.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Regex/heuristic scanner identifies common placeholders (`$VAR`, `<var>`, `{var}`).
|
||
|
|
- LLM identifies implicit requirements ("ensure you are on VPN", "requires prod AWS profile").
|
||
|
|
- Outputs a structured array of `variables` and `prerequisites` in the JSON payload.
|
||
|
|
- **Story Points:** 2
|
||
|
|
- **Dependencies:** 1.2
|
||
|
|
|
||
|
|
**Story 1.4: Branching & Ambiguity Highlighting**
|
||
|
|
*As a runbook author (Morgan), I want the system to map conditional logic (if/else) and flag vague instructions, so that my runbooks are deterministic and clear.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Parser detects simple "if X then Y" statements and maps them to a step execution DAG (Directed Acyclic Graph).
|
||
|
|
- Identifies vague steps (e.g., "check the logs" without specifying which logs) and adds them to an `ambiguities` array for author review.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 1.2
|
||
|
|
|
||
|
|
## Epic 2: Action Classifier
|
||
|
|
**Description:** The safety-critical core of the system. Classifies every extracted command as 🟢 Safe, 🟡 Caution, 🔴 Dangerous, or ⬜ Unknown. Uses a defense-in-depth "dual-key" architecture: a deterministic compiled Rust scanner overrides an advisory LLM classifier. A misclassification here is an existential risk.
|
||
|
|
|
||
|
|
**Dependencies:** Epic 1 (needs parsed steps to classify)
|
||
|
|
|
||
|
|
**Technical Notes:**
|
||
|
|
- The deterministic scanner must use tree-sitter or compiled regex sets.
|
||
|
|
- Merge rules must be hardcoded in Rust (not configurable).
|
||
|
|
- LLM classification must run in parallel across steps to meet the 5-second total parse+classify SLA.
|
||
|
|
- If the LLM confidence is low or unknown, risk escalates. Scanner wins disagreements.
|
||
|
|
|
||
|
|
### User Stories
|
||
|
|
|
||
|
|
**Story 2.1: Deterministic Safety Scanner**
|
||
|
|
*As a system, I want to pattern-match commands against a compiled database of known signatures, so that destructive commands are definitively blocked without relying on an LLM.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Rust library with compiled regex sets (allowlist 🟢, caution 🟡, blocklist 🔴).
|
||
|
|
- Uses tree-sitter to parse SQL/shell ASTs to detect destructive patterns (e.g., `DELETE` without `WHERE`, piped `rm -rf`).
|
||
|
|
- Executes in < 1ms per command.
|
||
|
|
- Defaults to `Unknown` (🟡 minimum) if no patterns match.
|
||
|
|
- **Story Points:** 5
|
||
|
|
- **Dependencies:** None (standalone library)
|
||
|
|
|
||
|
|
**Story 2.2: LLM Contextual Classifier**
|
||
|
|
*As a system, I want an LLM to assess the command in the context of surrounding steps, so that implicit state changes or custom scripts are caught.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Sends the command, surrounding steps, and infrastructure context to an LLM via a strict JSON prompt.
|
||
|
|
- Returns a classification (🟢/🟡/🔴), a confidence score, and suggested rollbacks.
|
||
|
|
- Escalate classification to the next risk tier if confidence < 0.9.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** Epic 1 (requires structured context)
|
||
|
|
|
||
|
|
**Story 2.3: Classification Merge Engine**
|
||
|
|
*As an on-call engineer (Riley), I want the system to safely merge the LLM and deterministic scanner results, so that I can trust the final 🟢/🟡/🔴 label.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Implements the 5 hardcoded merge rules exactly as architected (e.g., if Scanner says 🔴, final is 🔴; if Scanner says 🟡 and LLM says 🟢, final is 🟡).
|
||
|
|
- The final risk level is `Safe` ONLY if both the Scanner and the LLM agree it is safe.
|
||
|
|
- **Story Points:** 2
|
||
|
|
- **Dependencies:** 2.1, 2.2
|
||
|
|
|
||
|
|
**Story 2.4: Immutable Classification Audit**
|
||
|
|
*As an SRE manager (Jordan), I want every classification decision logged with full context, so that I have a forensic record of why the system made its choice.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Emits a `runbook.classified` event to the PostgreSQL database for every step.
|
||
|
|
- Log includes: scanner result (matched patterns), LLM reasoning, confidence scores, final classification, and merge rule applied.
|
||
|
|
- **Story Points:** 1
|
||
|
|
- **Dependencies:** 2.3
|
||
|
|
|
||
|
|
## Epic 3: Execution Engine
|
||
|
|
**Description:** A state machine orchestrating the step-by-step runbook execution, enforcing the Trust Gradient (Level 0, 1, 2) at every transition. It never auto-executes 🟡 or 🔴 commands in V1. Handles timeouts, rollbacks, and tracking skipped steps.
|
||
|
|
|
||
|
|
**Dependencies:** Epics 1 & 2 (needs classified runbooks to execute)
|
||
|
|
|
||
|
|
**Technical Notes:**
|
||
|
|
- V1 is Copilot-only. State transitions must block 🔴 and prompt for 🟡.
|
||
|
|
- Engine must communicate with the Agent via gRPC over mTLS.
|
||
|
|
- Each step execution receives a unique ID to prevent duplicate deliveries.
|
||
|
|
|
||
|
|
### User Stories
|
||
|
|
|
||
|
|
**Story 3.1: Execution State Machine**
|
||
|
|
*As a system, I want a state machine to orchestrate step-by-step runbook execution, so that no two commands execute simultaneously and the Trust Gradient is enforced.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Starts in `Pending` state upon alert match.
|
||
|
|
- Progresses to `AutoExecute` ONLY if the step is 🟢 and trust level allows.
|
||
|
|
- Transitions to `AwaitApproval` for 🟡/🔴 steps, blocking execution until approved.
|
||
|
|
- Aborts or transitions to `ManualIntervention` on timeout (e.g., 60s for 🟢, 300s for 🔴).
|
||
|
|
- **Story Points:** 5
|
||
|
|
- **Dependencies:** 2.3
|
||
|
|
|
||
|
|
**Story 3.2: gRPC Agent Communication Protocol**
|
||
|
|
*As a system, I want to communicate securely with the dd0c Agent in the customer VPC, so that commands are executed safely without inbound firewall rules.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Outbound-only gRPC streaming connection initiated by the Agent.
|
||
|
|
- Engine sends `ExecuteStep` payload (command, timeout, risk level, environment variables).
|
||
|
|
- Agent streams `StepOutput` (stdout/stderr) back to the Engine.
|
||
|
|
- Agent returns `StepResult` (exit code, duration, stdout/stderr hashes) on completion.
|
||
|
|
- **Story Points:** 5
|
||
|
|
- **Dependencies:** 3.1
|
||
|
|
|
||
|
|
**Story 3.3: Rollback Integration**
|
||
|
|
*As an on-call engineer (Riley), I want every state-changing step to record its inverse command, so that I can undo it with one click if it fails.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- If a step fails, the Engine transitions to `RollbackAvailable` and awaits human approval.
|
||
|
|
- Engine stores the rollback command before executing the forward command.
|
||
|
|
- Executing rollback triggers an independent execution step and returns the state machine to `StepReady` or `ManualIntervention`.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 3.1
|
||
|
|
|
||
|
|
**Story 3.4: Divergence Analysis**
|
||
|
|
*As an SRE manager (Jordan), I want the system to track what the engineer actually did vs. what the runbook prescribed, so that runbooks get better over time.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Post-execution analyzer compares prescribed steps with executed/skipped steps and any unlisted commands detected by the agent.
|
||
|
|
- Flags skipped steps, modified commands, and unlisted actions.
|
||
|
|
- Emits a `divergence.detected` event with suggested runbook updates.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 3.1
|
||
|
|
|
||
|
|
## Epic 4: Slack Bot Copilot
|
||
|
|
**Description:** The primary 3am interface for on-call engineers. Guides the engineer step-by-step through an active incident, enforcing the Trust Gradient via interactive buttons for 🟡 actions and explicit typed confirmations for 🔴 actions.
|
||
|
|
|
||
|
|
**Dependencies:** Epic 3 (needs execution state to drive the UI)
|
||
|
|
|
||
|
|
**Technical Notes:**
|
||
|
|
- Uses Rust + Slack Bolt SDK in Socket Mode (no inbound webhooks).
|
||
|
|
- Must use Slack Block Kit for formatting.
|
||
|
|
- Respects Slack's 1 message/sec/channel rate limit by batching rapid updates.
|
||
|
|
- Uses dedicated threads to keep main channels clean.
|
||
|
|
|
||
|
|
### User Stories
|
||
|
|
|
||
|
|
**Story 4.1: Alert Matching & Notification**
|
||
|
|
*As an on-call engineer (Riley), I want a Slack message with the relevant runbook when an alert fires, so that I don't have to search for it.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Integrates with PagerDuty/OpsGenie webhook payloads.
|
||
|
|
- Matches alert context (service, region, keywords) to the most relevant runbook.
|
||
|
|
- Posts a `🔔 Runbook matched` message with [▶ Start Copilot] button in the incident channel.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** None
|
||
|
|
|
||
|
|
**Story 4.2: Step-by-Step Interactive UI**
|
||
|
|
*As an on-call engineer (Riley), I want the bot to guide me through the runbook step-by-step in a thread, so that I can focus on one command at a time.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Tapping [▶ Start Copilot] opens a thread and triggers the Execution Engine.
|
||
|
|
- Shows 🟢 steps auto-executing with live stdout snippets.
|
||
|
|
- Shows 🟡/🔴 steps awaiting approval with command details and rollback.
|
||
|
|
- Includes [⏭ Skip] and [🛑 Abort] buttons for every step.
|
||
|
|
- **Story Points:** 5
|
||
|
|
- **Dependencies:** 3.1, 4.1
|
||
|
|
|
||
|
|
**Story 4.3: Risk-Aware Approval Gates**
|
||
|
|
*As a system, I want to block dangerous commands until explicit confirmation is provided, so that accidental misclicks don't cause outages.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- 🟡 steps require clicking [✅ Approve] or [✏️ Edit].
|
||
|
|
- 🔴 steps require opening a text input modal and typing the resource name exactly.
|
||
|
|
- Cannot bulk-approve steps. Each step must be individually gated.
|
||
|
|
- Approver's Slack identity is captured and passed to the Execution Engine.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 4.2
|
||
|
|
|
||
|
|
**Story 4.4: Real-Time Output Streaming**
|
||
|
|
*As an on-call engineer (Riley), I want to see the execution output of commands in real-time, so that I know what's happening.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Slack messages update dynamically with command stdout/stderr.
|
||
|
|
- Batches rapid output updates to respect Slack rate limits.
|
||
|
|
- Truncates long outputs and provides a link to the full log in the dashboard.
|
||
|
|
- **Story Points:** 2
|
||
|
|
- **Dependencies:** 4.2
|
||
|
|
|
||
|
|
## Epic 5: Audit Trail
|
||
|
|
**Description:** The compliance backbone and forensic record. An append-only, immutable, partitioned PostgreSQL log that tracks every runbook parse, classification, approval, execution, and rollback.
|
||
|
|
|
||
|
|
**Dependencies:** Epics 2, 3, 4 (needs events to log)
|
||
|
|
|
||
|
|
**Technical Notes:**
|
||
|
|
- PostgreSQL partitioned table (by month) for query performance.
|
||
|
|
- Application role must have `INSERT` and `SELECT` only. No `UPDATE` or `DELETE` grants.
|
||
|
|
- Row-Level Security (RLS) enforces tenant isolation at the database level.
|
||
|
|
- V1 includes a basic PDF/CSV export for SOC 2 readiness.
|
||
|
|
|
||
|
|
### User Stories
|
||
|
|
|
||
|
|
**Story 5.1: Append-Only Audit Schema**
|
||
|
|
*As a system, I want a partitioned database table to store all execution events immutably, so that no one can tamper with the forensic record.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Creates a PostgreSQL table partitioned by month.
|
||
|
|
- Revokes `UPDATE` and `DELETE` grants from the application role.
|
||
|
|
- Schema captures `tenant_id`, `event_type`, `execution_id`, `actor_id`, `event_data` (JSONB), and `created_at`.
|
||
|
|
- **Story Points:** 2
|
||
|
|
- **Dependencies:** None
|
||
|
|
|
||
|
|
**Story 5.2: Event Ingestion Pipeline**
|
||
|
|
*As an SRE manager (Jordan), I want every action recorded in real-time, so that I know exactly who did what during an incident.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Ingests events from Parser, Classifier, Execution Engine, and Slack Bot.
|
||
|
|
- Logs `runbook.parsed`, `runbook.classified`, `step.auto_executed`, `step.approved`, `step.executed`, etc., exactly as architected.
|
||
|
|
- Captures the exact command executed, exit code, and stdout/stderr hashes.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 5.1
|
||
|
|
|
||
|
|
**Story 5.3: Compliance Export**
|
||
|
|
*As an SRE manager (Jordan), I want to export timestamped execution logs, so that I can provide evidence to SOC 2 auditors.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Generates a PDF or CSV of an execution run, including the approval chain and audit trail.
|
||
|
|
- Includes risk classifications and any divergence/modifications made by the engineer.
|
||
|
|
- Links to S3 for full stdout/stderr logs if needed.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 5.2
|
||
|
|
|
||
|
|
**Story 5.4: Multi-Tenant Data Isolation (RLS)**
|
||
|
|
*As a system, I want to ensure no tenant can see another tenant's data, so that security boundaries are guaranteed at the database level.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Enables Row-Level Security (RLS) on all tenant-scoped tables (`runbooks`, `executions`, `audit_events`).
|
||
|
|
- API middleware sets `app.current_tenant_id` session variable on every database connection.
|
||
|
|
- Cross-tenant queries return zero rows, not an error.
|
||
|
|
- **Story Points:** 2
|
||
|
|
- **Dependencies:** 5.1
|
||
|
|
|
||
|
|
## Epic 6: Dashboard API
|
||
|
|
**Description:** The control plane REST API served by the dd0c API Gateway. Handles runbook CRUD, parsing endpoints, execution history, and integration with the Alert-Runbook Matcher. Provides the backend for the Dashboard UI and external integrations.
|
||
|
|
|
||
|
|
**Dependencies:** Epics 1, 3, 5
|
||
|
|
|
||
|
|
**Technical Notes:**
|
||
|
|
- Served via Axum (Rust) and secured with JWT.
|
||
|
|
- Integrates with the shared dd0c API Gateway.
|
||
|
|
- Enforces tenant isolation (RLS) on every request.
|
||
|
|
|
||
|
|
### User Stories
|
||
|
|
|
||
|
|
**Story 6.1: Runbook CRUD & Parsing Endpoints**
|
||
|
|
*As a runbook author (Morgan), I want to create, read, update, and soft-delete runbooks via a REST API, so that I can manage my team's active runbooks.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Implements `POST /runbooks` (paste raw text → auto-parse), `GET /runbooks`, `GET /runbooks/:id/versions`, `PUT /runbooks/:id`, `DELETE /runbooks/:id`.
|
||
|
|
- Implements `POST /runbooks/parse-preview` for the 5-second wow moment (parses without saving).
|
||
|
|
- Returns 422 if parsing fails.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 1.2
|
||
|
|
|
||
|
|
**Story 6.2: Execution History & Status Queries**
|
||
|
|
*As an SRE manager (Jordan), I want to view active and completed execution runs, so that I can monitor incident response times and skipped steps.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Implements `POST /executions` to start copilot manually.
|
||
|
|
- Implements `GET /executions` and `GET /executions/:id` to retrieve status, steps, exit codes, and durations.
|
||
|
|
- Implements `GET /executions/:id/divergence` to get the post-execution analysis (skipped steps, modified commands).
|
||
|
|
- **Story Points:** 2
|
||
|
|
- **Dependencies:** 3.1, 5.2
|
||
|
|
|
||
|
|
**Story 6.3: Alert-Runbook Matcher Integration**
|
||
|
|
*As a system, I want to match incoming webhooks (e.g., PagerDuty) to the correct runbook, so that the Slack bot can post the runbook immediately.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Implements `POST /webhooks/pagerduty`, `POST /webhooks/opsgenie`, and `POST /webhooks/dd0c-alert`.
|
||
|
|
- Uses keyword + metadata matching (and optionally pgvector similarity) to find the best runbook for the alert payload.
|
||
|
|
- Generates the `alert_context` and triggers the Slack bot notification (Epic 4).
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 6.1
|
||
|
|
|
||
|
|
**Story 6.4: Classification Query Endpoints**
|
||
|
|
*As a runbook author (Morgan), I want to query the classification engine directly, so that I can test commands before publishing a runbook.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Implements `POST /classify` for testing/debugging.
|
||
|
|
- Implements `GET /classifications/:step_id` to retrieve full classification details.
|
||
|
|
- Rate-limited to 30 req/min per tenant.
|
||
|
|
- **Story Points:** 1
|
||
|
|
- **Dependencies:** 2.3
|
||
|
|
|
||
|
|
## Epic 7: Dashboard UI
|
||
|
|
**Description:** A React Single-Page Application (SPA) for runbook authors and managers to view runbooks, past executions, and risk classifications. The primary interface for onboarding, reviewing runbooks, and analyzing post-incident data.
|
||
|
|
|
||
|
|
**Dependencies:** Epic 6 (needs APIs to consume)
|
||
|
|
|
||
|
|
**Technical Notes:**
|
||
|
|
- React SPA, integrates with the shared dd0c portal.
|
||
|
|
- Displays the 5-second "wow moment" parse preview.
|
||
|
|
- Must visually distinguish 🟢 Safe, 🟡 Caution, and 🔴 Dangerous commands.
|
||
|
|
|
||
|
|
### User Stories
|
||
|
|
|
||
|
|
**Story 7.1: Runbook Paste & Preview UI**
|
||
|
|
*As a runbook author (Morgan), I want to paste raw text and instantly see the parsed steps, so that I can verify the structure before saving.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Large text area for pasting raw runbook text.
|
||
|
|
- Calls `POST /runbooks/parse-preview` and displays the structured steps.
|
||
|
|
- Highlights variables, prerequisites, and ambiguities.
|
||
|
|
- Allows editing the raw text to trigger a re-parse.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 6.1
|
||
|
|
|
||
|
|
**Story 7.2: Execution Timeline & Divergence View**
|
||
|
|
*As an SRE manager (Jordan), I want to view a timeline of an incident's execution, so that I can see what was skipped or modified.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Displays the execution run with a visual timeline of steps.
|
||
|
|
- Shows who approved each step, the exact command executed, and the exit code.
|
||
|
|
- Highlights divergence (skipped steps, modified commands, unlisted actions).
|
||
|
|
- Provides a "Apply Updates" button to update the runbook based on divergence.
|
||
|
|
- **Story Points:** 5
|
||
|
|
- **Dependencies:** 6.2
|
||
|
|
|
||
|
|
**Story 7.3: Trust Level & Risk Visualization**
|
||
|
|
*As a runbook author (Morgan), I want to clearly see the risk level of each step in my runbook, so that I know what requires approval.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Color-codes steps: 🟢 Green (Safe), 🟡 Yellow (Caution), 🔴 Red (Dangerous).
|
||
|
|
- Clicking a risk badge shows the classification reasoning (scanner rules matched, LLM explanation).
|
||
|
|
- Displays the runbook's overall Trust Level (0, 1, 2) and allows authorized users to change it (V1 max = 2).
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 7.1
|
||
|
|
|
||
|
|
**Story 7.4: Basic Health & MTTR Dashboard**
|
||
|
|
*As an SRE manager (Jordan), I want a high-level view of my team's runbooks and execution stats, so that I can measure the value of dd0c/run.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Displays a list of runbooks, their coverage, and average staleness.
|
||
|
|
- Shows MTTR (Mean Time To Resolution) for incidents handled via Copilot.
|
||
|
|
- Displays the total number of Copilot runs and skipped steps per month.
|
||
|
|
- **Story Points:** 2
|
||
|
|
- **Dependencies:** 6.2
|
||
|
|
|
||
|
|
## Epic 8: Infrastructure & DevOps
|
||
|
|
**Description:** The foundational cloud infrastructure and CI/CD pipelines required to run the dd0c/run SaaS platform and build the customer-facing Agent. Ensures strict security isolation (mTLS), observability, and zero-downtime ECS Fargate deployments.
|
||
|
|
|
||
|
|
**Dependencies:** None (can be built in parallel with Epic 1).
|
||
|
|
|
||
|
|
**Technical Notes:**
|
||
|
|
- All infrastructure defined as code (Terraform) and deployed to AWS.
|
||
|
|
- ECS Fargate for all core services (Parser, Classifier, Engine, APIs).
|
||
|
|
- Agent binary built for Linux x86_64 and ARM64 via GitHub Actions.
|
||
|
|
- mTLS certificate generation and rotation pipeline for Agent authentication.
|
||
|
|
|
||
|
|
### User Stories
|
||
|
|
|
||
|
|
**Story 8.1: Core AWS Infrastructure Provisioning**
|
||
|
|
*As a system administrator (Brian), I want the base AWS infrastructure provisioned via Terraform, so that the services have a secure, scalable environment to run in.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Terraform configures VPC, public/private subnets, NAT Gateway, and ALB.
|
||
|
|
- Provisions RDS PostgreSQL 16 (Multi-AZ) with pgvector and S3 buckets for audit/logs.
|
||
|
|
- Provisions ECS Fargate cluster and SQS queues.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** None
|
||
|
|
|
||
|
|
**Story 8.2: CI/CD Pipeline & Agent Build**
|
||
|
|
*As a developer (Brian), I want automated build and deployment pipelines, so that I can ship code to production safely and distribute the Agent binary.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- GitHub Actions pipeline runs `cargo clippy`, `cargo test`, and the "canary test suite" on every PR.
|
||
|
|
- Merges to `main` auto-deploy ECS Fargate services with zero downtime.
|
||
|
|
- Release tags compile the Agent binary (x86_64, aarch64), sign it, and publish to GitHub Releases / S3.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** None
|
||
|
|
|
||
|
|
**Story 8.3: Agent mTLS & gRPC Setup**
|
||
|
|
*As a system, I want secure outbound-only gRPC communication between the Agent and the Execution Engine, so that commands are transmitted safely without VPNs or inbound firewall rules.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Internal CA issues tenant-scoped mTLS certificates.
|
||
|
|
- API Gateway terminates mTLS and validates the tenant ID in the certificate against the request.
|
||
|
|
- Agent successfully establishes a persistent, outbound-only gRPC stream to the Engine.
|
||
|
|
- **Story Points:** 5
|
||
|
|
- **Dependencies:** 8.1
|
||
|
|
|
||
|
|
**Story 8.4: Observability & Party Mode Alerting**
|
||
|
|
*As the solo on-call engineer (Brian), I want comprehensive monitoring and alerting, so that I am woken up if a safety invariant is violated (Party Mode).*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- OpenTelemetry (OTEL) collector deployed and routing metrics/traces to Grafana Cloud.
|
||
|
|
- PagerDuty integration configured for P1 alerts.
|
||
|
|
- Alert fires immediately if the "Party Mode" database flag is set or if Agent heartbeats drop unexpectedly.
|
||
|
|
- **Story Points:** 2
|
||
|
|
- **Dependencies:** 8.1
|
||
|
|
|
||
|
|
## Epic 9: Onboarding & PLG
|
||
|
|
**Description:** Product-Led Growth (PLG) and frictionless onboarding. Focuses on the "5-second wow moment" where a user can paste a runbook and instantly see the parsed, classified steps without installing an agent, plus a self-serve tier to capture leads.
|
||
|
|
|
||
|
|
**Dependencies:** Epics 1, 6, 7.
|
||
|
|
|
||
|
|
**Technical Notes:**
|
||
|
|
- The demo uses the `POST /runbooks/parse-preview` endpoint (does not persist data).
|
||
|
|
- Tenant provisioning must be fully automated upon signup.
|
||
|
|
- Free tier enforced via database limits (no Stripe required initially for free tier).
|
||
|
|
|
||
|
|
### User Stories
|
||
|
|
|
||
|
|
**Story 9.1: The 5-Second Interactive Demo**
|
||
|
|
*As a prospective customer, I want to paste my messy runbook into the marketing website and see it parsed instantly, so that I immediately understand the product's value without signing up.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Landing page features a prominent text area for pasting a runbook.
|
||
|
|
- Submitting calls the unauthenticated parse-preview endpoint (rate-limited by IP).
|
||
|
|
- Renders the structured steps with 🟢/🟡/🔴 risk badges dynamically.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 6.1, 7.1
|
||
|
|
|
||
|
|
**Story 9.2: Self-Serve Signup & Tenant Provisioning**
|
||
|
|
*As a new user, I want to create an account and get my tenant provisioned instantly, so that I can start using the product without talking to sales.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- OAuth/Email signup flow creates a User and a Tenant record in PostgreSQL.
|
||
|
|
- Automated worker initializes the tenant workspace, generates the initial mTLS certs, and provides the Agent installation command.
|
||
|
|
- Limits the free tier to 5 active runbooks and 50 executions/month.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 8.1
|
||
|
|
|
||
|
|
**Story 9.3: Agent Installation Wizard**
|
||
|
|
*As a new user, I want a simple copy-paste command to install the read-only Agent in my infrastructure, so that I can execute my first runbook within 10 minutes of signup.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- Dashboard provides a dynamically generated `curl | bash` or `kubectl apply` snippet.
|
||
|
|
- Snippet includes the tenant-specific mTLS certificate and Agent binary download.
|
||
|
|
- Dashboard shows a live "Waiting for Agent heartbeat..." state that turns green when the Agent connects.
|
||
|
|
- **Story Points:** 3
|
||
|
|
- **Dependencies:** 8.3, 9.2
|
||
|
|
|
||
|
|
**Story 9.4: First Runbook Copilot Walkthrough**
|
||
|
|
*As a new user, I want an interactive guide for my first runbook execution, so that I learn how the Copilot approval flow and Trust Gradient work safely.*
|
||
|
|
- **Acceptance Criteria:**
|
||
|
|
- System automatically provisions a "Hello World" runbook containing safe read-only commands (e.g., `kubectl get nodes`).
|
||
|
|
- In-app tooltip guides the user to trigger the runbook via the Slack integration.
|
||
|
|
- Completing this first execution unlocks the "Runbook Author" badge/status.
|
||
|
|
- **Story Points:** 2
|
||
|
|
- **Dependencies:** 4.2, 9.3
|
||
|
|
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Epic 10: Transparent Factory Compliance
|
||
|
|
**Description:** Cross-cutting epic ensuring dd0c/run adheres to the 5 Transparent Factory tenets. Of all 6 dd0c products, runbook automation carries the highest governance burden — it executes commands in production infrastructure. Every tenet here is load-bearing.
|
||
|
|
|
||
|
|
### Story 10.1: Atomic Flagging — Feature Flags for Execution Behaviors
|
||
|
|
**As a** solo founder, **I want** every new runbook execution capability, command parser, and auto-remediation behavior behind a feature flag (default: off), **so that** a bad parser change never executes the wrong command in a customer's production environment.
|
||
|
|
|
||
|
|
**Acceptance Criteria:**
|
||
|
|
- OpenFeature SDK integrated into the execution engine. V1: JSON file provider.
|
||
|
|
- All flags evaluate locally — no network calls during runbook execution.
|
||
|
|
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
|
||
|
|
- Automated circuit breaker: if a flagged execution path fails >2 times in any 10-minute window, the flag auto-disables and all in-flight executions using that path are paused (not killed).
|
||
|
|
- Flags required for: new command parsers, conditional logic handlers, new infrastructure targets (K8s, AWS, bare metal), auto-retry behaviors, approval bypass rules.
|
||
|
|
- **Hard rule:** any flag that controls execution of destructive commands (`rm`, `delete`, `terminate`, `drop`) requires a 48-hour bake time at 10% rollout before full enablement.
|
||
|
|
|
||
|
|
**Estimate:** 5 points
|
||
|
|
**Dependencies:** Epic 3 (Execution Engine)
|
||
|
|
**Technical Notes:**
|
||
|
|
- Circuit breaker threshold is intentionally low (2 failures) because production execution failures are high-severity.
|
||
|
|
- Pause vs. kill: paused executions hold state and can be resumed after review. Killing mid-execution risks partial state.
|
||
|
|
- 48-hour bake for destructive flags: enforced via flag metadata `destructive: true` + CI check.
|
||
|
|
|
||
|
|
### Story 10.2: Elastic Schema — Additive-Only for Execution Audit Trail
|
||
|
|
**As a** solo founder, **I want** all execution log and runbook state schema changes to be strictly additive, **so that** the forensic audit trail of every production command ever executed is never corrupted or lost.
|
||
|
|
|
||
|
|
**Acceptance Criteria:**
|
||
|
|
- CI rejects any migration that removes, renames, or changes type of existing columns/attributes.
|
||
|
|
- New fields use `_v2` suffix for breaking changes.
|
||
|
|
- All execution log parsers ignore unknown fields.
|
||
|
|
- Dual-write during migration windows within the same transaction.
|
||
|
|
- Every migration includes `sunset_date` comment (max 30 days).
|
||
|
|
- **Hard rule:** execution audit records are immutable. No `UPDATE` or `DELETE` statements are permitted on the `execution_log` table/collection. Ever.
|
||
|
|
|
||
|
|
**Estimate:** 3 points
|
||
|
|
**Dependencies:** Epic 4 (Audit & Forensics)
|
||
|
|
**Technical Notes:**
|
||
|
|
- Execution logs are append-only by design and by policy. Use DynamoDB with no `UpdateItem` IAM permission on the audit table.
|
||
|
|
- For runbook definition versioning: every edit creates a new version record. Old versions are never mutated.
|
||
|
|
- Schema for parsed runbook steps: version the parser output format. V1 steps and V2 steps coexist.
|
||
|
|
|
||
|
|
### Story 10.3: Cognitive Durability — Decision Logs for Execution Logic
|
||
|
|
**As a** future maintainer, **I want** every change to runbook parsing, step classification, or execution logic accompanied by a `decision_log.json`, **so that** I understand why the system interpreted step 3 as "destructive" and required approval.
|
||
|
|
|
||
|
|
**Acceptance Criteria:**
|
||
|
|
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
|
||
|
|
- CI requires a decision log for PRs touching `pkg/parser/`, `pkg/classifier/`, `pkg/executor/`, or `pkg/approval/`.
|
||
|
|
- Cyclomatic complexity cap of 10 enforced via linter. PRs exceeding this are blocked.
|
||
|
|
- Decision logs in `docs/decisions/`.
|
||
|
|
- **Additional requirement:** every change to the "destructive command" classification list requires a decision log entry explaining why the command was added/removed.
|
||
|
|
|
||
|
|
**Estimate:** 2 points
|
||
|
|
**Dependencies:** None
|
||
|
|
**Technical Notes:**
|
||
|
|
- The destructive command list is the most sensitive config in the entire product. Changes must be documented, reviewed, and logged.
|
||
|
|
- Parser logic decision logs should include sample runbook snippets showing before/after interpretation.
|
||
|
|
|
||
|
|
### Story 10.4: Semantic Observability — AI Reasoning Spans on Classification & Execution
|
||
|
|
**As an** SRE manager investigating an incident caused by automated execution, **I want** every runbook classification and execution decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I have a complete forensic record of what the system decided, why, and what it executed.
|
||
|
|
|
||
|
|
**Acceptance Criteria:**
|
||
|
|
- Every runbook execution creates a parent `runbook_execution` span. Child spans for each step: `step_classification`, `step_approval_check`, `step_execution`.
|
||
|
|
- `step_classification` attributes: `step.text_hash`, `step.classified_as` (safe/destructive/ambiguous), `step.confidence_score`, `step.alternatives_considered`.
|
||
|
|
- `step_execution` attributes: `step.command_hash`, `step.target_host_hash`, `step.exit_code`, `step.duration_ms`, `step.output_truncated` (first 500 chars, hashed).
|
||
|
|
- If AI-assisted classification: `ai.prompt_hash`, `ai.model_version`, `ai.reasoning_chain`.
|
||
|
|
- `step_approval_check` attributes: `step.approval_required` (bool), `step.approval_source` (human/policy/auto), `step.approval_latency_ms`.
|
||
|
|
- No PII. No raw commands in spans — everything hashed. Full commands only in the encrypted audit log.
|
||
|
|
|
||
|
|
**Estimate:** 5 points
|
||
|
|
**Dependencies:** Epic 3 (Execution Engine), Epic 4 (Audit & Forensics)
|
||
|
|
**Technical Notes:**
|
||
|
|
- This is the highest-effort observability story across all 6 products because execution tracing is forensic-grade.
|
||
|
|
- Span hierarchy: `runbook_execution` → `step_N` → `[classification, approval, execution]`. Three levels deep.
|
||
|
|
- Output hashing: SHA-256 of command output for correlation. Raw output only in encrypted audit store.
|
||
|
|
|
||
|
|
### Story 10.5: Configurable Autonomy — Governance for Production Execution
|
||
|
|
**As a** solo founder, **I want** a `policy.json` that strictly controls what dd0c/run is allowed to execute autonomously vs. what requires human approval, **so that** no automated system ever runs a destructive command without explicit authorization.
|
||
|
|
|
||
|
|
**Acceptance Criteria:**
|
||
|
|
- `policy.json` defines `governance_mode`: `strict` (all steps require human approval) or `audit` (safe steps auto-execute, destructive steps require approval).
|
||
|
|
- Default for ALL customers: `strict`. There is no "fully autonomous" mode for dd0c/run. Even in `audit` mode, destructive commands always require approval.
|
||
|
|
- `panic_mode`: when true, ALL execution halts immediately. In-flight steps are paused (not killed). All pending approvals are revoked. System enters read-only forensic mode.
|
||
|
|
- Governance drift monitoring: weekly report of auto-executed vs. human-approved steps. If auto-execution ratio exceeds the configured threshold, system auto-downgrades to `strict`.
|
||
|
|
- Per-customer, per-runbook governance overrides. Customers can lock specific runbooks to `strict` regardless of system mode.
|
||
|
|
- **Hard rule:** `panic_mode` must be triggerable in <1 second via API call, CLI command, Slack command, OR a physical hardware button (webhook endpoint).
|
||
|
|
|
||
|
|
**Estimate:** 5 points
|
||
|
|
**Dependencies:** Epic 3 (Execution Engine), Epic 5 (Approval Workflow)
|
||
|
|
**Technical Notes:**
|
||
|
|
- No "full auto" mode is a deliberate product decision. dd0c/run assists humans, it doesn't replace them for destructive actions.
|
||
|
|
- Panic mode implementation: Redis key `dd0c:panic` checked at the top of every execution loop iteration. Webhook endpoint `POST /admin/panic` sets the key and broadcasts a halt signal via pub/sub.
|
||
|
|
- The <1 second requirement means panic cannot depend on a DB write — Redis pub/sub only.
|
||
|
|
- Governance drift: cron job queries execution logs weekly. Threshold configurable per-org (default: 70% auto-execution triggers downgrade).
|
||
|
|
|
||
|
|
### Epic 10 Summary
|
||
|
|
| Story | Tenet | Points |
|
||
|
|
|-------|-------|--------|
|
||
|
|
| 10.1 | Atomic Flagging | 5 |
|
||
|
|
| 10.2 | Elastic Schema | 3 |
|
||
|
|
| 10.3 | Cognitive Durability | 2 |
|
||
|
|
| 10.4 | Semantic Observability | 5 |
|
||
|
|
| 10.5 | Configurable Autonomy | 5 |
|
||
|
|
| **Total** | | **20** |
|