# dd0c/run — V1 MVP Epics This document breaks down the V1 MVP of dd0c/run into implementation Epics and User Stories. Scope is strictly limited to V1 (read-only execution 🟢, copilot approval for 🟡/🔴, no autopilot, no terminal watcher). --- ## Epic 1: Runbook Parser **Description:** The "5-second wow moment." Ingest raw unstructured text from a paste event, normalize it, and use a fast LLM (e.g., Claude Haiku via dd0c/route) to extract a structured JSON representation of executable steps, variables, conditional branches, and prerequisites in under 5 seconds. **Dependencies:** None. This is the foundational data structure. **Technical Notes:** - Pure Rust normalizer to strip HTML/Markdown before hitting the LLM to save tokens and latency. - LLM prompt must enforce a strict JSON schema. - Idempotent parsing: same text + temperature 0 = same output. - Must run in < 3.5s to leave room for the Action Classifier within the 5s SLA. ### User Stories **Story 1.1: Raw Text Normalization & Ingestion** *As a runbook author (Morgan), I want to paste raw text from Confluence or Notion into the system, so that I don't have to learn a proprietary YAML DSL.* - **Acceptance Criteria:** - System accepts raw text payload via API. - Rust-based normalizer strips HTML tags, markdown formatting, and normalizes whitespace/bullet points. - Original raw text and hash are preserved in the DB for audit/re-parsing. - **Story Points:** 2 - **Dependencies:** None **Story 1.2: LLM Structured Step Extraction** *As a system, I want to pass normalized text to a fast LLM to extract an ordered JSON array of steps and commands, so that the runbook becomes machine-readable.* - **Acceptance Criteria:** - Sends normalized text to `dd0c/route` with a strict JSON schema prompt. - Correctly extracts step order, natural language description, and embedded CLI/shell commands. - P95 latency for extraction is < 3.5 seconds. - Rejects/errors gracefully if the text contains no actionable steps. - **Story Points:** 3 - **Dependencies:** 1.1 **Story 1.3: Variable & Prerequisite Detection** *As an on-call engineer (Riley), I want the parser to identify implicit prerequisites (like VPN) and variables (like ``), so that I'm fully prepared before the runbook starts.* - **Acceptance Criteria:** - Regex/heuristic scanner identifies common placeholders (`$VAR`, ``, `{var}`). - LLM identifies implicit requirements ("ensure you are on VPN", "requires prod AWS profile"). - Outputs a structured array of `variables` and `prerequisites` in the JSON payload. - **Story Points:** 2 - **Dependencies:** 1.2 **Story 1.4: Branching & Ambiguity Highlighting** *As a runbook author (Morgan), I want the system to map conditional logic (if/else) and flag vague instructions, so that my runbooks are deterministic and clear.* - **Acceptance Criteria:** - Parser detects simple "if X then Y" statements and maps them to a step execution DAG (Directed Acyclic Graph). - Identifies vague steps (e.g., "check the logs" without specifying which logs) and adds them to an `ambiguities` array for author review. - **Story Points:** 3 - **Dependencies:** 1.2 ## Epic 2: Action Classifier **Description:** The safety-critical core of the system. Classifies every extracted command as 🟢 Safe, 🟡 Caution, 🔴 Dangerous, or ⬜ Unknown. Uses a defense-in-depth "dual-key" architecture: a deterministic compiled Rust scanner overrides an advisory LLM classifier. A misclassification here is an existential risk. **Dependencies:** Epic 1 (needs parsed steps to classify) **Technical Notes:** - The deterministic scanner must use tree-sitter or compiled regex sets. - Merge rules must be hardcoded in Rust (not configurable). - LLM classification must run in parallel across steps to meet the 5-second total parse+classify SLA. - If the LLM confidence is low or unknown, risk escalates. Scanner wins disagreements. ### User Stories **Story 2.1: Deterministic Safety Scanner** *As a system, I want to pattern-match commands against a compiled database of known signatures, so that destructive commands are definitively blocked without relying on an LLM.* - **Acceptance Criteria:** - Rust library with compiled regex sets (allowlist 🟢, caution 🟡, blocklist 🔴). - Uses tree-sitter to parse SQL/shell ASTs to detect destructive patterns (e.g., `DELETE` without `WHERE`, piped `rm -rf`). - Executes in < 1ms per command. - Defaults to `Unknown` (🟡 minimum) if no patterns match. - **Story Points:** 5 - **Dependencies:** None (standalone library) **Story 2.2: LLM Contextual Classifier** *As a system, I want an LLM to assess the command in the context of surrounding steps, so that implicit state changes or custom scripts are caught.* - **Acceptance Criteria:** - Sends the command, surrounding steps, and infrastructure context to an LLM via a strict JSON prompt. - Returns a classification (🟢/🟡/🔴), a confidence score, and suggested rollbacks. - Escalate classification to the next risk tier if confidence < 0.9. - **Story Points:** 3 - **Dependencies:** Epic 1 (requires structured context) **Story 2.3: Classification Merge Engine** *As an on-call engineer (Riley), I want the system to safely merge the LLM and deterministic scanner results, so that I can trust the final 🟢/🟡/🔴 label.* - **Acceptance Criteria:** - Implements the 5 hardcoded merge rules exactly as architected (e.g., if Scanner says 🔴, final is 🔴; if Scanner says 🟡 and LLM says 🟢, final is 🟡). - The final risk level is `Safe` ONLY if both the Scanner and the LLM agree it is safe. - **Story Points:** 2 - **Dependencies:** 2.1, 2.2 **Story 2.4: Immutable Classification Audit** *As an SRE manager (Jordan), I want every classification decision logged with full context, so that I have a forensic record of why the system made its choice.* - **Acceptance Criteria:** - Emits a `runbook.classified` event to the PostgreSQL database for every step. - Log includes: scanner result (matched patterns), LLM reasoning, confidence scores, final classification, and merge rule applied. - **Story Points:** 1 - **Dependencies:** 2.3 ## Epic 3: Execution Engine **Description:** A state machine orchestrating the step-by-step runbook execution, enforcing the Trust Gradient (Level 0, 1, 2) at every transition. It never auto-executes 🟡 or 🔴 commands in V1. Handles timeouts, rollbacks, and tracking skipped steps. **Dependencies:** Epics 1 & 2 (needs classified runbooks to execute) **Technical Notes:** - V1 is Copilot-only. State transitions must block 🔴 and prompt for 🟡. - Engine must communicate with the Agent via gRPC over mTLS. - Each step execution receives a unique ID to prevent duplicate deliveries. ### User Stories **Story 3.1: Execution State Machine** *As a system, I want a state machine to orchestrate step-by-step runbook execution, so that no two commands execute simultaneously and the Trust Gradient is enforced.* - **Acceptance Criteria:** - Starts in `Pending` state upon alert match. - Progresses to `AutoExecute` ONLY if the step is 🟢 and trust level allows. - Transitions to `AwaitApproval` for 🟡/🔴 steps, blocking execution until approved. - Aborts or transitions to `ManualIntervention` on timeout (e.g., 60s for 🟢, 300s for 🔴). - **Story Points:** 5 - **Dependencies:** 2.3 **Story 3.2: gRPC Agent Communication Protocol** *As a system, I want to communicate securely with the dd0c Agent in the customer VPC, so that commands are executed safely without inbound firewall rules.* - **Acceptance Criteria:** - Outbound-only gRPC streaming connection initiated by the Agent. - Engine sends `ExecuteStep` payload (command, timeout, risk level, environment variables). - Agent streams `StepOutput` (stdout/stderr) back to the Engine. - Agent returns `StepResult` (exit code, duration, stdout/stderr hashes) on completion. - **Story Points:** 5 - **Dependencies:** 3.1 **Story 3.3: Rollback Integration** *As an on-call engineer (Riley), I want every state-changing step to record its inverse command, so that I can undo it with one click if it fails.* - **Acceptance Criteria:** - If a step fails, the Engine transitions to `RollbackAvailable` and awaits human approval. - Engine stores the rollback command before executing the forward command. - Executing rollback triggers an independent execution step and returns the state machine to `StepReady` or `ManualIntervention`. - **Story Points:** 3 - **Dependencies:** 3.1 **Story 3.4: Divergence Analysis** *As an SRE manager (Jordan), I want the system to track what the engineer actually did vs. what the runbook prescribed, so that runbooks get better over time.* - **Acceptance Criteria:** - Post-execution analyzer compares prescribed steps with executed/skipped steps and any unlisted commands detected by the agent. - Flags skipped steps, modified commands, and unlisted actions. - Emits a `divergence.detected` event with suggested runbook updates. - **Story Points:** 3 - **Dependencies:** 3.1 ## Epic 4: Slack Bot Copilot **Description:** The primary 3am interface for on-call engineers. Guides the engineer step-by-step through an active incident, enforcing the Trust Gradient via interactive buttons for 🟡 actions and explicit typed confirmations for 🔴 actions. **Dependencies:** Epic 3 (needs execution state to drive the UI) **Technical Notes:** - Uses Rust + Slack Bolt SDK in Socket Mode (no inbound webhooks). - Must use Slack Block Kit for formatting. - Respects Slack's 1 message/sec/channel rate limit by batching rapid updates. - Uses dedicated threads to keep main channels clean. ### User Stories **Story 4.1: Alert Matching & Notification** *As an on-call engineer (Riley), I want a Slack message with the relevant runbook when an alert fires, so that I don't have to search for it.* - **Acceptance Criteria:** - Integrates with PagerDuty/OpsGenie webhook payloads. - Matches alert context (service, region, keywords) to the most relevant runbook. - Posts a `🔔 Runbook matched` message with [▶ Start Copilot] button in the incident channel. - **Story Points:** 3 - **Dependencies:** None **Story 4.2: Step-by-Step Interactive UI** *As an on-call engineer (Riley), I want the bot to guide me through the runbook step-by-step in a thread, so that I can focus on one command at a time.* - **Acceptance Criteria:** - Tapping [▶ Start Copilot] opens a thread and triggers the Execution Engine. - Shows 🟢 steps auto-executing with live stdout snippets. - Shows 🟡/🔴 steps awaiting approval with command details and rollback. - Includes [⏭ Skip] and [🛑 Abort] buttons for every step. - **Story Points:** 5 - **Dependencies:** 3.1, 4.1 **Story 4.3: Risk-Aware Approval Gates** *As a system, I want to block dangerous commands until explicit confirmation is provided, so that accidental misclicks don't cause outages.* - **Acceptance Criteria:** - 🟡 steps require clicking [✅ Approve] or [✏️ Edit]. - 🔴 steps require opening a text input modal and typing the resource name exactly. - Cannot bulk-approve steps. Each step must be individually gated. - Approver's Slack identity is captured and passed to the Execution Engine. - **Story Points:** 3 - **Dependencies:** 4.2 **Story 4.4: Real-Time Output Streaming** *As an on-call engineer (Riley), I want to see the execution output of commands in real-time, so that I know what's happening.* - **Acceptance Criteria:** - Slack messages update dynamically with command stdout/stderr. - Batches rapid output updates to respect Slack rate limits. - Truncates long outputs and provides a link to the full log in the dashboard. - **Story Points:** 2 - **Dependencies:** 4.2 ## Epic 5: Audit Trail **Description:** The compliance backbone and forensic record. An append-only, immutable, partitioned PostgreSQL log that tracks every runbook parse, classification, approval, execution, and rollback. **Dependencies:** Epics 2, 3, 4 (needs events to log) **Technical Notes:** - PostgreSQL partitioned table (by month) for query performance. - Application role must have `INSERT` and `SELECT` only. No `UPDATE` or `DELETE` grants. - Row-Level Security (RLS) enforces tenant isolation at the database level. - V1 includes a basic PDF/CSV export for SOC 2 readiness. ### User Stories **Story 5.1: Append-Only Audit Schema** *As a system, I want a partitioned database table to store all execution events immutably, so that no one can tamper with the forensic record.* - **Acceptance Criteria:** - Creates a PostgreSQL table partitioned by month. - Revokes `UPDATE` and `DELETE` grants from the application role. - Schema captures `tenant_id`, `event_type`, `execution_id`, `actor_id`, `event_data` (JSONB), and `created_at`. - **Story Points:** 2 - **Dependencies:** None **Story 5.2: Event Ingestion Pipeline** *As an SRE manager (Jordan), I want every action recorded in real-time, so that I know exactly who did what during an incident.* - **Acceptance Criteria:** - Ingests events from Parser, Classifier, Execution Engine, and Slack Bot. - Logs `runbook.parsed`, `runbook.classified`, `step.auto_executed`, `step.approved`, `step.executed`, etc., exactly as architected. - Captures the exact command executed, exit code, and stdout/stderr hashes. - **Story Points:** 3 - **Dependencies:** 5.1 **Story 5.3: Compliance Export** *As an SRE manager (Jordan), I want to export timestamped execution logs, so that I can provide evidence to SOC 2 auditors.* - **Acceptance Criteria:** - Generates a PDF or CSV of an execution run, including the approval chain and audit trail. - Includes risk classifications and any divergence/modifications made by the engineer. - Links to S3 for full stdout/stderr logs if needed. - **Story Points:** 3 - **Dependencies:** 5.2 **Story 5.4: Multi-Tenant Data Isolation (RLS)** *As a system, I want to ensure no tenant can see another tenant's data, so that security boundaries are guaranteed at the database level.* - **Acceptance Criteria:** - Enables Row-Level Security (RLS) on all tenant-scoped tables (`runbooks`, `executions`, `audit_events`). - API middleware sets `app.current_tenant_id` session variable on every database connection. - Cross-tenant queries return zero rows, not an error. - **Story Points:** 2 - **Dependencies:** 5.1 ## Epic 6: Dashboard API **Description:** The control plane REST API served by the dd0c API Gateway. Handles runbook CRUD, parsing endpoints, execution history, and integration with the Alert-Runbook Matcher. Provides the backend for the Dashboard UI and external integrations. **Dependencies:** Epics 1, 3, 5 **Technical Notes:** - Served via Axum (Rust) and secured with JWT. - Integrates with the shared dd0c API Gateway. - Enforces tenant isolation (RLS) on every request. ### User Stories **Story 6.1: Runbook CRUD & Parsing Endpoints** *As a runbook author (Morgan), I want to create, read, update, and soft-delete runbooks via a REST API, so that I can manage my team's active runbooks.* - **Acceptance Criteria:** - Implements `POST /runbooks` (paste raw text → auto-parse), `GET /runbooks`, `GET /runbooks/:id/versions`, `PUT /runbooks/:id`, `DELETE /runbooks/:id`. - Implements `POST /runbooks/parse-preview` for the 5-second wow moment (parses without saving). - Returns 422 if parsing fails. - **Story Points:** 3 - **Dependencies:** 1.2 **Story 6.2: Execution History & Status Queries** *As an SRE manager (Jordan), I want to view active and completed execution runs, so that I can monitor incident response times and skipped steps.* - **Acceptance Criteria:** - Implements `POST /executions` to start copilot manually. - Implements `GET /executions` and `GET /executions/:id` to retrieve status, steps, exit codes, and durations. - Implements `GET /executions/:id/divergence` to get the post-execution analysis (skipped steps, modified commands). - **Story Points:** 2 - **Dependencies:** 3.1, 5.2 **Story 6.3: Alert-Runbook Matcher Integration** *As a system, I want to match incoming webhooks (e.g., PagerDuty) to the correct runbook, so that the Slack bot can post the runbook immediately.* - **Acceptance Criteria:** - Implements `POST /webhooks/pagerduty`, `POST /webhooks/opsgenie`, and `POST /webhooks/dd0c-alert`. - Uses keyword + metadata matching (and optionally pgvector similarity) to find the best runbook for the alert payload. - Generates the `alert_context` and triggers the Slack bot notification (Epic 4). - **Story Points:** 3 - **Dependencies:** 6.1 **Story 6.4: Classification Query Endpoints** *As a runbook author (Morgan), I want to query the classification engine directly, so that I can test commands before publishing a runbook.* - **Acceptance Criteria:** - Implements `POST /classify` for testing/debugging. - Implements `GET /classifications/:step_id` to retrieve full classification details. - Rate-limited to 30 req/min per tenant. - **Story Points:** 1 - **Dependencies:** 2.3 ## Epic 7: Dashboard UI **Description:** A React Single-Page Application (SPA) for runbook authors and managers to view runbooks, past executions, and risk classifications. The primary interface for onboarding, reviewing runbooks, and analyzing post-incident data. **Dependencies:** Epic 6 (needs APIs to consume) **Technical Notes:** - React SPA, integrates with the shared dd0c portal. - Displays the 5-second "wow moment" parse preview. - Must visually distinguish 🟢 Safe, 🟡 Caution, and 🔴 Dangerous commands. ### User Stories **Story 7.1: Runbook Paste & Preview UI** *As a runbook author (Morgan), I want to paste raw text and instantly see the parsed steps, so that I can verify the structure before saving.* - **Acceptance Criteria:** - Large text area for pasting raw runbook text. - Calls `POST /runbooks/parse-preview` and displays the structured steps. - Highlights variables, prerequisites, and ambiguities. - Allows editing the raw text to trigger a re-parse. - **Story Points:** 3 - **Dependencies:** 6.1 **Story 7.2: Execution Timeline & Divergence View** *As an SRE manager (Jordan), I want to view a timeline of an incident's execution, so that I can see what was skipped or modified.* - **Acceptance Criteria:** - Displays the execution run with a visual timeline of steps. - Shows who approved each step, the exact command executed, and the exit code. - Highlights divergence (skipped steps, modified commands, unlisted actions). - Provides a "Apply Updates" button to update the runbook based on divergence. - **Story Points:** 5 - **Dependencies:** 6.2 **Story 7.3: Trust Level & Risk Visualization** *As a runbook author (Morgan), I want to clearly see the risk level of each step in my runbook, so that I know what requires approval.* - **Acceptance Criteria:** - Color-codes steps: 🟢 Green (Safe), 🟡 Yellow (Caution), 🔴 Red (Dangerous). - Clicking a risk badge shows the classification reasoning (scanner rules matched, LLM explanation). - Displays the runbook's overall Trust Level (0, 1, 2) and allows authorized users to change it (V1 max = 2). - **Story Points:** 3 - **Dependencies:** 7.1 **Story 7.4: Basic Health & MTTR Dashboard** *As an SRE manager (Jordan), I want a high-level view of my team's runbooks and execution stats, so that I can measure the value of dd0c/run.* - **Acceptance Criteria:** - Displays a list of runbooks, their coverage, and average staleness. - Shows MTTR (Mean Time To Resolution) for incidents handled via Copilot. - Displays the total number of Copilot runs and skipped steps per month. - **Story Points:** 2 - **Dependencies:** 6.2 ## Epic 8: Infrastructure & DevOps **Description:** The foundational cloud infrastructure and CI/CD pipelines required to run the dd0c/run SaaS platform and build the customer-facing Agent. Ensures strict security isolation (mTLS), observability, and zero-downtime ECS Fargate deployments. **Dependencies:** None (can be built in parallel with Epic 1). **Technical Notes:** - All infrastructure defined as code (Terraform) and deployed to AWS. - ECS Fargate for all core services (Parser, Classifier, Engine, APIs). - Agent binary built for Linux x86_64 and ARM64 via GitHub Actions. - mTLS certificate generation and rotation pipeline for Agent authentication. ### User Stories **Story 8.1: Core AWS Infrastructure Provisioning** *As a system administrator (Brian), I want the base AWS infrastructure provisioned via Terraform, so that the services have a secure, scalable environment to run in.* - **Acceptance Criteria:** - Terraform configures VPC, public/private subnets, NAT Gateway, and ALB. - Provisions RDS PostgreSQL 16 (Multi-AZ) with pgvector and S3 buckets for audit/logs. - Provisions ECS Fargate cluster and SQS queues. - **Story Points:** 3 - **Dependencies:** None **Story 8.2: CI/CD Pipeline & Agent Build** *As a developer (Brian), I want automated build and deployment pipelines, so that I can ship code to production safely and distribute the Agent binary.* - **Acceptance Criteria:** - GitHub Actions pipeline runs `cargo clippy`, `cargo test`, and the "canary test suite" on every PR. - Merges to `main` auto-deploy ECS Fargate services with zero downtime. - Release tags compile the Agent binary (x86_64, aarch64), sign it, and publish to GitHub Releases / S3. - **Story Points:** 3 - **Dependencies:** None **Story 8.3: Agent mTLS & gRPC Setup** *As a system, I want secure outbound-only gRPC communication between the Agent and the Execution Engine, so that commands are transmitted safely without VPNs or inbound firewall rules.* - **Acceptance Criteria:** - Internal CA issues tenant-scoped mTLS certificates. - API Gateway terminates mTLS and validates the tenant ID in the certificate against the request. - Agent successfully establishes a persistent, outbound-only gRPC stream to the Engine. - **Story Points:** 5 - **Dependencies:** 8.1 **Story 8.4: Observability & Party Mode Alerting** *As the solo on-call engineer (Brian), I want comprehensive monitoring and alerting, so that I am woken up if a safety invariant is violated (Party Mode).* - **Acceptance Criteria:** - OpenTelemetry (OTEL) collector deployed and routing metrics/traces to Grafana Cloud. - PagerDuty integration configured for P1 alerts. - Alert fires immediately if the "Party Mode" database flag is set or if Agent heartbeats drop unexpectedly. - **Story Points:** 2 - **Dependencies:** 8.1 ## Epic 9: Onboarding & PLG **Description:** Product-Led Growth (PLG) and frictionless onboarding. Focuses on the "5-second wow moment" where a user can paste a runbook and instantly see the parsed, classified steps without installing an agent, plus a self-serve tier to capture leads. **Dependencies:** Epics 1, 6, 7. **Technical Notes:** - The demo uses the `POST /runbooks/parse-preview` endpoint (does not persist data). - Tenant provisioning must be fully automated upon signup. - Free tier enforced via database limits (no Stripe required initially for free tier). ### User Stories **Story 9.1: The 5-Second Interactive Demo** *As a prospective customer, I want to paste my messy runbook into the marketing website and see it parsed instantly, so that I immediately understand the product's value without signing up.* - **Acceptance Criteria:** - Landing page features a prominent text area for pasting a runbook. - Submitting calls the unauthenticated parse-preview endpoint (rate-limited by IP). - Renders the structured steps with 🟢/🟡/🔴 risk badges dynamically. - **Story Points:** 3 - **Dependencies:** 6.1, 7.1 **Story 9.2: Self-Serve Signup & Tenant Provisioning** *As a new user, I want to create an account and get my tenant provisioned instantly, so that I can start using the product without talking to sales.* - **Acceptance Criteria:** - OAuth/Email signup flow creates a User and a Tenant record in PostgreSQL. - Automated worker initializes the tenant workspace, generates the initial mTLS certs, and provides the Agent installation command. - Limits the free tier to 5 active runbooks and 50 executions/month. - **Story Points:** 3 - **Dependencies:** 8.1 **Story 9.3: Agent Installation Wizard** *As a new user, I want a simple copy-paste command to install the read-only Agent in my infrastructure, so that I can execute my first runbook within 10 minutes of signup.* - **Acceptance Criteria:** - Dashboard provides a dynamically generated `curl | bash` or `kubectl apply` snippet. - Snippet includes the tenant-specific mTLS certificate and Agent binary download. - Dashboard shows a live "Waiting for Agent heartbeat..." state that turns green when the Agent connects. - **Story Points:** 3 - **Dependencies:** 8.3, 9.2 **Story 9.4: First Runbook Copilot Walkthrough** *As a new user, I want an interactive guide for my first runbook execution, so that I learn how the Copilot approval flow and Trust Gradient work safely.* - **Acceptance Criteria:** - System automatically provisions a "Hello World" runbook containing safe read-only commands (e.g., `kubectl get nodes`). - In-app tooltip guides the user to trigger the runbook via the Slack integration. - Completing this first execution unlocks the "Runbook Author" badge/status. - **Story Points:** 2 - **Dependencies:** 4.2, 9.3 --- ## Epic 10: Transparent Factory Compliance **Description:** Cross-cutting epic ensuring dd0c/run adheres to the 5 Transparent Factory tenets. Of all 6 dd0c products, runbook automation carries the highest governance burden — it executes commands in production infrastructure. Every tenet here is load-bearing. ### Story 10.1: Atomic Flagging — Feature Flags for Execution Behaviors **As a** solo founder, **I want** every new runbook execution capability, command parser, and auto-remediation behavior behind a feature flag (default: off), **so that** a bad parser change never executes the wrong command in a customer's production environment. **Acceptance Criteria:** - OpenFeature SDK integrated into the execution engine. V1: JSON file provider. - All flags evaluate locally — no network calls during runbook execution. - Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%. - Automated circuit breaker: if a flagged execution path fails >2 times in any 10-minute window, the flag auto-disables and all in-flight executions using that path are paused (not killed). - Flags required for: new command parsers, conditional logic handlers, new infrastructure targets (K8s, AWS, bare metal), auto-retry behaviors, approval bypass rules. - **Hard rule:** any flag that controls execution of destructive commands (`rm`, `delete`, `terminate`, `drop`) requires a 48-hour bake time at 10% rollout before full enablement. **Estimate:** 5 points **Dependencies:** Epic 3 (Execution Engine) **Technical Notes:** - Circuit breaker threshold is intentionally low (2 failures) because production execution failures are high-severity. - Pause vs. kill: paused executions hold state and can be resumed after review. Killing mid-execution risks partial state. - 48-hour bake for destructive flags: enforced via flag metadata `destructive: true` + CI check. ### Story 10.2: Elastic Schema — Additive-Only for Execution Audit Trail **As a** solo founder, **I want** all execution log and runbook state schema changes to be strictly additive, **so that** the forensic audit trail of every production command ever executed is never corrupted or lost. **Acceptance Criteria:** - CI rejects any migration that removes, renames, or changes type of existing columns/attributes. - New fields use `_v2` suffix for breaking changes. - All execution log parsers ignore unknown fields. - Dual-write during migration windows within the same transaction. - Every migration includes `sunset_date` comment (max 30 days). - **Hard rule:** execution audit records are immutable. No `UPDATE` or `DELETE` statements are permitted on the `execution_log` table/collection. Ever. **Estimate:** 3 points **Dependencies:** Epic 4 (Audit & Forensics) **Technical Notes:** - Execution logs are append-only by design and by policy. Use DynamoDB with no `UpdateItem` IAM permission on the audit table. - For runbook definition versioning: every edit creates a new version record. Old versions are never mutated. - Schema for parsed runbook steps: version the parser output format. V1 steps and V2 steps coexist. ### Story 10.3: Cognitive Durability — Decision Logs for Execution Logic **As a** future maintainer, **I want** every change to runbook parsing, step classification, or execution logic accompanied by a `decision_log.json`, **so that** I understand why the system interpreted step 3 as "destructive" and required approval. **Acceptance Criteria:** - `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`. - CI requires a decision log for PRs touching `pkg/parser/`, `pkg/classifier/`, `pkg/executor/`, or `pkg/approval/`. - Cyclomatic complexity cap of 10 enforced via linter. PRs exceeding this are blocked. - Decision logs in `docs/decisions/`. - **Additional requirement:** every change to the "destructive command" classification list requires a decision log entry explaining why the command was added/removed. **Estimate:** 2 points **Dependencies:** None **Technical Notes:** - The destructive command list is the most sensitive config in the entire product. Changes must be documented, reviewed, and logged. - Parser logic decision logs should include sample runbook snippets showing before/after interpretation. ### Story 10.4: Semantic Observability — AI Reasoning Spans on Classification & Execution **As an** SRE manager investigating an incident caused by automated execution, **I want** every runbook classification and execution decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I have a complete forensic record of what the system decided, why, and what it executed. **Acceptance Criteria:** - Every runbook execution creates a parent `runbook_execution` span. Child spans for each step: `step_classification`, `step_approval_check`, `step_execution`. - `step_classification` attributes: `step.text_hash`, `step.classified_as` (safe/destructive/ambiguous), `step.confidence_score`, `step.alternatives_considered`. - `step_execution` attributes: `step.command_hash`, `step.target_host_hash`, `step.exit_code`, `step.duration_ms`, `step.output_truncated` (first 500 chars, hashed). - If AI-assisted classification: `ai.prompt_hash`, `ai.model_version`, `ai.reasoning_chain`. - `step_approval_check` attributes: `step.approval_required` (bool), `step.approval_source` (human/policy/auto), `step.approval_latency_ms`. - No PII. No raw commands in spans — everything hashed. Full commands only in the encrypted audit log. **Estimate:** 5 points **Dependencies:** Epic 3 (Execution Engine), Epic 4 (Audit & Forensics) **Technical Notes:** - This is the highest-effort observability story across all 6 products because execution tracing is forensic-grade. - Span hierarchy: `runbook_execution` → `step_N` → `[classification, approval, execution]`. Three levels deep. - Output hashing: SHA-256 of command output for correlation. Raw output only in encrypted audit store. ### Story 10.5: Configurable Autonomy — Governance for Production Execution **As a** solo founder, **I want** a `policy.json` that strictly controls what dd0c/run is allowed to execute autonomously vs. what requires human approval, **so that** no automated system ever runs a destructive command without explicit authorization. **Acceptance Criteria:** - `policy.json` defines `governance_mode`: `strict` (all steps require human approval) or `audit` (safe steps auto-execute, destructive steps require approval). - Default for ALL customers: `strict`. There is no "fully autonomous" mode for dd0c/run. Even in `audit` mode, destructive commands always require approval. - `panic_mode`: when true, ALL execution halts immediately. In-flight steps are paused (not killed). All pending approvals are revoked. System enters read-only forensic mode. - Governance drift monitoring: weekly report of auto-executed vs. human-approved steps. If auto-execution ratio exceeds the configured threshold, system auto-downgrades to `strict`. - Per-customer, per-runbook governance overrides. Customers can lock specific runbooks to `strict` regardless of system mode. - **Hard rule:** `panic_mode` must be triggerable in <1 second via API call, CLI command, Slack command, OR a physical hardware button (webhook endpoint). **Estimate:** 5 points **Dependencies:** Epic 3 (Execution Engine), Epic 5 (Approval Workflow) **Technical Notes:** - No "full auto" mode is a deliberate product decision. dd0c/run assists humans, it doesn't replace them for destructive actions. - Panic mode implementation: Redis key `dd0c:panic` checked at the top of every execution loop iteration. Webhook endpoint `POST /admin/panic` sets the key and broadcasts a halt signal via pub/sub. - The <1 second requirement means panic cannot depend on a DB write — Redis pub/sub only. - Governance drift: cron job queries execution logs weekly. Threshold configurable per-org (default: 70% auto-execution triggers downgrade). ### Epic 10 Summary | Story | Tenet | Points | |-------|-------|--------| | 10.1 | Atomic Flagging | 5 | | 10.2 | Elastic Schema | 3 | | 10.3 | Cognitive Durability | 2 | | 10.4 | Semantic Observability | 5 | | 10.5 | Configurable Autonomy | 5 | | **Total** | | **20** |