products/06-runbook-automation/epics/epics.md

# dd0c/run — V1 MVP Epics

This document breaks down the V1 MVP of dd0c/run into implementation Epics and User Stories. Scope is strictly limited to V1 (read-only execution 🟢, copilot approval for 🟡/🔴, no autopilot, no terminal watcher).

---

## Epic 1: Runbook Parser
**Description:** The "5-second wow moment." Ingest raw unstructured text from a paste event, normalize it, and use a fast LLM (e.g., Claude Haiku via dd0c/route) to extract a structured JSON representation of executable steps, variables, conditional branches, and prerequisites in under 5 seconds.

**Dependencies:** None. This is the foundational data structure.

**Technical Notes:** 
- Pure Rust normalizer to strip HTML/Markdown before hitting the LLM to save tokens and latency.
- LLM prompt must enforce a strict JSON schema.
- Idempotent parsing: same text + temperature 0 = same output.
- Must run in < 3.5s to leave room for the Action Classifier within the 5s SLA.

### User Stories

**Story 1.1: Raw Text Normalization & Ingestion**
*As a runbook author (Morgan), I want to paste raw text from Confluence or Notion into the system, so that I don't have to learn a proprietary YAML DSL.*
- **Acceptance Criteria:**
  - System accepts raw text payload via API.
  - Rust-based normalizer strips HTML tags, markdown formatting, and normalizes whitespace/bullet points.
  - Original raw text and hash are preserved in the DB for audit/re-parsing.
- **Story Points:** 2
- **Dependencies:** None

**Story 1.2: LLM Structured Step Extraction**
*As a system, I want to pass normalized text to a fast LLM to extract an ordered JSON array of steps and commands, so that the runbook becomes machine-readable.*
- **Acceptance Criteria:**
  - Sends normalized text to `dd0c/route` with a strict JSON schema prompt.
  - Correctly extracts step order, natural language description, and embedded CLI/shell commands.
  - P95 latency for extraction is < 3.5 seconds.
  - Rejects/errors gracefully if the text contains no actionable steps.
- **Story Points:** 3
- **Dependencies:** 1.1

**Story 1.3: Variable & Prerequisite Detection**
*As an on-call engineer (Riley), I want the parser to identify implicit prerequisites (like VPN) and variables (like `<instance-id>`), so that I'm fully prepared before the runbook starts.*
- **Acceptance Criteria:**
  - Regex/heuristic scanner identifies common placeholders (`$VAR`, `<var>`, `{var}`).
  - LLM identifies implicit requirements ("ensure you are on VPN", "requires prod AWS profile").
  - Outputs a structured array of `variables` and `prerequisites` in the JSON payload.
- **Story Points:** 2
- **Dependencies:** 1.2

**Story 1.4: Branching & Ambiguity Highlighting**
*As a runbook author (Morgan), I want the system to map conditional logic (if/else) and flag vague instructions, so that my runbooks are deterministic and clear.*
- **Acceptance Criteria:**
  - Parser detects simple "if X then Y" statements and maps them to a step execution DAG (Directed Acyclic Graph).
  - Identifies vague steps (e.g., "check the logs" without specifying which logs) and adds them to an `ambiguities` array for author review.
- **Story Points:** 3
- **Dependencies:** 1.2

## Epic 2: Action Classifier
**Description:** The safety-critical core of the system. Classifies every extracted command as 🟢 Safe, 🟡 Caution, 🔴 Dangerous, or ⬜ Unknown. Uses a defense-in-depth "dual-key" architecture: a deterministic compiled Rust scanner overrides an advisory LLM classifier. A misclassification here is an existential risk.

**Dependencies:** Epic 1 (needs parsed steps to classify)

**Technical Notes:** 
- The deterministic scanner must use tree-sitter or compiled regex sets.
- Merge rules must be hardcoded in Rust (not configurable).
- LLM classification must run in parallel across steps to meet the 5-second total parse+classify SLA.
- If the LLM confidence is low or unknown, risk escalates. Scanner wins disagreements.

### User Stories

**Story 2.1: Deterministic Safety Scanner**
*As a system, I want to pattern-match commands against a compiled database of known signatures, so that destructive commands are definitively blocked without relying on an LLM.*
- **Acceptance Criteria:**
  - Rust library with compiled regex sets (allowlist 🟢, caution 🟡, blocklist 🔴).
  - Uses tree-sitter to parse SQL/shell ASTs to detect destructive patterns (e.g., `DELETE` without `WHERE`, piped `rm -rf`).
  - Executes in < 1ms per command.
  - Defaults to `Unknown` (🟡 minimum) if no patterns match.
- **Story Points:** 5
- **Dependencies:** None (standalone library)

**Story 2.2: LLM Contextual Classifier**
*As a system, I want an LLM to assess the command in the context of surrounding steps, so that implicit state changes or custom scripts are caught.*
- **Acceptance Criteria:**
  - Sends the command, surrounding steps, and infrastructure context to an LLM via a strict JSON prompt.
  - Returns a classification (🟢/🟡/🔴), a confidence score, and suggested rollbacks.
  - Escalate classification to the next risk tier if confidence < 0.9.
- **Story Points:** 3
- **Dependencies:** Epic 1 (requires structured context)

**Story 2.3: Classification Merge Engine**
*As an on-call engineer (Riley), I want the system to safely merge the LLM and deterministic scanner results, so that I can trust the final 🟢/🟡/🔴 label.*
- **Acceptance Criteria:**
  - Implements the 5 hardcoded merge rules exactly as architected (e.g., if Scanner says 🔴, final is 🔴; if Scanner says 🟡 and LLM says 🟢, final is 🟡).
  - The final risk level is `Safe` ONLY if both the Scanner and the LLM agree it is safe.
- **Story Points:** 2
- **Dependencies:** 2.1, 2.2

**Story 2.4: Immutable Classification Audit**
*As an SRE manager (Jordan), I want every classification decision logged with full context, so that I have a forensic record of why the system made its choice.*
- **Acceptance Criteria:**
  - Emits a `runbook.classified` event to the PostgreSQL database for every step.
  - Log includes: scanner result (matched patterns), LLM reasoning, confidence scores, final classification, and merge rule applied.
- **Story Points:** 1
- **Dependencies:** 2.3

## Epic 3: Execution Engine
**Description:** A state machine orchestrating the step-by-step runbook execution, enforcing the Trust Gradient (Level 0, 1, 2) at every transition. It never auto-executes 🟡 or 🔴 commands in V1. Handles timeouts, rollbacks, and tracking skipped steps.

**Dependencies:** Epics 1 & 2 (needs classified runbooks to execute)

**Technical Notes:** 
- V1 is Copilot-only. State transitions must block 🔴 and prompt for 🟡.
- Engine must communicate with the Agent via gRPC over mTLS.
- Each step execution receives a unique ID to prevent duplicate deliveries.

### User Stories

**Story 3.1: Execution State Machine**
*As a system, I want a state machine to orchestrate step-by-step runbook execution, so that no two commands execute simultaneously and the Trust Gradient is enforced.*
- **Acceptance Criteria:**
  - Starts in `Pending` state upon alert match.
  - Progresses to `AutoExecute` ONLY if the step is 🟢 and trust level allows.
  - Transitions to `AwaitApproval` for 🟡/🔴 steps, blocking execution until approved.
  - Aborts or transitions to `ManualIntervention` on timeout (e.g., 60s for 🟢, 300s for 🔴).
- **Story Points:** 5
- **Dependencies:** 2.3

**Story 3.2: gRPC Agent Communication Protocol**
*As a system, I want to communicate securely with the dd0c Agent in the customer VPC, so that commands are executed safely without inbound firewall rules.*
- **Acceptance Criteria:**
  - Outbound-only gRPC streaming connection initiated by the Agent.
  - Engine sends `ExecuteStep` payload (command, timeout, risk level, environment variables).
  - Agent streams `StepOutput` (stdout/stderr) back to the Engine.
  - Agent returns `StepResult` (exit code, duration, stdout/stderr hashes) on completion.
- **Story Points:** 5
- **Dependencies:** 3.1

**Story 3.3: Rollback Integration**
*As an on-call engineer (Riley), I want every state-changing step to record its inverse command, so that I can undo it with one click if it fails.*
- **Acceptance Criteria:**
  - If a step fails, the Engine transitions to `RollbackAvailable` and awaits human approval.
  - Engine stores the rollback command before executing the forward command.
  - Executing rollback triggers an independent execution step and returns the state machine to `StepReady` or `ManualIntervention`.
- **Story Points:** 3
- **Dependencies:** 3.1

**Story 3.4: Divergence Analysis**
*As an SRE manager (Jordan), I want the system to track what the engineer actually did vs. what the runbook prescribed, so that runbooks get better over time.*
- **Acceptance Criteria:**
  - Post-execution analyzer compares prescribed steps with executed/skipped steps and any unlisted commands detected by the agent.
  - Flags skipped steps, modified commands, and unlisted actions.
  - Emits a `divergence.detected` event with suggested runbook updates.
- **Story Points:** 3
- **Dependencies:** 3.1

## Epic 4: Slack Bot Copilot
**Description:** The primary 3am interface for on-call engineers. Guides the engineer step-by-step through an active incident, enforcing the Trust Gradient via interactive buttons for 🟡 actions and explicit typed confirmations for 🔴 actions.

**Dependencies:** Epic 3 (needs execution state to drive the UI)

**Technical Notes:** 
- Uses Rust + Slack Bolt SDK in Socket Mode (no inbound webhooks).
- Must use Slack Block Kit for formatting.
- Respects Slack's 1 message/sec/channel rate limit by batching rapid updates.
- Uses dedicated threads to keep main channels clean.

### User Stories

**Story 4.1: Alert Matching & Notification**
*As an on-call engineer (Riley), I want a Slack message with the relevant runbook when an alert fires, so that I don't have to search for it.*
- **Acceptance Criteria:**
  - Integrates with PagerDuty/OpsGenie webhook payloads.
  - Matches alert context (service, region, keywords) to the most relevant runbook.
  - Posts a `🔔 Runbook matched` message with [▶ Start Copilot] button in the incident channel.
- **Story Points:** 3
- **Dependencies:** None

**Story 4.2: Step-by-Step Interactive UI**
*As an on-call engineer (Riley), I want the bot to guide me through the runbook step-by-step in a thread, so that I can focus on one command at a time.*
- **Acceptance Criteria:**
  - Tapping [▶ Start Copilot] opens a thread and triggers the Execution Engine.
  - Shows 🟢 steps auto-executing with live stdout snippets.
  - Shows 🟡/🔴 steps awaiting approval with command details and rollback.
  - Includes [⏭ Skip] and [🛑 Abort] buttons for every step.
- **Story Points:** 5
- **Dependencies:** 3.1, 4.1

**Story 4.3: Risk-Aware Approval Gates**
*As a system, I want to block dangerous commands until explicit confirmation is provided, so that accidental misclicks don't cause outages.*
- **Acceptance Criteria:**
  - 🟡 steps require clicking [✅ Approve] or [✏️ Edit].
  - 🔴 steps require opening a text input modal and typing the resource name exactly.
  - Cannot bulk-approve steps. Each step must be individually gated.
  - Approver's Slack identity is captured and passed to the Execution Engine.
- **Story Points:** 3
- **Dependencies:** 4.2

**Story 4.4: Real-Time Output Streaming**
*As an on-call engineer (Riley), I want to see the execution output of commands in real-time, so that I know what's happening.*
- **Acceptance Criteria:**
  - Slack messages update dynamically with command stdout/stderr.
  - Batches rapid output updates to respect Slack rate limits.
  - Truncates long outputs and provides a link to the full log in the dashboard.
- **Story Points:** 2
- **Dependencies:** 4.2

## Epic 5: Audit Trail
**Description:** The compliance backbone and forensic record. An append-only, immutable, partitioned PostgreSQL log that tracks every runbook parse, classification, approval, execution, and rollback.

**Dependencies:** Epics 2, 3, 4 (needs events to log)

**Technical Notes:** 
- PostgreSQL partitioned table (by month) for query performance.
- Application role must have `INSERT` and `SELECT` only. No `UPDATE` or `DELETE` grants.
- Row-Level Security (RLS) enforces tenant isolation at the database level.
- V1 includes a basic PDF/CSV export for SOC 2 readiness.

### User Stories

**Story 5.1: Append-Only Audit Schema**
*As a system, I want a partitioned database table to store all execution events immutably, so that no one can tamper with the forensic record.*
- **Acceptance Criteria:**
  - Creates a PostgreSQL table partitioned by month.
  - Revokes `UPDATE` and `DELETE` grants from the application role.
  - Schema captures `tenant_id`, `event_type`, `execution_id`, `actor_id`, `event_data` (JSONB), and `created_at`.
- **Story Points:** 2
- **Dependencies:** None

**Story 5.2: Event Ingestion Pipeline**
*As an SRE manager (Jordan), I want every action recorded in real-time, so that I know exactly who did what during an incident.*
- **Acceptance Criteria:**
  - Ingests events from Parser, Classifier, Execution Engine, and Slack Bot.
  - Logs `runbook.parsed`, `runbook.classified`, `step.auto_executed`, `step.approved`, `step.executed`, etc., exactly as architected.
  - Captures the exact command executed, exit code, and stdout/stderr hashes.
- **Story Points:** 3
- **Dependencies:** 5.1

**Story 5.3: Compliance Export**
*As an SRE manager (Jordan), I want to export timestamped execution logs, so that I can provide evidence to SOC 2 auditors.*
- **Acceptance Criteria:**
  - Generates a PDF or CSV of an execution run, including the approval chain and audit trail.
  - Includes risk classifications and any divergence/modifications made by the engineer.
  - Links to S3 for full stdout/stderr logs if needed.
- **Story Points:** 3
- **Dependencies:** 5.2

**Story 5.4: Multi-Tenant Data Isolation (RLS)**
*As a system, I want to ensure no tenant can see another tenant's data, so that security boundaries are guaranteed at the database level.*
- **Acceptance Criteria:**
  - Enables Row-Level Security (RLS) on all tenant-scoped tables (`runbooks`, `executions`, `audit_events`).
  - API middleware sets `app.current_tenant_id` session variable on every database connection.
  - Cross-tenant queries return zero rows, not an error.
- **Story Points:** 2
- **Dependencies:** 5.1

## Epic 6: Dashboard API
**Description:** The control plane REST API served by the dd0c API Gateway. Handles runbook CRUD, parsing endpoints, execution history, and integration with the Alert-Runbook Matcher. Provides the backend for the Dashboard UI and external integrations.

**Dependencies:** Epics 1, 3, 5

**Technical Notes:** 
- Served via Axum (Rust) and secured with JWT.
- Integrates with the shared dd0c API Gateway.
- Enforces tenant isolation (RLS) on every request.

### User Stories

**Story 6.1: Runbook CRUD & Parsing Endpoints**
*As a runbook author (Morgan), I want to create, read, update, and soft-delete runbooks via a REST API, so that I can manage my team's active runbooks.*
- **Acceptance Criteria:**
  - Implements `POST /runbooks` (paste raw text → auto-parse), `GET /runbooks`, `GET /runbooks/:id/versions`, `PUT /runbooks/:id`, `DELETE /runbooks/:id`.
  - Implements `POST /runbooks/parse-preview` for the 5-second wow moment (parses without saving).
  - Returns 422 if parsing fails.
- **Story Points:** 3
- **Dependencies:** 1.2

**Story 6.2: Execution History & Status Queries**
*As an SRE manager (Jordan), I want to view active and completed execution runs, so that I can monitor incident response times and skipped steps.*
- **Acceptance Criteria:**
  - Implements `POST /executions` to start copilot manually.
  - Implements `GET /executions` and `GET /executions/:id` to retrieve status, steps, exit codes, and durations.
  - Implements `GET /executions/:id/divergence` to get the post-execution analysis (skipped steps, modified commands).
- **Story Points:** 2
- **Dependencies:** 3.1, 5.2

**Story 6.3: Alert-Runbook Matcher Integration**
*As a system, I want to match incoming webhooks (e.g., PagerDuty) to the correct runbook, so that the Slack bot can post the runbook immediately.*
- **Acceptance Criteria:**
  - Implements `POST /webhooks/pagerduty`, `POST /webhooks/opsgenie`, and `POST /webhooks/dd0c-alert`.
  - Uses keyword + metadata matching (and optionally pgvector similarity) to find the best runbook for the alert payload.
  - Generates the `alert_context` and triggers the Slack bot notification (Epic 4).
- **Story Points:** 3
- **Dependencies:** 6.1

**Story 6.4: Classification Query Endpoints**
*As a runbook author (Morgan), I want to query the classification engine directly, so that I can test commands before publishing a runbook.*
- **Acceptance Criteria:**
  - Implements `POST /classify` for testing/debugging.
  - Implements `GET /classifications/:step_id` to retrieve full classification details.
  - Rate-limited to 30 req/min per tenant.
- **Story Points:** 1
- **Dependencies:** 2.3

## Epic 7: Dashboard UI
**Description:** A React Single-Page Application (SPA) for runbook authors and managers to view runbooks, past executions, and risk classifications. The primary interface for onboarding, reviewing runbooks, and analyzing post-incident data.

**Dependencies:** Epic 6 (needs APIs to consume)

**Technical Notes:** 
- React SPA, integrates with the shared dd0c portal.
- Displays the 5-second "wow moment" parse preview.
- Must visually distinguish 🟢 Safe, 🟡 Caution, and 🔴 Dangerous commands.

### User Stories

**Story 7.1: Runbook Paste & Preview UI**
*As a runbook author (Morgan), I want to paste raw text and instantly see the parsed steps, so that I can verify the structure before saving.*
- **Acceptance Criteria:**
  - Large text area for pasting raw runbook text.
  - Calls `POST /runbooks/parse-preview` and displays the structured steps.
  - Highlights variables, prerequisites, and ambiguities.
  - Allows editing the raw text to trigger a re-parse.
- **Story Points:** 3
- **Dependencies:** 6.1

**Story 7.2: Execution Timeline & Divergence View**
*As an SRE manager (Jordan), I want to view a timeline of an incident's execution, so that I can see what was skipped or modified.*
- **Acceptance Criteria:**
  - Displays the execution run with a visual timeline of steps.
  - Shows who approved each step, the exact command executed, and the exit code.
  - Highlights divergence (skipped steps, modified commands, unlisted actions).
  - Provides a "Apply Updates" button to update the runbook based on divergence.
- **Story Points:** 5
- **Dependencies:** 6.2

**Story 7.3: Trust Level & Risk Visualization**
*As a runbook author (Morgan), I want to clearly see the risk level of each step in my runbook, so that I know what requires approval.*
- **Acceptance Criteria:**
  - Color-codes steps: 🟢 Green (Safe), 🟡 Yellow (Caution), 🔴 Red (Dangerous).
  - Clicking a risk badge shows the classification reasoning (scanner rules matched, LLM explanation).
  - Displays the runbook's overall Trust Level (0, 1, 2) and allows authorized users to change it (V1 max = 2).
- **Story Points:** 3
- **Dependencies:** 7.1

**Story 7.4: Basic Health & MTTR Dashboard**
*As an SRE manager (Jordan), I want a high-level view of my team's runbooks and execution stats, so that I can measure the value of dd0c/run.*
- **Acceptance Criteria:**
  - Displays a list of runbooks, their coverage, and average staleness.
  - Shows MTTR (Mean Time To Resolution) for incidents handled via Copilot.
  - Displays the total number of Copilot runs and skipped steps per month.
- **Story Points:** 2
- **Dependencies:** 6.2

## Epic 8: Infrastructure & DevOps
**Description:** The foundational cloud infrastructure and CI/CD pipelines required to run the dd0c/run SaaS platform and build the customer-facing Agent. Ensures strict security isolation (mTLS), observability, and zero-downtime ECS Fargate deployments.

**Dependencies:** None (can be built in parallel with Epic 1).

**Technical Notes:** 
- All infrastructure defined as code (Terraform) and deployed to AWS.
- ECS Fargate for all core services (Parser, Classifier, Engine, APIs).
- Agent binary built for Linux x86_64 and ARM64 via GitHub Actions.
- mTLS certificate generation and rotation pipeline for Agent authentication.

### User Stories

**Story 8.1: Core AWS Infrastructure Provisioning**
*As a system administrator (Brian), I want the base AWS infrastructure provisioned via Terraform, so that the services have a secure, scalable environment to run in.*
- **Acceptance Criteria:**
  - Terraform configures VPC, public/private subnets, NAT Gateway, and ALB.
  - Provisions RDS PostgreSQL 16 (Multi-AZ) with pgvector and S3 buckets for audit/logs.
  - Provisions ECS Fargate cluster and SQS queues.
- **Story Points:** 3
- **Dependencies:** None

**Story 8.2: CI/CD Pipeline & Agent Build**
*As a developer (Brian), I want automated build and deployment pipelines, so that I can ship code to production safely and distribute the Agent binary.*
- **Acceptance Criteria:**
  - GitHub Actions pipeline runs `cargo clippy`, `cargo test`, and the "canary test suite" on every PR.
  - Merges to `main` auto-deploy ECS Fargate services with zero downtime.
  - Release tags compile the Agent binary (x86_64, aarch64), sign it, and publish to GitHub Releases / S3.
- **Story Points:** 3
- **Dependencies:** None

**Story 8.3: Agent mTLS & gRPC Setup**
*As a system, I want secure outbound-only gRPC communication between the Agent and the Execution Engine, so that commands are transmitted safely without VPNs or inbound firewall rules.*
- **Acceptance Criteria:**
  - Internal CA issues tenant-scoped mTLS certificates.
  - API Gateway terminates mTLS and validates the tenant ID in the certificate against the request.
  - Agent successfully establishes a persistent, outbound-only gRPC stream to the Engine.
- **Story Points:** 5
- **Dependencies:** 8.1

**Story 8.4: Observability & Party Mode Alerting**
*As the solo on-call engineer (Brian), I want comprehensive monitoring and alerting, so that I am woken up if a safety invariant is violated (Party Mode).*
- **Acceptance Criteria:**
  - OpenTelemetry (OTEL) collector deployed and routing metrics/traces to Grafana Cloud.
  - PagerDuty integration configured for P1 alerts.
  - Alert fires immediately if the "Party Mode" database flag is set or if Agent heartbeats drop unexpectedly.
- **Story Points:** 2
- **Dependencies:** 8.1

## Epic 9: Onboarding & PLG
**Description:** Product-Led Growth (PLG) and frictionless onboarding. Focuses on the "5-second wow moment" where a user can paste a runbook and instantly see the parsed, classified steps without installing an agent, plus a self-serve tier to capture leads.

**Dependencies:** Epics 1, 6, 7.

**Technical Notes:** 
- The demo uses the `POST /runbooks/parse-preview` endpoint (does not persist data).
- Tenant provisioning must be fully automated upon signup.
- Free tier enforced via database limits (no Stripe required initially for free tier).

### User Stories

**Story 9.1: The 5-Second Interactive Demo**
*As a prospective customer, I want to paste my messy runbook into the marketing website and see it parsed instantly, so that I immediately understand the product's value without signing up.*
- **Acceptance Criteria:**
  - Landing page features a prominent text area for pasting a runbook.
  - Submitting calls the unauthenticated parse-preview endpoint (rate-limited by IP).
  - Renders the structured steps with 🟢/🟡/🔴 risk badges dynamically.
- **Story Points:** 3
- **Dependencies:** 6.1, 7.1

**Story 9.2: Self-Serve Signup & Tenant Provisioning**
*As a new user, I want to create an account and get my tenant provisioned instantly, so that I can start using the product without talking to sales.*
- **Acceptance Criteria:**
  - OAuth/Email signup flow creates a User and a Tenant record in PostgreSQL.
  - Automated worker initializes the tenant workspace, generates the initial mTLS certs, and provides the Agent installation command.
  - Limits the free tier to 5 active runbooks and 50 executions/month.
- **Story Points:** 3
- **Dependencies:** 8.1

**Story 9.3: Agent Installation Wizard**
*As a new user, I want a simple copy-paste command to install the read-only Agent in my infrastructure, so that I can execute my first runbook within 10 minutes of signup.*
- **Acceptance Criteria:**
  - Dashboard provides a dynamically generated `curl | bash` or `kubectl apply` snippet.
  - Snippet includes the tenant-specific mTLS certificate and Agent binary download.
  - Dashboard shows a live "Waiting for Agent heartbeat..." state that turns green when the Agent connects.
- **Story Points:** 3
- **Dependencies:** 8.3, 9.2

**Story 9.4: First Runbook Copilot Walkthrough**
*As a new user, I want an interactive guide for my first runbook execution, so that I learn how the Copilot approval flow and Trust Gradient work safely.*
- **Acceptance Criteria:**
  - System automatically provisions a "Hello World" runbook containing safe read-only commands (e.g., `kubectl get nodes`).
  - In-app tooltip guides the user to trigger the runbook via the Slack integration.
  - Completing this first execution unlocks the "Runbook Author" badge/status.
- **Story Points:** 2
- **Dependencies:** 4.2, 9.3


---

## Epic 10: Transparent Factory Compliance
**Description:** Cross-cutting epic ensuring dd0c/run adheres to the 5 Transparent Factory tenets. Of all 6 dd0c products, runbook automation carries the highest governance burden — it executes commands in production infrastructure. Every tenet here is load-bearing.

### Story 10.1: Atomic Flagging — Feature Flags for Execution Behaviors
**As a** solo founder, **I want** every new runbook execution capability, command parser, and auto-remediation behavior behind a feature flag (default: off), **so that** a bad parser change never executes the wrong command in a customer's production environment.

**Acceptance Criteria:**
- OpenFeature SDK integrated into the execution engine. V1: JSON file provider.
- All flags evaluate locally — no network calls during runbook execution.
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
- Automated circuit breaker: if a flagged execution path fails >2 times in any 10-minute window, the flag auto-disables and all in-flight executions using that path are paused (not killed).
- Flags required for: new command parsers, conditional logic handlers, new infrastructure targets (K8s, AWS, bare metal), auto-retry behaviors, approval bypass rules.
- **Hard rule:** any flag that controls execution of destructive commands (`rm`, `delete`, `terminate`, `drop`) requires a 48-hour bake time at 10% rollout before full enablement.

**Estimate:** 5 points
**Dependencies:** Epic 3 (Execution Engine)
**Technical Notes:**
- Circuit breaker threshold is intentionally low (2 failures) because production execution failures are high-severity.
- Pause vs. kill: paused executions hold state and can be resumed after review. Killing mid-execution risks partial state.
- 48-hour bake for destructive flags: enforced via flag metadata `destructive: true` + CI check.

### Story 10.2: Elastic Schema — Additive-Only for Execution Audit Trail
**As a** solo founder, **I want** all execution log and runbook state schema changes to be strictly additive, **so that** the forensic audit trail of every production command ever executed is never corrupted or lost.

**Acceptance Criteria:**
- CI rejects any migration that removes, renames, or changes type of existing columns/attributes.
- New fields use `_v2` suffix for breaking changes.
- All execution log parsers ignore unknown fields.
- Dual-write during migration windows within the same transaction.
- Every migration includes `sunset_date` comment (max 30 days).
- **Hard rule:** execution audit records are immutable. No `UPDATE` or `DELETE` statements are permitted on the `execution_log` table/collection. Ever.

**Estimate:** 3 points
**Dependencies:** Epic 4 (Audit & Forensics)
**Technical Notes:**
- Execution logs are append-only by design and by policy. Use DynamoDB with no `UpdateItem` IAM permission on the audit table.
- For runbook definition versioning: every edit creates a new version record. Old versions are never mutated.
- Schema for parsed runbook steps: version the parser output format. V1 steps and V2 steps coexist.

### Story 10.3: Cognitive Durability — Decision Logs for Execution Logic
**As a** future maintainer, **I want** every change to runbook parsing, step classification, or execution logic accompanied by a `decision_log.json`, **so that** I understand why the system interpreted step 3 as "destructive" and required approval.

**Acceptance Criteria:**
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
- CI requires a decision log for PRs touching `pkg/parser/`, `pkg/classifier/`, `pkg/executor/`, or `pkg/approval/`.
- Cyclomatic complexity cap of 10 enforced via linter. PRs exceeding this are blocked.
- Decision logs in `docs/decisions/`.
- **Additional requirement:** every change to the "destructive command" classification list requires a decision log entry explaining why the command was added/removed.

**Estimate:** 2 points
**Dependencies:** None
**Technical Notes:**
- The destructive command list is the most sensitive config in the entire product. Changes must be documented, reviewed, and logged.
- Parser logic decision logs should include sample runbook snippets showing before/after interpretation.

### Story 10.4: Semantic Observability — AI Reasoning Spans on Classification & Execution
**As an** SRE manager investigating an incident caused by automated execution, **I want** every runbook classification and execution decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I have a complete forensic record of what the system decided, why, and what it executed.

**Acceptance Criteria:**
- Every runbook execution creates a parent `runbook_execution` span. Child spans for each step: `step_classification`, `step_approval_check`, `step_execution`.
- `step_classification` attributes: `step.text_hash`, `step.classified_as` (safe/destructive/ambiguous), `step.confidence_score`, `step.alternatives_considered`.
- `step_execution` attributes: `step.command_hash`, `step.target_host_hash`, `step.exit_code`, `step.duration_ms`, `step.output_truncated` (first 500 chars, hashed).
- If AI-assisted classification: `ai.prompt_hash`, `ai.model_version`, `ai.reasoning_chain`.
- `step_approval_check` attributes: `step.approval_required` (bool), `step.approval_source` (human/policy/auto), `step.approval_latency_ms`.
- No PII. No raw commands in spans — everything hashed. Full commands only in the encrypted audit log.

**Estimate:** 5 points
**Dependencies:** Epic 3 (Execution Engine), Epic 4 (Audit & Forensics)
**Technical Notes:**
- This is the highest-effort observability story across all 6 products because execution tracing is forensic-grade.
- Span hierarchy: `runbook_execution` → `step_N` → `[classification, approval, execution]`. Three levels deep.
- Output hashing: SHA-256 of command output for correlation. Raw output only in encrypted audit store.

### Story 10.5: Configurable Autonomy — Governance for Production Execution
**As a** solo founder, **I want** a `policy.json` that strictly controls what dd0c/run is allowed to execute autonomously vs. what requires human approval, **so that** no automated system ever runs a destructive command without explicit authorization.

**Acceptance Criteria:**
- `policy.json` defines `governance_mode`: `strict` (all steps require human approval) or `audit` (safe steps auto-execute, destructive steps require approval).
- Default for ALL customers: `strict`. There is no "fully autonomous" mode for dd0c/run. Even in `audit` mode, destructive commands always require approval.
- `panic_mode`: when true, ALL execution halts immediately. In-flight steps are paused (not killed). All pending approvals are revoked. System enters read-only forensic mode.
- Governance drift monitoring: weekly report of auto-executed vs. human-approved steps. If auto-execution ratio exceeds the configured threshold, system auto-downgrades to `strict`.
- Per-customer, per-runbook governance overrides. Customers can lock specific runbooks to `strict` regardless of system mode.
- **Hard rule:** `panic_mode` must be triggerable in <1 second via API call, CLI command, Slack command, OR a physical hardware button (webhook endpoint).

**Estimate:** 5 points
**Dependencies:** Epic 3 (Execution Engine), Epic 5 (Approval Workflow)
**Technical Notes:**
- No "full auto" mode is a deliberate product decision. dd0c/run assists humans, it doesn't replace them for destructive actions.
- Panic mode implementation: Redis key `dd0c:panic` checked at the top of every execution loop iteration. Webhook endpoint `POST /admin/panic` sets the key and broadcasts a halt signal via pub/sub.
- The <1 second requirement means panic cannot depend on a DB write — Redis pub/sub only.
- Governance drift: cron job queries execution logs weekly. Threshold configurable per-org (default: 70% auto-execution triggers downgrade).

### Epic 10 Summary
| Story | Tenet | Points |
|-------|-------|--------|
| 10.1 | Atomic Flagging | 5 |
| 10.2 | Elastic Schema | 3 |
| 10.3 | Cognitive Durability | 2 |
| 10.4 | Semantic Observability | 5 |
| 10.5 | Configurable Autonomy | 5 |
| **Total** | | **20** |