Files
Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
2026-02-28 17:35:02 +00:00

33 KiB

dd0c/run — V1 MVP Epics

This document breaks down the V1 MVP of dd0c/run into implementation Epics and User Stories. Scope is strictly limited to V1 (read-only execution 🟢, copilot approval for 🟡/🔴, no autopilot, no terminal watcher).


Epic 1: Runbook Parser

Description: The "5-second wow moment." Ingest raw unstructured text from a paste event, normalize it, and use a fast LLM (e.g., Claude Haiku via dd0c/route) to extract a structured JSON representation of executable steps, variables, conditional branches, and prerequisites in under 5 seconds.

Dependencies: None. This is the foundational data structure.

Technical Notes:

  • Pure Rust normalizer to strip HTML/Markdown before hitting the LLM to save tokens and latency.
  • LLM prompt must enforce a strict JSON schema.
  • Idempotent parsing: same text + temperature 0 = same output.
  • Must run in < 3.5s to leave room for the Action Classifier within the 5s SLA.

User Stories

Story 1.1: Raw Text Normalization & Ingestion As a runbook author (Morgan), I want to paste raw text from Confluence or Notion into the system, so that I don't have to learn a proprietary YAML DSL.

  • Acceptance Criteria:
    • System accepts raw text payload via API.
    • Rust-based normalizer strips HTML tags, markdown formatting, and normalizes whitespace/bullet points.
    • Original raw text and hash are preserved in the DB for audit/re-parsing.
  • Story Points: 2
  • Dependencies: None

Story 1.2: LLM Structured Step Extraction As a system, I want to pass normalized text to a fast LLM to extract an ordered JSON array of steps and commands, so that the runbook becomes machine-readable.

  • Acceptance Criteria:
    • Sends normalized text to dd0c/route with a strict JSON schema prompt.
    • Correctly extracts step order, natural language description, and embedded CLI/shell commands.
    • P95 latency for extraction is < 3.5 seconds.
    • Rejects/errors gracefully if the text contains no actionable steps.
  • Story Points: 3
  • Dependencies: 1.1

Story 1.3: Variable & Prerequisite Detection As an on-call engineer (Riley), I want the parser to identify implicit prerequisites (like VPN) and variables (like <instance-id>), so that I'm fully prepared before the runbook starts.

  • Acceptance Criteria:
    • Regex/heuristic scanner identifies common placeholders ($VAR, <var>, {var}).
    • LLM identifies implicit requirements ("ensure you are on VPN", "requires prod AWS profile").
    • Outputs a structured array of variables and prerequisites in the JSON payload.
  • Story Points: 2
  • Dependencies: 1.2

Story 1.4: Branching & Ambiguity Highlighting As a runbook author (Morgan), I want the system to map conditional logic (if/else) and flag vague instructions, so that my runbooks are deterministic and clear.

  • Acceptance Criteria:
    • Parser detects simple "if X then Y" statements and maps them to a step execution DAG (Directed Acyclic Graph).
    • Identifies vague steps (e.g., "check the logs" without specifying which logs) and adds them to an ambiguities array for author review.
  • Story Points: 3
  • Dependencies: 1.2

Epic 2: Action Classifier

Description: The safety-critical core of the system. Classifies every extracted command as 🟢 Safe, 🟡 Caution, 🔴 Dangerous, or Unknown. Uses a defense-in-depth "dual-key" architecture: a deterministic compiled Rust scanner overrides an advisory LLM classifier. A misclassification here is an existential risk.

Dependencies: Epic 1 (needs parsed steps to classify)

Technical Notes:

  • The deterministic scanner must use tree-sitter or compiled regex sets.
  • Merge rules must be hardcoded in Rust (not configurable).
  • LLM classification must run in parallel across steps to meet the 5-second total parse+classify SLA.
  • If the LLM confidence is low or unknown, risk escalates. Scanner wins disagreements.

User Stories

Story 2.1: Deterministic Safety Scanner As a system, I want to pattern-match commands against a compiled database of known signatures, so that destructive commands are definitively blocked without relying on an LLM.

  • Acceptance Criteria:
    • Rust library with compiled regex sets (allowlist 🟢, caution 🟡, blocklist 🔴).
    • Uses tree-sitter to parse SQL/shell ASTs to detect destructive patterns (e.g., DELETE without WHERE, piped rm -rf).
    • Executes in < 1ms per command.
    • Defaults to Unknown (🟡 minimum) if no patterns match.
  • Story Points: 5
  • Dependencies: None (standalone library)

Story 2.2: LLM Contextual Classifier As a system, I want an LLM to assess the command in the context of surrounding steps, so that implicit state changes or custom scripts are caught.

  • Acceptance Criteria:
    • Sends the command, surrounding steps, and infrastructure context to an LLM via a strict JSON prompt.
    • Returns a classification (🟢/🟡/🔴), a confidence score, and suggested rollbacks.
    • Escalate classification to the next risk tier if confidence < 0.9.
  • Story Points: 3
  • Dependencies: Epic 1 (requires structured context)

Story 2.3: Classification Merge Engine As an on-call engineer (Riley), I want the system to safely merge the LLM and deterministic scanner results, so that I can trust the final 🟢/🟡/🔴 label.

  • Acceptance Criteria:
    • Implements the 5 hardcoded merge rules exactly as architected (e.g., if Scanner says 🔴, final is 🔴; if Scanner says 🟡 and LLM says 🟢, final is 🟡).
    • The final risk level is Safe ONLY if both the Scanner and the LLM agree it is safe.
  • Story Points: 2
  • Dependencies: 2.1, 2.2

Story 2.4: Immutable Classification Audit As an SRE manager (Jordan), I want every classification decision logged with full context, so that I have a forensic record of why the system made its choice.

  • Acceptance Criteria:
    • Emits a runbook.classified event to the PostgreSQL database for every step.
    • Log includes: scanner result (matched patterns), LLM reasoning, confidence scores, final classification, and merge rule applied.
  • Story Points: 1
  • Dependencies: 2.3

Epic 3: Execution Engine

Description: A state machine orchestrating the step-by-step runbook execution, enforcing the Trust Gradient (Level 0, 1, 2) at every transition. It never auto-executes 🟡 or 🔴 commands in V1. Handles timeouts, rollbacks, and tracking skipped steps.

Dependencies: Epics 1 & 2 (needs classified runbooks to execute)

Technical Notes:

  • V1 is Copilot-only. State transitions must block 🔴 and prompt for 🟡.
  • Engine must communicate with the Agent via gRPC over mTLS.
  • Each step execution receives a unique ID to prevent duplicate deliveries.

User Stories

Story 3.1: Execution State Machine As a system, I want a state machine to orchestrate step-by-step runbook execution, so that no two commands execute simultaneously and the Trust Gradient is enforced.

  • Acceptance Criteria:
    • Starts in Pending state upon alert match.
    • Progresses to AutoExecute ONLY if the step is 🟢 and trust level allows.
    • Transitions to AwaitApproval for 🟡/🔴 steps, blocking execution until approved.
    • Aborts or transitions to ManualIntervention on timeout (e.g., 60s for 🟢, 300s for 🔴).
  • Story Points: 5
  • Dependencies: 2.3

Story 3.2: gRPC Agent Communication Protocol As a system, I want to communicate securely with the dd0c Agent in the customer VPC, so that commands are executed safely without inbound firewall rules.

  • Acceptance Criteria:
    • Outbound-only gRPC streaming connection initiated by the Agent.
    • Engine sends ExecuteStep payload (command, timeout, risk level, environment variables).
    • Agent streams StepOutput (stdout/stderr) back to the Engine.
    • Agent returns StepResult (exit code, duration, stdout/stderr hashes) on completion.
  • Story Points: 5
  • Dependencies: 3.1

Story 3.3: Rollback Integration As an on-call engineer (Riley), I want every state-changing step to record its inverse command, so that I can undo it with one click if it fails.

  • Acceptance Criteria:
    • If a step fails, the Engine transitions to RollbackAvailable and awaits human approval.
    • Engine stores the rollback command before executing the forward command.
    • Executing rollback triggers an independent execution step and returns the state machine to StepReady or ManualIntervention.
  • Story Points: 3
  • Dependencies: 3.1

Story 3.4: Divergence Analysis As an SRE manager (Jordan), I want the system to track what the engineer actually did vs. what the runbook prescribed, so that runbooks get better over time.

  • Acceptance Criteria:
    • Post-execution analyzer compares prescribed steps with executed/skipped steps and any unlisted commands detected by the agent.
    • Flags skipped steps, modified commands, and unlisted actions.
    • Emits a divergence.detected event with suggested runbook updates.
  • Story Points: 3
  • Dependencies: 3.1

Epic 4: Slack Bot Copilot

Description: The primary 3am interface for on-call engineers. Guides the engineer step-by-step through an active incident, enforcing the Trust Gradient via interactive buttons for 🟡 actions and explicit typed confirmations for 🔴 actions.

Dependencies: Epic 3 (needs execution state to drive the UI)

Technical Notes:

  • Uses Rust + Slack Bolt SDK in Socket Mode (no inbound webhooks).
  • Must use Slack Block Kit for formatting.
  • Respects Slack's 1 message/sec/channel rate limit by batching rapid updates.
  • Uses dedicated threads to keep main channels clean.

User Stories

Story 4.1: Alert Matching & Notification As an on-call engineer (Riley), I want a Slack message with the relevant runbook when an alert fires, so that I don't have to search for it.

  • Acceptance Criteria:
    • Integrates with PagerDuty/OpsGenie webhook payloads.
    • Matches alert context (service, region, keywords) to the most relevant runbook.
    • Posts a 🔔 Runbook matched message with [▶ Start Copilot] button in the incident channel.
  • Story Points: 3
  • Dependencies: None

Story 4.2: Step-by-Step Interactive UI As an on-call engineer (Riley), I want the bot to guide me through the runbook step-by-step in a thread, so that I can focus on one command at a time.

  • Acceptance Criteria:
    • Tapping [▶ Start Copilot] opens a thread and triggers the Execution Engine.
    • Shows 🟢 steps auto-executing with live stdout snippets.
    • Shows 🟡/🔴 steps awaiting approval with command details and rollback.
    • Includes [⏭ Skip] and [🛑 Abort] buttons for every step.
  • Story Points: 5
  • Dependencies: 3.1, 4.1

Story 4.3: Risk-Aware Approval Gates As a system, I want to block dangerous commands until explicit confirmation is provided, so that accidental misclicks don't cause outages.

  • Acceptance Criteria:
    • 🟡 steps require clicking [ Approve] or [✏️ Edit].
    • 🔴 steps require opening a text input modal and typing the resource name exactly.
    • Cannot bulk-approve steps. Each step must be individually gated.
    • Approver's Slack identity is captured and passed to the Execution Engine.
  • Story Points: 3
  • Dependencies: 4.2

Story 4.4: Real-Time Output Streaming As an on-call engineer (Riley), I want to see the execution output of commands in real-time, so that I know what's happening.

  • Acceptance Criteria:
    • Slack messages update dynamically with command stdout/stderr.
    • Batches rapid output updates to respect Slack rate limits.
    • Truncates long outputs and provides a link to the full log in the dashboard.
  • Story Points: 2
  • Dependencies: 4.2

Epic 5: Audit Trail

Description: The compliance backbone and forensic record. An append-only, immutable, partitioned PostgreSQL log that tracks every runbook parse, classification, approval, execution, and rollback.

Dependencies: Epics 2, 3, 4 (needs events to log)

Technical Notes:

  • PostgreSQL partitioned table (by month) for query performance.
  • Application role must have INSERT and SELECT only. No UPDATE or DELETE grants.
  • Row-Level Security (RLS) enforces tenant isolation at the database level.
  • V1 includes a basic PDF/CSV export for SOC 2 readiness.

User Stories

Story 5.1: Append-Only Audit Schema As a system, I want a partitioned database table to store all execution events immutably, so that no one can tamper with the forensic record.

  • Acceptance Criteria:
    • Creates a PostgreSQL table partitioned by month.
    • Revokes UPDATE and DELETE grants from the application role.
    • Schema captures tenant_id, event_type, execution_id, actor_id, event_data (JSONB), and created_at.
  • Story Points: 2
  • Dependencies: None

Story 5.2: Event Ingestion Pipeline As an SRE manager (Jordan), I want every action recorded in real-time, so that I know exactly who did what during an incident.

  • Acceptance Criteria:
    • Ingests events from Parser, Classifier, Execution Engine, and Slack Bot.
    • Logs runbook.parsed, runbook.classified, step.auto_executed, step.approved, step.executed, etc., exactly as architected.
    • Captures the exact command executed, exit code, and stdout/stderr hashes.
  • Story Points: 3
  • Dependencies: 5.1

Story 5.3: Compliance Export As an SRE manager (Jordan), I want to export timestamped execution logs, so that I can provide evidence to SOC 2 auditors.

  • Acceptance Criteria:
    • Generates a PDF or CSV of an execution run, including the approval chain and audit trail.
    • Includes risk classifications and any divergence/modifications made by the engineer.
    • Links to S3 for full stdout/stderr logs if needed.
  • Story Points: 3
  • Dependencies: 5.2

Story 5.4: Multi-Tenant Data Isolation (RLS) As a system, I want to ensure no tenant can see another tenant's data, so that security boundaries are guaranteed at the database level.

  • Acceptance Criteria:
    • Enables Row-Level Security (RLS) on all tenant-scoped tables (runbooks, executions, audit_events).
    • API middleware sets app.current_tenant_id session variable on every database connection.
    • Cross-tenant queries return zero rows, not an error.
  • Story Points: 2
  • Dependencies: 5.1

Epic 6: Dashboard API

Description: The control plane REST API served by the dd0c API Gateway. Handles runbook CRUD, parsing endpoints, execution history, and integration with the Alert-Runbook Matcher. Provides the backend for the Dashboard UI and external integrations.

Dependencies: Epics 1, 3, 5

Technical Notes:

  • Served via Axum (Rust) and secured with JWT.
  • Integrates with the shared dd0c API Gateway.
  • Enforces tenant isolation (RLS) on every request.

User Stories

Story 6.1: Runbook CRUD & Parsing Endpoints As a runbook author (Morgan), I want to create, read, update, and soft-delete runbooks via a REST API, so that I can manage my team's active runbooks.

  • Acceptance Criteria:
    • Implements POST /runbooks (paste raw text → auto-parse), GET /runbooks, GET /runbooks/:id/versions, PUT /runbooks/:id, DELETE /runbooks/:id.
    • Implements POST /runbooks/parse-preview for the 5-second wow moment (parses without saving).
    • Returns 422 if parsing fails.
  • Story Points: 3
  • Dependencies: 1.2

Story 6.2: Execution History & Status Queries As an SRE manager (Jordan), I want to view active and completed execution runs, so that I can monitor incident response times and skipped steps.

  • Acceptance Criteria:
    • Implements POST /executions to start copilot manually.
    • Implements GET /executions and GET /executions/:id to retrieve status, steps, exit codes, and durations.
    • Implements GET /executions/:id/divergence to get the post-execution analysis (skipped steps, modified commands).
  • Story Points: 2
  • Dependencies: 3.1, 5.2

Story 6.3: Alert-Runbook Matcher Integration As a system, I want to match incoming webhooks (e.g., PagerDuty) to the correct runbook, so that the Slack bot can post the runbook immediately.

  • Acceptance Criteria:
    • Implements POST /webhooks/pagerduty, POST /webhooks/opsgenie, and POST /webhooks/dd0c-alert.
    • Uses keyword + metadata matching (and optionally pgvector similarity) to find the best runbook for the alert payload.
    • Generates the alert_context and triggers the Slack bot notification (Epic 4).
  • Story Points: 3
  • Dependencies: 6.1

Story 6.4: Classification Query Endpoints As a runbook author (Morgan), I want to query the classification engine directly, so that I can test commands before publishing a runbook.

  • Acceptance Criteria:
    • Implements POST /classify for testing/debugging.
    • Implements GET /classifications/:step_id to retrieve full classification details.
    • Rate-limited to 30 req/min per tenant.
  • Story Points: 1
  • Dependencies: 2.3

Epic 7: Dashboard UI

Description: A React Single-Page Application (SPA) for runbook authors and managers to view runbooks, past executions, and risk classifications. The primary interface for onboarding, reviewing runbooks, and analyzing post-incident data.

Dependencies: Epic 6 (needs APIs to consume)

Technical Notes:

  • React SPA, integrates with the shared dd0c portal.
  • Displays the 5-second "wow moment" parse preview.
  • Must visually distinguish 🟢 Safe, 🟡 Caution, and 🔴 Dangerous commands.

User Stories

Story 7.1: Runbook Paste & Preview UI As a runbook author (Morgan), I want to paste raw text and instantly see the parsed steps, so that I can verify the structure before saving.

  • Acceptance Criteria:
    • Large text area for pasting raw runbook text.
    • Calls POST /runbooks/parse-preview and displays the structured steps.
    • Highlights variables, prerequisites, and ambiguities.
    • Allows editing the raw text to trigger a re-parse.
  • Story Points: 3
  • Dependencies: 6.1

Story 7.2: Execution Timeline & Divergence View As an SRE manager (Jordan), I want to view a timeline of an incident's execution, so that I can see what was skipped or modified.

  • Acceptance Criteria:
    • Displays the execution run with a visual timeline of steps.
    • Shows who approved each step, the exact command executed, and the exit code.
    • Highlights divergence (skipped steps, modified commands, unlisted actions).
    • Provides a "Apply Updates" button to update the runbook based on divergence.
  • Story Points: 5
  • Dependencies: 6.2

Story 7.3: Trust Level & Risk Visualization As a runbook author (Morgan), I want to clearly see the risk level of each step in my runbook, so that I know what requires approval.

  • Acceptance Criteria:
    • Color-codes steps: 🟢 Green (Safe), 🟡 Yellow (Caution), 🔴 Red (Dangerous).
    • Clicking a risk badge shows the classification reasoning (scanner rules matched, LLM explanation).
    • Displays the runbook's overall Trust Level (0, 1, 2) and allows authorized users to change it (V1 max = 2).
  • Story Points: 3
  • Dependencies: 7.1

Story 7.4: Basic Health & MTTR Dashboard As an SRE manager (Jordan), I want a high-level view of my team's runbooks and execution stats, so that I can measure the value of dd0c/run.

  • Acceptance Criteria:
    • Displays a list of runbooks, their coverage, and average staleness.
    • Shows MTTR (Mean Time To Resolution) for incidents handled via Copilot.
    • Displays the total number of Copilot runs and skipped steps per month.
  • Story Points: 2
  • Dependencies: 6.2

Epic 8: Infrastructure & DevOps

Description: The foundational cloud infrastructure and CI/CD pipelines required to run the dd0c/run SaaS platform and build the customer-facing Agent. Ensures strict security isolation (mTLS), observability, and zero-downtime ECS Fargate deployments.

Dependencies: None (can be built in parallel with Epic 1).

Technical Notes:

  • All infrastructure defined as code (Terraform) and deployed to AWS.
  • ECS Fargate for all core services (Parser, Classifier, Engine, APIs).
  • Agent binary built for Linux x86_64 and ARM64 via GitHub Actions.
  • mTLS certificate generation and rotation pipeline for Agent authentication.

User Stories

Story 8.1: Core AWS Infrastructure Provisioning As a system administrator (Brian), I want the base AWS infrastructure provisioned via Terraform, so that the services have a secure, scalable environment to run in.

  • Acceptance Criteria:
    • Terraform configures VPC, public/private subnets, NAT Gateway, and ALB.
    • Provisions RDS PostgreSQL 16 (Multi-AZ) with pgvector and S3 buckets for audit/logs.
    • Provisions ECS Fargate cluster and SQS queues.
  • Story Points: 3
  • Dependencies: None

Story 8.2: CI/CD Pipeline & Agent Build As a developer (Brian), I want automated build and deployment pipelines, so that I can ship code to production safely and distribute the Agent binary.

  • Acceptance Criteria:
    • GitHub Actions pipeline runs cargo clippy, cargo test, and the "canary test suite" on every PR.
    • Merges to main auto-deploy ECS Fargate services with zero downtime.
    • Release tags compile the Agent binary (x86_64, aarch64), sign it, and publish to GitHub Releases / S3.
  • Story Points: 3
  • Dependencies: None

Story 8.3: Agent mTLS & gRPC Setup As a system, I want secure outbound-only gRPC communication between the Agent and the Execution Engine, so that commands are transmitted safely without VPNs or inbound firewall rules.

  • Acceptance Criteria:
    • Internal CA issues tenant-scoped mTLS certificates.
    • API Gateway terminates mTLS and validates the tenant ID in the certificate against the request.
    • Agent successfully establishes a persistent, outbound-only gRPC stream to the Engine.
  • Story Points: 5
  • Dependencies: 8.1

Story 8.4: Observability & Party Mode Alerting As the solo on-call engineer (Brian), I want comprehensive monitoring and alerting, so that I am woken up if a safety invariant is violated (Party Mode).

  • Acceptance Criteria:
    • OpenTelemetry (OTEL) collector deployed and routing metrics/traces to Grafana Cloud.
    • PagerDuty integration configured for P1 alerts.
    • Alert fires immediately if the "Party Mode" database flag is set or if Agent heartbeats drop unexpectedly.
  • Story Points: 2
  • Dependencies: 8.1

Epic 9: Onboarding & PLG

Description: Product-Led Growth (PLG) and frictionless onboarding. Focuses on the "5-second wow moment" where a user can paste a runbook and instantly see the parsed, classified steps without installing an agent, plus a self-serve tier to capture leads.

Dependencies: Epics 1, 6, 7.

Technical Notes:

  • The demo uses the POST /runbooks/parse-preview endpoint (does not persist data).
  • Tenant provisioning must be fully automated upon signup.
  • Free tier enforced via database limits (no Stripe required initially for free tier).

User Stories

Story 9.1: The 5-Second Interactive Demo As a prospective customer, I want to paste my messy runbook into the marketing website and see it parsed instantly, so that I immediately understand the product's value without signing up.

  • Acceptance Criteria:
    • Landing page features a prominent text area for pasting a runbook.
    • Submitting calls the unauthenticated parse-preview endpoint (rate-limited by IP).
    • Renders the structured steps with 🟢/🟡/🔴 risk badges dynamically.
  • Story Points: 3
  • Dependencies: 6.1, 7.1

Story 9.2: Self-Serve Signup & Tenant Provisioning As a new user, I want to create an account and get my tenant provisioned instantly, so that I can start using the product without talking to sales.

  • Acceptance Criteria:
    • OAuth/Email signup flow creates a User and a Tenant record in PostgreSQL.
    • Automated worker initializes the tenant workspace, generates the initial mTLS certs, and provides the Agent installation command.
    • Limits the free tier to 5 active runbooks and 50 executions/month.
  • Story Points: 3
  • Dependencies: 8.1

Story 9.3: Agent Installation Wizard As a new user, I want a simple copy-paste command to install the read-only Agent in my infrastructure, so that I can execute my first runbook within 10 minutes of signup.

  • Acceptance Criteria:
    • Dashboard provides a dynamically generated curl | bash or kubectl apply snippet.
    • Snippet includes the tenant-specific mTLS certificate and Agent binary download.
    • Dashboard shows a live "Waiting for Agent heartbeat..." state that turns green when the Agent connects.
  • Story Points: 3
  • Dependencies: 8.3, 9.2

Story 9.4: First Runbook Copilot Walkthrough As a new user, I want an interactive guide for my first runbook execution, so that I learn how the Copilot approval flow and Trust Gradient work safely.

  • Acceptance Criteria:
    • System automatically provisions a "Hello World" runbook containing safe read-only commands (e.g., kubectl get nodes).
    • In-app tooltip guides the user to trigger the runbook via the Slack integration.
    • Completing this first execution unlocks the "Runbook Author" badge/status.
  • Story Points: 2
  • Dependencies: 4.2, 9.3

Epic 10: Transparent Factory Compliance

Description: Cross-cutting epic ensuring dd0c/run adheres to the 5 Transparent Factory tenets. Of all 6 dd0c products, runbook automation carries the highest governance burden — it executes commands in production infrastructure. Every tenet here is load-bearing.

Story 10.1: Atomic Flagging — Feature Flags for Execution Behaviors

As a solo founder, I want every new runbook execution capability, command parser, and auto-remediation behavior behind a feature flag (default: off), so that a bad parser change never executes the wrong command in a customer's production environment.

Acceptance Criteria:

  • OpenFeature SDK integrated into the execution engine. V1: JSON file provider.
  • All flags evaluate locally — no network calls during runbook execution.
  • Every flag has owner and ttl (max 14 days). CI blocks if expired flags remain at 100%.
  • Automated circuit breaker: if a flagged execution path fails >2 times in any 10-minute window, the flag auto-disables and all in-flight executions using that path are paused (not killed).
  • Flags required for: new command parsers, conditional logic handlers, new infrastructure targets (K8s, AWS, bare metal), auto-retry behaviors, approval bypass rules.
  • Hard rule: any flag that controls execution of destructive commands (rm, delete, terminate, drop) requires a 48-hour bake time at 10% rollout before full enablement.

Estimate: 5 points Dependencies: Epic 3 (Execution Engine) Technical Notes:

  • Circuit breaker threshold is intentionally low (2 failures) because production execution failures are high-severity.
  • Pause vs. kill: paused executions hold state and can be resumed after review. Killing mid-execution risks partial state.
  • 48-hour bake for destructive flags: enforced via flag metadata destructive: true + CI check.

Story 10.2: Elastic Schema — Additive-Only for Execution Audit Trail

As a solo founder, I want all execution log and runbook state schema changes to be strictly additive, so that the forensic audit trail of every production command ever executed is never corrupted or lost.

Acceptance Criteria:

  • CI rejects any migration that removes, renames, or changes type of existing columns/attributes.
  • New fields use _v2 suffix for breaking changes.
  • All execution log parsers ignore unknown fields.
  • Dual-write during migration windows within the same transaction.
  • Every migration includes sunset_date comment (max 30 days).
  • Hard rule: execution audit records are immutable. No UPDATE or DELETE statements are permitted on the execution_log table/collection. Ever.

Estimate: 3 points Dependencies: Epic 4 (Audit & Forensics) Technical Notes:

  • Execution logs are append-only by design and by policy. Use DynamoDB with no UpdateItem IAM permission on the audit table.
  • For runbook definition versioning: every edit creates a new version record. Old versions are never mutated.
  • Schema for parsed runbook steps: version the parser output format. V1 steps and V2 steps coexist.

Story 10.3: Cognitive Durability — Decision Logs for Execution Logic

As a future maintainer, I want every change to runbook parsing, step classification, or execution logic accompanied by a decision_log.json, so that I understand why the system interpreted step 3 as "destructive" and required approval.

Acceptance Criteria:

  • decision_log.json schema: { prompt, reasoning, alternatives_considered, confidence, timestamp, author }.
  • CI requires a decision log for PRs touching pkg/parser/, pkg/classifier/, pkg/executor/, or pkg/approval/.
  • Cyclomatic complexity cap of 10 enforced via linter. PRs exceeding this are blocked.
  • Decision logs in docs/decisions/.
  • Additional requirement: every change to the "destructive command" classification list requires a decision log entry explaining why the command was added/removed.

Estimate: 2 points Dependencies: None Technical Notes:

  • The destructive command list is the most sensitive config in the entire product. Changes must be documented, reviewed, and logged.
  • Parser logic decision logs should include sample runbook snippets showing before/after interpretation.

Story 10.4: Semantic Observability — AI Reasoning Spans on Classification & Execution

As an SRE manager investigating an incident caused by automated execution, I want every runbook classification and execution decision to emit an OpenTelemetry span with full reasoning metadata, so that I have a complete forensic record of what the system decided, why, and what it executed.

Acceptance Criteria:

  • Every runbook execution creates a parent runbook_execution span. Child spans for each step: step_classification, step_approval_check, step_execution.
  • step_classification attributes: step.text_hash, step.classified_as (safe/destructive/ambiguous), step.confidence_score, step.alternatives_considered.
  • step_execution attributes: step.command_hash, step.target_host_hash, step.exit_code, step.duration_ms, step.output_truncated (first 500 chars, hashed).
  • If AI-assisted classification: ai.prompt_hash, ai.model_version, ai.reasoning_chain.
  • step_approval_check attributes: step.approval_required (bool), step.approval_source (human/policy/auto), step.approval_latency_ms.
  • No PII. No raw commands in spans — everything hashed. Full commands only in the encrypted audit log.

Estimate: 5 points Dependencies: Epic 3 (Execution Engine), Epic 4 (Audit & Forensics) Technical Notes:

  • This is the highest-effort observability story across all 6 products because execution tracing is forensic-grade.
  • Span hierarchy: runbook_executionstep_N[classification, approval, execution]. Three levels deep.
  • Output hashing: SHA-256 of command output for correlation. Raw output only in encrypted audit store.

Story 10.5: Configurable Autonomy — Governance for Production Execution

As a solo founder, I want a policy.json that strictly controls what dd0c/run is allowed to execute autonomously vs. what requires human approval, so that no automated system ever runs a destructive command without explicit authorization.

Acceptance Criteria:

  • policy.json defines governance_mode: strict (all steps require human approval) or audit (safe steps auto-execute, destructive steps require approval).
  • Default for ALL customers: strict. There is no "fully autonomous" mode for dd0c/run. Even in audit mode, destructive commands always require approval.
  • panic_mode: when true, ALL execution halts immediately. In-flight steps are paused (not killed). All pending approvals are revoked. System enters read-only forensic mode.
  • Governance drift monitoring: weekly report of auto-executed vs. human-approved steps. If auto-execution ratio exceeds the configured threshold, system auto-downgrades to strict.
  • Per-customer, per-runbook governance overrides. Customers can lock specific runbooks to strict regardless of system mode.
  • Hard rule: panic_mode must be triggerable in <1 second via API call, CLI command, Slack command, OR a physical hardware button (webhook endpoint).

Estimate: 5 points Dependencies: Epic 3 (Execution Engine), Epic 5 (Approval Workflow) Technical Notes:

  • No "full auto" mode is a deliberate product decision. dd0c/run assists humans, it doesn't replace them for destructive actions.
  • Panic mode implementation: Redis key dd0c:panic checked at the top of every execution loop iteration. Webhook endpoint POST /admin/panic sets the key and broadcasts a halt signal via pub/sub.
  • The <1 second requirement means panic cannot depend on a DB write — Redis pub/sub only.
  • Governance drift: cron job queries execution logs weekly. Threshold configurable per-org (default: 70% auto-execution triggers downgrade).

Epic 10 Summary

Story Tenet Points
10.1 Atomic Flagging 5
10.2 Elastic Schema 3
10.3 Cognitive Durability 2
10.4 Semantic Observability 5
10.5 Configurable Autonomy 5
Total 20