Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
33 KiB
dd0c/run — V1 MVP Epics
This document breaks down the V1 MVP of dd0c/run into implementation Epics and User Stories. Scope is strictly limited to V1 (read-only execution 🟢, copilot approval for 🟡/🔴, no autopilot, no terminal watcher).
Epic 1: Runbook Parser
Description: The "5-second wow moment." Ingest raw unstructured text from a paste event, normalize it, and use a fast LLM (e.g., Claude Haiku via dd0c/route) to extract a structured JSON representation of executable steps, variables, conditional branches, and prerequisites in under 5 seconds.
Dependencies: None. This is the foundational data structure.
Technical Notes:
- Pure Rust normalizer to strip HTML/Markdown before hitting the LLM to save tokens and latency.
- LLM prompt must enforce a strict JSON schema.
- Idempotent parsing: same text + temperature 0 = same output.
- Must run in < 3.5s to leave room for the Action Classifier within the 5s SLA.
User Stories
Story 1.1: Raw Text Normalization & Ingestion As a runbook author (Morgan), I want to paste raw text from Confluence or Notion into the system, so that I don't have to learn a proprietary YAML DSL.
- Acceptance Criteria:
- System accepts raw text payload via API.
- Rust-based normalizer strips HTML tags, markdown formatting, and normalizes whitespace/bullet points.
- Original raw text and hash are preserved in the DB for audit/re-parsing.
- Story Points: 2
- Dependencies: None
Story 1.2: LLM Structured Step Extraction As a system, I want to pass normalized text to a fast LLM to extract an ordered JSON array of steps and commands, so that the runbook becomes machine-readable.
- Acceptance Criteria:
- Sends normalized text to
dd0c/routewith a strict JSON schema prompt. - Correctly extracts step order, natural language description, and embedded CLI/shell commands.
- P95 latency for extraction is < 3.5 seconds.
- Rejects/errors gracefully if the text contains no actionable steps.
- Sends normalized text to
- Story Points: 3
- Dependencies: 1.1
Story 1.3: Variable & Prerequisite Detection
As an on-call engineer (Riley), I want the parser to identify implicit prerequisites (like VPN) and variables (like <instance-id>), so that I'm fully prepared before the runbook starts.
- Acceptance Criteria:
- Regex/heuristic scanner identifies common placeholders (
$VAR,<var>,{var}). - LLM identifies implicit requirements ("ensure you are on VPN", "requires prod AWS profile").
- Outputs a structured array of
variablesandprerequisitesin the JSON payload.
- Regex/heuristic scanner identifies common placeholders (
- Story Points: 2
- Dependencies: 1.2
Story 1.4: Branching & Ambiguity Highlighting As a runbook author (Morgan), I want the system to map conditional logic (if/else) and flag vague instructions, so that my runbooks are deterministic and clear.
- Acceptance Criteria:
- Parser detects simple "if X then Y" statements and maps them to a step execution DAG (Directed Acyclic Graph).
- Identifies vague steps (e.g., "check the logs" without specifying which logs) and adds them to an
ambiguitiesarray for author review.
- Story Points: 3
- Dependencies: 1.2
Epic 2: Action Classifier
Description: The safety-critical core of the system. Classifies every extracted command as 🟢 Safe, 🟡 Caution, 🔴 Dangerous, or ⬜ Unknown. Uses a defense-in-depth "dual-key" architecture: a deterministic compiled Rust scanner overrides an advisory LLM classifier. A misclassification here is an existential risk.
Dependencies: Epic 1 (needs parsed steps to classify)
Technical Notes:
- The deterministic scanner must use tree-sitter or compiled regex sets.
- Merge rules must be hardcoded in Rust (not configurable).
- LLM classification must run in parallel across steps to meet the 5-second total parse+classify SLA.
- If the LLM confidence is low or unknown, risk escalates. Scanner wins disagreements.
User Stories
Story 2.1: Deterministic Safety Scanner As a system, I want to pattern-match commands against a compiled database of known signatures, so that destructive commands are definitively blocked without relying on an LLM.
- Acceptance Criteria:
- Rust library with compiled regex sets (allowlist 🟢, caution 🟡, blocklist 🔴).
- Uses tree-sitter to parse SQL/shell ASTs to detect destructive patterns (e.g.,
DELETEwithoutWHERE, pipedrm -rf). - Executes in < 1ms per command.
- Defaults to
Unknown(🟡 minimum) if no patterns match.
- Story Points: 5
- Dependencies: None (standalone library)
Story 2.2: LLM Contextual Classifier As a system, I want an LLM to assess the command in the context of surrounding steps, so that implicit state changes or custom scripts are caught.
- Acceptance Criteria:
- Sends the command, surrounding steps, and infrastructure context to an LLM via a strict JSON prompt.
- Returns a classification (🟢/🟡/🔴), a confidence score, and suggested rollbacks.
- Escalate classification to the next risk tier if confidence < 0.9.
- Story Points: 3
- Dependencies: Epic 1 (requires structured context)
Story 2.3: Classification Merge Engine As an on-call engineer (Riley), I want the system to safely merge the LLM and deterministic scanner results, so that I can trust the final 🟢/🟡/🔴 label.
- Acceptance Criteria:
- Implements the 5 hardcoded merge rules exactly as architected (e.g., if Scanner says 🔴, final is 🔴; if Scanner says 🟡 and LLM says 🟢, final is 🟡).
- The final risk level is
SafeONLY if both the Scanner and the LLM agree it is safe.
- Story Points: 2
- Dependencies: 2.1, 2.2
Story 2.4: Immutable Classification Audit As an SRE manager (Jordan), I want every classification decision logged with full context, so that I have a forensic record of why the system made its choice.
- Acceptance Criteria:
- Emits a
runbook.classifiedevent to the PostgreSQL database for every step. - Log includes: scanner result (matched patterns), LLM reasoning, confidence scores, final classification, and merge rule applied.
- Emits a
- Story Points: 1
- Dependencies: 2.3
Epic 3: Execution Engine
Description: A state machine orchestrating the step-by-step runbook execution, enforcing the Trust Gradient (Level 0, 1, 2) at every transition. It never auto-executes 🟡 or 🔴 commands in V1. Handles timeouts, rollbacks, and tracking skipped steps.
Dependencies: Epics 1 & 2 (needs classified runbooks to execute)
Technical Notes:
- V1 is Copilot-only. State transitions must block 🔴 and prompt for 🟡.
- Engine must communicate with the Agent via gRPC over mTLS.
- Each step execution receives a unique ID to prevent duplicate deliveries.
User Stories
Story 3.1: Execution State Machine As a system, I want a state machine to orchestrate step-by-step runbook execution, so that no two commands execute simultaneously and the Trust Gradient is enforced.
- Acceptance Criteria:
- Starts in
Pendingstate upon alert match. - Progresses to
AutoExecuteONLY if the step is 🟢 and trust level allows. - Transitions to
AwaitApprovalfor 🟡/🔴 steps, blocking execution until approved. - Aborts or transitions to
ManualInterventionon timeout (e.g., 60s for 🟢, 300s for 🔴).
- Starts in
- Story Points: 5
- Dependencies: 2.3
Story 3.2: gRPC Agent Communication Protocol As a system, I want to communicate securely with the dd0c Agent in the customer VPC, so that commands are executed safely without inbound firewall rules.
- Acceptance Criteria:
- Outbound-only gRPC streaming connection initiated by the Agent.
- Engine sends
ExecuteSteppayload (command, timeout, risk level, environment variables). - Agent streams
StepOutput(stdout/stderr) back to the Engine. - Agent returns
StepResult(exit code, duration, stdout/stderr hashes) on completion.
- Story Points: 5
- Dependencies: 3.1
Story 3.3: Rollback Integration As an on-call engineer (Riley), I want every state-changing step to record its inverse command, so that I can undo it with one click if it fails.
- Acceptance Criteria:
- If a step fails, the Engine transitions to
RollbackAvailableand awaits human approval. - Engine stores the rollback command before executing the forward command.
- Executing rollback triggers an independent execution step and returns the state machine to
StepReadyorManualIntervention.
- If a step fails, the Engine transitions to
- Story Points: 3
- Dependencies: 3.1
Story 3.4: Divergence Analysis As an SRE manager (Jordan), I want the system to track what the engineer actually did vs. what the runbook prescribed, so that runbooks get better over time.
- Acceptance Criteria:
- Post-execution analyzer compares prescribed steps with executed/skipped steps and any unlisted commands detected by the agent.
- Flags skipped steps, modified commands, and unlisted actions.
- Emits a
divergence.detectedevent with suggested runbook updates.
- Story Points: 3
- Dependencies: 3.1
Epic 4: Slack Bot Copilot
Description: The primary 3am interface for on-call engineers. Guides the engineer step-by-step through an active incident, enforcing the Trust Gradient via interactive buttons for 🟡 actions and explicit typed confirmations for 🔴 actions.
Dependencies: Epic 3 (needs execution state to drive the UI)
Technical Notes:
- Uses Rust + Slack Bolt SDK in Socket Mode (no inbound webhooks).
- Must use Slack Block Kit for formatting.
- Respects Slack's 1 message/sec/channel rate limit by batching rapid updates.
- Uses dedicated threads to keep main channels clean.
User Stories
Story 4.1: Alert Matching & Notification As an on-call engineer (Riley), I want a Slack message with the relevant runbook when an alert fires, so that I don't have to search for it.
- Acceptance Criteria:
- Integrates with PagerDuty/OpsGenie webhook payloads.
- Matches alert context (service, region, keywords) to the most relevant runbook.
- Posts a
🔔 Runbook matchedmessage with [▶ Start Copilot] button in the incident channel.
- Story Points: 3
- Dependencies: None
Story 4.2: Step-by-Step Interactive UI As an on-call engineer (Riley), I want the bot to guide me through the runbook step-by-step in a thread, so that I can focus on one command at a time.
- Acceptance Criteria:
- Tapping [▶ Start Copilot] opens a thread and triggers the Execution Engine.
- Shows 🟢 steps auto-executing with live stdout snippets.
- Shows 🟡/🔴 steps awaiting approval with command details and rollback.
- Includes [⏭ Skip] and [🛑 Abort] buttons for every step.
- Story Points: 5
- Dependencies: 3.1, 4.1
Story 4.3: Risk-Aware Approval Gates As a system, I want to block dangerous commands until explicit confirmation is provided, so that accidental misclicks don't cause outages.
- Acceptance Criteria:
- 🟡 steps require clicking [✅ Approve] or [✏️ Edit].
- 🔴 steps require opening a text input modal and typing the resource name exactly.
- Cannot bulk-approve steps. Each step must be individually gated.
- Approver's Slack identity is captured and passed to the Execution Engine.
- Story Points: 3
- Dependencies: 4.2
Story 4.4: Real-Time Output Streaming As an on-call engineer (Riley), I want to see the execution output of commands in real-time, so that I know what's happening.
- Acceptance Criteria:
- Slack messages update dynamically with command stdout/stderr.
- Batches rapid output updates to respect Slack rate limits.
- Truncates long outputs and provides a link to the full log in the dashboard.
- Story Points: 2
- Dependencies: 4.2
Epic 5: Audit Trail
Description: The compliance backbone and forensic record. An append-only, immutable, partitioned PostgreSQL log that tracks every runbook parse, classification, approval, execution, and rollback.
Dependencies: Epics 2, 3, 4 (needs events to log)
Technical Notes:
- PostgreSQL partitioned table (by month) for query performance.
- Application role must have
INSERTandSELECTonly. NoUPDATEorDELETEgrants. - Row-Level Security (RLS) enforces tenant isolation at the database level.
- V1 includes a basic PDF/CSV export for SOC 2 readiness.
User Stories
Story 5.1: Append-Only Audit Schema As a system, I want a partitioned database table to store all execution events immutably, so that no one can tamper with the forensic record.
- Acceptance Criteria:
- Creates a PostgreSQL table partitioned by month.
- Revokes
UPDATEandDELETEgrants from the application role. - Schema captures
tenant_id,event_type,execution_id,actor_id,event_data(JSONB), andcreated_at.
- Story Points: 2
- Dependencies: None
Story 5.2: Event Ingestion Pipeline As an SRE manager (Jordan), I want every action recorded in real-time, so that I know exactly who did what during an incident.
- Acceptance Criteria:
- Ingests events from Parser, Classifier, Execution Engine, and Slack Bot.
- Logs
runbook.parsed,runbook.classified,step.auto_executed,step.approved,step.executed, etc., exactly as architected. - Captures the exact command executed, exit code, and stdout/stderr hashes.
- Story Points: 3
- Dependencies: 5.1
Story 5.3: Compliance Export As an SRE manager (Jordan), I want to export timestamped execution logs, so that I can provide evidence to SOC 2 auditors.
- Acceptance Criteria:
- Generates a PDF or CSV of an execution run, including the approval chain and audit trail.
- Includes risk classifications and any divergence/modifications made by the engineer.
- Links to S3 for full stdout/stderr logs if needed.
- Story Points: 3
- Dependencies: 5.2
Story 5.4: Multi-Tenant Data Isolation (RLS) As a system, I want to ensure no tenant can see another tenant's data, so that security boundaries are guaranteed at the database level.
- Acceptance Criteria:
- Enables Row-Level Security (RLS) on all tenant-scoped tables (
runbooks,executions,audit_events). - API middleware sets
app.current_tenant_idsession variable on every database connection. - Cross-tenant queries return zero rows, not an error.
- Enables Row-Level Security (RLS) on all tenant-scoped tables (
- Story Points: 2
- Dependencies: 5.1
Epic 6: Dashboard API
Description: The control plane REST API served by the dd0c API Gateway. Handles runbook CRUD, parsing endpoints, execution history, and integration with the Alert-Runbook Matcher. Provides the backend for the Dashboard UI and external integrations.
Dependencies: Epics 1, 3, 5
Technical Notes:
- Served via Axum (Rust) and secured with JWT.
- Integrates with the shared dd0c API Gateway.
- Enforces tenant isolation (RLS) on every request.
User Stories
Story 6.1: Runbook CRUD & Parsing Endpoints As a runbook author (Morgan), I want to create, read, update, and soft-delete runbooks via a REST API, so that I can manage my team's active runbooks.
- Acceptance Criteria:
- Implements
POST /runbooks(paste raw text → auto-parse),GET /runbooks,GET /runbooks/:id/versions,PUT /runbooks/:id,DELETE /runbooks/:id. - Implements
POST /runbooks/parse-previewfor the 5-second wow moment (parses without saving). - Returns 422 if parsing fails.
- Implements
- Story Points: 3
- Dependencies: 1.2
Story 6.2: Execution History & Status Queries As an SRE manager (Jordan), I want to view active and completed execution runs, so that I can monitor incident response times and skipped steps.
- Acceptance Criteria:
- Implements
POST /executionsto start copilot manually. - Implements
GET /executionsandGET /executions/:idto retrieve status, steps, exit codes, and durations. - Implements
GET /executions/:id/divergenceto get the post-execution analysis (skipped steps, modified commands).
- Implements
- Story Points: 2
- Dependencies: 3.1, 5.2
Story 6.3: Alert-Runbook Matcher Integration As a system, I want to match incoming webhooks (e.g., PagerDuty) to the correct runbook, so that the Slack bot can post the runbook immediately.
- Acceptance Criteria:
- Implements
POST /webhooks/pagerduty,POST /webhooks/opsgenie, andPOST /webhooks/dd0c-alert. - Uses keyword + metadata matching (and optionally pgvector similarity) to find the best runbook for the alert payload.
- Generates the
alert_contextand triggers the Slack bot notification (Epic 4).
- Implements
- Story Points: 3
- Dependencies: 6.1
Story 6.4: Classification Query Endpoints As a runbook author (Morgan), I want to query the classification engine directly, so that I can test commands before publishing a runbook.
- Acceptance Criteria:
- Implements
POST /classifyfor testing/debugging. - Implements
GET /classifications/:step_idto retrieve full classification details. - Rate-limited to 30 req/min per tenant.
- Implements
- Story Points: 1
- Dependencies: 2.3
Epic 7: Dashboard UI
Description: A React Single-Page Application (SPA) for runbook authors and managers to view runbooks, past executions, and risk classifications. The primary interface for onboarding, reviewing runbooks, and analyzing post-incident data.
Dependencies: Epic 6 (needs APIs to consume)
Technical Notes:
- React SPA, integrates with the shared dd0c portal.
- Displays the 5-second "wow moment" parse preview.
- Must visually distinguish 🟢 Safe, 🟡 Caution, and 🔴 Dangerous commands.
User Stories
Story 7.1: Runbook Paste & Preview UI As a runbook author (Morgan), I want to paste raw text and instantly see the parsed steps, so that I can verify the structure before saving.
- Acceptance Criteria:
- Large text area for pasting raw runbook text.
- Calls
POST /runbooks/parse-previewand displays the structured steps. - Highlights variables, prerequisites, and ambiguities.
- Allows editing the raw text to trigger a re-parse.
- Story Points: 3
- Dependencies: 6.1
Story 7.2: Execution Timeline & Divergence View As an SRE manager (Jordan), I want to view a timeline of an incident's execution, so that I can see what was skipped or modified.
- Acceptance Criteria:
- Displays the execution run with a visual timeline of steps.
- Shows who approved each step, the exact command executed, and the exit code.
- Highlights divergence (skipped steps, modified commands, unlisted actions).
- Provides a "Apply Updates" button to update the runbook based on divergence.
- Story Points: 5
- Dependencies: 6.2
Story 7.3: Trust Level & Risk Visualization As a runbook author (Morgan), I want to clearly see the risk level of each step in my runbook, so that I know what requires approval.
- Acceptance Criteria:
- Color-codes steps: 🟢 Green (Safe), 🟡 Yellow (Caution), 🔴 Red (Dangerous).
- Clicking a risk badge shows the classification reasoning (scanner rules matched, LLM explanation).
- Displays the runbook's overall Trust Level (0, 1, 2) and allows authorized users to change it (V1 max = 2).
- Story Points: 3
- Dependencies: 7.1
Story 7.4: Basic Health & MTTR Dashboard As an SRE manager (Jordan), I want a high-level view of my team's runbooks and execution stats, so that I can measure the value of dd0c/run.
- Acceptance Criteria:
- Displays a list of runbooks, their coverage, and average staleness.
- Shows MTTR (Mean Time To Resolution) for incidents handled via Copilot.
- Displays the total number of Copilot runs and skipped steps per month.
- Story Points: 2
- Dependencies: 6.2
Epic 8: Infrastructure & DevOps
Description: The foundational cloud infrastructure and CI/CD pipelines required to run the dd0c/run SaaS platform and build the customer-facing Agent. Ensures strict security isolation (mTLS), observability, and zero-downtime ECS Fargate deployments.
Dependencies: None (can be built in parallel with Epic 1).
Technical Notes:
- All infrastructure defined as code (Terraform) and deployed to AWS.
- ECS Fargate for all core services (Parser, Classifier, Engine, APIs).
- Agent binary built for Linux x86_64 and ARM64 via GitHub Actions.
- mTLS certificate generation and rotation pipeline for Agent authentication.
User Stories
Story 8.1: Core AWS Infrastructure Provisioning As a system administrator (Brian), I want the base AWS infrastructure provisioned via Terraform, so that the services have a secure, scalable environment to run in.
- Acceptance Criteria:
- Terraform configures VPC, public/private subnets, NAT Gateway, and ALB.
- Provisions RDS PostgreSQL 16 (Multi-AZ) with pgvector and S3 buckets for audit/logs.
- Provisions ECS Fargate cluster and SQS queues.
- Story Points: 3
- Dependencies: None
Story 8.2: CI/CD Pipeline & Agent Build As a developer (Brian), I want automated build and deployment pipelines, so that I can ship code to production safely and distribute the Agent binary.
- Acceptance Criteria:
- GitHub Actions pipeline runs
cargo clippy,cargo test, and the "canary test suite" on every PR. - Merges to
mainauto-deploy ECS Fargate services with zero downtime. - Release tags compile the Agent binary (x86_64, aarch64), sign it, and publish to GitHub Releases / S3.
- GitHub Actions pipeline runs
- Story Points: 3
- Dependencies: None
Story 8.3: Agent mTLS & gRPC Setup As a system, I want secure outbound-only gRPC communication between the Agent and the Execution Engine, so that commands are transmitted safely without VPNs or inbound firewall rules.
- Acceptance Criteria:
- Internal CA issues tenant-scoped mTLS certificates.
- API Gateway terminates mTLS and validates the tenant ID in the certificate against the request.
- Agent successfully establishes a persistent, outbound-only gRPC stream to the Engine.
- Story Points: 5
- Dependencies: 8.1
Story 8.4: Observability & Party Mode Alerting As the solo on-call engineer (Brian), I want comprehensive monitoring and alerting, so that I am woken up if a safety invariant is violated (Party Mode).
- Acceptance Criteria:
- OpenTelemetry (OTEL) collector deployed and routing metrics/traces to Grafana Cloud.
- PagerDuty integration configured for P1 alerts.
- Alert fires immediately if the "Party Mode" database flag is set or if Agent heartbeats drop unexpectedly.
- Story Points: 2
- Dependencies: 8.1
Epic 9: Onboarding & PLG
Description: Product-Led Growth (PLG) and frictionless onboarding. Focuses on the "5-second wow moment" where a user can paste a runbook and instantly see the parsed, classified steps without installing an agent, plus a self-serve tier to capture leads.
Dependencies: Epics 1, 6, 7.
Technical Notes:
- The demo uses the
POST /runbooks/parse-previewendpoint (does not persist data). - Tenant provisioning must be fully automated upon signup.
- Free tier enforced via database limits (no Stripe required initially for free tier).
User Stories
Story 9.1: The 5-Second Interactive Demo As a prospective customer, I want to paste my messy runbook into the marketing website and see it parsed instantly, so that I immediately understand the product's value without signing up.
- Acceptance Criteria:
- Landing page features a prominent text area for pasting a runbook.
- Submitting calls the unauthenticated parse-preview endpoint (rate-limited by IP).
- Renders the structured steps with 🟢/🟡/🔴 risk badges dynamically.
- Story Points: 3
- Dependencies: 6.1, 7.1
Story 9.2: Self-Serve Signup & Tenant Provisioning As a new user, I want to create an account and get my tenant provisioned instantly, so that I can start using the product without talking to sales.
- Acceptance Criteria:
- OAuth/Email signup flow creates a User and a Tenant record in PostgreSQL.
- Automated worker initializes the tenant workspace, generates the initial mTLS certs, and provides the Agent installation command.
- Limits the free tier to 5 active runbooks and 50 executions/month.
- Story Points: 3
- Dependencies: 8.1
Story 9.3: Agent Installation Wizard As a new user, I want a simple copy-paste command to install the read-only Agent in my infrastructure, so that I can execute my first runbook within 10 minutes of signup.
- Acceptance Criteria:
- Dashboard provides a dynamically generated
curl | bashorkubectl applysnippet. - Snippet includes the tenant-specific mTLS certificate and Agent binary download.
- Dashboard shows a live "Waiting for Agent heartbeat..." state that turns green when the Agent connects.
- Dashboard provides a dynamically generated
- Story Points: 3
- Dependencies: 8.3, 9.2
Story 9.4: First Runbook Copilot Walkthrough As a new user, I want an interactive guide for my first runbook execution, so that I learn how the Copilot approval flow and Trust Gradient work safely.
- Acceptance Criteria:
- System automatically provisions a "Hello World" runbook containing safe read-only commands (e.g.,
kubectl get nodes). - In-app tooltip guides the user to trigger the runbook via the Slack integration.
- Completing this first execution unlocks the "Runbook Author" badge/status.
- System automatically provisions a "Hello World" runbook containing safe read-only commands (e.g.,
- Story Points: 2
- Dependencies: 4.2, 9.3
Epic 10: Transparent Factory Compliance
Description: Cross-cutting epic ensuring dd0c/run adheres to the 5 Transparent Factory tenets. Of all 6 dd0c products, runbook automation carries the highest governance burden — it executes commands in production infrastructure. Every tenet here is load-bearing.
Story 10.1: Atomic Flagging — Feature Flags for Execution Behaviors
As a solo founder, I want every new runbook execution capability, command parser, and auto-remediation behavior behind a feature flag (default: off), so that a bad parser change never executes the wrong command in a customer's production environment.
Acceptance Criteria:
- OpenFeature SDK integrated into the execution engine. V1: JSON file provider.
- All flags evaluate locally — no network calls during runbook execution.
- Every flag has
ownerandttl(max 14 days). CI blocks if expired flags remain at 100%. - Automated circuit breaker: if a flagged execution path fails >2 times in any 10-minute window, the flag auto-disables and all in-flight executions using that path are paused (not killed).
- Flags required for: new command parsers, conditional logic handlers, new infrastructure targets (K8s, AWS, bare metal), auto-retry behaviors, approval bypass rules.
- Hard rule: any flag that controls execution of destructive commands (
rm,delete,terminate,drop) requires a 48-hour bake time at 10% rollout before full enablement.
Estimate: 5 points Dependencies: Epic 3 (Execution Engine) Technical Notes:
- Circuit breaker threshold is intentionally low (2 failures) because production execution failures are high-severity.
- Pause vs. kill: paused executions hold state and can be resumed after review. Killing mid-execution risks partial state.
- 48-hour bake for destructive flags: enforced via flag metadata
destructive: true+ CI check.
Story 10.2: Elastic Schema — Additive-Only for Execution Audit Trail
As a solo founder, I want all execution log and runbook state schema changes to be strictly additive, so that the forensic audit trail of every production command ever executed is never corrupted or lost.
Acceptance Criteria:
- CI rejects any migration that removes, renames, or changes type of existing columns/attributes.
- New fields use
_v2suffix for breaking changes. - All execution log parsers ignore unknown fields.
- Dual-write during migration windows within the same transaction.
- Every migration includes
sunset_datecomment (max 30 days). - Hard rule: execution audit records are immutable. No
UPDATEorDELETEstatements are permitted on theexecution_logtable/collection. Ever.
Estimate: 3 points Dependencies: Epic 4 (Audit & Forensics) Technical Notes:
- Execution logs are append-only by design and by policy. Use DynamoDB with no
UpdateItemIAM permission on the audit table. - For runbook definition versioning: every edit creates a new version record. Old versions are never mutated.
- Schema for parsed runbook steps: version the parser output format. V1 steps and V2 steps coexist.
Story 10.3: Cognitive Durability — Decision Logs for Execution Logic
As a future maintainer, I want every change to runbook parsing, step classification, or execution logic accompanied by a decision_log.json, so that I understand why the system interpreted step 3 as "destructive" and required approval.
Acceptance Criteria:
decision_log.jsonschema:{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }.- CI requires a decision log for PRs touching
pkg/parser/,pkg/classifier/,pkg/executor/, orpkg/approval/. - Cyclomatic complexity cap of 10 enforced via linter. PRs exceeding this are blocked.
- Decision logs in
docs/decisions/. - Additional requirement: every change to the "destructive command" classification list requires a decision log entry explaining why the command was added/removed.
Estimate: 2 points Dependencies: None Technical Notes:
- The destructive command list is the most sensitive config in the entire product. Changes must be documented, reviewed, and logged.
- Parser logic decision logs should include sample runbook snippets showing before/after interpretation.
Story 10.4: Semantic Observability — AI Reasoning Spans on Classification & Execution
As an SRE manager investigating an incident caused by automated execution, I want every runbook classification and execution decision to emit an OpenTelemetry span with full reasoning metadata, so that I have a complete forensic record of what the system decided, why, and what it executed.
Acceptance Criteria:
- Every runbook execution creates a parent
runbook_executionspan. Child spans for each step:step_classification,step_approval_check,step_execution. step_classificationattributes:step.text_hash,step.classified_as(safe/destructive/ambiguous),step.confidence_score,step.alternatives_considered.step_executionattributes:step.command_hash,step.target_host_hash,step.exit_code,step.duration_ms,step.output_truncated(first 500 chars, hashed).- If AI-assisted classification:
ai.prompt_hash,ai.model_version,ai.reasoning_chain. step_approval_checkattributes:step.approval_required(bool),step.approval_source(human/policy/auto),step.approval_latency_ms.- No PII. No raw commands in spans — everything hashed. Full commands only in the encrypted audit log.
Estimate: 5 points Dependencies: Epic 3 (Execution Engine), Epic 4 (Audit & Forensics) Technical Notes:
- This is the highest-effort observability story across all 6 products because execution tracing is forensic-grade.
- Span hierarchy:
runbook_execution→step_N→[classification, approval, execution]. Three levels deep. - Output hashing: SHA-256 of command output for correlation. Raw output only in encrypted audit store.
Story 10.5: Configurable Autonomy — Governance for Production Execution
As a solo founder, I want a policy.json that strictly controls what dd0c/run is allowed to execute autonomously vs. what requires human approval, so that no automated system ever runs a destructive command without explicit authorization.
Acceptance Criteria:
policy.jsondefinesgovernance_mode:strict(all steps require human approval) oraudit(safe steps auto-execute, destructive steps require approval).- Default for ALL customers:
strict. There is no "fully autonomous" mode for dd0c/run. Even inauditmode, destructive commands always require approval. panic_mode: when true, ALL execution halts immediately. In-flight steps are paused (not killed). All pending approvals are revoked. System enters read-only forensic mode.- Governance drift monitoring: weekly report of auto-executed vs. human-approved steps. If auto-execution ratio exceeds the configured threshold, system auto-downgrades to
strict. - Per-customer, per-runbook governance overrides. Customers can lock specific runbooks to
strictregardless of system mode. - Hard rule:
panic_modemust be triggerable in <1 second via API call, CLI command, Slack command, OR a physical hardware button (webhook endpoint).
Estimate: 5 points Dependencies: Epic 3 (Execution Engine), Epic 5 (Approval Workflow) Technical Notes:
- No "full auto" mode is a deliberate product decision. dd0c/run assists humans, it doesn't replace them for destructive actions.
- Panic mode implementation: Redis key
dd0c:panicchecked at the top of every execution loop iteration. Webhook endpointPOST /admin/panicsets the key and broadcasts a halt signal via pub/sub. - The <1 second requirement means panic cannot depend on a DB write — Redis pub/sub only.
- Governance drift: cron job queries execution logs weekly. Threshold configurable per-org (default: 70% auto-execution triggers downgrade).
Epic 10 Summary
| Story | Tenet | Points |
|---|---|---|
| 10.1 | Atomic Flagging | 5 |
| 10.2 | Elastic Schema | 3 |
| 10.3 | Cognitive Durability | 2 |
| 10.4 | Semantic Observability | 5 |
| 10.5 | Configurable Autonomy | 5 |
| Total | 20 |