dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run Phases: brainstorm, design-thinking, innovation-strategy, party-mode, product-brief, architecture, epics (incl. Epic 10 TF compliance), test-architecture (TDD strategy) Brand strategy and market research included.
2026-02-28 17:35:02 +00:00
commit 5ee95d8b13
51 changed files with 36935 additions and 0 deletions
--- a/products/03-alert-intelligence/epics/epics.md
+++ b/products/03-alert-intelligence/epics/epics.md
@@ -0,0 +1,480 @@
+# dd0c/alert — V1 MVP Epics
+**Product:** dd0c/alert (Alert Intelligence Platform)
+**Phase:** 7 — Epics & Stories
+
+---
+
+## Epic 1: Webhook Ingestion
+**Description:** The front door of dd0c/alert. Responsible for receiving alert payloads from monitoring providers via webhooks, validating their authenticity, normalizing them into a canonical schema, and queuing them securely for the Correlation Engine. Must support high burst volume (incident storms) and guarantee zero dropped payloads.
+
+### User Stories
+
+**Story 1.1: Datadog Webhook Ingestion**
+* **As a** Platform Engineer, **I want** to send Datadog webhooks to a unique dd0c URL, **so that** my Datadog alerts enter the correlation pipeline.
+* **Acceptance Criteria:**
+  - System exposes `POST /v1/wh/{tenant_id}/datadog`
+  - Normalizes Datadog JSON (handles arrays/batched alerts) into the Canonical Alert Schema.
+  - Normalizes Datadog P1-P5 severities into critical/high/medium/low/info.
+* **Estimate:** 3 points
+
+**Story 1.2: PagerDuty Webhook Ingestion**
+* **As a** Platform Engineer, **I want** to send PagerDuty v3 webhooks to dd0c, **so that** my PD incidents are tracked.
+* **Acceptance Criteria:**
+  - System exposes `POST /v1/wh/{tenant_id}/pagerduty`
+  - Normalizes PagerDuty JSON into the Canonical Alert Schema.
+* **Estimate:** 3 points
+
+**Story 1.3: HMAC Signature Validation**
+* **As a** Security Admin, **I want** all incoming webhooks to have their HMAC signatures validated, **so that** bad actors cannot inject fake alerts.
+* **Acceptance Criteria:**
+  - Rejects payloads with missing or invalid `DD-WEBHOOK-SIGNATURE` or `X-PagerDuty-Signature` headers with 401 Unauthorized.
+  - Compares against the integration secret stored in DynamoDB/Secrets Manager.
+* **Estimate:** 3 points
+
+**Story 1.4: Payload Normalization & Deduplication (Fingerprinting)**
+* **As an** On-Call Engineer, **I want** identical alerts to be deterministically fingerprinted, **so that** flapping or duplicated payloads are instantly recognized.
+* **Acceptance Criteria:**
+  - Generates a SHA-256 fingerprint based on `tenant_id + provider + service + normalized_title`.
+  - Pushes canonical alert to SQS FIFO queue with `MessageGroupId=tenant_id`.
+  - Saves raw payload asynchronously to S3 for audit/replay.
+* **Estimate:** 5 points
+
+### Dependencies
+- Story 1.3 depends on Tenant/Integration configuration existing (Epic 9).
+- Story 1.4 depends on Canonical Alert Schema definition.
+
+### Technical Notes
+- **Infra:** API Gateway HTTP API -> Lambda -> SQS FIFO.
+- Lambda must return 200 OK to the provider in <100ms. S3 raw payload storage must be non-blocking (async).
+- Use ULIDs for `alert_id` for time-sortability.
+
+## Epic 2: Correlation Engine
+**Description:** The intelligence core. Consumes the normalized SQS FIFO queue, groups alerts based on time windows and service dependencies, and outputs correlated incidents.
+
+### User Stories
+
+**Story 2.1: Time-Window Clustering**
+* **As an** On-Call Engineer, **I want** alerts firing within a brief time window for the same service to be grouped together, **so that** I don't get paged 10 times for one failure.
+* **Acceptance Criteria:**
+  - Opens a 5-minute (configurable) correlation window in Redis when a new alert fingerprint arrives.
+  - Groups subsequent alerts for the same tenant/service into the active window.
+  - Stores the correlation state in ElastiCache Redis.
+* **Estimate:** 5 points
+
+**Story 2.2: Cascading Failure Correlation (Service Graph)**
+* **As an** On-Call Engineer, **I want** cascading failures across dependent services to be merged into a single incident, **so that** I can see the blast radius of an issue.
+* **Acceptance Criteria:**
+  - Reads explicit service dependencies from DynamoDB (`upstream -> downstream`).
+  - If a window is open for an upstream service, downstream service alerts are merged into the same window.
+* **Estimate:** 8 points
+
+**Story 2.3: Active Window Extension**
+* **As an** On-Call Engineer, **I want** the correlation window to automatically extend if alerts are still trickling in, **so that** long-running, cascading incidents are correctly grouped.
+* **Acceptance Criteria:**
+  - If a new alert arrives within the last 30 seconds of a window, the window extends by 2 minutes (max 15 minutes).
+  - Updates the `closes_at` timestamp in Redis.
+* **Estimate:** 3 points
+
+**Story 2.4: Incident Generation & Persistence**
+* **As an** On-Call Engineer, **I want** completed time windows to be saved as durable incidents, **so that** I have a permanent record of the correlated event.
+* **Acceptance Criteria:**
+  - When a window closes, it generates an Incident record in DynamoDB.
+  - Generates an event in TimescaleDB for trend tracking.
+  - Pushes a `correlation-request` to the Suggestion Engine SQS queue.
+* **Estimate:** 5 points
+
+### Dependencies
+- Story 2.1 depends on Epic 1 (normalized SQS queue).
+- Story 2.2 depends on a basic service dependency mapping (either config or API).
+
+### Technical Notes
+- **Infra:** ECS Fargate consuming SQS FIFO.
+- Must use Redis Sorted Sets for active window management (`closes_at_epoch` as score).
+- The correlation engine must be stateless (relying on Redis) so it can scale horizontally to handle incident storms.
+
+## Epic 3: Noise Analysis
+**Description:** The Suggestion Engine. Calculates a noise score (0-100) for correlated incidents and generates observe-only suppression suggestions. It strictly adheres to V1 constraints by *never* taking auto-action.
+
+### User Stories
+
+**Story 3.1: Rule-Based Noise Scoring**
+* **As an** On-Call Engineer, **I want** every incident to receive a noise score based on objective data points, **so that** I have a metric to understand if this incident is likely a false positive.
+* **Acceptance Criteria:**
+  - Calculates a 0-100 noise score when an incident is generated.
+  - Scores based on duplicate fingerprints (flapping), severity distribution (info vs critical), and time of day.
+  - Cap at 100, floor at 0.
+* **Estimate:** 5 points
+
+**Story 3.2: "Never Suppress" Safelist Execution**
+* **As a** Platform Engineer, **I want** critical services (databases, billing) to be excluded from high noise scoring regardless of pattern, **so that** I never miss a genuine P1.
+* **Acceptance Criteria:**
+  - Implements a default safelist regex (e.g., `db|rds|payment|billing`).
+  - Forces the noise score below 50 if the service or title matches the safelist, or if severity is critical.
+* **Estimate:** 3 points
+
+**Story 3.3: Observe-Only Suppression Suggestions**
+* **As an** On-Call Engineer, **I want** the system to tell me what it *would* have suppressed, **so that** I can build trust in its intelligence without risking an outage.
+* **Acceptance Criteria:**
+  - If a noise score > 80, the system generates a `suppress` suggestion record in DynamoDB.
+  - Generates plain-English reasoning for the suggestion (e.g., "This pattern was resolved automatically 4 times this month.").
+  - `action_taken` is always hardcoded to `none` for V1.
+* **Estimate:** 5 points
+
+**Story 3.4: Incident Scoring Metrics Collection**
+* **As an** Engineering Manager, **I want** the noise scores and counts to be stored as time-series data, **so that** I can view trends in our alert hygiene over time.
+* **Acceptance Criteria:**
+  - Writes noise score, alert counts, and unique fingerprints to TimescaleDB `alert_timeseries` table.
+* **Estimate:** 3 points
+
+### Dependencies
+- Story 3.1 depends on Epic 2 for Incident Generation.
+- Story 3.3 depends on Epic 5 (Slack Bot) to display the suggestion.
+
+### Technical Notes
+- **Infra:** ECS Fargate consuming from the `correlation-request` SQS queue.
+- Use PostgreSQL (TimescaleDB) for historical frequency lookups ("how many times has this fired in 7 days?") to inform the score.
+
+## Epic 4: CI/CD Correlation
+**Description:** Ingests deployment events and correlates them with alert storms. The "killer feature" mandated by the Party Mode board for V1 MVP, answering "did this break right after a deploy?"
+
+### User Stories
+
+**Story 4.1: GitHub Actions Deploy Ingestion**
+* **As a** Platform Engineer, **I want** to connect my GitHub Actions deployment webhooks, **so that** dd0c/alert knows exactly when and who deployed to production.
+* **Acceptance Criteria:**
+  - System exposes `POST /v1/wh/{tenant_id}/github`
+  - Validates `X-Hub-Signature-256`.
+  - Normalizes GHA workflow run payload into `DeployEvent` canonical schema.
+  - Pushes deploy event to SQS FIFO queue (`deploy-event`).
+* **Estimate:** 3 points
+
+**Story 4.2: Deploy-to-Alert Correlation**
+* **As an** On-Call Engineer, **I want** an alert cluster to be automatically tagged with a recent deployment to that service, **so that** I don't waste 15 minutes checking deploy logs manually.
+* **Acceptance Criteria:**
+  - When the Correlation Engine opens a window, it queries DynamoDB for deployments to the affected service within a configurable lookback window (default 15m for prod, 30m for staging).
+  - If a match is found, the deploy context (`deploy_pr`, `deploy_author`, `source_url`) is attached to the window state.
+* **Estimate:** 8 points
+
+**Story 4.3: Deploy-Weighted Noise Scoring**
+* **As an** On-Call Engineer, **I want** alerts that are highly correlated with deployments to be scored as more likely to be noise (if they aren't critical), **so that** feature flags and config refreshes don't wake me up.
+* **Acceptance Criteria:**
+  - If a deploy event is attached to an incident, boost the noise score by 15-30 points.
+  - Additional +5 points if the PR title matches `config` or `feature-flag`.
+* **Estimate:** 2 points
+
+### Dependencies
+- Story 4.2 depends on Epic 2 (Correlation Engine) and Epic 3 (Noise Analysis).
+- Service name mapping between GitHub and Datadog/PagerDuty (convention-based string matching).
+
+### Technical Notes
+- **Infra:** The Deployment Tracker runs as a module within the Correlation Engine ECS Task to avoid network latency.
+- DynamoDB needs a Global Secondary Index (GSI): `tenant_id` + `service` + `completed_at` to quickly find recent deploys.
+
+## Epic 5: Slack Bot
+**Description:** The primary interface for on-call engineers. Delivers correlated incident summaries, observe-only suppression suggestions, and daily alert digests directly into Slack. Provides interactive buttons for engineers to acknowledge or validate suggestions.
+
+### User Stories
+
+**Story 5.1: Incident Summary Notifications**
+* **As an** On-Call Engineer, **I want** to receive a single, concise Slack message when an alert storm is correlated, **so that** I don't get flooded with dozens of individual alert notifications.
+* **Acceptance Criteria:**
+  - Bot sends a formatted Slack Block Kit message to a configured channel.
+  - Message groups all related alerts under a single incident title.
+  - Displays the total number of correlated alerts, affected services, and start time.
+* **Estimate:** 5 points
+
+**Story 5.2: Observe-Only Suppression Suggestions in Slack**
+* **As an** On-Call Engineer, **I want** the Slack message to include the system's noise score and suppression recommendation, **so that** I can evaluate its accuracy in real-time.
+* **Acceptance Criteria:**
+  - If noise score > 80, the message includes a specific "Suggestion" block (e.g., "Would have auto-suppressed: 95% noise score").
+  - Includes the plain-English reasoning generated in Epic 3.
+* **Estimate:** 3 points
+
+**Story 5.3: Interactive Feedback Actions**
+* **As an** On-Call Engineer, **I want** to click "Good Catch" or "Bad Suggestion" on the Slack message, **so that** I can help train the noise analysis engine for future versions.
+* **Acceptance Criteria:**
+  - Slack message includes interactive buttons for feedback.
+  - Clicking a button sends a payload back to dd0c/alert to record the user's validation in the database.
+  - Updates the Slack message to acknowledge the feedback.
+* **Estimate:** 5 points
+
+**Story 5.4: Daily Alert Digest**
+* **As an** Engineering Manager, **I want** a daily summary of the noisiest services and total incidents dropped into Slack, **so that** my team can prioritize technical debt.
+* **Acceptance Criteria:**
+  - A scheduled job runs daily at 9 AM (configurable timezone).
+  - Aggregates the previous 24 hours of data from TimescaleDB.
+  - Posts a summary of "Top 3 Noisiest Services" and "Total Time Saved" (estimated) to the channel.
+* **Estimate:** 5 points
+
+### Dependencies
+- Story 5.1 depends on Epic 2 (Correlation Engine).
+- Story 5.2 depends on Epic 3 (Noise Analysis).
+
+### Technical Notes
+- **Infra:** AWS Lambda for handling incoming Slack interactions (buttons) via API Gateway.
+- Use Slack's Block Kit Builder for UI consistency.
+- Requires storing Slack Workspace and Channel tokens securely in AWS Secrets Manager or DynamoDB.
+
+## Epic 6: Dashboard API
+**Description:** The backend REST API that powers the dd0c/alert web dashboard. Provides secure endpoints for authentication, querying historical incidents, analyzing alert volume, and managing tenant configuration.
+
+### User Stories
+
+**Story 6.1: Tenant Authentication & Authorization**
+* **As a** Platform Engineer, **I want** to securely log in to the dashboard API, **so that** I can manage my organization's alert data safely.
+* **Acceptance Criteria:**
+  - Implement JWT-based authentication.
+  - Enforce tenant isolation on all API endpoints (users can only access data for their `tenant_id`).
+* **Estimate:** 5 points
+
+**Story 6.2: Incident Query Endpoints**
+* **As an** On-Call Engineer, **I want** to fetch a paginated list of historical incidents and their associated alerts, **so that** I can review past outages.
+* **Acceptance Criteria:**
+  - `GET /v1/incidents` supports pagination, time-range filtering, and service filtering.
+  - `GET /v1/incidents/{incident_id}/alerts` returns the raw alerts correlated into that incident.
+* **Estimate:** 5 points
+
+**Story 6.3: Analytics & Noise Score API**
+* **As an** Engineering Manager, **I want** to query aggregated metrics about alert noise and volume, **so that** I can populate charts on the dashboard.
+* **Acceptance Criteria:**
+  - `GET /v1/analytics/noise` returns time-series data of average noise scores per service.
+  - Queries TimescaleDB efficiently using materialized views or continuous aggregates if necessary.
+* **Estimate:** 8 points
+
+**Story 6.4: Configuration Management Endpoints**
+* **As a** Platform Engineer, **I want** to manage my integration webhooks and routing rules via API, **so that** I can script my onboarding or use the UI.
+* **Acceptance Criteria:**
+  - CRUD endpoints for managing Slack channel destinations.
+  - Endpoints to generate and rotate inbound webhook secrets for Datadog/PagerDuty.
+* **Estimate:** 3 points
+
+### Dependencies
+- Story 6.2 and 6.3 depend on TimescaleDB schema and data from Epics 2 and 3.
+
+### Technical Notes
+- **Infra:** API Gateway HTTP API -> AWS Lambda (Node.js/Go).
+- Strict validation middleware required for tenant isolation.
+- Use standard OpenAPI 3.0 specification for documentation.
+
+## Epic 7: Dashboard UI
+**Description:** The React Single Page Application (SPA) for dd0c/alert. Gives users a visual interface to view the incident timeline, inspect alert correlation details, and understand the noise scoring.
+
+### User Stories
+
+**Story 7.1: Incident Timeline View**
+* **As an** On-Call Engineer, **I want** a main feed showing all correlated incidents chronologically, **so that** I can see the current state of my systems at a glance.
+* **Acceptance Criteria:**
+  - React SPA fetches and displays data from `GET /v1/incidents`.
+  - Visual distinction between high-noise (suggested suppressed) and low-noise (critical) incidents.
+  - Real-time updates or auto-refresh every 30 seconds.
+* **Estimate:** 8 points
+
+**Story 7.2: Alert Correlation Visualizer**
+* **As an** On-Call Engineer, **I want** to click on an incident and see exactly which alerts were grouped together, **so that** I understand why the engine correlated them.
+* **Acceptance Criteria:**
+  - Detail pane showing the timeline of individual alerts within the incident window.
+  - Displays the deployment context (Epic 4) if applicable.
+* **Estimate:** 5 points
+
+**Story 7.3: Noise Score Breakdown**
+* **As a** Platform Engineer, **I want** to see the exact factors that contributed to an incident's noise score, **so that** I can trust the engine's reasoning.
+* **Acceptance Criteria:**
+  - UI component displaying the 0-100 noise score gauge.
+  - Lists the bulleted reasoning (e.g., "+20 points: Occurred 10 times this week", "+15 points: Recent deployment").
+* **Estimate:** 3 points
+
+**Story 7.4: Analytics Dashboard**
+* **As an** Engineering Manager, **I want** charts showing alert volume and noise trends over the last 30 days, **so that** I can track improvements in our alert hygiene.
+* **Acceptance Criteria:**
+  - Integrates a charting library (e.g., Recharts or Chart.js).
+  - Displays a bar chart of total alerts vs. correlated incidents to show "noise reduction" value.
+* **Estimate:** 5 points
+
+### Dependencies
+- Depends entirely on Epic 6 (Dashboard API).
+
+### Technical Notes
+- **Infra:** Hosted on AWS S3 + CloudFront or Vercel.
+- Framework: React (Next.js or Vite).
+- Tailwind CSS for rapid styling.
+
+## Epic 8: Infrastructure & DevOps
+**Description:** The foundational cloud infrastructure and deployment pipelines necessary to run dd0c/alert reliably, securely, and with observability.
+
+### User Stories
+
+**Story 8.1: Infrastructure as Code (IaC)**
+* **As a** Developer, **I want** all AWS resources defined in code, **so that** I can easily spin up staging and production environments identically.
+* **Acceptance Criteria:**
+  - Terraform or AWS CDK defines VPC, API Gateway, Lambda functions, ECS Fargate clusters, SQS queues, and DynamoDB tables.
+  - State is stored securely in an S3 backend with DynamoDB locking.
+* **Estimate:** 8 points
+
+**Story 8.2: CI/CD Pipelines**
+* **As a** Developer, **I want** automated testing and deployment when I push to main, **so that** I can ship features quickly without manual steps.
+* **Acceptance Criteria:**
+  - GitHub Actions workflow runs unit tests and linters on PRs.
+  - Merges to `main` trigger a deployment to the staging environment, followed by a manual approval for production.
+* **Estimate:** 5 points
+
+**Story 8.3: System Monitoring & Logging**
+* **As a** System Admin, **I want** central logging and metrics for the dd0c/alert services, **so that** I can debug issues when the platform itself fails.
+* **Acceptance Criteria:**
+  - All Lambda and ECS logs route to CloudWatch Logs.
+  - CloudWatch Alarms configured for API 5xx errors and SQS Dead Letter Queue (DLQ) messages.
+* **Estimate:** 3 points
+
+**Story 8.4: Database Provisioning (Timescale & Redis)**
+* **As a** Database Admin, **I want** managed, highly available instances for TimescaleDB and Redis, **so that** the correlation engine runs with low latency and durable storage.
+* **Acceptance Criteria:**
+  - Provisions AWS ElastiCache for Redis (for active window state).
+  - Provisions RDS for PostgreSQL with TimescaleDB extension, or uses Timescale Cloud.
+* **Estimate:** 5 points
+
+### Dependencies
+- Blocked by architectural decisions being finalized.
+- Blocks Epics 1, 2, 3 from being deployed to production.
+
+### Technical Notes
+- Optimize for Solo Founder: Keep infrastructure simple. Managed services over self-hosted.
+- Ensure appropriate IAM roles with least privilege access between Lambda/ECS and DynamoDB/SQS.
+
+## Epic 9: Onboarding & PLG
+**Description:** Product-Led Growth and the critical 60-second time-to-value flow. Ensures a frictionless setup experience for new users to connect their monitoring tools and Slack workspace immediately.
+
+### User Stories
+
+**Story 9.1: Frictionless Sign-Up**
+* **As a** New User, **I want** to sign up using my GitHub or Google account, **so that** I don't have to create and remember a new password.
+* **Acceptance Criteria:**
+  - Implement OAuth2 login (GitHub/Google).
+  - Automatically provisions a new `tenant_id` and default configuration upon successful first login.
+* **Estimate:** 5 points
+
+**Story 9.2: Webhook Setup Wizard**
+* **As a** New User, **I want** a step-by-step wizard to configure my Datadog or PagerDuty webhooks, **so that** I can start sending data to dd0c/alert immediately.
+* **Acceptance Criteria:**
+  - UI wizard provides copy-paste ready webhook URLs and secrets.
+  - Includes a "Waiting for first payload..." state that updates in real-time via WebSockets or polling when the first alert arrives.
+* **Estimate:** 8 points
+
+**Story 9.3: Slack App Installation Flow**
+* **As a** New User, **I want** a 1-click "Add to Slack" button, **so that** I can authorize dd0c/alert to post in my incident channels.
+* **Acceptance Criteria:**
+  - Implements the standard Slack OAuth v2 flow.
+  - Allows the user to select the default channel for incident summaries.
+* **Estimate:** 5 points
+
+**Story 9.4: Free Tier Limitations**
+* **As a** Product Owner, **I want** a free tier that limits the number of processed alerts or retention period, **so that** users can try the product without me incurring massive AWS costs.
+* **Acceptance Criteria:**
+  - Free tier limits enforced at the ingestion API (e.g., max 10,000 alerts/month).
+  - UI displays a usage quota bar.
+  - Data retention in TimescaleDB automatically purged after 7 days for free tier tenants.
+* **Estimate:** 5 points
+
+### Dependencies
+- Depends on Epic 6 (Dashboard API) and Epic 7 (Dashboard UI).
+- Story 9.2 depends on Epic 1 (Webhook Ingestion) being live.
+
+### Technical Notes
+- Use Auth0, Clerk, or AWS Cognito to minimize authentication development time for the Solo Founder.
+- Real-time "Waiting for payload" can be implemented via a lightweight polling endpoint if WebSockets add too much complexity.
+
+
+---
+
+## Epic 10: Transparent Factory Compliance
+**Description:** Cross-cutting epic ensuring dd0c/alert adheres to the 5 Transparent Factory tenets. For an alert intelligence platform, Semantic Observability is paramount — a tool that reasons about alerts must make its own reasoning fully transparent.
+
+### Story 10.1: Atomic Flagging — Feature Flags for Correlation & Scoring Rules
+**As a** solo founder, **I want** every new correlation rule, noise scoring algorithm, and suppression behavior behind a feature flag (default: off), **so that** a bad scoring change doesn't silence critical alerts in production.
+
+**Acceptance Criteria:**
+- OpenFeature SDK integrated into the alert processing pipeline. V1: env-var or JSON file provider.
+- All flags evaluate locally — no network calls in the alert ingestion hot path.
+- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
+- Automated circuit breaker: if a flagged scoring rule suppresses >2x the baseline alert volume over 30 minutes, the flag auto-disables and all suppressed alerts are re-emitted.
+- Flags required for: new correlation patterns, CI/CD deployment correlation, noise scoring thresholds, notification channel routing.
+
+**Estimate:** 5 points
+**Dependencies:** Epic 2 (Correlation Engine)
+**Technical Notes:**
+- Circuit breaker is critical here — a bad suppression rule is worse than no suppression. Track suppression counts per flag in Redis with 30-min sliding window.
+- Re-emission: suppressed alerts buffered in a dead-letter queue for 1 hour. On circuit break, replay the queue.
+
+### Story 10.2: Elastic Schema — Additive-Only for Alert Event Store
+**As a** solo founder, **I want** all alert event schema changes to be strictly additive, **so that** historical alert correlation data remains queryable after any deployment.
+
+**Acceptance Criteria:**
+- CI rejects migrations containing `DROP`, `ALTER ... TYPE`, or `RENAME` on existing columns/attributes.
+- New fields use `_v2` suffix for breaking changes. Old fields remain readable.
+- All event parsers configured to ignore unknown fields (Pydantic `model_config = {"extra": "ignore"}` or equivalent).
+- Dual-write during migration windows within the same DB transaction.
+- Every migration includes `sunset_date` comment (max 30 days). CI warns on overdue cleanups.
+
+**Estimate:** 3 points
+**Dependencies:** Epic 3 (Event Store)
+**Technical Notes:**
+- Alert events are append-only by nature — leverage this. Never mutate historical events.
+- For correlation metadata (enrichments added post-ingestion), store as separate linked records rather than mutating the original event.
+- TimescaleDB compression policies must handle both V1 and V2 column layouts.
+
+### Story 10.3: Cognitive Durability — Decision Logs for Scoring Logic
+**As a** future maintainer, **I want** every change to noise scoring weights, correlation rules, or suppression thresholds accompanied by a `decision_log.json`, **so that** I can understand why alert X was classified as noise vs. signal.
+
+**Acceptance Criteria:**
+- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
+- CI requires a decision log for PRs touching `src/scoring/`, `src/correlation/`, or `src/suppression/`.
+- Cyclomatic complexity cap of 10 enforced in CI. Scoring functions must be decomposable and testable.
+- Decision logs in `docs/decisions/`, one per significant logic change.
+
+**Estimate:** 2 points
+**Dependencies:** None
+**Technical Notes:**
+- Scoring weight changes are especially important to document — "why is deployment correlation weighted 0.7 and not 0.5?"
+- Include sample alert scenarios in decision logs showing before/after scoring behavior.
+
+### Story 10.4: Semantic Observability — AI Reasoning Spans on Alert Classification
+**As an** on-call engineer investigating a missed critical alert, **I want** every alert scoring and correlation decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I can trace exactly why an alert was scored as noise when it was actually a P1 incident.
+
+**Acceptance Criteria:**
+- Every alert ingestion creates a parent `alert_evaluation` span. Child spans for `noise_scoring`, `correlation_matching`, and `suppression_decision`.
+- Span attributes: `alert.source`, `alert.noise_score`, `alert.correlation_matches` (JSON array), `alert.suppressed` (bool), `alert.suppression_reason`.
+- If AI-assisted classification is used: `ai.prompt_hash`, `ai.model_version`, `ai.confidence_score`, `ai.reasoning_chain` (summarized).
+- CI/CD correlation spans include: `alert.deployment_correlation_score`, `alert.deployment_id`, `alert.time_since_deploy_seconds`.
+- No PII in spans. Alert payloads are hashed for correlation, not logged raw.
+
+**Estimate:** 3 points
+**Dependencies:** Epic 2 (Correlation Engine)
+**Technical Notes:**
+- This is the most important tenet for dd0c/alert. If the tool suppresses an alert, the reasoning MUST be traceable.
+- Use `opentelemetry-python` with OTLP exporter. Batch span export to avoid per-alert overhead.
+- For V1 without AI: `alert.suppression_reason` is the rule name + threshold. When AI scoring is added, the full reasoning chain is captured.
+
+### Story 10.5: Configurable Autonomy — Governance for Alert Suppression
+**As a** solo founder, **I want** a `policy.json` that controls whether dd0c/alert can auto-suppress alerts or only annotate them, **so that** customers never lose visibility into their alerts without explicit opt-in.
+
+**Acceptance Criteria:**
+- `policy.json` defines `governance_mode`: `strict` (annotate-only, never suppress) or `audit` (auto-suppress with full logging).
+- Default for all new customers: `strict`. Suppression requires explicit opt-in.
+- `panic_mode`: when true, all suppression stops immediately. Every alert passes through unmodified. A "panic active" banner appears in the dashboard.
+- Per-customer governance override: customers can only be MORE restrictive than system default.
+- All policy decisions logged with full context: "Alert X suppressed by audit mode, rule Y, score Z" or "Alert X annotation-only, strict mode active".
+
+**Estimate:** 3 points
+**Dependencies:** Epic 4 (Notification Router)
+**Technical Notes:**
+- `strict` mode is the safe default — dd0c/alert adds value even without suppression by annotating alerts with correlation data and noise scores.
+- Panic mode: single Redis key `dd0c:panic`. All suppression checks short-circuit on this key. Triggerable via `POST /admin/panic` or env var.
+- Customer override: stored in org settings. Merge: `max_restrictive(system, customer)`.
+
+### Epic 10 Summary
+| Story | Tenet | Points |
+|-------|-------|--------|
+| 10.1 | Atomic Flagging | 5 |
+| 10.2 | Elastic Schema | 3 |
+| 10.3 | Cognitive Durability | 2 |
+| 10.4 | Semantic Observability | 3 |
+| 10.5 | Configurable Autonomy | 3 |
+| **Total** | | **16** |