dd0c: full product research pipeline - 6 products, 8 phases each

Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
This commit is contained in:
2026-02-28 17:35:02 +00:00
commit 5ee95d8b13
51 changed files with 36935 additions and 0 deletions

View File

@@ -0,0 +1,480 @@
# dd0c/alert — V1 MVP Epics
**Product:** dd0c/alert (Alert Intelligence Platform)
**Phase:** 7 — Epics & Stories
---
## Epic 1: Webhook Ingestion
**Description:** The front door of dd0c/alert. Responsible for receiving alert payloads from monitoring providers via webhooks, validating their authenticity, normalizing them into a canonical schema, and queuing them securely for the Correlation Engine. Must support high burst volume (incident storms) and guarantee zero dropped payloads.
### User Stories
**Story 1.1: Datadog Webhook Ingestion**
* **As a** Platform Engineer, **I want** to send Datadog webhooks to a unique dd0c URL, **so that** my Datadog alerts enter the correlation pipeline.
* **Acceptance Criteria:**
- System exposes `POST /v1/wh/{tenant_id}/datadog`
- Normalizes Datadog JSON (handles arrays/batched alerts) into the Canonical Alert Schema.
- Normalizes Datadog P1-P5 severities into critical/high/medium/low/info.
* **Estimate:** 3 points
**Story 1.2: PagerDuty Webhook Ingestion**
* **As a** Platform Engineer, **I want** to send PagerDuty v3 webhooks to dd0c, **so that** my PD incidents are tracked.
* **Acceptance Criteria:**
- System exposes `POST /v1/wh/{tenant_id}/pagerduty`
- Normalizes PagerDuty JSON into the Canonical Alert Schema.
* **Estimate:** 3 points
**Story 1.3: HMAC Signature Validation**
* **As a** Security Admin, **I want** all incoming webhooks to have their HMAC signatures validated, **so that** bad actors cannot inject fake alerts.
* **Acceptance Criteria:**
- Rejects payloads with missing or invalid `DD-WEBHOOK-SIGNATURE` or `X-PagerDuty-Signature` headers with 401 Unauthorized.
- Compares against the integration secret stored in DynamoDB/Secrets Manager.
* **Estimate:** 3 points
**Story 1.4: Payload Normalization & Deduplication (Fingerprinting)**
* **As an** On-Call Engineer, **I want** identical alerts to be deterministically fingerprinted, **so that** flapping or duplicated payloads are instantly recognized.
* **Acceptance Criteria:**
- Generates a SHA-256 fingerprint based on `tenant_id + provider + service + normalized_title`.
- Pushes canonical alert to SQS FIFO queue with `MessageGroupId=tenant_id`.
- Saves raw payload asynchronously to S3 for audit/replay.
* **Estimate:** 5 points
### Dependencies
- Story 1.3 depends on Tenant/Integration configuration existing (Epic 9).
- Story 1.4 depends on Canonical Alert Schema definition.
### Technical Notes
- **Infra:** API Gateway HTTP API -> Lambda -> SQS FIFO.
- Lambda must return 200 OK to the provider in <100ms. S3 raw payload storage must be non-blocking (async).
- Use ULIDs for `alert_id` for time-sortability.
## Epic 2: Correlation Engine
**Description:** The intelligence core. Consumes the normalized SQS FIFO queue, groups alerts based on time windows and service dependencies, and outputs correlated incidents.
### User Stories
**Story 2.1: Time-Window Clustering**
* **As an** On-Call Engineer, **I want** alerts firing within a brief time window for the same service to be grouped together, **so that** I don't get paged 10 times for one failure.
* **Acceptance Criteria:**
- Opens a 5-minute (configurable) correlation window in Redis when a new alert fingerprint arrives.
- Groups subsequent alerts for the same tenant/service into the active window.
- Stores the correlation state in ElastiCache Redis.
* **Estimate:** 5 points
**Story 2.2: Cascading Failure Correlation (Service Graph)**
* **As an** On-Call Engineer, **I want** cascading failures across dependent services to be merged into a single incident, **so that** I can see the blast radius of an issue.
* **Acceptance Criteria:**
- Reads explicit service dependencies from DynamoDB (`upstream -> downstream`).
- If a window is open for an upstream service, downstream service alerts are merged into the same window.
* **Estimate:** 8 points
**Story 2.3: Active Window Extension**
* **As an** On-Call Engineer, **I want** the correlation window to automatically extend if alerts are still trickling in, **so that** long-running, cascading incidents are correctly grouped.
* **Acceptance Criteria:**
- If a new alert arrives within the last 30 seconds of a window, the window extends by 2 minutes (max 15 minutes).
- Updates the `closes_at` timestamp in Redis.
* **Estimate:** 3 points
**Story 2.4: Incident Generation & Persistence**
* **As an** On-Call Engineer, **I want** completed time windows to be saved as durable incidents, **so that** I have a permanent record of the correlated event.
* **Acceptance Criteria:**
- When a window closes, it generates an Incident record in DynamoDB.
- Generates an event in TimescaleDB for trend tracking.
- Pushes a `correlation-request` to the Suggestion Engine SQS queue.
* **Estimate:** 5 points
### Dependencies
- Story 2.1 depends on Epic 1 (normalized SQS queue).
- Story 2.2 depends on a basic service dependency mapping (either config or API).
### Technical Notes
- **Infra:** ECS Fargate consuming SQS FIFO.
- Must use Redis Sorted Sets for active window management (`closes_at_epoch` as score).
- The correlation engine must be stateless (relying on Redis) so it can scale horizontally to handle incident storms.
## Epic 3: Noise Analysis
**Description:** The Suggestion Engine. Calculates a noise score (0-100) for correlated incidents and generates observe-only suppression suggestions. It strictly adheres to V1 constraints by *never* taking auto-action.
### User Stories
**Story 3.1: Rule-Based Noise Scoring**
* **As an** On-Call Engineer, **I want** every incident to receive a noise score based on objective data points, **so that** I have a metric to understand if this incident is likely a false positive.
* **Acceptance Criteria:**
- Calculates a 0-100 noise score when an incident is generated.
- Scores based on duplicate fingerprints (flapping), severity distribution (info vs critical), and time of day.
- Cap at 100, floor at 0.
* **Estimate:** 5 points
**Story 3.2: "Never Suppress" Safelist Execution**
* **As a** Platform Engineer, **I want** critical services (databases, billing) to be excluded from high noise scoring regardless of pattern, **so that** I never miss a genuine P1.
* **Acceptance Criteria:**
- Implements a default safelist regex (e.g., `db|rds|payment|billing`).
- Forces the noise score below 50 if the service or title matches the safelist, or if severity is critical.
* **Estimate:** 3 points
**Story 3.3: Observe-Only Suppression Suggestions**
* **As an** On-Call Engineer, **I want** the system to tell me what it *would* have suppressed, **so that** I can build trust in its intelligence without risking an outage.
* **Acceptance Criteria:**
- If a noise score > 80, the system generates a `suppress` suggestion record in DynamoDB.
- Generates plain-English reasoning for the suggestion (e.g., "This pattern was resolved automatically 4 times this month.").
- `action_taken` is always hardcoded to `none` for V1.
* **Estimate:** 5 points
**Story 3.4: Incident Scoring Metrics Collection**
* **As an** Engineering Manager, **I want** the noise scores and counts to be stored as time-series data, **so that** I can view trends in our alert hygiene over time.
* **Acceptance Criteria:**
- Writes noise score, alert counts, and unique fingerprints to TimescaleDB `alert_timeseries` table.
* **Estimate:** 3 points
### Dependencies
- Story 3.1 depends on Epic 2 for Incident Generation.
- Story 3.3 depends on Epic 5 (Slack Bot) to display the suggestion.
### Technical Notes
- **Infra:** ECS Fargate consuming from the `correlation-request` SQS queue.
- Use PostgreSQL (TimescaleDB) for historical frequency lookups ("how many times has this fired in 7 days?") to inform the score.
## Epic 4: CI/CD Correlation
**Description:** Ingests deployment events and correlates them with alert storms. The "killer feature" mandated by the Party Mode board for V1 MVP, answering "did this break right after a deploy?"
### User Stories
**Story 4.1: GitHub Actions Deploy Ingestion**
* **As a** Platform Engineer, **I want** to connect my GitHub Actions deployment webhooks, **so that** dd0c/alert knows exactly when and who deployed to production.
* **Acceptance Criteria:**
- System exposes `POST /v1/wh/{tenant_id}/github`
- Validates `X-Hub-Signature-256`.
- Normalizes GHA workflow run payload into `DeployEvent` canonical schema.
- Pushes deploy event to SQS FIFO queue (`deploy-event`).
* **Estimate:** 3 points
**Story 4.2: Deploy-to-Alert Correlation**
* **As an** On-Call Engineer, **I want** an alert cluster to be automatically tagged with a recent deployment to that service, **so that** I don't waste 15 minutes checking deploy logs manually.
* **Acceptance Criteria:**
- When the Correlation Engine opens a window, it queries DynamoDB for deployments to the affected service within a configurable lookback window (default 15m for prod, 30m for staging).
- If a match is found, the deploy context (`deploy_pr`, `deploy_author`, `source_url`) is attached to the window state.
* **Estimate:** 8 points
**Story 4.3: Deploy-Weighted Noise Scoring**
* **As an** On-Call Engineer, **I want** alerts that are highly correlated with deployments to be scored as more likely to be noise (if they aren't critical), **so that** feature flags and config refreshes don't wake me up.
* **Acceptance Criteria:**
- If a deploy event is attached to an incident, boost the noise score by 15-30 points.
- Additional +5 points if the PR title matches `config` or `feature-flag`.
* **Estimate:** 2 points
### Dependencies
- Story 4.2 depends on Epic 2 (Correlation Engine) and Epic 3 (Noise Analysis).
- Service name mapping between GitHub and Datadog/PagerDuty (convention-based string matching).
### Technical Notes
- **Infra:** The Deployment Tracker runs as a module within the Correlation Engine ECS Task to avoid network latency.
- DynamoDB needs a Global Secondary Index (GSI): `tenant_id` + `service` + `completed_at` to quickly find recent deploys.
## Epic 5: Slack Bot
**Description:** The primary interface for on-call engineers. Delivers correlated incident summaries, observe-only suppression suggestions, and daily alert digests directly into Slack. Provides interactive buttons for engineers to acknowledge or validate suggestions.
### User Stories
**Story 5.1: Incident Summary Notifications**
* **As an** On-Call Engineer, **I want** to receive a single, concise Slack message when an alert storm is correlated, **so that** I don't get flooded with dozens of individual alert notifications.
* **Acceptance Criteria:**
- Bot sends a formatted Slack Block Kit message to a configured channel.
- Message groups all related alerts under a single incident title.
- Displays the total number of correlated alerts, affected services, and start time.
* **Estimate:** 5 points
**Story 5.2: Observe-Only Suppression Suggestions in Slack**
* **As an** On-Call Engineer, **I want** the Slack message to include the system's noise score and suppression recommendation, **so that** I can evaluate its accuracy in real-time.
* **Acceptance Criteria:**
- If noise score > 80, the message includes a specific "Suggestion" block (e.g., "Would have auto-suppressed: 95% noise score").
- Includes the plain-English reasoning generated in Epic 3.
* **Estimate:** 3 points
**Story 5.3: Interactive Feedback Actions**
* **As an** On-Call Engineer, **I want** to click "Good Catch" or "Bad Suggestion" on the Slack message, **so that** I can help train the noise analysis engine for future versions.
* **Acceptance Criteria:**
- Slack message includes interactive buttons for feedback.
- Clicking a button sends a payload back to dd0c/alert to record the user's validation in the database.
- Updates the Slack message to acknowledge the feedback.
* **Estimate:** 5 points
**Story 5.4: Daily Alert Digest**
* **As an** Engineering Manager, **I want** a daily summary of the noisiest services and total incidents dropped into Slack, **so that** my team can prioritize technical debt.
* **Acceptance Criteria:**
- A scheduled job runs daily at 9 AM (configurable timezone).
- Aggregates the previous 24 hours of data from TimescaleDB.
- Posts a summary of "Top 3 Noisiest Services" and "Total Time Saved" (estimated) to the channel.
* **Estimate:** 5 points
### Dependencies
- Story 5.1 depends on Epic 2 (Correlation Engine).
- Story 5.2 depends on Epic 3 (Noise Analysis).
### Technical Notes
- **Infra:** AWS Lambda for handling incoming Slack interactions (buttons) via API Gateway.
- Use Slack's Block Kit Builder for UI consistency.
- Requires storing Slack Workspace and Channel tokens securely in AWS Secrets Manager or DynamoDB.
## Epic 6: Dashboard API
**Description:** The backend REST API that powers the dd0c/alert web dashboard. Provides secure endpoints for authentication, querying historical incidents, analyzing alert volume, and managing tenant configuration.
### User Stories
**Story 6.1: Tenant Authentication & Authorization**
* **As a** Platform Engineer, **I want** to securely log in to the dashboard API, **so that** I can manage my organization's alert data safely.
* **Acceptance Criteria:**
- Implement JWT-based authentication.
- Enforce tenant isolation on all API endpoints (users can only access data for their `tenant_id`).
* **Estimate:** 5 points
**Story 6.2: Incident Query Endpoints**
* **As an** On-Call Engineer, **I want** to fetch a paginated list of historical incidents and their associated alerts, **so that** I can review past outages.
* **Acceptance Criteria:**
- `GET /v1/incidents` supports pagination, time-range filtering, and service filtering.
- `GET /v1/incidents/{incident_id}/alerts` returns the raw alerts correlated into that incident.
* **Estimate:** 5 points
**Story 6.3: Analytics & Noise Score API**
* **As an** Engineering Manager, **I want** to query aggregated metrics about alert noise and volume, **so that** I can populate charts on the dashboard.
* **Acceptance Criteria:**
- `GET /v1/analytics/noise` returns time-series data of average noise scores per service.
- Queries TimescaleDB efficiently using materialized views or continuous aggregates if necessary.
* **Estimate:** 8 points
**Story 6.4: Configuration Management Endpoints**
* **As a** Platform Engineer, **I want** to manage my integration webhooks and routing rules via API, **so that** I can script my onboarding or use the UI.
* **Acceptance Criteria:**
- CRUD endpoints for managing Slack channel destinations.
- Endpoints to generate and rotate inbound webhook secrets for Datadog/PagerDuty.
* **Estimate:** 3 points
### Dependencies
- Story 6.2 and 6.3 depend on TimescaleDB schema and data from Epics 2 and 3.
### Technical Notes
- **Infra:** API Gateway HTTP API -> AWS Lambda (Node.js/Go).
- Strict validation middleware required for tenant isolation.
- Use standard OpenAPI 3.0 specification for documentation.
## Epic 7: Dashboard UI
**Description:** The React Single Page Application (SPA) for dd0c/alert. Gives users a visual interface to view the incident timeline, inspect alert correlation details, and understand the noise scoring.
### User Stories
**Story 7.1: Incident Timeline View**
* **As an** On-Call Engineer, **I want** a main feed showing all correlated incidents chronologically, **so that** I can see the current state of my systems at a glance.
* **Acceptance Criteria:**
- React SPA fetches and displays data from `GET /v1/incidents`.
- Visual distinction between high-noise (suggested suppressed) and low-noise (critical) incidents.
- Real-time updates or auto-refresh every 30 seconds.
* **Estimate:** 8 points
**Story 7.2: Alert Correlation Visualizer**
* **As an** On-Call Engineer, **I want** to click on an incident and see exactly which alerts were grouped together, **so that** I understand why the engine correlated them.
* **Acceptance Criteria:**
- Detail pane showing the timeline of individual alerts within the incident window.
- Displays the deployment context (Epic 4) if applicable.
* **Estimate:** 5 points
**Story 7.3: Noise Score Breakdown**
* **As a** Platform Engineer, **I want** to see the exact factors that contributed to an incident's noise score, **so that** I can trust the engine's reasoning.
* **Acceptance Criteria:**
- UI component displaying the 0-100 noise score gauge.
- Lists the bulleted reasoning (e.g., "+20 points: Occurred 10 times this week", "+15 points: Recent deployment").
* **Estimate:** 3 points
**Story 7.4: Analytics Dashboard**
* **As an** Engineering Manager, **I want** charts showing alert volume and noise trends over the last 30 days, **so that** I can track improvements in our alert hygiene.
* **Acceptance Criteria:**
- Integrates a charting library (e.g., Recharts or Chart.js).
- Displays a bar chart of total alerts vs. correlated incidents to show "noise reduction" value.
* **Estimate:** 5 points
### Dependencies
- Depends entirely on Epic 6 (Dashboard API).
### Technical Notes
- **Infra:** Hosted on AWS S3 + CloudFront or Vercel.
- Framework: React (Next.js or Vite).
- Tailwind CSS for rapid styling.
## Epic 8: Infrastructure & DevOps
**Description:** The foundational cloud infrastructure and deployment pipelines necessary to run dd0c/alert reliably, securely, and with observability.
### User Stories
**Story 8.1: Infrastructure as Code (IaC)**
* **As a** Developer, **I want** all AWS resources defined in code, **so that** I can easily spin up staging and production environments identically.
* **Acceptance Criteria:**
- Terraform or AWS CDK defines VPC, API Gateway, Lambda functions, ECS Fargate clusters, SQS queues, and DynamoDB tables.
- State is stored securely in an S3 backend with DynamoDB locking.
* **Estimate:** 8 points
**Story 8.2: CI/CD Pipelines**
* **As a** Developer, **I want** automated testing and deployment when I push to main, **so that** I can ship features quickly without manual steps.
* **Acceptance Criteria:**
- GitHub Actions workflow runs unit tests and linters on PRs.
- Merges to `main` trigger a deployment to the staging environment, followed by a manual approval for production.
* **Estimate:** 5 points
**Story 8.3: System Monitoring & Logging**
* **As a** System Admin, **I want** central logging and metrics for the dd0c/alert services, **so that** I can debug issues when the platform itself fails.
* **Acceptance Criteria:**
- All Lambda and ECS logs route to CloudWatch Logs.
- CloudWatch Alarms configured for API 5xx errors and SQS Dead Letter Queue (DLQ) messages.
* **Estimate:** 3 points
**Story 8.4: Database Provisioning (Timescale & Redis)**
* **As a** Database Admin, **I want** managed, highly available instances for TimescaleDB and Redis, **so that** the correlation engine runs with low latency and durable storage.
* **Acceptance Criteria:**
- Provisions AWS ElastiCache for Redis (for active window state).
- Provisions RDS for PostgreSQL with TimescaleDB extension, or uses Timescale Cloud.
* **Estimate:** 5 points
### Dependencies
- Blocked by architectural decisions being finalized.
- Blocks Epics 1, 2, 3 from being deployed to production.
### Technical Notes
- Optimize for Solo Founder: Keep infrastructure simple. Managed services over self-hosted.
- Ensure appropriate IAM roles with least privilege access between Lambda/ECS and DynamoDB/SQS.
## Epic 9: Onboarding & PLG
**Description:** Product-Led Growth and the critical 60-second time-to-value flow. Ensures a frictionless setup experience for new users to connect their monitoring tools and Slack workspace immediately.
### User Stories
**Story 9.1: Frictionless Sign-Up**
* **As a** New User, **I want** to sign up using my GitHub or Google account, **so that** I don't have to create and remember a new password.
* **Acceptance Criteria:**
- Implement OAuth2 login (GitHub/Google).
- Automatically provisions a new `tenant_id` and default configuration upon successful first login.
* **Estimate:** 5 points
**Story 9.2: Webhook Setup Wizard**
* **As a** New User, **I want** a step-by-step wizard to configure my Datadog or PagerDuty webhooks, **so that** I can start sending data to dd0c/alert immediately.
* **Acceptance Criteria:**
- UI wizard provides copy-paste ready webhook URLs and secrets.
- Includes a "Waiting for first payload..." state that updates in real-time via WebSockets or polling when the first alert arrives.
* **Estimate:** 8 points
**Story 9.3: Slack App Installation Flow**
* **As a** New User, **I want** a 1-click "Add to Slack" button, **so that** I can authorize dd0c/alert to post in my incident channels.
* **Acceptance Criteria:**
- Implements the standard Slack OAuth v2 flow.
- Allows the user to select the default channel for incident summaries.
* **Estimate:** 5 points
**Story 9.4: Free Tier Limitations**
* **As a** Product Owner, **I want** a free tier that limits the number of processed alerts or retention period, **so that** users can try the product without me incurring massive AWS costs.
* **Acceptance Criteria:**
- Free tier limits enforced at the ingestion API (e.g., max 10,000 alerts/month).
- UI displays a usage quota bar.
- Data retention in TimescaleDB automatically purged after 7 days for free tier tenants.
* **Estimate:** 5 points
### Dependencies
- Depends on Epic 6 (Dashboard API) and Epic 7 (Dashboard UI).
- Story 9.2 depends on Epic 1 (Webhook Ingestion) being live.
### Technical Notes
- Use Auth0, Clerk, or AWS Cognito to minimize authentication development time for the Solo Founder.
- Real-time "Waiting for payload" can be implemented via a lightweight polling endpoint if WebSockets add too much complexity.
---
## Epic 10: Transparent Factory Compliance
**Description:** Cross-cutting epic ensuring dd0c/alert adheres to the 5 Transparent Factory tenets. For an alert intelligence platform, Semantic Observability is paramount — a tool that reasons about alerts must make its own reasoning fully transparent.
### Story 10.1: Atomic Flagging — Feature Flags for Correlation & Scoring Rules
**As a** solo founder, **I want** every new correlation rule, noise scoring algorithm, and suppression behavior behind a feature flag (default: off), **so that** a bad scoring change doesn't silence critical alerts in production.
**Acceptance Criteria:**
- OpenFeature SDK integrated into the alert processing pipeline. V1: env-var or JSON file provider.
- All flags evaluate locally — no network calls in the alert ingestion hot path.
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
- Automated circuit breaker: if a flagged scoring rule suppresses >2x the baseline alert volume over 30 minutes, the flag auto-disables and all suppressed alerts are re-emitted.
- Flags required for: new correlation patterns, CI/CD deployment correlation, noise scoring thresholds, notification channel routing.
**Estimate:** 5 points
**Dependencies:** Epic 2 (Correlation Engine)
**Technical Notes:**
- Circuit breaker is critical here — a bad suppression rule is worse than no suppression. Track suppression counts per flag in Redis with 30-min sliding window.
- Re-emission: suppressed alerts buffered in a dead-letter queue for 1 hour. On circuit break, replay the queue.
### Story 10.2: Elastic Schema — Additive-Only for Alert Event Store
**As a** solo founder, **I want** all alert event schema changes to be strictly additive, **so that** historical alert correlation data remains queryable after any deployment.
**Acceptance Criteria:**
- CI rejects migrations containing `DROP`, `ALTER ... TYPE`, or `RENAME` on existing columns/attributes.
- New fields use `_v2` suffix for breaking changes. Old fields remain readable.
- All event parsers configured to ignore unknown fields (Pydantic `model_config = {"extra": "ignore"}` or equivalent).
- Dual-write during migration windows within the same DB transaction.
- Every migration includes `sunset_date` comment (max 30 days). CI warns on overdue cleanups.
**Estimate:** 3 points
**Dependencies:** Epic 3 (Event Store)
**Technical Notes:**
- Alert events are append-only by nature — leverage this. Never mutate historical events.
- For correlation metadata (enrichments added post-ingestion), store as separate linked records rather than mutating the original event.
- TimescaleDB compression policies must handle both V1 and V2 column layouts.
### Story 10.3: Cognitive Durability — Decision Logs for Scoring Logic
**As a** future maintainer, **I want** every change to noise scoring weights, correlation rules, or suppression thresholds accompanied by a `decision_log.json`, **so that** I can understand why alert X was classified as noise vs. signal.
**Acceptance Criteria:**
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
- CI requires a decision log for PRs touching `src/scoring/`, `src/correlation/`, or `src/suppression/`.
- Cyclomatic complexity cap of 10 enforced in CI. Scoring functions must be decomposable and testable.
- Decision logs in `docs/decisions/`, one per significant logic change.
**Estimate:** 2 points
**Dependencies:** None
**Technical Notes:**
- Scoring weight changes are especially important to document — "why is deployment correlation weighted 0.7 and not 0.5?"
- Include sample alert scenarios in decision logs showing before/after scoring behavior.
### Story 10.4: Semantic Observability — AI Reasoning Spans on Alert Classification
**As an** on-call engineer investigating a missed critical alert, **I want** every alert scoring and correlation decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I can trace exactly why an alert was scored as noise when it was actually a P1 incident.
**Acceptance Criteria:**
- Every alert ingestion creates a parent `alert_evaluation` span. Child spans for `noise_scoring`, `correlation_matching`, and `suppression_decision`.
- Span attributes: `alert.source`, `alert.noise_score`, `alert.correlation_matches` (JSON array), `alert.suppressed` (bool), `alert.suppression_reason`.
- If AI-assisted classification is used: `ai.prompt_hash`, `ai.model_version`, `ai.confidence_score`, `ai.reasoning_chain` (summarized).
- CI/CD correlation spans include: `alert.deployment_correlation_score`, `alert.deployment_id`, `alert.time_since_deploy_seconds`.
- No PII in spans. Alert payloads are hashed for correlation, not logged raw.
**Estimate:** 3 points
**Dependencies:** Epic 2 (Correlation Engine)
**Technical Notes:**
- This is the most important tenet for dd0c/alert. If the tool suppresses an alert, the reasoning MUST be traceable.
- Use `opentelemetry-python` with OTLP exporter. Batch span export to avoid per-alert overhead.
- For V1 without AI: `alert.suppression_reason` is the rule name + threshold. When AI scoring is added, the full reasoning chain is captured.
### Story 10.5: Configurable Autonomy — Governance for Alert Suppression
**As a** solo founder, **I want** a `policy.json` that controls whether dd0c/alert can auto-suppress alerts or only annotate them, **so that** customers never lose visibility into their alerts without explicit opt-in.
**Acceptance Criteria:**
- `policy.json` defines `governance_mode`: `strict` (annotate-only, never suppress) or `audit` (auto-suppress with full logging).
- Default for all new customers: `strict`. Suppression requires explicit opt-in.
- `panic_mode`: when true, all suppression stops immediately. Every alert passes through unmodified. A "panic active" banner appears in the dashboard.
- Per-customer governance override: customers can only be MORE restrictive than system default.
- All policy decisions logged with full context: "Alert X suppressed by audit mode, rule Y, score Z" or "Alert X annotation-only, strict mode active".
**Estimate:** 3 points
**Dependencies:** Epic 4 (Notification Router)
**Technical Notes:**
- `strict` mode is the safe default — dd0c/alert adds value even without suppression by annotating alerts with correlation data and noise scores.
- Panic mode: single Redis key `dd0c:panic`. All suppression checks short-circuit on this key. Triggerable via `POST /admin/panic` or env var.
- Customer override: stored in org settings. Merge: `max_restrictive(system, customer)`.
### Epic 10 Summary
| Story | Tenet | Points |
|-------|-------|--------|
| 10.1 | Atomic Flagging | 5 |
| 10.2 | Elastic Schema | 3 |
| 10.3 | Cognitive Durability | 2 |
| 10.4 | Semantic Observability | 3 |
| 10.5 | Configurable Autonomy | 3 |
| **Total** | | **16** |