# dd0c/alert — V1 MVP Epics **Product:** dd0c/alert (Alert Intelligence Platform) **Phase:** 7 — Epics & Stories --- ## Epic 1: Webhook Ingestion **Description:** The front door of dd0c/alert. Responsible for receiving alert payloads from monitoring providers via webhooks, validating their authenticity, normalizing them into a canonical schema, and queuing them securely for the Correlation Engine. Must support high burst volume (incident storms) and guarantee zero dropped payloads. ### User Stories **Story 1.1: Datadog Webhook Ingestion** * **As a** Platform Engineer, **I want** to send Datadog webhooks to a unique dd0c URL, **so that** my Datadog alerts enter the correlation pipeline. * **Acceptance Criteria:** - System exposes `POST /v1/wh/{tenant_id}/datadog` - Normalizes Datadog JSON (handles arrays/batched alerts) into the Canonical Alert Schema. - Normalizes Datadog P1-P5 severities into critical/high/medium/low/info. * **Estimate:** 3 points **Story 1.2: PagerDuty Webhook Ingestion** * **As a** Platform Engineer, **I want** to send PagerDuty v3 webhooks to dd0c, **so that** my PD incidents are tracked. * **Acceptance Criteria:** - System exposes `POST /v1/wh/{tenant_id}/pagerduty` - Normalizes PagerDuty JSON into the Canonical Alert Schema. * **Estimate:** 3 points **Story 1.3: HMAC Signature Validation** * **As a** Security Admin, **I want** all incoming webhooks to have their HMAC signatures validated, **so that** bad actors cannot inject fake alerts. * **Acceptance Criteria:** - Rejects payloads with missing or invalid `DD-WEBHOOK-SIGNATURE` or `X-PagerDuty-Signature` headers with 401 Unauthorized. - Compares against the integration secret stored in DynamoDB/Secrets Manager. * **Estimate:** 3 points **Story 1.4: Payload Normalization & Deduplication (Fingerprinting)** * **As an** On-Call Engineer, **I want** identical alerts to be deterministically fingerprinted, **so that** flapping or duplicated payloads are instantly recognized. * **Acceptance Criteria:** - Generates a SHA-256 fingerprint based on `tenant_id + provider + service + normalized_title`. - Pushes canonical alert to SQS FIFO queue with `MessageGroupId=tenant_id`. - Saves raw payload asynchronously to S3 for audit/replay. * **Estimate:** 5 points ### Dependencies - Story 1.3 depends on Tenant/Integration configuration existing (Epic 9). - Story 1.4 depends on Canonical Alert Schema definition. ### Technical Notes - **Infra:** API Gateway HTTP API -> Lambda -> SQS FIFO. - Lambda must return 200 OK to the provider in <100ms. S3 raw payload storage must be non-blocking (async). - Use ULIDs for `alert_id` for time-sortability. ## Epic 2: Correlation Engine **Description:** The intelligence core. Consumes the normalized SQS FIFO queue, groups alerts based on time windows and service dependencies, and outputs correlated incidents. ### User Stories **Story 2.1: Time-Window Clustering** * **As an** On-Call Engineer, **I want** alerts firing within a brief time window for the same service to be grouped together, **so that** I don't get paged 10 times for one failure. * **Acceptance Criteria:** - Opens a 5-minute (configurable) correlation window in Redis when a new alert fingerprint arrives. - Groups subsequent alerts for the same tenant/service into the active window. - Stores the correlation state in ElastiCache Redis. * **Estimate:** 5 points **Story 2.2: Cascading Failure Correlation (Service Graph)** * **As an** On-Call Engineer, **I want** cascading failures across dependent services to be merged into a single incident, **so that** I can see the blast radius of an issue. * **Acceptance Criteria:** - Reads explicit service dependencies from DynamoDB (`upstream -> downstream`). - If a window is open for an upstream service, downstream service alerts are merged into the same window. * **Estimate:** 8 points **Story 2.3: Active Window Extension** * **As an** On-Call Engineer, **I want** the correlation window to automatically extend if alerts are still trickling in, **so that** long-running, cascading incidents are correctly grouped. * **Acceptance Criteria:** - If a new alert arrives within the last 30 seconds of a window, the window extends by 2 minutes (max 15 minutes). - Updates the `closes_at` timestamp in Redis. * **Estimate:** 3 points **Story 2.4: Incident Generation & Persistence** * **As an** On-Call Engineer, **I want** completed time windows to be saved as durable incidents, **so that** I have a permanent record of the correlated event. * **Acceptance Criteria:** - When a window closes, it generates an Incident record in DynamoDB. - Generates an event in TimescaleDB for trend tracking. - Pushes a `correlation-request` to the Suggestion Engine SQS queue. * **Estimate:** 5 points ### Dependencies - Story 2.1 depends on Epic 1 (normalized SQS queue). - Story 2.2 depends on a basic service dependency mapping (either config or API). ### Technical Notes - **Infra:** ECS Fargate consuming SQS FIFO. - Must use Redis Sorted Sets for active window management (`closes_at_epoch` as score). - The correlation engine must be stateless (relying on Redis) so it can scale horizontally to handle incident storms. ## Epic 3: Noise Analysis **Description:** The Suggestion Engine. Calculates a noise score (0-100) for correlated incidents and generates observe-only suppression suggestions. It strictly adheres to V1 constraints by *never* taking auto-action. ### User Stories **Story 3.1: Rule-Based Noise Scoring** * **As an** On-Call Engineer, **I want** every incident to receive a noise score based on objective data points, **so that** I have a metric to understand if this incident is likely a false positive. * **Acceptance Criteria:** - Calculates a 0-100 noise score when an incident is generated. - Scores based on duplicate fingerprints (flapping), severity distribution (info vs critical), and time of day. - Cap at 100, floor at 0. * **Estimate:** 5 points **Story 3.2: "Never Suppress" Safelist Execution** * **As a** Platform Engineer, **I want** critical services (databases, billing) to be excluded from high noise scoring regardless of pattern, **so that** I never miss a genuine P1. * **Acceptance Criteria:** - Implements a default safelist regex (e.g., `db|rds|payment|billing`). - Forces the noise score below 50 if the service or title matches the safelist, or if severity is critical. * **Estimate:** 3 points **Story 3.3: Observe-Only Suppression Suggestions** * **As an** On-Call Engineer, **I want** the system to tell me what it *would* have suppressed, **so that** I can build trust in its intelligence without risking an outage. * **Acceptance Criteria:** - If a noise score > 80, the system generates a `suppress` suggestion record in DynamoDB. - Generates plain-English reasoning for the suggestion (e.g., "This pattern was resolved automatically 4 times this month."). - `action_taken` is always hardcoded to `none` for V1. * **Estimate:** 5 points **Story 3.4: Incident Scoring Metrics Collection** * **As an** Engineering Manager, **I want** the noise scores and counts to be stored as time-series data, **so that** I can view trends in our alert hygiene over time. * **Acceptance Criteria:** - Writes noise score, alert counts, and unique fingerprints to TimescaleDB `alert_timeseries` table. * **Estimate:** 3 points ### Dependencies - Story 3.1 depends on Epic 2 for Incident Generation. - Story 3.3 depends on Epic 5 (Slack Bot) to display the suggestion. ### Technical Notes - **Infra:** ECS Fargate consuming from the `correlation-request` SQS queue. - Use PostgreSQL (TimescaleDB) for historical frequency lookups ("how many times has this fired in 7 days?") to inform the score. ## Epic 4: CI/CD Correlation **Description:** Ingests deployment events and correlates them with alert storms. The "killer feature" mandated by the Party Mode board for V1 MVP, answering "did this break right after a deploy?" ### User Stories **Story 4.1: GitHub Actions Deploy Ingestion** * **As a** Platform Engineer, **I want** to connect my GitHub Actions deployment webhooks, **so that** dd0c/alert knows exactly when and who deployed to production. * **Acceptance Criteria:** - System exposes `POST /v1/wh/{tenant_id}/github` - Validates `X-Hub-Signature-256`. - Normalizes GHA workflow run payload into `DeployEvent` canonical schema. - Pushes deploy event to SQS FIFO queue (`deploy-event`). * **Estimate:** 3 points **Story 4.2: Deploy-to-Alert Correlation** * **As an** On-Call Engineer, **I want** an alert cluster to be automatically tagged with a recent deployment to that service, **so that** I don't waste 15 minutes checking deploy logs manually. * **Acceptance Criteria:** - When the Correlation Engine opens a window, it queries DynamoDB for deployments to the affected service within a configurable lookback window (default 15m for prod, 30m for staging). - If a match is found, the deploy context (`deploy_pr`, `deploy_author`, `source_url`) is attached to the window state. * **Estimate:** 8 points **Story 4.3: Deploy-Weighted Noise Scoring** * **As an** On-Call Engineer, **I want** alerts that are highly correlated with deployments to be scored as more likely to be noise (if they aren't critical), **so that** feature flags and config refreshes don't wake me up. * **Acceptance Criteria:** - If a deploy event is attached to an incident, boost the noise score by 15-30 points. - Additional +5 points if the PR title matches `config` or `feature-flag`. * **Estimate:** 2 points ### Dependencies - Story 4.2 depends on Epic 2 (Correlation Engine) and Epic 3 (Noise Analysis). - Service name mapping between GitHub and Datadog/PagerDuty (convention-based string matching). ### Technical Notes - **Infra:** The Deployment Tracker runs as a module within the Correlation Engine ECS Task to avoid network latency. - DynamoDB needs a Global Secondary Index (GSI): `tenant_id` + `service` + `completed_at` to quickly find recent deploys. ## Epic 5: Slack Bot **Description:** The primary interface for on-call engineers. Delivers correlated incident summaries, observe-only suppression suggestions, and daily alert digests directly into Slack. Provides interactive buttons for engineers to acknowledge or validate suggestions. ### User Stories **Story 5.1: Incident Summary Notifications** * **As an** On-Call Engineer, **I want** to receive a single, concise Slack message when an alert storm is correlated, **so that** I don't get flooded with dozens of individual alert notifications. * **Acceptance Criteria:** - Bot sends a formatted Slack Block Kit message to a configured channel. - Message groups all related alerts under a single incident title. - Displays the total number of correlated alerts, affected services, and start time. * **Estimate:** 5 points **Story 5.2: Observe-Only Suppression Suggestions in Slack** * **As an** On-Call Engineer, **I want** the Slack message to include the system's noise score and suppression recommendation, **so that** I can evaluate its accuracy in real-time. * **Acceptance Criteria:** - If noise score > 80, the message includes a specific "Suggestion" block (e.g., "Would have auto-suppressed: 95% noise score"). - Includes the plain-English reasoning generated in Epic 3. * **Estimate:** 3 points **Story 5.3: Interactive Feedback Actions** * **As an** On-Call Engineer, **I want** to click "Good Catch" or "Bad Suggestion" on the Slack message, **so that** I can help train the noise analysis engine for future versions. * **Acceptance Criteria:** - Slack message includes interactive buttons for feedback. - Clicking a button sends a payload back to dd0c/alert to record the user's validation in the database. - Updates the Slack message to acknowledge the feedback. * **Estimate:** 5 points **Story 5.4: Daily Alert Digest** * **As an** Engineering Manager, **I want** a daily summary of the noisiest services and total incidents dropped into Slack, **so that** my team can prioritize technical debt. * **Acceptance Criteria:** - A scheduled job runs daily at 9 AM (configurable timezone). - Aggregates the previous 24 hours of data from TimescaleDB. - Posts a summary of "Top 3 Noisiest Services" and "Total Time Saved" (estimated) to the channel. * **Estimate:** 5 points ### Dependencies - Story 5.1 depends on Epic 2 (Correlation Engine). - Story 5.2 depends on Epic 3 (Noise Analysis). ### Technical Notes - **Infra:** AWS Lambda for handling incoming Slack interactions (buttons) via API Gateway. - Use Slack's Block Kit Builder for UI consistency. - Requires storing Slack Workspace and Channel tokens securely in AWS Secrets Manager or DynamoDB. ## Epic 6: Dashboard API **Description:** The backend REST API that powers the dd0c/alert web dashboard. Provides secure endpoints for authentication, querying historical incidents, analyzing alert volume, and managing tenant configuration. ### User Stories **Story 6.1: Tenant Authentication & Authorization** * **As a** Platform Engineer, **I want** to securely log in to the dashboard API, **so that** I can manage my organization's alert data safely. * **Acceptance Criteria:** - Implement JWT-based authentication. - Enforce tenant isolation on all API endpoints (users can only access data for their `tenant_id`). * **Estimate:** 5 points **Story 6.2: Incident Query Endpoints** * **As an** On-Call Engineer, **I want** to fetch a paginated list of historical incidents and their associated alerts, **so that** I can review past outages. * **Acceptance Criteria:** - `GET /v1/incidents` supports pagination, time-range filtering, and service filtering. - `GET /v1/incidents/{incident_id}/alerts` returns the raw alerts correlated into that incident. * **Estimate:** 5 points **Story 6.3: Analytics & Noise Score API** * **As an** Engineering Manager, **I want** to query aggregated metrics about alert noise and volume, **so that** I can populate charts on the dashboard. * **Acceptance Criteria:** - `GET /v1/analytics/noise` returns time-series data of average noise scores per service. - Queries TimescaleDB efficiently using materialized views or continuous aggregates if necessary. * **Estimate:** 8 points **Story 6.4: Configuration Management Endpoints** * **As a** Platform Engineer, **I want** to manage my integration webhooks and routing rules via API, **so that** I can script my onboarding or use the UI. * **Acceptance Criteria:** - CRUD endpoints for managing Slack channel destinations. - Endpoints to generate and rotate inbound webhook secrets for Datadog/PagerDuty. * **Estimate:** 3 points ### Dependencies - Story 6.2 and 6.3 depend on TimescaleDB schema and data from Epics 2 and 3. ### Technical Notes - **Infra:** API Gateway HTTP API -> AWS Lambda (Node.js/Go). - Strict validation middleware required for tenant isolation. - Use standard OpenAPI 3.0 specification for documentation. ## Epic 7: Dashboard UI **Description:** The React Single Page Application (SPA) for dd0c/alert. Gives users a visual interface to view the incident timeline, inspect alert correlation details, and understand the noise scoring. ### User Stories **Story 7.1: Incident Timeline View** * **As an** On-Call Engineer, **I want** a main feed showing all correlated incidents chronologically, **so that** I can see the current state of my systems at a glance. * **Acceptance Criteria:** - React SPA fetches and displays data from `GET /v1/incidents`. - Visual distinction between high-noise (suggested suppressed) and low-noise (critical) incidents. - Real-time updates or auto-refresh every 30 seconds. * **Estimate:** 8 points **Story 7.2: Alert Correlation Visualizer** * **As an** On-Call Engineer, **I want** to click on an incident and see exactly which alerts were grouped together, **so that** I understand why the engine correlated them. * **Acceptance Criteria:** - Detail pane showing the timeline of individual alerts within the incident window. - Displays the deployment context (Epic 4) if applicable. * **Estimate:** 5 points **Story 7.3: Noise Score Breakdown** * **As a** Platform Engineer, **I want** to see the exact factors that contributed to an incident's noise score, **so that** I can trust the engine's reasoning. * **Acceptance Criteria:** - UI component displaying the 0-100 noise score gauge. - Lists the bulleted reasoning (e.g., "+20 points: Occurred 10 times this week", "+15 points: Recent deployment"). * **Estimate:** 3 points **Story 7.4: Analytics Dashboard** * **As an** Engineering Manager, **I want** charts showing alert volume and noise trends over the last 30 days, **so that** I can track improvements in our alert hygiene. * **Acceptance Criteria:** - Integrates a charting library (e.g., Recharts or Chart.js). - Displays a bar chart of total alerts vs. correlated incidents to show "noise reduction" value. * **Estimate:** 5 points ### Dependencies - Depends entirely on Epic 6 (Dashboard API). ### Technical Notes - **Infra:** Hosted on AWS S3 + CloudFront or Vercel. - Framework: React (Next.js or Vite). - Tailwind CSS for rapid styling. ## Epic 8: Infrastructure & DevOps **Description:** The foundational cloud infrastructure and deployment pipelines necessary to run dd0c/alert reliably, securely, and with observability. ### User Stories **Story 8.1: Infrastructure as Code (IaC)** * **As a** Developer, **I want** all AWS resources defined in code, **so that** I can easily spin up staging and production environments identically. * **Acceptance Criteria:** - Terraform or AWS CDK defines VPC, API Gateway, Lambda functions, ECS Fargate clusters, SQS queues, and DynamoDB tables. - State is stored securely in an S3 backend with DynamoDB locking. * **Estimate:** 8 points **Story 8.2: CI/CD Pipelines** * **As a** Developer, **I want** automated testing and deployment when I push to main, **so that** I can ship features quickly without manual steps. * **Acceptance Criteria:** - GitHub Actions workflow runs unit tests and linters on PRs. - Merges to `main` trigger a deployment to the staging environment, followed by a manual approval for production. * **Estimate:** 5 points **Story 8.3: System Monitoring & Logging** * **As a** System Admin, **I want** central logging and metrics for the dd0c/alert services, **so that** I can debug issues when the platform itself fails. * **Acceptance Criteria:** - All Lambda and ECS logs route to CloudWatch Logs. - CloudWatch Alarms configured for API 5xx errors and SQS Dead Letter Queue (DLQ) messages. * **Estimate:** 3 points **Story 8.4: Database Provisioning (Timescale & Redis)** * **As a** Database Admin, **I want** managed, highly available instances for TimescaleDB and Redis, **so that** the correlation engine runs with low latency and durable storage. * **Acceptance Criteria:** - Provisions AWS ElastiCache for Redis (for active window state). - Provisions RDS for PostgreSQL with TimescaleDB extension, or uses Timescale Cloud. * **Estimate:** 5 points ### Dependencies - Blocked by architectural decisions being finalized. - Blocks Epics 1, 2, 3 from being deployed to production. ### Technical Notes - Optimize for Solo Founder: Keep infrastructure simple. Managed services over self-hosted. - Ensure appropriate IAM roles with least privilege access between Lambda/ECS and DynamoDB/SQS. ## Epic 9: Onboarding & PLG **Description:** Product-Led Growth and the critical 60-second time-to-value flow. Ensures a frictionless setup experience for new users to connect their monitoring tools and Slack workspace immediately. ### User Stories **Story 9.1: Frictionless Sign-Up** * **As a** New User, **I want** to sign up using my GitHub or Google account, **so that** I don't have to create and remember a new password. * **Acceptance Criteria:** - Implement OAuth2 login (GitHub/Google). - Automatically provisions a new `tenant_id` and default configuration upon successful first login. * **Estimate:** 5 points **Story 9.2: Webhook Setup Wizard** * **As a** New User, **I want** a step-by-step wizard to configure my Datadog or PagerDuty webhooks, **so that** I can start sending data to dd0c/alert immediately. * **Acceptance Criteria:** - UI wizard provides copy-paste ready webhook URLs and secrets. - Includes a "Waiting for first payload..." state that updates in real-time via WebSockets or polling when the first alert arrives. * **Estimate:** 8 points **Story 9.3: Slack App Installation Flow** * **As a** New User, **I want** a 1-click "Add to Slack" button, **so that** I can authorize dd0c/alert to post in my incident channels. * **Acceptance Criteria:** - Implements the standard Slack OAuth v2 flow. - Allows the user to select the default channel for incident summaries. * **Estimate:** 5 points **Story 9.4: Free Tier Limitations** * **As a** Product Owner, **I want** a free tier that limits the number of processed alerts or retention period, **so that** users can try the product without me incurring massive AWS costs. * **Acceptance Criteria:** - Free tier limits enforced at the ingestion API (e.g., max 10,000 alerts/month). - UI displays a usage quota bar. - Data retention in TimescaleDB automatically purged after 7 days for free tier tenants. * **Estimate:** 5 points ### Dependencies - Depends on Epic 6 (Dashboard API) and Epic 7 (Dashboard UI). - Story 9.2 depends on Epic 1 (Webhook Ingestion) being live. ### Technical Notes - Use Auth0, Clerk, or AWS Cognito to minimize authentication development time for the Solo Founder. - Real-time "Waiting for payload" can be implemented via a lightweight polling endpoint if WebSockets add too much complexity. --- ## Epic 10: Transparent Factory Compliance **Description:** Cross-cutting epic ensuring dd0c/alert adheres to the 5 Transparent Factory tenets. For an alert intelligence platform, Semantic Observability is paramount — a tool that reasons about alerts must make its own reasoning fully transparent. ### Story 10.1: Atomic Flagging — Feature Flags for Correlation & Scoring Rules **As a** solo founder, **I want** every new correlation rule, noise scoring algorithm, and suppression behavior behind a feature flag (default: off), **so that** a bad scoring change doesn't silence critical alerts in production. **Acceptance Criteria:** - OpenFeature SDK integrated into the alert processing pipeline. V1: env-var or JSON file provider. - All flags evaluate locally — no network calls in the alert ingestion hot path. - Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%. - Automated circuit breaker: if a flagged scoring rule suppresses >2x the baseline alert volume over 30 minutes, the flag auto-disables and all suppressed alerts are re-emitted. - Flags required for: new correlation patterns, CI/CD deployment correlation, noise scoring thresholds, notification channel routing. **Estimate:** 5 points **Dependencies:** Epic 2 (Correlation Engine) **Technical Notes:** - Circuit breaker is critical here — a bad suppression rule is worse than no suppression. Track suppression counts per flag in Redis with 30-min sliding window. - Re-emission: suppressed alerts buffered in a dead-letter queue for 1 hour. On circuit break, replay the queue. ### Story 10.2: Elastic Schema — Additive-Only for Alert Event Store **As a** solo founder, **I want** all alert event schema changes to be strictly additive, **so that** historical alert correlation data remains queryable after any deployment. **Acceptance Criteria:** - CI rejects migrations containing `DROP`, `ALTER ... TYPE`, or `RENAME` on existing columns/attributes. - New fields use `_v2` suffix for breaking changes. Old fields remain readable. - All event parsers configured to ignore unknown fields (Pydantic `model_config = {"extra": "ignore"}` or equivalent). - Dual-write during migration windows within the same DB transaction. - Every migration includes `sunset_date` comment (max 30 days). CI warns on overdue cleanups. **Estimate:** 3 points **Dependencies:** Epic 3 (Event Store) **Technical Notes:** - Alert events are append-only by nature — leverage this. Never mutate historical events. - For correlation metadata (enrichments added post-ingestion), store as separate linked records rather than mutating the original event. - TimescaleDB compression policies must handle both V1 and V2 column layouts. ### Story 10.3: Cognitive Durability — Decision Logs for Scoring Logic **As a** future maintainer, **I want** every change to noise scoring weights, correlation rules, or suppression thresholds accompanied by a `decision_log.json`, **so that** I can understand why alert X was classified as noise vs. signal. **Acceptance Criteria:** - `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`. - CI requires a decision log for PRs touching `src/scoring/`, `src/correlation/`, or `src/suppression/`. - Cyclomatic complexity cap of 10 enforced in CI. Scoring functions must be decomposable and testable. - Decision logs in `docs/decisions/`, one per significant logic change. **Estimate:** 2 points **Dependencies:** None **Technical Notes:** - Scoring weight changes are especially important to document — "why is deployment correlation weighted 0.7 and not 0.5?" - Include sample alert scenarios in decision logs showing before/after scoring behavior. ### Story 10.4: Semantic Observability — AI Reasoning Spans on Alert Classification **As an** on-call engineer investigating a missed critical alert, **I want** every alert scoring and correlation decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I can trace exactly why an alert was scored as noise when it was actually a P1 incident. **Acceptance Criteria:** - Every alert ingestion creates a parent `alert_evaluation` span. Child spans for `noise_scoring`, `correlation_matching`, and `suppression_decision`. - Span attributes: `alert.source`, `alert.noise_score`, `alert.correlation_matches` (JSON array), `alert.suppressed` (bool), `alert.suppression_reason`. - If AI-assisted classification is used: `ai.prompt_hash`, `ai.model_version`, `ai.confidence_score`, `ai.reasoning_chain` (summarized). - CI/CD correlation spans include: `alert.deployment_correlation_score`, `alert.deployment_id`, `alert.time_since_deploy_seconds`. - No PII in spans. Alert payloads are hashed for correlation, not logged raw. **Estimate:** 3 points **Dependencies:** Epic 2 (Correlation Engine) **Technical Notes:** - This is the most important tenet for dd0c/alert. If the tool suppresses an alert, the reasoning MUST be traceable. - Use `opentelemetry-python` with OTLP exporter. Batch span export to avoid per-alert overhead. - For V1 without AI: `alert.suppression_reason` is the rule name + threshold. When AI scoring is added, the full reasoning chain is captured. ### Story 10.5: Configurable Autonomy — Governance for Alert Suppression **As a** solo founder, **I want** a `policy.json` that controls whether dd0c/alert can auto-suppress alerts or only annotate them, **so that** customers never lose visibility into their alerts without explicit opt-in. **Acceptance Criteria:** - `policy.json` defines `governance_mode`: `strict` (annotate-only, never suppress) or `audit` (auto-suppress with full logging). - Default for all new customers: `strict`. Suppression requires explicit opt-in. - `panic_mode`: when true, all suppression stops immediately. Every alert passes through unmodified. A "panic active" banner appears in the dashboard. - Per-customer governance override: customers can only be MORE restrictive than system default. - All policy decisions logged with full context: "Alert X suppressed by audit mode, rule Y, score Z" or "Alert X annotation-only, strict mode active". **Estimate:** 3 points **Dependencies:** Epic 4 (Notification Router) **Technical Notes:** - `strict` mode is the safe default — dd0c/alert adds value even without suppression by annotating alerts with correlation data and noise scores. - Panic mode: single Redis key `dd0c:panic`. All suppression checks short-circuit on this key. Triggerable via `POST /admin/panic` or env var. - Customer override: stored in org settings. Merge: `max_restrictive(system, customer)`. ### Epic 10 Summary | Story | Tenet | Points | |-------|-------|--------| | 10.1 | Atomic Flagging | 5 | | 10.2 | Elastic Schema | 3 | | 10.3 | Cognitive Durability | 2 | | 10.4 | Semantic Observability | 3 | | 10.5 | Configurable Autonomy | 3 | | **Total** | | **16** |