Files
Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
2026-02-28 17:35:02 +00:00

28 KiB

dd0c/alert — V1 MVP Epics

Product: dd0c/alert (Alert Intelligence Platform) Phase: 7 — Epics & Stories


Epic 1: Webhook Ingestion

Description: The front door of dd0c/alert. Responsible for receiving alert payloads from monitoring providers via webhooks, validating their authenticity, normalizing them into a canonical schema, and queuing them securely for the Correlation Engine. Must support high burst volume (incident storms) and guarantee zero dropped payloads.

User Stories

Story 1.1: Datadog Webhook Ingestion

  • As a Platform Engineer, I want to send Datadog webhooks to a unique dd0c URL, so that my Datadog alerts enter the correlation pipeline.
  • Acceptance Criteria:
    • System exposes POST /v1/wh/{tenant_id}/datadog
    • Normalizes Datadog JSON (handles arrays/batched alerts) into the Canonical Alert Schema.
    • Normalizes Datadog P1-P5 severities into critical/high/medium/low/info.
  • Estimate: 3 points

Story 1.2: PagerDuty Webhook Ingestion

  • As a Platform Engineer, I want to send PagerDuty v3 webhooks to dd0c, so that my PD incidents are tracked.
  • Acceptance Criteria:
    • System exposes POST /v1/wh/{tenant_id}/pagerduty
    • Normalizes PagerDuty JSON into the Canonical Alert Schema.
  • Estimate: 3 points

Story 1.3: HMAC Signature Validation

  • As a Security Admin, I want all incoming webhooks to have their HMAC signatures validated, so that bad actors cannot inject fake alerts.
  • Acceptance Criteria:
    • Rejects payloads with missing or invalid DD-WEBHOOK-SIGNATURE or X-PagerDuty-Signature headers with 401 Unauthorized.
    • Compares against the integration secret stored in DynamoDB/Secrets Manager.
  • Estimate: 3 points

Story 1.4: Payload Normalization & Deduplication (Fingerprinting)

  • As an On-Call Engineer, I want identical alerts to be deterministically fingerprinted, so that flapping or duplicated payloads are instantly recognized.
  • Acceptance Criteria:
    • Generates a SHA-256 fingerprint based on tenant_id + provider + service + normalized_title.
    • Pushes canonical alert to SQS FIFO queue with MessageGroupId=tenant_id.
    • Saves raw payload asynchronously to S3 for audit/replay.
  • Estimate: 5 points

Dependencies

  • Story 1.3 depends on Tenant/Integration configuration existing (Epic 9).
  • Story 1.4 depends on Canonical Alert Schema definition.

Technical Notes

  • Infra: API Gateway HTTP API -> Lambda -> SQS FIFO.
  • Lambda must return 200 OK to the provider in <100ms. S3 raw payload storage must be non-blocking (async).
  • Use ULIDs for alert_id for time-sortability.

Epic 2: Correlation Engine

Description: The intelligence core. Consumes the normalized SQS FIFO queue, groups alerts based on time windows and service dependencies, and outputs correlated incidents.

User Stories

Story 2.1: Time-Window Clustering

  • As an On-Call Engineer, I want alerts firing within a brief time window for the same service to be grouped together, so that I don't get paged 10 times for one failure.
  • Acceptance Criteria:
    • Opens a 5-minute (configurable) correlation window in Redis when a new alert fingerprint arrives.
    • Groups subsequent alerts for the same tenant/service into the active window.
    • Stores the correlation state in ElastiCache Redis.
  • Estimate: 5 points

Story 2.2: Cascading Failure Correlation (Service Graph)

  • As an On-Call Engineer, I want cascading failures across dependent services to be merged into a single incident, so that I can see the blast radius of an issue.
  • Acceptance Criteria:
    • Reads explicit service dependencies from DynamoDB (upstream -> downstream).
    • If a window is open for an upstream service, downstream service alerts are merged into the same window.
  • Estimate: 8 points

Story 2.3: Active Window Extension

  • As an On-Call Engineer, I want the correlation window to automatically extend if alerts are still trickling in, so that long-running, cascading incidents are correctly grouped.
  • Acceptance Criteria:
    • If a new alert arrives within the last 30 seconds of a window, the window extends by 2 minutes (max 15 minutes).
    • Updates the closes_at timestamp in Redis.
  • Estimate: 3 points

Story 2.4: Incident Generation & Persistence

  • As an On-Call Engineer, I want completed time windows to be saved as durable incidents, so that I have a permanent record of the correlated event.
  • Acceptance Criteria:
    • When a window closes, it generates an Incident record in DynamoDB.
    • Generates an event in TimescaleDB for trend tracking.
    • Pushes a correlation-request to the Suggestion Engine SQS queue.
  • Estimate: 5 points

Dependencies

  • Story 2.1 depends on Epic 1 (normalized SQS queue).
  • Story 2.2 depends on a basic service dependency mapping (either config or API).

Technical Notes

  • Infra: ECS Fargate consuming SQS FIFO.
  • Must use Redis Sorted Sets for active window management (closes_at_epoch as score).
  • The correlation engine must be stateless (relying on Redis) so it can scale horizontally to handle incident storms.

Epic 3: Noise Analysis

Description: The Suggestion Engine. Calculates a noise score (0-100) for correlated incidents and generates observe-only suppression suggestions. It strictly adheres to V1 constraints by never taking auto-action.

User Stories

Story 3.1: Rule-Based Noise Scoring

  • As an On-Call Engineer, I want every incident to receive a noise score based on objective data points, so that I have a metric to understand if this incident is likely a false positive.
  • Acceptance Criteria:
    • Calculates a 0-100 noise score when an incident is generated.
    • Scores based on duplicate fingerprints (flapping), severity distribution (info vs critical), and time of day.
    • Cap at 100, floor at 0.
  • Estimate: 5 points

Story 3.2: "Never Suppress" Safelist Execution

  • As a Platform Engineer, I want critical services (databases, billing) to be excluded from high noise scoring regardless of pattern, so that I never miss a genuine P1.
  • Acceptance Criteria:
    • Implements a default safelist regex (e.g., db|rds|payment|billing).
    • Forces the noise score below 50 if the service or title matches the safelist, or if severity is critical.
  • Estimate: 3 points

Story 3.3: Observe-Only Suppression Suggestions

  • As an On-Call Engineer, I want the system to tell me what it would have suppressed, so that I can build trust in its intelligence without risking an outage.
  • Acceptance Criteria:
    • If a noise score > 80, the system generates a suppress suggestion record in DynamoDB.
    • Generates plain-English reasoning for the suggestion (e.g., "This pattern was resolved automatically 4 times this month.").
    • action_taken is always hardcoded to none for V1.
  • Estimate: 5 points

Story 3.4: Incident Scoring Metrics Collection

  • As an Engineering Manager, I want the noise scores and counts to be stored as time-series data, so that I can view trends in our alert hygiene over time.
  • Acceptance Criteria:
    • Writes noise score, alert counts, and unique fingerprints to TimescaleDB alert_timeseries table.
  • Estimate: 3 points

Dependencies

  • Story 3.1 depends on Epic 2 for Incident Generation.
  • Story 3.3 depends on Epic 5 (Slack Bot) to display the suggestion.

Technical Notes

  • Infra: ECS Fargate consuming from the correlation-request SQS queue.
  • Use PostgreSQL (TimescaleDB) for historical frequency lookups ("how many times has this fired in 7 days?") to inform the score.

Epic 4: CI/CD Correlation

Description: Ingests deployment events and correlates them with alert storms. The "killer feature" mandated by the Party Mode board for V1 MVP, answering "did this break right after a deploy?"

User Stories

Story 4.1: GitHub Actions Deploy Ingestion

  • As a Platform Engineer, I want to connect my GitHub Actions deployment webhooks, so that dd0c/alert knows exactly when and who deployed to production.
  • Acceptance Criteria:
    • System exposes POST /v1/wh/{tenant_id}/github
    • Validates X-Hub-Signature-256.
    • Normalizes GHA workflow run payload into DeployEvent canonical schema.
    • Pushes deploy event to SQS FIFO queue (deploy-event).
  • Estimate: 3 points

Story 4.2: Deploy-to-Alert Correlation

  • As an On-Call Engineer, I want an alert cluster to be automatically tagged with a recent deployment to that service, so that I don't waste 15 minutes checking deploy logs manually.
  • Acceptance Criteria:
    • When the Correlation Engine opens a window, it queries DynamoDB for deployments to the affected service within a configurable lookback window (default 15m for prod, 30m for staging).
    • If a match is found, the deploy context (deploy_pr, deploy_author, source_url) is attached to the window state.
  • Estimate: 8 points

Story 4.3: Deploy-Weighted Noise Scoring

  • As an On-Call Engineer, I want alerts that are highly correlated with deployments to be scored as more likely to be noise (if they aren't critical), so that feature flags and config refreshes don't wake me up.
  • Acceptance Criteria:
    • If a deploy event is attached to an incident, boost the noise score by 15-30 points.
    • Additional +5 points if the PR title matches config or feature-flag.
  • Estimate: 2 points

Dependencies

  • Story 4.2 depends on Epic 2 (Correlation Engine) and Epic 3 (Noise Analysis).
  • Service name mapping between GitHub and Datadog/PagerDuty (convention-based string matching).

Technical Notes

  • Infra: The Deployment Tracker runs as a module within the Correlation Engine ECS Task to avoid network latency.
  • DynamoDB needs a Global Secondary Index (GSI): tenant_id + service + completed_at to quickly find recent deploys.

Epic 5: Slack Bot

Description: The primary interface for on-call engineers. Delivers correlated incident summaries, observe-only suppression suggestions, and daily alert digests directly into Slack. Provides interactive buttons for engineers to acknowledge or validate suggestions.

User Stories

Story 5.1: Incident Summary Notifications

  • As an On-Call Engineer, I want to receive a single, concise Slack message when an alert storm is correlated, so that I don't get flooded with dozens of individual alert notifications.
  • Acceptance Criteria:
    • Bot sends a formatted Slack Block Kit message to a configured channel.
    • Message groups all related alerts under a single incident title.
    • Displays the total number of correlated alerts, affected services, and start time.
  • Estimate: 5 points

Story 5.2: Observe-Only Suppression Suggestions in Slack

  • As an On-Call Engineer, I want the Slack message to include the system's noise score and suppression recommendation, so that I can evaluate its accuracy in real-time.
  • Acceptance Criteria:
    • If noise score > 80, the message includes a specific "Suggestion" block (e.g., "Would have auto-suppressed: 95% noise score").
    • Includes the plain-English reasoning generated in Epic 3.
  • Estimate: 3 points

Story 5.3: Interactive Feedback Actions

  • As an On-Call Engineer, I want to click "Good Catch" or "Bad Suggestion" on the Slack message, so that I can help train the noise analysis engine for future versions.
  • Acceptance Criteria:
    • Slack message includes interactive buttons for feedback.
    • Clicking a button sends a payload back to dd0c/alert to record the user's validation in the database.
    • Updates the Slack message to acknowledge the feedback.
  • Estimate: 5 points

Story 5.4: Daily Alert Digest

  • As an Engineering Manager, I want a daily summary of the noisiest services and total incidents dropped into Slack, so that my team can prioritize technical debt.
  • Acceptance Criteria:
    • A scheduled job runs daily at 9 AM (configurable timezone).
    • Aggregates the previous 24 hours of data from TimescaleDB.
    • Posts a summary of "Top 3 Noisiest Services" and "Total Time Saved" (estimated) to the channel.
  • Estimate: 5 points

Dependencies

  • Story 5.1 depends on Epic 2 (Correlation Engine).
  • Story 5.2 depends on Epic 3 (Noise Analysis).

Technical Notes

  • Infra: AWS Lambda for handling incoming Slack interactions (buttons) via API Gateway.
  • Use Slack's Block Kit Builder for UI consistency.
  • Requires storing Slack Workspace and Channel tokens securely in AWS Secrets Manager or DynamoDB.

Epic 6: Dashboard API

Description: The backend REST API that powers the dd0c/alert web dashboard. Provides secure endpoints for authentication, querying historical incidents, analyzing alert volume, and managing tenant configuration.

User Stories

Story 6.1: Tenant Authentication & Authorization

  • As a Platform Engineer, I want to securely log in to the dashboard API, so that I can manage my organization's alert data safely.
  • Acceptance Criteria:
    • Implement JWT-based authentication.
    • Enforce tenant isolation on all API endpoints (users can only access data for their tenant_id).
  • Estimate: 5 points

Story 6.2: Incident Query Endpoints

  • As an On-Call Engineer, I want to fetch a paginated list of historical incidents and their associated alerts, so that I can review past outages.
  • Acceptance Criteria:
    • GET /v1/incidents supports pagination, time-range filtering, and service filtering.
    • GET /v1/incidents/{incident_id}/alerts returns the raw alerts correlated into that incident.
  • Estimate: 5 points

Story 6.3: Analytics & Noise Score API

  • As an Engineering Manager, I want to query aggregated metrics about alert noise and volume, so that I can populate charts on the dashboard.
  • Acceptance Criteria:
    • GET /v1/analytics/noise returns time-series data of average noise scores per service.
    • Queries TimescaleDB efficiently using materialized views or continuous aggregates if necessary.
  • Estimate: 8 points

Story 6.4: Configuration Management Endpoints

  • As a Platform Engineer, I want to manage my integration webhooks and routing rules via API, so that I can script my onboarding or use the UI.
  • Acceptance Criteria:
    • CRUD endpoints for managing Slack channel destinations.
    • Endpoints to generate and rotate inbound webhook secrets for Datadog/PagerDuty.
  • Estimate: 3 points

Dependencies

  • Story 6.2 and 6.3 depend on TimescaleDB schema and data from Epics 2 and 3.

Technical Notes

  • Infra: API Gateway HTTP API -> AWS Lambda (Node.js/Go).
  • Strict validation middleware required for tenant isolation.
  • Use standard OpenAPI 3.0 specification for documentation.

Epic 7: Dashboard UI

Description: The React Single Page Application (SPA) for dd0c/alert. Gives users a visual interface to view the incident timeline, inspect alert correlation details, and understand the noise scoring.

User Stories

Story 7.1: Incident Timeline View

  • As an On-Call Engineer, I want a main feed showing all correlated incidents chronologically, so that I can see the current state of my systems at a glance.
  • Acceptance Criteria:
    • React SPA fetches and displays data from GET /v1/incidents.
    • Visual distinction between high-noise (suggested suppressed) and low-noise (critical) incidents.
    • Real-time updates or auto-refresh every 30 seconds.
  • Estimate: 8 points

Story 7.2: Alert Correlation Visualizer

  • As an On-Call Engineer, I want to click on an incident and see exactly which alerts were grouped together, so that I understand why the engine correlated them.
  • Acceptance Criteria:
    • Detail pane showing the timeline of individual alerts within the incident window.
    • Displays the deployment context (Epic 4) if applicable.
  • Estimate: 5 points

Story 7.3: Noise Score Breakdown

  • As a Platform Engineer, I want to see the exact factors that contributed to an incident's noise score, so that I can trust the engine's reasoning.
  • Acceptance Criteria:
    • UI component displaying the 0-100 noise score gauge.
    • Lists the bulleted reasoning (e.g., "+20 points: Occurred 10 times this week", "+15 points: Recent deployment").
  • Estimate: 3 points

Story 7.4: Analytics Dashboard

  • As an Engineering Manager, I want charts showing alert volume and noise trends over the last 30 days, so that I can track improvements in our alert hygiene.
  • Acceptance Criteria:
    • Integrates a charting library (e.g., Recharts or Chart.js).
    • Displays a bar chart of total alerts vs. correlated incidents to show "noise reduction" value.
  • Estimate: 5 points

Dependencies

  • Depends entirely on Epic 6 (Dashboard API).

Technical Notes

  • Infra: Hosted on AWS S3 + CloudFront or Vercel.
  • Framework: React (Next.js or Vite).
  • Tailwind CSS for rapid styling.

Epic 8: Infrastructure & DevOps

Description: The foundational cloud infrastructure and deployment pipelines necessary to run dd0c/alert reliably, securely, and with observability.

User Stories

Story 8.1: Infrastructure as Code (IaC)

  • As a Developer, I want all AWS resources defined in code, so that I can easily spin up staging and production environments identically.
  • Acceptance Criteria:
    • Terraform or AWS CDK defines VPC, API Gateway, Lambda functions, ECS Fargate clusters, SQS queues, and DynamoDB tables.
    • State is stored securely in an S3 backend with DynamoDB locking.
  • Estimate: 8 points

Story 8.2: CI/CD Pipelines

  • As a Developer, I want automated testing and deployment when I push to main, so that I can ship features quickly without manual steps.
  • Acceptance Criteria:
    • GitHub Actions workflow runs unit tests and linters on PRs.
    • Merges to main trigger a deployment to the staging environment, followed by a manual approval for production.
  • Estimate: 5 points

Story 8.3: System Monitoring & Logging

  • As a System Admin, I want central logging and metrics for the dd0c/alert services, so that I can debug issues when the platform itself fails.
  • Acceptance Criteria:
    • All Lambda and ECS logs route to CloudWatch Logs.
    • CloudWatch Alarms configured for API 5xx errors and SQS Dead Letter Queue (DLQ) messages.
  • Estimate: 3 points

Story 8.4: Database Provisioning (Timescale & Redis)

  • As a Database Admin, I want managed, highly available instances for TimescaleDB and Redis, so that the correlation engine runs with low latency and durable storage.
  • Acceptance Criteria:
    • Provisions AWS ElastiCache for Redis (for active window state).
    • Provisions RDS for PostgreSQL with TimescaleDB extension, or uses Timescale Cloud.
  • Estimate: 5 points

Dependencies

  • Blocked by architectural decisions being finalized.
  • Blocks Epics 1, 2, 3 from being deployed to production.

Technical Notes

  • Optimize for Solo Founder: Keep infrastructure simple. Managed services over self-hosted.
  • Ensure appropriate IAM roles with least privilege access between Lambda/ECS and DynamoDB/SQS.

Epic 9: Onboarding & PLG

Description: Product-Led Growth and the critical 60-second time-to-value flow. Ensures a frictionless setup experience for new users to connect their monitoring tools and Slack workspace immediately.

User Stories

Story 9.1: Frictionless Sign-Up

  • As a New User, I want to sign up using my GitHub or Google account, so that I don't have to create and remember a new password.
  • Acceptance Criteria:
    • Implement OAuth2 login (GitHub/Google).
    • Automatically provisions a new tenant_id and default configuration upon successful first login.
  • Estimate: 5 points

Story 9.2: Webhook Setup Wizard

  • As a New User, I want a step-by-step wizard to configure my Datadog or PagerDuty webhooks, so that I can start sending data to dd0c/alert immediately.
  • Acceptance Criteria:
    • UI wizard provides copy-paste ready webhook URLs and secrets.
    • Includes a "Waiting for first payload..." state that updates in real-time via WebSockets or polling when the first alert arrives.
  • Estimate: 8 points

Story 9.3: Slack App Installation Flow

  • As a New User, I want a 1-click "Add to Slack" button, so that I can authorize dd0c/alert to post in my incident channels.
  • Acceptance Criteria:
    • Implements the standard Slack OAuth v2 flow.
    • Allows the user to select the default channel for incident summaries.
  • Estimate: 5 points

Story 9.4: Free Tier Limitations

  • As a Product Owner, I want a free tier that limits the number of processed alerts or retention period, so that users can try the product without me incurring massive AWS costs.
  • Acceptance Criteria:
    • Free tier limits enforced at the ingestion API (e.g., max 10,000 alerts/month).
    • UI displays a usage quota bar.
    • Data retention in TimescaleDB automatically purged after 7 days for free tier tenants.
  • Estimate: 5 points

Dependencies

  • Depends on Epic 6 (Dashboard API) and Epic 7 (Dashboard UI).
  • Story 9.2 depends on Epic 1 (Webhook Ingestion) being live.

Technical Notes

  • Use Auth0, Clerk, or AWS Cognito to minimize authentication development time for the Solo Founder.
  • Real-time "Waiting for payload" can be implemented via a lightweight polling endpoint if WebSockets add too much complexity.

Epic 10: Transparent Factory Compliance

Description: Cross-cutting epic ensuring dd0c/alert adheres to the 5 Transparent Factory tenets. For an alert intelligence platform, Semantic Observability is paramount — a tool that reasons about alerts must make its own reasoning fully transparent.

Story 10.1: Atomic Flagging — Feature Flags for Correlation & Scoring Rules

As a solo founder, I want every new correlation rule, noise scoring algorithm, and suppression behavior behind a feature flag (default: off), so that a bad scoring change doesn't silence critical alerts in production.

Acceptance Criteria:

  • OpenFeature SDK integrated into the alert processing pipeline. V1: env-var or JSON file provider.
  • All flags evaluate locally — no network calls in the alert ingestion hot path.
  • Every flag has owner and ttl (max 14 days). CI blocks if expired flags remain at 100%.
  • Automated circuit breaker: if a flagged scoring rule suppresses >2x the baseline alert volume over 30 minutes, the flag auto-disables and all suppressed alerts are re-emitted.
  • Flags required for: new correlation patterns, CI/CD deployment correlation, noise scoring thresholds, notification channel routing.

Estimate: 5 points Dependencies: Epic 2 (Correlation Engine) Technical Notes:

  • Circuit breaker is critical here — a bad suppression rule is worse than no suppression. Track suppression counts per flag in Redis with 30-min sliding window.
  • Re-emission: suppressed alerts buffered in a dead-letter queue for 1 hour. On circuit break, replay the queue.

Story 10.2: Elastic Schema — Additive-Only for Alert Event Store

As a solo founder, I want all alert event schema changes to be strictly additive, so that historical alert correlation data remains queryable after any deployment.

Acceptance Criteria:

  • CI rejects migrations containing DROP, ALTER ... TYPE, or RENAME on existing columns/attributes.
  • New fields use _v2 suffix for breaking changes. Old fields remain readable.
  • All event parsers configured to ignore unknown fields (Pydantic model_config = {"extra": "ignore"} or equivalent).
  • Dual-write during migration windows within the same DB transaction.
  • Every migration includes sunset_date comment (max 30 days). CI warns on overdue cleanups.

Estimate: 3 points Dependencies: Epic 3 (Event Store) Technical Notes:

  • Alert events are append-only by nature — leverage this. Never mutate historical events.
  • For correlation metadata (enrichments added post-ingestion), store as separate linked records rather than mutating the original event.
  • TimescaleDB compression policies must handle both V1 and V2 column layouts.

Story 10.3: Cognitive Durability — Decision Logs for Scoring Logic

As a future maintainer, I want every change to noise scoring weights, correlation rules, or suppression thresholds accompanied by a decision_log.json, so that I can understand why alert X was classified as noise vs. signal.

Acceptance Criteria:

  • decision_log.json schema: { prompt, reasoning, alternatives_considered, confidence, timestamp, author }.
  • CI requires a decision log for PRs touching src/scoring/, src/correlation/, or src/suppression/.
  • Cyclomatic complexity cap of 10 enforced in CI. Scoring functions must be decomposable and testable.
  • Decision logs in docs/decisions/, one per significant logic change.

Estimate: 2 points Dependencies: None Technical Notes:

  • Scoring weight changes are especially important to document — "why is deployment correlation weighted 0.7 and not 0.5?"
  • Include sample alert scenarios in decision logs showing before/after scoring behavior.

Story 10.4: Semantic Observability — AI Reasoning Spans on Alert Classification

As an on-call engineer investigating a missed critical alert, I want every alert scoring and correlation decision to emit an OpenTelemetry span with full reasoning metadata, so that I can trace exactly why an alert was scored as noise when it was actually a P1 incident.

Acceptance Criteria:

  • Every alert ingestion creates a parent alert_evaluation span. Child spans for noise_scoring, correlation_matching, and suppression_decision.
  • Span attributes: alert.source, alert.noise_score, alert.correlation_matches (JSON array), alert.suppressed (bool), alert.suppression_reason.
  • If AI-assisted classification is used: ai.prompt_hash, ai.model_version, ai.confidence_score, ai.reasoning_chain (summarized).
  • CI/CD correlation spans include: alert.deployment_correlation_score, alert.deployment_id, alert.time_since_deploy_seconds.
  • No PII in spans. Alert payloads are hashed for correlation, not logged raw.

Estimate: 3 points Dependencies: Epic 2 (Correlation Engine) Technical Notes:

  • This is the most important tenet for dd0c/alert. If the tool suppresses an alert, the reasoning MUST be traceable.
  • Use opentelemetry-python with OTLP exporter. Batch span export to avoid per-alert overhead.
  • For V1 without AI: alert.suppression_reason is the rule name + threshold. When AI scoring is added, the full reasoning chain is captured.

Story 10.5: Configurable Autonomy — Governance for Alert Suppression

As a solo founder, I want a policy.json that controls whether dd0c/alert can auto-suppress alerts or only annotate them, so that customers never lose visibility into their alerts without explicit opt-in.

Acceptance Criteria:

  • policy.json defines governance_mode: strict (annotate-only, never suppress) or audit (auto-suppress with full logging).
  • Default for all new customers: strict. Suppression requires explicit opt-in.
  • panic_mode: when true, all suppression stops immediately. Every alert passes through unmodified. A "panic active" banner appears in the dashboard.
  • Per-customer governance override: customers can only be MORE restrictive than system default.
  • All policy decisions logged with full context: "Alert X suppressed by audit mode, rule Y, score Z" or "Alert X annotation-only, strict mode active".

Estimate: 3 points Dependencies: Epic 4 (Notification Router) Technical Notes:

  • strict mode is the safe default — dd0c/alert adds value even without suppression by annotating alerts with correlation data and noise scores.
  • Panic mode: single Redis key dd0c:panic. All suppression checks short-circuit on this key. Triggerable via POST /admin/panic or env var.
  • Customer override: stored in org settings. Merge: max_restrictive(system, customer).

Epic 10 Summary

Story Tenet Points
10.1 Atomic Flagging 5
10.2 Elastic Schema 3
10.3 Cognitive Durability 2
10.4 Semantic Observability 3
10.5 Configurable Autonomy 3
Total 16