Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
28 KiB
dd0c/alert — V1 MVP Epics
Product: dd0c/alert (Alert Intelligence Platform) Phase: 7 — Epics & Stories
Epic 1: Webhook Ingestion
Description: The front door of dd0c/alert. Responsible for receiving alert payloads from monitoring providers via webhooks, validating their authenticity, normalizing them into a canonical schema, and queuing them securely for the Correlation Engine. Must support high burst volume (incident storms) and guarantee zero dropped payloads.
User Stories
Story 1.1: Datadog Webhook Ingestion
- As a Platform Engineer, I want to send Datadog webhooks to a unique dd0c URL, so that my Datadog alerts enter the correlation pipeline.
- Acceptance Criteria:
- System exposes
POST /v1/wh/{tenant_id}/datadog - Normalizes Datadog JSON (handles arrays/batched alerts) into the Canonical Alert Schema.
- Normalizes Datadog P1-P5 severities into critical/high/medium/low/info.
- System exposes
- Estimate: 3 points
Story 1.2: PagerDuty Webhook Ingestion
- As a Platform Engineer, I want to send PagerDuty v3 webhooks to dd0c, so that my PD incidents are tracked.
- Acceptance Criteria:
- System exposes
POST /v1/wh/{tenant_id}/pagerduty - Normalizes PagerDuty JSON into the Canonical Alert Schema.
- System exposes
- Estimate: 3 points
Story 1.3: HMAC Signature Validation
- As a Security Admin, I want all incoming webhooks to have their HMAC signatures validated, so that bad actors cannot inject fake alerts.
- Acceptance Criteria:
- Rejects payloads with missing or invalid
DD-WEBHOOK-SIGNATUREorX-PagerDuty-Signatureheaders with 401 Unauthorized. - Compares against the integration secret stored in DynamoDB/Secrets Manager.
- Rejects payloads with missing or invalid
- Estimate: 3 points
Story 1.4: Payload Normalization & Deduplication (Fingerprinting)
- As an On-Call Engineer, I want identical alerts to be deterministically fingerprinted, so that flapping or duplicated payloads are instantly recognized.
- Acceptance Criteria:
- Generates a SHA-256 fingerprint based on
tenant_id + provider + service + normalized_title. - Pushes canonical alert to SQS FIFO queue with
MessageGroupId=tenant_id. - Saves raw payload asynchronously to S3 for audit/replay.
- Generates a SHA-256 fingerprint based on
- Estimate: 5 points
Dependencies
- Story 1.3 depends on Tenant/Integration configuration existing (Epic 9).
- Story 1.4 depends on Canonical Alert Schema definition.
Technical Notes
- Infra: API Gateway HTTP API -> Lambda -> SQS FIFO.
- Lambda must return 200 OK to the provider in <100ms. S3 raw payload storage must be non-blocking (async).
- Use ULIDs for
alert_idfor time-sortability.
Epic 2: Correlation Engine
Description: The intelligence core. Consumes the normalized SQS FIFO queue, groups alerts based on time windows and service dependencies, and outputs correlated incidents.
User Stories
Story 2.1: Time-Window Clustering
- As an On-Call Engineer, I want alerts firing within a brief time window for the same service to be grouped together, so that I don't get paged 10 times for one failure.
- Acceptance Criteria:
- Opens a 5-minute (configurable) correlation window in Redis when a new alert fingerprint arrives.
- Groups subsequent alerts for the same tenant/service into the active window.
- Stores the correlation state in ElastiCache Redis.
- Estimate: 5 points
Story 2.2: Cascading Failure Correlation (Service Graph)
- As an On-Call Engineer, I want cascading failures across dependent services to be merged into a single incident, so that I can see the blast radius of an issue.
- Acceptance Criteria:
- Reads explicit service dependencies from DynamoDB (
upstream -> downstream). - If a window is open for an upstream service, downstream service alerts are merged into the same window.
- Reads explicit service dependencies from DynamoDB (
- Estimate: 8 points
Story 2.3: Active Window Extension
- As an On-Call Engineer, I want the correlation window to automatically extend if alerts are still trickling in, so that long-running, cascading incidents are correctly grouped.
- Acceptance Criteria:
- If a new alert arrives within the last 30 seconds of a window, the window extends by 2 minutes (max 15 minutes).
- Updates the
closes_attimestamp in Redis.
- Estimate: 3 points
Story 2.4: Incident Generation & Persistence
- As an On-Call Engineer, I want completed time windows to be saved as durable incidents, so that I have a permanent record of the correlated event.
- Acceptance Criteria:
- When a window closes, it generates an Incident record in DynamoDB.
- Generates an event in TimescaleDB for trend tracking.
- Pushes a
correlation-requestto the Suggestion Engine SQS queue.
- Estimate: 5 points
Dependencies
- Story 2.1 depends on Epic 1 (normalized SQS queue).
- Story 2.2 depends on a basic service dependency mapping (either config or API).
Technical Notes
- Infra: ECS Fargate consuming SQS FIFO.
- Must use Redis Sorted Sets for active window management (
closes_at_epochas score). - The correlation engine must be stateless (relying on Redis) so it can scale horizontally to handle incident storms.
Epic 3: Noise Analysis
Description: The Suggestion Engine. Calculates a noise score (0-100) for correlated incidents and generates observe-only suppression suggestions. It strictly adheres to V1 constraints by never taking auto-action.
User Stories
Story 3.1: Rule-Based Noise Scoring
- As an On-Call Engineer, I want every incident to receive a noise score based on objective data points, so that I have a metric to understand if this incident is likely a false positive.
- Acceptance Criteria:
- Calculates a 0-100 noise score when an incident is generated.
- Scores based on duplicate fingerprints (flapping), severity distribution (info vs critical), and time of day.
- Cap at 100, floor at 0.
- Estimate: 5 points
Story 3.2: "Never Suppress" Safelist Execution
- As a Platform Engineer, I want critical services (databases, billing) to be excluded from high noise scoring regardless of pattern, so that I never miss a genuine P1.
- Acceptance Criteria:
- Implements a default safelist regex (e.g.,
db|rds|payment|billing). - Forces the noise score below 50 if the service or title matches the safelist, or if severity is critical.
- Implements a default safelist regex (e.g.,
- Estimate: 3 points
Story 3.3: Observe-Only Suppression Suggestions
- As an On-Call Engineer, I want the system to tell me what it would have suppressed, so that I can build trust in its intelligence without risking an outage.
- Acceptance Criteria:
- If a noise score > 80, the system generates a
suppresssuggestion record in DynamoDB. - Generates plain-English reasoning for the suggestion (e.g., "This pattern was resolved automatically 4 times this month.").
action_takenis always hardcoded tononefor V1.
- If a noise score > 80, the system generates a
- Estimate: 5 points
Story 3.4: Incident Scoring Metrics Collection
- As an Engineering Manager, I want the noise scores and counts to be stored as time-series data, so that I can view trends in our alert hygiene over time.
- Acceptance Criteria:
- Writes noise score, alert counts, and unique fingerprints to TimescaleDB
alert_timeseriestable.
- Writes noise score, alert counts, and unique fingerprints to TimescaleDB
- Estimate: 3 points
Dependencies
- Story 3.1 depends on Epic 2 for Incident Generation.
- Story 3.3 depends on Epic 5 (Slack Bot) to display the suggestion.
Technical Notes
- Infra: ECS Fargate consuming from the
correlation-requestSQS queue. - Use PostgreSQL (TimescaleDB) for historical frequency lookups ("how many times has this fired in 7 days?") to inform the score.
Epic 4: CI/CD Correlation
Description: Ingests deployment events and correlates them with alert storms. The "killer feature" mandated by the Party Mode board for V1 MVP, answering "did this break right after a deploy?"
User Stories
Story 4.1: GitHub Actions Deploy Ingestion
- As a Platform Engineer, I want to connect my GitHub Actions deployment webhooks, so that dd0c/alert knows exactly when and who deployed to production.
- Acceptance Criteria:
- System exposes
POST /v1/wh/{tenant_id}/github - Validates
X-Hub-Signature-256. - Normalizes GHA workflow run payload into
DeployEventcanonical schema. - Pushes deploy event to SQS FIFO queue (
deploy-event).
- System exposes
- Estimate: 3 points
Story 4.2: Deploy-to-Alert Correlation
- As an On-Call Engineer, I want an alert cluster to be automatically tagged with a recent deployment to that service, so that I don't waste 15 minutes checking deploy logs manually.
- Acceptance Criteria:
- When the Correlation Engine opens a window, it queries DynamoDB for deployments to the affected service within a configurable lookback window (default 15m for prod, 30m for staging).
- If a match is found, the deploy context (
deploy_pr,deploy_author,source_url) is attached to the window state.
- Estimate: 8 points
Story 4.3: Deploy-Weighted Noise Scoring
- As an On-Call Engineer, I want alerts that are highly correlated with deployments to be scored as more likely to be noise (if they aren't critical), so that feature flags and config refreshes don't wake me up.
- Acceptance Criteria:
- If a deploy event is attached to an incident, boost the noise score by 15-30 points.
- Additional +5 points if the PR title matches
configorfeature-flag.
- Estimate: 2 points
Dependencies
- Story 4.2 depends on Epic 2 (Correlation Engine) and Epic 3 (Noise Analysis).
- Service name mapping between GitHub and Datadog/PagerDuty (convention-based string matching).
Technical Notes
- Infra: The Deployment Tracker runs as a module within the Correlation Engine ECS Task to avoid network latency.
- DynamoDB needs a Global Secondary Index (GSI):
tenant_id+service+completed_atto quickly find recent deploys.
Epic 5: Slack Bot
Description: The primary interface for on-call engineers. Delivers correlated incident summaries, observe-only suppression suggestions, and daily alert digests directly into Slack. Provides interactive buttons for engineers to acknowledge or validate suggestions.
User Stories
Story 5.1: Incident Summary Notifications
- As an On-Call Engineer, I want to receive a single, concise Slack message when an alert storm is correlated, so that I don't get flooded with dozens of individual alert notifications.
- Acceptance Criteria:
- Bot sends a formatted Slack Block Kit message to a configured channel.
- Message groups all related alerts under a single incident title.
- Displays the total number of correlated alerts, affected services, and start time.
- Estimate: 5 points
Story 5.2: Observe-Only Suppression Suggestions in Slack
- As an On-Call Engineer, I want the Slack message to include the system's noise score and suppression recommendation, so that I can evaluate its accuracy in real-time.
- Acceptance Criteria:
- If noise score > 80, the message includes a specific "Suggestion" block (e.g., "Would have auto-suppressed: 95% noise score").
- Includes the plain-English reasoning generated in Epic 3.
- Estimate: 3 points
Story 5.3: Interactive Feedback Actions
- As an On-Call Engineer, I want to click "Good Catch" or "Bad Suggestion" on the Slack message, so that I can help train the noise analysis engine for future versions.
- Acceptance Criteria:
- Slack message includes interactive buttons for feedback.
- Clicking a button sends a payload back to dd0c/alert to record the user's validation in the database.
- Updates the Slack message to acknowledge the feedback.
- Estimate: 5 points
Story 5.4: Daily Alert Digest
- As an Engineering Manager, I want a daily summary of the noisiest services and total incidents dropped into Slack, so that my team can prioritize technical debt.
- Acceptance Criteria:
- A scheduled job runs daily at 9 AM (configurable timezone).
- Aggregates the previous 24 hours of data from TimescaleDB.
- Posts a summary of "Top 3 Noisiest Services" and "Total Time Saved" (estimated) to the channel.
- Estimate: 5 points
Dependencies
- Story 5.1 depends on Epic 2 (Correlation Engine).
- Story 5.2 depends on Epic 3 (Noise Analysis).
Technical Notes
- Infra: AWS Lambda for handling incoming Slack interactions (buttons) via API Gateway.
- Use Slack's Block Kit Builder for UI consistency.
- Requires storing Slack Workspace and Channel tokens securely in AWS Secrets Manager or DynamoDB.
Epic 6: Dashboard API
Description: The backend REST API that powers the dd0c/alert web dashboard. Provides secure endpoints for authentication, querying historical incidents, analyzing alert volume, and managing tenant configuration.
User Stories
Story 6.1: Tenant Authentication & Authorization
- As a Platform Engineer, I want to securely log in to the dashboard API, so that I can manage my organization's alert data safely.
- Acceptance Criteria:
- Implement JWT-based authentication.
- Enforce tenant isolation on all API endpoints (users can only access data for their
tenant_id).
- Estimate: 5 points
Story 6.2: Incident Query Endpoints
- As an On-Call Engineer, I want to fetch a paginated list of historical incidents and their associated alerts, so that I can review past outages.
- Acceptance Criteria:
GET /v1/incidentssupports pagination, time-range filtering, and service filtering.GET /v1/incidents/{incident_id}/alertsreturns the raw alerts correlated into that incident.
- Estimate: 5 points
Story 6.3: Analytics & Noise Score API
- As an Engineering Manager, I want to query aggregated metrics about alert noise and volume, so that I can populate charts on the dashboard.
- Acceptance Criteria:
GET /v1/analytics/noisereturns time-series data of average noise scores per service.- Queries TimescaleDB efficiently using materialized views or continuous aggregates if necessary.
- Estimate: 8 points
Story 6.4: Configuration Management Endpoints
- As a Platform Engineer, I want to manage my integration webhooks and routing rules via API, so that I can script my onboarding or use the UI.
- Acceptance Criteria:
- CRUD endpoints for managing Slack channel destinations.
- Endpoints to generate and rotate inbound webhook secrets for Datadog/PagerDuty.
- Estimate: 3 points
Dependencies
- Story 6.2 and 6.3 depend on TimescaleDB schema and data from Epics 2 and 3.
Technical Notes
- Infra: API Gateway HTTP API -> AWS Lambda (Node.js/Go).
- Strict validation middleware required for tenant isolation.
- Use standard OpenAPI 3.0 specification for documentation.
Epic 7: Dashboard UI
Description: The React Single Page Application (SPA) for dd0c/alert. Gives users a visual interface to view the incident timeline, inspect alert correlation details, and understand the noise scoring.
User Stories
Story 7.1: Incident Timeline View
- As an On-Call Engineer, I want a main feed showing all correlated incidents chronologically, so that I can see the current state of my systems at a glance.
- Acceptance Criteria:
- React SPA fetches and displays data from
GET /v1/incidents. - Visual distinction between high-noise (suggested suppressed) and low-noise (critical) incidents.
- Real-time updates or auto-refresh every 30 seconds.
- React SPA fetches and displays data from
- Estimate: 8 points
Story 7.2: Alert Correlation Visualizer
- As an On-Call Engineer, I want to click on an incident and see exactly which alerts were grouped together, so that I understand why the engine correlated them.
- Acceptance Criteria:
- Detail pane showing the timeline of individual alerts within the incident window.
- Displays the deployment context (Epic 4) if applicable.
- Estimate: 5 points
Story 7.3: Noise Score Breakdown
- As a Platform Engineer, I want to see the exact factors that contributed to an incident's noise score, so that I can trust the engine's reasoning.
- Acceptance Criteria:
- UI component displaying the 0-100 noise score gauge.
- Lists the bulleted reasoning (e.g., "+20 points: Occurred 10 times this week", "+15 points: Recent deployment").
- Estimate: 3 points
Story 7.4: Analytics Dashboard
- As an Engineering Manager, I want charts showing alert volume and noise trends over the last 30 days, so that I can track improvements in our alert hygiene.
- Acceptance Criteria:
- Integrates a charting library (e.g., Recharts or Chart.js).
- Displays a bar chart of total alerts vs. correlated incidents to show "noise reduction" value.
- Estimate: 5 points
Dependencies
- Depends entirely on Epic 6 (Dashboard API).
Technical Notes
- Infra: Hosted on AWS S3 + CloudFront or Vercel.
- Framework: React (Next.js or Vite).
- Tailwind CSS for rapid styling.
Epic 8: Infrastructure & DevOps
Description: The foundational cloud infrastructure and deployment pipelines necessary to run dd0c/alert reliably, securely, and with observability.
User Stories
Story 8.1: Infrastructure as Code (IaC)
- As a Developer, I want all AWS resources defined in code, so that I can easily spin up staging and production environments identically.
- Acceptance Criteria:
- Terraform or AWS CDK defines VPC, API Gateway, Lambda functions, ECS Fargate clusters, SQS queues, and DynamoDB tables.
- State is stored securely in an S3 backend with DynamoDB locking.
- Estimate: 8 points
Story 8.2: CI/CD Pipelines
- As a Developer, I want automated testing and deployment when I push to main, so that I can ship features quickly without manual steps.
- Acceptance Criteria:
- GitHub Actions workflow runs unit tests and linters on PRs.
- Merges to
maintrigger a deployment to the staging environment, followed by a manual approval for production.
- Estimate: 5 points
Story 8.3: System Monitoring & Logging
- As a System Admin, I want central logging and metrics for the dd0c/alert services, so that I can debug issues when the platform itself fails.
- Acceptance Criteria:
- All Lambda and ECS logs route to CloudWatch Logs.
- CloudWatch Alarms configured for API 5xx errors and SQS Dead Letter Queue (DLQ) messages.
- Estimate: 3 points
Story 8.4: Database Provisioning (Timescale & Redis)
- As a Database Admin, I want managed, highly available instances for TimescaleDB and Redis, so that the correlation engine runs with low latency and durable storage.
- Acceptance Criteria:
- Provisions AWS ElastiCache for Redis (for active window state).
- Provisions RDS for PostgreSQL with TimescaleDB extension, or uses Timescale Cloud.
- Estimate: 5 points
Dependencies
- Blocked by architectural decisions being finalized.
- Blocks Epics 1, 2, 3 from being deployed to production.
Technical Notes
- Optimize for Solo Founder: Keep infrastructure simple. Managed services over self-hosted.
- Ensure appropriate IAM roles with least privilege access between Lambda/ECS and DynamoDB/SQS.
Epic 9: Onboarding & PLG
Description: Product-Led Growth and the critical 60-second time-to-value flow. Ensures a frictionless setup experience for new users to connect their monitoring tools and Slack workspace immediately.
User Stories
Story 9.1: Frictionless Sign-Up
- As a New User, I want to sign up using my GitHub or Google account, so that I don't have to create and remember a new password.
- Acceptance Criteria:
- Implement OAuth2 login (GitHub/Google).
- Automatically provisions a new
tenant_idand default configuration upon successful first login.
- Estimate: 5 points
Story 9.2: Webhook Setup Wizard
- As a New User, I want a step-by-step wizard to configure my Datadog or PagerDuty webhooks, so that I can start sending data to dd0c/alert immediately.
- Acceptance Criteria:
- UI wizard provides copy-paste ready webhook URLs and secrets.
- Includes a "Waiting for first payload..." state that updates in real-time via WebSockets or polling when the first alert arrives.
- Estimate: 8 points
Story 9.3: Slack App Installation Flow
- As a New User, I want a 1-click "Add to Slack" button, so that I can authorize dd0c/alert to post in my incident channels.
- Acceptance Criteria:
- Implements the standard Slack OAuth v2 flow.
- Allows the user to select the default channel for incident summaries.
- Estimate: 5 points
Story 9.4: Free Tier Limitations
- As a Product Owner, I want a free tier that limits the number of processed alerts or retention period, so that users can try the product without me incurring massive AWS costs.
- Acceptance Criteria:
- Free tier limits enforced at the ingestion API (e.g., max 10,000 alerts/month).
- UI displays a usage quota bar.
- Data retention in TimescaleDB automatically purged after 7 days for free tier tenants.
- Estimate: 5 points
Dependencies
- Depends on Epic 6 (Dashboard API) and Epic 7 (Dashboard UI).
- Story 9.2 depends on Epic 1 (Webhook Ingestion) being live.
Technical Notes
- Use Auth0, Clerk, or AWS Cognito to minimize authentication development time for the Solo Founder.
- Real-time "Waiting for payload" can be implemented via a lightweight polling endpoint if WebSockets add too much complexity.
Epic 10: Transparent Factory Compliance
Description: Cross-cutting epic ensuring dd0c/alert adheres to the 5 Transparent Factory tenets. For an alert intelligence platform, Semantic Observability is paramount — a tool that reasons about alerts must make its own reasoning fully transparent.
Story 10.1: Atomic Flagging — Feature Flags for Correlation & Scoring Rules
As a solo founder, I want every new correlation rule, noise scoring algorithm, and suppression behavior behind a feature flag (default: off), so that a bad scoring change doesn't silence critical alerts in production.
Acceptance Criteria:
- OpenFeature SDK integrated into the alert processing pipeline. V1: env-var or JSON file provider.
- All flags evaluate locally — no network calls in the alert ingestion hot path.
- Every flag has
ownerandttl(max 14 days). CI blocks if expired flags remain at 100%. - Automated circuit breaker: if a flagged scoring rule suppresses >2x the baseline alert volume over 30 minutes, the flag auto-disables and all suppressed alerts are re-emitted.
- Flags required for: new correlation patterns, CI/CD deployment correlation, noise scoring thresholds, notification channel routing.
Estimate: 5 points Dependencies: Epic 2 (Correlation Engine) Technical Notes:
- Circuit breaker is critical here — a bad suppression rule is worse than no suppression. Track suppression counts per flag in Redis with 30-min sliding window.
- Re-emission: suppressed alerts buffered in a dead-letter queue for 1 hour. On circuit break, replay the queue.
Story 10.2: Elastic Schema — Additive-Only for Alert Event Store
As a solo founder, I want all alert event schema changes to be strictly additive, so that historical alert correlation data remains queryable after any deployment.
Acceptance Criteria:
- CI rejects migrations containing
DROP,ALTER ... TYPE, orRENAMEon existing columns/attributes. - New fields use
_v2suffix for breaking changes. Old fields remain readable. - All event parsers configured to ignore unknown fields (Pydantic
model_config = {"extra": "ignore"}or equivalent). - Dual-write during migration windows within the same DB transaction.
- Every migration includes
sunset_datecomment (max 30 days). CI warns on overdue cleanups.
Estimate: 3 points Dependencies: Epic 3 (Event Store) Technical Notes:
- Alert events are append-only by nature — leverage this. Never mutate historical events.
- For correlation metadata (enrichments added post-ingestion), store as separate linked records rather than mutating the original event.
- TimescaleDB compression policies must handle both V1 and V2 column layouts.
Story 10.3: Cognitive Durability — Decision Logs for Scoring Logic
As a future maintainer, I want every change to noise scoring weights, correlation rules, or suppression thresholds accompanied by a decision_log.json, so that I can understand why alert X was classified as noise vs. signal.
Acceptance Criteria:
decision_log.jsonschema:{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }.- CI requires a decision log for PRs touching
src/scoring/,src/correlation/, orsrc/suppression/. - Cyclomatic complexity cap of 10 enforced in CI. Scoring functions must be decomposable and testable.
- Decision logs in
docs/decisions/, one per significant logic change.
Estimate: 2 points Dependencies: None Technical Notes:
- Scoring weight changes are especially important to document — "why is deployment correlation weighted 0.7 and not 0.5?"
- Include sample alert scenarios in decision logs showing before/after scoring behavior.
Story 10.4: Semantic Observability — AI Reasoning Spans on Alert Classification
As an on-call engineer investigating a missed critical alert, I want every alert scoring and correlation decision to emit an OpenTelemetry span with full reasoning metadata, so that I can trace exactly why an alert was scored as noise when it was actually a P1 incident.
Acceptance Criteria:
- Every alert ingestion creates a parent
alert_evaluationspan. Child spans fornoise_scoring,correlation_matching, andsuppression_decision. - Span attributes:
alert.source,alert.noise_score,alert.correlation_matches(JSON array),alert.suppressed(bool),alert.suppression_reason. - If AI-assisted classification is used:
ai.prompt_hash,ai.model_version,ai.confidence_score,ai.reasoning_chain(summarized). - CI/CD correlation spans include:
alert.deployment_correlation_score,alert.deployment_id,alert.time_since_deploy_seconds. - No PII in spans. Alert payloads are hashed for correlation, not logged raw.
Estimate: 3 points Dependencies: Epic 2 (Correlation Engine) Technical Notes:
- This is the most important tenet for dd0c/alert. If the tool suppresses an alert, the reasoning MUST be traceable.
- Use
opentelemetry-pythonwith OTLP exporter. Batch span export to avoid per-alert overhead. - For V1 without AI:
alert.suppression_reasonis the rule name + threshold. When AI scoring is added, the full reasoning chain is captured.
Story 10.5: Configurable Autonomy — Governance for Alert Suppression
As a solo founder, I want a policy.json that controls whether dd0c/alert can auto-suppress alerts or only annotate them, so that customers never lose visibility into their alerts without explicit opt-in.
Acceptance Criteria:
policy.jsondefinesgovernance_mode:strict(annotate-only, never suppress) oraudit(auto-suppress with full logging).- Default for all new customers:
strict. Suppression requires explicit opt-in. panic_mode: when true, all suppression stops immediately. Every alert passes through unmodified. A "panic active" banner appears in the dashboard.- Per-customer governance override: customers can only be MORE restrictive than system default.
- All policy decisions logged with full context: "Alert X suppressed by audit mode, rule Y, score Z" or "Alert X annotation-only, strict mode active".
Estimate: 3 points Dependencies: Epic 4 (Notification Router) Technical Notes:
strictmode is the safe default — dd0c/alert adds value even without suppression by annotating alerts with correlation data and noise scores.- Panic mode: single Redis key
dd0c:panic. All suppression checks short-circuit on this key. Triggerable viaPOST /admin/panicor env var. - Customer override: stored in org settings. Merge:
max_restrictive(system, customer).
Epic 10 Summary
| Story | Tenet | Points |
|---|---|---|
| 10.1 | Atomic Flagging | 5 |
| 10.2 | Elastic Schema | 3 |
| 10.3 | Cognitive Durability | 2 |
| 10.4 | Semantic Observability | 3 |
| 10.5 | Configurable Autonomy | 3 |
| Total | 16 |