Files
dd0c/products/05-aws-cost-anomaly/epics/epics.md
Max Mayfield 5ee95d8b13 dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
        product-brief, architecture, epics (incl. Epic 10 TF compliance),
        test-architecture (TDD strategy)

Brand strategy and market research included.
2026-02-28 17:35:02 +00:00

29 KiB

dd0c/cost — V1 MVP Epics

This document breaks down the dd0c/cost MVP into implementable Epics and Stories. Stories are sized for a solo founder to complete in 1-3 days (1-5 points typically).

Epic 1: CloudTrail Ingestion

Description: Build the real-time event pipeline that receives CloudTrail events from customer accounts, filters for cost-relevant actions (EC2, RDS, Lambda), normalizes them into CostEvents, and estimates their on-demand cost. This is the foundational data ingestion layer.

User Stories

Story 1.1: Cross-Account EventBridge Bus

  • As a dd0c system, I want to receive CloudTrail events from external customer AWS accounts via EventBridge, so that I can process them centrally without running agents in customer accounts.
  • Acceptance Criteria:
    • dd0c-cost-bus created in dd0c's AWS account.
    • Resource policy allows events:PutEvents from any AWS account (scoped by external ID/trust later, but fundamentally open to receive).
    • Test events sent from a separate AWS account successfully arrive on the bus.
  • Estimate: 2
  • Dependencies: None
  • Technical Notes: Use AWS CDK. Ensure the bus is configured in us-east-1.

Story 1.2: SQS Ingestion Queue & Dead Letter Queue

  • As a data pipeline, I want events routed from EventBridge to an SQS FIFO queue, so that I can process them in order, deduplicate them, and handle bursts without dropping data.
  • Acceptance Criteria:
    • EventBridge rule routes matching events to event-ingestion.fifo queue.
    • SQS FIFO configured with MessageGroupId = accountId and deduplication enabled.
    • DLQ configured after 3 retries.
  • Estimate: 2
  • Dependencies: Story 1.1
  • Technical Notes: CloudTrail can emit duplicates; use eventID for SQS deduplication ID.

Story 1.3: Static Pricing Tables

  • As an event processor, I want local static lookup tables for EC2, RDS, and Lambda on-demand pricing, so that I can estimate hourly costs in milliseconds without calling the slow AWS Pricing API.
  • Acceptance Criteria:
    • JSON/TypeScript dicts created for top 20 instance types for EC2 and RDS, plus Lambda per-GB-second rates.
    • Pricing covers us-east-1 (and placeholder for others if needed).
  • Estimate: 2
  • Dependencies: None
  • Technical Notes: Keep it simple for V1. Hardcode the most common instance types. We don't need the entire AWS price list yet.

Story 1.4: Event Processor Lambda

  • As an event pipeline, I want a Lambda function to poll the SQS queue, normalize raw CloudTrail events into CostEvent schemas, and write them to DynamoDB, so that downstream systems have clean, standardized data.
  • Acceptance Criteria:
    • Lambda polls SQS (batch size 10).
    • Parses RunInstances, CreateDBInstance, CreateFunction20150331, etc.
    • Extracts actor (IAM User/Role ARN), resource ID, region.
    • Looks up pricing and appends estimatedHourlyCost.
    • Writes CostEvent to DynamoDB dd0c-cost-main table.
  • Estimate: 5
  • Dependencies: Story 1.2, Story 1.3
  • Technical Notes: Implement idempotency. Use DynamoDB Single-Table Design. Partition key: ACCOUNT#<id>, Sort key: EVENT#<timestamp>#<eventId>.

Epic 2: Anomaly Detection Engine

Description: Implement the baseline learning and anomaly scoring algorithms. The engine evaluates incoming CostEvent records against account-specific, service-specific historical spending baselines to flag unusual spikes, new instance types, or unusual actors.

User Stories

Story 2.1: Baseline Storage & Retrieval

  • As an anomaly scorer, I want to read and write spending baselines per account/service/resource from DynamoDB, so that I have a statistical foundation to evaluate new events against.
  • Acceptance Criteria:
    • Baseline schema created in DynamoDB (BASELINE#<account_id>).
    • Read/Write logic implemented for running means, standard deviations, max observed, and expected actors/instance types.
  • Estimate: 3
  • Dependencies: Story 1.4
  • Technical Notes: Update baseline with ADD expressions in DynamoDB to avoid race conditions.

Story 2.2: Cold-Start Absolute Thresholds

  • As a new customer, I want my account to immediately flag highly expensive resources (>$5/hr) even if I have no baseline, so that I don't wait 14 days for the system to "learn" a $3,000 mistake.
  • Acceptance Criteria:
    • Implement absolute threshold heuristics: >$0.50/hr = INFO, >$5/hr = WARNING, >$25/hr = CRITICAL.
    • Apply logic when account maturity is cold-start (<14 days or <20 events).
  • Estimate: 2
  • Dependencies: Story 2.1
  • Technical Notes: Implement a scoreAnomaly function that checks the maturity state of the baseline.

Story 2.3: Statistical Anomaly Scoring

  • As an anomaly scorer, I want to calculate composite anomaly scores using Z-scores, instance novelty, and actor novelty, so that I reduce false positives and only flag truly unusual behavior.
  • Acceptance Criteria:
    • Implement Z-score calculation (event cost vs baseline mean).
    • Implement novelty checks (is this instance type or actor new?).
    • Composite score logic computes severity (info, warning, critical).
    • Creates an AnomalyRecord in DynamoDB if threshold crossed.
  • Estimate: 5
  • Dependencies: Story 2.1
  • Technical Notes: Add unit tests covering various edge cases (new actor + cheap instance vs. familiar actor + expensive instance).

Story 2.4: Feedback Loop ("Mark as Expected")

  • As an anomaly engine, I want to update baselines when a user marks an anomaly as expected, so that I learn from feedback and stop alerting on normal workflows.
  • Acceptance Criteria:
    • Provide a function to append a resource type and actor to expectedInstanceTypes and expectedActors.
    • Future events matching this suppressed pattern get a reduced anomaly score.
  • Estimate: 3
  • Dependencies: Story 2.3
  • Technical Notes: This API will be called by the Slack action handler.

Epic 3: Notification Service

Description: Build the Slack-first notification engine. Deliver rich Block Kit alerts containing anomaly context, estimated costs, and manual remediation suggestions. This is the product's primary user interface for V1.

User Stories

Story 3.1: SQS Alert Queue & Notifier Lambda

  • As a notification engine, I want to poll an alert queue and trigger a Lambda function for every new anomaly, so that I can format and send alerts asynchronously without blocking the ingestion path.
  • Acceptance Criteria:
    • Create standard SQS alert-queue for anomalies.
    • Create notifier Lambda that polls the queue.
    • SQS retries via visibility timeout on Slack API rate limits (429).
  • Estimate: 2
  • Dependencies: Story 2.3
  • Technical Notes: The scorer Lambda pushes the anomaly ID to this queue.

Story 3.2: Slack Block Kit Formatting

  • As a user, I want anomaly alerts formatted nicely in Slack, so that I can instantly understand what resource launched, who launched it, the estimated cost, and why it was flagged.
  • Acceptance Criteria:
    • Use Slack Block Kit to design a highly readable card.
    • Include: Resource Type, Region, Cost/hr, Actor, Timestamp, and the reason (e.g., "New instance type never seen").
    • Test rendering for EC2, RDS, and Lambda anomalies.
  • Estimate: 3
  • Dependencies: Story 3.1
  • Technical Notes: Include a "Why this alert" section detailing the anomaly signals.

Story 3.3: Manual Remediation Suggestions

  • As a user, I want the Slack alert to include CLI commands to stop or terminate the anomalous resource, so that I can fix the issue immediately even before one-click buttons are available.
  • Acceptance Criteria:
    • Block Kit template appends a Suggested actions section.
    • Generate a valid aws ec2 stop-instances or aws rds stop-db-instance command based on the resource type and region.
  • Estimate: 2
  • Dependencies: Story 3.2
  • Technical Notes: For V1, no actual remediation API calls are made by dd0c. This prevents accidental deletions and builds trust first.

Story 3.4: Daily Digest Generator

  • As a user, I want a daily summary of my spending and any minor anomalies, so that I don't get paged for every $0.50 resource but still have visibility.
  • Acceptance Criteria:
    • Create an EventBridge Scheduler rule (e.g., cron at 09:00 UTC).
    • Lambda queries the last 24h of anomalies and baseline metrics.
    • Sends a digest message (Spend Estimate, Anomalies Resolved vs. Open, Zombie Watch summary).
  • Estimate: 5
  • Dependencies: Story 3.2
  • Technical Notes: Query DynamoDB GSI for recent anomalies (ANOMALY#<id>#STATUS#*).

Epic 4: Customer Onboarding

Description: Automate the 5-minute setup experience. Create the CloudFormation templates and cross-account IAM roles required for dd0c to securely read CloudTrail events and resource metadata without touching customer data or secrets.

User Stories

Story 4.1: IAM Read-Only CloudFormation Template

  • As a customer, I want to deploy a simple, open-source CloudFormation template, so that I can grant dd0c secure, read-only access to my AWS account without worrying about compromised credentials.
  • Acceptance Criteria:
    • Create dd0c-cost-readonly.yaml template.
    • Role dd0c-cost-readonly with sts:AssumeRole policy.
    • Requires ExternalId parameter.
    • Allows ec2:Describe*, rds:Describe*, lambda:List*, cloudwatch:*, ce:GetCostAndUsage, tag:GetResources.
    • Hosted on a public S3 bucket (dd0c-cf-templates).
  • Estimate: 3
  • Dependencies: None
  • Technical Notes: Include an EventBridge rule that forwards cost-relevant CloudTrail events to dd0c's EventBridge bus (arn:aws:events:...:dd0c-cost-bus).

Story 4.2: Cognito User Pool Authentication

  • As a platform, I want a secure identity provider, so that users can sign up quickly using GitHub or Google SSO.
  • Acceptance Criteria:
    • Configure Amazon Cognito User Pool.
    • Enable GitHub and Google OIDC providers.
    • Provide a login URL and redirect to the dd0c app.
  • Estimate: 3
  • Dependencies: None
  • Technical Notes: AWS Cognito is free for the first 50k MAU, keeping V1 costs at zero.

Story 4.3: Account Setup API Endpoint

  • As a new user, I want an API that initializes my tenant and generates a secure CloudFormation "quick-create" link, so that I can click one button to install the required AWS permissions.
  • Acceptance Criteria:
    • POST /v1/accounts/setup created in API Gateway.
    • Validates Cognito JWT.
    • Generates a unique UUIDv4 externalId per tenant/account.
    • Returns a URL pointing to the AWS Console CloudFormation quick-create page with pre-filled parameters.
  • Estimate: 3
  • Dependencies: Story 4.1, Story 4.2
  • Technical Notes: The API Lambda should store the generated externalId in DynamoDB under the tenant record.

Story 4.4: Role Validation & Activation

  • As a dd0c system, I want to validate a user's AWS account connection by assuming their newly created role, so that I know I can receive events and start anomaly detection.
  • Acceptance Criteria:
    • POST /v1/accounts API created (receives awsAccountId, roleArn).
    • Calls sts:AssumeRole using the roleArn and externalId.
    • On success, updates the account status to active in DynamoDB.
    • Automatically triggers a "Zombie Resource Scan" on connection.
  • Estimate: 5
  • Dependencies: Story 4.3
  • Technical Notes: This is the critical moment. If the AssumeRole fails, return an error explaining the ExternalId mismatch or missing permissions.

Epic 5: Dashboard API

Description: Build the REST API for anomaly querying, account management, and basic metrics. V1 relies entirely on Slack for interaction, but a minimal API is needed for account settings and the upcoming V2 dashboard.

User Stories

Story 5.1: Account Retrieval API

  • As a user, I want to see my connected AWS accounts, so that I can view their health status and disconnect them if needed.
  • Acceptance Criteria:
    • GET /v1/accounts API created (returns accountId, status, baselineMaturity).
    • DELETE /v1/accounts/{id} API created.
    • Returns 401 Unauthorized without a valid Cognito JWT.
    • Scopes database query to tenantId.
  • Estimate: 3
  • Dependencies: Story 4.4
  • Technical Notes: The disconnect endpoint should mark the account as disconnecting and trigger a background Lambda to delete the data within 72 hours.

Story 5.2: Anomaly Listing API

  • As a user, I want to view a list of recent anomalies, so that I can review past alerts or check if anything was missed.
  • Acceptance Criteria:
    • GET /v1/anomalies API created.
    • Queries DynamoDB GSI3 (ANOMALY#<id>#STATUS#*) for the authenticated account.
    • Supports since, status, and severity filters.
    • Implements basic pagination.
  • Estimate: 5
  • Dependencies: Story 2.3
  • Technical Notes: Include slackMessageUrl if the anomaly triggered a Slack alert.

Story 5.3: Baseline Overrides

  • As a user, I want to adjust anomaly sensitivity for specific services or resource types, so that I don't get paged for expected batch processing spikes.
  • Acceptance Criteria:
    • PATCH /v1/accounts/{id}/baselines/{service}/{type} API created.
    • Modifies the DynamoDB baseline record to update sensitivityOverride (low, medium, high).
  • Estimate: 2
  • Dependencies: Story 2.1
  • Technical Notes: Valid values must be enforced by the API schema.

Epic 6: Dashboard UI

Description: Build the initial Next.js/React frontend. While V1 focuses on Slack, the web dashboard handles onboarding, account connection, Slack OAuth, and basic anomaly viewing for users who prefer the web.

User Stories

Story 6.1: Next.js Boilerplate & Auth

  • As a user, I want to sign in to the dd0c/cost portal, so that I can configure my account and view my AWS connections.
  • Acceptance Criteria:
    • Initialize Next.js app with Tailwind CSS.
    • Implement AWS Amplify or next-auth for Cognito integration.
    • Landing page with Start Free button.
    • Protect /dashboard routes.
  • Estimate: 3
  • Dependencies: Story 4.2
  • Technical Notes: Keep the design clean and Vercel-like. The goal is to get the user authenticated in <10 seconds.

Story 6.2: Onboarding Flow

  • As a new user, I want a simple 3-step wizard to connect AWS and Slack, so that I don't get lost in documentation.
  • Acceptance Criteria:
    • "Connect AWS Account" screen.
    • Generates CloudFormation quick-create URL.
    • Polls /v1/accounts/{id}/health for successful connection.
    • "Connect Slack" screen initiates OAuth flow.
  • Estimate: 5
  • Dependencies: Story 4.3, Story 4.4, Story 6.1
  • Technical Notes: Provide a fallback manual input field if the auto-polling fails or the user closes the AWS Console window early.

Story 6.3: Basic Dashboard View

  • As a user, I want a simple dashboard showing my connected accounts, recent anomalies, and estimated monthly cost, so that I have a high-level view outside of Slack.
  • Acceptance Criteria:
    • Render an Account Overview table.
    • Fetch anomalies via /v1/anomalies and display in a simple list or timeline.
    • Indicate the account's baseline learning phase (e.g., "14 days left in learning phase").
  • Estimate: 5
  • Dependencies: Story 5.1, Story 5.2, Story 6.1
  • Technical Notes: V1 UI shouldn't be complex. Avoid graphs or heavy chart libraries for MVP.

Epic 7: Slack Bot

Description: Build the Slack bot interaction model. This includes the OAuth installation flow, parsing incoming slash commands (/dd0c status, /dd0c anomalies, /dd0c digest), and handling interactive message payloads for actions like snoozing or marking alerts as expected.

User Stories

Story 7.1: Slack OAuth Installation Flow

  • As a user, I want to securely install the dd0c app to my Slack workspace, so that the bot can send alerts to my designated channels.
  • Acceptance Criteria:
    • GET /v1/slack/install initiates the Slack OAuth v2 flow.
    • GET /v1/slack/oauth_redirect handles the callback, exchanging the code for a bot token.
    • Bot token and workspace details are securely stored in DynamoDB under the tenant's record.
  • Estimate: 3
  • Dependencies: Story 4.2
  • Technical Notes: Request minimum scopes: chat:write, commands, incoming-webhook. Encrypt the Slack bot token at rest.

Story 7.2: Slash Command Parser & Router

  • As a Slack user, I want to use commands like /dd0c status, so that I can interact with the system without leaving my chat window.
  • Acceptance Criteria:
    • POST /v1/slack/commands API endpoint created to receive Slack command webhooks.
    • Validates Slack request signatures (HMAC-SHA256).
    • Routes /dd0c status, /dd0c anomalies, and /dd0c digest to respective handler functions.
  • Estimate: 3
  • Dependencies: Story 7.1
  • Technical Notes: Respond within 3 seconds or defer with a 200 OK and use the response_url for delayed execution.

Story 7.3: Interactive Action Handler

  • As a user, I want to click buttons on anomaly alerts to snooze them or mark them as expected, so that I can tune the system's noise level instantly.
  • Acceptance Criteria:
    • POST /v1/slack/actions API endpoint created to receive interactive payloads.
    • Validates Slack request signatures.
    • Handles mark_expected action by updating the anomaly record and retraining the baseline.
    • Handles snooze_Xh actions by updating the snoozeUntil attribute.
    • Updates the original Slack message using the Slack API to reflect the action taken.
  • Estimate: 5
  • Dependencies: Story 3.2, Story 7.2
  • Technical Notes: V1 only implements non-destructive actions (snooze, mark expected). No actual AWS remediation API calls yet.

Epic 8: Infrastructure & DevOps

Description: Define the serverless infrastructure using AWS CDK. This epic covers the deployment of the EventBridge buses, SQS queues, Lambda functions, DynamoDB tables, and setting up the CI/CD pipeline for automated testing and deployment.

User Stories

Story 8.1: Core Serverless Stack (CDK)

  • As a developer, I want the core ingestion and data storage infrastructure defined as code, so that I can deploy the dd0c platform reliably and repeatedly.
  • Acceptance Criteria:
    • AWS CDK (TypeScript) project initialized.
    • dd0c-cost-main DynamoDB table defined with GSIs and TTL.
    • dd0c-cost-bus EventBridge bus configured with resource policies allowing external puts.
    • event-ingestion.fifo and alert-queue SQS queues created.
  • Estimate: 3
  • Dependencies: None
  • Technical Notes: Ensure DynamoDB is set to PAY_PER_REQUEST (on-demand) to minimize baseline costs.

Story 8.2: Lambda Deployments & Triggers

  • As a developer, I want to deploy the Lambda functions and connect them to their respective triggers, so that the event-driven architecture functions end-to-end.
  • Acceptance Criteria:
    • CDK definitions for event-processor, anomaly-scorer, notifier, and API handlers.
    • SQS event source mappings configured for processor and notifier Lambdas.
    • API Gateway REST API configured with routes pointing to the API handler Lambda.
  • Estimate: 5
  • Dependencies: Story 8.1
  • Technical Notes: Bundle Lambdas using NodejsFunction construct (esbuild) to minimize cold starts. Set explicit memory and timeout values.

Story 8.3: Observability & Alarms

  • As an operator, I want automated monitoring of the infrastructure, so that I am alerted if ingestion fails or components throttle.
  • Acceptance Criteria:
    • CloudWatch Alarms created for Lambda error rates (>5% in 5 mins).
    • Alarms created for SQS DLQ depth (ApproximateNumberOfMessagesVisible > 0).
    • Alarms send notifications to an SNS ops-alerts topic.
  • Estimate: 2
  • Dependencies: Story 8.2
  • Technical Notes: Keep V1 alarms simple to avoid alert fatigue.

Story 8.4: CI/CD Pipeline Setup

  • As a solo founder, I want GitHub Actions to automatically test and deploy my code, so that I can push to main and have it live in minutes without manual deployment steps.
  • Acceptance Criteria:
    • GitHub Actions workflow created for PRs (lint, test).
    • Workflow created for main branch (lint, test, cdk deploy --require-approval broadening).
    • OIDC provider configured in AWS for passwordless GitHub Actions authentication.
  • Estimate: 3
  • Dependencies: Story 8.1
  • Technical Notes: Use AWS configure-aws-credentials action with Role to assume.

Epic 9: PLG & Free Tier

Description: Implement the product-led growth (PLG) foundations. This involves building a seamless self-serve signup flow, enforcing free tier limits (1 AWS account), and providing the mechanism to upgrade to a paid tier via Stripe.

User Stories

Story 9.1: Free Tier Enforcement

  • As a platform, I want to limit free users to 1 connected AWS account, so that I can control infrastructure costs while letting users experience the product's value.
  • Acceptance Criteria:
    • POST /v1/accounts/setup checks the tenant's current account count.
    • Rejects the request with 403 Forbidden and an upgrade prompt if the limit (1) is reached on the free tier.
  • Estimate: 2
  • Dependencies: Story 4.3
  • Technical Notes: Check the TENANT#<id> metadata record to determine the subscription tier.

Story 9.2: Stripe Integration & Upgrade Flow

  • As a user, I want to easily upgrade to a paid subscription, so that I can connect multiple AWS accounts and access premium features.
  • Acceptance Criteria:
    • Create a Stripe Checkout session endpoint (POST /v1/billing/checkout).
    • Configure a Stripe webhook handler to listen for checkout.session.completed and customer.subscription.deleted.
    • Update the tenant's tier to pro in DynamoDB upon successful payment.
  • Estimate: 5
  • Dependencies: Story 4.2
  • Technical Notes: The Pro tier is $19/account/month. Use Stripe Billing's per-unit pricing model tied to the number of active AWS accounts.

Story 9.3: API Key Management (V1 Foundation)

  • As a power user, I want to generate an API key, so that I can programmatically interact with my dd0c account in the future.
  • Acceptance Criteria:
    • POST /v1/api-keys endpoint to generate a secure, scoped API key.
    • Hash the API key before storing it in DynamoDB (TENANT#<id>#APIKEY#<hash>).
    • Display the plain-text key only once during creation.
  • Estimate: 3
  • Dependencies: Story 5.1
  • Technical Notes: This lays the groundwork for the V2 Business tier API access.

Epic 10: Transparent Factory Compliance

Description: Cross-cutting epic ensuring dd0c/cost adheres to the 5 Transparent Factory tenets. A cost anomaly detector that auto-alerts on spending must itself be governed — false positives erode trust, false negatives cost money.

Story 10.1: Atomic Flagging — Feature Flags for Anomaly Detection Rules

As a solo founder, I want every new anomaly scoring algorithm, baseline model, and alert threshold behind a feature flag (default: off), so that a bad scoring change doesn't flood customers with false-positive cost alerts.

Acceptance Criteria:

  • OpenFeature SDK integrated into the anomaly scoring engine. V1: env-var or JSON file provider.
  • All flags evaluate locally — no network calls during cost event processing.
  • Every flag has owner and ttl (max 14 days). CI blocks if expired flags remain at 100%.
  • Automated circuit breaker: if a flagged scoring rule generates >3x baseline alert volume over 1 hour, the flag auto-disables. Suppressed alerts buffered in DLQ for review.
  • Flags required for: new baseline algorithms, Z-score thresholds, instance novelty scoring, actor novelty detection, new AWS service parsers.

Estimate: 5 points Dependencies: Epic 2 (Anomaly Engine) Technical Notes:

  • Circuit breaker tracks alert-per-account rate in Redis with 1-hour sliding window.
  • DLQ: SQS queue. On circuit break, alerts are replayed once the flag is fixed or removed.
  • For the "no baseline" fast-path (>$5/hr resources), this is NOT behind a flag — it's a safety net that's always on.

Story 10.2: Elastic Schema — Additive-Only for Cost Event Tables

As a solo founder, I want all DynamoDB cost event and TimescaleDB baseline schema changes to be strictly additive, so that rollbacks never corrupt historical spending data or break baseline calculations.

Acceptance Criteria:

  • CI rejects migrations containing DROP, ALTER ... TYPE, or RENAME on existing columns/attributes.
  • New fields use _v2 suffix for breaking changes.
  • All event parsers ignore unknown fields (Pydantic extra="ignore" or Go equivalent).
  • Dual-write during migration windows within the same transaction.
  • Every migration includes sunset_date comment (max 30 days).

Estimate: 3 points Dependencies: Epic 3 (Data Pipeline) Technical Notes:

  • CostEvent records in DynamoDB are append-only — never mutate historical events.
  • Baseline models in TimescaleDB: new algorithm versions write to new continuous aggregate, old aggregate remains queryable during transition.
  • GSI changes: add new GSIs, never remove old ones until sunset.

Story 10.3: Cognitive Durability — Decision Logs for Scoring Algorithms

As a future maintainer, I want every change to anomaly scoring weights, Z-score thresholds, or baseline learning rates accompanied by a decision_log.json, so that I understand why the system flagged (or missed) a $3,000 EC2 instance.

Acceptance Criteria:

  • decision_log.json schema: { prompt, reasoning, alternatives_considered, confidence, timestamp, author }.
  • CI requires a decision log for PRs touching src/scoring/, src/baseline/, or src/detection/.
  • Cyclomatic complexity cap of 10 enforced in CI.
  • Decision logs in docs/decisions/.

Estimate: 2 points Dependencies: None Technical Notes:

  • Threshold changes are the highest-risk decisions — document: "Why Z-score > 2.5 and not 2.0? What's the false-positive rate at each threshold?"
  • Include sample cost events showing before/after scoring behavior in decision logs.

Story 10.4: Semantic Observability — AI Reasoning Spans on Anomaly Scoring

As an engineer investigating a missed cost anomaly, I want every anomaly scoring decision to emit an OpenTelemetry span with full reasoning metadata, so that I can trace exactly why a $500/hr GPU instance wasn't flagged.

Acceptance Criteria:

  • Every CostEvent evaluation creates an anomaly_scoring span.
  • Span attributes: cost.account_id_hash, cost.service, cost.anomaly_score, cost.z_score, cost.instance_novelty, cost.actor_novelty, cost.alert_triggered (bool), cost.baseline_days (how many days of baseline data existed).
  • If no baseline exists: cost.fast_path_triggered (bool) and cost.hourly_rate.
  • Spans export via OTLP. No PII — account IDs hashed, actor ARNs hashed.

Estimate: 3 points Dependencies: Epic 2 (Anomaly Engine) Technical Notes:

  • Use OpenTelemetry Python SDK with OTLP exporter. Batch export — cost events can be high volume.
  • The "no baseline fast-path" span is especially important — it's the safety net for new accounts.
  • Include cost.baseline_days so you can correlate alert accuracy with baseline maturity.

Story 10.5: Configurable Autonomy — Governance for Cost Alerting

As a solo founder, I want a policy.json that controls whether dd0c/cost can auto-alert customers or only log anomalies internally, so that I can validate scoring accuracy before enabling customer-facing notifications.

Acceptance Criteria:

  • policy.json defines governance_mode: strict (log-only, no customer alerts) or audit (auto-alert with logging).
  • Default for new accounts: strict for first 14 days (baseline learning period), then auto-promote to audit.
  • panic_mode: when true, all alerting stops. Anomalies are still scored and logged but no notifications sent. Dashboard shows "alerting paused" banner.
  • Per-account governance override: customers can set their own mode. Can only be MORE restrictive.
  • All policy decisions logged: "Alert for account X suppressed by strict mode", "Auto-promoted account Y to audit mode after 14-day baseline".

Estimate: 3 points Dependencies: Epic 4 (Notification Engine) Technical Notes:

  • The 14-day auto-promotion is key — it prevents alert spam during baseline learning while ensuring customers eventually get value.
  • Auto-promotion check: daily cron. If account has ≥14 days of baseline data AND false-positive rate <10%, promote to audit.
  • Panic mode: Redis key dd0c:panic. Notification engine short-circuits on this key.

Epic 10 Summary

Story Tenet Points
10.1 Atomic Flagging 5
10.2 Elastic Schema 3
10.3 Cognitive Durability 2
10.4 Semantic Observability 3
10.5 Configurable Autonomy 3
Total 16