dd0c: full product research pipeline - 6 products, 8 phases each
Products: route, drift, alert, portal, cost, run
Phases: brainstorm, design-thinking, innovation-strategy, party-mode,
product-brief, architecture, epics (incl. Epic 10 TF compliance),
test-architecture (TDD strategy)
Brand strategy and market research included.
This commit is contained in:
481
products/05-aws-cost-anomaly/epics/epics.md
Normal file
481
products/05-aws-cost-anomaly/epics/epics.md
Normal file
@@ -0,0 +1,481 @@
|
||||
# dd0c/cost — V1 MVP Epics
|
||||
|
||||
This document breaks down the dd0c/cost MVP into implementable Epics and Stories. Stories are sized for a solo founder to complete in 1-3 days (1-5 points typically).
|
||||
|
||||
## Epic 1: CloudTrail Ingestion
|
||||
**Description:** Build the real-time event pipeline that receives CloudTrail events from customer accounts, filters for cost-relevant actions (EC2, RDS, Lambda), normalizes them into `CostEvents`, and estimates their on-demand cost. This is the foundational data ingestion layer.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 1.1: Cross-Account EventBridge Bus**
|
||||
- **As a** dd0c system, **I want** to receive CloudTrail events from external customer AWS accounts via EventBridge, **so that** I can process them centrally without running agents in customer accounts.
|
||||
- **Acceptance Criteria:**
|
||||
- `dd0c-cost-bus` created in dd0c's AWS account.
|
||||
- Resource policy allows `events:PutEvents` from any AWS account (scoped by external ID/trust later, but fundamentally open to receive).
|
||||
- Test events sent from a separate AWS account successfully arrive on the bus.
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** None
|
||||
- **Technical Notes:** Use AWS CDK. Ensure the bus is configured in `us-east-1`.
|
||||
|
||||
**Story 1.2: SQS Ingestion Queue & Dead Letter Queue**
|
||||
- **As a** data pipeline, **I want** events routed from EventBridge to an SQS FIFO queue, **so that** I can process them in order, deduplicate them, and handle bursts without dropping data.
|
||||
- **Acceptance Criteria:**
|
||||
- EventBridge rule routes matching events to `event-ingestion.fifo` queue.
|
||||
- SQS FIFO configured with `MessageGroupId` = accountId and deduplication enabled.
|
||||
- DLQ configured after 3 retries.
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 1.1
|
||||
- **Technical Notes:** CloudTrail can emit duplicates; use `eventID` for SQS deduplication ID.
|
||||
|
||||
**Story 1.3: Static Pricing Tables**
|
||||
- **As an** event processor, **I want** local static lookup tables for EC2, RDS, and Lambda on-demand pricing, **so that** I can estimate hourly costs in milliseconds without calling the slow AWS Pricing API.
|
||||
- **Acceptance Criteria:**
|
||||
- JSON/TypeScript dicts created for top 20 instance types for EC2 and RDS, plus Lambda per-GB-second rates.
|
||||
- Pricing covers `us-east-1` (and placeholder for others if needed).
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** None
|
||||
- **Technical Notes:** Keep it simple for V1. Hardcode the most common instance types. We don't need the entire AWS price list yet.
|
||||
|
||||
**Story 1.4: Event Processor Lambda**
|
||||
- **As an** event pipeline, **I want** a Lambda function to poll the SQS queue, normalize raw CloudTrail events into `CostEvent` schemas, and write them to DynamoDB, **so that** downstream systems have clean, standardized data.
|
||||
- **Acceptance Criteria:**
|
||||
- Lambda polls SQS (batch size 10).
|
||||
- Parses `RunInstances`, `CreateDBInstance`, `CreateFunction20150331`, etc.
|
||||
- Extracts actor (IAM User/Role ARN), resource ID, region.
|
||||
- Looks up pricing and appends `estimatedHourlyCost`.
|
||||
- Writes `CostEvent` to DynamoDB `dd0c-cost-main` table.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 1.2, Story 1.3
|
||||
- **Technical Notes:** Implement idempotency. Use DynamoDB Single-Table Design. Partition key: `ACCOUNT#<id>`, Sort key: `EVENT#<timestamp>#<eventId>`.
|
||||
|
||||
|
||||
## Epic 2: Anomaly Detection Engine
|
||||
**Description:** Implement the baseline learning and anomaly scoring algorithms. The engine evaluates incoming `CostEvent` records against account-specific, service-specific historical spending baselines to flag unusual spikes, new instance types, or unusual actors.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 2.1: Baseline Storage & Retrieval**
|
||||
- **As an** anomaly scorer, **I want** to read and write spending baselines per account/service/resource from DynamoDB, **so that** I have a statistical foundation to evaluate new events against.
|
||||
- **Acceptance Criteria:**
|
||||
- `Baseline` schema created in DynamoDB (`BASELINE#<account_id>`).
|
||||
- Read/Write logic implemented for running means, standard deviations, max observed, and expected actors/instance types.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 1.4
|
||||
- **Technical Notes:** Update baseline with `ADD` expressions in DynamoDB to avoid race conditions.
|
||||
|
||||
**Story 2.2: Cold-Start Absolute Thresholds**
|
||||
- **As a** new customer, **I want** my account to immediately flag highly expensive resources (>$5/hr) even if I have no baseline, **so that** I don't wait 14 days for the system to "learn" a $3,000 mistake.
|
||||
- **Acceptance Criteria:**
|
||||
- Implement absolute threshold heuristics: >$0.50/hr = INFO, >$5/hr = WARNING, >$25/hr = CRITICAL.
|
||||
- Apply logic when account maturity is `cold-start` (<14 days or <20 events).
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 2.1
|
||||
- **Technical Notes:** Implement a `scoreAnomaly` function that checks the maturity state of the baseline.
|
||||
|
||||
**Story 2.3: Statistical Anomaly Scoring**
|
||||
- **As an** anomaly scorer, **I want** to calculate composite anomaly scores using Z-scores, instance novelty, and actor novelty, **so that** I reduce false positives and only flag truly unusual behavior.
|
||||
- **Acceptance Criteria:**
|
||||
- Implement Z-score calculation (event cost vs baseline mean).
|
||||
- Implement novelty checks (is this instance type or actor new?).
|
||||
- Composite score logic computes severity (`info`, `warning`, `critical`).
|
||||
- Creates an `AnomalyRecord` in DynamoDB if threshold crossed.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 2.1
|
||||
- **Technical Notes:** Add unit tests covering various edge cases (new actor + cheap instance vs. familiar actor + expensive instance).
|
||||
|
||||
**Story 2.4: Feedback Loop ("Mark as Expected")**
|
||||
- **As an** anomaly engine, **I want** to update baselines when a user marks an anomaly as expected, **so that** I learn from feedback and stop alerting on normal workflows.
|
||||
- **Acceptance Criteria:**
|
||||
- Provide a function to append a resource type and actor to `expectedInstanceTypes` and `expectedActors`.
|
||||
- Future events matching this suppressed pattern get a reduced anomaly score.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 2.3
|
||||
- **Technical Notes:** This API will be called by the Slack action handler.
|
||||
|
||||
|
||||
## Epic 3: Notification Service
|
||||
**Description:** Build the Slack-first notification engine. Deliver rich Block Kit alerts containing anomaly context, estimated costs, and manual remediation suggestions. This is the product's primary user interface for V1.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 3.1: SQS Alert Queue & Notifier Lambda**
|
||||
- **As a** notification engine, **I want** to poll an alert queue and trigger a Lambda function for every new anomaly, **so that** I can format and send alerts asynchronously without blocking the ingestion path.
|
||||
- **Acceptance Criteria:**
|
||||
- Create standard SQS `alert-queue` for anomalies.
|
||||
- Create `notifier` Lambda that polls the queue.
|
||||
- SQS retries via visibility timeout on Slack API rate limits (429).
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 2.3
|
||||
- **Technical Notes:** The scorer Lambda pushes the anomaly ID to this queue.
|
||||
|
||||
**Story 3.2: Slack Block Kit Formatting**
|
||||
- **As a** user, **I want** anomaly alerts formatted nicely in Slack, **so that** I can instantly understand what resource launched, who launched it, the estimated cost, and why it was flagged.
|
||||
- **Acceptance Criteria:**
|
||||
- Use Slack Block Kit to design a highly readable card.
|
||||
- Include: Resource Type, Region, Cost/hr, Actor, Timestamp, and the reason (e.g., "New instance type never seen").
|
||||
- Test rendering for EC2, RDS, and Lambda anomalies.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 3.1
|
||||
- **Technical Notes:** Include a "Why this alert" section detailing the anomaly signals.
|
||||
|
||||
**Story 3.3: Manual Remediation Suggestions**
|
||||
- **As a** user, **I want** the Slack alert to include CLI commands to stop or terminate the anomalous resource, **so that** I can fix the issue immediately even before one-click buttons are available.
|
||||
- **Acceptance Criteria:**
|
||||
- Block Kit template appends a `Suggested actions` section.
|
||||
- Generate a valid `aws ec2 stop-instances` or `aws rds stop-db-instance` command based on the resource type and region.
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 3.2
|
||||
- **Technical Notes:** For V1, no actual remediation API calls are made by dd0c. This prevents accidental deletions and builds trust first.
|
||||
|
||||
**Story 3.4: Daily Digest Generator**
|
||||
- **As a** user, **I want** a daily summary of my spending and any minor anomalies, **so that** I don't get paged for every $0.50 resource but still have visibility.
|
||||
- **Acceptance Criteria:**
|
||||
- Create an EventBridge Scheduler rule (e.g., cron at 09:00 UTC).
|
||||
- Lambda queries the last 24h of anomalies and baseline metrics.
|
||||
- Sends a digest message (Spend Estimate, Anomalies Resolved vs. Open, Zombie Watch summary).
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 3.2
|
||||
- **Technical Notes:** Query DynamoDB GSI for recent anomalies (`ANOMALY#<id>#STATUS#*`).
|
||||
|
||||
|
||||
## Epic 4: Customer Onboarding
|
||||
**Description:** Automate the 5-minute setup experience. Create the CloudFormation templates and cross-account IAM roles required for dd0c to securely read CloudTrail events and resource metadata without touching customer data or secrets.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 4.1: IAM Read-Only CloudFormation Template**
|
||||
- **As a** customer, **I want** to deploy a simple, open-source CloudFormation template, **so that** I can grant dd0c secure, read-only access to my AWS account without worrying about compromised credentials.
|
||||
- **Acceptance Criteria:**
|
||||
- Create `dd0c-cost-readonly.yaml` template.
|
||||
- Role `dd0c-cost-readonly` with `sts:AssumeRole` policy.
|
||||
- Requires `ExternalId` parameter.
|
||||
- Allows `ec2:Describe*`, `rds:Describe*`, `lambda:List*`, `cloudwatch:*`, `ce:GetCostAndUsage`, `tag:GetResources`.
|
||||
- Hosted on a public S3 bucket (`dd0c-cf-templates`).
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** None
|
||||
- **Technical Notes:** Include an EventBridge rule that forwards `cost-relevant` CloudTrail events to dd0c's EventBridge bus (`arn:aws:events:...:dd0c-cost-bus`).
|
||||
|
||||
**Story 4.2: Cognito User Pool Authentication**
|
||||
- **As a** platform, **I want** a secure identity provider, **so that** users can sign up quickly using GitHub or Google SSO.
|
||||
- **Acceptance Criteria:**
|
||||
- Configure Amazon Cognito User Pool.
|
||||
- Enable GitHub and Google OIDC providers.
|
||||
- Provide a login URL and redirect to the dd0c app.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** None
|
||||
- **Technical Notes:** AWS Cognito is free for the first 50k MAU, keeping V1 costs at zero.
|
||||
|
||||
**Story 4.3: Account Setup API Endpoint**
|
||||
- **As a** new user, **I want** an API that initializes my tenant and generates a secure CloudFormation "quick-create" link, **so that** I can click one button to install the required AWS permissions.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/accounts/setup` created in API Gateway.
|
||||
- Validates Cognito JWT.
|
||||
- Generates a unique UUIDv4 `externalId` per tenant/account.
|
||||
- Returns a URL pointing to the AWS Console CloudFormation quick-create page with pre-filled parameters.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 4.1, Story 4.2
|
||||
- **Technical Notes:** The API Lambda should store the generated `externalId` in DynamoDB under the tenant record.
|
||||
|
||||
**Story 4.4: Role Validation & Activation**
|
||||
- **As a** dd0c system, **I want** to validate a user's AWS account connection by assuming their newly created role, **so that** I know I can receive events and start anomaly detection.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/accounts` API created (receives `awsAccountId`, `roleArn`).
|
||||
- Calls `sts:AssumeRole` using the `roleArn` and `externalId`.
|
||||
- On success, updates the account status to `active` in DynamoDB.
|
||||
- Automatically triggers a "Zombie Resource Scan" on connection.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 4.3
|
||||
- **Technical Notes:** This is the critical moment. If the `AssumeRole` fails, return an error explaining the `ExternalId` mismatch or missing permissions.
|
||||
|
||||
|
||||
## Epic 5: Dashboard API
|
||||
**Description:** Build the REST API for anomaly querying, account management, and basic metrics. V1 relies entirely on Slack for interaction, but a minimal API is needed for account settings and the upcoming V2 dashboard.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 5.1: Account Retrieval API**
|
||||
- **As a** user, **I want** to see my connected AWS accounts, **so that** I can view their health status and disconnect them if needed.
|
||||
- **Acceptance Criteria:**
|
||||
- `GET /v1/accounts` API created (returns `accountId`, status, `baselineMaturity`).
|
||||
- `DELETE /v1/accounts/{id}` API created.
|
||||
- Returns `401 Unauthorized` without a valid Cognito JWT.
|
||||
- Scopes database query to `tenantId`.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 4.4
|
||||
- **Technical Notes:** The disconnect endpoint should mark the account as `disconnecting` and trigger a background Lambda to delete the data within 72 hours.
|
||||
|
||||
**Story 5.2: Anomaly Listing API**
|
||||
- **As a** user, **I want** to view a list of recent anomalies, **so that** I can review past alerts or check if anything was missed.
|
||||
- **Acceptance Criteria:**
|
||||
- `GET /v1/anomalies` API created.
|
||||
- Queries DynamoDB GSI3 (`ANOMALY#<id>#STATUS#*`) for the authenticated account.
|
||||
- Supports `since`, `status`, and `severity` filters.
|
||||
- Implements basic pagination.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 2.3
|
||||
- **Technical Notes:** Include `slackMessageUrl` if the anomaly triggered a Slack alert.
|
||||
|
||||
**Story 5.3: Baseline Overrides**
|
||||
- **As a** user, **I want** to adjust anomaly sensitivity for specific services or resource types, **so that** I don't get paged for expected batch processing spikes.
|
||||
- **Acceptance Criteria:**
|
||||
- `PATCH /v1/accounts/{id}/baselines/{service}/{type}` API created.
|
||||
- Modifies the DynamoDB baseline record to update `sensitivityOverride` (`low`, `medium`, `high`).
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 2.1
|
||||
- **Technical Notes:** Valid values must be enforced by the API schema.
|
||||
|
||||
|
||||
## Epic 6: Dashboard UI
|
||||
**Description:** Build the initial Next.js/React frontend. While V1 focuses on Slack, the web dashboard handles onboarding, account connection, Slack OAuth, and basic anomaly viewing for users who prefer the web.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 6.1: Next.js Boilerplate & Auth**
|
||||
- **As a** user, **I want** to sign in to the dd0c/cost portal, **so that** I can configure my account and view my AWS connections.
|
||||
- **Acceptance Criteria:**
|
||||
- Initialize Next.js app with Tailwind CSS.
|
||||
- Implement AWS Amplify or `next-auth` for Cognito integration.
|
||||
- Landing page with `Start Free` button.
|
||||
- Protect `/dashboard` routes.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 4.2
|
||||
- **Technical Notes:** Keep the design clean and Vercel-like. The goal is to get the user authenticated in <10 seconds.
|
||||
|
||||
**Story 6.2: Onboarding Flow**
|
||||
- **As a** new user, **I want** a simple 3-step wizard to connect AWS and Slack, **so that** I don't get lost in documentation.
|
||||
- **Acceptance Criteria:**
|
||||
- "Connect AWS Account" screen.
|
||||
- Generates CloudFormation quick-create URL.
|
||||
- Polls `/v1/accounts/{id}/health` for successful connection.
|
||||
- "Connect Slack" screen initiates OAuth flow.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 4.3, Story 4.4, Story 6.1
|
||||
- **Technical Notes:** Provide a fallback manual input field if the auto-polling fails or the user closes the AWS Console window early.
|
||||
|
||||
**Story 6.3: Basic Dashboard View**
|
||||
- **As a** user, **I want** a simple dashboard showing my connected accounts, recent anomalies, and estimated monthly cost, **so that** I have a high-level view outside of Slack.
|
||||
- **Acceptance Criteria:**
|
||||
- Render an `Account Overview` table.
|
||||
- Fetch anomalies via `/v1/anomalies` and display in a simple list or timeline.
|
||||
- Indicate the account's baseline learning phase (e.g., "14 days left in learning phase").
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 5.1, Story 5.2, Story 6.1
|
||||
- **Technical Notes:** V1 UI shouldn't be complex. Avoid graphs or heavy chart libraries for MVP.
|
||||
|
||||
|
||||
## Epic 7: Slack Bot
|
||||
**Description:** Build the Slack bot interaction model. This includes the OAuth installation flow, parsing incoming slash commands (`/dd0c status`, `/dd0c anomalies`, `/dd0c digest`), and handling interactive message payloads for actions like snoozing or marking alerts as expected.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 7.1: Slack OAuth Installation Flow**
|
||||
- **As a** user, **I want** to securely install the dd0c app to my Slack workspace, **so that** the bot can send alerts to my designated channels.
|
||||
- **Acceptance Criteria:**
|
||||
- `GET /v1/slack/install` initiates the Slack OAuth v2 flow.
|
||||
- `GET /v1/slack/oauth_redirect` handles the callback, exchanging the code for a bot token.
|
||||
- Bot token and workspace details are securely stored in DynamoDB under the tenant's record.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 4.2
|
||||
- **Technical Notes:** Request minimum scopes: `chat:write`, `commands`, `incoming-webhook`. Encrypt the Slack bot token at rest.
|
||||
|
||||
**Story 7.2: Slash Command Parser & Router**
|
||||
- **As a** Slack user, **I want** to use commands like `/dd0c status`, **so that** I can interact with the system without leaving my chat window.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/slack/commands` API endpoint created to receive Slack command webhooks.
|
||||
- Validates Slack request signatures (HMAC-SHA256).
|
||||
- Routes `/dd0c status`, `/dd0c anomalies`, and `/dd0c digest` to respective handler functions.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 7.1
|
||||
- **Technical Notes:** Respond within 3 seconds or defer with a 200 OK and use the `response_url` for delayed execution.
|
||||
|
||||
**Story 7.3: Interactive Action Handler**
|
||||
- **As a** user, **I want** to click buttons on anomaly alerts to snooze them or mark them as expected, **so that** I can tune the system's noise level instantly.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/slack/actions` API endpoint created to receive interactive payloads.
|
||||
- Validates Slack request signatures.
|
||||
- Handles `mark_expected` action by updating the anomaly record and retraining the baseline.
|
||||
- Handles `snooze_Xh` actions by updating the `snoozeUntil` attribute.
|
||||
- Updates the original Slack message using the Slack API to reflect the action taken.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 3.2, Story 7.2
|
||||
- **Technical Notes:** V1 only implements non-destructive actions (snooze, mark expected). No actual AWS remediation API calls yet.
|
||||
|
||||
|
||||
## Epic 8: Infrastructure & DevOps
|
||||
**Description:** Define the serverless infrastructure using AWS CDK. This epic covers the deployment of the EventBridge buses, SQS queues, Lambda functions, DynamoDB tables, and setting up the CI/CD pipeline for automated testing and deployment.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 8.1: Core Serverless Stack (CDK)**
|
||||
- **As a** developer, **I want** the core ingestion and data storage infrastructure defined as code, **so that** I can deploy the dd0c platform reliably and repeatedly.
|
||||
- **Acceptance Criteria:**
|
||||
- AWS CDK (TypeScript) project initialized.
|
||||
- `dd0c-cost-main` DynamoDB table defined with GSIs and TTL.
|
||||
- `dd0c-cost-bus` EventBridge bus configured with resource policies allowing external puts.
|
||||
- `event-ingestion.fifo` and `alert-queue` SQS queues created.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** None
|
||||
- **Technical Notes:** Ensure DynamoDB is set to PAY_PER_REQUEST (on-demand) to minimize baseline costs.
|
||||
|
||||
**Story 8.2: Lambda Deployments & Triggers**
|
||||
- **As a** developer, **I want** to deploy the Lambda functions and connect them to their respective triggers, **so that** the event-driven architecture functions end-to-end.
|
||||
- **Acceptance Criteria:**
|
||||
- CDK definitions for `event-processor`, `anomaly-scorer`, `notifier`, and API handlers.
|
||||
- SQS event source mappings configured for processor and notifier Lambdas.
|
||||
- API Gateway REST API configured with routes pointing to the API handler Lambda.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 8.1
|
||||
- **Technical Notes:** Bundle Lambdas using `NodejsFunction` construct (esbuild) to minimize cold starts. Set explicit memory and timeout values.
|
||||
|
||||
**Story 8.3: Observability & Alarms**
|
||||
- **As an** operator, **I want** automated monitoring of the infrastructure, **so that** I am alerted if ingestion fails or components throttle.
|
||||
- **Acceptance Criteria:**
|
||||
- CloudWatch Alarms created for Lambda error rates (>5% in 5 mins).
|
||||
- Alarms created for SQS DLQ depth (`ApproximateNumberOfMessagesVisible` > 0).
|
||||
- Alarms send notifications to an SNS `ops-alerts` topic.
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 8.2
|
||||
- **Technical Notes:** Keep V1 alarms simple to avoid alert fatigue.
|
||||
|
||||
**Story 8.4: CI/CD Pipeline Setup**
|
||||
- **As a** solo founder, **I want** GitHub Actions to automatically test and deploy my code, **so that** I can push to `main` and have it live in minutes without manual deployment steps.
|
||||
- **Acceptance Criteria:**
|
||||
- GitHub Actions workflow created for PRs (lint, test).
|
||||
- Workflow created for `main` branch (lint, test, `cdk deploy --require-approval broadening`).
|
||||
- OIDC provider configured in AWS for passwordless GitHub Actions authentication.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 8.1
|
||||
- **Technical Notes:** Use AWS `configure-aws-credentials` action with Role to assume.
|
||||
|
||||
|
||||
## Epic 9: PLG & Free Tier
|
||||
**Description:** Implement the product-led growth (PLG) foundations. This involves building a seamless self-serve signup flow, enforcing free tier limits (1 AWS account), and providing the mechanism to upgrade to a paid tier via Stripe.
|
||||
|
||||
### User Stories
|
||||
|
||||
**Story 9.1: Free Tier Enforcement**
|
||||
- **As a** platform, **I want** to limit free users to 1 connected AWS account, **so that** I can control infrastructure costs while letting users experience the product's value.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/accounts/setup` checks the tenant's current account count.
|
||||
- Rejects the request with `403 Forbidden` and an upgrade prompt if the limit (1) is reached on the free tier.
|
||||
- **Estimate:** 2
|
||||
- **Dependencies:** Story 4.3
|
||||
- **Technical Notes:** Check the `TENANT#<id>` metadata record to determine the subscription tier.
|
||||
|
||||
**Story 9.2: Stripe Integration & Upgrade Flow**
|
||||
- **As a** user, **I want** to easily upgrade to a paid subscription, **so that** I can connect multiple AWS accounts and access premium features.
|
||||
- **Acceptance Criteria:**
|
||||
- Create a Stripe Checkout session endpoint (`POST /v1/billing/checkout`).
|
||||
- Configure a Stripe webhook handler to listen for `checkout.session.completed` and `customer.subscription.deleted`.
|
||||
- Update the tenant's tier to `pro` in DynamoDB upon successful payment.
|
||||
- **Estimate:** 5
|
||||
- **Dependencies:** Story 4.2
|
||||
- **Technical Notes:** The Pro tier is $19/account/month. Use Stripe Billing's per-unit pricing model tied to the number of active AWS accounts.
|
||||
|
||||
**Story 9.3: API Key Management (V1 Foundation)**
|
||||
- **As a** power user, **I want** to generate an API key, **so that** I can programmatically interact with my dd0c account in the future.
|
||||
- **Acceptance Criteria:**
|
||||
- `POST /v1/api-keys` endpoint to generate a secure, scoped API key.
|
||||
- Hash the API key before storing it in DynamoDB (`TENANT#<id>#APIKEY#<hash>`).
|
||||
- Display the plain-text key only once during creation.
|
||||
- **Estimate:** 3
|
||||
- **Dependencies:** Story 5.1
|
||||
- **Technical Notes:** This lays the groundwork for the V2 Business tier API access.
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Epic 10: Transparent Factory Compliance
|
||||
**Description:** Cross-cutting epic ensuring dd0c/cost adheres to the 5 Transparent Factory tenets. A cost anomaly detector that auto-alerts on spending must itself be governed — false positives erode trust, false negatives cost money.
|
||||
|
||||
### Story 10.1: Atomic Flagging — Feature Flags for Anomaly Detection Rules
|
||||
**As a** solo founder, **I want** every new anomaly scoring algorithm, baseline model, and alert threshold behind a feature flag (default: off), **so that** a bad scoring change doesn't flood customers with false-positive cost alerts.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- OpenFeature SDK integrated into the anomaly scoring engine. V1: env-var or JSON file provider.
|
||||
- All flags evaluate locally — no network calls during cost event processing.
|
||||
- Every flag has `owner` and `ttl` (max 14 days). CI blocks if expired flags remain at 100%.
|
||||
- Automated circuit breaker: if a flagged scoring rule generates >3x baseline alert volume over 1 hour, the flag auto-disables. Suppressed alerts buffered in DLQ for review.
|
||||
- Flags required for: new baseline algorithms, Z-score thresholds, instance novelty scoring, actor novelty detection, new AWS service parsers.
|
||||
|
||||
**Estimate:** 5 points
|
||||
**Dependencies:** Epic 2 (Anomaly Engine)
|
||||
**Technical Notes:**
|
||||
- Circuit breaker tracks alert-per-account rate in Redis with 1-hour sliding window.
|
||||
- DLQ: SQS queue. On circuit break, alerts are replayed once the flag is fixed or removed.
|
||||
- For the "no baseline" fast-path (>$5/hr resources), this is NOT behind a flag — it's a safety net that's always on.
|
||||
|
||||
### Story 10.2: Elastic Schema — Additive-Only for Cost Event Tables
|
||||
**As a** solo founder, **I want** all DynamoDB cost event and TimescaleDB baseline schema changes to be strictly additive, **so that** rollbacks never corrupt historical spending data or break baseline calculations.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- CI rejects migrations containing `DROP`, `ALTER ... TYPE`, or `RENAME` on existing columns/attributes.
|
||||
- New fields use `_v2` suffix for breaking changes.
|
||||
- All event parsers ignore unknown fields (Pydantic `extra="ignore"` or Go equivalent).
|
||||
- Dual-write during migration windows within the same transaction.
|
||||
- Every migration includes `sunset_date` comment (max 30 days).
|
||||
|
||||
**Estimate:** 3 points
|
||||
**Dependencies:** Epic 3 (Data Pipeline)
|
||||
**Technical Notes:**
|
||||
- `CostEvent` records in DynamoDB are append-only — never mutate historical events.
|
||||
- Baseline models in TimescaleDB: new algorithm versions write to new continuous aggregate, old aggregate remains queryable during transition.
|
||||
- GSI changes: add new GSIs, never remove old ones until sunset.
|
||||
|
||||
### Story 10.3: Cognitive Durability — Decision Logs for Scoring Algorithms
|
||||
**As a** future maintainer, **I want** every change to anomaly scoring weights, Z-score thresholds, or baseline learning rates accompanied by a `decision_log.json`, **so that** I understand why the system flagged (or missed) a $3,000 EC2 instance.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- `decision_log.json` schema: `{ prompt, reasoning, alternatives_considered, confidence, timestamp, author }`.
|
||||
- CI requires a decision log for PRs touching `src/scoring/`, `src/baseline/`, or `src/detection/`.
|
||||
- Cyclomatic complexity cap of 10 enforced in CI.
|
||||
- Decision logs in `docs/decisions/`.
|
||||
|
||||
**Estimate:** 2 points
|
||||
**Dependencies:** None
|
||||
**Technical Notes:**
|
||||
- Threshold changes are the highest-risk decisions — document: "Why Z-score > 2.5 and not 2.0? What's the false-positive rate at each threshold?"
|
||||
- Include sample cost events showing before/after scoring behavior in decision logs.
|
||||
|
||||
### Story 10.4: Semantic Observability — AI Reasoning Spans on Anomaly Scoring
|
||||
**As an** engineer investigating a missed cost anomaly, **I want** every anomaly scoring decision to emit an OpenTelemetry span with full reasoning metadata, **so that** I can trace exactly why a $500/hr GPU instance wasn't flagged.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- Every `CostEvent` evaluation creates an `anomaly_scoring` span.
|
||||
- Span attributes: `cost.account_id_hash`, `cost.service`, `cost.anomaly_score`, `cost.z_score`, `cost.instance_novelty`, `cost.actor_novelty`, `cost.alert_triggered` (bool), `cost.baseline_days` (how many days of baseline data existed).
|
||||
- If no baseline exists: `cost.fast_path_triggered` (bool) and `cost.hourly_rate`.
|
||||
- Spans export via OTLP. No PII — account IDs hashed, actor ARNs hashed.
|
||||
|
||||
**Estimate:** 3 points
|
||||
**Dependencies:** Epic 2 (Anomaly Engine)
|
||||
**Technical Notes:**
|
||||
- Use OpenTelemetry Python SDK with OTLP exporter. Batch export — cost events can be high volume.
|
||||
- The "no baseline fast-path" span is especially important — it's the safety net for new accounts.
|
||||
- Include `cost.baseline_days` so you can correlate alert accuracy with baseline maturity.
|
||||
|
||||
### Story 10.5: Configurable Autonomy — Governance for Cost Alerting
|
||||
**As a** solo founder, **I want** a `policy.json` that controls whether dd0c/cost can auto-alert customers or only log anomalies internally, **so that** I can validate scoring accuracy before enabling customer-facing notifications.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- `policy.json` defines `governance_mode`: `strict` (log-only, no customer alerts) or `audit` (auto-alert with logging).
|
||||
- Default for new accounts: `strict` for first 14 days (baseline learning period), then auto-promote to `audit`.
|
||||
- `panic_mode`: when true, all alerting stops. Anomalies are still scored and logged but no notifications sent. Dashboard shows "alerting paused" banner.
|
||||
- Per-account governance override: customers can set their own mode. Can only be MORE restrictive.
|
||||
- All policy decisions logged: "Alert for account X suppressed by strict mode", "Auto-promoted account Y to audit mode after 14-day baseline".
|
||||
|
||||
**Estimate:** 3 points
|
||||
**Dependencies:** Epic 4 (Notification Engine)
|
||||
**Technical Notes:**
|
||||
- The 14-day auto-promotion is key — it prevents alert spam during baseline learning while ensuring customers eventually get value.
|
||||
- Auto-promotion check: daily cron. If account has ≥14 days of baseline data AND false-positive rate <10%, promote to `audit`.
|
||||
- Panic mode: Redis key `dd0c:panic`. Notification engine short-circuits on this key.
|
||||
|
||||
### Epic 10 Summary
|
||||
| Story | Tenet | Points |
|
||||
|-------|-------|--------|
|
||||
| 10.1 | Atomic Flagging | 5 |
|
||||
| 10.2 | Elastic Schema | 3 |
|
||||
| 10.3 | Cognitive Durability | 2 |
|
||||
| 10.4 | Semantic Observability | 3 |
|
||||
| 10.5 | Configurable Autonomy | 3 |
|
||||
| **Total** | | **16** |
|
||||
Reference in New Issue
Block a user