products/plg-instrumentation-brainstorm.md

# dd0c Platform — PLG Instrumentation Brainstorm

**Session:** Carson (Brainstorming Coach) — Cross-Product PLG Analytics
**Date:** March 1, 2026
**Scope:** All 6 dd0c products

---

## The Problem

We built 6 products with onboarding flows, free tiers, and Stripe billing — but zero product analytics. We can't answer:

- How many users hit "aha moment" vs. bounce?
- Where in the funnel do free users drop off before upgrading?
- Which features drive retention vs. which are ignored?
- Are users churning because of alert fatigue, false positives, or just not getting value?
- What's our time-to-first-value per product?

Without instrumentation, PLG iteration is guesswork.

---

## Brainstorm: What to Instrument

### 1. Unified Event Taxonomy

Every dd0c product shares a common event naming convention:

```
<domain>.<object>.<action>

Examples:
  account.signup.completed
  account.aws.connected
  anomaly.alert.sent
  anomaly.alert.snoozed
  slack.bot.installed
  billing.checkout.started
  billing.upgrade.completed
  feature.flag.evaluated
```

**Rules:**
- Past tense for completed actions (`completed`, `sent`, `clicked`)
- Present tense for state changes (`active`, `learning`, `paused`)
- Always include `tenant_id`, `timestamp`, `product` (route/drift/alert/portal/cost/run)
- Never include PII — hash emails, account IDs

### 2. Per-Product Activation Metrics

The "aha moment" is different for each product:

| Product | Aha Moment | Metric | Target |
|---------|-----------|--------|--------|
| dd0c/route | First dollar saved by model routing | `routing.savings.first_dollar` | <24hr from signup |
| dd0c/drift | First drift detected in real stack | `drift.detection.first_found` | <1hr from agent install |
| dd0c/alert | First alert correlated (not just forwarded) | `alert.correlation.first_match` | <60sec from first alert |
| dd0c/portal | First service auto-discovered | `portal.discovery.first_service` | <5min from install |
| dd0c/cost | First anomaly detected in real account | `cost.anomaly.first_detected` | <24hr from AWS connect |
| dd0c/run | First runbook executed successfully | `run.execution.first_success` | <10min from setup |

### 3. Conversion Funnel (Universal)

Every product shares this funnel shape:

```
Signup → Connect (AWS/Slack/Git) → First Value → Habit → Upgrade
```

Events per stage:

**Stage 1: Signup**
- `account.signup.started` — landed on signup page
- `account.signup.completed` — account created
- `account.signup.method` — github_sso / google_sso / email

**Stage 2: Connect**
- `account.integration.started` — began connecting external service
- `account.integration.completed` — connection verified
- `account.integration.failed` — connection failed (include `error_type`)
- Product-specific: `account.aws.connected`, `account.slack.installed`, `account.git.connected`

**Stage 3: First Value**
- Product-specific aha moment event (see table above)
- `onboarding.wizard.step_completed` — which step, how long
- `onboarding.wizard.abandoned` — which step they quit on

**Stage 4: Habit**
- `session.daily.active` — DAU ping
- `session.weekly.active` — WAU ping
- `feature.<name>.used` — per-feature usage
- `notification.digest.opened` — are they reading digests?
- `slack.command.used` — which slash commands, how often

**Stage 5: Upgrade**
- `billing.checkout.started`
- `billing.checkout.completed`
- `billing.checkout.abandoned`
- `billing.plan.changed` — upgrade/downgrade
- `billing.churn.detected` — subscription cancelled

### 4. Feature Usage Events (Per Product)

**dd0c/route (LLM Cost Router)**
- `routing.request.processed` — model selected, latency, cost
- `routing.override.manual` — user forced a specific model
- `routing.savings.calculated` — weekly savings digest generated
- `routing.shadow.audit.run` — shadow mode comparison completed
- `dashboard.cost.viewed` — opened cost dashboard

**dd0c/drift (IaC Drift Detection)**
- `drift.scan.completed` — scan finished, drifts found count
- `drift.remediation.clicked` — user clicked "fix drift"
- `drift.remediation.applied` — drift actually fixed
- `drift.false_positive.marked` — user dismissed a drift
- `drift.agent.heartbeat` — agent is alive and scanning

**dd0c/alert (Alert Intelligence)**
- `alert.ingested` — raw alert received
- `alert.correlated` — alerts grouped into incident
- `alert.suppressed` — duplicate/noise suppressed
- `alert.escalated` — sent to on-call
- `alert.feedback.helpful` / `alert.feedback.noise` — user feedback
- `alert.mttr.measured` — time from alert to resolution

**dd0c/portal (Lightweight IDP)**
- `portal.service.discovered` — auto-discovery found a service
- `portal.service.claimed` — team claimed ownership
- `portal.scorecard.viewed` — someone checked service health
- `portal.scorecard.action_taken` — acted on a recommendation
- `portal.search.performed` — searched the catalog

**dd0c/cost (AWS Cost Anomaly)**
- `cost.event.ingested` — CloudTrail event processed
- `cost.anomaly.scored` — anomaly scoring completed
- `cost.anomaly.alerted` — Slack alert sent
- `cost.anomaly.snoozed` — user snoozed alert
- `cost.anomaly.expected` — user marked as expected
- `cost.remediation.clicked` — user clicked Stop/Terminate
- `cost.remediation.executed` — remediation completed
- `cost.zombie.detected` — idle resource found
- `cost.digest.sent` — daily digest delivered

**dd0c/run (Runbook Automation)**
- `run.runbook.created` — new runbook authored
- `run.execution.started` — runbook execution began
- `run.execution.completed` — execution finished (include `success`/`failed`)
- `run.execution.approval_requested` — human approval needed
- `run.execution.approval_granted` — human approved
- `run.execution.rolled_back` — rollback triggered
- `run.sandbox.test.run` — dry-run in sandbox

### 5. Health Scoring (Churn Prediction)

Composite health score per tenant, updated daily:

```
health_score = (
  0.3 * activation_complete +    // did they hit aha moment?
  0.2 * weekly_active_days +     // how many days active this week?
  0.2 * feature_breadth +        // how many features used?
  0.15 * integration_depth +     // how many integrations connected?
  0.15 * feedback_sentiment       // positive vs negative actions
)
```

Thresholds:
- `health > 0.7` → Healthy (green)
- `health 0.4-0.7` → At Risk (yellow) → trigger re-engagement email
- `health < 0.4` → Churning (red) → trigger founder outreach

### 6. Analytics Stack Recommendation

**PostHog** (self-hosted on AWS):
- Open source, self-hostable → no vendor lock-in
- Free tier: unlimited events self-hosted
- Built-in: funnels, retention, feature flags, session replay
- Supports custom events via REST API or JS/Python SDK
- Can run on a single t3.medium for V1 traffic

**Why not Segment/Amplitude/Mixpanel:**
- Segment: $120/mo minimum, overkill for solo founder
- Amplitude: free tier is generous but cloud-only, data leaves your infra
- Mixpanel: same cloud-only concern
- PostHog self-hosted: $0/mo, data stays in your AWS account, GDPR-friendly

**Integration pattern:**
```
Lambda/API → PostHog REST API (async, fire-and-forget)
Next.js UI → PostHog JS SDK (auto-captures pageviews, clicks)
Slack Bot → PostHog Python SDK (command usage, action clicks)
```

### 7. Cross-Product Flywheel Metrics

dd0c is a platform — users on one product should discover others:

- `platform.cross_sell.impression` — "Try dd0c/alert" banner shown
- `platform.cross_sell.clicked` — user clicked cross-sell
- `platform.cross_sell.activated` — user activated second product
- `platform.products.active_count` — how many dd0c products per tenant

**Flywheel hypothesis:** Users who activate 2+ dd0c products have 3x lower churn than single-product users. We need data to prove/disprove this.

---

## Epic 11 Proposal: PLG Instrumentation

### Scope
Cross-cutting epic added to all 6 products. Shared analytics SDK, per-product event implementations, funnel dashboards, health scoring.

### Stories (Draft)
1. **PostHog Infrastructure** — CDK stack for self-hosted PostHog on ECS Fargate
2. **Analytics SDK** — Shared TypeScript/Python wrapper with standard event schema
3. **Funnel Dashboard** — PostHog dashboard template per product
4. **Activation Tracking** — Per-product aha moment detection and logging
5. **Health Scoring Engine** — Daily cron that computes tenant health scores
6. **Cross-Sell Instrumentation** — Platform-level cross-product discovery events
7. **Churn Alert Pipeline** — Health score → Slack alert to founder when tenant goes red

### Estimate
~25 story points across all products (shared infrastructure + per-product event wiring)

---

*This brainstorm establishes the "what" and "why." Party Mode advisory board should stress-test: Is PostHog the right choice? Is the event taxonomy too granular? Should health scoring be V1 or V2? Is 25 points realistic?*
Implement review remediation + PLG analytics SDK - All 6 test architectures patched with Section 11 addendums - P5 (cost) fully rewritten from 232 to ~600 lines - PLG brainstorm + party mode advisory board results - Analytics SDK v2 (PostHog Cloud, Zod strict, Lambda-safe) - Analytics tests v2 (safeParse, no , no timestamp, no PII) - Addresses all Gemini review findings across P1-P6 2026-03-01 01:42:49 +00:00			`# dd0c Platform — PLG Instrumentation Brainstorm`

			`Session: Carson (Brainstorming Coach) — Cross-Product PLG Analytics`
			`Date: March 1, 2026`
			`Scope: All 6 dd0c products`

			`---`

			`## The Problem`

			`We built 6 products with onboarding flows, free tiers, and Stripe billing — but zero product analytics. We can't answer:`

			`- How many users hit "aha moment" vs. bounce?`
			`- Where in the funnel do free users drop off before upgrading?`
			`- Which features drive retention vs. which are ignored?`
			`- Are users churning because of alert fatigue, false positives, or just not getting value?`
			`- What's our time-to-first-value per product?`

			`Without instrumentation, PLG iteration is guesswork.`

			`---`

			`## Brainstorm: What to Instrument`

			`### 1. Unified Event Taxonomy`

			`Every dd0c product shares a common event naming convention:`

			```
			`<domain>.<object>.<action>`

			`Examples:`
			`account.signup.completed`
			`account.aws.connected`
			`anomaly.alert.sent`
			`anomaly.alert.snoozed`
			`slack.bot.installed`
			`billing.checkout.started`
			`billing.upgrade.completed`
			`feature.flag.evaluated`
			```

			`Rules:`
			- Past tense for completed actions (`completed`, `sent`, `clicked`)
			- Present tense for state changes (`active`, `learning`, `paused`)
			- Always include `tenant_id`, `timestamp`, `product` (route/drift/alert/portal/cost/run)
			`- Never include PII — hash emails, account IDs`

			`### 2. Per-Product Activation Metrics`

			`The "aha moment" is different for each product:`

			`\| Product \| Aha Moment \| Metric \| Target \|`
			`\|---------\|-----------\|--------\|--------\|`
			\| dd0c/route \| First dollar saved by model routing \| `routing.savings.first_dollar` \| <24hr from signup \|
			\| dd0c/drift \| First drift detected in real stack \| `drift.detection.first_found` \| <1hr from agent install \|
			\| dd0c/alert \| First alert correlated (not just forwarded) \| `alert.correlation.first_match` \| <60sec from first alert \|
			\| dd0c/portal \| First service auto-discovered \| `portal.discovery.first_service` \| <5min from install \|
			\| dd0c/cost \| First anomaly detected in real account \| `cost.anomaly.first_detected` \| <24hr from AWS connect \|
			\| dd0c/run \| First runbook executed successfully \| `run.execution.first_success` \| <10min from setup \|

			`### 3. Conversion Funnel (Universal)`

			`Every product shares this funnel shape:`

			```
			`Signup → Connect (AWS/Slack/Git) → First Value → Habit → Upgrade`
			```

			`Events per stage:`

			`Stage 1: Signup`
			- `account.signup.started` — landed on signup page
			- `account.signup.completed` — account created
			- `account.signup.method` — github_sso / google_sso / email

			`Stage 2: Connect`
			- `account.integration.started` — began connecting external service
			- `account.integration.completed` — connection verified
			- `account.integration.failed` — connection failed (include `error_type`)
			- Product-specific: `account.aws.connected`, `account.slack.installed`, `account.git.connected`

			`Stage 3: First Value`
			`- Product-specific aha moment event (see table above)`
			- `onboarding.wizard.step_completed` — which step, how long
			- `onboarding.wizard.abandoned` — which step they quit on

			`Stage 4: Habit`
			- `session.daily.active` — DAU ping
			- `session.weekly.active` — WAU ping
			- `feature.<name>.used` — per-feature usage
			- `notification.digest.opened` — are they reading digests?
			- `slack.command.used` — which slash commands, how often

			`Stage 5: Upgrade`
			- `billing.checkout.started`
			- `billing.checkout.completed`
			- `billing.checkout.abandoned`
			- `billing.plan.changed` — upgrade/downgrade
			- `billing.churn.detected` — subscription cancelled

			`### 4. Feature Usage Events (Per Product)`

			`dd0c/route (LLM Cost Router)`
			- `routing.request.processed` — model selected, latency, cost
			- `routing.override.manual` — user forced a specific model
			- `routing.savings.calculated` — weekly savings digest generated
			- `routing.shadow.audit.run` — shadow mode comparison completed
			- `dashboard.cost.viewed` — opened cost dashboard

			`dd0c/drift (IaC Drift Detection)`
			- `drift.scan.completed` — scan finished, drifts found count
			- `drift.remediation.clicked` — user clicked "fix drift"
			- `drift.remediation.applied` — drift actually fixed
			- `drift.false_positive.marked` — user dismissed a drift
			- `drift.agent.heartbeat` — agent is alive and scanning

			`dd0c/alert (Alert Intelligence)`
			- `alert.ingested` — raw alert received
			- `alert.correlated` — alerts grouped into incident
			- `alert.suppressed` — duplicate/noise suppressed
			- `alert.escalated` — sent to on-call
			- `alert.feedback.helpful` / `alert.feedback.noise` — user feedback
			- `alert.mttr.measured` — time from alert to resolution

			`dd0c/portal (Lightweight IDP)`
			- `portal.service.discovered` — auto-discovery found a service
			- `portal.service.claimed` — team claimed ownership
			- `portal.scorecard.viewed` — someone checked service health
			- `portal.scorecard.action_taken` — acted on a recommendation
			- `portal.search.performed` — searched the catalog

			`dd0c/cost (AWS Cost Anomaly)`
			- `cost.event.ingested` — CloudTrail event processed
			- `cost.anomaly.scored` — anomaly scoring completed
			- `cost.anomaly.alerted` — Slack alert sent
			- `cost.anomaly.snoozed` — user snoozed alert
			- `cost.anomaly.expected` — user marked as expected
			- `cost.remediation.clicked` — user clicked Stop/Terminate
			- `cost.remediation.executed` — remediation completed
			- `cost.zombie.detected` — idle resource found
			- `cost.digest.sent` — daily digest delivered

			`dd0c/run (Runbook Automation)`
			- `run.runbook.created` — new runbook authored
			- `run.execution.started` — runbook execution began
			- `run.execution.completed` — execution finished (include `success`/`failed`)
			- `run.execution.approval_requested` — human approval needed
			- `run.execution.approval_granted` — human approved
			- `run.execution.rolled_back` — rollback triggered
			- `run.sandbox.test.run` — dry-run in sandbox

			`### 5. Health Scoring (Churn Prediction)`

			`Composite health score per tenant, updated daily:`

			```
			`health_score = (`
			`0.3 * activation_complete + // did they hit aha moment?`
			`0.2 * weekly_active_days + // how many days active this week?`
			`0.2 * feature_breadth + // how many features used?`
			`0.15 * integration_depth + // how many integrations connected?`
			`0.15 * feedback_sentiment // positive vs negative actions`
			`)`
			```

			`Thresholds:`
			- `health > 0.7` → Healthy (green)
			- `health 0.4-0.7` → At Risk (yellow) → trigger re-engagement email
			- `health < 0.4` → Churning (red) → trigger founder outreach

			`### 6. Analytics Stack Recommendation`

			`PostHog (self-hosted on AWS):`
			`- Open source, self-hostable → no vendor lock-in`
			`- Free tier: unlimited events self-hosted`
			`- Built-in: funnels, retention, feature flags, session replay`
			`- Supports custom events via REST API or JS/Python SDK`
			`- Can run on a single t3.medium for V1 traffic`

			`Why not Segment/Amplitude/Mixpanel:`
			`- Segment: $120/mo minimum, overkill for solo founder`
			`- Amplitude: free tier is generous but cloud-only, data leaves your infra`
			`- Mixpanel: same cloud-only concern`
			`- PostHog self-hosted: $0/mo, data stays in your AWS account, GDPR-friendly`

			`Integration pattern:`
			```
			`Lambda/API → PostHog REST API (async, fire-and-forget)`
			`Next.js UI → PostHog JS SDK (auto-captures pageviews, clicks)`
			`Slack Bot → PostHog Python SDK (command usage, action clicks)`
			```

			`### 7. Cross-Product Flywheel Metrics`

			`dd0c is a platform — users on one product should discover others:`

			- `platform.cross_sell.impression` — "Try dd0c/alert" banner shown
			- `platform.cross_sell.clicked` — user clicked cross-sell
			- `platform.cross_sell.activated` — user activated second product
			- `platform.products.active_count` — how many dd0c products per tenant

			`Flywheel hypothesis: Users who activate 2+ dd0c products have 3x lower churn than single-product users. We need data to prove/disprove this.`

			`---`

			`## Epic 11 Proposal: PLG Instrumentation`

			`### Scope`
			`Cross-cutting epic added to all 6 products. Shared analytics SDK, per-product event implementations, funnel dashboards, health scoring.`

			`### Stories (Draft)`
			`1. PostHog Infrastructure — CDK stack for self-hosted PostHog on ECS Fargate`
			`2. Analytics SDK — Shared TypeScript/Python wrapper with standard event schema`
			`3. Funnel Dashboard — PostHog dashboard template per product`
			`4. Activation Tracking — Per-product aha moment detection and logging`
			`5. Health Scoring Engine — Daily cron that computes tenant health scores`
			`6. Cross-Sell Instrumentation — Platform-level cross-product discovery events`
			`7. Churn Alert Pipeline — Health score → Slack alert to founder when tenant goes red`

			`### Estimate`
			`~25 story points across all products (shared infrastructure + per-product event wiring)`

			`---`

			`This brainstorm establishes the "what" and "why." Party Mode advisory board should stress-test: Is PostHog the right choice? Is the event taxonomy too granular? Should health scoring be V1 or V2? Is 25 points realistic?`