feat: confluence benchmark, pattern extractor, agent KB, UX spec

- extract-patterns.js: mines layered arch, ArgoCD appsets, cloud regions,
  CIDR allocations, naming conventions, sync waves, tech stack from code
- agent-kb.js: token-efficient JSON rendering of same doc tree
- eval-confluence-ref-questions.json: 32 reference-only benchmark questions
- wiggum-v2.sh: Ralph Wiggum loop targeting confluence baseline (77.8%)
- docs/human-ux-spec.md: BMad UX designer spec for human doc structure
- Eval results: V2 at 28.7% vs confluence 77.8% baseline
- Hub/spoke ownership now correctly extracted (95% on that question)
- Naming conventions, regions, CIDRs surfaced in system-architecture.md
This commit is contained in:
Jarvis Prime
2026-03-10 14:20:35 +00:00
parent 049609a358
commit 0265ec7a60
844 changed files with 2129910 additions and 30 deletions

129
docs/architecture.md Normal file
View File

@@ -0,0 +1,129 @@
# Dev Intel V2 Architecture Document
## 1. Introduction & Goals
The goal of the Dev Intel V2 pipeline improvements is to elevate the documentation quality from purely descriptive to deeply explanatory. The current state struggles to answer the "why" behind the infrastructure architecture and falls short in mapping flow paths and Terraform structures. This document details the design for closing these gaps, satisfying the PRD requirements for Terraform extraction, Flow Tracing, and Change Impact analysis.
## 2. Component Architecture
```mermaid
graph TD
subgraph Extraction Phase
TS[extract.js<br/>Tree-sitter Code]
HELM[extract-helm.js<br/>Helm + Templates]
TF[extract-terraform.js<br/>HCL / Regex Hybrid]
end
subgraph Knowledge Graph
GRAPH[graph.js<br/>In-Memory Store]
TS -->|Node/Edges| GRAPH
HELM -->|Node/Edges| GRAPH
TF -->|Node/Edges| GRAPH
end
subgraph Enrichment & Analysis Phase
FLOW[flow.js<br/>Entry Point Auto-Detector]
IMPACT[impact.js<br/>Change Impact Query]
PROSE[prose.js<br/>Explanatory LLM]
GRAPH <--> FLOW
GRAPH <--> IMPACT
GRAPH --> PROSE
end
subgraph Outputs
DOCS[sysdoc.js<br/>Diataxis Docs]
PROSE --> DOCS
FLOW --> DOCS
IMPACT --> DOCS
end
```
## 3. Data Flow & Component Design
### 3.1 `extract-terraform.js`: Terraform Extraction
**Problem:** Current naive regex misses modules, locals, and complex blocks, resulting in ~0% coverage of `control-core`.
**Design:** A hybrid extraction module that attempts to load `tree-sitter-hcl` first. If unavailable or unpinned, it falls back to an advanced multi-pass regex parser.
- **Pass 1:** Extract block declarations (`module`, `resource`, `data`, `variable`, `output`, `provider`, `locals`).
- **Pass 2:** Extract dependencies. Within each block's body, run a regex to find references (`var\.([a-zA-Z0-9_-]+)`, `local\.([a-zA-Z0-9_-]+)`, `module\.([a-zA-Z0-9_-]+)\.`, `aws_s3_bucket\.([a-zA-Z0-9_-]+)\.`).
- **Entity Mapping:** Nodes are created with `kind: 'terraform'`, `type: blockType`. Edges of type `DEPENDS_ON` are generated from references.
### 3.2 Explanatory Prose Generator (`prose.js`)
**Problem:** The LLM generates "what" instead of "why" because prompts only pass structural components.
**Design:** The `prose.js` prompt generator will be restructured to feed enriched context:
1. **Dependency Matrix:** Explicitly list upstream and downstream components.
2. **Anomaly Flags:** If a subsystem has zero functions but many variables, or high fan-in/fan-out, pass this to the LLM.
3. **Prompt Update:** Change the prompt instruction from *"Describe this system"* to *"Explain the architectural rationale behind this subsystem. Why does it depend on X and Y? Why does it exhibit anomaly Z?"*
### 3.3 Entry Point Auto-Detection (`flow.js`)
**Problem:** No start-to-finish execution paths exist in the generated docs.
**Design:** `flow.js` will scan the `GraphStore` for nodes matching specific heuristics:
- **K8s:** Nodes of kind `Deployment` or `StatefulSet` that have an incoming edge from a `Service` or `Ingress`.
- **Scripts:** Bash files containing `main()` or ending with an execution call.
- **Python:** Files containing `if __name__ == '__main__':`.
- **CI/CD:** Files in `.github/workflows/` or `.gitlab-ci.yml`.
Once identified, a breadth-first search (BFS) follows outbound `CALLS` or `DEPENDS_ON` edges to map the execution flow.
### 3.4 Change Impact Query Interface (`graph.js` / `impact.js`)
**Problem:** Engineers cannot determine the blast radius of a change.
**Design:** A new query interface that traverses the graph backwards.
- Given a `nodeId` (e.g., a Secret or Terraform Module), traverse all *inbound* `DEPENDS_ON` or `CALLS` edges recursively.
- Return a hierarchical JSON payload or Markdown tree representing all downstream systems forced to redeploy or re-evaluate if the target node is modified.
### 3.5 Index Enrichment (`extract-helm.js`)
**Problem:** AI agents get lost when wrapper charts only provide links to sub-charts.
**Design:** During the indexing phase, when a wrapper chart lists dependencies in `Chart.yaml`, `sysdoc.js` will query the graph for those sub-charts and inline their key entities (Deployments, Services, ConfigMaps) directly into the wrapper chart's markdown section. This guarantees the LLM has complete context in a single context window.
## 4. Interface Contracts
```javascript
// extract-terraform.js
/**
* @param {string} filePath - Absolute path to the .tf file.
* @param {string} repoRoot - Base path of the repository.
* @returns {Object} { file, language: 'hcl', entities: [...], relationships: [...] }
*/
function extractTerraform(filePath, repoRoot);
// flow.js
/**
* @param {GraphStore} graph - The populated knowledge graph.
* @returns {Array<Object>} List of entry point nodes.
*/
function detectEntryPoints(graph);
/**
* @param {GraphStore} graph - The populated knowledge graph.
* @param {string} startNodeId - The entry point ID.
* @returns {Object} A tree representing the execution flow.
*/
function traceExecution(graph, startNodeId);
// impact.js
/**
* @param {GraphStore} graph - The populated knowledge graph.
* @param {string} targetNodeId - The node being modified.
* @param {number} maxDepth - Max traversal depth.
* @returns {Array<Object>} List of impacted downstream nodes.
*/
function queryImpact(graph, targetNodeId, maxDepth = 10);
```
## 5. Key Technical Decisions with Rationale
1. **Hybrid Parsing for Terraform (HCL):**
* *Decision:* Try `tree-sitter-hcl`, fallback to regex.
* *Rationale:* The PRD noted `tree-sitter-hcl` might not be pinned or available in Node 22. A pure tree-sitter approach risks completely failing. The regex fallback guarantees we extract the most critical blocks (modules, resources) even if the grammar fails to load.
2. **In-Memory Graph Traversal for Impact:**
* *Decision:* Use BFS on the existing `Map`-based graph rather than introducing a graph database (e.g., Neo4j).
* *Rationale:* The scope is limited to monorepos parsed at runtime. Adding an external database violates the "no external dependencies" philosophy of `graph.js` and slows down the pipeline.
3. **Inlining Dependencies in Helm:**
* *Decision:* Inline sub-chart structures into the wrapper chart's markdown.
* *Rationale:* It solves the Tier 1 issue of context-switching for AI coding agents and human readers, despite the risk of increasing document length.
## 6. Risk Mitigations
| Risk | Mitigation |
| :--- | :--- |
| **LLM Context Bloat (T1.1, T1.2)** | Limit the inlining depth of sub-charts to 1 level. Truncate anomaly explanations if the context window exceeds the configured threshold (e.g., 100k tokens). |
| **Noisy Graph Edges Breaking Impact Analysis (T2.3)** | Introduce edge "confidence scores" or strict edge typing (`EXPLICIT_DEPENDS_ON` vs `INFERRED_CALLS`). The impact query will only traverse high-confidence, explicitly defined edges. |
| **Regex Fallback Missing Complex HCL (T2.1)** | Focus the regex fallback purely on extracting block headers and specific reference patterns (`var.`, `module.`). This covers 80% of structural relationships, which satisfies the success metric. |

67
docs/human-ux-spec.md Normal file
View File

@@ -0,0 +1,67 @@
# Human-Facing Documentation UX Specification
## 1. Information Architecture (Directory Structure)
The documentation should follow the Diataxis framework but be organized in a way that aligns with a developer's mental model, starting from a high-level overview and allowing for deep dives into specific domains.
```text
docs/
├── index.md # Landing page (High-level architecture & entry points)
├── architecture/ # Progressive Disclosure: High-level concepts
│ ├── index.md
│ ├── system-context.md # C4 Context / System overview
│ └── data-flow.md # Mermaid diagrams of data movement
├── tutorials/ # Learning-oriented (Step-by-step guides for onboarding)
│ ├── index.md
│ └── local-setup.md
├── how-to/ # Problem-oriented (Recipes for specific tasks)
│ ├── index.md
│ ├── deploy-new-service.md
│ └── debug-helm-chart.md
├── explanation/ # Understanding-oriented (Why things are the way they are)
│ ├── index.md
│ └── decisions/ # ADRs (Architecture Decision Records)
└── reference/ # Information-oriented (Auto-generated deep dives)
├── index.md
├── helm-charts/ # Helm value schemas and usage
└── terraform/ # Module inputs/outputs and resources
```
## 2. Visual Hierarchy
To prevent the documentation from feeling like a dense text dump, we must use visual elements to create a clear hierarchy and break up the content.
* **Headers:** Use `H1` (`#`) strictly for the document title. Use `H2` (`##`) for major sections and `H3` (`###`) for subsections. Do not go deeper than `H4`.
* **Mermaid Diagrams:** Use diagrams early in `architecture` and `explanation` documents. Visuals should precede dense text to set the context.
* **Callouts/Admonitions:** Use blockquotes or specialized markdown extensions (like GitHub's alert syntax) for critical information to draw the eye.
* `> [!NOTE]` for general tips.
* `> [!WARNING]` for destructive actions or important caveats.
* `> [!TIP]` for best practices.
* **Tables:** Use tables strictly for structured data, such as environment variables, API parameters, or Helm chart values. Do not use tables for layout.
* **Code Blocks:** Always specify the language for syntax highlighting (e.g., `bash`, `yaml`, `hcl`). Keep code blocks concise; link to the actual source file if the snippet exceeds 30 lines.
## 3. Navigation
Navigation must be explicit and prevent the user from hitting "dead ends."
* **Breadcrumbs:** Every page (except the root `index.md`) should begin with a breadcrumb trail back to the root and its parent category.
* *Example:* `[Home](../index.md) > [Reference](./index.md) > Helm Charts`
* **Table of Contents (TOC):** Any document longer than two screens of text must include a generated TOC immediately following the title.
* **Next Steps/See Also:** Every document must end with a "Related Links" or "Next Steps" section. For example, a "How-To" on deploying a service should link to the "Reference" for the specific Helm chart used.
* **Index Pages:** Every directory must contain an `index.md` that lists and briefly describes all documents within that directory. It serves as the local "table of contents."
## 4. Progressive Disclosure
The documentation must cater to both the new hire needing a high-level overview and the senior engineer debugging a specific Terraform state.
1. **Level 1: The Landing Page (`docs/index.md`)**
* Goal: Orient the user.
* Content: A brief summary of the monorepo's purpose, a high-level Mermaid C4 Context diagram, and prominent links to the four Diataxis quadrants.
2. **Level 2: The Domain Overview (`architecture/index.md`)**
* Goal: Explain how the pieces fit together.
* Content: System architecture diagrams, data flow descriptions, and links to the underlying infrastructure components.
3. **Level 3: Component Deep Dive (e.g., `reference/helm-charts/my-service.md`)**
* Goal: Provide exhaustive detail for implementation.
* Content: Auto-generated tables of values, specific configurations, and links to the actual source code.
**The Golden Rule:** Never show Level 3 information on a Level 1 or 2 page. Provide summaries and clear links to drill down into the specifics.

35
docs/party-review-v3.md Normal file
View File

@@ -0,0 +1,35 @@
# Party Mode Review: Dev Intel V3 PRD
**🎸 The Punk**
Finally, someone gets it! Nuking 1500 lines of custom garbage to just use `terraform-docs` and a bash script is the most punk rock thing I've seen all week. Burn `ratchet.js` to the ground, we don't need a bloated JavaScript orchestrator to do a simple while-loop.
**🧪 The Scientist**
I appreciate the strict constraints—targeting a sub-10 minute execution time and a $1.00 cost per release provides highly testable metrics. However, asserting we'll maintain a 93% agent eval score while ripping out the custom evaluation logic in favor of `promptfoo` requires empirical validation we don't have yet. Show me the benchmark data comparing the two evaluators.
**💀 The Skeptic**
You're replacing a system that technically works with a "Ralph Wiggum" bash loop and hoping an OSS tool won't randomly break your pipeline. Relying on `helm-docs` while admitting it can't handle cross-chart analysis means you're just shifting the complexity to this magical "Glue Layer" that's going to become the new maintenance nightmare. I give it two weeks before the bash script is 500 lines long and unreadable.
**🎪 The Hype Beast**
This is a game-changer! 🚀 By offloading the boring stuff to open source, we can focus all our energy on that sweet, sweet AI prose generation! A hybrid architecture that is fast, cheap, AND smart is exactly what's going to take Dev Intel to the moon! 🌕✨ We're basically building the ultimate AI brain for our infrastructure!
**🔧 The Mechanic**
Using off-the-shelf binaries is fine, but how exactly does this "minimal orchestration" feed `terraform-docs` output back into the `prose.js` graph builder? The PRD completely glosses over the actual data contract between the OSS tools and the custom tree-sitter extraction. It sounds nice on paper, but wiring that pipeline up in bash is going to be incredibly brittle when edge cases hit.
---
### Panel Verdict
**Top 3 Strengths:**
1. Massive reduction in custom code maintenance (2000 lines down to 500).
2. Clear, measurable, and aggressive constraints (Under 10 mins, <= $1 cost).
3. Embracing industry-standard OSS (`terraform-docs`, `helm-docs`, `promptfoo`) instead of reinventing the wheel.
**Top 3 Risks:**
1. The bash "Glue Layer" becoming a brittle, unmaintainable mess of pipes and regex.
2. Loss of the nuanced context that the custom V2 extractors provided for cross-chart and cross-file graph edges.
3. Assuming `promptfoo` will perfectly replicate the custom `eval.js` logic and maintain the 93% score without regressions.
**One thing we'd change:**
Define the exact data contract/JSON interface between the OSS tool outputs and the remaining `prose.js` / graph builders, instead of hand-waving it as "minimal orchestration."
**Final Score:** 7.5/10

59
docs/prd-v3.md Normal file
View File

@@ -0,0 +1,59 @@
# Product Requirements Document: Dev Intel V3
## 1. Problem Statement
Dev Intel V2 successfully generates documentation from our Foxtrot monorepo, achieving a 93% agent eval and 78% human eval score. However, the pipeline relies on ~2000 lines of custom JavaScript. Much of this custom code duplicates the functionality of well-established Open Source Software (OSS). We need to simplify the architecture, reduce the maintenance burden, and embrace community-standard tools without sacrificing output quality. Our "ratchet loop" is functionally just a "Ralph Wiggum" loop, and we should embrace a simplified, brute-force bash loop with clear objective completion criteria rather than complex custom code.
## 2. Architecture
The V3 architecture adopts a hybrid approach: "OSS for the heavy lifting, custom code for the magic."
### OSS Replacements
* **Terraform Documentation:** `terraform-docs` (Replaces `extract-terraform.js`)
* **Helm Chart Documentation:** `helm-docs` (Replaces `extract-helm.js` & `sysdoc.js` chart section)
* **Evaluation Harness:** `promptfoo` (Replaces `eval-agent.js`, `eval-human.js`, `eval.js`)
* **Documentation Serving:** `mkdocs-material` (Replaces custom doc serving)
* **Ratchet Loop:** Simple Ralph Wiggum bash loop (Replaces `ratchet.js`)
### Retained Custom Components (The Value Add)
* **Graph Builder (`graph.js` + `extract.js`):** Tree-sitter extraction to build a unified knowledge graph across 13 repositories.
* **Subsystem Aggregator (`subsystem.js`):** Grouping files into logical subsystems and detecting cross-cutting concerns.
* **Cross-Chart Interaction Analysis:** Analyzing shared secrets, ports, and service references across Helm charts (which `helm-docs` cannot do natively).
* **LLM Prose Enrichment (`prose.js`):** Feeding the dependency matrix and anomaly flags into Claude to generate "why" explanations.
* **Glue Layer:** Minimal orchestration connecting OSS tools and custom analysis into unified output.
## 3. Requirements
* **LLM Engine:** Use `http://192.168.86.11:8000/v1` with the `claude-haiku-4.5` model.
* **Scale:** Must handle the Foxtrot monorepo (13 subdirectories, 17K+ files).
* **Footprint constraint:** The pipeline should be composed of ~500 lines of custom Node.js code plus config files.
* **Speed constraint:** Must run end-to-end in under 10 minutes (excluding LLM execution wait times).
* **Cost constraint:** Target cost is $1.00 per release.
* **Code Implementation:** Replace the existing terraform and per-chart Helm doc generation with the CLI tools (`terraform-docs` and `helm-docs`).
* **Docs Website:** Implement an `mkdocs.yml` configuration to serve the output as a searchable site.
* **Evaluation Implementation:** Configure `promptfoo` via YAML to act as the objective judge.
## 4. Ralph Wiggum Loop Spec
The previous `ratchet.js` implementation will be replaced by a `bash` script. This runs an AI agent in a simple, well-known ratchet pattern: loop until objective completion criteria are met.
**Execution Flow:**
1. **Generate:** Run the Dev Intel V3 pipeline.
2. **Evaluate:** Run `promptfoo eval` against the pipeline's output.
3. **Diagnose:** Check the `promptfoo` score against the required threshold.
4. **Condition:**
* **If Score >= Threshold:** Success, exit the loop.
* **If Score < Threshold:** Re-feed the previous output and failure context (the evaluation feedback) back into the generator prompt for context.
5. **Repeat:** Continue up to *N* iterations until criteria are met.
## 5. Success Metrics
* **Quality Parity or Better:** Agent eval score >= 93%, Human eval score >= 78%.
* **Simplicity:** Custom codebase shrinks from ~2000 lines to ~500 lines.
* **Performance:** Execution overhead is under 10 minutes.
* **Efficiency:** Pipeline inference costs remain <= $1 per release.
## 6. Migration Plan
To safely deprecate V2 while maintaining documentation pipelines:
1. **Remove Custom Extractors:** Delete `extract-terraform.js`, `extract-helm.js`, and the Helm-specific logic inside `sysdoc.js`.
2. **Remove Custom Evaluators:** Delete `eval-agent.js`, `eval-human.js`, and `eval.js`.
3. **Remove Custom Ratchet:** Delete `ratchet.js`.
4. **Integrate CLI Binaries:** Install and wire up `terraform-docs` and `helm-docs`.
5. **Add Configs:** Write `promptfoo.yaml` for evaluations and `mkdocs.yml` for serving docs.
6. **Implement Bash Script:** Write the Ralph Wiggum loop.
7. **Re-wire Glue Code:** Connect the outputs from the OSS tools into the preserved `prose.js` module.

36
docs/prd.md Normal file
View File

@@ -0,0 +1,36 @@
# Product Requirements Document: Dev Intel V2
## 1. Problem Statement
Dev Intel V2 currently extracts code entities and Helm chart structures to build a unified knowledge graph and generate Diataxis-structured documentation for infrastructure monorepos. While the pipeline performs well for AI agents (93.4% eval score), human engineers are struggling (78.6% eval score) because the generated prose is purely descriptive rather than explanatory. Furthermore, critical infrastructure components like Terraform are missing from the extraction, and architectural flow tracing is non-existent, leaving significant gaps in the generated documentation's usefulness for understanding change impact and structural anomalies.
## 2. User Personas
* **Infrastructure Engineer:** Needs to understand the "why" behind the architecture, trace execution flows across boundaries, and quickly assess the blast radius of changes (e.g., modifying a secret or Helm chart).
* **AI Coding Agent:** Relies on high-fidelity, highly structured knowledge graphs and inlined dependencies to reliably answer questions about the codebase without getting lost in nested wrapper charts.
## 3. Requirements
### Tier 1: Fix What's Broken (Explanation & Accuracy)
* **T1.1: Inline Sub-chart Dependencies:** Wrapper charts must inline their sub-chart dependencies in the index to ensure dependency queries do not fail.
* **T1.2: Explanatory LLM Prose:** Update the LLM enrichment prompts to explain *why* subsystems depend on each other and *why* certain structural anomalies exist (e.g., subsystems with zero functions).
* **T1.3: Architectural Anomaly Resolution:** Documentation must explicitly address and explain architectural structural anomalies to improve the current 30% success rate on architectural "why" questions.
### Tier 2: Fill Real Gaps (Coverage & Tracing)
* **T2.1: Terraform Extraction (`extract-terraform.js`):** Implement robust Terraform entity extraction. Currently, only 1 module is detected out of 336 files in `control-core`.
* **T2.2: Auto-Detection of Entry Points:** Implement flow tracing by automatically detecting entry points. Target: Helm Deployments with Services, `main()` in shell scripts, `__main__` in Python, and CI pipelines.
* **T2.3: Change Impact Analysis Interface:** Build a query interface leveraging existing knowledge graph edges to answer change impact questions (e.g., "If I modify `vault-secret`, which charts redeploy?").
## 4. Success Metrics
* **Agent Eval Score:** Maintain > 90%.
* **Human Eval Score:** Increase from 78.6% to > 90%.
* **Terraform Coverage:** Increase from ~0% to > 80% of `control-core` entities extracted.
* **Flow Traces:** Document at least 5 meaningful entry-to-exit execution paths.
## 5. Out of Scope
* Support for new languages outside of the current stack (Python, Go, TypeScript, Shell, HCL/Terraform).
* Interactive UI dashboards (focus remains on markdown generation and query interfaces).
* Modifying the core Diataxis structural framework.
## 6. Dependencies and Risks
* **Risk (LLM Context Limits):** Inlining sub-chart dependencies and expanding explanatory prose could bloat the context window for the evaluating LLM.
* **Dependency:** The change impact query interface relies heavily on the accuracy of the existing graph edges; if current edges are noisy, the impact analysis will be flawed.
* **Dependency:** Terraform extraction requires successfully parsing HCL, which may have complex module resolution behaviors compared to standard code tree-sitter extraction.