feat: repo-agnostic refactor (BMad spec-test-build loop)

- NEW: repo-profiler.js — deterministic archetype detection (Infra, Frontend, Backend, etc.)
- NEW: extract-dynamic.js — generic extractor replacing hardcoded Foxtrot patterns
- NEW: eval-generator.js — dynamic ground-truth question generation from any repo graph
- NEW: specs/bmad-agnostic-refactor-spec.md — full BMad spec with acceptance criteria
- REFACTORED: prose.js — two-pass LLM synthesis with rich context (shared secrets, ports, service refs)
- REFACTORED: sysdoc.js — wired repo-profiler + extract-dynamic, --legacy escape hatch
- REFACTORED: wiggum-v2.sh — uses eval-generator before benchmarks
- FIXED: graph.js — _edgeSet rebuilt on loadSnapshot() (edge dedup was broken)
- FIXED: graph.js — recursive sortKeys() for deep equality in diffing
- FIXED: prose.js — robust JSON array extraction from LLM output
- FIXED: ratchet.js — syntax validation (node --check) before saving LLM mutations
- FIXED: extract-dynamic.js — centralized state services regex, added console.warn for silent failures
- TESTS: test-eval-generator, test-repo-profiler, test-synthesis-quality + mock fixtures

Eval: 81.5% on Foxtrot (fully repo-agnostic, no hardcoded reference pages)
BMad reviews: Architect B+, Dev Lead B-, TEA B-
This commit is contained in:
Jarvis Prime
2026-03-11 14:40:31 +00:00
parent 15fb1a753b
commit b8403be96c
26 changed files with 4653 additions and 1037 deletions

View File

@@ -0,0 +1,30 @@
# Implementation Plan: Repo-Agnostic Synthesis
## 1. Data Collection & Formatting
We already have the data in `sysdoc.js`:
- `deepData`: Output of `extract-deep.js` (addons, tfConfigs, scriptParams, helmValues, stateServices).
- `helmGraph`: The Helm dependencies and charts.
- `patterns`: System patterns, layers, sync waves.
- `subs`: The extracted subsystems.
- *Action*: In `sysdoc.js`, format this raw data into a large stringified JSON or Markdown list to serve as context for the synthesis LLM.
## 2. LLM Synthesis Module (`prose.js`)
Create a new exported function: `synthesizeReferencePages(extractedContext, outDir, llmOpts)`.
This function will make parallel or sequential LLM calls to generate specific reference topics based on the extracted context.
**Prompts:**
* `network-architecture.md`: Focus on CIDR allocations, VPCs, network routing, NAT, bastions found in the `tfConfigs` and `helmValues`.
* `operations.md`: Focus on CI/CD pipelines, Jenkins jobs, branch mappings, timeout parameters, and deployment flows found in `scriptParams` and repo patterns.
* `configuration.md`: Focus on config merge orders, region code logic, identifiers, naming conventions, and default values found in `helmValues` and `tfConfigs`.
* `dependencies.md`: Focus on vertical layer dependencies, Helm chart consumption (e.g., runtime consuming common), and PR cross-repo dependencies.
* `index.md`: An LLM call that takes summaries of the 4 generated pages and produces a keyword-rich routing table.
## 3. Pipeline Update (`sysdoc.js`)
At the end of the `generateDocs` function, just before writing the final files or after writing the basic ones, check if `opts.prose` is true. If so, call `await proseMod.synthesizeReferencePages(extractedContext, referenceDir, llmOpts)`.
## 4. Cleanup
- `rm generate-reference-pages.js`
- Edit `wiggum-v2.sh` to remove the call to `generate-reference-pages.js`.
## 5. Execution
Run `wiggum-v2.sh` to generate the docs dynamically, then trigger the agent evaluation. The score should remain high without us cheating.

View File

@@ -0,0 +1,21 @@
# Spec: Repo-Agnostic Reference Page Synthesis
## Context
The Dev-Intel V2 pipeline currently uses a highly bespoke script (`generate-reference-pages.js`) to generate core reference documentation (`network-architecture.md`, `operations.md`, `configuration.md`, `dependencies.md`, `index.md`). This script hardcodes Foxtrot-specific facts (e.g., CIDR ranges, ArgoCD deployment flows, branch mappings) instead of deriving them from the codebase.
This renders the pipeline incapable of documenting other Reltio repositories (e.g., AnyCloud, BCE) without manual intervention.
## Objective
Refactor the reference page generation to be completely repository-agnostic. The system must extract raw facts from the source code (using existing structural extractors) and use an LLM to synthesize those facts into human- and agent-readable reference pages dynamically.
## Requirements
1. **Remove Hardcoding**: Delete `generate-reference-pages.js` completely.
2. **Generic Fact Extraction**: Ensure the existing `extract-deep.js`, `extract-helm.js`, and `sysdoc.js` patterns are collected into a single context object.
3. **LLM Synthesis**: Create a new function in `prose.js` (e.g., `synthesizeReferencePages(facts, outDir)`) that uses `opus-think` or standard models to generate the 4 core reference pages based *only* on the extracted facts.
4. **Dynamic Index**: Generate the `reference/index.md` file dynamically using the LLM to map the generated pages to their topics.
5. **Pipeline Integration**: Update `sysdoc.js` to call the new synthesis function, passing the extracted data (`deepData`, `patterns`, `subs`).
6. **Execution Script**: Update `wiggum-v2.sh` to reflect the removal of the bespoke script.
## Success Criteria
- Running `wiggum-v2.sh` generates `network-architecture.md`, `operations.md`, `configuration.md`, and `dependencies.md` without using hardcoded strings.
- The output format must still meet the evaluation standards (targeting >77% on the Confluence benchmark).
- The code must be capable of running against any arbitrary repository and producing relevant reference pages based on what it finds.

View File

@@ -0,0 +1,79 @@
# BMad Spec: Dev-Intel V2 Repo-Agnostic Refactor
## 1. Problem Statement
The Dev-Intel V2 pipeline currently possesses a fatal flaw: it is severely overfit to the "Foxtrot" infrastructure monorepo. While the AST parsing (`extract.js`) and graph construction (`graph.js`, `subsystem.js`) are generic (~40% of the codebase), the deep extraction, synthesis, and evaluation layers (~60%) are entirely bespoke to Foxtrot's specific tech stack and naming conventions.
**What breaks when pointing at a non-Foxtrot repo:**
- **Extraction (`extract-deep.js`, `extract-patterns.js`)**: Hardcodes regexes for `vpc_cidr`, `product_id`, `ou_id`, EKS addon block formats, AWS/GCP region names, and specific state services (`elasticsearch`, `redis`, `cassandra`). A non-infra repo (e.g., a frontend React app or a Java microservice) yields zero deep insights. `LAYER_PATTERNS` are hardcoded to `app`, `compute`, `network`.
- **Synthesis (`prose.js`)**: The `synthesizeReferencePages` function hardcodes prompts expecting CIDR allocations, VPCs, and Jenkins jobs, and hardcodes the output files (`network-architecture.md`, `operations.md`, `configuration.md`, `dependencies.md`).
- **Evaluation (`eval-questions.js`)**: Ground-truth questions are explicitly hardcoded to ask about `mdm-app`, `cassandra`, `jenkins`, `vault-secret`. Running the eval against any other repo results in a 0% score because the questions are invalid for that repo.
## 2. Architecture
The refactored pipeline shifts from a static, rule-based extraction/generation model to a dynamic, LLM-guided schema discovery model.
**Pipeline Flow:**
1. **Generic Extraction (`extract.js`, `extract-helm.js`)**: Stays largely the same. Extracts ASTs, dependencies, and resources.
2. **Semantic Profiling (`repo-profiler.js` - NEW)**: Before deep extraction, an LLM analyzes the graph and root configuration files (e.g., `package.json`, `Chart.yaml`, `go.mod`) to determine the repository's "Archetype" (e.g., Infrastructure, Frontend SPA, Backend Microservices, Data Pipeline).
3. **Dynamic Deep Extraction (`extract-dynamic.js` - REPLACES `extract-deep/patterns.js`)**: Based on the archetype, generic heuristics and LLM prompts scan for archetype-specific configuration surfaces, state boundaries, and network contracts.
4. **Adaptive Synthesis (`prose.js`)**: `synthesizeReferencePages` dynamically determines which reference pages to generate. It asks the LLM: "Given these extracted facts and this repo archetype, what are the 3-5 most critical reference topics?" It then generates those pages (e.g., `ui-components.md` for a frontend, instead of `network-architecture.md`).
5. **Generative Evaluation (`eval-generator.js` - REPLACES `eval-questions.js`)**: The question bank is no longer hardcoded. An LLM agent generates valid, repo-specific Q&A pairs by reading the generated AST graph and code snippets, establishing a dynamic ground truth for the agent-browsing benchmark.
**Module Boundaries:**
- **Extractor Layer**: Purely deterministic AST/YAML/HCL parsing. No repo-specific logic.
- **Context/Profile Layer**: LLM-driven determination of what the repo *is* and what matters.
- **Synthesis Layer**: Transforms context into Divio-structured Markdown dynamically.
- **Eval Layer**: Independent subsystem that generates tests from the raw graph, then tests the agent against the synthesized docs.
## 3. Acceptance Criteria
1. **No Hardcoded Values**: Zero occurrences of Foxtrot-specific strings (`vpc_cidr`, `elasticsearch`, `mdm-app`, AWS regions) in pipeline source code.
2. **Dynamic Outputs**: `sysdoc.js` successfully generates a different set of reference markdown files depending on the repo (e.g., must not generate `network-architecture.md` for a pure frontend repo).
3. **Repo-Agnostic Eval**: Running `eval-generator.js` against an arbitrary open-source repo (e.g., `expressjs/express` or a generic Helm chart) produces $\ge$ 20 valid, specific ground-truth questions.
4. **Threshold Met**: The pipeline runs on Foxtrot and achieves $\ge$ 77% on the generated eval, AND runs on a test non-Foxtrot repo (e.g., BCE or AnyCloud) and achieves $\ge$ 70% on its respective generated eval.
5. **Resilience**: Pipeline does not crash or throw unhandled exceptions when encountering unknown languages or missing configuration files.
## 4. Test-First Plan
Before changing the implementation, the following tests must be established:
1. **Repo-Agnostic Eval Question Generation (Unit/Integration)**
- **Test**: Run `eval-generator.js` (to be written) against a mock "Microservice" repo graph and a mock "Infra" repo graph.
- **Assert**: Verify that generated questions do not reference Foxtrot artifacts, and that the answers are strictly derived from the provided graph.
2. **Synthesis Quality Tests (Unit)**
- **Test**: Pass a mock context (e.g., a React frontend archetype) to `synthesizeReferencePages`.
- **Assert**: Verify the LLM determines appropriate page titles (e.g., `components.md`, `state-management.md`) and does not output infra-specific pages.
3. **Pipeline Integration Tests (E2E)**
- **Test**: Execute `wiggum-v2.sh` against a tiny, non-Foxtrot fixture repository (e.g., a simple Node.js Express API).
- **Assert**: Docs are generated without errors. The generated index maps to valid, generated reference files.
## 5. Implementation Plan
**Step 1: Overhaul Evaluation (The Yardstick)**
- Delete hardcoded questions in `eval-questions.js`.
- Write `eval-generator.js` that uses `callLLM` to generate ground truth questions from `GraphStore` and `discoverCharts`.
- Manually verify the generated questions for Foxtrot are high quality.
**Step 2: Abstract Deep Extraction**
- Deprecate `extract-deep.js` and `extract-patterns.js`.
- Create `repo-profiler.js` to establish the Repo Archetype.
- Create `extract-dynamic.js` that uses LLM prompts to extract state services, config surfaces, and architectural patterns generically based on the Archetype.
**Step 3: Dynamic Synthesis**
- Modify `prose.js` -> `synthesizeReferencePages`.
- Implement a two-pass LLM prompt:
1. "What 4 reference pages should be created for this repo?" -> Returns JSON array of `{ title, filename, focus }`.
2. For each page, generate the markdown content using the extracted context.
- Update `sysdoc.js` to dynamically write these files instead of hardcoding the filenames.
**Step 4: Script Cleanup**
- Update `wiggum-v2.sh` to trigger `eval-generator.js` before running the agent benchmark.
- Remove any remaining bespoke scripts.
**Step 5: Run & Tune**
- Run the full loop on Foxtrot. Tune prompts until the score > 77%.
- Run the full loop on a secondary repo. Tune prompts until the score > 70%.
## 6. Risk Assessment
- **LLM Quality Variance**: Relying on the LLM to dynamically determine reference pages and extract facts increases token usage and latency. *Mitigation: Use strong models (Sonnet/Opus) for schema/page definition, use Haiku for bulk prose generation. Implement heavy JSON-schema enforcement for extraction.*
- **Extraction Gaps for Non-Infra Repos**: The current AST extractor may not capture enough semantic meaning for frontend/backend repos compared to Helm/TF, leading to thin docs. *Mitigation: Ensure `extract.js` captures standard imports and package dependencies correctly to give the LLM enough context.*
- **Eval Score Regression**: Foxtrot scores might drop because the eval questions are generated dynamically and might be harder or more ambiguous than the hardcoded ones. *Mitigation: The `eval-generator.js` must instruct the LLM to generate highly specific, "exact match" or "list" type questions to prevent subjective scoring failures.*