feat: repo-agnostic refactor (BMad spec-test-build loop)
- NEW: repo-profiler.js — deterministic archetype detection (Infra, Frontend, Backend, etc.) - NEW: extract-dynamic.js — generic extractor replacing hardcoded Foxtrot patterns - NEW: eval-generator.js — dynamic ground-truth question generation from any repo graph - NEW: specs/bmad-agnostic-refactor-spec.md — full BMad spec with acceptance criteria - REFACTORED: prose.js — two-pass LLM synthesis with rich context (shared secrets, ports, service refs) - REFACTORED: sysdoc.js — wired repo-profiler + extract-dynamic, --legacy escape hatch - REFACTORED: wiggum-v2.sh — uses eval-generator before benchmarks - FIXED: graph.js — _edgeSet rebuilt on loadSnapshot() (edge dedup was broken) - FIXED: graph.js — recursive sortKeys() for deep equality in diffing - FIXED: prose.js — robust JSON array extraction from LLM output - FIXED: ratchet.js — syntax validation (node --check) before saving LLM mutations - FIXED: extract-dynamic.js — centralized state services regex, added console.warn for silent failures - TESTS: test-eval-generator, test-repo-profiler, test-synthesis-quality + mock fixtures Eval: 81.5% on Foxtrot (fully repo-agnostic, no hardcoded reference pages) BMad reviews: Architect B+, Dev Lead B-, TEA B-
This commit is contained in:
30
specs/agnostic-synthesis-plan.md
Normal file
30
specs/agnostic-synthesis-plan.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# Implementation Plan: Repo-Agnostic Synthesis
|
||||
|
||||
## 1. Data Collection & Formatting
|
||||
We already have the data in `sysdoc.js`:
|
||||
- `deepData`: Output of `extract-deep.js` (addons, tfConfigs, scriptParams, helmValues, stateServices).
|
||||
- `helmGraph`: The Helm dependencies and charts.
|
||||
- `patterns`: System patterns, layers, sync waves.
|
||||
- `subs`: The extracted subsystems.
|
||||
- *Action*: In `sysdoc.js`, format this raw data into a large stringified JSON or Markdown list to serve as context for the synthesis LLM.
|
||||
|
||||
## 2. LLM Synthesis Module (`prose.js`)
|
||||
Create a new exported function: `synthesizeReferencePages(extractedContext, outDir, llmOpts)`.
|
||||
This function will make parallel or sequential LLM calls to generate specific reference topics based on the extracted context.
|
||||
|
||||
**Prompts:**
|
||||
* `network-architecture.md`: Focus on CIDR allocations, VPCs, network routing, NAT, bastions found in the `tfConfigs` and `helmValues`.
|
||||
* `operations.md`: Focus on CI/CD pipelines, Jenkins jobs, branch mappings, timeout parameters, and deployment flows found in `scriptParams` and repo patterns.
|
||||
* `configuration.md`: Focus on config merge orders, region code logic, identifiers, naming conventions, and default values found in `helmValues` and `tfConfigs`.
|
||||
* `dependencies.md`: Focus on vertical layer dependencies, Helm chart consumption (e.g., runtime consuming common), and PR cross-repo dependencies.
|
||||
* `index.md`: An LLM call that takes summaries of the 4 generated pages and produces a keyword-rich routing table.
|
||||
|
||||
## 3. Pipeline Update (`sysdoc.js`)
|
||||
At the end of the `generateDocs` function, just before writing the final files or after writing the basic ones, check if `opts.prose` is true. If so, call `await proseMod.synthesizeReferencePages(extractedContext, referenceDir, llmOpts)`.
|
||||
|
||||
## 4. Cleanup
|
||||
- `rm generate-reference-pages.js`
|
||||
- Edit `wiggum-v2.sh` to remove the call to `generate-reference-pages.js`.
|
||||
|
||||
## 5. Execution
|
||||
Run `wiggum-v2.sh` to generate the docs dynamically, then trigger the agent evaluation. The score should remain high without us cheating.
|
||||
21
specs/agnostic-synthesis-spec.md
Normal file
21
specs/agnostic-synthesis-spec.md
Normal file
@@ -0,0 +1,21 @@
|
||||
# Spec: Repo-Agnostic Reference Page Synthesis
|
||||
|
||||
## Context
|
||||
The Dev-Intel V2 pipeline currently uses a highly bespoke script (`generate-reference-pages.js`) to generate core reference documentation (`network-architecture.md`, `operations.md`, `configuration.md`, `dependencies.md`, `index.md`). This script hardcodes Foxtrot-specific facts (e.g., CIDR ranges, ArgoCD deployment flows, branch mappings) instead of deriving them from the codebase.
|
||||
This renders the pipeline incapable of documenting other Reltio repositories (e.g., AnyCloud, BCE) without manual intervention.
|
||||
|
||||
## Objective
|
||||
Refactor the reference page generation to be completely repository-agnostic. The system must extract raw facts from the source code (using existing structural extractors) and use an LLM to synthesize those facts into human- and agent-readable reference pages dynamically.
|
||||
|
||||
## Requirements
|
||||
1. **Remove Hardcoding**: Delete `generate-reference-pages.js` completely.
|
||||
2. **Generic Fact Extraction**: Ensure the existing `extract-deep.js`, `extract-helm.js`, and `sysdoc.js` patterns are collected into a single context object.
|
||||
3. **LLM Synthesis**: Create a new function in `prose.js` (e.g., `synthesizeReferencePages(facts, outDir)`) that uses `opus-think` or standard models to generate the 4 core reference pages based *only* on the extracted facts.
|
||||
4. **Dynamic Index**: Generate the `reference/index.md` file dynamically using the LLM to map the generated pages to their topics.
|
||||
5. **Pipeline Integration**: Update `sysdoc.js` to call the new synthesis function, passing the extracted data (`deepData`, `patterns`, `subs`).
|
||||
6. **Execution Script**: Update `wiggum-v2.sh` to reflect the removal of the bespoke script.
|
||||
|
||||
## Success Criteria
|
||||
- Running `wiggum-v2.sh` generates `network-architecture.md`, `operations.md`, `configuration.md`, and `dependencies.md` without using hardcoded strings.
|
||||
- The output format must still meet the evaluation standards (targeting >77% on the Confluence benchmark).
|
||||
- The code must be capable of running against any arbitrary repository and producing relevant reference pages based on what it finds.
|
||||
79
specs/bmad-agnostic-refactor-spec.md
Normal file
79
specs/bmad-agnostic-refactor-spec.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# BMad Spec: Dev-Intel V2 Repo-Agnostic Refactor
|
||||
|
||||
## 1. Problem Statement
|
||||
The Dev-Intel V2 pipeline currently possesses a fatal flaw: it is severely overfit to the "Foxtrot" infrastructure monorepo. While the AST parsing (`extract.js`) and graph construction (`graph.js`, `subsystem.js`) are generic (~40% of the codebase), the deep extraction, synthesis, and evaluation layers (~60%) are entirely bespoke to Foxtrot's specific tech stack and naming conventions.
|
||||
|
||||
**What breaks when pointing at a non-Foxtrot repo:**
|
||||
- **Extraction (`extract-deep.js`, `extract-patterns.js`)**: Hardcodes regexes for `vpc_cidr`, `product_id`, `ou_id`, EKS addon block formats, AWS/GCP region names, and specific state services (`elasticsearch`, `redis`, `cassandra`). A non-infra repo (e.g., a frontend React app or a Java microservice) yields zero deep insights. `LAYER_PATTERNS` are hardcoded to `app`, `compute`, `network`.
|
||||
- **Synthesis (`prose.js`)**: The `synthesizeReferencePages` function hardcodes prompts expecting CIDR allocations, VPCs, and Jenkins jobs, and hardcodes the output files (`network-architecture.md`, `operations.md`, `configuration.md`, `dependencies.md`).
|
||||
- **Evaluation (`eval-questions.js`)**: Ground-truth questions are explicitly hardcoded to ask about `mdm-app`, `cassandra`, `jenkins`, `vault-secret`. Running the eval against any other repo results in a 0% score because the questions are invalid for that repo.
|
||||
|
||||
## 2. Architecture
|
||||
The refactored pipeline shifts from a static, rule-based extraction/generation model to a dynamic, LLM-guided schema discovery model.
|
||||
|
||||
**Pipeline Flow:**
|
||||
1. **Generic Extraction (`extract.js`, `extract-helm.js`)**: Stays largely the same. Extracts ASTs, dependencies, and resources.
|
||||
2. **Semantic Profiling (`repo-profiler.js` - NEW)**: Before deep extraction, an LLM analyzes the graph and root configuration files (e.g., `package.json`, `Chart.yaml`, `go.mod`) to determine the repository's "Archetype" (e.g., Infrastructure, Frontend SPA, Backend Microservices, Data Pipeline).
|
||||
3. **Dynamic Deep Extraction (`extract-dynamic.js` - REPLACES `extract-deep/patterns.js`)**: Based on the archetype, generic heuristics and LLM prompts scan for archetype-specific configuration surfaces, state boundaries, and network contracts.
|
||||
4. **Adaptive Synthesis (`prose.js`)**: `synthesizeReferencePages` dynamically determines which reference pages to generate. It asks the LLM: "Given these extracted facts and this repo archetype, what are the 3-5 most critical reference topics?" It then generates those pages (e.g., `ui-components.md` for a frontend, instead of `network-architecture.md`).
|
||||
5. **Generative Evaluation (`eval-generator.js` - REPLACES `eval-questions.js`)**: The question bank is no longer hardcoded. An LLM agent generates valid, repo-specific Q&A pairs by reading the generated AST graph and code snippets, establishing a dynamic ground truth for the agent-browsing benchmark.
|
||||
|
||||
**Module Boundaries:**
|
||||
- **Extractor Layer**: Purely deterministic AST/YAML/HCL parsing. No repo-specific logic.
|
||||
- **Context/Profile Layer**: LLM-driven determination of what the repo *is* and what matters.
|
||||
- **Synthesis Layer**: Transforms context into Divio-structured Markdown dynamically.
|
||||
- **Eval Layer**: Independent subsystem that generates tests from the raw graph, then tests the agent against the synthesized docs.
|
||||
|
||||
## 3. Acceptance Criteria
|
||||
1. **No Hardcoded Values**: Zero occurrences of Foxtrot-specific strings (`vpc_cidr`, `elasticsearch`, `mdm-app`, AWS regions) in pipeline source code.
|
||||
2. **Dynamic Outputs**: `sysdoc.js` successfully generates a different set of reference markdown files depending on the repo (e.g., must not generate `network-architecture.md` for a pure frontend repo).
|
||||
3. **Repo-Agnostic Eval**: Running `eval-generator.js` against an arbitrary open-source repo (e.g., `expressjs/express` or a generic Helm chart) produces $\ge$ 20 valid, specific ground-truth questions.
|
||||
4. **Threshold Met**: The pipeline runs on Foxtrot and achieves $\ge$ 77% on the generated eval, AND runs on a test non-Foxtrot repo (e.g., BCE or AnyCloud) and achieves $\ge$ 70% on its respective generated eval.
|
||||
5. **Resilience**: Pipeline does not crash or throw unhandled exceptions when encountering unknown languages or missing configuration files.
|
||||
|
||||
## 4. Test-First Plan
|
||||
Before changing the implementation, the following tests must be established:
|
||||
|
||||
1. **Repo-Agnostic Eval Question Generation (Unit/Integration)**
|
||||
- **Test**: Run `eval-generator.js` (to be written) against a mock "Microservice" repo graph and a mock "Infra" repo graph.
|
||||
- **Assert**: Verify that generated questions do not reference Foxtrot artifacts, and that the answers are strictly derived from the provided graph.
|
||||
|
||||
2. **Synthesis Quality Tests (Unit)**
|
||||
- **Test**: Pass a mock context (e.g., a React frontend archetype) to `synthesizeReferencePages`.
|
||||
- **Assert**: Verify the LLM determines appropriate page titles (e.g., `components.md`, `state-management.md`) and does not output infra-specific pages.
|
||||
|
||||
3. **Pipeline Integration Tests (E2E)**
|
||||
- **Test**: Execute `wiggum-v2.sh` against a tiny, non-Foxtrot fixture repository (e.g., a simple Node.js Express API).
|
||||
- **Assert**: Docs are generated without errors. The generated index maps to valid, generated reference files.
|
||||
|
||||
## 5. Implementation Plan
|
||||
|
||||
**Step 1: Overhaul Evaluation (The Yardstick)**
|
||||
- Delete hardcoded questions in `eval-questions.js`.
|
||||
- Write `eval-generator.js` that uses `callLLM` to generate ground truth questions from `GraphStore` and `discoverCharts`.
|
||||
- Manually verify the generated questions for Foxtrot are high quality.
|
||||
|
||||
**Step 2: Abstract Deep Extraction**
|
||||
- Deprecate `extract-deep.js` and `extract-patterns.js`.
|
||||
- Create `repo-profiler.js` to establish the Repo Archetype.
|
||||
- Create `extract-dynamic.js` that uses LLM prompts to extract state services, config surfaces, and architectural patterns generically based on the Archetype.
|
||||
|
||||
**Step 3: Dynamic Synthesis**
|
||||
- Modify `prose.js` -> `synthesizeReferencePages`.
|
||||
- Implement a two-pass LLM prompt:
|
||||
1. "What 4 reference pages should be created for this repo?" -> Returns JSON array of `{ title, filename, focus }`.
|
||||
2. For each page, generate the markdown content using the extracted context.
|
||||
- Update `sysdoc.js` to dynamically write these files instead of hardcoding the filenames.
|
||||
|
||||
**Step 4: Script Cleanup**
|
||||
- Update `wiggum-v2.sh` to trigger `eval-generator.js` before running the agent benchmark.
|
||||
- Remove any remaining bespoke scripts.
|
||||
|
||||
**Step 5: Run & Tune**
|
||||
- Run the full loop on Foxtrot. Tune prompts until the score > 77%.
|
||||
- Run the full loop on a secondary repo. Tune prompts until the score > 70%.
|
||||
|
||||
## 6. Risk Assessment
|
||||
- **LLM Quality Variance**: Relying on the LLM to dynamically determine reference pages and extract facts increases token usage and latency. *Mitigation: Use strong models (Sonnet/Opus) for schema/page definition, use Haiku for bulk prose generation. Implement heavy JSON-schema enforcement for extraction.*
|
||||
- **Extraction Gaps for Non-Infra Repos**: The current AST extractor may not capture enough semantic meaning for frontend/backend repos compared to Helm/TF, leading to thin docs. *Mitigation: Ensure `extract.js` captures standard imports and package dependencies correctly to give the LLM enough context.*
|
||||
- **Eval Score Regression**: Foxtrot scores might drop because the eval questions are generated dynamically and might be harder or more ambiguous than the hardcoded ones. *Mitigation: The `eval-generator.js` must instruct the LLM to generate highly specific, "exact match" or "list" type questions to prevent subjective scoring failures.*
|
||||
Reference in New Issue
Block a user