- NEW: repo-profiler.js — deterministic archetype detection (Infra, Frontend, Backend, etc.) - NEW: extract-dynamic.js — generic extractor replacing hardcoded Foxtrot patterns - NEW: eval-generator.js — dynamic ground-truth question generation from any repo graph - NEW: specs/bmad-agnostic-refactor-spec.md — full BMad spec with acceptance criteria - REFACTORED: prose.js — two-pass LLM synthesis with rich context (shared secrets, ports, service refs) - REFACTORED: sysdoc.js — wired repo-profiler + extract-dynamic, --legacy escape hatch - REFACTORED: wiggum-v2.sh — uses eval-generator before benchmarks - FIXED: graph.js — _edgeSet rebuilt on loadSnapshot() (edge dedup was broken) - FIXED: graph.js — recursive sortKeys() for deep equality in diffing - FIXED: prose.js — robust JSON array extraction from LLM output - FIXED: ratchet.js — syntax validation (node --check) before saving LLM mutations - FIXED: extract-dynamic.js — centralized state services regex, added console.warn for silent failures - TESTS: test-eval-generator, test-repo-profiler, test-synthesis-quality + mock fixtures Eval: 81.5% on Foxtrot (fully repo-agnostic, no hardcoded reference pages) BMad reviews: Architect B+, Dev Lead B-, TEA B-
7.5 KiB
BMad Spec: Dev-Intel V2 Repo-Agnostic Refactor
1. Problem Statement
The Dev-Intel V2 pipeline currently possesses a fatal flaw: it is severely overfit to the "Foxtrot" infrastructure monorepo. While the AST parsing (extract.js) and graph construction (graph.js, subsystem.js) are generic (~40% of the codebase), the deep extraction, synthesis, and evaluation layers (~60%) are entirely bespoke to Foxtrot's specific tech stack and naming conventions.
What breaks when pointing at a non-Foxtrot repo:
- Extraction (
extract-deep.js,extract-patterns.js): Hardcodes regexes forvpc_cidr,product_id,ou_id, EKS addon block formats, AWS/GCP region names, and specific state services (elasticsearch,redis,cassandra). A non-infra repo (e.g., a frontend React app or a Java microservice) yields zero deep insights.LAYER_PATTERNSare hardcoded toapp,compute,network. - Synthesis (
prose.js): ThesynthesizeReferencePagesfunction hardcodes prompts expecting CIDR allocations, VPCs, and Jenkins jobs, and hardcodes the output files (network-architecture.md,operations.md,configuration.md,dependencies.md). - Evaluation (
eval-questions.js): Ground-truth questions are explicitly hardcoded to ask aboutmdm-app,cassandra,jenkins,vault-secret. Running the eval against any other repo results in a 0% score because the questions are invalid for that repo.
2. Architecture
The refactored pipeline shifts from a static, rule-based extraction/generation model to a dynamic, LLM-guided schema discovery model.
Pipeline Flow:
- Generic Extraction (
extract.js,extract-helm.js): Stays largely the same. Extracts ASTs, dependencies, and resources. - Semantic Profiling (
repo-profiler.js- NEW): Before deep extraction, an LLM analyzes the graph and root configuration files (e.g.,package.json,Chart.yaml,go.mod) to determine the repository's "Archetype" (e.g., Infrastructure, Frontend SPA, Backend Microservices, Data Pipeline). - Dynamic Deep Extraction (
extract-dynamic.js- REPLACESextract-deep/patterns.js): Based on the archetype, generic heuristics and LLM prompts scan for archetype-specific configuration surfaces, state boundaries, and network contracts. - Adaptive Synthesis (
prose.js):synthesizeReferencePagesdynamically determines which reference pages to generate. It asks the LLM: "Given these extracted facts and this repo archetype, what are the 3-5 most critical reference topics?" It then generates those pages (e.g.,ui-components.mdfor a frontend, instead ofnetwork-architecture.md). - Generative Evaluation (
eval-generator.js- REPLACESeval-questions.js): The question bank is no longer hardcoded. An LLM agent generates valid, repo-specific Q&A pairs by reading the generated AST graph and code snippets, establishing a dynamic ground truth for the agent-browsing benchmark.
Module Boundaries:
- Extractor Layer: Purely deterministic AST/YAML/HCL parsing. No repo-specific logic.
- Context/Profile Layer: LLM-driven determination of what the repo is and what matters.
- Synthesis Layer: Transforms context into Divio-structured Markdown dynamically.
- Eval Layer: Independent subsystem that generates tests from the raw graph, then tests the agent against the synthesized docs.
3. Acceptance Criteria
- No Hardcoded Values: Zero occurrences of Foxtrot-specific strings (
vpc_cidr,elasticsearch,mdm-app, AWS regions) in pipeline source code. - Dynamic Outputs:
sysdoc.jssuccessfully generates a different set of reference markdown files depending on the repo (e.g., must not generatenetwork-architecture.mdfor a pure frontend repo). - Repo-Agnostic Eval: Running
eval-generator.jsagainst an arbitrary open-source repo (e.g.,expressjs/expressor a generic Helm chart) produces\ge20 valid, specific ground-truth questions. - Threshold Met: The pipeline runs on Foxtrot and achieves
\ge77% on the generated eval, AND runs on a test non-Foxtrot repo (e.g., BCE or AnyCloud) and achieves\ge70% on its respective generated eval. - Resilience: Pipeline does not crash or throw unhandled exceptions when encountering unknown languages or missing configuration files.
4. Test-First Plan
Before changing the implementation, the following tests must be established:
-
Repo-Agnostic Eval Question Generation (Unit/Integration)
- Test: Run
eval-generator.js(to be written) against a mock "Microservice" repo graph and a mock "Infra" repo graph. - Assert: Verify that generated questions do not reference Foxtrot artifacts, and that the answers are strictly derived from the provided graph.
- Test: Run
-
Synthesis Quality Tests (Unit)
- Test: Pass a mock context (e.g., a React frontend archetype) to
synthesizeReferencePages. - Assert: Verify the LLM determines appropriate page titles (e.g.,
components.md,state-management.md) and does not output infra-specific pages.
- Test: Pass a mock context (e.g., a React frontend archetype) to
-
Pipeline Integration Tests (E2E)
- Test: Execute
wiggum-v2.shagainst a tiny, non-Foxtrot fixture repository (e.g., a simple Node.js Express API). - Assert: Docs are generated without errors. The generated index maps to valid, generated reference files.
- Test: Execute
5. Implementation Plan
Step 1: Overhaul Evaluation (The Yardstick)
- Delete hardcoded questions in
eval-questions.js. - Write
eval-generator.jsthat usescallLLMto generate ground truth questions fromGraphStoreanddiscoverCharts. - Manually verify the generated questions for Foxtrot are high quality.
Step 2: Abstract Deep Extraction
- Deprecate
extract-deep.jsandextract-patterns.js. - Create
repo-profiler.jsto establish the Repo Archetype. - Create
extract-dynamic.jsthat uses LLM prompts to extract state services, config surfaces, and architectural patterns generically based on the Archetype.
Step 3: Dynamic Synthesis
- Modify
prose.js->synthesizeReferencePages. - Implement a two-pass LLM prompt:
- "What 4 reference pages should be created for this repo?" -> Returns JSON array of
{ title, filename, focus }. - For each page, generate the markdown content using the extracted context.
- "What 4 reference pages should be created for this repo?" -> Returns JSON array of
- Update
sysdoc.jsto dynamically write these files instead of hardcoding the filenames.
Step 4: Script Cleanup
- Update
wiggum-v2.shto triggereval-generator.jsbefore running the agent benchmark. - Remove any remaining bespoke scripts.
Step 5: Run & Tune
- Run the full loop on Foxtrot. Tune prompts until the score > 77%.
- Run the full loop on a secondary repo. Tune prompts until the score > 70%.
6. Risk Assessment
- LLM Quality Variance: Relying on the LLM to dynamically determine reference pages and extract facts increases token usage and latency. Mitigation: Use strong models (Sonnet/Opus) for schema/page definition, use Haiku for bulk prose generation. Implement heavy JSON-schema enforcement for extraction.
- Extraction Gaps for Non-Infra Repos: The current AST extractor may not capture enough semantic meaning for frontend/backend repos compared to Helm/TF, leading to thin docs. Mitigation: Ensure
extract.jscaptures standard imports and package dependencies correctly to give the LLM enough context. - Eval Score Regression: Foxtrot scores might drop because the eval questions are generated dynamically and might be harder or more ambiguous than the hardcoded ones. Mitigation: The
eval-generator.jsmust instruct the LLM to generate highly specific, "exact match" or "list" type questions to prevent subjective scoring failures.