Files

Jarvis Prime b8403be96c feat: repo-agnostic refactor (BMad spec-test-build loop)

- NEW: repo-profiler.js — deterministic archetype detection (Infra, Frontend, Backend, etc.)
- NEW: extract-dynamic.js — generic extractor replacing hardcoded Foxtrot patterns
- NEW: eval-generator.js — dynamic ground-truth question generation from any repo graph
- NEW: specs/bmad-agnostic-refactor-spec.md — full BMad spec with acceptance criteria
- REFACTORED: prose.js — two-pass LLM synthesis with rich context (shared secrets, ports, service refs)
- REFACTORED: sysdoc.js — wired repo-profiler + extract-dynamic, --legacy escape hatch
- REFACTORED: wiggum-v2.sh — uses eval-generator before benchmarks
- FIXED: graph.js — _edgeSet rebuilt on loadSnapshot() (edge dedup was broken)
- FIXED: graph.js — recursive sortKeys() for deep equality in diffing
- FIXED: prose.js — robust JSON array extraction from LLM output
- FIXED: ratchet.js — syntax validation (node --check) before saving LLM mutations
- FIXED: extract-dynamic.js — centralized state services regex, added console.warn for silent failures
- TESTS: test-eval-generator, test-repo-profiler, test-synthesis-quality + mock fixtures

Eval: 81.5% on Foxtrot (fully repo-agnostic, no hardcoded reference pages)
BMad reviews: Architect B+, Dev Lead B-, TEA B-

2026-03-11 14:40:31 +00:00

7.5 KiB

Raw Blame History

BMad Spec: Dev-Intel V2 Repo-Agnostic Refactor

1. Problem Statement

The Dev-Intel V2 pipeline currently possesses a fatal flaw: it is severely overfit to the "Foxtrot" infrastructure monorepo. While the AST parsing (extract.js) and graph construction (graph.js, subsystem.js) are generic (~40% of the codebase), the deep extraction, synthesis, and evaluation layers (~60%) are entirely bespoke to Foxtrot's specific tech stack and naming conventions.

What breaks when pointing at a non-Foxtrot repo:

Extraction (extract-deep.js, extract-patterns.js): Hardcodes regexes for vpc_cidr, product_id, ou_id, EKS addon block formats, AWS/GCP region names, and specific state services (elasticsearch, redis, cassandra). A non-infra repo (e.g., a frontend React app or a Java microservice) yields zero deep insights. LAYER_PATTERNS are hardcoded to app, compute, network.
Synthesis (prose.js): The synthesizeReferencePages function hardcodes prompts expecting CIDR allocations, VPCs, and Jenkins jobs, and hardcodes the output files (network-architecture.md, operations.md, configuration.md, dependencies.md).
Evaluation (eval-questions.js): Ground-truth questions are explicitly hardcoded to ask about mdm-app, cassandra, jenkins, vault-secret. Running the eval against any other repo results in a 0% score because the questions are invalid for that repo.

2. Architecture

The refactored pipeline shifts from a static, rule-based extraction/generation model to a dynamic, LLM-guided schema discovery model.

Pipeline Flow:

Generic Extraction (extract.js, extract-helm.js): Stays largely the same. Extracts ASTs, dependencies, and resources.
Semantic Profiling (repo-profiler.js - NEW): Before deep extraction, an LLM analyzes the graph and root configuration files (e.g., package.json, Chart.yaml, go.mod) to determine the repository's "Archetype" (e.g., Infrastructure, Frontend SPA, Backend Microservices, Data Pipeline).
Dynamic Deep Extraction (extract-dynamic.js - REPLACES extract-deep/patterns.js): Based on the archetype, generic heuristics and LLM prompts scan for archetype-specific configuration surfaces, state boundaries, and network contracts.
Adaptive Synthesis (prose.js): synthesizeReferencePages dynamically determines which reference pages to generate. It asks the LLM: "Given these extracted facts and this repo archetype, what are the 3-5 most critical reference topics?" It then generates those pages (e.g., ui-components.md for a frontend, instead of network-architecture.md).
Generative Evaluation (eval-generator.js - REPLACES eval-questions.js): The question bank is no longer hardcoded. An LLM agent generates valid, repo-specific Q&A pairs by reading the generated AST graph and code snippets, establishing a dynamic ground truth for the agent-browsing benchmark.

Module Boundaries:

Extractor Layer: Purely deterministic AST/YAML/HCL parsing. No repo-specific logic.
Context/Profile Layer: LLM-driven determination of what the repo is and what matters.
Synthesis Layer: Transforms context into Divio-structured Markdown dynamically.
Eval Layer: Independent subsystem that generates tests from the raw graph, then tests the agent against the synthesized docs.

3. Acceptance Criteria

No Hardcoded Values: Zero occurrences of Foxtrot-specific strings (vpc_cidr, elasticsearch, mdm-app, AWS regions) in pipeline source code.
Dynamic Outputs: sysdoc.js successfully generates a different set of reference markdown files depending on the repo (e.g., must not generate network-architecture.md for a pure frontend repo).
Repo-Agnostic Eval: Running eval-generator.js against an arbitrary open-source repo (e.g., expressjs/express or a generic Helm chart) produces \ge 20 valid, specific ground-truth questions.
Threshold Met: The pipeline runs on Foxtrot and achieves \ge 77% on the generated eval, AND runs on a test non-Foxtrot repo (e.g., BCE or AnyCloud) and achieves \ge 70% on its respective generated eval.
Resilience: Pipeline does not crash or throw unhandled exceptions when encountering unknown languages or missing configuration files.

4. Test-First Plan

Before changing the implementation, the following tests must be established:

Repo-Agnostic Eval Question Generation (Unit/Integration)
- Test: Run eval-generator.js (to be written) against a mock "Microservice" repo graph and a mock "Infra" repo graph.
- Assert: Verify that generated questions do not reference Foxtrot artifacts, and that the answers are strictly derived from the provided graph.
Synthesis Quality Tests (Unit)
- Test: Pass a mock context (e.g., a React frontend archetype) to synthesizeReferencePages.
- Assert: Verify the LLM determines appropriate page titles (e.g., components.md, state-management.md) and does not output infra-specific pages.
Pipeline Integration Tests (E2E)
- Test: Execute wiggum-v2.sh against a tiny, non-Foxtrot fixture repository (e.g., a simple Node.js Express API).
- Assert: Docs are generated without errors. The generated index maps to valid, generated reference files.

5. Implementation Plan

Step 1: Overhaul Evaluation (The Yardstick)

Delete hardcoded questions in eval-questions.js.
Write eval-generator.js that uses callLLM to generate ground truth questions from GraphStore and discoverCharts.
Manually verify the generated questions for Foxtrot are high quality.

Step 2: Abstract Deep Extraction

Deprecate extract-deep.js and extract-patterns.js.
Create repo-profiler.js to establish the Repo Archetype.
Create extract-dynamic.js that uses LLM prompts to extract state services, config surfaces, and architectural patterns generically based on the Archetype.

Step 3: Dynamic Synthesis

Modify prose.js -> synthesizeReferencePages.
Implement a two-pass LLM prompt:
1. "What 4 reference pages should be created for this repo?" -> Returns JSON array of { title, filename, focus }.
2. For each page, generate the markdown content using the extracted context.
Update sysdoc.js to dynamically write these files instead of hardcoding the filenames.

Step 4: Script Cleanup

Update wiggum-v2.sh to trigger eval-generator.js before running the agent benchmark.
Remove any remaining bespoke scripts.

Step 5: Run & Tune

Run the full loop on Foxtrot. Tune prompts until the score > 77%.
Run the full loop on a secondary repo. Tune prompts until the score > 70%.

6. Risk Assessment

LLM Quality Variance: Relying on the LLM to dynamically determine reference pages and extract facts increases token usage and latency. Mitigation: Use strong models (Sonnet/Opus) for schema/page definition, use Haiku for bulk prose generation. Implement heavy JSON-schema enforcement for extraction.
Extraction Gaps for Non-Infra Repos: The current AST extractor may not capture enough semantic meaning for frontend/backend repos compared to Helm/TF, leading to thin docs. Mitigation: Ensure extract.js captures standard imports and package dependencies correctly to give the LLM enough context.
Eval Score Regression: Foxtrot scores might drop because the eval questions are generated dynamically and might be harder or more ambiguous than the hardcoded ones. Mitigation: The eval-generator.js must instruct the LLM to generate highly specific, "exact match" or "list" type questions to prevent subjective scoring failures.

7.5 KiB Raw Blame History