# BMad Spec: Dev-Intel V2 Repo-Agnostic Refactor ## 1. Problem Statement The Dev-Intel V2 pipeline currently possesses a fatal flaw: it is severely overfit to the "Foxtrot" infrastructure monorepo. While the AST parsing (`extract.js`) and graph construction (`graph.js`, `subsystem.js`) are generic (~40% of the codebase), the deep extraction, synthesis, and evaluation layers (~60%) are entirely bespoke to Foxtrot's specific tech stack and naming conventions. **What breaks when pointing at a non-Foxtrot repo:** - **Extraction (`extract-deep.js`, `extract-patterns.js`)**: Hardcodes regexes for `vpc_cidr`, `product_id`, `ou_id`, EKS addon block formats, AWS/GCP region names, and specific state services (`elasticsearch`, `redis`, `cassandra`). A non-infra repo (e.g., a frontend React app or a Java microservice) yields zero deep insights. `LAYER_PATTERNS` are hardcoded to `app`, `compute`, `network`. - **Synthesis (`prose.js`)**: The `synthesizeReferencePages` function hardcodes prompts expecting CIDR allocations, VPCs, and Jenkins jobs, and hardcodes the output files (`network-architecture.md`, `operations.md`, `configuration.md`, `dependencies.md`). - **Evaluation (`eval-questions.js`)**: Ground-truth questions are explicitly hardcoded to ask about `mdm-app`, `cassandra`, `jenkins`, `vault-secret`. Running the eval against any other repo results in a 0% score because the questions are invalid for that repo. ## 2. Architecture The refactored pipeline shifts from a static, rule-based extraction/generation model to a dynamic, LLM-guided schema discovery model. **Pipeline Flow:** 1. **Generic Extraction (`extract.js`, `extract-helm.js`)**: Stays largely the same. Extracts ASTs, dependencies, and resources. 2. **Semantic Profiling (`repo-profiler.js` - NEW)**: Before deep extraction, an LLM analyzes the graph and root configuration files (e.g., `package.json`, `Chart.yaml`, `go.mod`) to determine the repository's "Archetype" (e.g., Infrastructure, Frontend SPA, Backend Microservices, Data Pipeline). 3. **Dynamic Deep Extraction (`extract-dynamic.js` - REPLACES `extract-deep/patterns.js`)**: Based on the archetype, generic heuristics and LLM prompts scan for archetype-specific configuration surfaces, state boundaries, and network contracts. 4. **Adaptive Synthesis (`prose.js`)**: `synthesizeReferencePages` dynamically determines which reference pages to generate. It asks the LLM: "Given these extracted facts and this repo archetype, what are the 3-5 most critical reference topics?" It then generates those pages (e.g., `ui-components.md` for a frontend, instead of `network-architecture.md`). 5. **Generative Evaluation (`eval-generator.js` - REPLACES `eval-questions.js`)**: The question bank is no longer hardcoded. An LLM agent generates valid, repo-specific Q&A pairs by reading the generated AST graph and code snippets, establishing a dynamic ground truth for the agent-browsing benchmark. **Module Boundaries:** - **Extractor Layer**: Purely deterministic AST/YAML/HCL parsing. No repo-specific logic. - **Context/Profile Layer**: LLM-driven determination of what the repo *is* and what matters. - **Synthesis Layer**: Transforms context into Divio-structured Markdown dynamically. - **Eval Layer**: Independent subsystem that generates tests from the raw graph, then tests the agent against the synthesized docs. ## 3. Acceptance Criteria 1. **No Hardcoded Values**: Zero occurrences of Foxtrot-specific strings (`vpc_cidr`, `elasticsearch`, `mdm-app`, AWS regions) in pipeline source code. 2. **Dynamic Outputs**: `sysdoc.js` successfully generates a different set of reference markdown files depending on the repo (e.g., must not generate `network-architecture.md` for a pure frontend repo). 3. **Repo-Agnostic Eval**: Running `eval-generator.js` against an arbitrary open-source repo (e.g., `expressjs/express` or a generic Helm chart) produces $\ge$ 20 valid, specific ground-truth questions. 4. **Threshold Met**: The pipeline runs on Foxtrot and achieves $\ge$ 77% on the generated eval, AND runs on a test non-Foxtrot repo (e.g., BCE or AnyCloud) and achieves $\ge$ 70% on its respective generated eval. 5. **Resilience**: Pipeline does not crash or throw unhandled exceptions when encountering unknown languages or missing configuration files. ## 4. Test-First Plan Before changing the implementation, the following tests must be established: 1. **Repo-Agnostic Eval Question Generation (Unit/Integration)** - **Test**: Run `eval-generator.js` (to be written) against a mock "Microservice" repo graph and a mock "Infra" repo graph. - **Assert**: Verify that generated questions do not reference Foxtrot artifacts, and that the answers are strictly derived from the provided graph. 2. **Synthesis Quality Tests (Unit)** - **Test**: Pass a mock context (e.g., a React frontend archetype) to `synthesizeReferencePages`. - **Assert**: Verify the LLM determines appropriate page titles (e.g., `components.md`, `state-management.md`) and does not output infra-specific pages. 3. **Pipeline Integration Tests (E2E)** - **Test**: Execute `wiggum-v2.sh` against a tiny, non-Foxtrot fixture repository (e.g., a simple Node.js Express API). - **Assert**: Docs are generated without errors. The generated index maps to valid, generated reference files. ## 5. Implementation Plan **Step 1: Overhaul Evaluation (The Yardstick)** - Delete hardcoded questions in `eval-questions.js`. - Write `eval-generator.js` that uses `callLLM` to generate ground truth questions from `GraphStore` and `discoverCharts`. - Manually verify the generated questions for Foxtrot are high quality. **Step 2: Abstract Deep Extraction** - Deprecate `extract-deep.js` and `extract-patterns.js`. - Create `repo-profiler.js` to establish the Repo Archetype. - Create `extract-dynamic.js` that uses LLM prompts to extract state services, config surfaces, and architectural patterns generically based on the Archetype. **Step 3: Dynamic Synthesis** - Modify `prose.js` -> `synthesizeReferencePages`. - Implement a two-pass LLM prompt: 1. "What 4 reference pages should be created for this repo?" -> Returns JSON array of `{ title, filename, focus }`. 2. For each page, generate the markdown content using the extracted context. - Update `sysdoc.js` to dynamically write these files instead of hardcoding the filenames. **Step 4: Script Cleanup** - Update `wiggum-v2.sh` to trigger `eval-generator.js` before running the agent benchmark. - Remove any remaining bespoke scripts. **Step 5: Run & Tune** - Run the full loop on Foxtrot. Tune prompts until the score > 77%. - Run the full loop on a secondary repo. Tune prompts until the score > 70%. ## 6. Risk Assessment - **LLM Quality Variance**: Relying on the LLM to dynamically determine reference pages and extract facts increases token usage and latency. *Mitigation: Use strong models (Sonnet/Opus) for schema/page definition, use Haiku for bulk prose generation. Implement heavy JSON-schema enforcement for extraction.* - **Extraction Gaps for Non-Infra Repos**: The current AST extractor may not capture enough semantic meaning for frontend/backend repos compared to Helm/TF, leading to thin docs. *Mitigation: Ensure `extract.js` captures standard imports and package dependencies correctly to give the LLM enough context.* - **Eval Score Regression**: Foxtrot scores might drop because the eval questions are generated dynamically and might be harder or more ambiguous than the hardcoded ones. *Mitigation: The `eval-generator.js` must instruct the LLM to generate highly specific, "exact match" or "list" type questions to prevent subjective scoring failures.*