diff --git a/README.md b/README.md new file mode 100644 index 0000000..47242c7 --- /dev/null +++ b/README.md @@ -0,0 +1,113 @@ +# Developer Intelligence Pipeline v2 + +Multi-language semantic graph extractor that builds a knowledge graph from source code. Produces function-level call graphs, cross-file dependency maps, and semantic diffs — all without LLM calls. + +## Quick Start + +```bash +npm install +node pipeline.js batch /path/to/repo --output /tmp/output +``` + +## What It Does + +Parses source code into a directed graph of entities (modules, functions, classes, configs) and relationships (CALLS, IMPORTS, CONTAINS, IMPLEMENTS). Then diffs snapshots to detect breaking changes, compute impact scores, and identify affected callers. + +## Supported Languages + +| Language | Parser | Entities | +|----------|--------|----------| +| TypeScript/JavaScript | tree-sitter | Modules, Functions, Classes, Imports | +| Python | tree-sitter | Modules, Functions, Classes (with `_`/`__` visibility) | +| Go | tree-sitter | Modules, Functions, Structs, Receiver Methods | +| Java | tree-sitter | Modules, Functions, Classes, Interfaces | +| Bash | tree-sitter | Modules, Functions, `source` imports, Commands | +| YAML | js-yaml | Config keys (K8s manifests, Helm, KCL) | +| Terraform/HCL | regex | Resources, Data, Modules, Providers | + +## Pipeline Phases + +### Phase 1: Entity Extraction (`extract.js`) +```bash +node extract.js /path/to/file.ts /repo/root +``` +Outputs JSON with entities and relationships. + +### Phase 2: Graph Store (`graph.js`) +```bash +node graph.js build /dir/of/jsons snapshot.json +node graph.js query snapshot.json "cli/route.ts:tryRouteCli" +node graph.js diff old.json new.json +``` + +### Phase 3: Namespace Registry (`namespace.js`) +```bash +node namespace.js build snap-a.json snap-b.json --output registry.json +node namespace.js resolve graph.json registry.json +node namespace.js lookup registry.json functionName +``` +3-tier cross-repo resolution: exact ID → normalized path → name-only. + +### Phase 4: Semantic Diff (`semantic-diff.js`) +```bash +node semantic-diff.js diff old.json new.json +node semantic-diff.js score old.json new.json +``` +Categorizes changes as breaking/significant/internal/cosmetic. Impact score 0-100. + +### Phase 5: Pipeline (`pipeline.js`) +```bash +node pipeline.js batch /repo --output /tmp/out # Full extraction +node pipeline.js benchmark /repo --samples 20 # Performance test +node pipeline.js run /repo --snapshot prev.json # Incremental diff +``` + +## Benchmark (OpenClaw repo) + +| Metric | Value | +|--------|-------| +| Files | 4,325 | +| Extracted | 4,259 (98.5%) | +| Nodes | 21,646 | +| Edges | 133,979 | +| Time | 67 seconds | +| Avg/file | 15ms | + +## V1 vs V2 + +| Metric | V1 POC | V2 Pipeline | +|--------|--------|-------------| +| Parse time | ~2s | 552ms | +| Total time | 15-20 min (LLM) | 552ms | +| Entities | files + imports | 457 (4 types) | +| CALLS edges | 0 | 1,290 | +| Cross-file calls | No | 51 resolved | +| Languages | Go only | 8 | +| Semantic diff | No | Yes | +| Impact scoring | No | Yes | +| Cost | ~$0 (Ollama) | $0 | + +*Tested on labstack/echo (44 Go files)* + +## Testing + +```bash +bash test/run-all.sh # 9/9 ground truth benchmark +node test/test-graph.js # 25/25 graph store tests +``` + +## Architecture + +``` +source files → extract.js → JSON → graph.js → snapshot.json + ↓ + semantic-diff.js → impact report + ↓ + namespace.js → cross-repo links +``` + +Zero external runtime dependencies beyond tree-sitter grammars. + +## License + +MIT