Commit Graph

1 Commits

Author SHA1 Message Date
Jarvis Prime
304f0a9e9f Phase 9c: Split eval into Agent (file-browsing) and Human (readability) tracks
Agent eval: 54.3% (22 questions, 40.9% NOT_FOUND)
Human eval: 63.9% (28 questions, 17.9% NOT_FOUND)

Key findings:
- Agent navigation is the bottleneck (2.09/5) — long path-based filenames hurt discoverability
- Human findability is decent (3.46/5) but dependency questions fail (0%) because chart docs for wrapper charts don't surface their sub-chart deps
- Both tracks show strong precision (4.4+/5) — very low hallucination
- Resources (91%) and interactions (95%) score great for humans
- Configuration and contracts are solid across both tracks
2026-03-09 23:55:54 +00:00