dev-intel-v2/eval-agent-report.json at 049609a35820331a339fec7a48c9fb753aa541a2

Files

Jarvis Prime 304f0a9e9f Phase 9c: Split eval into Agent (file-browsing) and Human (readability) tracks

Agent eval: 54.3% (22 questions, 40.9% NOT_FOUND)
Human eval: 63.9% (28 questions, 17.9% NOT_FOUND)

Key findings:
- Agent navigation is the bottleneck (2.09/5) — long path-based filenames hurt discoverability
- Human findability is decent (3.46/5) but dependency questions fail (0%) because chart docs for wrapper charts don't surface their sub-chart deps
- Both tracks show strong precision (4.4+/5) — very low hallucination
- Resources (91%) and interactions (95%) score great for humans
- Configuration and contracts are solid across both tracks

2026-03-09 23:55:54 +00:00

37 KiB

Raw Blame History

View Raw

37 KiB Raw Blame History

37 KiB

Raw Blame History