Add deep extractors, reference pages, keyword index; eval 53.3%
- extract-deep.js: mines addon versions, TF configs, script params, helm values, state services - generate-reference-pages.js: creates operations.md, configuration.md, network-architecture.md - reference/index.md: keyword-rich topic-to-file routing table - Enriched CIDR extractor with inline comment capture - Eval progression: 28.7% -> 33.4% -> 46.7% -> 52.5% -> 53.3% - NOT_FOUND: 25 -> 20 -> 16 -> 10 -> 11 - Top scores: config-region-code 95%, argo-gen-params 95%, multiple 100%s - Remaining gap: agent planner (haiku) doesn't consistently follow index routing
This commit is contained in:
87
wiggum-v2-ref-3.log
Normal file
87
wiggum-v2-ref-3.log
Normal file
@@ -0,0 +1,87 @@
|
||||
🔁 Ralph Wiggum Loop (V2) — max 3 iterations, target 77%
|
||||
Benchmark: Confluence Gold Standard (/home/node/.openclaw/workspace/projects/dev-intel-v2/eval-confluence-ref-questions.json)
|
||||
|
||||
=== Iteration 1/3 ===
|
||||
📝 Running V2 pipeline...
|
||||
Generating prose for subsystem: compute-common...
|
||||
Generating prose for subsystem: compute-tools...
|
||||
Generating prose for subsystem: control-core...
|
||||
Generating prose for subsystem: ipam-core...
|
||||
Generating prose for subsystem: ipam-tools...
|
||||
Generating prose for subsystem: network-common...
|
||||
Generating prose for subsystem: network-core...
|
||||
Generating prose for subsystem: runtime...
|
||||
Generating prose for subsystem: root...
|
||||
Generating prose for 124 contracts...
|
||||
Agent KB: 12 subsystems, 76 charts
|
||||
Generated docs in ./foxtrot-docs
|
||||
- 12 subsystems
|
||||
- 124 contracts
|
||||
- 0 flows
|
||||
📊 Running agent file-browsing eval against Confluence questions...
|
||||
Using model: claude-haiku-4.5
|
||||
Agent Eval: 32 machine-audience questions
|
||||
[1/32] arch-layered-order... 30% (A:1 C:2 P:1 N:2) files:5
|
||||
[2/32] arch-hub-spoke-ownership... 60% (A:3 C:2 P:4 N:3) files:5
|
||||
[3/32] arch-aws-regions... 50% (A:2 C:5 P:1 N:2) files:5
|
||||
[4/32] arch-gcp-shared-vpc-host... 40% (A:2 C:1 P:4 N:1) files:5 [NOT_FOUND]
|
||||
[5/32] arch-cidr-employee-access... 30% (A:0 C:0 P:5 N:1) files:5 [NOT_FOUND]
|
||||
[6/32] arch-production-cidr... 0% (A:0 C:0 P:0 N:0) files:5 [NOT_FOUND]
|
||||
[7/32] dep-runtime-common-horizontal... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[8/32] dep-vertical-layers... 35% (A:1 C:2 P:1 N:3) files:5
|
||||
[9/32] dep-create-account-repos... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[10/32] dep-create-cluster-repos... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[11/32] dep-compute-common-deps... 40% (A:2 C:1 P:3 N:2) files:5
|
||||
[12/32] ops-argocd-deployment-flow... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[13/32] ops-ebf-release-pattern... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[14/32] ops-rollback-procedure... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[15/32] ops-branch-cluster-mapping... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[16/32] ops-jenkins-jobs... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[17/32] ops-create-cluster-timeout... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[18/32] config-cloud-resource-naming... 35% (A:2 C:2 P:2 N:1) files:5
|
||||
[19/32] config-region-code-algorithm... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[20/32] config-app-config-merge-order... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[21/32] config-account-creation-product-id... 20% (A:0 C:0 P:4 N:0) files:5 [NOT_FOUND]
|
||||
[22/32] config-ipam-rds-backup... 100% (A:5 C:5 P:5 N:5) files:5
|
||||
[23/32] config-dev-artifact-naming... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[24/32] services-tech-stack-orchestration... 35% (A:2 C:2 P:1 N:2) files:5
|
||||
[25/32] services-state-management... 60% (A:3 C:4 P:2 N:3) files:5
|
||||
[26/32] services-eks-addon-versions... 100% (A:5 C:5 P:5 N:5) files:4
|
||||
[27/32] services-aws-nat-egress-model... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[28/32] services-ipam-netbox-role... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[29/32] contracts-argo-gen-params-required... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[30/32] contracts-azure-xrd-naming... 25% (A:0 C:0 P:5 N:0) files:5 [NOT_FOUND]
|
||||
[31/32] contracts-helm-chart-required-values... 20% (A:1 C:1 P:1 N:1) files:5
|
||||
[32/32] contracts-sync-wave-ordering... 15% (A:0 C:1 P:1 N:1) files:5
|
||||
|
||||
════════════════════════════════════════════════════════════
|
||||
AGENT EVAL REPORT
|
||||
════════════════════════════════════════════════════════════
|
||||
Overall Score: 33.4%
|
||||
Accuracy: 0.91/5 Completeness: 1.03/5 Precision: 3.75/5 Navigation: 1.00/5
|
||||
Not Found: 20/32 (62.5%)
|
||||
|
||||
By Category:
|
||||
architecture: 35.0% (6 questions)
|
||||
dependencies: 30.0% (5 questions)
|
||||
operations: 25.0% (6 questions)
|
||||
configuration: 38.3% (6 questions)
|
||||
services: 49.0% (5 questions)
|
||||
contracts: 21.3% (4 questions)
|
||||
|
||||
By Difficulty:
|
||||
easy: 38.0% (10 questions)
|
||||
medium: 25.3% (17 questions)
|
||||
hard: 52.0% (5 questions)
|
||||
|
||||
Weakest:
|
||||
[arch-production-cidr] 0% — What is the CIDR range for production workloads on AWS and on GCP?... (read: reference/subsystems/network-core.md, reference/helm/charts/network-common-charts-foxtrot-aws-vpc.md, reference/helm/charts/network-common-charts-foxtrot-gcp-vpc.md, reference/subsystems/network-common.md, reference/system-architecture.md)
|
||||
[contracts-sync-wave-ordering] 15% — What are the ArgoCD sync wave values and what resource types are deplo... (read: reference/helm/charts/app-common-charts-argocd-apps.md, reference/helm/index.md, reference/subsystems/app-common.md, diagrams/helm-interactions.mmd, reference/system-architecture.md)
|
||||
[config-account-creation-product-id] 20% — What is the AWS Service Catalog product ID used by account-common for ... (read: reference/helm/charts/account-common-charts-account-creation.md, reference/subsystems/account-common.md, reference/contracts/index.md, reference/helm/index.md, agent-kb.json)
|
||||
[contracts-helm-chart-required-values] 20% — What are the five required values that all app Helm charts must define... (read: reference/helm/index.md, reference/subsystems/app-common.md, reference/contracts/index.md, reference/system-architecture.md, reference/helm/charts/app-common-charts-cluster.md)
|
||||
[dep-runtime-common-horizontal] 25% — Which runtime repositories consume charts from which common repositori... (read: reference/subsystems/runtime.md, reference/helm/index.md, reference/system-architecture.md, reference/contracts/index.md, diagrams/helm-interactions.mmd)
|
||||
|
||||
Full report: /home/node/.openclaw/workspace/projects/dev-intel-v2/eval-wiggum-v2-iter-1.json
|
||||
|
||||
🏁 Iteration 1 Score: 33% (Target: 77%)
|
||||
❌ Below threshold. To iterate, we need a diagnosis and code fix step here.
|
||||
Reference in New Issue
Block a user