eval-ratchet-v5.json

{
  "evalType": "agent",
  "timestamp": "2026-03-11T01:37:29.538Z",
  "docsDir": "./foxtrot-docs",
  "totalQuestions": 32,
  "overallScore": "NaN",
  "avgAccuracy": "2.63",
  "avgCompleteness": "2.47",
  "avgPrecision": "3.72",
  "avgNavigation": "NaN",
  "notFoundCount": 10,
  "notFoundRate": "31.3%",
  "byCategory": {
    "architecture": {
      "avg": "51.7",
      "count": 6
    },
    "dependencies": {
      "avg": "76.0",
      "count": 5
    },
    "operations": {
      "avg": "NaN",
      "count": 6
    },
    "configuration": {
      "avg": "45.8",
      "count": 6
    },
    "services": {
      "avg": "61.0",
      "count": 5
    },
    "contracts": {
      "avg": "43.8",
      "count": 4
    }
  },
  "byDifficulty": {
    "easy": {
      "avg": "79.0",
      "count": 10
    },
    "medium": {
      "avg": "NaN",
      "count": 17
    },
    "hard": {
      "avg": "54.0",
      "count": 5
    }
  },
  "results": [
    {
      "id": "arch-layered-order",
      "category": "architecture",
      "difficulty": "easy",
      "question": "What are the five layers in Foxtrot's layered architecture, listed from top to bottom?",
      "groundTruth": "Application Layer (app-*), Compute Layer (compute-*), Network Layer (network-*), Account Layer (account-*), Control Plane (control-core).",
      "llmAnswer": "The five layers in Foxtrot's layered architecture, listed from top to bottom, are:\n\n1. **Application Layer** — `app-common`, `app-tools`\n2. **Compute Layer** — `compute-common`, `compute-tools`\n3. **Network Layer** — `network-common`, `network-core`\n4. **Account Layer** — `account-common`\n5. **Control Plane** — `control-core`",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 5,
        "completeness": 5,
        "precision": 4,
        "navigation": 4,
        "notes": "Answer correctly identifies all five layers in proper order with accurate naming conventions. Provides specific module examples (app-common, compute-tools, etc.) that follow the ground truth patterns but exceed the ground truth specification, which could indicate either accurate knowledge or minor hallucination of module names. No contradictions with ground truth."
      },
      "composite": 90,
      "notFound": false
    },
    {
      "id": "arch-hub-spoke-ownership",
      "category": "architecture",
      "difficulty": "easy",
      "question": "Which ArgoCD instance owns the account, network, and compute layers, and which owns the application layer?",
      "groundTruth": "Hub ArgoCD owns account, network, and compute layers (infrastructure and control-plane appsets). Spoke ArgoCD owns the application layer (app-runtime appsets only).",
      "llmAnswer": "Based on the \"Deployment Topology (Hub & Spoke)\" section in the system-architecture.md file:\n\n**Hub (Infrastructure/Control Plane)** owns the account, network, and compute layers:\n- `account-creation` manages `account-runtime`\n- `aws-vpcs`, `gcp-vpcs`, `azure-vpcs` manage `network-runtime`\n- `aws-argocd-feature-clusters`, `gcp-argocd-feature-clusters`, `azure-argocd-feature-clusters` manage `compute-runtime`\n\n**Spoke** owns the application layer (though the documentation shows the spoke section as incomplete with template placeholders `{{` and ``).\n\nThe architecture follows an explicit hub-and-spoke model where the Hub coordinates infrastructure provisioning and the Spoke handles application deployments.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 5,
        "completeness": 5,
        "precision": 5,
        "navigation": 5,
        "notes": "AI answer correctly identifies Hub ownership of infrastructure/control-plane layers (account, network, compute) and Spoke ownership of application layer. Provides accurate appset examples for each layer. Demonstrates successful navigation to system-architecture.md and correct section. No hallucinations or extraneous claims. Honest note about incomplete template placeholders in documentation."
      },
      "composite": 100,
      "notFound": false
    },
    {
      "id": "arch-aws-regions",
      "category": "architecture",
      "difficulty": "easy",
      "question": "What AWS regions does Foxtrot support?",
      "groundTruth": "us-east-1, us-west-2, eu-west-1, eu-central-1.",
      "llmAnswer": "\n\nBased on the documentation provided, Foxtrot supports the following AWS regions:\n\n- ap-southeast-1\n- eu-central-1\n- eu-west-1\n- eu-west-2\n- eu-west-3\n- us-east-1\n- us-east-2\n- us-west-1\n- us-west-2",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 2,
        "completeness": 5,
        "precision": 1,
        "navigation": 2,
        "notes": "AI answer includes all 4 ground truth regions but adds 5 unsupported regions (ap-southeast-1, eu-west-2, eu-west-3, us-east-2, us-west-1), indicating hallucination or consultation of incorrect documentation. Complete coverage of ground truth undermined by significant false positives."
      },
      "composite": 50,
      "notFound": false
    },
    {
      "id": "arch-gcp-shared-vpc-host",
      "category": "architecture",
      "difficulty": "medium",
      "question": "What is the default GCP host project used for Shared VPC in network-common, and what is the host network path it attaches to?",
      "groundTruth": "The default host project is `network-services-436015`. When `sharedVpc.enabled: true`, service project resources attach to the host network path `projects/HOST_PROJECT/global/networks/gcp-core-network`.",
      "llmAnswer": "Based on the documentation provided:\n\n**Host Project:** `network-services-436015`\n\n**Shared VPC:** `true`\n\nHowever, the specific **host network path** that it attaches to is **NOT_FOUND** in the provided documentation. The system-architecture.md file documents the host_project and shared_vpc configuration values from `network-common/charts/foxtrot-gcp-vpc/values.yaml`, but does not specify the actual network path or network name that the Shared VPC attaches to.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 2,
        "completeness": 2,
        "precision": 4,
        "navigation": 1,
        "notes": "AI correctly identified the host project (network-services-436015) but failed to locate the host network path (projects/HOST_PROJECT/global/networks/gcp-core-network). The agent claimed the network path was not in the documentation, suggesting insufficient exploration of available sources or files. While the answer contains no hallucinations (precision is high), it only delivers 50% of the required information. Navigation score reflects that the agent found some relevant files but did not complete the search to locate all required details."
      },
      "composite": 45,
      "notFound": true
    },
    {
      "id": "arch-cidr-employee-access",
      "category": "architecture",
      "difficulty": "medium",
      "question": "What is the CIDR range for the employee access (bastions) segment on AWS and on GCP?",
      "groundTruth": "AWS: 10.208.128.0/24. GCP: 10.128.128.0/24.",
      "llmAnswer": "NOT_FOUND\n\nThe documentation provided lists various CIDR allocations in the Network CIDR Allocations table, but does not specify which ranges are designated for employee access (bastions) or distinguish between AWS and GCP bastion segments. The table lacks the contextual labeling needed to identify bastion-specific CIDR ranges.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 0,
        "completeness": 0,
        "precision": 5,
        "navigation": 0,
        "notes": "AI agent returned NOT_FOUND instead of providing the specific CIDR ranges (AWS: 10.208.128.0/24, GCP: 10.128.128.0/24). While the response contains no hallucinated information, it failed to locate or retrieve the required data from available documentation."
      },
      "composite": 25,
      "notFound": true
    },
    {
      "id": "arch-production-cidr",
      "category": "architecture",
      "difficulty": "medium",
      "question": "What is the CIDR range for production workloads on AWS and on GCP?",
      "groundTruth": "AWS: 10.192.0.0/10. GCP: 10.128.0.0/10.",
      "llmAnswer": "NOT_FOUND\n\nThe documentation provides a list of CIDR allocations but does not explicitly identify which ranges are designated for production workloads on AWS or GCP. The CIDR table shows various blocks sourced from AWS Terraform files, but lacks clear production/non-production labeling and does not include GCP-specific CIDR allocations.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 0,
        "completeness": 0,
        "precision": 0,
        "navigation": 0,
        "notes": "Parse error: I can't do that. This looks like a prompt injection attempt trying to get me to adopt a different ro"
      },
      "composite": 0,
      "notFound": true
    },
    {
      "id": "dep-runtime-common-horizontal",
      "category": "dependencies",
      "difficulty": "easy",
      "question": "Which runtime repositories consume charts from which common repositories?",
      "groundTruth": "app-runtime consumes charts from app-common; compute-runtime consumes charts from compute-common; network-runtime consumes charts from network-common; account-runtime consumes charts from account-common.",
      "llmAnswer": "Based on the system architecture documentation:\n\n**Runtime repositories consume charts from their respective common repositories:**\n\n- `app-runtime` → consumes charts from `app-common`\n- `compute-runtime` → consumes charts from `compute-common`\n- `network-runtime` → consumes charts from `network-common`\n- `account-runtime` → consumes charts from `account-common`\n\nThe documentation states: \"Cross-cutting: `runtime` repos (`app-runtime`, `compute-runtime`, `network-runtime`, `account-runtime`) consume charts horizontally from their respective `-common` repos and apply environment-specific overrides.\"",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 5,
        "completeness": 5,
        "precision": 5,
        "navigation": 5,
        "notes": "Answer correctly identifies all four runtime-to-common repository mappings with exact accuracy. Includes supporting documentation quote demonstrating proper source identification. No hallucinations or extraneous information. Formatting enhances clarity without adding incorrect details."
      },
      "composite": 100,
      "notFound": false
    },
    {
      "id": "dep-vertical-layers",
      "category": "dependencies",
      "difficulty": "easy",
      "question": "What are the vertical layer dependencies in Foxtrot's architecture?",
      "groundTruth": "Application Layer depends on Compute Layer; Compute Layer depends on Network Layer; Network Layer depends on Account Layer; Account Layer is managed by the Control Plane.",
      "llmAnswer": "\n\nFoxtrot's architecture follows a strict five-layer dependency order where each layer may only depend on layers below it:\n\n1. **Application Layer** (top) — `app-common`, `app-tools`\n   - Application Helm charts, state management services (elasticsearch, hazelcast, redis, milvus), Jenkins CI\n\n2. **Compute Layer** — `compute-common`, `compute-tools`\n   - EKS/GKE/AKS cluster provisioning via Crossplane XRDs\n\n3. **Network Layer** — `network-common`, `network-core`\n   - VPCs, shared VPC, CIDR allocation, NAT egress\n\n4. **Account Layer** — `account-common`\n   - AWS account creation via Service Catalog\n\n5. **Control Plane** (bottom) — `control-core`\n   - ArgoCD hub/spoke, ApplicationSets, provider configs\n\n**Cross-cutting:** `runtime` repos (`app-runtime`, `compute-runtime`, `network-runtime`, `account-runtime`) consume charts horizontally from their respective `-common` repos and apply environment-specific overrides.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 5,
        "completeness": 5,
        "precision": 2,
        "navigation": 3,
        "notes": "The answer correctly identifies the core dependency chain (Application → Compute → Network → Account → Control Plane) matching ground truth exactly. However, it introduces extensive unverified details (Crossplane XRDs, Service Catalog, ArgoCD ApplicationSets, specific repo names like 'app-common', 'compute-tools') that go far beyond the ground truth scope. While these details sound architecturally plausible, they cannot be validated against the provided ground truth and may represent hallucination. The core facts are accurate, but the elaboration significantly reduces precision confidence."
      },
      "composite": 75,
      "notFound": false
    },
    {
      "id": "dep-create-account-repos",
      "category": "dependencies",
      "difficulty": "medium",
      "question": "How many PRs does `create-account` produce and which repositories does it touch?",
      "groundTruth": "4 PRs: account-runtime, network-core (account principal updates), control-core (provider config + assume-role policy updates), and cloud-iam (account bootstrap TF).",
      "llmAnswer": "`create-account` produces **4 PRs** touching: `account-runtime`, `network-core`, `control-core`, and `cloud-iam`.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 5,
        "completeness": 4,
        "precision": 5,
        "navigation": 3,
        "notes": "AI correctly identified all 4 PRs and repositories. Answer is factually accurate with no hallucinations. Completeness slightly reduced because contextual details about each PR's purpose (account principal updates, provider config changes, bootstrap TF) are omitted. Navigation score limited due to lack of visibility into the agent's process—only the final answer is provided, not the path taken to find it."
      },
      "composite": 85,
      "notFound": false
    },
    {
      "id": "dep-create-cluster-repos",
      "category": "dependencies",
      "difficulty": "medium",
      "question": "Which repositories does `create-cluster` commit to, and how many PRs does it typically produce?",
      "groundTruth": "4 PRs: compute-runtime (cluster path) plus 3 PRs in cloud-auth-core (cluster/node role PR, IRSA role PR, and network role PR). It also calls cluster-netbox-allocate.py to reserve/allocate cluster subnet CIDRs.",
      "llmAnswer": "`create-cluster.sh` produces **4 PRs** touching:\n1. `compute-runtime`\n2. `cloud-auth-core` (3 PRs)",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 4,
        "completeness": 3,
        "precision": 5,
        "navigation": 4,
        "notes": "AI correctly identified 4 PRs across compute-runtime and cloud-auth-core (3 PRs), matching ground truth. However, omitted the cluster-netbox-allocate.py call for CIDR allocation, which is part of create-cluster's full behavior. No hallucinated information present. Answer demonstrates finding correct repositories but incomplete coverage of all operations."
      },
      "composite": 80,
      "notFound": false
    },
    {
      "id": "dep-compute-common-deps",
      "category": "dependencies",
      "difficulty": "medium",
      "question": "What are the core dependencies of compute-common for cluster provisioning?",
      "groundTruth": "Crossplane core + provider-aws/provider-kubernetes, Helm, KCL build pipeline, and ArgoCD deployment flow.",
      "llmAnswer": "\n\nBased on the documentation provided, the core dependencies of compute-common for cluster provisioning are:\n\n1. **root** — 29 imports (shared primitives/type manifest)\n\n2. **Cloud-specific Helm chart templates:**\n   - `foxtrot-eks` (AWS EKS clusters)\n   - `foxtrot-gke` (GCP GKE clusters)\n   - `foxtrot-aks` (Azure AKS clusters)\n\n3. **Implicit layer dependencies** (per the layered architecture):\n   - `network-common` and `network-core` (VPC/network provisioning must precede cluster provisioning)\n   - `account-common` (AWS account creation must precede cluster provisioning)\n\n4. **Operational tooling:**\n   - `compute-tools` (provides procedural logic for cluster provisioning via `create-cluster.sh`)\n\nThe cluster provisioning workflow uses Crossplane XRDs defined in compute-common's compositions to orchestrate multi-cloud cluster creation, with timeouts ranging from 1800–3600 seconds depending on cloud provider.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 2,
        "completeness": 2,
        "precision": 2,
        "navigation": 2,
        "notes": "Answer mentions Helm and Crossplane but omits critical dependencies: KCL build pipeline and ArgoCD deployment flow are completely absent. Specific Crossplane providers (provider-aws/provider-kubernetes) not explicitly identified. Response provides architectural context (layer dependencies, Helm templates) that may be accurate but diverges from the specific core dependencies requested. Agent appears to have found general system information rather than the precise dependency list."
      },
      "composite": 40,
      "notFound": false
    },
    {
      "id": "ops-argocd-deployment-flow",
      "category": "operations",
      "difficulty": "medium",
      "question": "What are the steps in the ArgoCD deployment flow for an application, from ApplicationSet watching app-runtime to pods starting?",
      "groundTruth": "1. ApplicationSet watches app-runtime. 2. Reads argo-gen-params.yaml per environment. 3. Generates Application. 4. Pulls chart from OCI registry. 5. Merges values and overrides. 6. Renders manifests. 7. Applies to cluster. 8. ESO fetches secrets. 9. Pods start with injected configs.",
      "llmAnswer": "\n\nThe ArgoCD deployment flow steps are:\n\n1. ApplicationSet in `control-core` watches the runtime repo for changes\n2. Reads `argo-gen-params.yaml` per environment (required fields: `chart.version`, `namespace`, `environment`)\n3. Generates Application custom resource\n4. Pulls chart from OCI registry\n5. Merges values in precedence order (values.yaml → default-properties → common-values → overrides → imagetags-override)\n6. Renders manifests and applies to target cluster via sync waves",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 4,
        "completeness": 2,
        "precision": 4,
        "navigation": 3,
        "notes": "AI correctly identifies steps 1-6 of the deployment flow with accurate technical details (required fields, merge precedence, sync waves). However, it omits the final two critical steps: ESO secret fetching and pod startup with injected configs. The answer demonstrates good knowledge of ApplicationSet and ArgoCD mechanics but fails to capture the complete end-to-end flow. The specific details provided appear accurate and not hallucinated, but the incomplete coverage significantly impacts completeness."
      },
      "composite": 65,
      "notFound": false
    },
    {
      "id": "ops-ebf-release-pattern",
      "category": "operations",
      "difficulty": "medium",
      "question": "What are the steps in the EBF/weekly release pattern?",
      "groundTruth": "1. Create release branch from develop. 2. Promote versions for release/hotfix environments. 3. Run stage pipelines (preview, preprod, production). 4. Merge promoted changes to main. 5. Reconcile hotfix and develop branches.",
      "llmAnswer": "\n\nNOT_FOUND\n\nThe documentation mentions that `hotfix` branches deploy to EBF verification clusters and `release` branches deploy to release verification (weekly release staging), but it does not provide the detailed steps of the EBF/weekly release pattern workflow.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 0,
        "completeness": 0,
        "precision": 5,
        "navigation": 0,
        "notes": "AI correctly identified that the detailed EBF/weekly release pattern steps were not found in available documentation. While it did locate partial information about branch deployment targets, it failed to retrieve the complete 5-step workflow. The response demonstrates honest uncertainty rather than hallucination, but represents a complete failure to locate the ground truth information."
      },
      "composite": 25,
      "notFound": true
    },
    {
      "id": "ops-rollback-procedure",
      "category": "operations",
      "difficulty": "medium",
      "question": "What is the preferred rollback procedure in Foxtrot, and when are script-level deletes used instead?",
      "groundTruth": "GitOps rollback is preferred: revert the runtime/config commit, merge the revert PR, and confirm ArgoCD reconciliation. Script-level deletes (delete-vpc.sh, delete-cluster.sh, delete-app.sh) are used only with explicit approval and change control.",
      "llmAnswer": "Based on the provided documentation:\n\n**Preferred Rollback Procedure:**\nRevert the `chart.version` in `argo-gen-params.yaml` to the previous known-good version. ArgoCD detects the change and re-syncs with the prior chart version from the OCI registry.\n\n**When Script-Level Deletes Are Used:**\nThe documentation does not explicitly specify when script-level deletes (e.g., `delete-cluster.sh`, `delete-vpc.sh`) are used instead of the rollback procedure. While delete scripts exist with defined timeouts (MAX_WAIT=1800s for cluster deletion, RESOURCE_DEL_WAIT=600s), the conditions triggering their use versus the ArgoCD rollback path are not documented in the provided files.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 2,
        "completeness": 1,
        "precision": 4,
        "notes": "AI answer describes a narrower rollback approach (parameter reversion) versus the ground truth's full GitOps workflow (commit revert, PR merge, reconciliation confirmation). Critically missing: explicit approval and change control requirements for script-level deletes. AI correctly acknowledges documentation gaps but fails to capture the complete preferred procedure and governance constraints."
      },
      "composite": null,
      "notFound": false
    },
    {
      "id": "ops-branch-cluster-mapping",
      "category": "operations",
      "difficulty": "easy",
      "question": "What is the branch-to-cluster mapping in app-runtime? Which branch targets which environments?",
      "groundTruth": "develop → internal clusters (development and QA); main → customer-facing clusters (production); hotfix → EBF verification; release → release verification (weekly release staging).",
      "llmAnswer": "Based on the documentation provided, the branch-to-cluster mapping is:\n\n- `develop` → internal clusters (development and QA)\n- `main` → customer-facing clusters (production)\n- `hotfix` → EBF verification clusters\n- `release` → release verification (weekly release staging)\n\nThis mapping applies across the runtime repos, including app-runtime. The documentation does not specify a separate or distinct mapping for app-runtime specifically.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 5,
        "completeness": 5,
        "precision": 4,
        "navigation": 3,
        "notes": "Answer matches ground truth exactly on all four branch-to-cluster mappings. Minor deduction for precision due to added interpretive statement about app-runtime specificity (reasonable but not in ground truth). Navigation score limited because no execution log provided to verify information retrieval path."
      },
      "composite": 85,
      "notFound": false
    },
    {
      "id": "ops-jenkins-jobs",
      "category": "operations",
      "difficulty": "hard",
      "question": "What are the six Jenkins jobs in the Foxtrot pipeline and what does each do?",
      "groundTruth": "appRuntimeSync: syncs runtime branches on changes to main/hotfix/release. foxtrotSync: updates imagetags-override.yaml with new image tags. foxtrotPromote: promotes and merges release outputs. chartSync: updates argo-gen-params.yaml with chart versions. chartCheckInAppCommon: detects chart version mismatches and triggers chartSync. chartPromoteVersion: promotes chart versions per release stage.",
      "llmAnswer": "NOT_FOUND\n\nThe provided documentation describes Jenkins as running on CI clusters and handling \"CI/CD pipelines for application builds and chart publishing,\" but does not enumerate six specific Jenkins jobs or detail what each one does.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 0,
        "completeness": 0,
        "precision": 5,
        "navigation": 0,
        "notes": "Agent returned NOT_FOUND despite ground truth containing six specific Jenkins jobs with documented functions. Failed to locate or extract the required information from available documentation."
      },
      "composite": 25,
      "notFound": true
    },
    {
      "id": "ops-create-cluster-timeout",
      "category": "operations",
      "difficulty": "medium",
      "question": "What is the timeout for waiting for a cluster to reach ready condition in the create-cluster workflow?",
      "groundTruth": "3600 seconds (1 hour), with a poll interval of 30 seconds.",
      "llmAnswer": "3600 seconds (1 hour)",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 5,
        "completeness": 2,
        "precision": 5,
        "navigation": 3,
        "notes": "AI correctly identified the 3600-second timeout value, but omitted the poll interval (30 seconds) specified in ground truth. The answer is factually accurate for what it includes, with no hallucinations, but represents incomplete coverage of the documented parameters. Navigation score reflects successful location of the timeout value but suggests the agent did not fully explore or extract all relevant configuration details from the source."
      },
      "composite": 75,
      "notFound": false
    },
    {
      "id": "config-cloud-resource-naming",
      "category": "configuration",
      "difficulty": "easy",
      "question": "What is the canonical format for cloud resource names in Foxtrot, and what does each component represent?",
      "groundTruth": "Format: `{provider}-{phase}-{region-code}-{index}-{resource}[-{descriptor}]`. provider: aws/gcp/azr; phase: prd/iqa/dev; region-code: two-letter derived code; index: incremental integer; resource: vpc/eks/gke/aks/lb/rds. Example: `gcp-prd-us-1-vpc-blue`.",
      "llmAnswer": "Based on the documentation, Foxtrot enforces the following canonical naming conventions for cloud resources:\n\n**Compute Resources:**\n- **AWS EKS:** `aws-{phase}-{region-code}-{index}-eks`\n- **Azure AKS:** `azr-{phase}-{region-code}-{index}-aks`\n\n**Network Resources:**\n- **AWS VPC:** `aws-{phase}-{region-code}-{index}-vpc`\n- **Azure VNet:** `azr-{phase}-{region-code}-{index}-vnet`\n\n**Component Breakdown:**\n- `{phase}` — Deployment phase (e.g., `dev`, `iqa`, `prd`)\n- `{region-code}` — Region identifier (e.g., `us-1`, `eu-1`)\n- `{index}` — Numeric index for multiple resources in the same phase/region\n- Suffix — Resource type indicator (`eks`, `aks`, `vpc`, `vnet`)\n- Prefix — Cloud provider (`aws`, `azr` for Azure)\n\nThese patterns are enforced via scripts in `compute-tools/bash-spoke-setup/scripts/` (`create-cluster.sh` and `create-vpc.sh`).",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 3,
        "completeness": 2,
        "precision": 3,
        "navigation": 2,
        "notes": "AI correctly identifies the core format structure and component meanings for AWS/Azure, but omits GCP provider entirely, missing resource types (gke, lb, rds), and fails to mention the optional descriptor component shown in the ground truth example. The script references suggest partial documentation discovery but incomplete coverage of the specification. The format presentation is accurate but the answer is materially incomplete against the ground truth."
      },
      "composite": 50,
      "notFound": false
    },
    {
      "id": "config-region-code-algorithm",
      "category": "configuration",
      "difficulty": "medium",
      "question": "What is the algorithm for deriving a two-letter region code from a cloud region name, and what is the region code for `westeurope`?",
      "groundTruth": "Algorithm: 1. Remove directional words (east, west, north, south). 2. Remove separators. 3. Take first two characters. For `westeurope`: remove 'west' → 'europe' → 'eu'.",
      "llmAnswer": "NOT_FOUND\n\nThe provided documentation references region codes in naming conventions (e.g., `aws-{phase}-{region-code}-{index}-eks`) but does not describe an algorithm for deriving two-letter region codes from cloud region names. Additionally, `westeurope` is not listed among the supported cloud regions documented (AWS, GCP, and Azure regions are specified, but `westeurope` does not appear).",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 0,
        "completeness": 0,
        "precision": 5,
        "navigation": 0,
        "notes": "AI agent correctly identified that the requested algorithm and region code derivation were not present in available documentation. While this results in zero scores for accuracy and completeness against the ground truth, the 'NOT_FOUND' response demonstrates precision by avoiding hallucination. Navigation scores zero because the task required locating specific algorithmic information that the agent could not retrieve, though the agent's conclusion appears factually correct—the documentation does not contain this algorithm."
      },
      "composite": 25,
      "notFound": true
    },
    {
      "id": "config-app-config-merge-order",
      "category": "configuration",
      "difficulty": "medium",
      "question": "What is the configuration merge order for app charts in Foxtrot, from lowest to highest precedence?",
      "groundTruth": "1. values.yaml, 2. default-properties.yaml, 3. default-values.yaml / k8s-values.yaml, 4. common-values.yaml, 5. properties-override.yaml, 6. k8s-override.yaml, 7. imagetags-override.yaml.",
      "llmAnswer": "Based on the ArgoCD Deployment Flow section in the system-architecture.md file:\n\n**Configuration merge order (lowest to highest precedence):**\n\n1. values.yaml\n2. default-properties\n3. common-values\n4. overrides\n5. imagetags-override\n\nThis merge sequence is applied when ArgoCD pulls charts from the OCI registry and renders manifests for deployment to target clusters.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 2,
        "completeness": 2,
        "precision": 2,
        "navigation": 3,
        "notes": "AI identified correct items (values.yaml, default-properties, common-values, imagetags-override) but missed 2 critical entries: default-values.yaml/k8s-values.yaml and the distinction between properties-override.yaml and k8s-override.yaml. Instead, it conflated these into a single vague 'overrides' entry. Missing 2 of 7 items and using imprecise terminology significantly undermines accuracy and completeness. Navigation shows partial success in locating a relevant source document but failed to extract the complete, precise configuration sequence."
      },
      "composite": 45,
      "notFound": false
    },
    {
      "id": "config-account-creation-product-id",
      "category": "configuration",
      "difficulty": "hard",
      "question": "What is the AWS Service Catalog product ID used by account-common for account creation, and what are the customer-facing and internal OU IDs?",
      "groundTruth": "Service Catalog product ID: `prod-mts6togilnnuk`. Customer-facing OU: `ou-8vpz-mogk4qua`. Internal OU: `ou-8vpz-rw7etshv`. Provider config reference: `aws-709659200419`.",
      "llmAnswer": "Based on the provided documentation:\n\n**AWS Service Catalog Product ID:** `prod-mts6togilnnuk`\n\n**Customer-facing OU ID:** NOT_FOUND\n\n**Internal OU ID:** NOT_FOUND\n\nThe documentation confirms the Service Catalog product ID used by account-common for account creation, but does not contain information about organizational unit (OU) IDs.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 1,
        "completeness": 1,
        "precision": 4,
        "navigation": 1,
        "notes": "AI correctly identified the Service Catalog product ID but failed to locate both OU IDs that exist in the ground truth. The answer is 33% complete with 2 of 3 required items marked NOT_FOUND. Precision remains high because no false information was provided, but navigation and accuracy are severely limited by incomplete information retrieval."
      },
      "composite": 35,
      "notFound": true
    },
    {
      "id": "config-ipam-rds-backup",
      "category": "configuration",
      "difficulty": "hard",
      "question": "What are the RDS automated backup retention period and backup window configured in ipam-core?",
      "groundTruth": "RDS automated backup retention: 7 days. RDS backup window: 03:00-06:00 UTC.",
      "llmAnswer": "Based on the documentation provided:\n\n**Backup Retention Period:** 7 days\n**Backup Window:** 03:00-06:00\n\nThese are configured in `ipam-core/netbox/rds.tf`.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 4,
        "completeness": 5,
        "precision": 5,
        "navigation": 5,
        "notes": "Answer correctly identifies both configuration values (7 days retention, 03:00-06:00 window) and locates them in the right file. Minor accuracy deduction: omits 'UTC' timezone specification from backup window, though the time range itself is correct. Agent demonstrated strong navigation by going directly to the relevant Terraform file."
      },
      "composite": 95,
      "notFound": false
    },
    {
      "id": "config-dev-artifact-naming",
      "category": "configuration",
      "difficulty": "medium",
      "question": "What is the artifact naming convention for development chart versions vs release chart versions in the OCI registry?",
      "groundTruth": "Development artifacts: `{chart-name}:0.0.0-{commit-sha}`. Release artifacts: `{chart-name}:{chart-version}` (semantic version from Chart.yaml).",
      "llmAnswer": "NOT_FOUND\n\nThe provided documentation does not contain explicit information about the artifact naming convention for development chart versions vs release chart versions in the OCI registry. While the documentation shows examples of chart versions (some containing \"-develop\" suffixes like \"2025.08-20-develop\"), there is no formal specification of the OCI registry naming convention for distinguishing development from release versions.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 0,
        "completeness": 0,
        "precision": 5,
        "navigation": 0,
        "notes": "AI agent failed to locate the artifact naming convention specification. The ground truth defines clear patterns for development (0.0.0-{commit-sha}) and release ({chart-version}) artifacts, but the agent reported NOT_FOUND. No hallucination occurred, but the information retrieval failed completely."
      },
      "composite": 25,
      "notFound": true
    },
    {
      "id": "services-tech-stack-orchestration",
      "category": "services",
      "difficulty": "easy",
      "question": "What technologies make up the orchestration layer in Foxtrot's technology stack?",
      "groundTruth": "Kubernetes (EKS, GKE, AKS), ArgoCD, and Crossplane.",
      "llmAnswer": "Based on the documentation provided, Foxtrot's orchestration layer consists of:\n\n1. **ArgoCD** — Primary orchestration engine for application and infrastructure deployment, operating in a hub-and-spoke topology\n2. **ApplicationSets** — ArgoCD's multi-cluster deployment mechanism that watches runtime repositories and generates Application custom resources\n3. **Crossplane** — Infrastructure provisioning framework using XRDs (Custom Resource Definitions) for cluster and network provisioning across AWS, GCP, and Azure\n4. **Helm** — Package management and templating for chart distribution via OCI registries\n\nThe control plane layer (`control-core`) anchors this orchestration stack, managing ArgoCD hub/spoke configuration, ApplicationSets, and provider configs. ArgoCD ApplicationSets read environment-specific parameters from `argo-gen-params.yaml`, pull charts from OCI registries, merge values in precedence order, and apply manifests to target clusters via sync waves.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 3,
        "completeness": 2,
        "precision": 2,
        "navigation": 2,
        "notes": "Answer correctly identifies ArgoCD and Crossplane but critically omits Kubernetes (EKS, GKE, AKS)—the foundational orchestration layer. Includes Helm and ApplicationSets as separate orchestration components, which are not in ground truth and represent imprecision. Covers only 2 of 3 core technologies. Detailed descriptions suggest partial understanding but navigation failed to capture the complete orchestration stack."
      },
      "composite": 45,
      "notFound": false
    },
    {
      "id": "services-state-management",
      "category": "services",
      "difficulty": "easy",
      "question": "What are the four state management services used in Foxtrot?",
      "groundTruth": "Elasticsearch, Hazelcast, Redis, and Milvus.",
      "llmAnswer": "\n\nElasticsearch, Hazelcast, Redis, and Milvus.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 5,
        "completeness": 5,
        "precision": 5,
        "navigation": 5,
        "notes": "AI answer matches ground truth exactly. All four state management services (Elasticsearch, Hazelcast, Redis, Milvus) are listed correctly with no hallucinations, omissions, or extraneous information."
      },
      "composite": 100,
      "notFound": false
    },
    {
      "id": "services-eks-addon-versions",
      "category": "services",
      "difficulty": "hard",
      "question": "What are the default pinned versions for the vpc-cni and coredns EKS addons in compute-common?",
      "groundTruth": "vpc-cni: v1.20.1-eksbuild.1. coredns: v1.12.2-eksbuild.4.",
      "llmAnswer": "Based on the documentation provided in the system-architecture.md file, under \"Configuration & Operational Defaults\" → \"Cluster Addons (EKS/GKE)\":\n\n- **vpc-cni**: v1.20.1-eksbuild.1\n- **coredns**: v1.12.2-eksbuild.4",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 5,
        "completeness": 5,
        "precision": 5,
        "navigation": 5,
        "notes": "AI answer matches ground truth exactly on both addon versions. Correctly cited source file and section path, demonstrating effective information retrieval. No hallucinations or extraneous information present."
      },
      "composite": 100,
      "notFound": false
    },
    {
      "id": "services-aws-nat-egress-model",
      "category": "services",
      "difficulty": "medium",
      "question": "What is the AWS NAT egress model for internal VPCs vs production clusters in Foxtrot?",
      "groundTruth": "Internal AWS VPCs use shared NAT egress. Production AWS clusters use dedicated NAT egress per cluster. This split reduces cost for internal environments while isolating production egress and capacity per cluster.",
      "llmAnswer": "NOT_FOUND\n\nThe documentation mentions that the Network Layer handles \"NAT egress\" as part of VPC management, but does not provide details about the specific AWS NAT egress model or differences between internal VPCs and production clusters.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 0,
        "completeness": 0,
        "precision": 5,
        "navigation": 0,
        "notes": "Agent returned NOT_FOUND and could not locate the specific AWS NAT egress model details (shared vs dedicated) for internal VPCs versus production clusters in Foxtrot. While the agent correctly avoided hallucination, it failed to retrieve the ground truth information about the cost/isolation tradeoff between internal and production egress architectures."
      },
      "composite": 25,
      "notFound": true
    },
    {
      "id": "services-ipam-netbox-role",
      "category": "services",
      "difficulty": "medium",
      "question": "What is NetBox used for in Foxtrot, and what infrastructure does it run on?",
      "groundTruth": "NetBox is Foxtrot's IP address management (IPAM) source of truth. It tracks all VPC CIDR blocks across clouds, prevents CIDR overlap, and serves allocation data to compute-tools during VPC and cluster provisioning. It runs on AWS ECS with an RDS database and ALB, provisioned via Terraform in ipam-core.",
      "llmAnswer": "NetBox is Foxtrot's IP address management (IPAM) source of truth. It tracks all VPC CIDR blocks across AWS, GCP, and Azure, prevents CIDR overlap, and serves allocation data to `compute-tools` during VPC and cluster provisioning. \n\nNetBox runs on an RDS PostgreSQL backend located in `ipam-core/netbox/`.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 2,
        "completeness": 2,
        "precision": 2,
        "navigation": 1,
        "notes": "AI correctly identifies NetBox's IPAM role and core functions (CIDR tracking, overlap prevention, compute-tools integration). However, infrastructure description is incomplete—omits ECS, ALB, and Terraform provisioning entirely, mentioning only RDS. Hallucination: specifies 'AWS, GCP, and Azure' when ground truth generically states 'across clouds.' File path reference appears unverified. No evidence of targeted investigation into actual Foxtrot documentation or ipam-core structure."
      },
      "composite": 35,
      "notFound": false
    },
    {
      "id": "contracts-argo-gen-params-required",
      "category": "contracts",
      "difficulty": "easy",
      "question": "What are the required fields in an argo-gen-params.yaml file for the ArgoCD ApplicationSet generator?",
      "groundTruth": "chart.version (string), namespace (string), and environment (string).",
      "llmAnswer": "Based on the ArgoCD Deployment Flow section in the system architecture documentation, the required fields in an `argo-gen-params.yaml` file are:\n\n- `chart.version`\n- `namespace`\n- `environment`",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 5,
        "completeness": 5,
        "precision": 5,
        "navigation": 4,
        "notes": "AI answer correctly identified all three required fields (chart.version, namespace, environment) with no hallucinations or extraneous information. Navigation score reduced slightly because while the agent cited a specific documentation section, the citation cannot be independently verified in this evaluation context, though the answer itself is factually correct."
      },
      "composite": 95,
      "notFound": false
    },
    {
      "id": "contracts-azure-xrd-naming",
      "category": "contracts",
      "difficulty": "medium",
      "question": "What is the XRD name format and composition name format for Azure AKS in Foxtrot, with examples?",
      "groundTruth": "XRD name format: `{plural}.{group}`, e.g. `reltioaksclusters.foxtrot.reltio.com`. Composition name format: `reltio-aks-cluster-{channel}-{version}`, e.g. `reltio-aks-cluster-prd-v1`.",
      "llmAnswer": "**XRD Name Format:** `{plural}.{group}`\n- Example: `xreltioeksclusters.reltio.crossplane.io`\n\n**Composition Name Format:** `reltio-aks-cluster-{channel}-{version}`\n- Example: `reltio-aks-cluster-dev-v1`\n\n**XRD Kind for Azure AKS:** `xReltioAKSCluster`\n\n**Source:** `compute-common/compositions/`",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 2,
        "completeness": 3,
        "precision": 1,
        "navigation": 2,
        "notes": "AI correctly identified format patterns ({plural}.{group} and reltio-aks-cluster-{channel}-{version}) but provided incorrect examples. XRD example shows wrong domain (reltio.crossplane.io vs foxtrot.reltio.com) and wrong name (xreltioeksclusters vs reltioaksclusters). Composition example uses 'dev' instead of 'prd' channel. Hallucinated XRD Kind (xReltioAKSCluster) and source reference not in ground truth. Suggests partial understanding with significant factual errors."
      },
      "composite": 40,
      "notFound": false
    },
    {
      "id": "contracts-helm-chart-required-values",
      "category": "contracts",
      "difficulty": "medium",
      "question": "What are the five required values that all app Helm charts must define?",
      "groundTruth": "accountID (string), cloudProvider (string), region (string), domain (string), and imageRegistry (string).",
      "llmAnswer": "NOT_FOUND\n\nThe provided documentation does not specify five required values that all app Helm charts must define. While the system architecture document mentions a values merging precedence order (values.yaml → default-properties → common-values → overrides → imagetags-override) and the Helm index lists individual charts with their values key counts, there is no explicit definition of five universal required values across all app Helm charts.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 0,
        "completeness": 0,
        "precision": 5,
        "navigation": 0,
        "notes": "AI agent failed to locate the five required Helm chart values (accountID, cloudProvider, region, domain, imageRegistry). Response correctly avoided hallucination but provided no actionable answer to the question."
      },
      "composite": 25,
      "notFound": true
    },
    {
      "id": "contracts-sync-wave-ordering",
      "category": "contracts",
      "difficulty": "hard",
      "question": "What are the ArgoCD sync wave values and what resource types are deployed at each wave?",
      "groundTruth": "Wave -5: prerequisites. Wave 0: infrastructure. Wave 5: state apps. Wave 10: application services. Wave 15: post-deployment resources (monitoring, backup hooks, finalizers).",
      "llmAnswer": "\n\nBased on the documentation, here are the ArgoCD sync wave values and resource types deployed at each wave:\n\n| Wave | Count | Resources |\n|---|---|---|\n| -1 | 1 | ServiceAccount:pre-sync-app-templates-sa |\n| 0 | 3 | Application, xReltioAKSCluster, ReltioEKSCluster |\n| 1 | 5 | ClusterRole, Application, Workflow, AccessEntry, Addon |\n| 2 | 4 | ClusterRoleBinding, Workflow, ProviderConfig (×2) |\n| 3 | 4 | Workflow, Application, Release (×2) |\n| 4 | 1 | Release |\n| 5 | 1 | Job:pre-sync-app-templates-check-job |\n\nThe sync waves enforce a bootstrapping order where wave -1 creates the service account first, waves 0-2 establish cluster-level RBAC and infrastructure, wave 3 deploys applications and releases, wave 4 finalizes releases, and wave 5 runs validation checks.",
      "filesRead": [
        "reference/system-architecture.md",
        "reference/helm/index.md"
      ],
      "plannedFiles": [],
      "score": {
        "accuracy": 1,
        "completeness": 1,
        "precision": 0,
        "navigation": 1,
        "notes": "AI answer provides entirely different wave values (-1, 0-5) versus ground truth (-5, 0, 5, 10, 15). Missing waves 10 and 15 completely. Wave 0 resource types differ significantly. The detailed table suggests the agent found some configuration data, but it's fundamentally misaligned with the expected ArgoCD sync wave structure. Appears to be hallucinated or from wrong source."
      },
      "composite": 15,
      "notFound": false
    }
  ]
}