# Big Indexer — Validation Evidence > **Public evidence record** — 100 scored runs across 5 repos (validating Python + Go + TypeScript, with core support since expanded to 11 query-backed languages). > Mirrors the validation page at `website/public/validation.html` (served at `bigindexer.com/validation`). ## Does Big Indexer actually help AI coding assistants? We ran Opencode on 5 production open-source repos in three modes and measured four things: **evidence coverage** (recall of architectural facts), **boundary accuracy** (correct seam identification), **actionability** (1–5, does the AI give implementable guidance), and **hallucinations** (incorrect claims). --- ## Three-stage core + three-model replication Each stage adds a layer. The numbers show what each layer contributed. | Metric | No BGI (baseline) | BGI MCP | BGI MCP + TWIN | TWIN delta | |---|---|---|---|---| | Actionability (1–5) | 4.00 | 4.00 | **4.75** | **+0.75** | | Evidence coverage | 78.7% | 84.9% | 79.9% / **96%†** | — | | Boundary accuracy | 0.95 | **1.00** | **1.00** | held | | Hallucination flags | 0 | 0 | 0 | 0 | | Median latency | 133.8s | 66.2s | 68.5s | — | † 79.9% is all-prompt mean for the 20-run TWIN refresh. **96%** is the p04 (safe implementation path) slice across 5 repos — the most actionability-relevant prompt. **What each stage fixed:** - **BGI MCP** fixed boundary accuracy (0.95 → 1.00) and halved latency (133.8s → 66.2s) - **BGI TWIN** fixed actionability (4.00 → 4.75) by surfacing behavioral twins + seam + rubric ### Three-model independent replication (GPT-4o + Gemini auto) We re-ran the full TWIN prompt pack on two additional models (`azure/gpt-4o` and `gemini/auto`) across all 5 repos (20 runs each). | Metric | BGI-TWIN (deepseek-v4-flash) | BGI-TWIN (GPT-4o) | BGI-TWIN (Gemini auto) | |---|---|---|---| | Actionability (1–5) | 4.75 | **4.85** | 4.25 | | Evidence coverage (strict) | 79.9% (96% p04) | 47.9% (49.3% p04) | 62.4% | | Evidence (tag-relaxed)\* | **94.8%** (100% p04) | 59.5% (62.7% p04) | 83.4% | | Boundary accuracy | 1.00 | **1.00** | 0.95 | | Hallucination flags | 0 | **0** | **0** | | Median latency | 68.5s | **41.6s** | 65.8s | Interpretation: actionability (≥4.25), boundary accuracy (≥0.95), and zero-hallucination held across all three models. Gemini's 0.95 boundary reflects one genuine architectural miss (django/p02 — depth-first on `query.py`, not a scoring calibration error). \* Tag-relaxed evidence score formula: `min(100, evidence_coverage_pct + min(25, (unlabeled_repo_anchor_lines / checklist_items) * 100 * 0.15))` Repo-anchor lines are non-log lines mentioning concrete repo files/modules (e.g. `*.py`, `*.go`, `*.ts`) without explicit `VERIFIED/HYPOTHESIS/UNKNOWN` tags. #### Evidence-gap interpretation (explicit) The evidence-coverage gap is real and should be read directly: - deepseek TWIN refresh (20 runs): **278** explicit `VERIFIED/HYPOTHESIS/UNKNOWN` labels (**13.9/run**) - Gemini auto (20 runs): **~167** labels (**8–15/run**) - GPT-4o TWIN replication (20 runs): **139** labels (**6.95/run**) On the p04 implementation slice, actionability stayed high across all three (deepseek 4.75, GPT-4o 4.85, Gemini 4.25), boundary stayed ≥0.95, and hallucinations stayed zero. This means the evidence metric measures explicit tagging style. deepseek follows the `VERIFIED/HYPOTHESIS/UNKNOWN` protocol most strictly; Gemini is intermediate; GPT-4o gives correct answers with fewer explicit labels. The tag-relaxed second score closes most of this gap. We keep scores unnormalized and publish raw outputs for independent re-scoring. Example from the same prompt (fastapi p03, blast radius of `solve_dependencies`): ```text deepseek (validation/runs/fastapi/opencode_mcp_p03_twin_refresh_r2.txt) | Definition spans lines 598–735 ... | VERIFIED | | get_request_handler ... | VERIFIED | | get_websocket_app ... | VERIFIED | VERIFIED: Only 3 call sites exist in the codebase. GPT-4o (validation/runs/fastapi/opencode_mcp_p03_twin_refresh_gpt4o_r1.txt) 1. VERIFIED: Changes to solve_dependencies will directly impact tests. 2. HYPOTHESIS: Other clusters may have dependent implications. ``` --- ## What is BGI-TWIN? Three deterministic MCP tools that convert architecture context into implementation-ready guidance. | Tool | What it does | |---|---| | `task_fingerprint(task)` | NL task → COV token set (deterministic, no LLM) | | `behavioral_twins(task)` | Top-3 code units ranked by Jaccard overlap with task fingerprint | | `twin_context(task)` | Combined: task COV + top twins + seam suggestion + 5-point rubric + confidence gate | BGI-TWIN is a context compiler. It does not generate code, does not call an LLM, and does not speculate. Every output is derived directly from the indexed graph and fuse artifacts. --- ## Per-repo breakdown 5 repos, 4 slices, 100 scored runs. | Repo | Mode | Runs | Median Latency | Evidence Cov. | Boundary | Actionability | |---|---|---|---|---|---|---| | django/django | Baseline | 4 | 99.8s | 73.3% | 1.00 | 4.0 | | django/django | BGI MCP | 4 | 73.1s | **84.0%** | 1.00 | 4.0 | | django/django | **BGI TWIN** | 4 | 60.9s | 75.3% | 1.00 | **5.0** | | django/django | **BGI TWIN (GPT-4o)** | 4 | 48.3s | 47.0% | 1.00 | **5.0** | | django/django | **BGI TWIN (Gemini auto)** | 4 | 61.9s | 45.7% | 0.75 | 4.0 | | tiangolo/fastapi | Baseline | 3 | 131.3s | 93.2% | 1.00 | 4.3 | | tiangolo/fastapi | BGI MCP | 3 | **54.8s** | 66.7% | 1.00 | 4.3 | | tiangolo/fastapi | **BGI TWIN** | 4 | 79.5s | 82.0% | 1.00 | **5.0** | | tiangolo/fastapi | **BGI TWIN (GPT-4o)** | 4 | 31.6s | 45.4% | 1.00 | **5.0** | | tiangolo/fastapi | **BGI TWIN (Gemini auto)** | 4 | 65.8s | 78.7% | 1.00 | 4.75 | | pydantic/pydantic-core | Baseline | 4 | 192.2s | 48.6% | 0.75 | 4.0 | | pydantic/pydantic-core | BGI MCP | 4 | 63.3s | **86.7%** | **1.00** | 4.0 | | pydantic/pydantic-core | **BGI TWIN** | 4 | **47.5s** | 71.3% | 1.00 | **4.7** | | pydantic/pydantic-core | **BGI TWIN (GPT-4o)** | 4 | 40.4s | 43.8% | 1.00 | **4.5** | | pydantic/pydantic-core | **BGI TWIN (Gemini auto)** | 4 | 54.8s | 47.5% | 1.00 | 3.50 | | prometheus/prometheus | Baseline | 6 | **89.9s** | 90.0% | 1.00 | 4.0 | | prometheus/prometheus | BGI MCP | 6 | 119.9s | 90.0% | 1.00 | 4.0 | | prometheus/prometheus | **BGI TWIN** | 4 | 70.0s | 80.8% | 1.00 | **5.0** | | prometheus/prometheus | **BGI TWIN (GPT-4o)** | 4 | 53.4s | 59.6% | 1.00 | **4.8** | | prometheus/prometheus | **BGI TWIN (Gemini auto)** | 4 | 69.7s | 71.4% | 1.00 | 4.25 | | vercel/next.js | Baseline | 3 | 291.8s | 89.2% | 1.00 | 3.7 | | vercel/next.js | BGI MCP | 3 | **66.4s** | **91.7%** | 1.00 | 3.7 | | vercel/next.js | **BGI TWIN** | 4 | 88.9s | 63.4% | 1.00 | **4.0** | | vercel/next.js | **BGI TWIN (GPT-4o)** | 4 | 33.6s | 44.0% | 1.00 | **5.0** | | vercel/next.js | **BGI TWIN (Gemini auto)** | 4 | 96.2s | 68.5% | 1.00 | 4.75 | BGI TWIN rows are post-shipment refresh runs (p01–p04, `CallToolRequest` evidence confirmed in every run). --- ## Notable findings ### pydantic-core — the clearest result in the dataset Baseline p01: evidence **0%**, boundary **0**. The model described a pure-Python architecture. The repo is Python + Rust with a `pyo3` bridge that the baseline model never found. BGI MCP p01: evidence **80%**, boundary **1.0**. BGI injected the exact `pyo3` boundary and the model identified it correctly on the first attempt. BGI-TWIN p04: evidence **100%**, actionability **5/5**. The safe-implementation prompt produced a copy-paste-ready patch path with specific file and function references. ### fastapi — honest reporting of a mixed result Evidence coverage dropped on two fastapi MCP runs (p03: 33.3%, p04: 66.7% vs baseline 90%, 100%). This is real: The baseline model, with no architecture context, read every source file individually and built a detailed verified-claim table. The MCP model received blast-radius context (1,614 impacted units) and treated that as the full picture — it made fewer granular verifications. What this reveals: MCP architecture context trades file-reading breadth for boundary accuracy. On well-structured repos with good baseline exploration, the evidence-coverage gain is smaller. Boundary accuracy was perfect (1.0) in all fastapi modes. BGI-TWIN's refresh recovered evidence to 82% mean with 5/5 actionability because behavioral twins anchor the model to specific files rather than summaries. Raw outputs: [validation/runs/fastapi/](../validation/runs/fastapi/) ### Prometheus (Go) — cross-language neutral result Evidence is flat at 90.0% in baseline and MCP modes. MCP is slower (119.9s vs 89.9s) on this Go codebase. BGI-TWIN improved actionability to 5/5 and reduced latency to 70s. Key insight: MCP accuracy gains are largest when baseline models are architecturally blind. BGI-TWIN's actionability gains hold regardless. ### next.js (TypeScript) — large monorepo signal Baseline latency 291.8s — BGI reduces it to 66–89s in all modes. Boundary accuracy perfect across all three modes. Actionability improved from 3.7 to 4.0 with BGI-TWIN. ### Hallucination rate: 0 across 100 scored runs No factually incorrect module or file claim in any baseline, MCP, or TWIN run. --- ## External benchmark vs Louvain The 100-run study above measures whether BGI improves AI assistant outputs. A separate, complementary question is whether BGI's behavioral edges are stronger than raw import edges as a clustering signal — measured against an external ground truth, with no rubric scoring. **Setup.** For each repo, we use the maintainers' own top-level package layout as architectural ground truth (e.g. `django/db/`, `django/forms/`, `django/contrib/auth/`). We then compare four file-level clusterings: - `bgi_native` — BGI's own `clusters` field (file-level majority vote) - `louvain_imports` — Louvain on the language-native import graph (Python AST or Go imports) - `louvain_bgi_hard` — Louvain on BGI's HARD edges projected to file level - `louvain_bgi_all` — Louvain on BGI's HARD + PREDICTED edges projected to file level Same algorithm (networkx Louvain, seed=42), same ground truth, same metric harness. The two `louvain_bgi_*` methods isolate BGI's edge contribution from any unit-vs-file granularity confound. **Metrics.** Pairwise precision/recall/F1 on co-clustering decisions, plus MoJoFM (Tzerpos & Holt) — the canonical architecture-recovery similarity score. Higher is better; 1.0 = exact match (label-equivalent). ### Results **django (Python, 564 files, 31 ground-truth clusters):** | method | clusters | precision | recall | F1 | MoJoFM | |---|---:|---:|---:|---:|---:| | bgi_native | 100 | 0.246 | 0.167 | 0.199 | 0.190 | | louvain_imports | 17 | 0.258 | 0.337 | 0.292 | 0.335 | | **louvain_bgi_hard** | 287 | **1.000** | 0.235 | **0.381** | 0.440 | | **louvain_bgi_all** | 139 | 0.312 | 0.348 | 0.329 | **0.453** | BGI's edges win on every metric. HARD-only achieves perfect precision: every file pair BGI's confident edges co-cluster does in fact share a real package. HARD+PREDICTED gives the best MoJoFM by trading some precision for recall. **prometheus (Go, 665 files, 18 ground-truth clusters):** | method | clusters | precision | recall | F1 | MoJoFM | |---|---:|---:|---:|---:|---:| | bgi_native | 620 | 1.000 | 0.005 | 0.010 | -0.204 | | **louvain_imports** | 11 | **0.445** | **0.678** | **0.537** | **0.396** | | louvain_bgi_hard | 663 | 1.000 | 0.000 | 0.000 | -0.290 | | louvain_bgi_all | 368 | 0.152 | 0.074 | 0.100 | -0.118 | **gin (Go, 96 files, 7 ground-truth clusters):** | method | clusters | precision | recall | F1 | MoJoFM | |---|---:|---:|---:|---:|---:| | bgi_native | 96 | 0.000 | 0.000 | 0.000 | -0.534 | | **louvain_imports** | 32 | **0.620** | 0.364 | **0.459** | **0.207** | | louvain_bgi_hard | 96 | 0.000 | 0.000 | 0.000 | -0.534 | | louvain_bgi_all | 61 | 0.425 | 0.088 | 0.145 | -0.172 | Raw imports dominate on both Go repos. Root cause is **structural, not algorithmic**: of BGI's 47,204 HARD edges on prometheus, 47,202 stay inside one file. Token mix is the upstream driver — ~70% of Go emissions are INTAKE/OUTPUT/CONDITIONAL/LOOP, which gate-2 deliberately scopes to same-file to prevent O(N²) noise. Cross-file pair density (FETCH/PERSIST, EMIT/SUBSCRIBE, ROUTE/AUTHENTICATE) is lower on Go than on Python. We expanded the Go scanner's data-flow patterns (HTTP routing, channel receive→SUBSCRIBE, marshal/unmarshal→TRANSFORM, lock cleanup→TEARDOWN) and re-ran on gin — token coverage improved (249 ROUTE emissions on gin vs 0 on prometheus's older scan), but the spectral-mask interaction means cross-file edges remain sparse on Go relative to Python. ### Honest read **Where BGI's edges add real value:** tier-1 query-backed languages (`.scm`-based: Python, TypeScript, JavaScript, Go, Rust, Java, C#, PHP, Ruby, Kotlin, Scala). The HARD-only precision of 1.000 on django is the strongest single result here — when BGI promotes an edge to HARD, those two files genuinely belong together. This validates the README's "scope-constrained edge generation" claim on the languages where the scanner produces dense cross-file behavioral edges. **Where BGI's edges currently underperform:** tier-2 scanner-backed languages (and previously Go, prior to its query-backed upgrade) on cluster-recovery benchmarks. The user-visible MCP product (boundary detection, twin retrieval, AI-assistant context) still works on Go — boundary accuracy 1.0, actionability 4.25–5.0 across all three models in the 100-run study. The cluster-recovery gap is real and documented in the language-tier section of the README. **What this changes:** - The README's `DRS cluster` glossary line is softened — `clusters` is honestly described as unit-level grouping, mostly intra-file. File-level architectural components are better expressed via the BGI edge graph + Louvain (or via the fuse-graph boundary signal). - The language-tier section adds a cross-file edge density caveat. ### Reproduce Per-repo command, raw outputs, and harness code: ```bash # Python repo python3 scripts/external_benchmark/run_repo.py \ --repo-slug django --repo-root /tmp/bgi-ab-repos/django \ --package-root django --bgi-graph output/validation/mcp-ab/django/bgi-graph.json \ --ext py --language python --truth-split contrib \ --out output/benchmarks/external/django # Go repo python3 scripts/external_benchmark/run_repo.py \ --repo-slug gin --repo-root /tmp/bgi-ab-repos/gin \ --package-root . --bgi-graph output/benchmarks/external/gin/bgi-graph.json \ --ext go --language go \ --out output/benchmarks/external/gin ``` Per-repo `summary.json`, `metrics.csv`, and full `clusterings.json` are committed under `output/benchmarks/external/{django,prometheus,gin}/`. Harness source: `scripts/external_benchmark/`. --- ## Limitations We publish limitations before readers find them. A reader who discovers a flaw themselves trusts evidence less than one who was told. **Self-reported scoring.** Checklists were written by us, scored by us. The checklists were defined before scoring by reading actual source code. The full rubric is at [validation/SCORING_RUBRIC.md](../validation/SCORING_RUBRIC.md). Every raw output is public at [validation/runs/](../validation/runs/). Re-score independently and open an issue if you disagree. **5 repos is not a large sample.** Python + Go + TypeScript (with core support since expanded to 11 first-class query-backed languages) is broader than Python-only, but still limited. The pydantic-core finding stands on its own. Three-model replication (GPT-4o + Gemini auto) is now complete, but we still need additional repos and external replications. **BGI-TWIN refresh is MCP-only.** No updated baseline was run alongside the refresh. We have no reason to believe the baseline changed, but this is a real experimental design limitation. **Evidence coverage is style-sensitive across models.** The rubric rewards explicit claim-level `VERIFIED/HYPOTHESIS/UNKNOWN` labeling with citations. deepseek follows this most strictly; Gemini is intermediate; GPT-4o tags less explicitly. All three produce correct answers with high actionability and zero hallucinations. **One invalid MCP run.** One next.js p04 original A/B run had no `CallToolRequest` evidence and is marked explicitly unscored in `runs.csv`. All 60 TWIN refresh runs (deepseek, GPT-4o, Gemini) have invocation evidence. **We still need external replication.** We have three-model replication (GPT-4o + Gemini auto), but we still need external teams to run and publish the protocol on their own repos. --- ## Methodology | Item | Detail | |---|---| | Repos | tiangolo/fastapi, django/django, pydantic/pydantic-core, prometheus/prometheus, vercel/next.js | | CLI | opencode 1.14.41 / gemini CLI (auto) | | Model | deepseek-v4-flash + azure/gpt-4o + gemini/auto | | MCP server | `bgi mcp --graph ... --fuse-graph ...` | | TWIN invocation | `twin_context` explicitly required in prompt; `CallToolRequest` confirmed in every TWIN run | | Evidence coverage | Recall of architectural facts vs ground-truth checklist (sensitive to explicit label/citation style) | | Evidence (tag-relaxed, second score) | Primary evidence score + capped credit for unlabeled repo-anchor lines (no reruns required) | | Boundary accuracy | 0/1 — correct seam identification | | Actionability | 1–5 rubric: 5 = immediately actionable (copy-paste), 1 = vague | | Hallucination flags | Count of factually incorrect module/file claims | | Total scored runs | 100 (20 baseline + 20 MCP + 20 TWIN deepseek + 20 TWIN GPT-4o + 20 TWIN Gemini auto) | | Full rubric | [validation/SCORING_RUBRIC.md](../validation/SCORING_RUBRIC.md) | | All run artifacts | [validation/runs/](../validation/runs/) | | Run log | [validation/runs.csv](../validation/runs.csv) | --- ## Reproduce ```bash # Install pip install bigindexer # Clone any repo and build the index git clone --depth 1 https://github.com/tiangolo/fastapi bgi scan fastapi/ --out output/ # Start MCP server (includes BGI-TWIN tools) bgi mcp --graph output/bgi-graph.json --fuse-graph output/fuse-graph.json # Run with OpenCode and a local config in the repo dir: # { "mcp": { "bgi": { "command": "bgi", "args": ["mcp", ...] } } } opencode # AI receives architecture summary + behavioral twins + seam + rubric ``` Full setup: [docs/MCP_SETUP.md](../docs/MCP_SETUP.md) --- *Big Indexer — Architecture-aware context for AI coding assistants.* *https://bigindexer.com*