# Big Indexer — Validation Evidence

> **Public evidence record** — 100 scored runs across 5 repos (validating Python + Go + TypeScript, with core support since expanded to 11 query-backed languages).
> Mirrors the validation page at `website/public/validation.html` (served at `bigindexer.com/validation`).

## Does Big Indexer actually help AI coding assistants?

We ran Opencode on 5 production open-source repos in three modes and measured four things:
**evidence coverage** (recall of architectural facts), **boundary accuracy** (correct seam identification),
**actionability** (1–5, does the AI give implementable guidance), and **hallucinations** (incorrect claims).

---

## Three-stage core + three-model replication

Each stage adds a layer. The numbers show what each layer contributed.

| Metric | No BGI (baseline) | BGI MCP | BGI MCP + TWIN | TWIN delta |
|---|---|---|---|---|
| Actionability (1–5) | 4.00 | 4.00 | **4.75** | **+0.75** |
| Evidence coverage | 78.7% | 84.9% | 79.9% / **96%†** | — |
| Boundary accuracy | 0.95 | **1.00** | **1.00** | held |
| Hallucination flags | 0 | 0 | 0 | 0 |
| Median latency | 133.8s | 66.2s | 68.5s | — |

† 79.9% is all-prompt mean for the 20-run TWIN refresh. **96%** is the p04 (safe implementation path) slice across 5 repos — the most actionability-relevant prompt.

**What each stage fixed:**
- **BGI MCP** fixed boundary accuracy (0.95 → 1.00) and halved latency (133.8s → 66.2s)
- **BGI TWIN** fixed actionability (4.00 → 4.75) by surfacing behavioral twins + seam + rubric

### Three-model independent replication (GPT-4o + Gemini auto)

We re-ran the full TWIN prompt pack on two additional models (`azure/gpt-4o` and `gemini/auto`) across all 5 repos (20 runs each).

| Metric | BGI-TWIN (deepseek-v4-flash) | BGI-TWIN (GPT-4o) | BGI-TWIN (Gemini auto) |
|---|---|---|---|
| Actionability (1–5) | 4.75 | **4.85** | 4.25 |
| Evidence coverage (strict) | 79.9% (96% p04) | 47.9% (49.3% p04) | 62.4% |
| Evidence (tag-relaxed)\* | **94.8%** (100% p04) | 59.5% (62.7% p04) | 83.4% |
| Boundary accuracy | 1.00 | **1.00** | 0.95 |
| Hallucination flags | 0 | **0** | **0** |
| Median latency | 68.5s | **41.6s** | 65.8s |

Interpretation: actionability (≥4.25), boundary accuracy (≥0.95), and zero-hallucination held across all three models.
Gemini's 0.95 boundary reflects one genuine architectural miss (django/p02 — depth-first on `query.py`, not a scoring calibration error).

\* Tag-relaxed evidence score formula:  
`min(100, evidence_coverage_pct + min(25, (unlabeled_repo_anchor_lines / checklist_items) * 100 * 0.15))`  
Repo-anchor lines are non-log lines mentioning concrete repo files/modules (e.g. `*.py`, `*.go`, `*.ts`) without explicit `VERIFIED/HYPOTHESIS/UNKNOWN` tags.

#### Evidence-gap interpretation (explicit)

The evidence-coverage gap is real and should be read directly:
- deepseek TWIN refresh (20 runs): **278** explicit `VERIFIED/HYPOTHESIS/UNKNOWN` labels (**13.9/run**)
- Gemini auto (20 runs): **~167** labels (**8–15/run**)
- GPT-4o TWIN replication (20 runs): **139** labels (**6.95/run**)

On the p04 implementation slice, actionability stayed high across all three (deepseek 4.75, GPT-4o 4.85, Gemini 4.25), boundary stayed ≥0.95, and hallucinations stayed zero.

This means the evidence metric measures explicit tagging style. deepseek follows the `VERIFIED/HYPOTHESIS/UNKNOWN` protocol most strictly; Gemini is intermediate; GPT-4o gives correct answers with fewer explicit labels. The tag-relaxed second score closes most of this gap. We keep scores unnormalized and publish raw outputs for independent re-scoring.

Example from the same prompt (fastapi p03, blast radius of `solve_dependencies`):

```text
deepseek (validation/runs/fastapi/opencode_mcp_p03_twin_refresh_r2.txt)
| Definition spans lines 598–735 ... | VERIFIED |
| get_request_handler ...             | VERIFIED |
| get_websocket_app ...               | VERIFIED |
VERIFIED: Only 3 call sites exist in the codebase.

GPT-4o (validation/runs/fastapi/opencode_mcp_p03_twin_refresh_gpt4o_r1.txt)
1. VERIFIED: Changes to solve_dependencies will directly impact tests.
2. HYPOTHESIS: Other clusters may have dependent implications.
```

---

## What is BGI-TWIN?

Three deterministic MCP tools that convert architecture context into implementation-ready guidance.

| Tool | What it does |
|---|---|
| `task_fingerprint(task)` | NL task → COV token set (deterministic, no LLM) |
| `behavioral_twins(task)` | Top-3 code units ranked by Jaccard overlap with task fingerprint |
| `twin_context(task)` | Combined: task COV + top twins + seam suggestion + 5-point rubric + confidence gate |

BGI-TWIN is a context compiler. It does not generate code, does not call an LLM, and does not speculate. Every output is derived directly from the indexed graph and fuse artifacts.

---

## Per-repo breakdown

5 repos, 4 slices, 100 scored runs.

| Repo | Mode | Runs | Median Latency | Evidence Cov. | Boundary | Actionability |
|---|---|---|---|---|---|---|
| django/django | Baseline | 4 | 99.8s | 73.3% | 1.00 | 4.0 |
| django/django | BGI MCP | 4 | 73.1s | **84.0%** | 1.00 | 4.0 |
| django/django | **BGI TWIN** | 4 | 60.9s | 75.3% | 1.00 | **5.0** |
| django/django | **BGI TWIN (GPT-4o)** | 4 | 48.3s | 47.0% | 1.00 | **5.0** |
| django/django | **BGI TWIN (Gemini auto)** | 4 | 61.9s | 45.7% | 0.75 | 4.0 |
| tiangolo/fastapi | Baseline | 3 | 131.3s | 93.2% | 1.00 | 4.3 |
| tiangolo/fastapi | BGI MCP | 3 | **54.8s** | 66.7% | 1.00 | 4.3 |
| tiangolo/fastapi | **BGI TWIN** | 4 | 79.5s | 82.0% | 1.00 | **5.0** |
| tiangolo/fastapi | **BGI TWIN (GPT-4o)** | 4 | 31.6s | 45.4% | 1.00 | **5.0** |
| tiangolo/fastapi | **BGI TWIN (Gemini auto)** | 4 | 65.8s | 78.7% | 1.00 | 4.75 |
| pydantic/pydantic-core | Baseline | 4 | 192.2s | 48.6% | 0.75 | 4.0 |
| pydantic/pydantic-core | BGI MCP | 4 | 63.3s | **86.7%** | **1.00** | 4.0 |
| pydantic/pydantic-core | **BGI TWIN** | 4 | **47.5s** | 71.3% | 1.00 | **4.7** |
| pydantic/pydantic-core | **BGI TWIN (GPT-4o)** | 4 | 40.4s | 43.8% | 1.00 | **4.5** |
| pydantic/pydantic-core | **BGI TWIN (Gemini auto)** | 4 | 54.8s | 47.5% | 1.00 | 3.50 |
| prometheus/prometheus | Baseline | 6 | **89.9s** | 90.0% | 1.00 | 4.0 |
| prometheus/prometheus | BGI MCP | 6 | 119.9s | 90.0% | 1.00 | 4.0 |
| prometheus/prometheus | **BGI TWIN** | 4 | 70.0s | 80.8% | 1.00 | **5.0** |
| prometheus/prometheus | **BGI TWIN (GPT-4o)** | 4 | 53.4s | 59.6% | 1.00 | **4.8** |
| prometheus/prometheus | **BGI TWIN (Gemini auto)** | 4 | 69.7s | 71.4% | 1.00 | 4.25 |
| vercel/next.js | Baseline | 3 | 291.8s | 89.2% | 1.00 | 3.7 |
| vercel/next.js | BGI MCP | 3 | **66.4s** | **91.7%** | 1.00 | 3.7 |
| vercel/next.js | **BGI TWIN** | 4 | 88.9s | 63.4% | 1.00 | **4.0** |
| vercel/next.js | **BGI TWIN (GPT-4o)** | 4 | 33.6s | 44.0% | 1.00 | **5.0** |
| vercel/next.js | **BGI TWIN (Gemini auto)** | 4 | 96.2s | 68.5% | 1.00 | 4.75 |

BGI TWIN rows are post-shipment refresh runs (p01–p04, `CallToolRequest` evidence confirmed in every run).

---

## Notable findings

### pydantic-core — the clearest result in the dataset

Baseline p01: evidence **0%**, boundary **0**. The model described a pure-Python architecture. The repo is Python + Rust with a `pyo3` bridge that the baseline model never found.

BGI MCP p01: evidence **80%**, boundary **1.0**. BGI injected the exact `pyo3` boundary and the model identified it correctly on the first attempt.

BGI-TWIN p04: evidence **100%**, actionability **5/5**. The safe-implementation prompt produced a copy-paste-ready patch path with specific file and function references.

### fastapi — honest reporting of a mixed result

Evidence coverage dropped on two fastapi MCP runs (p03: 33.3%, p04: 66.7% vs baseline 90%, 100%). This is real:

The baseline model, with no architecture context, read every source file individually and built a detailed verified-claim table. The MCP model received blast-radius context (1,614 impacted units) and treated that as the full picture — it made fewer granular verifications.

What this reveals: MCP architecture context trades file-reading breadth for boundary accuracy. On well-structured repos with good baseline exploration, the evidence-coverage gain is smaller. Boundary accuracy was perfect (1.0) in all fastapi modes. BGI-TWIN's refresh recovered evidence to 82% mean with 5/5 actionability because behavioral twins anchor the model to specific files rather than summaries.

Raw outputs: [validation/runs/fastapi/](../validation/runs/fastapi/)

### Prometheus (Go) — cross-language neutral result

Evidence is flat at 90.0% in baseline and MCP modes. MCP is slower (119.9s vs 89.9s) on this Go codebase. BGI-TWIN improved actionability to 5/5 and reduced latency to 70s.

Key insight: MCP accuracy gains are largest when baseline models are architecturally blind. BGI-TWIN's actionability gains hold regardless.

### next.js (TypeScript) — large monorepo signal

Baseline latency 291.8s — BGI reduces it to 66–89s in all modes. Boundary accuracy perfect across all three modes. Actionability improved from 3.7 to 4.0 with BGI-TWIN.

### Hallucination rate: 0 across 100 scored runs

No factually incorrect module or file claim in any baseline, MCP, or TWIN run.

---

## External benchmark vs Louvain

The 100-run study above measures whether BGI improves AI assistant outputs. A separate, complementary question is whether BGI's behavioral edges are stronger than raw import edges as a clustering signal — measured against an external ground truth, with no rubric scoring.

**Setup.** For each repo, we use the maintainers' own top-level package layout as architectural ground truth (e.g. `django/db/`, `django/forms/`, `django/contrib/auth/`). We then compare four file-level clusterings:

- `bgi_native` — BGI's own `clusters` field (file-level majority vote)
- `louvain_imports` — Louvain on the language-native import graph (Python AST or Go imports)
- `louvain_bgi_hard` — Louvain on BGI's HARD edges projected to file level
- `louvain_bgi_all` — Louvain on BGI's HARD + PREDICTED edges projected to file level

Same algorithm (networkx Louvain, seed=42), same ground truth, same metric harness. The two `louvain_bgi_*` methods isolate BGI's edge contribution from any unit-vs-file granularity confound.

**Metrics.** Pairwise precision/recall/F1 on co-clustering decisions, plus MoJoFM (Tzerpos & Holt) — the canonical architecture-recovery similarity score. Higher is better; 1.0 = exact match (label-equivalent).

### Results

**django (Python, 564 files, 31 ground-truth clusters):**

| method | clusters | precision | recall | F1 | MoJoFM |
|---|---:|---:|---:|---:|---:|
| bgi_native | 100 | 0.246 | 0.167 | 0.199 | 0.190 |
| louvain_imports | 17 | 0.258 | 0.337 | 0.292 | 0.335 |
| **louvain_bgi_hard** | 287 | **1.000** | 0.235 | **0.381** | 0.440 |
| **louvain_bgi_all** | 139 | 0.312 | 0.348 | 0.329 | **0.453** |

BGI's edges win on every metric. HARD-only achieves perfect precision: every file pair BGI's confident edges co-cluster does in fact share a real package. HARD+PREDICTED gives the best MoJoFM by trading some precision for recall.

**prometheus (Go, 665 files, 18 ground-truth clusters):**

| method | clusters | precision | recall | F1 | MoJoFM |
|---|---:|---:|---:|---:|---:|
| bgi_native | 620 | 1.000 | 0.005 | 0.010 | -0.204 |
| **louvain_imports** | 11 | **0.445** | **0.678** | **0.537** | **0.396** |
| louvain_bgi_hard | 663 | 1.000 | 0.000 | 0.000 | -0.290 |
| louvain_bgi_all | 368 | 0.152 | 0.074 | 0.100 | -0.118 |

**gin (Go, 96 files, 7 ground-truth clusters):**

| method | clusters | precision | recall | F1 | MoJoFM |
|---|---:|---:|---:|---:|---:|
| bgi_native | 96 | 0.000 | 0.000 | 0.000 | -0.534 |
| **louvain_imports** | 32 | **0.620** | 0.364 | **0.459** | **0.207** |
| louvain_bgi_hard | 96 | 0.000 | 0.000 | 0.000 | -0.534 |
| louvain_bgi_all | 61 | 0.425 | 0.088 | 0.145 | -0.172 |

Raw imports dominate on both Go repos. Root cause is **structural, not algorithmic**: of BGI's 47,204 HARD edges on prometheus, 47,202 stay inside one file. Token mix is the upstream driver — ~70% of Go emissions are INTAKE/OUTPUT/CONDITIONAL/LOOP, which gate-2 deliberately scopes to same-file to prevent O(N²) noise. Cross-file pair density (FETCH/PERSIST, EMIT/SUBSCRIBE, ROUTE/AUTHENTICATE) is lower on Go than on Python. We expanded the Go scanner's data-flow patterns (HTTP routing, channel receive→SUBSCRIBE, marshal/unmarshal→TRANSFORM, lock cleanup→TEARDOWN) and re-ran on gin — token coverage improved (249 ROUTE emissions on gin vs 0 on prometheus's older scan), but the spectral-mask interaction means cross-file edges remain sparse on Go relative to Python.

### Honest read

**Where BGI's edges add real value:** tier-1 query-backed languages (`.scm`-based: Python, TypeScript, JavaScript, Go, Rust, Java, C#, PHP, Ruby, Kotlin, Scala). The HARD-only precision of 1.000 on django is the strongest single result here — when BGI promotes an edge to HARD, those two files genuinely belong together. This validates the README's "scope-constrained edge generation" claim on the languages where the scanner produces dense cross-file behavioral edges.

**Where BGI's edges currently underperform:** tier-2 scanner-backed languages (and previously Go, prior to its query-backed upgrade) on cluster-recovery benchmarks. The user-visible MCP product (boundary detection, twin retrieval, AI-assistant context) still works on Go — boundary accuracy 1.0, actionability 4.25–5.0 across all three models in the 100-run study. The cluster-recovery gap is real and documented in the language-tier section of the README.

**What this changes:**
- The README's `DRS cluster` glossary line is softened — `clusters` is honestly described as unit-level grouping, mostly intra-file. File-level architectural components are better expressed via the BGI edge graph + Louvain (or via the fuse-graph boundary signal).
- The language-tier section adds a cross-file edge density caveat.

### Reproduce

Per-repo command, raw outputs, and harness code:

```bash
# Python repo
python3 scripts/external_benchmark/run_repo.py \
  --repo-slug django --repo-root /tmp/bgi-ab-repos/django \
  --package-root django --bgi-graph output/validation/mcp-ab/django/bgi-graph.json \
  --ext py --language python --truth-split contrib \
  --out output/benchmarks/external/django

# Go repo
python3 scripts/external_benchmark/run_repo.py \
  --repo-slug gin --repo-root /tmp/bgi-ab-repos/gin \
  --package-root . --bgi-graph output/benchmarks/external/gin/bgi-graph.json \
  --ext go --language go \
  --out output/benchmarks/external/gin
```

Per-repo `summary.json`, `metrics.csv`, and full `clusterings.json` are committed under `output/benchmarks/external/{django,prometheus,gin}/`. Harness source: `scripts/external_benchmark/`.

---

## Limitations

We publish limitations before readers find them. A reader who discovers a flaw themselves trusts evidence less than one who was told.

**Self-reported scoring.** Checklists were written by us, scored by us. The checklists were defined before scoring by reading actual source code. The full rubric is at [validation/SCORING_RUBRIC.md](../validation/SCORING_RUBRIC.md). Every raw output is public at [validation/runs/](../validation/runs/). Re-score independently and open an issue if you disagree.

**5 repos is not a large sample.** Python + Go + TypeScript (with core support since expanded to 11 first-class query-backed languages) is broader than Python-only, but still limited. The pydantic-core finding stands on its own. Three-model replication (GPT-4o + Gemini auto) is now complete, but we still need additional repos and external replications.

**BGI-TWIN refresh is MCP-only.** No updated baseline was run alongside the refresh. We have no reason to believe the baseline changed, but this is a real experimental design limitation.

**Evidence coverage is style-sensitive across models.** The rubric rewards explicit claim-level `VERIFIED/HYPOTHESIS/UNKNOWN` labeling with citations. deepseek follows this most strictly; Gemini is intermediate; GPT-4o tags less explicitly. All three produce correct answers with high actionability and zero hallucinations.

**One invalid MCP run.** One next.js p04 original A/B run had no `CallToolRequest` evidence and is marked explicitly unscored in `runs.csv`. All 60 TWIN refresh runs (deepseek, GPT-4o, Gemini) have invocation evidence.

**We still need external replication.** We have three-model replication (GPT-4o + Gemini auto), but we still need external teams to run and publish the protocol on their own repos.

---

## Methodology

| Item | Detail |
|---|---|
| Repos | tiangolo/fastapi, django/django, pydantic/pydantic-core, prometheus/prometheus, vercel/next.js |
| CLI | opencode 1.14.41 / gemini CLI (auto) |
| Model | deepseek-v4-flash + azure/gpt-4o + gemini/auto |
| MCP server | `bgi mcp --graph ... --fuse-graph ...` |
| TWIN invocation | `twin_context` explicitly required in prompt; `CallToolRequest` confirmed in every TWIN run |
| Evidence coverage | Recall of architectural facts vs ground-truth checklist (sensitive to explicit label/citation style) |
| Evidence (tag-relaxed, second score) | Primary evidence score + capped credit for unlabeled repo-anchor lines (no reruns required) |
| Boundary accuracy | 0/1 — correct seam identification |
| Actionability | 1–5 rubric: 5 = immediately actionable (copy-paste), 1 = vague |
| Hallucination flags | Count of factually incorrect module/file claims |
| Total scored runs | 100 (20 baseline + 20 MCP + 20 TWIN deepseek + 20 TWIN GPT-4o + 20 TWIN Gemini auto) |
| Full rubric | [validation/SCORING_RUBRIC.md](../validation/SCORING_RUBRIC.md) |
| All run artifacts | [validation/runs/](../validation/runs/) |
| Run log | [validation/runs.csv](../validation/runs.csv) |

---

## Reproduce

```bash
# Install
pip install bigindexer

# Clone any repo and build the index
git clone --depth 1 https://github.com/tiangolo/fastapi
bgi scan fastapi/ --out output/

# Start MCP server (includes BGI-TWIN tools)
bgi mcp --graph output/bgi-graph.json --fuse-graph output/fuse-graph.json

# Run with OpenCode and a local config in the repo dir:
# { "mcp": { "bgi": { "command": "bgi", "args": ["mcp", ...] } } }
opencode  # AI receives architecture summary + behavioral twins + seam + rubric
```

Full setup: [docs/MCP_SETUP.md](../docs/MCP_SETUP.md)

---

*Big Indexer — Architecture-aware context for AI coding assistants.*  
*https://bigindexer.com*