# Retrieval Quality Benchmark

Benchmark for measuring semantic code search quality of octocode's full pipeline: chunking, embedding, vector search, and reranking.

## Ground Truth

`code.csv` contains 127 queries with 1-3 annotated code locations each (100 standard + 27 hard).

**Based on commit:** `b1771ba48214ce7404ad7158e277eb60680912b4`
**Config used:** [`benchmark/config.toml`](config.toml) — contextual retrieval enabled, Voyage reranker 2.5, RaBitQ quantization

### How ground truth was created

1. **AI-generated candidates** — queries and expected code/doc locations were generated by an LLM with full codebase context, covering all major modules and documentation files
2. **Source verification** — every referenced file and line range was read and validated against the actual source to confirm the code/docs at those lines answers the query
3. **Multi-agent validation** — parallel validation agents independently checked line ranges, file paths, and relevance scores across all 254 queries
4. **Search-informed corrections** — queries that the search missed were analyzed: if the search found a valid alternative location (same logic in a different file), it was added as a secondary result rather than removed
5. **Hard query design** — 27 code queries and ~14 doc queries use natural language that deliberately avoids mirroring function names or section titles, testing semantic understanding over keyword matching

Format:
```
query,result1,result2,result3
```

Each result: `src/path/file.rs:start_line-end_line:relevance`
- Relevance `2` = primary (directly answers the query)
- Relevance `1` = secondary (related, useful context)

Matching uses **line range overlap**: a search result at lines 40-90 matches ground truth 45-92 if ranges intersect.

## Metrics

### Hit@k

Binary per query: did ANY correct result appear in the top-k? Averaged across all queries.

A Hit@5 of 0.85 means 85% of queries had at least one relevant result in the top 5.

### MRR (Mean Reciprocal Rank)

Reciprocal of the rank of the **first** correct result, averaged across queries.

| First hit at rank | Score |
|-------------------|-------|
| 1                 | 1.0   |
| 2                 | 0.5   |
| 3                 | 0.33  |
| 5                 | 0.2   |
| Not found         | 0.0   |

Measures how high the first relevant result appears.

### NDCG@10 (Normalized Discounted Cumulative Gain)

Accounts for both **relevance grades** (2 vs 1) and **position** in the ranking:

```
DCG@k  = sum( rel_i / log2(i + 1) )   for i = 1..k
IDCG@k = DCG of the ideal ranking (ground truth sorted by relevance desc)
NDCG   = DCG / IDCG
```

A relevance=2 result at position 1 contributes more than a relevance=1 result at position 5. Measures whether the **most relevant** results are ranked highest.

### Recall@k

Fraction of ground truth entries found in the top-k results.

If a query has 3 ground truth blocks and search finds 2 of them in the top 10:
```
Recall@10 = 2/3 = 0.67
```

Measures completeness: how many relevant blocks did we find?

### Worked Example

Ground truth: `fileA:10-50:2, fileB:20-30:1`
Search returns: `[fileC:1-10, fileA:30-60, fileB:25-35, ...]`

- fileA overlaps (30-60 intersects 10-50) at rank 2, relevance=2
- fileB overlaps (25-35 intersects 20-30) at rank 3, relevance=1

| Metric    | Calculation                                            | Score |
|-----------|--------------------------------------------------------|-------|
| Hit@5     | found a match                                          | 1     |
| MRR       | first hit at rank 2 = 1/2                              | 0.50  |
| DCG       | 2/log2(3) + 1/log2(4) = 1.26 + 0.50                   | 1.76  |
| IDCG      | 2/log2(2) + 1/log2(3) = 2.00 + 0.63                   | 2.63  |
| NDCG@10   | 1.76 / 2.63                                           | 0.67  |
| Recall@10 | 2 of 2 ground truth found                              | 1.00  |

## Ground Truth Files

| File | Mode | Queries | What it tests |
|------|------|---------|---------------|
| `code.csv` | `code` | 127 | Code search: functions, structs, logic blocks |
| `docs.csv` | `docs` | 127 | Doc search: markdown sections, config guides, architecture |

## Usage

```bash
# Benchmark code search (default)
python3 benchmark/score.py --verbose

# Benchmark documentation search
python3 benchmark/score.py --mode docs --csv benchmark/docs.csv --verbose

# With custom settings
python3 benchmark/score.py --threshold 0.5 --max-results 10

# Quiet mode (summary only)
python3 benchmark/score.py
```

Exit code is 1 if Hit@5 drops below 0.70.

## Coverage

The 100 queries cover all major modules:

| Area                      | Queries |
|---------------------------|---------|
| Code chunking/extraction  | 12      |
| Markdown processing       | 12      |
| Contextual enrichment     | 5       |
| Differential indexing     | 3       |
| Embedding generation      | 5       |
| Store data structures     | 10      |
| Store operations          | 10      |
| Vector optimizer/table ops| 4       |
| Batch converter           | 1       |
| GraphRAG types            | 5       |
| GraphRAG relationships    | 6       |
| GraphRAG database/utils   | 7       |
| GraphRAG AI/builder       | 6       |
| MCP server                | 5       |
| LSP integration           | 4       |
| LLM client                | 2       |
| Search/rendering          | 2       |
| File watcher              | 1       |
| Config/storage/state      | 7       |
| **Subtotal (standard)**   | **100** |
| **Hard queries**          | **27**  |
| **Total**                 | **127** |

## Hard Queries

The last 27 queries (101-127) use **natural language that doesn't mirror code comments or function names**. They test semantic understanding rather than keyword matching:

| # | Query intent | Why it's hard |
|---|-------------|---------------|
| 101 | Preventing infinite loops in overlapping chunks | 5-line guard clause, no code keywords in query |
| 102 | Skipping indexing when repo unchanged | Buried 100+ lines into a 900-line function |
| 103 | Avoiding duplicate embeddings for unchanged code | Hash-based dedup flow across functions |
| 104 | Vector quantization compression ratios | Answer is in doc comments, not executable code |
| 105 | Consistent database paths across developers | Intent-based query about identity hashing |
| 106 | Minimum declarations before grouping | Specific constant + its usage site (2 lines) |
| 107 | Why some symbols are hidden in results | UX behavior question, small helper function |
| 108 | AI vs rule-based decision for code analysis | Decision logic, no direct keyword overlap |
| 109 | Filtering dissimilar search results | 7-line block within a 60-line method |
| 110 | Knowing which files to reindex after git commit | Multi-step git diff logic |
| 111 | Cleaning markdown fences from LLM JSON responses | Natural language, function name not in query |
| 112 | Behavior when embedding API keeps failing | Failure/retry path, not happy path |
| 113 | Skipping files unchanged on disk (mtime) | Performance optimization buried in main loop |
| 114 | Preventing duplicate graph nodes during rebuild | Dedup check within batch processing (25 lines) |
| 115 | Switching embedding model with different dimensions | Schema migration logic, not obvious location |
| 116 | Metadata not saved if flush fails | Atomicity pattern, answer in CRITICAL comment |
| 117 | Preventing concurrent reindexing in MCP server | AtomicBool + compare_exchange (concurrency) |
| 118 | Deduplicating results from multiple queries | Single function call, concept-level query |
| 119 | Non-code files handled as chunked text | Edge case handling, 6 lines in 900-line function |
| 120 | Forced flush after removing changed file blocks | Crash safety, explained only in code comment |
| 121 | Avoiding redundant table opens | Cache with double-check locking pattern (20 lines) |
| 122 | Reranker score to distance conversion | 2-line conversion in a map closure |
| 123 | Rough token estimation without tokenizer | Single line: `s.len() / 4` |
| 124 | Additional delay before background reindex | 3 lines with timing rationale |
| 125 | Cleaning up files recently added to gitignore | 8 lines within cleanup loop |
| 126 | Similarity-to-distance threshold conversion | 6 lines, two separate locations |
| 127 | GraphRAG build from existing DB when no new files | Decision tree within indexing pipeline |

If the standard queries score ~1.0 but hard queries score significantly lower, the benchmark is working correctly and exposes real retrieval weaknesses.

## Variant Matrix (`run_matrix.py`)

`score.py` evaluates one configuration. `run_matrix.py` sweeps **retrieval
parameters** over a single shared index and writes a comparison table to
[`RESULTS.md`](RESULTS.md) (+ `RESULTS.json`). It derives each variant config
from `config.toml` by mutating only the relevant keys — no hand-maintained
strict config — and accumulates results across runs (so a focused `ONLY=` run
merges into the existing table).

**Pin the corpus first** — ground truth is annotated against commit `b1771ba`,
so indexing any other checkout drifts the line ranges:

```bash
git worktree add /home/box/bench_corpus b1771ba

# Full local matrix (fastembed, no API key): vector-only vs hybrid weight sweep
CORPUS=/home/box/bench_corpus \
  OCTO_BIN=./target/release/octocode \
  python3 benchmark/run_matrix.py

# Add a local cross-encoder reranker (still no key) — reuses the index
CORPUS=/home/box/bench_corpus OCTO_BIN=./target/release/octocode \
  SKIP_INDEX=1 ONLY=rerank RERANK_MODEL=fastembed:bge-reranker-base \
  python3 benchmark/run_matrix.py
```

| Env | Purpose |
|-----|---------|
| `CORPUS` | (required) pinned checkout to index + search |
| `OCTO_BIN` | octocode binary (default: `octocode` on `PATH`) |
| `SKIP_INDEX=1` | reuse the existing index (only search-time params changed) |
| `ONLY=<substr>` | run only variants whose name matches (e.g. `rerank`) |
| `RERANK_MODEL` | reranker model; `fastembed:bge-reranker-base` is local/no-key |
| `CODE_MODEL` / `TEXT_MODEL` | override embedding models |
| `VOYAGE_API_KEY` etc. | if set, enables paid-reranker variants automatically |

Variants: `vector_only`, `hybrid_70_30` (default weights), `hybrid_30_70`
(keyword-tilted), `+graph` (GraphRAG file-level expansion — only moves results
when a reranker re-scores the enlarged set), `+rerank` (cross-encoder).