# Retrieval Quality Benchmark Benchmark for measuring semantic code search quality of octocode's full pipeline: chunking, embedding, vector search, and reranking. ## Ground Truth `code.csv` contains 127 queries with 1-3 annotated code locations each (100 standard + 27 hard). **Based on commit:** `b1771ba48214ce7404ad7158e277eb60680912b4` **Config used:** [`benchmark/config.toml`](config.toml) — contextual retrieval enabled, Voyage reranker 2.5, RaBitQ quantization ### How ground truth was created 1. **AI-generated candidates** — queries and expected code/doc locations were generated by an LLM with full codebase context, covering all major modules and documentation files 2. **Source verification** — every referenced file and line range was read and validated against the actual source to confirm the code/docs at those lines answers the query 3. **Multi-agent validation** — parallel validation agents independently checked line ranges, file paths, and relevance scores across all 254 queries 4. **Search-informed corrections** — queries that the search missed were analyzed: if the search found a valid alternative location (same logic in a different file), it was added as a secondary result rather than removed 5. **Hard query design** — 27 code queries and ~14 doc queries use natural language that deliberately avoids mirroring function names or section titles, testing semantic understanding over keyword matching Format: ``` query,result1,result2,result3 ``` Each result: `src/path/file.rs:start_line-end_line:relevance` - Relevance `2` = primary (directly answers the query) - Relevance `1` = secondary (related, useful context) Matching uses **line range overlap**: a search result at lines 40-90 matches ground truth 45-92 if ranges intersect. ## Metrics ### Hit@k Binary per query: did ANY correct result appear in the top-k? Averaged across all queries. A Hit@5 of 0.85 means 85% of queries had at least one relevant result in the top 5. ### MRR (Mean Reciprocal Rank) Reciprocal of the rank of the **first** correct result, averaged across queries. | First hit at rank | Score | |-------------------|-------| | 1 | 1.0 | | 2 | 0.5 | | 3 | 0.33 | | 5 | 0.2 | | Not found | 0.0 | Measures how high the first relevant result appears. ### NDCG@10 (Normalized Discounted Cumulative Gain) Accounts for both **relevance grades** (2 vs 1) and **position** in the ranking: ``` DCG@k = sum( rel_i / log2(i + 1) ) for i = 1..k IDCG@k = DCG of the ideal ranking (ground truth sorted by relevance desc) NDCG = DCG / IDCG ``` A relevance=2 result at position 1 contributes more than a relevance=1 result at position 5. Measures whether the **most relevant** results are ranked highest. ### Recall@k Fraction of ground truth entries found in the top-k results. If a query has 3 ground truth blocks and search finds 2 of them in the top 10: ``` Recall@10 = 2/3 = 0.67 ``` Measures completeness: how many relevant blocks did we find? ### Worked Example Ground truth: `fileA:10-50:2, fileB:20-30:1` Search returns: `[fileC:1-10, fileA:30-60, fileB:25-35, ...]` - fileA overlaps (30-60 intersects 10-50) at rank 2, relevance=2 - fileB overlaps (25-35 intersects 20-30) at rank 3, relevance=1 | Metric | Calculation | Score | |-----------|--------------------------------------------------------|-------| | Hit@5 | found a match | 1 | | MRR | first hit at rank 2 = 1/2 | 0.50 | | DCG | 2/log2(3) + 1/log2(4) = 1.26 + 0.50 | 1.76 | | IDCG | 2/log2(2) + 1/log2(3) = 2.00 + 0.63 | 2.63 | | NDCG@10 | 1.76 / 2.63 | 0.67 | | Recall@10 | 2 of 2 ground truth found | 1.00 | ## Ground Truth Files | File | Mode | Queries | What it tests | |------|------|---------|---------------| | `code.csv` | `code` | 127 | Code search: functions, structs, logic blocks | | `docs.csv` | `docs` | 127 | Doc search: markdown sections, config guides, architecture | ## Usage ```bash # Benchmark code search (default) python3 benchmark/score.py --verbose # Benchmark documentation search python3 benchmark/score.py --mode docs --csv benchmark/docs.csv --verbose # With custom settings python3 benchmark/score.py --threshold 0.5 --max-results 10 # Quiet mode (summary only) python3 benchmark/score.py ``` Exit code is 1 if Hit@5 drops below 0.70. ## Coverage The 100 queries cover all major modules: | Area | Queries | |---------------------------|---------| | Code chunking/extraction | 12 | | Markdown processing | 12 | | Contextual enrichment | 5 | | Differential indexing | 3 | | Embedding generation | 5 | | Store data structures | 10 | | Store operations | 10 | | Vector optimizer/table ops| 4 | | Batch converter | 1 | | GraphRAG types | 5 | | GraphRAG relationships | 6 | | GraphRAG database/utils | 7 | | GraphRAG AI/builder | 6 | | MCP server | 5 | | LSP integration | 4 | | LLM client | 2 | | Search/rendering | 2 | | File watcher | 1 | | Config/storage/state | 7 | | **Subtotal (standard)** | **100** | | **Hard queries** | **27** | | **Total** | **127** | ## Hard Queries The last 27 queries (101-127) use **natural language that doesn't mirror code comments or function names**. They test semantic understanding rather than keyword matching: | # | Query intent | Why it's hard | |---|-------------|---------------| | 101 | Preventing infinite loops in overlapping chunks | 5-line guard clause, no code keywords in query | | 102 | Skipping indexing when repo unchanged | Buried 100+ lines into a 900-line function | | 103 | Avoiding duplicate embeddings for unchanged code | Hash-based dedup flow across functions | | 104 | Vector quantization compression ratios | Answer is in doc comments, not executable code | | 105 | Consistent database paths across developers | Intent-based query about identity hashing | | 106 | Minimum declarations before grouping | Specific constant + its usage site (2 lines) | | 107 | Why some symbols are hidden in results | UX behavior question, small helper function | | 108 | AI vs rule-based decision for code analysis | Decision logic, no direct keyword overlap | | 109 | Filtering dissimilar search results | 7-line block within a 60-line method | | 110 | Knowing which files to reindex after git commit | Multi-step git diff logic | | 111 | Cleaning markdown fences from LLM JSON responses | Natural language, function name not in query | | 112 | Behavior when embedding API keeps failing | Failure/retry path, not happy path | | 113 | Skipping files unchanged on disk (mtime) | Performance optimization buried in main loop | | 114 | Preventing duplicate graph nodes during rebuild | Dedup check within batch processing (25 lines) | | 115 | Switching embedding model with different dimensions | Schema migration logic, not obvious location | | 116 | Metadata not saved if flush fails | Atomicity pattern, answer in CRITICAL comment | | 117 | Preventing concurrent reindexing in MCP server | AtomicBool + compare_exchange (concurrency) | | 118 | Deduplicating results from multiple queries | Single function call, concept-level query | | 119 | Non-code files handled as chunked text | Edge case handling, 6 lines in 900-line function | | 120 | Forced flush after removing changed file blocks | Crash safety, explained only in code comment | | 121 | Avoiding redundant table opens | Cache with double-check locking pattern (20 lines) | | 122 | Reranker score to distance conversion | 2-line conversion in a map closure | | 123 | Rough token estimation without tokenizer | Single line: `s.len() / 4` | | 124 | Additional delay before background reindex | 3 lines with timing rationale | | 125 | Cleaning up files recently added to gitignore | 8 lines within cleanup loop | | 126 | Similarity-to-distance threshold conversion | 6 lines, two separate locations | | 127 | GraphRAG build from existing DB when no new files | Decision tree within indexing pipeline | If the standard queries score ~1.0 but hard queries score significantly lower, the benchmark is working correctly and exposes real retrieval weaknesses. ## Variant Matrix (`run_matrix.py`) `score.py` evaluates one configuration. `run_matrix.py` sweeps **retrieval parameters** over a single shared index and writes a comparison table to [`RESULTS.md`](RESULTS.md) (+ `RESULTS.json`). It derives each variant config from `config.toml` by mutating only the relevant keys — no hand-maintained strict config — and accumulates results across runs (so a focused `ONLY=` run merges into the existing table). **Pin the corpus first** — ground truth is annotated against commit `b1771ba`, so indexing any other checkout drifts the line ranges: ```bash git worktree add /home/box/bench_corpus b1771ba # Full local matrix (fastembed, no API key): vector-only vs hybrid weight sweep CORPUS=/home/box/bench_corpus \ OCTO_BIN=./target/release/octocode \ python3 benchmark/run_matrix.py # Add a local cross-encoder reranker (still no key) — reuses the index CORPUS=/home/box/bench_corpus OCTO_BIN=./target/release/octocode \ SKIP_INDEX=1 ONLY=rerank RERANK_MODEL=fastembed:bge-reranker-base \ python3 benchmark/run_matrix.py ``` | Env | Purpose | |-----|---------| | `CORPUS` | (required) pinned checkout to index + search | | `OCTO_BIN` | octocode binary (default: `octocode` on `PATH`) | | `SKIP_INDEX=1` | reuse the existing index (only search-time params changed) | | `ONLY=` | run only variants whose name matches (e.g. `rerank`) | | `RERANK_MODEL` | reranker model; `fastembed:bge-reranker-base` is local/no-key | | `CODE_MODEL` / `TEXT_MODEL` | override embedding models | | `VOYAGE_API_KEY` etc. | if set, enables paid-reranker variants automatically | Variants: `vector_only`, `hybrid_70_30` (default weights), `hybrid_30_70` (keyword-tilted), `+graph` (GraphRAG file-level expansion — only moves results when a reranker re-scores the enlarged set), `+rerank` (cross-encoder).