# Reranker Leaderboard

Evaluation of 12 reranking models using LLM-as-judge pairwise comparisons across 6 datasets.

## Leaderboard

| Rank | Model | ELO | Win Rate |
|------|-------|-----|----------|
| 1 | Zerank 2 | 1638 | 57% |
| 2 | Cohere Rerank 4 Pro | 1629 | 58% |
| 3 | Zerank 1 | 1573 | 57% |
| 4 | Voyage AI Rerank 2.5 | 1544 | 58% |
| 5 | Zerank 1 Small | 1539 | 55% |
| 6 | Voyage AI Rerank 2.5 Lite | 1520 | 53% |
| 7 | Cohere Rerank 4 Fast | 1510 | 50% |
| 8 | Qwen3 Reranker 8B | 1473 | 51% |
| 9 | Contextual AI Rerank v2 Instruct | 1469 | 42% |
| 10 | Cohere Rerank 3.5 | 1451 | 41% |
| 11 | BAAI/BGE Reranker v2 M3 | 1327 | 29% |
| 12 | Jina Reranker v2 Base Multilingual | 1327 | 28% |


## Datasets

- MSMARCO (web search)
- Arguana (argument mining)
- FiQa (financial Q&A)
- Business Reports
- Paul Graham Essays
- DBPedia (entity retrieval)

## Methodology

1. Embed documents with BAAI/bge-small-en-v1.5
2. Retrieve top-50 candidates using FAISS
3. Rerank to top-15 with each model
4. Generate pairwise judgments using GPT-5
5. Calculate ELO ratings from pairwise comparisons

## Usage

See [eval-pipeline/ADD_NEW_RERANKER_GUIDE.md](eval-pipeline/ADD_NEW_RERANKER_GUIDE.md) for instructions on adding new rerankers.

## Project Structure

```
eval-pipeline/
├── config.yaml              Configuration
├── pipeline/                Evaluation pipeline
│   └── stages/
│       ├── embed.py         Document embedding
│       ├── retrieve.py      FAISS retrieval
│       ├── rerank.py        Reranker integrations
│       └── llm-judge.py     LLM-as-judge evaluation
├── add-reranker.py          Add new reranker
├── compare-rerankers.py     Compare all rerankers (ELO)
└── aggregate-all-results.py Aggregate cross-dataset results
```