# llama-server reranker (local) — Qwen3-Reranker, self-hosted ZE, any ZE-wire-shape provider [`llama-server`](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md) is the HTTP wrapper that ships with llama.cpp. With `--reranking`, it exposes an OpenAI-style `POST /v1/rerank` endpoint that returns `{results: [{index, relevance_score}]}` — exactly the wire shape gbrain already drives for ZeroEntropy's hosted reranker. The `llama-server-reranker` recipe (added in v0.40.6.1) routes `gateway.rerank()` at your local llama.cpp instance instead of ZE. Two flavors of "local" this recipe covers: - **Qwen3-Reranker** (0.6B / 4B / 8B) — open-weight cross-encoder; pull the GGUF from HuggingFace and serve. - **Self-hosted ZeroEntropy** (`zerank-2`, `zerank-1-small`) — the weights are on HuggingFace too. GGUF-convert them and serve them the same way. **Quality is not guaranteed to match ZE-hosted:** GGUF conversion + quantization + pooling/rank metadata + tokenizer special tokens all affect scores. If you self-host ZE for production retrieval, pin your own brain-relevant eval ( [docs/eval-bench.md](../eval-bench.md)) as a regression guard. This recipe is the path override + recipe shape. Any provider whose request/response wire matches ZE/llama.cpp can use it by just pointing at a different base URL. Providers whose wire shape differs (Voyage uses `top_k` not `top_n`, returns `data[]` not `results[]`) need a separate recipe with adapter hooks — that lands in a follow-up plan. ## Setup ### 1. Build llama.cpp (or download a release) ```bash # Clone and build (CPU only; add `-DGGML_CUDA=ON` for GPU) git clone https://github.com/ggml-org/llama.cpp.git cd llama.cpp cmake -B build cmake --build build --config Release -j ``` Pin a specific commit when you ship — `llama-server`'s path aliases (`/rerank`, `/v1/rerank`, `/reranking`, `/v1/reranking`) have shifted across releases. The recipe sends to `/v1/rerank`. ### 2. Pull a reranker GGUF For Qwen3-Reranker-4B (quantized Q4_K_M is the sweet spot for CPU): ```bash # Pick a quant level — Q4_K_M is the usual CPU sweet spot. huggingface-cli download \ Qwen/Qwen3-Reranker-4B-GGUF qwen3-reranker-4b-q4_k_m.gguf \ --local-dir ./models ``` For self-hosted ZeroEntropy weights, find a community GGUF conversion or convert from the HuggingFace weights yourself (out of scope of this doc — see llama.cpp's `convert_hf_to_gguf.py`). ### 3. Launch llama-server with --reranking AND --alias ```bash ./build/bin/llama-server \ --model ./models/qwen3-reranker-4b-q4_k_m.gguf \ --alias qwen3-reranker-4b \ --reranking \ --port 8081 ``` The `--alias` matters: without it, llama-server's `/v1/models` (and the `model` field rerank requests echo) defaults to the full gguf file path, which makes the gbrain config string ugly and brittle. With `--alias qwen3-reranker-4b`, your config string is short and stable. `--reranking` and `--embeddings` are mutually exclusive at server launch. If you also run a local embedder via the [`llama-server`](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md) recipe, run two separate llama-server processes on two different ports (typically 8080 for embeddings, 8081 for reranking — gbrain's defaults match that convention). ### 4. Wire gbrain at your server ```bash # Point gbrain at the llama.cpp host (skip if running locally on default port) gbrain config set provider_base_urls.llama-server-reranker http://your-host:8081/v1 # Tell search to use this reranker gbrain config set search.reranker.model llama-server-reranker:qwen3-reranker-4b gbrain config set search.reranker.enabled true ``` The `qwen3-reranker-4b` after the colon is your `--alias` value from step 3. Any string works as long as it matches your server's alias. Env vars work too as an alternative to the config set above: ```bash export LLAMA_SERVER_RERANKER_BASE_URL=http://your-host:8081/v1 # Optional: if you front llama-server with nginx + bearer auth export LLAMA_SERVER_RERANKER_API_KEY=your-bearer-token ``` ### 5. Verify ```bash gbrain models doctor # Expect: ✔ reranker_config llama-server-reranker:qwen3-reranker-4b ok # ✔ reranker_config llama-server-reranker:qwen3-reranker-4b ok (reachability) gbrain search "some query" --json | jq '.[].rerank_score' # Expect: rerank_score on every row ``` If `gbrain models doctor` reports the reachability probe as `network` status, two common causes: 1. The server is reachable but in embedding mode, not reranking mode. `--reranking` and `--embeddings` are mutually exclusive at launch — relaunch the right one. 2. The recipe path doesn't match what your llama.cpp version serves. This recipe sends `/v1/rerank`; older llama.cpp installs may only serve `/rerank`. Pin to a recent llama.cpp commit. ## Cold-start headroom CPU-only first-call warmup on a 4B reranker can take 8-15 seconds. The recipe declares `default_timeout_ms: 30000` so the first call after a server restart doesn't fail-open silently. That value flows through search-mode resolution unless you override it: ```bash # Tighten or loosen per-search timeout (overrides recipe default): gbrain config set search.reranker.timeout_ms 60000 ``` Per-call overrides in `SearchOpts.reranker_timeout_ms` still win for any single call. ## Budget caps + local rerank The recipe declares `cost_per_1m_tokens_usd: 0` and registers under `FREE_LOCAL_RERANK_PROVIDERS` in the budget tracker, so `--max-cost`-bounded callers (autopilot loops, batch jobs) do NOT hard-fail when configured for local rerank. Local rerank costs electricity, not API tokens. ```bash GBRAIN_MAX_USD=0.01 gbrain search "..." --reranker llama-server-reranker:qwen3-reranker-4b # Works: rerank fires, recorded at $0, cumulative cap untouched. ``` ## Fail-open contract preserved `applyReranker` in `src/core/search/rerank.ts` still has the fail-open posture: any error class (network, timeout, malformed response) logs to `~/.gbrain/audit/rerank-failures-*.jsonl` and returns the original RRF order unchanged. Search reliability beats reranker quality. If your llama.cpp host goes down, your searches keep working — they just stop ranking against the cross-encoder until you restart the server.