# Retrieval benchmarks (R0) ## How to read these numbers The system under test is an MCP (Model Context Protocol) memory server: a local service that stores text snippets, called memories, for an AI assistant and finds them again by search. Every benchmark on this page asks the same question: when you search, does the right memory come back, and how high in the result list? A gold set is a fixed list of test questions where the correct answer (the exact memory or conversation session that should be found) is known in advance. Scoring is mechanical: run each query, then check whether that item came back and at what rank. precision@1 is the share of questions where the correct item was the top result. recall@5 is the share where it appeared anywhere in the top 5: a recall@5 of 90% means that out of every 10 questions, the right memory was in the top 5 results for 9 of them. Other suites write the same idea as R@10, recall_any@5, or hit@5; only the cutoff changes. MRR (mean reciprocal rank) tracks how high the correct item ranks on average: a first-place hit counts 1, second place 1/2, third 1/3, and a miss 0, so values near 1.0 mean the right answer is usually at the top. Most tables compare two modes. The base mode is hybrid search, which combines vector similarity with keyword matching. The rerank mode adds a cross-encoder reranker: a second model that re-sorts the top results. It is slower but usually more accurate; one suite below (ConvoMem) shows a case where it hurts. Every number on this page can be reproduced on your own machine with the commands listed in each section. Nothing is self-reported from a cloud API: the embedding model, the index, and the data all run and stay local, at **$0 per token** and **zero network egress**. The first harness, R0, is the project's own reproducible, fully-local retrieval-quality benchmark, run with the real embedding model and the real production handlers. The later sections run four public benchmarks (LongMemEval-S, LOCOMO, ConvoMem, MemBench) through the same production code. ## What it measures The harness builds a fixed gold-set corpus ("Helios", a 24-memory B2B billing SaaS knowledge base) and a set of natural-language queries with a single known best hit each. It stores every memory through the real `memory_store` handler and runs `memory_search` over the gold queries **twice**: once with the hybrid RRF ranker alone, and once with the local cross-encoder reranker as the final stage. It then reports: | Metric | Meaning | |---|---| | `precision_at_1` | Fraction of queries whose gold hit is ranked #1. | | `precision_at_3` | Fraction whose gold hit is in the top 3. | | `mrr` | Mean Reciprocal Rank (mean of 1/rank, misses count 0). | | `store_latency` | avg + p95 ms per `memory_store` call. | | `search_latency` | avg + p95 ms per `memory_search` call (per mode). | | `precision_at_1_lift` / `mrr_lift` | Reranker gain over the hybrid baseline. | The metric math lives in a small dependency-free module, [`scripts/bench/metrics.mjs`](../scripts/bench/metrics.mjs), and is covered by a deterministic unit test (`src/__tests__/bench-metrics.test.ts`), so the formulas can't silently drift. ## How it runs The harness loads the real model, `TransformersEmbeddingProvider` wrapped in `CachedEmbeddingProvider` (all-MiniLM-L6-v2, 384-dim), and the real tool handlers (`handleStore`, `handleSearch`), not the test mock. There is no build step: a tiny esbuild-based loader ([`scripts/bench/ts-loader.mjs`](../scripts/bench/ts-loader.mjs)) runs the TypeScript sources directly, so the benchmark exercises exactly the code that ships. ```bash # Default: 24-row in-memory corpus, both ranking modes. npm run bench # Latency at scale: duplicate the corpus to ~1000 rows (24 * 42). BENCH_CORPUS_SCALE=42 npm run bench # On-disk SQLite instead of :memory: (clears the file first). BENCH_DB=/tmp/bench.db npm run bench ``` Output is machine-readable JSON on stdout (model load logs go to stderr), so it pipes straight into `jq` or a file: ```bash npm run bench --silent > bench-result.json ``` At `BENCH_CORPUS_SCALE > 1` the corpus is duplicated with a distinct content suffix so every row is a unique vector. This loads the index for latency measurement without polluting the gold set; only the first, unsuffixed copy is graded. ## Results Run on the maintainer's local machine (Apple Silicon, CPU-only inference, all-MiniLM-L6-v2 embedder + ms-marco-MiniLM-L-6-v2 reranker). Re-run with `npm run bench` to reproduce; numbers vary slightly with hardware. ### Quality (24-row gold set, 16 queries) | Mode | precision@1 | precision@3 | MRR | search avg / p95 (ms) | |---|---|---|---|---| | Hybrid (RRF) | 0.563 | 0.750 | 0.704 | ~3 / ~4 | | + Cross-encoder rerank | **0.813** | **0.875** | **0.867** | ~120 / ~230 | | **Reranker lift** | **+0.250** | **+0.125** | **+0.163** | n/a | The reranker gives a large per-query precision win on the weak 384-dim MiniLM base, consistent with Anthropic's published finding that reranking materially reduces retrieval failures. It is still 100% local. ### Latency at scale (GAP 3: measured 1K / 10K / 50K) Measured by `scripts/battle/verify-scale.mjs` with the real embedder (`Xenova/all-MiniLM-L6-v2`, 384-dim) through the real production `handleStore` / `handleSearch` handlers on a file-backed SQLite DB. Each row is a distinct synthetic engineering-doc sentence with per-row lexical salt, so every vector is unique and the index actually reaches the target size (no dedup folding). The 50K run landed **49,931 vectors** in the `memories_vec` table. Hardware: Apple Silicon laptop (CPU-only inference). Search latency is over 60 queries (no rerank) / 10 queries (with rerank). | Vectors | store rows/sec | search p50 / p95 / max (ms, **no rerank**) | search p50 / p95 (ms, **+rerank**) | sub-second @ p95? | |---|---|---|---|---| | 1,000 | 96 | 0.9 / 3.7 / 8.5 | 202 / 297 | yes | | 10,000 | 61 | 4.4 / 9.1 / 26.1 | 199 / 258 | yes | | 50,000 | 26 | 20.1 / 30.2 / 37.4 | 207 / 223 | yes | The sqlite-vec KNN + RRF hot path stays sub-second well past the 10K goal: p95 is **9.1 ms at 10K** and still only **30.2 ms at 50K**, a ~100–1000× margin under the 1-second target. The cross-encoder reranker adds a roughly **constant** ~200 ms (it scores a fixed top-50 candidate set, so its cost does not grow with the corpus) and stays comfortably sub-second at every size. Where it degrades: two O(n) hotspots, both inherent to sqlite-vec's brute-force KNN (no ANN/HNSW index). 1. **Search** scales ~linearly with corpus size (p50 0.9 → 4.4 → 20.1 ms across 1K/10K/50K) because each query's `embedding MATCH ? AND k = ?` is a full linear scan of the vec index. In C this is fast (50K × 384-dim is ~30 ms), so it is nowhere near the sub-second budget, but it is genuinely O(n) and would cross 1 s somewhere in the low-millions of rows. 2. **Store throughput** drops from 96 → 26 rows/sec (1K → 50K) because every store runs **two** brute-force KNN scans on the growing index, plus the embed: the `detectConflicts` dedup scan (`k=10`) and `buildSimilarityEdges`' `findNearDuplicates` scan (`k=7`). Both scans are O(n), so per-store cost rises with the live row count. This is a write-path cost, not the retrieval hot path GAP 3 targets; it is acceptable for a personal/team memory store (tens of thousands of rows) but would dominate a bulk import of hundreds of thousands of rows. The fix is architectural (an ANN index such as HNSW, or skipping the similarity-edge weave during bulk ingest) rather than a small local patch, so it is documented here rather than changed. > Regenerate on the target hardware with `node scripts/battle/verify-scale.mjs` > (knobs: `SCALE_SIZES`, `SCALE_DB`, `SCALE_SEARCH_ITERS`, `SCALE_TIME_BUDGET_MS`). > A companion run that lets the store path's dedup/supersede logic fold > near-identical synthetic rows collapsed 50K store *calls* into a ~16.4K-vector > index and showed the same sub-second profile (p95 11.9 ms no-rerank); that is, > the conflict resolver also caps index growth under repetitive writes. ## Framing vs. competitors Every memory-layer competitor (mem0 / Zep / Letta / Cognee / Supermemory) and every native memory (ChatGPT / Claude) leads with self-reported, cloud-hosted benchmark numbers and bills per token. This harness is the opposite by construction. The model, the index, and the data never leave the machine. There is no API and no metering; the cost is CPU time you already own, at $0 per token. The corpus and the runner are committed, so anyone can clone the repo and re-run. The gold set and the misses are printed, not hidden. Even mid-pack accuracy would be a useful result on those terms: comparable retrieval quality at **0% cloud exposure and $0/token**. The reranker lift above is the kind of measured, not asserted, gain the rest of the roadmap is held to. ## LongMemEval-S (public benchmark) [LongMemEval](https://github.com/xiaowu0162/LongMemEval) (Wu et al., ICLR 2025, MIT) measures long-term conversational memory: 500 questions, each over a ~115k-token haystack of 38–62 chat sessions. We run the **retrieval** stage, the part a memory server owns, against this server's real production handlers (`handleStore` → `handleSearch`, hybrid RRF, optional local cross-encoder rerank) with the stock embedder (`Xenova/all-MiniLM-L6-v2`, 384-dim). One doc per session (user turns only), per-question isolated corpus, query = the raw question string. **Zero LongMemEval-specific tuning**: production defaults. ```bash npm run bench:longmemeval # full S (downloads ~277 MB once) npm run bench:longmemeval -- --dataset oracle --limit 50 # quick smoke ``` Two aggregations from the same run (2026-06-10, Apple Silicon, ~5s/question for both modes; full report JSON printed by the runner): **MemPalace-comparable**: methodology copied from [MemPalace's published benchmark](https://github.com/MemPalace/mempalace) (all 500 questions incl. the 30 abstention items, session-level **recall_any@k**: ≥1 evidence session in the top-k). Cells marked n/a are numbers MemPalace does not publish: | System | R@1 | R@3 | R@5 | R@10 | |---|---|---|---|---| | MemPalace "raw" (their headline, same MiniLM embedder) | n/a | n/a | 96.6% | n/a | | MemPalace "hybrid v4" (450-q held-out; benchmark-specific boosts) | n/a | n/a | 98.4% | 99.8% | | **mcp-memory hybrid (no rerank)** | 60.0% | 92.0% | 95.2% | 98.8% | | **mcp-memory hybrid + local rerank** | **92.2%** | **97.4%** | **97.8%** | **98.8%** | Per-type recall_any@5 (rerank on): knowledge-update 100%, multi-session 100%, single-session-user 98.6%, single-session-assistant 98.2%, temporal-reasoning 97.0%, single-session-preference 83.3%. **Official-style**: the official `eval_utils.py` aggregation (skips abstention + assistant-only-evidence questions → 419 questions; headline = **recall_all@k**, every evidence session retrieved, + binary NDCG): | Mode | recall_all@5 | NDCG@5 | recall_all@10 | NDCG@10 | |---|---|---|---|---| | hybrid (no rerank) | 83.8% | 0.772 | 94.3% | 0.798 | | hybrid + local rerank | **92.8%** | **0.930** | **97.1%** | **0.939** | Honest notes: - The two aggregations are not interchangeable: recall_any@5 over 500 is the number comparable to MemPalace's 96.6%; recall_all@5 over 419 is the number comparable to the LongMemEval paper's baselines. The runner prints both. - MemPalace's 98.4% "hybrid v4" adds keyword/temporal/preference-regex boosts developed against this benchmark family (their own docs flag the tuning risk and use a held-out split). Our pipeline has no benchmark-specific logic; 97.8% is the same production path every MCP request takes. - 14 of ~23,850 session stores (0.06%) were dropped by the production exact-duplicate gate (`ingest_integrity.non_add_operations`). This can only *lower* our score, never inflate it. - Retrieval recall is not end-to-end QA accuracy; no LLM reader is involved. ## LOCOMO (public benchmark) [LOCOMO](https://github.com/snap-research/locomo) (Maharana et al., ACL 2024) measures very-long-term conversational memory: 10 conversations of up to ~35 sessions / ~26k tokens each, ~2,000 questions across 5 categories (multi-hop, temporal, open-domain, single-hop, adversarial). Same setup as LongMemEval above: real production handlers, stock MiniLM embedder, **zero LOCOMO-specific tuning**. ```bash npm run bench:locomo # full 10 conversations (downloads once) npm run bench:locomo -- --limit 2 # quick smoke ``` The runner reports recall at **session** granularity (≥ coverage of the evidence sessions, the level MemPalace publishes) and the harder **turn** granularity from the same run. Full run 2026-06-11, Apple Silicon, ~131 min, hybrid + local rerank (the production MCP default): | Granularity | R@1 | R@5 | R@10 | R@50 | |---|---|---|---|---| | session | 52.3% | 73.3% | **82.2%** | **100%** | | turn | 39.4% | 57.0% | 61.0% | 68.1% | Per-category session R@10: temporal 85.5%, single-hop 84.4%, adversarial 86.1%, multi-hop 72.3%, open-domain 62.4%. Honest notes: - MemPalace publishes session R@10 = 60.3% for their baseline and **88.9%** for their "hybrid v5", which adds keyword/temporal-regex boosts developed against this benchmark family. Our untuned production path lands at 82.2%: 22 points above their baseline, 6.7 below their benchmark-tuned pipeline. - R@50 = 100%: every evidence session is always in the top-50, so the misses at k=10 are ranking failures, not recall failures. - **Reranker A/B (2026-06-11):** because R@50 = 100%, the lever is the final reranker, not retrieval. We replayed the LOCOMO candidate set through three latency-viable local cross-encoders. The shipping model wins: | Reranker (local, $0/token) | session R@10 | ~latency / 50 docs | |---|---|---| | **ms-marco-MiniLM-L-6-v2 (shipping)** | **82.2%** | ~0.5 s | | ms-marco-MiniLM-L-12-v2 | 81.0% | ~1.0 s | | jina-reranker-v1-turbo-en | 80.2% | ~0.6 s | Heavier rerankers (bge-reranker-base, mxbai-rerank-\*) were excluded up front on latency (2–4 s per query, a 4–8× regression for no recall gain). No swap improves the untuned number, so **82.2% is the ceiling of this production path**; the gap to MemPalace's 88.9% is their benchmark-specific tuning, which we don't do. - Retrieval recall is not end-to-end QA accuracy; no LLM reader is involved. ## ConvoMem (public benchmark) [ConvoMem](https://huggingface.co/datasets/Salesforce/ConvoMem) (Pakhomov et al., Salesforce, arXiv 2511.10523) is a 75k-question conversational-memory benchmark. We run the **retrieval** slice MemPalace publishes against: the `1_evidence` single-message-evidence files, 5 categories × 50 items, one document per message of each item's own conversations, scored hit@10 by bidirectional substring match (this exactly mirrors MemPalace's `convomem_bench.py`). Same setup as above: real production handlers, stock MiniLM embedder, **zero tuning**. ```bash npm run bench:convomem # 50 items × 5 categories npm run bench:convomem -- --limit 3 # quick smoke ``` | Mode | overall recall | user | assistant-facts | abstention | preference | implicit-connection | |---|---|---|---|---|---|---| | **hybrid (no rerank)** | **93.5%** | 98% | 100% | 98% | 86% | 86% | | hybrid + local rerank | 86.2% | 98% | 100% | 91% | 76% | 66% | Honest notes: - **93.5% (hybrid) beats MemPalace's published 92.9%** on the same slice + same embedder, untuned. The no-rerank number is the apples-to-apples comparator (MemPalace runs raw ChromaDB vector search, no reranker). - The cross-encoder reranker **hurts** here (−7.3 pts): these are very short single-message corpora scored by exact-substring presence, where the reranker's semantic-relevance reordering can demote the literal evidence message. It's a real, honest property of this easy-slice benchmark; the reranker's win shows on the harder LongMemEval / LOCOMO sets above. - This is the `1_evidence` parity slice MemPalace reports, not the full 75k-item benchmark; `changing_evidence` has no `1_evidence` split upstream and is skipped (as MemPalace does). ## MemBench (public benchmark) [MemBench](https://github.com/import-myself/Membench) (Tan et al., Findings of ACL 2025) is an agent-memory benchmark. We run MemPalace's parity slice, the FirstAgent `movie`/`roles`/`events` topics (8,500 items), one document per turn, scored hit@5 with their generous dual id-matching, through the **untuned production** search path. ```bash npm run bench:membench # full 8,500-item slice npm run bench:membench -- --limit 3 # quick smoke ``` | Mode | overall hit@5 | strong categories | designed-hard categories | |---|---|---|---| | **hybrid (no rerank), untuned** | **78.7%** | simple 97% · comparative 99% · aggregative 99% · lowlevel-rec 100% | noisy 38% · conditional 50% · post-processing 50% | Honest notes: - MemPalace publishes **80.3%**, which is their **tuned** `hybrid` mode (over-retrieve 3× + keyword predicate-overlap rescoring developed against this benchmark). Our 78.7% is the **untuned** production path, within 1.6 pts of their tuned number with no benchmark-specific logic. - Our production store **deduplicated 8,115 of the turn writes** (near-identical turns within an item fold to one row). Unlike MemPalace's ChromaDB (which indexes every turn), a folded target turn becomes unretrievable, so 78.7% is a **conservative floor**; an un-deduplicated index scores higher. - The per-category profile matches MemPalace's published shape (their hardest cases are ours too: noisy, conditional, post-processing), a sanity check that the harness measures the same thing. ## Roadmap (BATTLE-PLAN §6.D) This R0 harness is the foundation. Planned additions, each gated on measured numbers committed here: - ~~LongMemEval-S runner~~ Done; see above (`scripts/bench/longmemeval.mjs`). - ~~LOCOMO runner~~ Done; see above (`scripts/bench/locomo.mjs`). - ~~ConvoMem runner~~ Done; see above (`scripts/bench/convomem.mjs`). - ~~MemBench runner~~ Done; see above (`scripts/bench/membench.mjs`). **4/4 public-benchmark parity with MemPalace.** - Bigger held-out gold set; before/after numbers for each retrieval change. - ~~Latency dashboard at 1K / 10K / 100K vectors.~~ Done at 1K / 10K / 50K; see "Latency at scale" above (`scripts/battle/verify-scale.mjs`). 100K is bounded by the documented O(n) write-path cost, not the sub-second retrieval goal. ## Benchmark integrity Aggregate claims are only as good as their auditability: - Committed per-question artifacts: every public-benchmark runner accepts `--out ` and writes the full report plus per-question rows. Committed artifacts live in [`benchmarks/results/`](../benchmarks/results/) (currently the full LOCOMO run: session R@10 = 0.822 rerank-on / 0.742 rerank-off, R@50 = 1.0, which reproduces the published headline; the other three suites regenerate with `npm run bench: -- --out …`). - Zero benchmark-specific tuning: all numbers come from the production `handleStore`/`handleSearch` handlers with the stock MiniLM embedder, the same code path every user runs. There is no bench-only flag anywhere in the pipeline: the query sanitizer's verbatim gate (512 code points) sits above the longest genuine benchmark question (466 cp, MemBench), so the published numbers hold by construction, not by special-casing. - Hermetic gates: the bench/battle harnesses pin the config loader to a no-config baseline so a developer machine's personal `~/.mcp-memory/config.json` (whose `defaults` now affect store scope/namespace and the embedding context prefix) cannot move the numbers. - Honest deltas: where a competitor's tuned number beats ours untuned, the tables above say so (MemBench 78.7 untuned vs 80.3 tuned; LongMemEval-S 97.8 vs their held-out tuned 98.4).