# LongMemEval-S Benchmark Results [LongMemEval](https://arxiv.org/abs/2410.10813) (ICLR 2025) is an academic benchmark for evaluating long-term memory in chat assistants. It tests 5 core abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. ## Setup - **Dataset**: LongMemEval-S (500 questions, ~48 sessions per question, ~115K tokens) - **Source**: [xiaowu0162/longmemeval-cleaned](https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned) - **Metric**: `recall_any@K` — does ANY gold session appear in top-K retrieved results? - **Embedding model**: `all-MiniLM-L6-v2` (384 dimensions, local, no API key) - **No LLM in the loop**: Pure retrieval evaluation, no answer generation or judge ## Results | System | R@5 | R@10 | R@20 | NDCG@10 | MRR | |---|---|---|---|---|---| | **agentmemory BM25+Vector** | **95.2%** | **98.6%** | **99.4%** | **87.9%** | **88.2%** | | agentmemory BM25-only | 86.2% | 94.6% | 98.6% | 73.0% | 71.5% | | MemPalace raw (vector-only) | 96.6% | ~97.6% | — | — | — | ### By Question Type (BM25+Vector) | Type | R@5 | R@10 | Count | |---|---|---|---| | knowledge-update | 98.7% | 100.0% | 78 | | multi-session | 97.7% | 100.0% | 133 | | single-session-assistant | 96.4% | 98.2% | 56 | | temporal-reasoning | 95.5% | 97.7% | 133 | | single-session-user | 90.0% | 97.1% | 70 | | single-session-preference | 83.3% | 96.7% | 30 | ### By Question Type (BM25-only) | Type | R@5 | R@10 | Count | |---|---|---|---| | knowledge-update | 92.3% | 98.7% | 78 | | single-session-user | 91.4% | 95.7% | 70 | | temporal-reasoning | 88.0% | 94.7% | 133 | | multi-session | 86.5% | 96.2% | 133 | | single-session-assistant | 80.4% | 91.1% | 56 | | single-session-preference | 60.0% | 80.0% | 30 | ## Analysis 1. **BM25+Vector (95.2%) nearly matches pure vector search (96.6%)** with only a 1.4pp gap. Both use the same embedding model (all-MiniLM-L6-v2). 2. **BM25 alone gets 86.2%** — keyword search with Porter stemming and synonym expansion is surprisingly effective on conversational data. 3. **Adding vectors to BM25 gives +9pp** (86.2% → 95.2%), the largest improvement from any single component. 4. **Preferences are the hardest category** for both BM25 (60%) and hybrid (83.3%). These require understanding implicit/indirect statements. 5. **Multi-session and knowledge-update are strongest** (97.7%+ hybrid). The hybrid approach excels when facts are distributed across sessions. 6. **R@10 reaches 98.6%** — nearly all gold sessions are found within the top 10 results. ## Important Notes on Methodology - These are **retrieval recall** scores, not end-to-end QA accuracy. The official LongMemEval metric is QA accuracy (retrieve + generate answer + GPT-4o judge). - Systems on the actual LongMemEval QA leaderboard score 60-95% depending on the LLM reader (Oracle GPT-4o gets ~82.4%). - We do NOT claim these as "LongMemEval scores" — they are retrieval-only evaluations on the LongMemEval-S haystack. - Each question builds a fresh index from its ~48 sessions, searches with the question text, and checks if gold session IDs appear in results. ## Reproducibility ```bash # Download dataset (264 MB) pip install huggingface_hub python3 -c " from huggingface_hub import hf_hub_download hf_hub_download(repo_id='xiaowu0162/longmemeval-cleaned', filename='longmemeval_s_cleaned.json', repo_type='dataset', local_dir='benchmark/data') " # Run BM25-only npx tsx benchmark/longmemeval-bench.ts bm25 # Run BM25+Vector hybrid (requires @xenova/transformers) npx tsx benchmark/longmemeval-bench.ts hybrid ```