# agentmemory v0.6.0 — Search Quality Evaluation (Internal Dataset)

> For results on the academic LongMemEval-S benchmark (ICLR 2025, 500 questions), see [`LONGMEMEVAL.md`](LONGMEMEVAL.md) — **95.2% R@5, 98.6% R@10**.

**Date:** 2026-03-18T07:44:43.397Z
**Dataset:** 240 synthetic observations across 30 sessions (internal coding project)
**Queries:** 20 labeled queries with ground-truth relevance
**Metric definitions:** Recall@K (fraction of relevant docs in top K), Precision@K (fraction of top K that are relevant), NDCG@10 (ranking quality), MRR (position of first relevant result)

## Head-to-Head Comparison

| System | Recall@5 | Recall@10 | Precision@5 | NDCG@10 | MRR | Latency | Tokens/query |
|--------|----------|-----------|-------------|---------|-----|---------|--------------|
| Built-in (CLAUDE.md / grep) | 37.0% | 55.8% | 78.0% | 80.3% | 82.5% | 0.50ms | 22,610 |
| Built-in (200-line MEMORY.md) | 27.4% | 37.8% | 63.0% | 56.4% | 65.5% | 0.16ms | 7,938 |
| BM25-only | 43.8% | 55.9% | 95.0% | 82.7% | 95.5% | 0.17ms | 3,142 |
| Dual-stream (BM25+Vector) | 42.4% | 58.6% | 90.0% | 84.7% | 95.4% | 0.71ms | 3,142 |
| Triple-stream (BM25+Vector+Graph) | 36.8% | 58.0% | 87.0% | 81.7% | 87.9% | 1.02ms | 3,142 |

## Why This Matters

**Recall improvement:** agentmemory triple-stream finds 58.0% of relevant memories at K=10 vs 55.8% for keyword grep (+4%)
**Token savings:** agentmemory returns only the top 10 results (3,142 tokens) vs loading everything into context (22,610 tokens) — 86% reduction
**200-line cap:** Claude Code's MEMORY.md is capped at 200 lines. With 240 observations, 37.8% recall at K=10 — memories from later sessions are simply invisible.

## Per-Query Breakdown (Triple-Stream)

| Query | Category | Recall@10 | NDCG@10 | MRR | Relevant | Latency |
|-------|----------|-----------|---------|-----|----------|---------|
| How did we set up authentication? | semantic | 50.0% | 100.0% | 100.0% | 20 | 1.7ms |
| JWT token validation middleware | exact | 50.0% | 64.9% | 100.0% | 10 | 1.2ms |
| PostgreSQL connection issues | semantic | 33.3% | 100.0% | 100.0% | 30 | 1.0ms |
| Playwright test configuration | exact | 100.0% | 100.0% | 100.0% | 10 | 1.1ms |
| Why did the production deployment fail? | cross-session | 33.3% | 100.0% | 100.0% | 30 | 0.8ms |
| rate limiting implementation | exact | 80.0% | 64.1% | 33.3% | 10 | 0.7ms |
| What security measures did we add? | semantic | 33.3% | 100.0% | 100.0% | 30 | 0.7ms |
| database performance optimization | semantic | 0.0% | 0.0% | 7.1% | 25 | 0.8ms |
| Kubernetes pod crash debugging | entity | 100.0% | 96.7% | 100.0% | 5 | 1.2ms |
| Docker containerization setup | entity | 100.0% | 100.0% | 100.0% | 10 | 0.9ms |
| How does caching work in the app? | semantic | 25.0% | 64.9% | 100.0% | 20 | 0.8ms |
| test infrastructure and factories | exact | 50.0% | 64.9% | 100.0% | 10 | 0.7ms |
| What happened with the OAuth callback error? | cross-session | 100.0% | 54.1% | 16.7% | 5 | 1.1ms |
| monitoring and observability setup | semantic | 66.7% | 100.0% | 100.0% | 15 | 0.8ms |
| Prisma ORM configuration | entity | 25.7% | 93.6% | 100.0% | 35 | 1.8ms |
| CI/CD pipeline configuration | exact | 20.0% | 64.9% | 100.0% | 25 | 1.0ms |
| memory leak debugging | cross-session | 100.0% | 100.0% | 100.0% | 5 | 0.7ms |
| API design decisions | semantic | 25.0% | 64.9% | 100.0% | 20 | 1.4ms |
| zod validation schemas | entity | 66.7% | 100.0% | 100.0% | 15 | 0.7ms |
| infrastructure as code Terraform | entity | 100.0% | 100.0% | 100.0% | 5 | 1.5ms |

## By Query Category

| Category | Avg Recall@10 | Avg NDCG@10 | Avg MRR | Queries |
|----------|---------------|-------------|---------|---------|
| exact | 60.0% | 71.8% | 86.7% | 5 |
| semantic | 33.3% | 75.7% | 86.7% | 7 |
| cross-session | 77.8% | 84.7% | 72.2% | 3 |
| entity | 78.5% | 98.1% | 100.0% | 5 |

## Context Window Analysis

The fundamental problem with built-in agent memory:

| Observations | MEMORY.md tokens | agentmemory tokens (top 10) | Savings | MEMORY.md reachable |
|-------------|-----------------|---------------------------|---------|-------------------|
| 240 | 12,000 | 3,142 | 74% | 83% |
| 500 | 25,000 | 3,142 | 87% | 40% |
| 1,000 | 50,000 | 3,142 | 94% | 20% |
| 5,000 | 250,000 | 3,142 | 99% | 4% |

At 240 observations (our dataset), MEMORY.md already hits its 200-line cap and loses access to the most recent 40 observations. At 1,000 observations, 80% of memories are invisible. agentmemory always searches the full corpus.

---

*100 evaluations across 5 systems. Ground-truth labels assigned by concept matching against observation metadata.*