# Benchmarks This is the public benchmark hub for Lerim. The rule is simple: public numbers must point to raw artifacts or cited external sources. Generated report copies are kept for auditability, but raw `report.json` files are the source of truth for Lerim numbers. Launch-grade benchmark artifacts should be rerun from a clean commit and pass the clean/tracked public benchmark gate. The `v0.3.0` public artifacts passed that release gate. Future artifacts with `git_dirty: true` should still be treated as pre-release evidence until rerun after commit. ## Start Here | Page | Use it for | | --- | --- | | [Benchmark Suite](benchmark-suite.md) | Plain-English explanation of each benchmark surface and boundary | | [Lerim Results](lerim-results.md) | Detailed Lerim-only benchmark results, raw artifact references, commands, and boundaries | | [Market Comparison](market-comparison.md) | Lerim vs other memory systems, with normalized rows, cited external numbers, and watchlist rows kept separate | Generated reports live under `benchmarks/results/reports/` as audit copies. Use the two pages above for the public reading path; use generated reports when you need to trace a table back to raw artifacts. Raw artifacts are tracked in this repo under `benchmarks/results/raw/`. Generated audit copies are tracked under `benchmarks/results/reports/`. Those paths must be included in the release commit before public benchmark links are treated as launch-grade evidence. ## Artifact Map | Path | Purpose | How to read it | | --- | --- | --- | | `docs/benchmarks/index.md` | Public benchmark hub | Start here | | `docs/benchmarks/benchmark-suite.md` | Benchmark surface explanations | Use when learning what each benchmark means | | `docs/benchmarks/lerim-results.md` | Public Lerim-only results | Use for first-party claims | | `docs/benchmarks/market-comparison.md` | Public market comparison | Use for competitor/market claims | | `benchmarks/lerim_evidence/` | Lerim benchmark runners | Code that produces Lerim numbers | | `benchmarks/competitors/` | Source-backed competitor importers | Competitor evidence normalization, not product code | | `benchmarks/results/raw/` | Raw benchmark artifacts | Numeric source of truth | | `benchmarks/results/reports/` | Generated Markdown reports | Audit copies generated from raw artifacts | Do not edit numbers by hand in docs. Change the runner, rerun it, then update the generated artifacts. ## Current Evidence | Surface | Current evidence | Where to read | | --- | --- | --- | | LongMemEval-S retrieval | Full 500-question retrieval-only runs for hybrid and lexical modes | [Lerim Results](lerim-results.md#longmemeval-s-retrieval) | | Context budget | Full 500-question context-selection run using a Hugging Face tokenizer | [Lerim Results](lerim-results.md#context-budget) | | Retrieval latency | Partial local scale run on LongMemEval-S sessions | [Lerim Results](lerim-results.md#retrieval-latency) | | Trace ingestion cost/performance | Small public-trace sample with measured LLM calls and unavailable-cost disclosure | [Lerim Results](lerim-results.md#trace-ingestion-costperformance) | | MCP integration | Config validation, local stdio MCP probes, trace-submit idempotency, 0 trace-submit extraction acceptances, and a Gemini CLI live tool-call acceptance artifact | [Lerim Results](lerim-results.md#mcp-integration) | | Extraction quality | Aggregate-only 47-case diagnostic report from a `MiniMax-M2.7` agent artifact judged by `MiniMax-M2.5`; not launch-grade | [Lerim Results](lerim-results.md#extraction-eval-status) | | Market comparison | Source-backed market table with comparable and not-yet-comparable rows separated | [Market Comparison](market-comparison.md) | ## Surface Map | Surface | Public question answered | Current evidence | Not proven | | --- | --- | --- | --- | | LongMemEval-S retrieval | Can Lerim find answer-bearing sessions? | Full 500-question retrieval-only run | Answer generation or official LongMemEval QA accuracy | | Context budget | How much context does Lerim select after retrieval? | Same 500 LongMemEval-S questions, Hugging Face tokenizer counts, recall shown beside reduction | Dollar cost savings, answer quality, or a replacement for the retrieval benchmark | | Retrieval latency | How fast is local search on this machine? | Local timings over LongMemEval-S sessions | Hosted/server load performance | | Trace ingestion cost/performance | How much time, LLM-call count, and local DB growth does the write path use? | Small LongMemEval-S public-trace sample through DSPy ingestion | Extraction quality, answer quality, or dollar cost when provider usage is unavailable | | MCP integration | Does Lerim's config and MCP plumbing work? | Config validation, local stdio tools/context probes, trace-submit idempotency, 0 trace-submit extraction acceptances, and Gemini CLI live tool-call acceptance | Autonomous live tool use by every external client or successful trace-submit extraction in this artifact | | Extraction | Can Lerim extract durable records from source sessions? | Aggregate-only report from one 47-case internal eval | Launch-grade public claim or market comparison | | Market comparison | How does Lerim compare to alternatives? | Market table with source/provenance per row | Full same-boundary market ranking | ## Reporting Rules - `report.json` is the numeric source of truth for Lerim rows. - Use `predictions.jsonl` for per-question benchmark rows. - Use `details.jsonl` for integration probe rows. - Do not publish partial slices as final benchmark results. - Do not call retrieval-only scores official LongMemEval QA scores. - Do not use context-budget numbers without recall. - Do not reuse retrieval numbers as extraction-quality numbers. - Treat Lerim's trace-to-context extraction eval as first-party/private until a competitor runner feeds the same traces into another system and scores its saved memories with the same labels and judge. - Do not publish competitor numbers without matching metric boundaries and source-backed provenance.