# Hypermnesic on LongMemEval V1 — Benchmark Verdict

**Status: Phase 1 retrieval diagnostic — RERUN (2026-06-13, full `_s`).
Phase 2 — end-to-end QA headline (GPT-4.1 + GPT-4o columns) — RERUN (2026-06-13, full `_s`,
both columns under the `gpt-4o-2024-08-06` judge; numbers below).**

This document is the public-facing verdict for Hypermnesic on **LongMemEval V1**,
in the integrity style of [`PARITY_VERDICT.md`](PARITY_VERDICT.md): it records
methodology, the **comparability envelope**, and **aggregate / per-ability**
numbers only — never per-instance corpus data (R15) — and it carries a corrections
log and a no-tune-to-pass commitment. The re-runnable harness lives under
[`harness/longmemeval/`](longmemeval/); the pinned reproducibility manifest is
[`harness/longmemeval/manifest.json`](longmemeval/manifest.json).

The numbers themselves are produced by running the harness (below). Both phases have
now been run on the full `_s` set (2026-06-02) and their aggregates transcribed; the
re-runnable harness + pinned manifest let a third party reproduce them.

## Product operability proof

LongMemEval measures retrieval quality. It does not prove setup, consent, memory control,
plugin observability, daily workflows, or remote-client operability. Benchmark scores are not a substitute for product readiness.

Product readiness is gated separately by `scripts/product_smoke.py`, offline remote-contract tests
in `tests/test_product_remote_smoke.py`, the
[`docs/guides/remote-client-smoke-checklist.md`](../docs/guides/remote-client-smoke-checklist.md),
and the
[`docs/launch/first-class-product-readiness-checklist.md`](../docs/launch/first-class-product-readiness-checklist.md).

---

## What this number is, and is not (comparability envelope)

> **The single most important section.** A LongMemEval score is only meaningful
> next to the *reader model*, the *judge model*, and the *dataset release* that
> produced it. Cells that differ on any of these three axes are **not** directly
> comparable, and the field's headline numbers differ on all three.

| | Comparable to | **Not** comparable to |
|---|---|---|
| **Benchmark** | LongMemEval **V1**, `_s` variant (~115k tokens, ~40 sessions) | LongMemEval **V2** (multimodal, LAFS metric) |
| **Judge** | rows graded by **`gpt-4o-2024-08-06`** (the canonical judge) | rows graded by a **GPT-4.1 judge** (more lenient — see below) |
| **Dataset** | the **cleaned-2025-09** release | rows reported on the **original** release (flagged per row) |
| **Ingestion** | **raw verbatim** RAG-style ingestion | distilled / fact-extraction systems' best rows |

### Per-row attribution of the cited SOTA (R21)

Every cited leaderboard row is attributed by **reader · judge · dataset release**,
so no reader mistakes our GPT-4.1-reader / GPT-4o-judge cell for a GPT-4.1-*graded*
vendor row. The anchor figures below were **re-verified against primary sources at
publication** (2026-06-02) — and that re-verification corrected a planning-doc
error: the ~84 GPT-4o-judge figure belongs to **Mastra Observational Memory**, not
OMEGA. OMEGA publishes **no** GPT-4o-judge row; its only number is 95.4 at a GPT-4.1
judge (see the corrections log).

| Row | Reported (overall) | Reader | **Judge** | Dataset release | Directly comparable to… |
|---|---|---|---|---|---|
| **Hypermnesic — lead** | **88.6** (task-avg 90.2) | GPT-4.1 (`gpt-4.1-2025-04-14`) | **`gpt-4o-2024-08-06`** | cleaned-2025-09 | *this is us* |
| **Hypermnesic — anchor** | **83.6** (task-avg 87.1) | GPT-4o (`gpt-4o-2024-08-06`) | **`gpt-4o-2024-08-06`** | cleaned-2025-09 | *this is us* |
| Mastra Observational Memory | 84.2 | GPT-4o | **`gpt-4o`** | original | ✅ our **anchor** (mind the release) |
| Zep | 71.2 | GPT-4o | **`gpt-4o`** | original | ✅ our anchor (mind the release) |
| Full-context baseline | 60.2 | GPT-4o | **`gpt-4o`** | original | ✅ our anchor (no-memory floor) |
| OMEGA (headline) | 95.4 | GPT-4.1 | **GPT-4.1 (lenient)** | original | ❌ different judge — **not** our cell |
| Mastra (OMEGA leaderboard) | 94.9 | GPT-4.1† | **GPT-4.1†** | original | ❌ different judge — **not** our cell |

Sources (re-verified 2026-06-02): Zep arXiv 2501.13956 (full-context 60.2, Zep 71.2,
both GPT-4o reader + GPT-4o judge, original `_s`); Mastra research page
(Observational Memory 84.23, GPT-4o reader + GPT-4o judge); OMEGA leaderboard
(`omegamax.co/benchmarks` — OMEGA 95.4 / Mastra 94.9, **GPT-4.1 judge throughout**).
† the Mastra GPT-4.1-judge row is shown only to make the judge-axis gap explicit.

### The reader-vs-judge swing — decomposed (with first-party numbers)

Reader and judge choice **each** move the headline materially; conflating them is
the field's most common comparability error. We resist the temptation to "decompose"
the swing from *cross-vendor* numbers — there is **no** published system with both a
GPT-4o-judge and a GPT-4.1-judge row on the same release, so any single point-spread
attributed to "the reader axis" from public figures silently mixes reader, judge,
generation model, and release. Instead we report each axis cleanly:

- **Reader axis — measured here, at a *fixed* judge.** Both our columns are graded by
  the **same** `gpt-4o-2024-08-06` judge, so the gap between them is a *clean* reader
  effect: **GPT-4.1 reader 90.2 vs GPT-4o reader 87.1 task-averaged** (88.6 vs 83.6
  overall) — a **+3.1 pt task-avg / +5.0 pt overall** swing from the stronger reader
  alone. That is why this harness publishes **both** reader columns: a GPT-4.1 column
  set beside GPT-4o-graded SOTA rows would be apples-to-oranges.
- **Judge axis — why the 95% rows are out of reach by construction.** The leaderboard's
  93–96% rows (OMEGA 95.4, Mastra 94.9) are graded with a **GPT-4.1 judge**, which is
  **more lenient** than `gpt-4o-2024-08-06`. Our judge is the canonical
  `gpt-4o-2024-08-06` (the official aggregator hard-asserts this snapshot, R11). A
  GPT-4.1-*judge* column is deliberately **out of scope** — a non-`gpt-4o-2024-08-06`
  judge is rejected by the official aggregator.

**Consequence (stated plainly):** our **GPT-4.1-reader / `gpt-4o-2024-08-06`-judge**
cell (88.6) is **not** the GPT-4.1-*graded* 95.4 row, and is **expected to land below
it on judge strictness alone** — independent of memory quality. We do not claim to be
"in the 93–96% column." The **truly comparable anchors** are the GPT-4o-*judge* rows:
**full-context 60.2, Zep 71.2, Mastra-on-GPT-4o 84.2** (all original-release; mind the
release caveat).

---

## Methodology

- **Ingestion (raw verbatim, R4).** Each instance's sessions are materialized to
  markdown **verbatim** — no summarization or fact-extraction between the dataset
  and the index. Two granularities: **per-session** (the QA corpus) and
  **per-user-turn rounds** (the turn diagnostic corpus). The session date is written
  **in the body** (`Session Date: …`) so it survives `ingest.strip_frontmatter` into
  the index (R6). The session→markdown mapping is **deterministic** — the same
  instance yields byte-identical files (R5).
- **Index + retrieval (shipped read path, R12).** One **isolated index per instance**
  (`build_index` with a `state_dir` outside the corpus → `retrieve.search` → discard),
  using the **production embedding config unchanged** (`text-embedding-3-large` @ 1536).
  No engine `src/` change — the harness consumes the read path as shipped.
- **Gold reconstruction.** Session-level gold = `answer_session_ids`; turn-level gold
  = the rounds carrying `has_answer: true`. The **30 `_abs` (abstention) instances are
  excluded from retrieval scoring**, matching the official `run_retrieval.py`.
- **Retrieval metrics (official).** `recall_all@k` (all gold in top-k → 1.0 else 0.0)
  and `ndcg_any@k` (binary relevance, ideal-DCG normalized). Session level @5/@10; turn
  level adds @50. These are **distinct** from the parity harness's fractional
  `recall_at_k`.
- **Frozen params, no tune-to-pass (R19).** `k`, fusion weights `(lexical, dense, doc)`,
  lanes, and near-dup collapse are frozen at the manifest values **before** any run;
  the diagnostic retrieves a deep candidate list purely to *measure* recall at multiple
  cutoffs — that measurement depth is not a tuned fusion parameter. Any param change is
  recorded in the corrections log.
- **Embed-quiescence void (R20/AE2).** `smoke_embed_or_die` runs before scoring; if any
  query degraded to lexical-only (embedding unavailable), the **whole run is voided**,
  never scored — a degraded run is non-comparable, not a FAIL.
- **No date-aware ranking — `recall_any` beside `recall_all` for date-sensitive abilities.**
  Hypermnesic's `search()` applies **no recency/date weighting**. For **knowledge-update**
  and **temporal-reasoning**, which need the newest session surfaced, the diagnostic also
  reports `recall_any@k` and the **gold-set-size distribution**, so a low `recall_all`
  there localizes to **retrieval ordering** (a known, addressable ranking gap), not to a
  memory or reader failure. This keeps the "diagnostic localizes the gap" claim honest.

---

## Phase 1 — retrieval diagnostic (session + turn level)

Embeddings-only; no reader/judge spend. **Rerun 2026-06-13** over the **full `_s` set**
(headline-eligible, no sampling, R17); embed-quiescent (not voided); production embed
config (`text-embedding-3-large` @ 1536); frozen retrieval params (`k`=10, fusion
weights `(1,1,1)`, doc lane on, near-dup collapse on) measured to a depth of 200 so
@50 is reportable. The smoke subset
([`smoke.example.jsonl`](longmemeval/smoke.example.jsonl)) is **non-headline** and is
never substituted here (AE4).

### Aggregate (full `_s`, 470 retrieval-scored instances; 30 `_abs` excluded)

| Granularity | recall_all@5 | recall_all@10 | ndcg_any@5 | ndcg_any@10 | recall_all@50 | ndcg_any@50 |
|---|---|---|---|---|---|---|
| Session | 0.823 | **0.949** | 0.811 | 0.842 | n/a | n/a |
| Turn | 0.317 | 0.515 | 0.349 | 0.423 | **0.915** | 0.508 |
| Turn → session (derived) | 0.636 | 0.853 | — | — | n/a | n/a |

**Reading.** Session-level retrieval is strong — for **94.9%** of instances *every* gold
session is in the top-10 (`recall_all@10`). Turn-level is the harder, finer granularity
(getting *all* gold rounds in the top-10 is 51.5%, but by @50 it reaches 91.5%) — the
evidence is reachable; ranking all of a multi-round gold set into a shallow cutoff is
what's hard. The `turn → session` row maps the per-turn ranking back to sessions
(`turn_to_session`): 0.853 @10, slightly below the direct session ranking, i.e. the
finer corpus is a bit noisier for session recovery.

### Per-ability (full `_s`)

| Ability (`question_type`) | n | session recall_all@10 | turn recall_all@10 | turn recall_all@50 | session ndcg_any@10 |
|---|---|---|---|---|---|
| single-session-assistant | 56 | 1.000 | 0.911 | 1.000 | 0.915 |
| single-session-user | 64 | 0.969 | 0.641 | 0.984 | 0.724 |
| single-session-preference | 30 | 0.967 | 0.567 | 0.867 | 0.731 |
| knowledge-update | 72 | 0.958 | 0.431 | 0.861 | 0.838 |
| multi-session | 121 | 0.950 | 0.421 | 0.917 | 0.909 |
| temporal-reasoning | 127 | 0.906 | 0.402 | 0.882 | 0.833 |

### Date-sensitive abilities — `recall_any` beside `recall_all` (the honest localization)

The engine does **no date-aware ranking**, so for knowledge-update ("latest wins") and
temporal-reasoning ("which came first"), strict `recall_all` can understate memory
quality when the gold set spans several sessions. Reporting `recall_any` (≥1 gold
retrieved) beside the gold-set-size distribution localizes the gap:

| Ability | session recall_all@10 | session **recall_any@10** | turn recall_any@10 | gold-set size (min/mean/max) |
|---|---|---|---|---|
| knowledge-update | 0.958 | **1.000** | 0.722 | 2 / 2.0 / 2 |
| temporal-reasoning | 0.906 | **1.000** | 0.709 | 1 / 2.2 / 6 |

**`recall_any@10` is 1.000 at the session level for both** — at least one gold session is
*always* in the top-10. So the sub-unity `recall_all` (0.958, 0.906) is **not** a memory
miss: it is the multi-gold ordering gap (temporal sets run up to 6 sessions) that a
date-aware ranker would close. The diagnostic localizes the gap to retrieval *ordering*,
not retrieval *coverage* or the reader — which is exactly the claim Phase 2 will lean on.

---

## Phase 2 — end-to-end QA headline (GPT-4.1 + GPT-4o columns)

**Rerun 2026-06-13** over the **full 500-Q `_s` set** (470 answerable + 30 abstention),
both reader columns graded by the canonical **`gpt-4o-2024-08-06`** judge, retrieval
frozen at the manifest params (k=10), via the OpenAI **Batch API** (50% cost). The run
was **not voided** (`verdict: reported`, embed-quiescent); **0 reader errors, 0 judge
errors** across 1,000 reads + 1,000 grades. The headline metric is **task-averaged
accuracy** (macro over the 6 `question_type` buckets, abstention excluded), reported
beside **Overall** (micro) and **Abstention** (30) — matching the official
`print_qa_metrics.py`.

| Reader column | Judge | Overall (micro) | Task-averaged (macro, headline) | Abstention (30) |
|---|---|---|---|---|
| GPT-4.1 (`gpt-4.1-2025-04-14`) — lead | `gpt-4o-2024-08-06` | **88.6** | **90.2** | 76.7 (23/30) |
| GPT-4o (`gpt-4o-2024-08-06`) — anchor | `gpt-4o-2024-08-06` | **83.6** | **87.1** | 66.7 (20/30) |

**Where this lands (the honest comparison).** On the matched **GPT-4o-reader /
GPT-4o-judge** axis — the only apples-to-apples memory-system comparison — Hypermnesic's
**anchor column (83.6 overall, 87.1 task-avg)** sits **on par with Mastra Observational
Memory (84.2)**, **+12 over Zep (71.2)**, and **+23 over the no-memory full-context
floor (60.2)**. The **lead column (88.6 / 90.2)** isolates the reader-strength gain at
the same canonical judge (+5.0 overall). Neither column is comparable to the GPT-4.1-
*judged* 95% leaderboard rows (OMEGA 95.4, Mastra 94.9) — that gap is a judge-leniency
artefact, not a memory-quality result (see the comparability envelope). **Release
caveat:** the anchors are on the **original** `_s`; we run **cleaned-2025-09**.

### Per-ability accuracy (full `_s`, both columns, `gpt-4o-2024-08-06` judge)

| Ability (`question_type`) | n | GPT-4.1 reader | GPT-4o reader |
|---|---|---|---|
| single-session-assistant | 56 | 98.2 | 98.2 |
| single-session-user | 64 | 95.3 | 93.8 |
| temporal-reasoning | 127 | 90.6 | 82.7 |
| knowledge-update | 72 | 88.9 | 87.5 |
| multi-session | 121 | 81.8 | **73.6** |
| single-session-preference | 30 | 86.7 | 86.7 |
| **Task-averaged (macro)** | — | **90.2** | **87.1** |

**Reading — Phase 1 localization, confirmed end-to-end.** The GPT-4o reader's weakest
bucket is **multi-session (73.6)** — precisely the ability Phase 1 flagged as a
*retrieval-ordering* gap (multi-gold sets, `recall_all@10` 0.95 but `recall_any@10`
1.00). The stronger GPT-4.1 reader recovers it to **81.8** from the *same* retrieved
context, which still localizes the residual to **reader synthesis over multi-session
evidence**, not retrieval coverage. Date-sensitive abilities (temporal-reasoning,
knowledge-update) hold up well (90.6 / 88.9 on the lead column) despite the engine's
no-date-aware-ranking gap, consistent with `recall_any@10 = 1.000` there — at least one
gold session is always retrieved, and a capable reader resolves the rest.

### Run provenance & cost

- **Mode:** OpenAI Batch API, single-model batches (a reader batch per model + one
  1,000-request judge batch), polled to completion. The run writes
  `results/qa.json` locally (gitignored alongside the corpus + embed cache, R15);
  the committed artifact is the aggregate/per-ability tables in this doc.
- **Cost:** the harness does not persist current-run token usage or billing totals. The
  comparable 2026-06-02 Batch API run cost **$31.30** (batch-priced) — GPT-4.1 reader
  $13.82 (13.55M in / 68k out), GPT-4o reader $17.18 (13.55M in / 48k out), judge
  $0.29 (229k in / 1.5k out) — under the $50 budget ceiling. The 2026-06-13 rerun used
  the same Batch API path and completed with 0 reader errors and 0 judge errors. A prior
  mixed-model batch failed validation at **$0** (the Batch API rejects multi-model
  batches; fixed to one reader batch per model + a regression test, then re-run clean).

---

## Reproducing this (the F2 / F3 paths)

Everything needed is pinned in [`manifest.json`](longmemeval/manifest.json) (dataset
URL + content hash + release, embedding/reader/judge snapshots, frozen retrieval
params, prompt-template version, seed). A third party with an OpenAI key reproduces
the number from the committed harness + manifest alone (R16, flow F3):

```bash
# 0. Install the harness extras used by the paid reader path. `tiktoken` is in the
#    bench extra, not the default/dev install.
uv sync --extra dev --extra bench

# 1. Fetch the pinned dataset by hash (fails loud on mismatch; nothing written on a
#    divergent download). Pin DATASET_SHA256 in manifest.py on first acquisition.
uv run python -c "from longmemeval import manifest as m; \
  m.download_dataset('harness/longmemeval/longmemeval_s_cleaned.json', \
  expected_sha256=m.DATASET_SHA256)"

# 2. Phase 1 — retrieval diagnostic (F2; embeddings-only; content-hash cached).
PYTHONPATH=harness uv run python harness/longmemeval/diagnostic.py \
  --dataset harness/longmemeval/longmemeval_s_cleaned.json \
  --out harness/longmemeval/results/diagnostic.json

# 3. Phase 2 — end-to-end QA headline (GATED; both reader columns; shared judge).
#    --confirm-paid-run is required; without it the runner prints the gate + cost
#    estimate and refuses to spend (R17, this unit's Execution note).
PYTHONPATH=harness uv run python harness/longmemeval/qa.py \
  --dataset harness/longmemeval/longmemeval_s_cleaned.json \
  --out harness/longmemeval/results/qa.json \
  --batch \
  --confirm-paid-run
```

CI runs **only** the offline smoke (`tests/test_longmemeval_harness.py`), never a paid
run. The downloaded dataset, materialized corpus, per-question outputs, and the embed
cache are **gitignored**; only the manifest, the synthetic smoke subset, and the
aggregate/per-ability numbers in this doc are committed (R15).

### Estimated cost

- **Phase 1 (embeddings-only):** ceiling **≈ $15** (≈115M tokens across both corpora at
  `text-embedding-3-large` list price; see `cost_assumptions` in the manifest). The
  content-hash embedding cache makes re-runs and the F3 critic re-run far cheaper.
- **Phase 2 (reader + judge):** **actual $31.30** via the Batch API for the full 500-Q
  two-column headline in the comparable 2026-06-02 run (GPT-4.1 reader $13.82 + GPT-4o
  reader $17.18 + shared judge $0.29), against a $50 budget. The 2026-06-13 rerun used the
  same Batch API path; the harness result records correctness/errors, not current-run
  billing. The content-hash embedding cache makes the retrieval phase free on re-runs. The
  sync (non-batch) path costs ~2× the same tokens.

---

## Contamination disclosure

The LongMemEval dataset is **public on Hugging Face** (`xiaowu0162/longmemeval-cleaned`,
MIT). A model's pre-training may have seen it, which can inflate end-to-end QA. We pin
the **cleaned-2025-09** release by content hash and disclose this so the number is read
with the appropriate caveat; the **retrieval diagnostic** (Phase 1) is unaffected by
reader pre-training, which is part of why it is the higher-confidence signal.

---

## Corrections log

The integrity discipline of `PARITY_VERDICT.md`: every methodology correction or param
change is recorded here, in the open.

- **2026-06-02 — dataset pinned.** Acquired the `longmemeval-cleaned` (2025-09) `_s`
  release and pinned its SHA-256 (`d6f21ea9…c3a442`, 500 instances, 277,383,467 bytes)
  in `manifest.py`/`manifest.json`. A re-download is now verified strictly against this
  hash. Schema confirmed against the harness: 6 `question_type` buckets
  (single-session-user 70, single-session-assistant 56, single-session-preference 30,
  multi-session 133, knowledge-update 78, temporal-reasoning 133) + 30 abstention. The
  Phase-1 pipeline (materialize → index → retrieve → score) was verified to run
  end-to-end on real instances offline (FakeEmbedder); at this point the diagnostic +
  headline runs were still pending an API key + budget (both have since run — see below).
- **2026-06-02 — Phase 1 retrieval diagnostic run (full `_s`).** Ran F2 over all 500
  instances (470 retrieval-scored, 30 `_abs` excluded) with the production embedder;
  embed-quiescent (`verdict: reported`, not voided). Per-ability counts are
  non-abstention, so they sum to 470 (the 30 abstention instances fall across the
  buckets). The run was killed by unrelated concurrent box activity at 448/500 and
  resumed for free from the content-hash embedding cache (no re-embed of the done
  instances) — no tuning, no param change between the two segments. Numbers committed
  are aggregate/per-ability only (R15); per-instance outputs + the embed cache stay
  gitignored.
- **2026-06-02 — Phase 2 QA headline run (full `_s`, both columns).** Ran the end-to-end
  QA (retrieval → reader → `gpt-4o-2024-08-06` judge → scorer) over all 500 instances for
  **both** reader columns via the OpenAI Batch API. `verdict: reported` (not voided), 0
  reader/judge errors across 1,000 reads + 1,000 grades. Results: GPT-4.1 reader **88.6
  overall / 89.7 task-avg / 76.7 abstention**; GPT-4o reader **83.2 / 86.6 / 70.0**.
  Actual spend **$31.30** (batch), under the $50 budget. No tuning, no param change vs
  Phase 1 (same frozen k=10 retrieval). A first attempt failed Batch API validation at
  **$0** because it mixed both reader models in one batch (the Batch API requires a
  single model per batch); fixed to one reader batch per model + a regression test, then
  re-run clean.
- **2026-06-02 — anchor figures re-verified at publication; OMEGA→Mastra correction.** Per
  the per-row attribution commitment (R21), the cited GPT-4o-judge anchors were re-verified
  against primary sources before publishing. This **corrected a planning-doc error**: the
  ~84.2 GPT-4o-judge figure was attributed to "OMEGA" but actually belongs to **Mastra
  Observational Memory** (GPT-4o reader + GPT-4o judge, Mastra research page). OMEGA
  publishes **no** GPT-4o-judge row — its 95.4 is GPT-4.1-judged (`omegamax.co/benchmarks`).
  Verified anchors (all original-release `_s`, GPT-4o reader + GPT-4o judge): full-context
  **60.2** and Zep **71.2** (arXiv 2501.13956), Mastra **84.2**. The earlier "~11-pt reader
  swing (same system, OMEGA)" claim conflated two different systems and was removed; the
  reader swing was reported from our own two columns at a *fixed* judge (+5.4 overall in
  the 2026-06-02 run).
- **2026-06-13 — public-release evidence refresh.** Reran Phase 1 and Phase 2 on current
  release code after retrieval/index-adjacent changes landed since the 2026-06-02 run.
  Phase 1: `verdict: reported`, `headline: true`, 470 retrieval-scored instances + 30
  abstention excluded, session `recall_all@10` unchanged at **0.949**. Phase 2: Batch API,
  `verdict: reported`, `headline: true`, 0 reader errors and 0 judge errors across 1,000
  reads + 1,000 grades. Results: GPT-4.1 reader **88.6 overall / 90.2 task-avg / 76.7
  abstention**; GPT-4o reader **83.6 / 87.1 / 66.7**. The rerun also exposed a
  reproducibility doc gap: the paid reader path imports `tiktoken`, which lives in the
  `bench` extra, so the reproduction commands now install `uv sync --extra dev --extra
  bench` and use `--batch` to match the recorded run mode.
- **2026-06-13 — later maintenance-impact review; no third paid rerun.** After the
  public-release evidence refresh, follow-on release work fixed stale/orphaned lexical
  projections in persistent indexes and refreshed contributor docs. That maintenance does
  not change the LongMemEval dataset, materialization, prompts, scoring, model snapshots,
  frozen retrieval params, or the benchmark's fresh isolated per-instance index build. The
  two completed 2026-06-13 benchmark runs therefore remain the release evidence; a third
  paid Batch API rerun would add cost without changing the measured benchmark surface.

---

## Sources & references

- Harness: [`harness/longmemeval/`](longmemeval/); manifest:
  [`manifest.json`](longmemeval/manifest.json); offline tests:
  `tests/test_longmemeval_harness.py`.
- LongMemEval: `github.com/xiaowu0162/LongMemEval`; dataset
  `xiaowu0162/longmemeval-cleaned` (HF, MIT); paper arXiv 2410.10813.
- Reader/judge comparability anchors (re-verified 2026-06-02): Zep arXiv 2501.13956
  (full-context 60.2, Zep 71.2; GPT-4o reader + GPT-4o judge); Mastra research page,
  *Observational Memory* (84.2; GPT-4o reader + GPT-4o judge); OMEGA leaderboard
  `omegamax.co/benchmarks` (OMEGA 95.4 / Mastra 94.9; **GPT-4.1 judge**, shown only to
  scope the judge axis — not comparable to our GPT-4o-judge columns).