{ "benchmark": "MemoryAgentBench", "status": "candidate_public_claim", "dataset": { "source": "https://github.com/HUST-AI-HYZ/MemoryAgentBench", "license": "MIT", "vendored": false }, "run": { "commit": "bc83974", "packageVersion": "0.3.5", "command": "eval:phase-64-smoke -- --benchmark-root /private/tmp/MAB-67c-full --live --evidence-pack --resume (run-phase67c-mab-harness-resumed-current; AR100 CR73 TTL30 LRU56)", "executionFailures": 0 }, "model": { "answerModel": "gpt-5.5", "judgeModel": null, "sameModelJudge": false }, "metrics": { "primary": "Conflict Resolution answer accuracy (deterministic upstream match-mode, judge-free)", "score": 0.959, "baseline": 0.0 }, "coverage": { "complete": true, "note": "Clean run executionFailures 0 (259 questions, resumable retry). With-memory: CR 0.959 (70/73), TTL 0.767 (23/30), AR 0.890 (89/100), LRU 0.518 (29/56). No-memory ablation baseline (empty memory context): CR 0.000 (0/73) and TTL 0.000 (0/30) on FULL coverage — the decisive disclosure that the public CR/TTL claim is a genuine memory win (both unanswerable without GoodMemory). AR 0.926 (87/94) and LRU 0.632 (12/19) are on PARTIAL coverage: the no-memory baseline run had executionFailures 43, all landing in AR/LRU (CR/TTL fully covered); they still directionally confirm memory does NOT help AR/LRU (no-memory >= with-memory), so AR/LRU are excluded from any claim. Verified against persisted artifacts reports/eval/research/phase-64/mab/run-phase67c-mab-harness-resumed-current (execFails 0) + run-phase67c-mab-baseline-current." }, "claimBoundary": { "publicClaimAllowed": true, "reason": "Methodology clears every rule: deterministic upstream match-mode scoring (NO LLM judge, no same-model bias), executionFailures 0 via resumable retry, reproducible (commit/command/package version), MIT-licensed non-vendored data, complete coverage. THE CLAIM IS SCOPED TO Conflict Resolution (CR 0.959) and Test-Time Learning (TTL 0.767) ONLY — these are genuine GoodMemory contributions: the no-memory ablation scores 0.000 on both (the questions are unanswerable without GoodMemory's retrieved consolidated fact / in-context demos), and CR 0.959 exceeds the published single-hop ceiling ~0.60. AR and LRU are DELIBERATELY EXCLUDED from any claim: the no-memory ablation scores AR 0.926 (vs 0.890 with memory) and LRU 0.632 (vs 0.518), i.e. memory does NOT help — they are multiple-choice leaks where the model answers from the candidates in the question. Required disclosure if promoted: CR/TTL measure answer-time current-value resolution + in-context retrieval (GoodMemory's genuine strength), not general retrieval recall (CR retrieval recall 0.57). README promoted 2026-06-25 (CR + TTL row, AR/LRU excluded, no-memory ablation disclosed) with user sign-off." } }