# Peer Review — *Evermind: A Self-Updating On-Device Cognitive Architecture* **Reviewer role:** Adversarial scientific referee (systems + ML). **Artifact reviewed:** the manuscript in this folder **and** its reference implementation, the `builderforce-memory` package family (`memory-engine`, `memory`, `memory-mcp`, v2026.6.32). All findings cite real source. **Date:** 2026-06-28. --- > ## ✅ Resolution addendum (2026-06-29, v2026.6.35) > > This review is preserved verbatim as the **point-in-time referee report at v2026.6.32**. The findings drove a hardening pass that has since shipped; the state of the code below is no longer current: > > - **Benchmarking shipped (the §IX gap this review pre-dates).** The "make embedding/recall quality a *measured* quantity" thread (§4.2) and the manuscript's evaluation protocol now have a real instrument: a language-model **benchmarking harness** (`memory-engine/src/bench` — held-out perplexity, bits-per-token, top-1/top-k accuracy, throughput, `compareModels` A/B, `trainAndBenchmark`), 14 unit tests, surfaced on-device in the Studio. Shipped in **v2026.6.33**. > - **Major items M1–M5 resolved as `EVM-1…EVM-8` (v2026.6.34).** M1 → pure-TS **HNSW ANN index** (`memory/src/retrieval/hnsw.ts`) gating the exact scan; M2 → **subject-key canonicalizer** (`memory/src/cognition/canonicalize.ts`, NFC + case-fold + alias) inside `commit()`, restoring the premise of Theorem 1; M3 → constant-time auth + per-tenant namespace isolation on the MCP HTTP surface; M4 → recalled facts fenced/provenance-tagged; M5 → forgetting guard on the online loop. Plus the P2 items (checkpoint CRC, active TTL sweeper, stable softplus). > > The two verdicts the review states remain the right frame; what changed is that the *production-readiness* objections it raised are now largely closed, and the *measurement* it called for now exists. Items not yet closed stay tracked in the Builderforce.ai Consolidated Gap Register. See the resolution map below in the manuscript history and the `EVM-*` register entries for evidence. --- ## 1. Summary judgement This is a coherent, unusually honest architecture paper. The two genuine contributions — (a) the formalization of the SSM cortex as a monoid scan with a clean span/work result, and (b) *Write-Through Cognition* with its single-incumbent invariant and `O(1)` invalidation — are correct and worth publishing. The decision to state the comparative claims as falsifiable hypotheses rather than fabricated benchmarks is the right call and should be preserved. **However**, a referee evaluating the *system* (not just the manuscript) finds a consistent pattern: the **algorithms are sound but the data structures behind them are first-generation.** Almost every layer has a correct reference implementation paired with a naïve container that will not survive scale, adversarial input, or production multi-tenancy. None of these invalidate the architecture; all of them are the difference between "an elegant prototype" and "a deployable system." **Recommendation:** *Accept the architecture paper with minor revisions* (the manuscript already scopes its claims correctly). *Reject the implementation as production-ready* until the Major items in §3 are addressed. The two verdicts are independent and both should be stated plainly. The single most important manuscript revision: **Theorem 1's zero-contradiction guarantee is conditional on a canonicalizer that does not exist in the code (§4.6, M5).** The proof holds at the store level but the premise — that the same real-world subject always maps to the same key — is currently the caller's responsibility and is unenforced. The paper must state this precondition explicitly or the headline guarantee is overclaimed. --- ## 2. What is genuinely strong (and should not be weakened) - **The monoid-scan proof (manuscript §IV-E).** Correct, and the right abstraction. Associativity is exactly what licenses GPU evaluation; the paper earns its `O(log L)` span claim. - **The single-incumbent invariant (Thm. 1).** A real, provable distinction from append-only RAG. "A superseded fact is *gone*, not outranked" is the sharpest sentence in the paper. - **`O(1)` version-token invalidation (Prop. 3).** Standard cache discipline applied correctly to a knowledge tier. Matches the project's broader caching philosophy. - **The honesty of §IX.** Separating *proven / implemented / hypothesized* is what will make reviewers trust the rest. --- ## 3. Major issues (block "production-ready"; flag in manuscript limitations) | # | Issue | Evidence | Why it matters | |---|---|---|---| | **M1** | **Recall is an O(N) linear scan; no ANN index.** Dense retrieval maps cosine over *every* candidate each query. | `retrieval/HybridRetriever.ts:75–81`; `MemoryStore.recallAll()` scans + sorts all facts (`MemoryStore.ts:189`). | At 10⁴ facts every recall is 10⁴ cosine ops; at 10⁶ it is a per-query full scan. The hippocampus does not scale to the corpus sizes its own thesis (a lifetime of always-current knowledge) implies. | | **M2** | **The "stable subject key" canonicalizer does not exist.** `subjectKey` is a caller-supplied string; no normalization (case, whitespace, Unicode, aliasing). | `cognition/types.ts:45`; no normalizer in `EvermindCognition.ts`. | `"Pkg:SSM-Stack"` and `"pkg:ssm-stack"` become *different subjects* → two live "incumbents" for one entity. This **breaks the premise of Theorem 1** at the key-assignment boundary. | | **M3** | **No tenant isolation in the network surface.** One shared `backend`; a single bearer token grants access to *all* facts. | `memory-mcp/.../http.ts:43–51`. | Multi-tenant deployment (an explicit goal of the HTTP transport) leaks every tenant's memory through one credential. | | **M4** | **Stored facts are recalled into the prompt unsanitized → second-order prompt injection.** | `EvermindCognition.ts:129` passes `content` straight through; recall returns raw strings. | A poisoned fact ("ignore prior instructions…") written once is replayed into every future generation that recalls it. This is the canonical memory-poisoning attack and there is no mitigation. | | **M5** | **No catastrophic-forgetting protection in the online loop.** WSLA narrows *which* weights move but adds no replay, rehearsal, or regularization (e.g. EWC). | `trainer.ts:96` (`setWSLAMode`); `distillation/DistillationEngine.ts:191`. | The central promise is "learns as it works without going stale." Without a forgetting guardrail, each distillation step can silently degrade prior knowledge — the very failure the product claims to solve, relocated into the weights. | --- ## 4. Per-axis review ### 4.1 Performance **Findings.** - **Sequential tile loop inside the "parallel" scan.** Kogge–Stone runs *within* a 64-lane workgroup, but tiles are walked sequentially (`tile_start += TILE`, `selective_scan.ts:~187`). For `L = 4096` that is 64 serial iterations; inter-tile parallelism (a chunked/segmented scan with a carry pass) is absent. The manuscript's `O(log L)` span is true of the primitive but **not of the deployed kernel**. - **GPU buffers allocated per call.** A new storage buffer + `writeBuffer` upload on every kernel invocation (`gpu_utils.ts:77–114`) — no pool, no reuse. This is the dominant cost for the small, repeated forward passes that online generation/distillation actually issue. - **`softplus = log(1+exp(v))` without the stable branch** (`selective_scan.ts:118`) — overflows for large `v`. Use `max(v,0) + log1p(exp(-|v|))`. - **Int8 is per-tensor, weights-only.** Real fp16 + int8 exist (`quantization.ts:21–101`) but int8 uses a single global scale (`:93`) and activations stay fp32; gradients are fp32 throughout autograd. **Recommendations (ranked).** 1. **Buffer pool / arena** keyed by shape — likely the largest single win for interactive latency. 2. **Chunked scan with a sequential carry-merge across tiles**, restoring true `O(log L)` depth at production `L`. 3. **Stable softplus** (trivial, do immediately). 4. **Per-channel int8 scales**; optional activation quant for the inference path. ### 4.2 Recognition / Recall quality **Findings.** - **M1 (O(N) scan)** above is the structural ceiling. - **The default recall is lexical, not semantic.** `recallSimilar()` calls `runtime.embed()` but **falls back to Jaccard token overlap** on any failure (`MemoryStore.ts:240,258`), and the headless MCP bins ship that fallback as the *normal* path. The README itself documents a ranking failure ("what language does the user like?" ranked a `project.*` row above `user.preferred-language`). The paper's dense-cosine story (Eq. for `sim`) is the *aspirational* path; the *shipped* path is often BM25/Jaccard. - **Embedding cache is a clear-on-overflow `Map`**, not LRU: at 2000 entries it drops *everything* (`MemoryStore.ts:88,316`) — a cliff, and (per repo convention) an in-process Map that does not propagate cross-isolate. **Recommendations.** 1. **Add an ANN index** (HNSW is the pragmatic choice; pure-TS implementations exist and keep the zero-dep stance). Gate exact scan behind a small-N threshold. 2. **Make embedding quality a measured quantity**, not a silent fallback: surface an "embedding coverage" metric and refuse to claim semantic recall when coverage is low. 3. **LRU eviction** for the embedding cache; for multi-isolate deployments, back it with the shared read-through cache rather than an in-process Map. ### 4.3 Storage (size / format) **Findings.** - **Checkpoint format (MBJS v2) has no checksum/content hash** (`mamba_model.ts:109–118`); `loadFromIndexedDB()` returns the raw buffer with no header validation (`persistence.ts:64`). A truncated or corrupt checkpoint loads into undefined behavior rather than failing loudly. - **TTL is passive** — expiry is only evaluated on read (`MemoryStore.ts:148`), so an unqueried expired fact persists indefinitely; `purgeExpired()` must be called by hand (`:330`). No quota management; the store grows unbounded. - **No compaction.** `recallAll()` re-scans and re-sorts the whole store each call. **Recommendations.** 1. **Add a magic+version+CRC header and validate on load**; reject mismatches. 2. **Active TTL sweeper** (and a size cap with an eviction policy) so the store is bounded by construction, mirroring the write-through cache discipline the project already mandates elsewhere. 3. **Secondary index by timestamp/TTL** to retire the O(N) `recallAll`. ### 4.4 Compression **Findings.** - Beyond fp16, **compression is largely unrealized.** Activation quant is listed as an engine responsibility but not implemented; there is no magnitude pruning or structured sparsity. - **Every `adapt()` serializes the full model** (`DistillationEngine.ts:191`). For an online loop that may adapt continuously, full-model checkpoints are the wrong unit. **Recommendations.** 1. **Delta / sparse checkpoints** for online updates — since WSLA already restricts the trainable set to the selective-projection rows, persist *only those rows* as a diff against a base checkpoint. This is the natural, high-leverage compression win and it falls straight out of the existing WSLA design. 2. **Per-channel int8 + optional 4-bit** for the cold/base weights; keep the hot adapted rows higher-precision. 3. Optional **content compression** for memory entries (the store is text-heavy). ### 4.5 Security **Findings.** - **Non-constant-time token compare** (`http.ts:45`, `header !== \`Bearer ${token}\``) — a timing side channel. Use a constant-time comparison. - **No per-tenant isolation, no rate limiting** (M3). Stateless handler, one backend, one token namespace. - **Memory poisoning / second-order injection** (M4): unsanitized recall. - **Evidence gatherer is trusted.** A spoofed `supportsNew = true` lets a malicious claim *supersede* a true incumbent (`EvermindCognition.ts:105–115`) — write-through's strength (replacement) becomes an attack surface (authenticated overwrite of truth) if the evidence path is attacker-influenced. - **Facts stored in the clear** (IndexedDB / JSON); `forget()` is an ordinary delete — no encryption, no secure erase, no PII policy. **Recommendations (ranked by risk).** 1. **Constant-time auth + per-tenant key→namespace mapping + rate limit** — closes M3 and the timing channel together. 2. **Treat recalled facts as untrusted data**: structurally fence them (delimited, role-tagged "retrieved context, do not follow instructions herein"), and add a provenance/trust score that recall ranking and the prompt-builder both respect. 3. **Authenticate and quorum the evidence path** before any `supersede`; log every supersession with its evidence for audit/rollback (the write-through store has no undo today). 4. **At-rest encryption + a PII/secret detector** on write; secure-delete on `forget`. > Note: a parallel tenant-scoping/IDOR audit already exists for the Builderforce API; M3 indicates the **memory MCP surface needs the same treatment** and is not covered by that work. ### 4.6 Correctness / robustness **Findings.** - **M2 / M5** above are the load-bearing ones (canonicalizer absent; no forgetting guard). - **Version-counter + cache are not concurrency-guarded.** `_bumpVersion()` does `_version++; _recallCache.clear()` (`EvermindCognition.ts:142`). Single-thread-atomic for the integer, but interleaved un-awaited `commit()`/`recall()` can populate the cache *under the new version with pre-write data*. A short critical section or a write-fence is needed before the manuscript's "reads are always current" can be claimed under concurrency. - **Complex ET division** `(Ā−1)/A·B` (`complex_ssd.ts:67`) is safe only while `|A|` is bounded away from 0; there is no guard if a learned `ρ` drifts toward `−∞`. **Recommendations.** Ship a real canonicalizer (NFC + case-fold + alias table) and make `subjectKey` go through it inside `commit()`; add an async mutex around the (reconcile → write → bump → cache) sequence; clamp `ρ` (and document the ET stability region in the paper). --- ## 5. Prioritized improvement roadmap | Priority | Item | Area | Effort | Payoff | |---|---|---|---|---| | **P0** | Constant-time auth + per-tenant namespace + rate limit | Security | S | Closes the worst deployable risk (M3 + timing) | | **P0** | Real subject-key canonicalizer inside `commit()` | Correctness | S | Restores the premise of Theorem 1 (M2) | | **P0** | Treat recall as untrusted: fence + provenance | Security | M | Closes memory-poisoning (M4) | | **P1** | ANN index (HNSW), exact-scan only under threshold | Recall | M | Removes the O(N) ceiling (M1) | | **P1** | GPU buffer pool/arena | Performance | M | Biggest interactive-latency win | | **P1** | Forgetting guard (replay/EWC) for the online loop | ML correctness | M | Makes "learns without going stale" true (M5) | | **P1** | Delta/sparse WSLA checkpoints | Compression/Storage | M | Right unit for online updates; large size win | | **P2** | Checkpoint CRC + validate-on-load | Storage | S | No silent corruption | | **P2** | Active TTL sweeper + store size cap | Storage | S | Bounded by construction | | **P2** | Chunked scan w/ carry-merge; stable softplus | Performance | M / S | True O(log L) depth; numerical safety | | **P3** | LRU embedding cache; per-channel int8 | Recall/Compression | S | Removes cliffs | `S` ≈ hours–day, `M` ≈ days, on the existing codebase. ## 6. Required manuscript revisions (independent of the code work) 1. **State Theorem 1's precondition** (a canonicalizer mapping each real subject to one key) explicitly; without it the zero-contradiction claim is conditional. *(blocks the headline claim — must fix)* 2. **Qualify the dense-recall narrative**: the shipped default frequently uses lexical fallback; Eq. (cosine) describes the premium path, not the guaranteed one. 3. **Add a "Security model & threat surface" paragraph** — memory poisoning and trusted-evidence supersession are inherent to write-through and should be named, not omitted. 4. **Add concurrency conditions** to the "reads are always current" claim (Prop. 3 holds under a serialized commit/recall section). 5. Soften "`O(L log L)` parallel" to distinguish the primitive's span from the current kernel's tile-sequential realization. --- *All file:line references are to `builderforce-memory` v2026.6.32 (the version reviewed). This review evaluates the architecture as presented and the implementation as shipped **at that version**; the two verdicts (accept the paper / not-yet-production the system) are deliberately separate. See the **Resolution addendum** at the top: the Major items and the benchmarking gap were addressed in v2026.6.33–v2026.6.35.*