# MaxSim Optimization **Implemented:** 2026-03-24 **Result:** 47x speedup on MaxSim scoring, zero quality regression. **Benchmark:** GenCodeSearchNet MRR@10 83.64% (unchanged), latency p50 942ms → 502ms. --- ## Architecture Three-tier MaxSim acceleration with automatic fallback: | Tier | Engine | Speedup | Portability | |------|--------|---------|-------------| | 1 | Native Rust + Rayon | 47x | Platform-specific `.node` addon | | 2 | WASM SIMD (Rust/LLVM) | 16x | Universal — `core/infrastructure/maxsim.wasm` (4.4KB, ~1.6KB gzipped) | | 3 | JS (norm-cached flat buffer) | 3.5x | Universal — pure JS fallback | Auto-detected at startup. Tier 1 requires building with `npm run build:native`. Tier 2 ships in the package. Tier 3 is always available. ### Key files - `core/late-interaction-index.js` — MaxSim scoring, 3-tier dispatch - `core/simd-distance.js` — WASM/native loader, tier routing - `core/infrastructure/maxsim.wasm` — Rust-compiled WASM SIMD kernel - `crates/wasm-maxsim/` — Rust source for WASM kernel - `crates/sweet-search-native/` — Rust source for native addon (rayon + NEON/AVX2) --- ## What was implemented ### Always-on (no quality impact) 1. **`quantizeToInt8` bugfix** — fixed `Math.min(...floatArray)` stack overflow for 1024+ token documents at 128d. 2. **Norm caching** — pre-computes L2 norms for query and document tokens. Avoids 3x redundant FLOPs per cosine similarity. 3. **Flat buffer scoring** — `maxSimScoreFlat()` operates on contiguous Float32Array with offset indexing, avoiding N sub-array allocations per candidate. 4. **Pooled dequantization buffer** — reuses a single Float32Array across candidates instead of allocating 256KB per candidate. 5. **WASM SIMD kernel** (`core/infrastructure/maxsim.wasm`) — Rust-compiled f32x4 SIMD with LLVM optimization. Handles arbitrary dimensions (scalar tail for non-multiple-of-4). 6. **Native Rust + Rayon** (`crates/sweet-search-native/`) — parallel candidate scoring across all CPU cores. Zero-copy buffer access via napi-rs. 7. **`tokenNorms` persistence** — computed at `add()` time, reconstructed on `load()` via `_rebuildTokenNorms()`. ### Opt-in (quality/speed tradeoff) Available via `scoreWithLateInteraction(query, candidates, options)`: | Option | Effect | Quality impact | |--------|--------|---------------| | `maxCandidates: N` | Only MaxSim-score top N by initial score | Risky if initial scores don't correlate | | `maxQueryTokens: N` | Keep N highest-norm query tokens | <1% at 75% budget | | `maxDocTokens: N` | Stride-subsample document tokens | -1.5% to -8% depending on aggressiveness | ### Evaluated and not shipped - **Block-Max MaxSim** — component-wise Cauchy-Schwarz upper bounds. Mathematically correct but ineffective for unit-normalized ColBERT-style vectors (bounds always ~1.0 while actual similarities ~0.3). Would activate for models with varying token norms (>0.05 range). - **Fused WASM dequantization** — dequantize int8→f32 inside the WASM kernel. Slower than JS dequant + WASM f32 scoring because the per-element widening pipeline (i8→i16→i32→f32) costs more than the float32 copy savings. - **PLAID/EMVB** — ColBERT-scale infrastructure. Not justified at reranker scale. Do not promote if the result is only a faster micro-benchmark with no real search impact. --- ## Benchmark results ### Microbenchmark (synthetic, 512 tokens, Q=32, dim=128) | Candidates | Original JS | Native Rust+Rayon | Speedup | |------------|------------|-------------------|---------| | 10 | 55ms | 1.3ms | 42x | | 20 | 109ms | 2.5ms | 45x | | 50 | 273ms | 5.9ms | 47x | | 100 | 546ms | 14ms | 41x | | 231 | 1,261ms | 27ms | 47x | ### GenCodeSearchNet (6000 queries, profile=full, concurrency=12) | Metric | Before | After | Delta | |--------|--------|-------|-------| | MRR@10 | 83.5% | 83.64% | +0.1pp | | Recall@20 | 93.8% | 94.25% | +0.5pp | | Latency p50 | 942ms | 502ms | -47% | Per-language: Python 98.0%, Go 94.4%, Java 84.8%, JS 71.6%, PHP 77.6%, Ruby 75.4%. Zero regressions. ### Key learnings 1. **Sub-array allocation was the hidden bottleneck.** `Array.from()` per token per candidate cost more than the dot products. 2. **V8 JIT optimizes regular Arrays better than Float32Array.** Packed doubles with TurboFan beat typed array accessors. 3. **Block-Max requires non-uniform norms.** For L2-normalized ColBERT outputs, Cauchy-Schwarz bounds are trivially loose (~1.0 vs actual ~0.3). 4. **Fused WASM dequant is slower than JS dequant + WASM scoring.** The i8→i16→i32→f32 widening pipeline costs more than copying 256KB of float32. 5. **f32 precision improves quality.** The WASM kernel using f32 (model's native precision) produces marginally better rankings than JS f64. ## Build ```bash # WASM kernel (ships in package, universal) npm run build:wasm # Native addon (local build, platform-specific, optional) npm run build:native ```