# LI Quantization Strategy > **Goal**: Minimize memory footprint, disk size, load time, and MaxSim > scoring latency of the late interaction index so Sweet Search runs well > on developer laptops with 8-16 GB RAM, while preserving retrieval > quality (NDCG@10 regression < 0.5 pp). **Status**: Binary LI storage and INT4 quantization are implemented. Plain INT4 is now the default product path for late-interaction indexes. WHT remains available for A/B work but is not the default. Full TurboQuant remains deferred. **Implementation approach**: All kernels are built from scratch — no external quantization/scoring dependencies. Existing open-source implementations (see [Reference Implementations](#reference-implementations-algorithm-reference-only--no-dependencies)) serve as algorithm references only. This keeps the dependency tree minimal, avoids license entanglements, and ensures every kernel is tuned for our specific pipeline (asymmetric float32-query × quantized-doc MaxSim scoring on L2-normalized code embeddings at d=128). **Related**: `INFERENCE_SPEEDUP_PLAN.md` covers ONNX forward-pass speedups (worker threads, warmup, session config) for both the embedding and LI models. That plan speeds up inference; this plan speeds up storage, loading, and MaxSim scoring. Gains are multiplicative. ## Current Decision - Implemented: fully binary LI segment storage, persisted LI quantization metadata, INT4 doc-token storage, and query-time loading that follows stored index metadata. - Default shipping path: plain INT4, no WHT, no pooling, no Phase 5 work. - Benchmarked outcome: INT4 matched the prior INT8 baseline closely enough to ship while materially reducing LI index size. - Deferred: WHT as a product default, aggressive token pooling, Matryoshka-dependent paths, and full TurboQuant / PolarQuant / QJL work. ### Current Outcome Summary | Area | Current Result | Notes | |------|----------------|-------| | **LI storage** | Materially smaller with INT4 | Full benchmark runs showed the LI index shrinking from the previous INT8 baseline without requiring any query-time flags. | | **Retrieval quality** | Good enough to ship | Full A/Bs showed plain INT4 staying effectively in line with the shipped INT8 baseline. | | **WHT rotation** | Deferred from default | Benchmarks were mixed: small or inconsistent quality gains, with real extra indexing/query work in some runs. | | **Pooling** | Rejected as-is | `poolFactor=2` caused unacceptable quality loss on the benchmark suite. | | **Phase 5 / TurboQuant** | Deferred | Not needed to justify the current product default after INT4 validated successfully. | ### Native Binary Strategy All new kernels (4-bit dequant, WHT-rotated scoring) ship as: 1. **Native N-API addon** per platform (`@sweet-search/native-{platform}-{arch}`) — Rayon parallel + NEON/AVX2 SIMD 2. **WASM SIMD fallback** (universal, `core/infrastructure/maxsim.wasm`) — single-threaded f32x4 3. **Pure JS fallback** — always available This follows the existing 3-tier architecture from INIT_PLAN.md and matches how `crates/sweet-search-native/` is already built and shipped. --- ## Background [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) is a data-oblivious vector quantization algorithm from Google that compresses vectors to 3-4 bits with near-zero quality loss. Its core insight: **random orthogonal rotation (Walsh-Hadamard) equalizes coordinate variance, making aggressive scalar quantization safe.** A multi-agent research swarm (6 specialists: quantization math, WASM engineering, IR quality, performance, devil's advocate, architecture) analyzed TurboQuant's applicability to Sweet Search's late interaction pipeline. This plan synthesizes those findings with additional review from Codex. ### Why Not Full TurboQuant? Full TurboQuant (PolarQuant + QJL) is **not** the right first move: | Concern | Detail | |---------|--------| | **Wrong validation domain** | Validated on LLM KV caches (unnormalized), not L2-normalized retrieval embeddings. Distributional assumptions may differ. | | **d=48 disqualified** | The edge model (`lateon-code-edge`, d=48) is outside the high-dimensional concentration regime. Full TurboQuant cannot be a universal solution across both model tiers. | | **QJL complexity for marginal gain** | With asymmetric scoring (float32 query x quantized doc), QJL's bias correction is less critical. Adds 16 bytes/token overhead + random projection matrix management. | | **3-bit packing is SIMD-hostile** | No native 3-bit extract in NEON or WASM SIMD. turbo3 decoding is ~1.5x slower than INT8; turbo4 (nibble) is at parity. | | **We build our own** | Several reference implementations exist in Rust/Python (see [Reference Implementations](#reference-implementations-algorithm-reference-only--no-dependencies)), but we write our own kernels tuned for asymmetric float32-query × quantized-doc MaxSim at d=128. No external quantization dependencies. | | **Bigger wins available first** | JSON storage overhead and token count are higher-leverage targets. | ### What We Take From TurboQuant The **rotation insight** is real and proven in production (Weaviate's [8-bit Rotational Quantization](https://weaviate.io/blog/8-bit-rotational-quantization)): random rotation equalizes dimension variance so that scalar quantization captures more information per bit. We already ship the infrastructure (`walshHadamardTransform`, `fastRotate`, `generateSignVector` in `core/infrastructure/quantization.js:39-91`, re-exported from `core/embedding/embedding-service.js:337-346`). --- ## Current State **Index artifact**: Segmented directory with `manifest.json` + per-segment `.bin` files. Each segment has a 64-byte binary header (magic `0x4C495345` / "LISE", version u16, docCount u32, tokenDim u16, useInt8 u8, 51 bytes padding) followed by a JSON body containing the document array with `Array.from(doc.tokens)` serialized int8 values (`core/ranking/late-interaction-index.js:221-245`). A stub JSON at the main index path points to the segment directory. Legacy single-file JSON (v2.1) is still supported for small indexes. **Quantization**: Global per-document min/scale INT8 (`core/ranking/late-interaction-index.js:21-48`). Single `min` and `scale` across ALL tokens of a document — outlier dimensions waste quantization bins for every other dimension. **On-disk format**: Segmented binary-header + JSON-body hybrid. Each segment is written via `Array.from(doc.tokens)` (lines 227, 735 in `core/ranking/late-interaction-index.js`). A `manifest.json` per segment directory tracks document counts and paths. A stub at the main index path points to the segment directory. Legacy single-file JSON is still supported for small indexes. Int8 byte values are serialized as decimal ASCII integers with commas and brackets. Measured overhead: **~3.4x** bloat vs raw payload. **Measured on a real index** (Codex-verified): - 17,034 documents, 3,141,806 tokens (184.4 avg tokens/doc) - Raw INT8 payload: **383.5 MiB** (3.14M tokens x 128 bytes) - On-disk JSON file: **1.343 GiB** (3.4x overhead from JSON encoding) **Scoring kernels** (3-tier, `core/infrastructure/simd-distance.js`): - Tier 1: Native Rust + Rayon (`crates/sweet-search-native/src/lib.rs`) — 47x JS - Tier 2: WASM SIMD (`crates/wasm-maxsim/src/lib.rs`) — 16x JS - Tier 3: Pure JS fallback **Models**: - `lateon-code`: 149M params, d=128 (2^7, WHT-friendly) - `lateon-code-edge`: 17M params, d=48 (NOT power-of-2, needs padding to 64) --- ## Historical Implementation Plan The remaining sections preserve the original implementation plan and research notes. They are useful as design history, but they are not the current product decision. The active default is the INT4 path summarized above. ### Phase 0: Binary LI Storage Format **What**: Replace the JSON body inside each segment's binary-header+JSON hybrid format with a fully binary payload. The current segment format already has a 64-byte binary header (magic "LISE", version, docCount, tokenDim, useInt8) but the body is still JSON-serialized `Array.from()` int8 values — the same ~3.4x ASCII bloat as the old monolithic format, just split across segment files. **Why**: The single highest-leverage change. Zero quality risk, zero kernel changes, ~3.4x disk reduction and faster load times. On an 8 GB laptop, going from 1.34 GB to ~400 MB on disk (and in I/O buffer during load) is the difference between usable and not. **Format**: Each segment keeps the `.bin` extension but upgrades from the current "LISE" hybrid format (binary header + JSON body) to the "SSLX" fully-binary format. The existing `manifest.json` gains a `"format": "sslx-v3"` field. Backward compatibility: `_readSegmentFile()` checks the magic — `0x4C495345` ("LISE") uses the legacy hybrid path, `0x53534C58` ("SSLX") uses the new fully-binary path. Segment file names remain `segment-NNNN.bin`. Per-segment binary format (`segment-NNNN.bin`, SSLX v3): ``` HEADER (fixed 64 bytes): [0..3] magic: "SSLX" (Sweet Search Late indeX) [4..5] version: u16 = 3 [6..6] quantBits: u8 (8 = int8, 4 = future) [7..7] tokenDim: u8 (128 or 48) [8..11] numDocuments: u32 [12..15] poolFactor: u8, reserved[3] [16..47] modelId: 32 bytes utf8 (zero-padded) [48..51] whtSeed: u32 (0 = no rotation, >0 = WHT rotation applied) [52..63] reserved (zero) DOCUMENT TABLE (numDocuments x 20 bytes): Per entry: [0..3] tokenDataOffset: u32 (byte offset into token slab) [4..5] numTokens: u16 [6..9] min: f32 [10..13] scale: f32 [14..19] reserved ID TABLE: [0..3] totalIdBytes: u32 Per document: [0..1] idLen: u16 [2..N] id: utf8 bytes TOKEN SLAB (contiguous): Per document (in order): numTokens * tokenDim bytes (int8) numTokens * 4 bytes (tokenNorms as f32) FOOTER: [0..3] checksum: CRC32 of everything above ``` **Estimated sizes** (17K docs, 3.14M tokens, d=128): | Component | Size | |-----------|------| | Header | 64 B | | Doc table | 17K x 20 = 340 KB | | ID table | ~500 KB (avg 30-byte IDs) | | Token slab (int8) | 3.14M x 128 = 383.5 MiB | | Token norms | 3.14M x 4 = 12.0 MiB | | **Total** | **~396 MiB** (vs 1.343 GiB JSON = **3.4x smaller**) | **Backward compatibility**: `_readSegmentFile()` already checks the magic u32 at offset 0 (`core/ranking/late-interaction-index.js:251`). Add an `else if` branch for `0x53534C58` ("SSLX") that reads the new fully-binary layout. `_writeSegmentFile()` always writes SSLX v3. Old "LISE" hybrid segments still load. No migration tool needed — re-indexing produces the new format. The manifest stub at the main index path is unchanged. **Changes**: - `core/ranking/late-interaction-index.js`: - `_writeSegmentFile()`: Replace JSON body with packed binary (doc table + ID table + token slab + norms + CRC32) - `_readSegmentFile()`: Add SSLX branch for binary deserialization - `save()`: Write `"format": "sslx-v3"` into `manifest.json` - `core/infrastructure/config/ranking.js`: No changes needed **Go/no-go**: Load time < 2s for 17K docs. File size within 5% of theoretical minimum. --- ### Phase 1: Per-Token Quantization **What**: Change INT8 quantization from per-document min/scale to per-token min/scale. **Why**: The current scheme (`quantizeToInt8` at `core/ranking/late-interaction-index.js:21`) uses a single `min` and `scale` across the **entire flattened token buffer** of a document. If one token has an outlier dimension at -0.4 and another has all values in [-0.1, 0.1], the latter uses only ~25% of the INT8 range. Per-token quantization eliminates this waste. **Storage cost**: Two extra f32 values per token (8 bytes). At 128d, token size goes from 128 to 136 bytes (+6.25%). But quantization error drops significantly — each token uses the full INT8 range. **Quality impact**: Strictly positive. Better quantization fidelity at the same bit-width. **Kernel changes required**: The current kernels take **scalar** min/scale per candidate, not per-token arrays: - Native: `MaxSimCandidate` struct has `min: f64, scale: f64` (`crates/sweet-search-native/src/lib.rs:84-86`) - WASM: `maxsim_dequant()` takes `min: f32, scale: f32` as scalar params (`crates/wasm-maxsim/src/lib.rs:137-138`) - JS bridge: `doc.min, doc.scale` passed as scalars (`core/ranking/late-interaction-index.js:577`) Per-token quantization requires passing `Float32Array` of min/scale (one pair per token) instead of scalars. The dequant loop must index by token position. **Changes**: - `core/ranking/late-interaction-index.js`: - `quantizeToInt8()`: Per-token mode — compute min/scale per token slice of `tokenDim` values instead of across the entire buffer - `add()`: Store per-token `minArray` and `scaleArray` alongside tokens - Scoring bridge: Pass min/scale arrays to native/WASM kernels - Save/load: Persist per-token min/scale in binary segment format (8 bytes per token in the token slab) - `crates/sweet-search-native/src/lib.rs`: - `MaxSimCandidate`: Replace `min: f64, scale: f64` with `min_array: Float32Array, scale_array: Float32Array` - `dequantize()`: Accept token index, read per-token min/scale - `crates/wasm-maxsim/src/lib.rs`: - `maxsim_dequant()`: Accept `min_ptr` and `scale_ptr` (pointers to per-token float arrays) instead of scalar `min, scale` - Inner loop: Index min/scale by `token_idx` during dequant - `core/infrastructure/simd-distance.js`: - JS fallback: Update dequant to use per-token min/scale arrays **Go/no-go**: MaxSim score correlation (Kendall tau) >= 0.998 vs float32 ground truth on eval corpus. --- ### Phase 2: WHT Rotation + INT8 ("Poor Man's TurboQuant") **What**: Apply Walsh-Hadamard rotation before INT8 quantization. **Why**: After WHT rotation, coordinate values become near-Gaussian with equalized variance. This makes per-coordinate scalar quantization near-optimal. Weaviate reports 99.4% recall with this approach. Combined with per-token quantization from Phase 1, this squeezes maximum quality from 8 bits. **The key mathematical insight**: ``` = (Pi orthogonal) ``` Rotate query tokens once at query time (cost: ~0.1ms for 32 tokens at 128d). Score in the rotated domain. No inverse rotation ever needed. **Zero changes to the scoring kernels** — they're agnostic to whether vectors are in original or rotated space. **Implementation** (~40 lines of changes): At **index time** (in `add()`): ```js import { fastRotate, generateSignVector } from '../infrastructure/quantization.js'; // Generate deterministic sign vector on first add if (!this.signVector) { this.signVector = generateSignVector(this.tokenDim, this.whtSeed); } // Rotate each token before quantization for (let t = 0; t < truncated.length; t++) { truncated[t] = fastRotate(truncated[t], this.signVector); } // Then quantize as before (per-token from Phase 1) ``` At **query time** (in `scoreWithLateInteraction()`): ```js // Rotate query tokens once (amortized across all candidates) const rotatedQuery = queryTokens.map(q => fastRotate(new Float32Array(q), this.signVector) ); // Score with existing MaxSim kernels — no changes needed ``` **d=48 handling**: Pad to 64 (next power-of-2) for WHT, truncate result back to 48. `fastRotate()` already handles this (`core/infrastructure/quantization.js:74-91`). **Binary format**: Store `whtSeed` in header (already reserved at offset 48-51). If `whtSeed > 0`, query-time rotation is required. **Go/no-go**: Kendall tau >= 0.995 on eval corpus. Latency overhead < 0.5ms per query. --- ### Phase 3: Token Count Reduction **What**: Reduce average tokens per document from 184 to ~92 via pooling or learned pruning. **Why**: Token count is a **multiplicative** lever — halving tokens halves storage regardless of bit-width. The infrastructure already exists: `poolTokens()` in `core/ranking/late-interaction-model.js:358` and `poolFactor` config. ColBERTv2's headline was 6-10x reduction from token compression, not from sub-byte coding. **Options** (evaluated independently): 1. **poolFactor=2**: Average consecutive token pairs. Already implemented. Instant 2x token reduction. Quality impact: ~1-2pp MRR regression (based on ColBERT pooling literature). 2. **Norm-based pruning**: Drop tokens with L2 norm below threshold. `tokenNorms` already stored. Low-norm tokens carry little signal. 3. **Attention-score pruning**: Keep only tokens that the model's attention mechanism deemed important (requires model modification). **Concrete impact** (Phase 0+1+3 combined): | Configuration | Bytes/token | Tokens (17K docs) | Total | |---|---|---|---| | Current JSON INT8 | ~440 on disk | 3.14M | 1.34 GiB | | Phase 0 (binary) | 132 | 3.14M | 396 MiB | | Phase 0+1 (per-token quant) | 140 | 3.14M | 419 MiB | | Phase 0+1+3 pool=2 | 140 | 1.57M | 210 MiB | **Go/no-go**: MRR@10 regression < 2pp on GenCodeSearchNet. User-visible latency improvement from fewer MaxSim comparisons. --- ### Phase 4: WHT + 4-bit Quantization + Native Kernel Rewrite **What**: Drop from 8-bit to 4-bit per coordinate in the WHT-rotated domain, and simultaneously rewrite the MaxSim kernel for speed (both native and WASM). **Why**: After rotation, coordinate values are near-Gaussian with well-characterized variance. 4-bit (16 levels) captures sufficient information for near-float quality. Combined with per-token quantization, this gives **2x compression vs INT8**. The kernel rewrite eliminates redundant norm computation and per-thread allocations. **Storage format**: Packed nibbles (2 values per byte). 128d = 64 bytes/token. Plus 4 bytes min + 4 bytes scale per token = 72 bytes/token (vs 140 for Phase 1 INT8). #### MaxSim Kernel Optimizations (applies to all bit-widths) **Current redundant work in `crates/sweet-search-native/src/lib.rs`:** ```rust // d_norm_sq recomputed for every (qi, di) pair — Q*D times // when it only needs D times. With Q=32, D=100: 3200 vs 100. for qi in 0..num_q { for di in 0..num_d { let mut d_norm_sq: f32 = 0.0; // redundant! for i in 0..dim { d_norm_sq += d_slice[i] * d_slice[i]; } } } ``` **Fix**: Accept pre-stored `tokenNorms` from the binary format (Phase 0 already stores them). Eliminates ~40% of inner-loop FLOPs. **Current per-thread allocation:** ```rust // Line 141: allocates Vec per candidate per Rayon thread let mut doc_f32 = Vec::with_capacity(*num_d * *d); dequantize(int8_data, *min, *scale, &mut doc_f32); ``` **Fix with 4-bit**: Centroid LUT scoring works directly on packed data. No dequantization buffer needed. Replace per-thread `Vec` with in-place centroid lookup during the dot-product loop. **Current napi bridge overhead:** ```rust // Line 131: copies entire Int8Array from JS heap per candidate let int8: Vec = c.tokens.iter().map(|&b| b as i8).collect(); ``` **Fix with 4-bit**: 2x smaller copy (64 bytes/token vs 128). Consider zero-copy via shared ArrayBuffer for the token slab. #### Native N-API Kernel (`crates/sweet-search-native/src/lib.rs`) New entry point: `maxsim_score_batch_4bit()` with: - **Pre-stored norms**: `tokenNorms: Float32Array` passed from JS alongside token data. Eliminates d_norm_sq from inner loop. - **Centroid LUT scoring**: 16-entry f32 table in registers. On NEON: `TBL` with 4-register 64-byte table — one instruction per 16 centroid lookups. On x86-64: `_mm256_i32gather_ps` for 8-wide AVX2 gather. - **No dequant allocation**: Score directly from packed nibbles. - **Rayon parallelism**: Unchanged (parallel across candidates). Estimated speedup over current native INT8 kernel: **~30-40%** from combined norm reuse + eliminated alloc + smaller copies. Ships as `@sweet-search/native-darwin-arm64`, `@sweet-search/native-darwin-x64`, `@sweet-search/native-linux-x64`, `@sweet-search/native-linux-arm64` (same platform packages from INIT_PLAN.md). #### WASM SIMD Kernel (`crates/wasm-maxsim/src/lib.rs`) New entry point: `maxsim_dequant_4bit()` with: - Nibble extract: `v128.and(v, 0x0F)` + `i8x16.shr_u(v, 4)` — 2 ops per 32 values - Centroid lookup: `i8x16.swizzle` with 16-entry LUT — 1 instruction per 16 values - Pre-stored norms via pointer parameter (no d_norm_sq accumulation) - Compiled with `RUSTFLAGS="-C target-feature=+simd128"` #### JS Fallback (`core/infrastructure/simd-distance.js`) Pure JS `maxsimDequant4bit()` with manual nibble unpacking + LUT array indexing. Slower but always available. #### Performance Projections (M3 Max, 50 candidates, Q=32, D=100, d=128) | Metric | Current INT8 | Phase 4 (WHT+4bit+norms) | Improvement | |--------|-------------|--------------------------|-------------| | MaxSim latency (native) | 5.9ms | ~3.5-4.0ms (est.) | **30-40%** | | MaxSim latency (WASM) | ~30ms (est.) | ~20ms (est.) | ~33% | | Bytes/token | 128 | 64+8 (norms) = 72 | 1.8x smaller | | Token slab (3.14M tokens) | 383.5 MiB | 215 MiB | 1.8x smaller | | Total index (binary fmt) | ~396 MiB | ~230 MiB | 1.7x smaller | | napi buffer copy | 128 B/token | 72 B/token | 1.8x less | **Quality analysis** (from research swarm IR specialist): Inner product variance at 4-bit, d=128: `sigma_cosine ~ 0.012` Rank inversion probability: | Score gap | P(inversion) | |-----------|-------------| | 0.01 | 5.2% | | 0.02 | 0.19% | | 0.03 | ~0% | Expected NDCG@10 loss: **< 0.2%**. Safe for production. **Required changes**: - `core/infrastructure/config/ranking.js`: **Prerequisite schema change.** The current config uses `quantization: 'int8'` (line 181) but the constructor in `core/ranking/late-interaction-index.js:99` resolves this to a boolean `useInt8`. The manifest (`save()` at line 686) and segment header (`_writeSegmentFile()` at line 242) also persist `useInt8` as a boolean. This must be replaced with a scheme enum: ``` quantization: 'int8' | 'int8-per-token' | 'wht-int8' | 'wht-int4' ``` The constructor, manifest, segment header, and load path all need to read/write the scheme string instead of a boolean. This is a cross-cutting change that should land as a separate prep commit before the Phase 4 kernel work. - `core/ranking/late-interaction-index.js`: - New `quantizeToInt4()` / `dequantizeFromInt4()` with nibble packing - Constructor: Replace `this.useInt8` with `this.quantScheme` - `add()`: Branch on scheme for quantization path - Save/load: Persist `quantScheme` in manifest and segment header (use header byte at offset 6 for `quantBits`: 8 or 4) - `crates/sweet-search-native/src/lib.rs`: New `maxsim_score_batch_4bit()` with norm params, centroid LUT, zero-alloc scoring - `crates/wasm-maxsim/src/lib.rs`: New `maxsim_dequant_4bit()` with SIMD nibble extract + swizzle - `core/infrastructure/simd-distance.js`: Tier detection for 4-bit kernel availability, new JS fallback - `crates/sweet-search-native/Cargo.toml`: Keep existing INT8 entry points for backward compat **Go/no-go**: NDCG@10 regression < 0.5pp. MaxSim score Kendall tau >= 0.990 vs float32. --- ### Phase 5 (Deferred): Full TurboQuant **When**: Only if plain INT4 stops meeting product goals or we later need materially more compression than the current default delivers. **What**: PolarQuant (optimal per-coordinate scalar quantization using Lloyd-Max codebooks) + optional QJL (1-bit residual correction for unbiased inner products). **Why we might need it**: At 3.25 bits (turbo3), storage could drop further than INT4. That may matter for much larger corpora or tighter memory targets. **Why it is deferred now**: The current plain INT4 path already achieved the required quality/size tradeoff. The remaining savings from full TurboQuant do not currently justify the extra kernel complexity, persistence surface, and validation cost. **Prerequisites**: - Empirical validation that LateOn-Code embeddings (L2-normalized) match TurboQuant's distributional assumptions at d=128 - Algorithm well-documented (multiple reference implementations exist in Rust and Python — see [Reference Implementations](#reference-implementations-algorithm-reference-only--no-dependencies)) - 3-bit packing SIMD kernel proven viable on WASM **Open questions on QJL**: Google explicitly describes QJL as balancing high-precision queries with low-precision data (asymmetric scoring). This is exactly our use case. The swarm dismissed QJL too quickly — it deserves empirical evaluation in Phase 5, not upfront dismissal. --- ## Historical Research Addendum (March 2026) > **Added after deep research sweep across 30+ papers, community > implementations, and production deployments (Oct 2025 – Mar 2026).** > These are optimizations discovered *beyond* the original TurboQuant > paper that are relevant to our pipeline. > > **CRITICAL RULE: Every optimization below MUST be A/B tested against > the previous baseline before being accepted.** No optimization ships > without empirical proof that it helps *our* specific pipeline > (LateOn-Code, d=128, code retrieval). Academic gains do not guarantee > gains on our data. The eval harness (`eval/retrieval-harness.js`) is > the single source of truth. Each item below includes its specific A/B > test protocol. --- ### CRA-1: Hierarchical Token Pooling (Upgrades Phase 3) **What**: Replace our three Phase 3 options (consecutive-pair pooling, norm-based pruning, attention-score pruning) with **hierarchical token pooling** as the primary strategy. **Why**: LIR'26 Workshop (Johns Hopkins, [arXiv 2603.22434](https://arxiv.org/abs/2603.22434), March 2026) conclusively proved that **token pooling is strictly superior to token pruning** for multi-vector retrieval. The gap is large: | Method | Keep ratio | Rel. nDCG@10 | Rel. Recall@100 | |--------|-----------|-------------|-----------------| | **Hierarchical Pooling** | 20% (5x) | **95.7%** | **98.1%** | | IDF Pruning | 50% (2x) | 92.4% | 98.9% | | Attention Pruning | 20% | ~75% | ~80% | Pooling achieves **2.5x better compression than pruning at equal quality**. At extreme compression (10-20% keep), pruning catastrophically fails while pooling degrades gracefully. Two pooling variants worth testing: 1. **Hierarchical pooling** (Ward clustering + cosine distance) — best overall quality, O(L² · d + L² log L) per document 2. **Attention-based pooling** — nearly as good, much cheaper O(L·L̃·d), better for online indexing scenarios Our existing `poolTokens()` averages consecutive pairs — this is the *worst* pooling strategy (no semantic awareness). It should be replaced with similarity-based clustering. **A/B test protocol**: 1. Index the eval corpus with each strategy at keep ratios [0.20, 0.33, 0.50, 0.75] 2. Measure: nDCG@10, MRR@10, Recall@100, index time, index size 3. Compare: hierarchical pooling vs attention pooling vs current consecutive-pair pooling vs no pooling baseline 4. **Ship only if**: nDCG@10 regression < 2pp at 2x+ token reduction 5. **Reject if**: Index-time cost > 3x increase with no quality gain over attention pooling --- ### CRA-2: Quantile-Based (Non-Uniform) Bucket Boundaries (Upgrades Phase 4) **What**: Replace uniform INT4 quantization boundaries with **quantile-based bucket boundaries** that allocate more quantization levels to densely populated data regions. **Why**: WARP ([arXiv 2501.17788](https://arxiv.org/abs/2501.17788), ETH Zurich/Stanford, Jan 2025) uses quantile-based boundaries for their 4-bit residual quantization, contributing to their 7.3x memory reduction and 3x speedup over PLAID. The insight: embedding coordinate values are NOT uniformly distributed after rotation — they follow a near-Gaussian distribution. Uniform quantization wastes levels in the sparse tails. Quantile boundaries match level density to data density. **Implementation**: During index build, collect a sample of coordinate values post-rotation, compute 16 quantile boundaries (for 4-bit), store the 16 boundary values in the binary header (16 × f32 = 64 bytes). Quantization becomes: `bucket = bisect(boundaries, value)` instead of `bucket = clamp(floor((value - min) / scale * 16), 0, 15)`. **Cost**: Zero runtime cost — same 4-bit storage, same LUT scoring. The only change is *which* 16 centroids the LUT contains. **A/B test protocol**: 1. Build two indexes: uniform INT4 vs quantile INT4 (same WHT rotation seed) 2. Measure: Kendall tau vs float32, nDCG@10, MRR@10 3. **Ship if**: Any measurable quality improvement (even 0.1pp) 4. **Reject if**: Quality is identical or worse (quantile overhead wasted) --- ### CRA-3: WUSH Data-Aware Calibrated Rotation (Upgrades Phase 2) **What**: Augment the Phase 2 Walsh-Hadamard rotation with a **data-dependent scaling step** calibrated on actual embedding statistics, following the WUSH transform. **Why**: WUSH ([arXiv 2512.00956](https://arxiv.org/abs/2512.00956), ETH Zurich/ISTA/Red Hat, Nov 2025) proved that a pure data-oblivious Hadamard rotation is the *orthogonal component* of the optimal transform, but the full optimal transform is: ``` T_wush = H × S^(-1/2) × U^T × W'^T ``` where S, U come from SVD of the data's second-moment matrix. This adds a **non-orthogonal scaling** step that compensates for non-uniform dimension variance in the actual embeddings. Reported gain: **+2.8 average accuracy points** over pure Hadamard at W4A4 quantization. **Implementation** (~40 lines): 1. During index build, collect a sample of N=10K token embeddings 2. Compute covariance matrix C = (1/N) × X^T × X 3. SVD: C = U × S × V^T 4. Pre-multiply the Hadamard rotation: T = H × diag(S^(-1/2)) × U^T 5. Apply T instead of bare H at index time; apply T to queries at search time 6. Store the calibration matrix in the index file (128×128 × f32 = 64 KB) **Caution**: This violates TurboQuant's "data-oblivious" property. If our embeddings are already well-distributed post-WHT (which L2-normalized embeddings tend to be), the gain may be negligible. The academic gain was measured on LLM weight/activation quantization, NOT retrieval embeddings. Must A/B test. **A/B test protocol**: 1. Build three indexes: no rotation (baseline), pure WHT (Phase 2), WUSH-calibrated 2. Measure: Kendall tau vs float32, nDCG@10, MRR@10 at both 8-bit and 4-bit 3. **Ship if**: Kendall tau improvement >= 0.002 OR nDCG@10 improvement >= 0.3pp 4. **Reject if**: Gains < noise floor (~0.1pp), since WUSH adds calibration complexity and 64KB to the index --- ### CRA-4: Sequency-Ordered Walsh Matrices (Upgrades Phase 2) **What**: Replace standard Hadamard ordering in `fastRotate()` (`core/infrastructure/quantization.js:74`) with **sequency-ordered Walsh matrices** (GSR). **Why**: GSR ([arXiv 2505.03810](https://arxiv.org/abs/2505.03810), Seoul National University, May 2025) showed that sequency ordering clusters similar-frequency components together, reducing quantization error compared to standard Hadamard. The block-diagonal variant (e.g., 32×32 blocks) isolates outlier impact and matches optimization-based *learned* rotations — without any training. Especially effective at extreme low bit-widths (2-bit). **Implementation**: Change the Hadamard matrix construction to use Walsh (sequency) ordering instead of natural ordering. Same O(d log d) transform, same butterfly structure, different permutation. Optionally use block-diagonal structure (four 32×32 blocks for d=128). **Cost**: Zero — identical computation, different matrix ordering. **A/B test protocol**: 1. Build indexes with: natural Hadamard vs sequency Walsh vs block-diagonal Walsh (32×32) 2. Test at both 8-bit (Phase 1) and 4-bit (Phase 4) 3. Measure: Kendall tau vs float32, coordinate variance uniformity, nDCG@10 4. **Ship if**: Measurable quality improvement at either bit-width 5. **Reject if**: No difference (our d=128 already concentrates well enough) --- ### CRA-5: Implicit Decompression / Zero-Alloc Scoring (Upgrades Phase 4) **What**: Score MaxSim **directly from packed nibbles without ever materializing f32 vectors**, following WARP's implicit decompression pattern. **Why**: Our Phase 4 plan already eliminates the per-thread Vec allocation via centroid LUT scoring, but the WARP paper's algebraic decomposition goes further: the dot product between a float32 query token and a quantized document token can be computed as: ``` dot(q, d_quantized) = dot(q, centroid) + Σ_i q[i] × bucket_weight[bucket[i]] ``` The centroid contribution is **pre-computed once** per (query_token, cluster) pair and reused across all documents in that cluster. Only the residual sum needs per-document computation. This eliminates even the LUT gather-and-accumulate step for the centroid component. **A/B test protocol**: 1. Implement both: (a) standard LUT scoring (Phase 4 plan), (b) implicit decompression 2. Measure: MaxSim latency (native + WASM), numerical equivalence to float32 3. **Ship if**: Latency improvement > 10% on the 50-candidate benchmark 4. **Reject if**: Complexity increase not justified by latency gain, or numerical divergence from standard path --- ### CRA-6: Learned Token Importance Weights for MaxSim (New — All Phases) **What**: Add **per-token learned importance weights** to the MaxSim scoring function, changing it from: ``` score(Q, D) = Σ_i max_j (q_i^T · d_j) ``` to: ``` score(Q, D) = Σ_i w_i × max_j (q_i^T · d_j) ``` **Why**: [arXiv 2511.16106](https://arxiv.org/abs/2511.16106) (Nov 2025) showed that adding token importance weights (analogous to IDF in BM25) improves late interaction retrieval quality. Two approaches: 1. **Zero-shot (IDF-based)**: Weight each document token by its inverse document frequency. No training needed. 2. **Few-shot (learned)**: Learn weights via contrastive loss on a small labeled set. This is **orthogonal to compression** — it improves quality at any quantization level. The weights cost only 4 bytes per token (stored alongside tokenNorms in the binary format). The weights could also **guide adaptive bit allocation**: important tokens get more bits. **A/B test protocol**: 1. Implement IDF-based weighting (zero-shot, no training needed) 2. Measure: nDCG@10, MRR@10 on GenCodeSearchNet with and without weights 3. Test at multiple compression levels: uncompressed, INT8, INT4 4. **Ship if**: nDCG@10 improvement >= 0.5pp at any compression level 5. **Reject if**: No improvement on code retrieval (IDF may not help for code tokens the way it helps natural language) --- ### CRA-7: Metal Constant Memory for Centroid LUT (Upgrades Phase 4) **What**: Place the 4-bit centroid LUT in **Metal constant memory** on Apple Silicon, rather than shared or device memory. **Why**: TheTom/turboquant_plus ([sparse-v-dequant.md](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/sparse-v-dequant.md)) discovered that constant memory has dedicated hardware caching on M-series chips. The 16-entry f32 LUT for 4-bit quantization (64 bytes) fits perfectly. This is a single annotation change in the Metal/Rust kernel. **A/B test protocol**: 1. Benchmark native kernel on M3 Max with LUT in: shared memory vs constant memory 2. Measure: MaxSim latency over 1000 queries, 50 candidates each 3. **Ship if**: Measurable latency reduction (even 5%) 4. **Reject if**: No difference (M3 cache hierarchy may already optimize this) --- ### CRA-8: Matryoshka + Quantization Combo (Upgrades Phase 5) **What**: If LateOn-Code supports (or can be fine-tuned to support) **Matryoshka-style dimensional truncation**, combine dimension reduction with 4-bit quantization for extreme compression. **Why**: SMEC ([arXiv 2510.12474](https://arxiv.org/abs/2510.12474), EMNLP 2025) showed Matryoshka embeddings allow flexible truncation where prefixes retain semantic utility. Combined with quantization: | Dimensions | Bits | Bytes/token | Compression vs d=128 INT8 | |-----------|------|-------------|---------------------------| | 128 | 8 | 128 | 1x (baseline) | | 128 | 4 | 64 | 2x | | 64 | 4 | 32 | **4x** | | 32 | 4 | 16 | **8x** | At d=64 + 4-bit + pool=2, the index shrinks to ~30 MiB for 17K docs — **45x** smaller than current JSON INT8. **Caution**: Requires model-level changes (Matryoshka fine-tuning of LateOn-Code). This is a Phase 5+ direction, not a drop-in optimization. **A/B test protocol**: 1. Fine-tune LateOn-Code with Matryoshka objective (if pursued) 2. Evaluate at d=[32, 64, 96, 128] × bits=[4, 8] on GenCodeSearchNet 3. Plot Pareto frontier of nDCG@10 vs bytes/token 4. **Ship if**: d=64 achieves nDCG@10 within 1pp of d=128 5. **Reject if**: Truncation below d=96 causes > 2pp regression on code retrieval --- ### CRA-9: Voronoi-Guided Token Importance (Upgrades Phase 3) **What**: Use **Voronoi cell volume** in embedding space as the token importance metric for pruning decisions. **Why**: [arXiv 2603.09933](https://arxiv.org/abs/2603.09933) (Sorbonne, March 2026) formalized token pruning as a Voronoi estimation problem. Tokens with larger Voronoi regions are more "unique" and harder to replace by neighboring tokens. Achieves **70% token removal with <1.5% nDCG drop** on in-domain data. **Use case**: Hybrid strategy — use Voronoi importance to identify tokens that are safe to prune (tiny Voronoi cells = redundant tokens), then pool the remaining tokens. This could outperform pure pooling. **A/B test protocol**: 1. Implement Voronoi cell estimation for document token sets 2. Compare: Voronoi pruning vs norm pruning vs IDF pruning (all at same keep ratio) 3. Test hybrid: Voronoi prune bottom 30% → hierarchical pool remaining to target count 4. **Ship if**: Hybrid outperforms pure hierarchical pooling by >= 0.3pp nDCG@10 5. **Reject if**: Voronoi computation cost (O(L² · d) per doc) not justified by gain --- ### CRA-10: SAQ-Style Adaptive Bit Allocation per Dimension (Upgrades Phase 4) **What**: After WHT rotation, allocate **non-uniform bits per dimension segment** using dynamic programming optimization. **Why**: SAQ ([arXiv 2509.12086](https://arxiv.org/abs/2509.12086), Wuhan/CUHK/Huawei, Sep 2025) partitions PCA-projected vectors into segments and allocates more bits to high-magnitude segments, fewer to trailing segments. Uses DP to minimize total quantization error under a fixed bit budget. **Nuance**: After WHT rotation, dimensions *should* be near-uniform — that's the whole point. So the gain may be minimal. But if any residual non-uniformity remains (e.g., from the structure of code embeddings), adaptive allocation captures it. **A/B test protocol**: 1. After WHT rotation, measure per-dimension variance across 10K tokens 2. If variance ratio (max/min) > 1.5, implement DP bit allocation 3. Compare: uniform 4-bit vs adaptive (avg 4-bit, range 3-6 per segment) 4. **Ship if**: Kendall tau improvement >= 0.002 5. **Reject if**: Post-WHT dimensions are already uniform (ratio < 1.2), skip entirely --- ### CRA-11: ConstBERT Fixed-Count Embeddings (Future Direction) **What**: Train the model to produce a **fixed number of document embeddings** (e.g., 32 per document) via learned pooling projection. **Why**: ConstBERT ([arXiv 2504.01818](https://arxiv.org/abs/2504.01818), U Glasgow/Pinecone, Apr 2025) achieves >50% index size reduction with comparable nDCG@10 on BEIR. The killer benefit: **uniform memory layout** (every document has exactly C vectors), which eliminates variable-length token slabs, simplifies the binary format, enables perfect SIMD alignment, and dramatically improves OS paging and cache behavior. **Caution**: Requires model architecture changes to LateOn-Code. This is a future model generation decision, not a compression technique. **A/B test protocol** (if pursued): 1. Train ConstBERT variant of LateOn-Code with C=[16, 32, 64] 2. Evaluate on GenCodeSearchNet: nDCG@10, MRR@10 3. Compare total system performance: index size + load time + query latency 4. **Ship if**: C=32 matches baseline nDCG@10 within 1pp 5. **Reject if**: Code retrieval quality degrades (code tokens may be more diverse than natural language, requiring more embeddings) --- ### CRA-12: Per-Vector Learned Quantizers / NVQ (Upgrades Phase 5) **What**: Replace uniform scalar quantization with **per-vector learned nonlinear quantizers** individually calibrated for each indexed vector. **Why**: NVQ ([arXiv 2509.18471](https://arxiv.org/abs/2509.18471), IBM, Sep 2025) achieves 3x storage reduction with <0.01 recall impact above 0.95. Each vector gets its own quantization function (not just its own min/scale as in Phase 1). **Caution**: Encoding cost is very high (4,000µs per 1536-d vector for Extended RaBitQ's approach). May be prohibitive for our 3.14M tokens. Weaviate's RQ approach simplifies encoding to 2µs by falling back to scalar quantization. **A/B test protocol**: 1. Implement per-token nonlinear quantization with calibration 2. Compare: per-token linear (Phase 1) vs per-token nonlinear at 4-bit 3. Measure: Kendall tau, encoding time, nDCG@10 4. **Ship if**: Quality improvement >= 0.5pp AND encoding time < 10µs/token 5. **Reject if**: Encoding too slow for re-indexing or quality gain < noise --- ### CRA-13: Fused Kernel with L1 Cache-Resident LUT (Upgrades Phase 4) **What**: Design the Phase 4 native kernel explicitly for **L1 cache residency** of the centroid LUT, and **fuse dot-product + max-reduce** into a single pass. **Why**: The dejan.ai Triton kernel implementation demonstrated that the optimal pattern is: 1. Pre-rotate query (single matmul) 2. Load uint8 indices from HBM (4x less bandwidth than f32) 3. Gather centroids from L1-cached LUT (16 entries = 64 bytes = 1 cache line) 4. Fused dot-product + max-reduce in one kernel launch → ~1.2x speedup The critical insight: the 16-entry LUT is **reused across all sequence positions**, so the bottleneck is HBM bandwidth for index loading, not compute. Our kernel should be designed bandwidth-first. **A/B test protocol**: 1. Implement two kernel variants: (a) separate dequant + score, (b) fused 2. Benchmark on M3 Max: 50 candidates, Q=32, D=100, d=128 3. **Ship if**: Fused kernel is >= 15% faster 4. **Reject if**: Fusion complexity hurts maintainability with < 10% gain --- ### CRA-14: Fast Pseudo-Random Rotation (Weaviate RQ Approach) **What**: Evaluate Weaviate's **fast pseudo-random rotation** (7µs) as an alternative to full WHT matrix-vector multiply (~1,700µs for dense). **Why**: Weaviate's 8-bit Rotational Quantization ([blog](https://weaviate.io/blog/8-bit-rotational-quantization)) uses simplified pseudo-random rotations that achieve >99% recall on high-dimensional datasets while being **~240x faster** to apply than standard matrix-vector multiplication. Trades theoretical unbiased guarantees for practical performance. **Applicability**: Our `fastRotate()` uses the O(d log d) butterfly WHT, which is already fast (~0.01ms for d=128). The Weaviate approach may not offer meaningful speedup at our dimensionality. More relevant if we scale to d=768+ with future models. **A/B test protocol**: 1. Implement Weaviate-style fast rotation alongside WHT 2. Measure: rotation latency, quantization quality (Kendall tau), nDCG@10 3. **Ship if**: Equivalent quality with measurable latency improvement 4. **Reject if**: Quality regression > 0.1pp (theoretical guarantees matter for us) --- ### CRA-15: WebGPU Compute Shader Tier (Future — New Tier 0) **What**: Add a **WebGPU compute shader** tier above the existing 3-tier architecture for GPU-accelerated MaxSim scoring in-browser. **Why**: WebGPU 1.1 (2025 Q1) added subgroup operations enabling wave-level parallelism. For 4-bit nibble operations, WebGPU uses bit manipulation (`(x >> 4) & 0xF`) in compute shaders. Benchmarks show 10x+ speedup over CPU for large array operations. **Tier architecture** (updated): - Tier 0: WebGPU compute — GPU-accelerated, in-browser - Tier 1: Native N-API (Rust + Rayon + NEON/AVX2) — existing - Tier 2: WASM SIMD — existing - Tier 3: Pure JS fallback — existing **Applicability**: Only relevant if Sweet Search runs in-browser. Low priority for server-side CLI search. **A/B test protocol**: 1. Implement WebGPU MaxSim kernel with 4-bit support 2. Benchmark vs WASM SIMD tier in Chrome/Safari 3. **Ship if**: >= 3x speedup over WASM SIMD in supported browsers 4. **Reject if**: Browser support too fragmented or < 2x gain --- ### Historical Projection Snapshot This section is preserved as pre-benchmark research context only. It is not the current roadmap and should not be read as an expected product outcome. In practice, the aggressive pooling assumptions did not hold up in A/B testing, and Phase 5 remains deferred. | Item | Current status | |------|----------------| | Binary storage | Shipped | | Plain INT4 | Shipped and default | | WHT as default | Deferred | | Pooling (`poolFactor=2`) | Rejected as implemented | | Full TurboQuant / Phase 5 | Deferred | --- ## Community Research References (Added) - [WUSH: Near-Optimal Adaptive Transforms (arXiv 2512.00956)](https://arxiv.org/abs/2512.00956) - [GSR: Grouped Sequency-Arranged Rotation (arXiv 2505.03810)](https://arxiv.org/abs/2505.03810) - [LIR'26: Token Pooling vs Pruning (arXiv 2603.22434)](https://arxiv.org/abs/2603.22434) - [Voronoi Cell Token Pruning (arXiv 2603.09933)](https://arxiv.org/abs/2603.09933) - [Token Importance Weighting (arXiv 2511.16106)](https://arxiv.org/abs/2511.16106) - [ConstBERT: Fixed-Count Embeddings (arXiv 2504.01818)](https://arxiv.org/abs/2504.01818) - [SAQ: Dimension Segmentation (arXiv 2509.12086)](https://arxiv.org/abs/2509.12086) - [NVQ: Per-Vector Quantizers (arXiv 2509.18471)](https://arxiv.org/abs/2509.18471) - [SMEC: Matryoshka Compression (arXiv 2510.12474)](https://arxiv.org/abs/2510.12474) - [Lossless ColBERT Token Pruning (arXiv 2504.12778)](https://arxiv.org/abs/2504.12778) - [ROSAQ: Saliency-Aware Rotation (arXiv 2506.13472)](https://arxiv.org/abs/2506.13472) - [OptRot: Data-Free Rotation (arXiv 2512.24124)](https://arxiv.org/abs/2512.24124) - [QuEPT: Elastic Precision (arXiv 2602.12609)](https://arxiv.org/abs/2602.12609) - [Extended RaBitQ (GitHub)](https://github.com/VectorDB-NTU/Extended-RaBitQ) - [Weaviate 8-bit RQ (blog)](https://weaviate.io/blog/8-bit-rotational-quantization) - [Milvus IVF_RABITQ (blog)](https://milvus.io/blog/bring-vector-compression-to-the-extreme-how-milvus-serves-3%C3%97-more-queries-with-rabitq.md) - [dejan.ai TurboQuant Triton Kernel (blog)](https://dejan.ai/blog/turboquant/) - [turboquant_plus sparse-v-dequant (GitHub)](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/sparse-v-dequant.md) - [ColBERT-Att: Late Interaction + Attention (arXiv 2603.25248)](https://arxiv.org/abs/2603.25248) - [RSQ: Learning from Important Tokens (arXiv 2503.01820)](https://arxiv.org/abs/2503.01820) --- ## Current Outcome This is the operative summary for the repository today. | Decision area | Outcome | |---------------|---------| | LI storage format | Fully binary segment storage shipped | | LI quantization default | Plain INT4 shipped as default | | Query-time flags | Not required for normal use; stored LI metadata drives load behavior | | WHT rotation | Available for benchmarking, not the product default | | Token pooling | Not shipped as a default optimization | | Full TurboQuant / PolarQuant / QJL | Deferred pending a materially stronger need | | Benchmark conclusion | Result | |----------------------|--------| | INT4 vs prior INT8 baseline | Close enough on retrieval quality to ship | | INT4 storage footprint | Materially smaller than the prior INT8 baseline | | WHT defaulting decision | Mixed benchmark results, so not defaulted | | Pooling decision | Quality regression too large to ship as implemented | --- ## Validation Status Completed validation work: - Full A/B benchmarks were run against the repository's current retrieval harnesses before changing the product default. - Those runs were sufficient to ship plain INT4 as the default LI path. - WHT was left available for experimentation, but not defaulted, because quality gains were inconsistent across benchmarks. - Pooling was not accepted because the observed quality regression was too large. Future validation, if deferred work is revisited: - Re-run full benchmark A/Bs before changing the default again. - Treat any WHT, pooling, Matryoshka, or full TurboQuant work as benchmark-gated rather than assumption-driven. - Keep microbenchmarks as supporting evidence only; ship decisions should continue to follow retrieval-quality benchmarks. --- ## Current Priority 1. Keep plain INT4 as the default LI quantization path. 2. Rebuild or migrate existing indexes when adopting the new default in production environments. 3. Defer WHT defaulting, pooling, and full TurboQuant until a new benchmark-backed product need appears. 4. Continue to treat retrieval-quality benchmarks as the ship gate for any future LI quantization changes. --- ## Key References - [TurboQuant Paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874) - [Google Research Blog](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/) - [Weaviate 8-bit Rotational Quantization](https://weaviate.io/blog/8-bit-rotational-quantization) - [QuIP# Hadamard Incoherence (arXiv 2402.04396)](https://arxiv.org/abs/2402.04396) - [PolarQuant KV Cache (arXiv 2502.02617)](https://arxiv.org/html/2502.02617v1) - [ColBERTv2 (arXiv 2112.01488)](https://arxiv.org/abs/2112.01488) - [WARP Multi-Vector Execution (arXiv 2501.17788)](https://arxiv.org/abs/2501.17788) - [TheTom/turboquant_plus (Metal implementation)](https://github.com/TheTom/turboquant_plus) - [llama.cpp TurboQuant Discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969) - [OpenSearch Random Rotation Benefits](https://opensearch.org/blog/the-benefits-of-random-rotation-in-quantized-vector-search/) ### Community Research (Added March 2026) - [WUSH: Near-Optimal Adaptive Transforms (arXiv 2512.00956)](https://arxiv.org/abs/2512.00956) - [GSR: Grouped Sequency-Arranged Rotation (arXiv 2505.03810)](https://arxiv.org/abs/2505.03810) - [LIR'26: Token Pooling vs Pruning Comparison (arXiv 2603.22434)](https://arxiv.org/abs/2603.22434) - [Voronoi Cell Token Pruning (arXiv 2603.09933)](https://arxiv.org/abs/2603.09933) - [Token Importance Weighting for ColBERT (arXiv 2511.16106)](https://arxiv.org/abs/2511.16106) - [ConstBERT: Constant-Space Multi-Vector (arXiv 2504.01818)](https://arxiv.org/abs/2504.01818) - [SAQ: Dimension Segmentation VQ (arXiv 2509.12086)](https://arxiv.org/abs/2509.12086) - [NVQ: Per-Vector Learned Quantizers (arXiv 2509.18471)](https://arxiv.org/abs/2509.18471) - [SMEC: Matryoshka Embedding Compression (arXiv 2510.12474)](https://arxiv.org/abs/2510.12474) - [Lossless ColBERT Token Pruning (arXiv 2504.12778)](https://arxiv.org/abs/2504.12778) - [ROSAQ: Saliency-Aware Rotation (arXiv 2506.13472)](https://arxiv.org/abs/2506.13472) - [Extended RaBitQ Multi-Bit (GitHub)](https://github.com/VectorDB-NTU/Extended-RaBitQ) - [Weaviate 8-bit Rotational Quantization (blog)](https://weaviate.io/blog/8-bit-rotational-quantization) - [Milvus IVF_RABITQ (blog)](https://milvus.io/blog/bring-vector-compression-to-the-extreme-how-milvus-serves-3%C3%97-more-queries-with-rabitq.md) - [dejan.ai TurboQuant Triton Kernel (blog)](https://dejan.ai/blog/turboquant/) - [turboquant_plus sparse-v-dequant (GitHub)](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/sparse-v-dequant.md) ### Reference Implementations (Algorithm Reference Only — No Dependencies) We build all kernels from scratch. These implementations exist in the ecosystem and are useful for **understanding algorithms, validating correctness, and comparing performance** — not as dependencies. | Implementation | Language | What to study | Phase | |---|---|---|---| | [animehacker/llama-turboquant](https://github.com/animehacker/llama-turboquant) | C/CUDA | Per-block 32×32 WHT with fused `vec_dot` kernel scoring in rotated space. Proves block-diagonal WHT (CRA-4) works. Lloyd-Max codebook values for Gaussian: `{±0.243, ±0.743, ±1.334, ±2.157}` | 2, 4 | | [scos-lab/turboquant](https://github.com/scos-lab/turboquant) | Python | Key finding: MSE > Prod for attention (contradicts paper). Mixed precision closes the gap. Good validation methodology. | 5 | | [dhawalc/turboQuantDC](https://github.com/dhawalc/turboQuantDC) | Python (7.3K LOC) | Full TurboQuant with Lloyd-Max codebooks, WHT rotation, QJL. MIT license. Codebook computation reference. | 4, 5 | | [cksac/turboquant-model](https://github.com/cksac/turboquant-model) | Python/PyTorch | Confirms Hadamard matches QR quality at O(d) vs O(d²) storage. Residual quantization benchmarks. | 2 | | [TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus) | Python + Metal | 5K+ stars. Sparse V dequant analysis: Metal constant memory LUT fits 16 entries in 1 cache line. QJL dropped by 5 independent groups. | 4, CRA-7 | | [jlscheerer/xtr-warp](https://github.com/jlscheerer/xtr-warp) | Python + C++ | WARP engine. Quantile-based 4-bit bucket boundaries. Implicit decompression (zero-alloc scoring). Nibble unpacking patterns. | CRA-2, CRA-5 | | [arXiv 2409.14683](https://arxiv.org/abs/2409.14683) (token pooling) | Python (SciPy) | Hierarchical Ward clustering for ColBERT token pooling. Pool factor 2 = 50% reduction at 100.6% perf. Integrated into upstream ColBERT. | 3, CRA-1 | | [mixedbread maxsim-cpu](https://www.mixedbread.com/blog/multimodal-late-interaction-billion-scale) | Rust | Production MaxSim scoring on CPU with custom SIMD kernels. Billion-scale late interaction. | 4 | | [pylate-rs](https://lightonai.github.io/pylate-rs/) | Rust + WASM | ColBERT scoring crate with dedicated WASM compilation target. Multi-platform SIMD dispatch patterns. | 4 | | dejan.ai Triton kernel | Python/Triton | Fused centroid LUT + dot product. Pre-rotate query once, inner loop is LUT gather + accumulate. Bandwidth-first kernel design. | CRA-13 | **Rust crates on crates.io** (algorithm reference, NOT used as deps): - `turbo-quant` — 1K SLoC PolarQuant + QJL at d=128. WHT rotation. - `turboquant-rs` — 3.5K SLoC, AVX2/FMA, llama.cpp-compatible 3-bit packing. - `BitPolar` — Rust + WASM + Python bindings. `WhtRotation` module. **npm** (NOT used as dep): - `turboquant-wasm` — Zig→WASM with relaxed SIMD. Full polar TQ, not our simpler WHT+scalar approach. **Key algorithm constants** (from reference impls, validated across 6+ independent implementations): - Lloyd-Max 4-bit (16 centroids) for N(0,1): precompute from quantile function at build time - Lloyd-Max 3-bit (8 centroids) for N(0,1): `{-2.157, -1.334, -0.743, -0.243, +0.243, +0.743, +1.334, +2.157}` - WHT butterfly: O(d log d), d must be power-of-2 (pad d=48→64) - Post-WHT distribution: converges to N(0, σ²) by CLT for Walsh transforms