# MLX Framework Port (Experimental) TurboQuant KV cache compression is being ported to Apple's [MLX framework](https://github.com/ml-explore/mlx) for native Python/Swift inference on Apple Silicon. **Fork:** [TheTom/mlx `feature/turboquant-plus`](https://github.com/TheTom/mlx/tree/feature/turboquant-plus) ### Results (M5 Max) **Qwen2.5-3B 4bit — delegated KVCache (5-run avg, 500 decode tokens, [`7ad7500`](https://github.com/TheTom/mlx/commit/7ad75002bbf87f5d1774859e94e2a76ccbd3a56d)):** | Config | Decode tok/s | vs Baseline | PPL | PPL Delta | |--------|-------------|-------------|-----|-----------| | Baseline (f16 KV) | 172.6 | 100% | 1.8764 | — | | **Sym turbo4** | **171.2** | **99.2%** | 1.9083 | +1.70% | | **Asym (K=FP16, V=turbo4)** | **171.0** | **99.0%** | 1.8859 | +0.51% | **Quality:** Output text indistinguishable from baseline. KL divergence < 0.001, cosine similarity > 0.989. **35B MoE (Qwen3.5-35B-A3B 8bit):** | Config | Prefill | Decode | vs Baseline | |--------|---------|--------|-------------| | Baseline | 11.4 | 95.7 | 100% | | turbo4 fused + boundary | 132.7 | **94.2** | **96%** | **Qwen3.5-27B Dense 8bit (16/64 KV layers):** | Config | PPL | PPL Delta | Decode | vs Baseline | |--------|-----|-----------|--------|-------------| | Baseline | 1.4800 | — | 17.9 | 100% | | turbo4 asymmetric | 1.5082 | +1.91% | 15.5 | 87% | | turbo4 symmetric | 1.5219 | +2.83% | 15.4 | 86% | **Quality Validation (Qwen2.5-7B 8bit, dense, all 28 layers KV):** | Test | Symmetric turbo4 | Asymmetric (K=FP16) | |------|-----------------|---------------------| | **KLD** | 6.86 (broken) | **0.003** | | **Top-1 match** | 10.5% (broken) | **98.1%** | | **NIAH** | 0/15 FAIL | **15/15 PASS** | > [!warning] **Symmetric turbo is catastrophic on dense models.** All K layers compressed → softmax error compounds across 28 layers. Asymmetric (K=FP16, V=turbo4) is mandatory for dense architectures. Hybrid models (Qwen3.5) with delta net layers are accidentally safe because only a fraction of layers use KV cache. **Dense models (short context, deferred compression):** | Model | Baseline Decode | turbo4 asym Decode | PPL Delta | |-------|----------------|-------------------|-----------| | Qwen2.5-7B 8bit | 64.2 | 64.1 | 0.00% | | phi-4 8bit | 32.9 | 32.7 | 0.00% | **M2 Pro — Qwen2.5-1.5B 8bit (dense, 28/28 KV layers, asymmetric):** | Test | Result | |------|--------| | KLD | 0.004 | | Top-1 match | 96.8% | | NIAH | 30/30 PASS | | Context | Baseline Decode | Turbo Asymmetric | vs Baseline | |---------|----------------|-----------------|-------------| | 128 | 34.8 | 35.2 | 101% | | 4096 | 46.9 | 21.6 | 46% | M2 Pro shows more decode regression at long context — lower memory bandwidth amplifies turbo overhead. **M5 Max Context Scaling (Qwen2.5-7B 8bit, delegated KVCache, [`7ad7500`](https://github.com/TheTom/mlx/commit/7ad75002bbf87f5d1774859e94e2a76ccbd3a56d)):** | Context | Baseline | Sym turbo4 | vs Baseline | Asym (K=FP16) | vs Baseline | |---------|----------|-----------|-------------|---------------|-------------| | 512 | 63.6 | 63.6 | **100%** | 64.0 | **101%** | | 1K | 63.1 | 62.8 | **100%** | 62.6 | **99%** | | 2K | 62.7 | 61.8 | **98%** | 62.2 | **99%** | | 4K | 61.0 | 60.2 | **99%** | 61.0 | **100%** | | 8K | 58.2 | 56.9 | **98%** | 57.7 | **99%** | | 16K | 54.6 | 53.0 | **97%** | 53.8 | **99%** | > Previous numbers (61-83%) were measured before the delegated KVCache optimization (`7ad7500`). Root cause was `mx.concatenate` allocating new arrays every decode step × n_layers. Fixed by delegating FP16 storage to an internal KVCache with pre-allocated buffers. **MLX Python vs llama.cpp (Qwen2.5-7B, M5 Max):** | Framework | Prefill (400 tok) | Decode | Memory | |-----------|------------------|--------|--------| | llama.cpp (Q8_0) | 387 | 20.9 | 7.5 GB | | MLX (8bit) | 243 | **21.2** | 8.5 GB | MLX decode matches llama.cpp. Prefill 37% slower (lazy graph vs pre-compiled). **MLX Python vs llama.cpp (M2 Pro, Qwen2.5-7B):** | Framework | Prefill (400 tok) | Decode | |-----------|------------------|--------| | llama.cpp | 387 | 20.9 | | MLX | 243 | 21.3 | > **Note:** Future benchmark logs should record Apple Silicon power mode (Low / Auto / High) when known, as it can materially affect throughput. ### Quick Start (MLX Python) ```python import mlx_lm from mlx.nn.layers.turbo_kv_cache import TurboKVCache model, tokenizer = mlx_lm.load("mlx-community/Qwen2.5-7B-Instruct-8bit") n_layers = len(model.model.layers) cache = [TurboKVCache(bits=4, key_bits=4) for _ in range(n_layers)] text = mlx_lm.generate(model, tokenizer, prompt="Hello!", max_tokens=200, prompt_cache=cache, verbose=True) ``` ```bash pip install git+https://github.com/TheTom/mlx.git@feature/turboquant-plus pip install mlx-lm ``` ### How it works `TurboKVCache` is a drop-in replacement for mlx-lm's `KVCache` that adds TurboQuant 4-bit K+V compression. Compatible with **mlx-lm** and **mlx-vlm** — no framework changes needed. **Delegated KVCache architecture** ([`7ad7500`](https://github.com/TheTom/mlx/commit/7ad75002bbf87f5d1774859e94e2a76ccbd3a56d)): During prefill, stores raw FP16. On first decode step, compresses to packed TurboQuant storage and seeds an internal `KVCache` with decoded FP16. Subsequent decode tokens go through the native KVCache (pre-allocated buffers, zero-alloc slice-assign). Packed storage updated in background via periodic batch recompression on CPU stream. - **97–100% baseline decode speed** across 512–16K context (Qwen2.5-7B, M5 Max) - +0.51% PPL (asymmetric), +1.70% PPL (symmetric) - 99% answer agreement with baseline (520 multimodal samples) - Works with stock mlx-lm and mlx-vlm, no fork needed - All TurboQuant+ papers applied (beta centroids, dual SRHT signs, boundary layers) ### Quick Start — mlx-vlm (multimodal) ```python from mlx_vlm import load from mlx_vlm.models.cache import make_prompt_cache from mlx_lm.models.cache import KVCache from mlx.nn.layers.turbo_kv_cache import TurboKVCacheLite, compact_turbo_cache model, processor = load("mlx-community/gemma-4-26b-a4b-it-bf16") # Wrap KV layers with TurboKVCacheLite cache = make_prompt_cache(model.language_model) kv_indices = [i for i, c in enumerate(cache) if isinstance(c, KVCache)] for idx in kv_indices: cache[idx] = TurboKVCacheLite(cache[idx], bits=4, key_bits=4) # Generate as normal — prefill stores FP16 from mlx_vlm import generate generate(model, processor, prompt="...", max_tokens=1, prompt_cache=cache) # Compact: compress K+V to 4-bit TurboQuant compact_turbo_cache(cache) # Continue generating — native SDPA at full speed generate(model, processor, prompt="Continue.", max_tokens=200, prompt_cache=cache) ``` ```bash pip install git+https://github.com/TheTom/mlx.git@feature/turboquant-plus pip install mlx-vlm ``` ### Quick Start — mlx-lm (text) ```python import mlx_lm from mlx.nn.layers.turbo_kv_cache import make_turbo_cache, compact_turbo_cache model, tokenizer = mlx_lm.load("mlx-community/Qwen2.5-7B-Instruct-8bit") cache = make_turbo_cache(model, bits=4) mlx_lm.generate(model, tokenizer, prompt="Hello!", max_tokens=1, prompt_cache=cache) compact_turbo_cache(cache) mlx_lm.generate(model, tokenizer, prompt="Continue.", max_tokens=200, prompt_cache=cache, verbose=True) ``` ### MM-NIAH Multimodal Benchmark (520 samples) gemma-4-26b-a4b-it · BF16 · 4-bit TQ+ Compact · MM-NIAH (val) · M5 Max 128GB | Bucket | BL Acc | TQ+ Acc | Agree | BL Decode | TQ+ Decode | Speedup | BL KV | TQ+ KV | KV saved | |--------|--------|---------|-------|-----------|-----------|---------|-------|--------|----------| | ~1K | 85% | 84% | 99% | 55.1 | 54.7 | 0.99x | 0.21G | 0.19G | 10% | | ~3K | 81% | 79% | 99% | 55.2 | 54.1 | 0.98x | 0.27G | 0.21G | 22% | | ~7K | 80% | 81% | 99% | 54.1 | 51.7 | 0.96x | 0.37G | 0.24G | 35% | | ~15K | 76% | 76% | 100% | 52.0 | 47.8 | 0.92x | 0.53G | 0.28G | 47% | | ~30K | 77% | 75% | 98% | 46.9 | 40.2 | 0.86x | 0.87G | 0.36G | 59% | | ~60K | 75% | 76% | 99% | 42.6 | 33.7 | 0.79x | 1.30G | 0.47G | 64% | | **Total** | **79%** | **78%** | **99%** | **51.1** | **47.2** | **0.92x** | **0.58G** | **0.29G** | **50%** | 99% answer agreement with baseline across all context lengths — zero systematic quality degradation. KV savings of 10–64% where TQ+ is active. Decode speedup scales from 0.99x at ~1K to 0.79x at ~60K (dequant-once overhead on longer prefills).