# Storage Block Size Optimization for TurboQuant KV Cache Compression **Tom Turney** Independent Researcher GitHub: [@TheTom](https://github.com/TheTom) --- ## Abstract TurboQuant's turbo3 format stores quantized KV cache values in 32-element blocks, each with its own 16-bit norm. Since the Walsh-Hadamard rotation and norm correction operate on 128-element groups (matching head_dim), all 4 blocks within a group receive the same corrected norm value. This redundancy wastes 6 bytes per group (3 duplicate norms). We tested whether increasing the storage block size from 32 to 128 (one block per rotation group) affects perplexity, speed, or correctness. Across 3 model architectures (dense, dense Qwen, hybrid MoE), 2 context lengths (512, 8192), and 2 Apple Silicon platforms (M5 Max, M2 Pro), perplexity was identical to 4 decimal places at all block sizes. Speed deltas were within measurement noise. The result is a free compression improvement: turbo3 at block_size=128 achieves 5.12x compression (vs 4.57x at block_size=32) with no measurable quality or speed cost. All testing used `q8_0-K + turbo-V` cache configurations on Metal (Apple Silicon). Symmetric turbo-K paths and CUDA backends were not validated in this pass. --- ## 1. Background ### 1.1 Block Size vs Group Size TurboQuant's quantization operates at two granularities: - **Rotation group** (`QK_TURBO3_GROUP = 128`): The Walsh-Hadamard Transform (WHT) operates on 128-element groups matching the model's head dimension. After rotation, coordinates are approximately Gaussian with known variance, enabling optimal scalar quantization. The group-level norm correction (`grp_norm / recon_norm`) is computed over all 128 elements. - **Storage block** (`QK_TURBO3 = 32`): Each block of 32 quantized elements is stored with its own 16-bit norm, quantization indices, and sign bits. One rotation group contains 4 storage blocks. The storage block size affects only how many elements share one stored norm value. It does not affect the rotation, centroid selection, or norm correction math. ### 1.2 Compression at Different Block Sizes | Block Size | Block Layout | bits/value | Compression vs fp16 | |:----------:|:------------|:----------:|:-------------------:| | 32 (current) | norm(2B) + qs(8B) + signs(4B) = 14B per 32 | 3.50 | 4.57x | | 64 | norm(2B) + qs(16B) + signs(8B) = 26B per 64 | 3.25 | 4.92x | | 128 | norm(2B) + qs(32B) + signs(16B) = 50B per 128 | 3.125 | 5.12x | The compression improvement comes from amortizing the 2-byte norm over more elements. At block_size=128, one norm covers the entire rotation group, eliminating all redundancy. ### 1.3 Why PPL Should Be Identical All 4 blocks within a 128-element rotation group receive the same corrected norm. This is because norm correction is computed at the group level: ``` grp_norm = ||original_group|| recon_norm = ||reconstructed_centroids|| corrected_norm = grp_norm / recon_norm ``` Every block in the group gets `corrected_norm`. Changing from 4 blocks to 1 block stores this value once instead of four times, but the quantization and dequantization math is identical. --- ## 2. Experimental Setup ### 2.1 Implementation Block size is controlled by `QK_TURBO3` and `QK_TURBO2` defines in `ggml/src/ggml-common.h`. We added derived macros for the flash attention template parameters: ```c #define NL_TURBO3 (QK_TURBO3 / 16) #define NL_TURBO3_VEC (QK_TURBO3 / 4) ``` This replaced ~250 hardcoded `nl` values in Metal flash attention template instantiations, making block size a one-line edit. The dequantization functions, set-rows kernels, and quantization functions all use `QK_TURBO3` symbolically and require no additional changes. ### 2.2 Models | Model | Architecture | Weights | Layers | head_dim | |-------|-------------|---------|--------|----------| | phi-4 | Dense (Phi-3) | Q8_0 | 40 | 128 | | Qwen2.5-7B-Instruct | Dense (Qwen2) | Q4_K_M | 28 | 128 | | Qwen3.5-35B-A3B | Hybrid MoE (GDN + attention) | Q8_0 | 40 (10 KV) | 128 | ### 2.3 Hardware | Machine | SoC | Memory | Bandwidth | |---------|-----|--------|-----------| | M5 Max | Apple M5 Max | 128 GB | 546 GB/s | | M2 Pro | Apple M2 Pro | 16 GB | 200 GB/s | ### 2.4 Cache Configuration All PPL and speed tests used `--cache-type-k q8_0 --cache-type-v turbo3` (asymmetric). This is the recommended production configuration for models where K precision matters. Symmetric turbo-K paths were not tested in this block-size pass. ### 2.5 Benchmark Tools - **PPL**: `llama-perplexity` on wikitext-2-raw, 512 and 8192 context, 4-20 chunks - **Speed**: `llama-bench` with pp512/pp4096 prefill and tg128 decode --- ## 3. Results ### 3.1 Perplexity: Block Size Has Zero Effect #### Short Context (512, wikitext-2-raw) | Model | Block 32 | Block 64 | Block 128 | |-------|:--------:|:--------:|:---------:| | phi-4 Q8_0 (M5, 20 chunks) | 6.6105 | 6.6105 | 6.6105 | | Qwen2.5-7B Q4_K_M (M5, 10 chunks) | 7.4471 | — | 7.4471 | | Qwen2.5-7B Q4_K_M (M2, 10 chunks) | 7.4727 | — | 7.4727 | | Qwen3.5-35B-A3B Q8_0 (M5, 10 chunks) | 7.0298 | — | 7.0298 | #### Long Context (wikitext-2-raw) | Model | Context | Block 32 | Block 128 | |-------|:-------:|:--------:|:---------:| | phi-4 Q8_0 (M5, 4 chunks) | 8K | 5.7134 | 5.7134 | | Qwen2.5-7B Q4_K_M (M5, 4 chunks) | 8K | 5.9547 | 5.9547 | | Qwen2.5-7B Q4_K_M (M2, 4 chunks) | 8K | 5.9477 | 5.9477 | | phi-4 Q8_0 (M5, 4 chunks) | 32K | 6.1873 | 6.1873 | PPL is identical to 4 decimal places across all tested models, context lengths (512, 8K, 32K), and hardware. This confirms the theoretical prediction: block size does not affect quantization math. #### Turbo Type and Cache Path Parity (M5, 512 ctx, 10 chunks) | Cache Config | Model | Block 32 | Block 128 | |:------------:|-------|:--------:|:---------:| | q8_0-K / turbo3-V | phi-4 Q8_0 | 6.6105 | 6.6105 | | q8_0-K / turbo2-V | phi-4 Q8_0 | 6.7346 | 6.7346 | | q8_0-K / turbo2-V | Qwen2.5-7B Q4_K_M | 7.7989 | 7.7989 | | q8_0-K / turbo2-V | Qwen2.5-7B Q4_K_M (M2) | 7.7807 | 7.7807 | | turbo3-K / turbo3-V | phi-4 Q8_0 | 6.9208 | 6.9208 | | turbo3-K / turbo3-V | Qwen3.5-35B-A3B Q8_0 | 7.0627 | 7.0627 | | turbo4-K / turbo4-V (always block128) | phi-4 Q8_0 | 6.6624 | 6.6624 | PPL is identical across all tested cache paths — both asymmetric (q8_0-K + turbo-V) and symmetric (turbo3-K + turbo3-V). turbo2 parity confirmed on both phi-4 and Qwen2.5-7B (including M2). turbo4 already uses block_size=128 natively. #### NIAH (phi-4 Q8_0, M5, symmetric turbo3, block_size=128) | Depth | 4K Context | |:-----:|:----------:| | 0% | pass | | 50% | pass | | 100% | pass | 3/3 pass at block_size=128. This is a block128-only run (not a strict A/B with block32), but given that PPL is byte-identical across all other tests, NIAH parity is expected. #### Boundary V (TURBO_LAYER_ADAPTIVE=7, q8_0-K + turbo2-V) | Model | Hardware | Block 32 | Block 128 | |-------|----------|:--------:|:---------:| | phi-4 Q8_0 (14B, dense) | M5 | 6.7398 | 6.7398 | | Qwen2.5-7B Q4_K_M (dense) | M5 | 7.7852 | 7.7852 | | Qwen3.5-35B-A3B Q8_0 (MoE) | M5 | 7.0725 | 7.0725 | | Qwen2.5-7B Q4_K_M (dense) | M2 | 7.8384 | 7.8384 | PPL identical across all 3 architectures and both hardware platforms. The layer-adaptive V policy (boundary layers q8_0-V, rest turbo2-V) is block-size-independent. ### 3.2 Speed: No Measurable Regression #### M5 Max | Model | Metric | Block 32 | Block 128 | Delta | |-------|--------|:--------:|:---------:|------:| | phi-4 14B | pp4096 (t/s) | 701 ± 3 | 708 ± 37 | +1.0% | | phi-4 14B | tg128 (t/s) | 29.55 ± 0.36 | 29.38 ± 0.88 | -0.6% | | Qwen2.5-7B | pp512 (t/s) | 1939 ± 283 | 1886 ± 316 | -2.7% | | Qwen2.5-7B | tg128 (t/s) | 77.84 ± 0.26 | 78.34 ± 0.60 | +0.6% | | Qwen3.5-35B MoE | pp512 (t/s) | 2775 ± 35 | 2773 ± 23 | -0.1% | | Qwen3.5-35B MoE | tg128 (t/s) | 77.54 ± 0.64 | 77.75 ± 0.50 | +0.3% | All deltas are within measurement noise. The Qwen2.5-7B prefill shows high variance (±283 t/s) making the -2.7% unreliable. Decode throughput is flat across all models. #### M2 Pro — Matched Performance Follow-up (Qwen2.5-1.5B Q4_K_M) A targeted follow-up tested the same model at multiple context lengths on both M2 Pro and M5 Max to determine whether block_size=128 produces a speed benefit specifically on bandwidth-constrained Apple Silicon. **M2 Pro (16 GB, 200 GB/s bandwidth):** | Context | Metric | Block 32 | Block 128 | Delta | |:-------:|--------|:--------:|:---------:|------:| | 512 | prefill (t/s) | 1458 ± 8 | 1466 ± 4 | +0.6% | | 512 | decode (t/s) | 64.42 ± 0.14 | 66.32 ± 0.40 | +3.0% | | 8K | prefill (t/s) | 808 ± 0.3 | 811 ± 0.3 | +0.4% | | 8K | decode (t/s) | 64.05 ± 0.32 | 68.34 ± 0.97 | +6.7% | | 16K | prefill (t/s) | 538 ± 0.2 | 539 ± 0.3 | +0.3% | | 16K | decode (t/s) | 64.03 ± 0.44 | 66.60 ± 0.58 | +4.0% | M2 Pro shows a consistent decode improvement at block_size=128: +3.0% at 512, +6.7% at 8K, +4.0% at 16K. Prefill is effectively flat. **M5 Max matched comparison (128 GB, 546 GB/s bandwidth, same model/config):** | Context | Metric | Block 32 | Block 128 | Delta | |:-------:|--------|:--------:|:---------:|------:| | 512 | prefill (t/s) | 9521 ± 45 | 9377 ± 49 | -1.5% | | 512 | decode (t/s) | 161.41 ± 2.85 | 152.47 ± 4.49 | -5.5%* | | 8K | prefill (t/s) | 4853 ± 49 | 4802 ± 68 | -1.0% | | 8K | decode (t/s) | 161.93 ± 0.39 | 159.80 ± 1.42 | -1.3% | | 16K | prefill (t/s) | 2999 ± 42 | 3074 ± 43 | +2.5% | | 16K | decode (t/s) | 155.13 ± 2.71 | 161.52 ± 0.28 | +4.1% | *The 512 decode -5.5% has high variance (±4.49 t/s). M5 decode deltas are inconsistent in direction and within noise, consistent with earlier phi-4 14B results on M5. The M2 decode gain is consistent and reproducible across three context lengths. The matched M5 run does not show a comparable benefit. One plausible interpretation is that M2's lower memory bandwidth (200 GB/s vs 546 GB/s) makes each byte of reduced norm storage proportionally more valuable during V-cache dequant. This is a hypothesis based on the hardware difference, not a proven mechanism. **Caveats:** This is a single-model result (Qwen2.5-1.5B) on a single M2 variant (M2 Pro). Larger models crash in llama-bench on M2 16GB. Other M2 variants and M1-family chips have different memory subsystems and may show different behavior. ### 3.3 Memory Savings For phi-4 (40 layers, head_dim=128), V cache memory at 512 context: | Block Size | V Cache (MiB) | Savings vs Block 32 | |:----------:|:-------------:|:-------------------:| | 32 | 43.75 | baseline | | 64 | 40.62 | 7.2% | | 128 | 39.06 | 10.7% | The savings scale linearly with context length. At 128K context, the difference is ~1.2 GB. --- ## 4. Interpretation The block size result is mechanically straightforward. The WHT rotation, centroid codebook, and norm correction are all computed at the 128-element group level. The storage block size determines only how many 2-byte norms are written per group. At block_size=32, four identical norm values are stored. At block_size=128, one is stored. The quantization indices and sign bits are identical regardless. This means the current block_size=32 default has been storing 3 redundant norms per group since the initial implementation. The 4.57x compression figure reported for turbo3 was conservative. The achievable compression for the same quantization math is 5.12x. --- ## 5. Relation to External Implementations Credit to [@AmesianX](https://github.com/AmesianX) whose CUDA TurboQuant implementation with block_size=256 prompted this investigation. His implementation reports 5.2x compression at +1.7% PPL. We have not inspected the internals of his implementation. One plausible interpretation is that 256 refers to the rotation group size (not just storage block size), which would use a 256x256 rotation matrix for models with head_dim=256 (e.g., Qwen3.5-27B). That would be a fundamentally different experiment from ours (changing rotation scope, not just storage chunking). For models with head_dim=128, a 256-element rotation group would not work. Whether his approach changes the rotation group size is not established here. --- ## 6. Limitations 1. **Backend scope.** Tested on Metal (Apple Silicon) only. CUDA block size changes would require separate kernel updates and validation. 2. **M2 speed coverage.** M2 speed testing used Qwen2.5-1.5B (the only model small enough for llama-bench on 16 GB). The +3-7% decode gain was consistent across 3 context lengths but is a single-model result. Larger-model speed behavior on M2 is inferred from PPL parity, not directly measured. 3. **NIAH.** Tested at block_size=128 only (phi-4, symmetric turbo3, 3 depths at 4K). 3/3 pass. Strict A/B vs block32 was not performed, but is expected to match given byte-identical PPL. 4. **Serialization.** Changing block size changes the on-disk turbo3/turbo2 block struct layout. Existing serialized cache state files would be incompatible. Since this is a pre-release branch, this is acceptable. 5. **Symmetric turbo4-K.** Symmetric turbo3-K + turbo3-V was validated (phi-4 dense, Qwen3.5-35B MoE). Symmetric turbo4-K + turbo4-V was not separately tested at block_size=128, though turbo4 already uses block_size=128 natively. --- ## 7. Recommendation For the tested Metal (Apple Silicon) paths, block_size=128 is a safe default change: - Zero PPL regression across all tested configurations: - Asymmetric q8_0-K + turbo{2,3}-V - Symmetric turbo3-K + turbo3-V (phi-4 dense, Qwen3.5-35B MoE) - Boundary V / LA-V7 (phi-4, Qwen2.5-7B, Qwen3.5-35B MoE, M5 + M2) - 3 model architectures, 3 context lengths (512, 8K, 32K), 2 hardware platforms - 12% better compression (5.12x vs 4.57x for turbo3) - No speed regression on M5 Max (flat across 3 models) - Consistent +3-7% decode improvement on the tested M2 Pro setup (Qwen2.5-1.5B, 3 context lengths) - NIAH 3/3 pass at block_size=128 (phi-4, symmetric turbo3) - Eliminates redundant norm storage Before shipping as a default change, CUDA backends should be separately validated. --- ## 8. Reproduction To test block_size=128 locally: 1. Edit `ggml/src/ggml-common.h`: ```c #define QK_TURBO3 128 // was 32 #define QK_TURBO2 128 // was 32 ``` 2. Rebuild: `cmake --build build -j$(nproc)` 3. Run PPL: ```bash ./build/bin/llama-perplexity \ -m your-model.gguf \ -f wikitext-2-raw/wiki.test.raw \ --cache-type-k q8_0 --cache-type-v turbo3 \ -c 512 --chunks 10 -ngl 99 ``` The derived macros (`NL_TURBO3`, `NL_TURBO3_VEC`, `NL_TURBO2`, `NL_TURBO2_VEC`) propagate the block size change through all Metal flash attention template instantiations automatically. --- ## Addendum: Independent Validation (2026-03-31) The block size 128 finding has been independently confirmed: - **@dusterbloom** — RTX 3090 (2026-03-30): Tested TBQ3 Flash Attention across 5 model families with block_size=128 builds. Results consistent with our Metal findings. - **@HyperionMS2040** — RTX 3090 (2026-03-30): CUDA warp-to-block mapping fix (PR #32) enabled correct block_size=128 support on CUDA. PPL validated on multiple models after fix. - **@AmesianX** — Blackwell DGX Spark (2026-03-30): Used block size 256 (QK_K) for fused FA kernels. Different tradeoff: 256 gives slightly better compression but our testing shows 128 is better quality-per-bit on smaller models (7B-14B).