# Getting Started with TurboQuant+ Get up and running with TurboQuant+ KV cache compression in under 5 minutes. ## Prerequisites - A GGUF model (download from [HuggingFace](https://huggingface.co)) - CMake 3.14+ - One of: Apple Silicon Mac, NVIDIA GPU (CUDA), AMD GPU (ROCm/HIP) ## Build ```bash git clone https://github.com/TheTom/llama-cpp-turboquant cd llama-cpp-turboquant git checkout feature/turboquant-kv-cache # Apple Silicon (Metal) cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release cmake --build build -j # NVIDIA (CUDA) cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j # AMD (ROCm/HIP) cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j # AMD (ROCm/HIP) on Windows — clang requires explicit optimization flags cmake -S . -B build -G Ninja \ -DGPU_TARGETS=gfx1201 \ -DGGML_HIP=ON \ -DCMAKE_C_COMPILER=clang \ -DCMAKE_CXX_COMPILER=clang++ \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_C_FLAGS_RELEASE="-O3 -DNDEBUG" \ -DCMAKE_CXX_FLAGS_RELEASE="-O3 -DNDEBUG" cmake --build build --config Release # Windows (CUDA, use Developer Command Prompt or WSL2) cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j ``` ## Quick Start Run a model with TurboQuant+ KV cache compression: ```bash # Interactive chat ./build/bin/llama-cli -m model.gguf -ctk q8_0 -ctv turbo3 -fa on -ngl 99 -c 8192 # Server mode ./build/bin/llama-server -m model.gguf -ctk q8_0 -ctv turbo3 -fa on -ngl 99 -c 8192 ``` ## Choosing Your Config ### Safe default (works on any model) ```bash -ctk q8_0 -ctv turbo4 -fa on ``` Asymmetric: full precision K, compressed V. Safe on all tested models including sensitive ones like Qwen2.5. > **Known limitation:** Models with `head_dim=64` (e.g., GPT-OSS-120B) may crash or produce degraded output with turbo V compression. The WHT kernel requires sufficient dimensionality for CLT convergence, and d=64 is at the lower boundary. If you hit issues on a head_dim=64 model, fall back to `-ctk q8_0 -ctv q8_0` (no turbo compression). The paper validated at head_dim=128 and 256. ### More compression (works on most models) ```bash -ctk q8_0 -ctv turbo3 -fa on ``` 5.12x V compression with minimal quality loss (+1-2% PPL). ### Maximum compression (validated large models only) ```bash -ctk turbo3 -ctv turbo3 -fa on ``` Symmetric. Works great on Llama 70B, Command-R+ 104B, Mistral 24B, Qwen3.5 MoE. **Do NOT use on Qwen2.5 with Q4_K_M weights** (catastrophic PPL). ### When to use what | Your model | Recommended config | |---|---| | Q8_0 weights, any size | `-ctk turbo4 -ctv turbo4` or `-ctk turbo3 -ctv turbo3` | | Q4_K_M, unknown model | `-ctk q8_0 -ctv turbo4` (safe default) | | Q4_K_M, Qwen2.5 | `-ctk q8_0 -ctv turbo3` (must be asymmetric) | | Q4_K_M, Llama/Mistral/Cohere 24B+ | `-ctk turbo3 -ctv turbo3` (symmetric works) | For full details see [Configuration Recommendations](turboquant-recommendations.md). ## Built-in Optimizations These features are enabled by default on recent builds. No extra flags needed. ### Sparse V Dequant (Metal only) Skips V dequantization for positions where the softmax attention weight is negligible (< 1e-6). At long context, 90%+ of positions have negligible weight, so this eliminates roughly half the total dequant cost. - +22.8% decode throughput at 32K context on MoE models - Zero perplexity impact (validated at 32K, 50 chunks, wikitext-103) - Works with any KV cache type (turbo3, q8_0, q4_0) - Opt-out: `TURBO_SPARSE_V=0` See [Sparse V paper](papers/sparse-v-dequant.md). ### Boundary V (auto-enabled with turbo2-V) Protects the first 2 + last 2 transformer layers with q8_0-V while compressing all remaining layers with turbo2-V. Recovers 37-91% of the quality gap between turbo2 and turbo3, with no speed penalty. - Auto-enabled when you use `-ctv turbo2` - Opt-out: `TURBO_LAYER_ADAPTIVE=0` - Manually enable on older builds: `TURBO_LAYER_ADAPTIVE=7` See [Boundary V paper](papers/layer-aware-v-compression.md). ### Maximum V Compression ```bash # turbo2-V with Boundary V (auto-enabled) # Best for extreme memory pressure. +5-9.5% PPL. ./build/bin/llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa on -ngl 99 ``` ## Benchmarking Help the project by sharing your numbers. Here's how to generate comparable results. > **Important: Pick the right config before benchmarking.** Symmetric turbo (e.g., `-ctk turbo3 -ctv turbo3`) is catastrophic on some model families (Qwen2.5, Qwen3 MoE) with Q4 weight quantization. If you benchmark with the wrong config, your PPL numbers will be misleading -- you'll measure the damage from a bad config, not the actual compression quality. Start with asymmetric (`-ctk q8_0 -ctv turbo4`) unless you've validated symmetric on your specific model. See [Configuration Recommendations](turboquant-recommendations.md) for which configs are safe on which models. ### Compression reference | Config | K bpv | V bpv | Avg bpv | KV compression vs fp16 | |--------|-------|-------|---------|------------------------| | fp16/fp16 | 16.0 | 16.0 | 16.0 | 1.0x (baseline) | | q8_0/q8_0 | 8.5 | 8.5 | 8.5 | 1.88x | | q8_0/turbo4 | 8.5 | 4.25 | 6.375 | 2.51x | | q8_0/turbo3 | 8.5 | 3.125 | 5.8125 | 2.75x | | q8_0/turbo2 | 8.5 | 2.125 | 5.3125 | 3.01x | | turbo4/turbo4 | 4.25 | 4.25 | 4.25 | 3.76x | | turbo3/turbo3 | 3.125 | 3.125 | 3.125 | 5.12x | Formula: `compression = 16 / avg_bpv`. For asymmetric: `avg_bpv = (k_bpv + v_bpv) / 2`. ### How to measure actual KV cache size llama-server logs the KV cache allocation at startup. Look for lines like: ``` llama_kv_cache_init: Metal KV buffer size = 1234.56 MiB ``` Compare this across configs to see real memory savings. You can also use `llama-bench` output which reports KV cache size. ### Step 1: Download test data ```bash # Wikitext-2 raw text for perplexity testing # Option 1: Direct download (no HF account needed) wget https://huggingface.co/nisten/llama3-8b-instruct-32k-gguf/raw/main/wiki.test.raw # Option 2: Via Hugging Face CLI (requires HF token for some models) # pip install huggingface-hub # huggingface-cli download Salesforce/wikitext wikitext-2-raw-v1/test-00000-of-00001.parquet ``` ### Step 2: Run baseline (always do this first) ```bash ./build/bin/llama-perplexity \ -m model.gguf \ -ctk q8_0 -ctv q8_0 \ -fa on -ngl 99 \ -f wiki.test.raw \ -c 512 --chunks 20 ``` ### Step 3: Run turbo configs ```bash # Asymmetric (safe for any model) ./build/bin/llama-perplexity \ -m model.gguf \ -ctk q8_0 -ctv turbo3 \ -fa on -ngl 99 \ -f wiki.test.raw \ -c 512 --chunks 20 # Symmetric (if your model supports it) ./build/bin/llama-perplexity \ -m model.gguf \ -ctk turbo3 -ctv turbo3 \ -fa on -ngl 99 \ -f wiki.test.raw \ -c 512 --chunks 20 ``` ### Step 4: Speed benchmarks ```bash # Short context ./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1 # Long context (compare turbo3 vs q8_0 at each length) ./build/bin/llama-bench -m model.gguf -ctk q8_0 -ctv q8_0 -fa 1 -p 8192 -r 1 ./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1 -p 8192 -r 1 ./build/bin/llama-bench -m model.gguf -ctk q8_0 -ctv q8_0 -fa 1 -p 32768 -r 1 ./build/bin/llama-bench -m model.gguf -ctk turbo3 -ctv turbo3 -fa 1 -p 32768 -r 1 ``` ### Step 5: Share results Post your numbers on [PR #45](https://github.com/TheTom/llama-cpp-turboquant/pull/45). This is where we are tracking all test results before merge. Include: model name, weight quantization, GPU, VRAM, turbo config, PPL, and speed numbers. ## Weight Compression (TQ4_1S) — Experimental > **Backend support:** Metal (Apple Silicon), CUDA (NVIDIA), and AMD HIP/ROCm (tested on RX 9070 XT, RDNA 4). The quantization step (llama-quantize) works on any platform. Windows builds supported for both MSVC (CUDA) and clang (HIP). TQ4_1S applies WHT rotation + Lloyd-Max polar quantization to model weights (not just KV cache). This is post-training quantization -- no retraining or calibration required. Apply directly to Q8_0 GGUF models. **Code:** [PR #45](https://github.com/TheTom/llama-cpp-turboquant/pull/45) (branch `pr/tq4-weight-compression`). Build from that branch to use weight compression. See the [weight compression paper](papers/weight-compression-tq4.md) for full methodology and results. ### Quick Test (copy-paste, 5 minutes) Clone, build, compress a known-good model, and benchmark. Paste the llama-bench output back in the [PR #45 comments](https://github.com/TheTom/llama-cpp-turboquant/pull/45). ```bash # 1. Clone and build from PR #45 git clone https://github.com/TheTom/llama-cpp-turboquant.git cd llama-cpp-turboquant git checkout pr/tq4-weight-compression # Pick your backend: cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release # Apple Silicon cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release # NVIDIA cmake --build build -j$(nproc 2>/dev/null || sysctl -n hw.ncpu) # 2. Download a known-good test model (Qwen3.5-27B Q8_0, 26.6 GB) # Fits on 32GB+ VRAM. For 24GB cards, use Qwen2.5-7B-Instruct Q8_0 instead. huggingface-cli download Qwen/Qwen3.5-27B-GGUF qwen3.5-27b-q8_0.gguf --local-dir models/ # 3. Compress with Config I python3 -c " n_layers = 64 # 64 for Qwen3.5-27B, 28 for Qwen2.5-7B boundary = 2 for i in range(boundary, n_layers - boundary): for t in ['attn_q', 'attn_k', 'attn_v', 'attn_output', 'ffn_gate', 'ffn_up']: print(f'blk.{i}.{t}.weight=tq4_1s') print(f'blk.{i}.ffn_down.weight=q4_k') " > config_i.txt ./build/bin/llama-quantize --allow-requantize --tensor-type-file config_i.txt \ models/qwen3.5-27b-q8_0.gguf models/qwen3.5-27b-config-i.gguf Q8_0 # Show size before/after python3 -c " import os src_path = 'models/qwen3.5-27b-q8_0.gguf' dst_path = 'models/qwen3.5-27b-config-i.gguf' src = os.path.getsize(src_path) / (1024**3) dst = os.path.getsize(dst_path) / (1024**3) print(f'{src_path}: {src:.2f} GB -> {dst_path}: {dst:.2f} GB ({(1-dst/src)*100:.0f}% smaller)') " # 4. Benchmark — run all 3 and paste the output in the PR ./build/bin/llama-bench -m models/qwen3.5-27b-q8_0.gguf -fa 1 -ngl 99 -p 512 -n 128 ./build/bin/llama-bench -m models/qwen3.5-27b-config-i.gguf -fa 1 -ngl 99 -p 512 -n 128 ./build/bin/llama-bench -m models/qwen3.5-27b-config-i.gguf -ctk q8_0 -ctv turbo4 -fa 1 -ngl 99 -p 512 -n 128 ``` Expected results (Qwen3.5-27B): - Q8_0 source: 26.6 GB - Config I output: ~19.1 GB (28% smaller) - Quality: +1.3% PPL - Speed (Metal): 94-102% of Q8_0. Speed (CUDA): varies by GPU, collecting data **What to report:** paste all 3 llama-bench outputs in [PR #45](https://github.com/TheTom/llama-cpp-turboquant/pull/45) along with your GPU model. Crashes, errors, or unexpected output are equally valuable. ### Other Models and Configs The Quick Test above uses Config I, which works for Qwen, Phi, and most non-Llama models. Adjust `n_layers` for your model (28 for 1.5B, 64 for 27B, 80 for 70B, etc.). Source must be Q8_0 GGUF. For **Llama-family models**, use Hybrid (Q4_K for ALL FFN) or Premium (Q5_K/Q6_K FFN): ```bash # Llama Hybrid: TQ4_1S attention, Q4_K all FFN python3 -c " n_layers = 80 # Llama 3.1 70B for i in range(2, n_layers - 2): for t in ['attn_q', 'attn_k', 'attn_v', 'attn_output']: print(f'blk.{i}.{t}.weight=tq4_1s') for t in ['ffn_gate', 'ffn_up', 'ffn_down']: print(f'blk.{i}.{t}.weight=q4_k') " > llama_hybrid.txt # Llama Premium: TQ4_1S attention, Q5_K/Q6_K FFN (better quality, +8G) python3 -c " n_layers = 80 for i in range(4, n_layers - 4): for t in ['attn_q', 'attn_k', 'attn_v', 'attn_output']: print(f'blk.{i}.{t}.weight=tq4_1s') for t in ['ffn_gate', 'ffn_up']: print(f'blk.{i}.{t}.weight=q5_k') print(f'blk.{i}.ffn_down.weight=q6_k') " > llama_premium.txt ./build/bin/llama-quantize \ --allow-requantize \ --tensor-type-file llama_hybrid.txt \ model-Q8_0.gguf model-hybrid.gguf Q8_0 ``` Source requirement: Q8_0 GGUF. Models already at Q4_K_M have minimal compression headroom. ### Model Compatibility Matrix Based on architecture analysis and tested results. Models marked with * are predictions based on code analysis of `convert_hf_to_gguf.py` and need community validation. ### Tested Models — Validated Results | Model | Size (Q8_0) | Size (Compressed) | Config | PPL Delta | Decode | NIAH | |-------|-------------|-------------------|--------|-----------|--------|------| | Qwen2.5-1.5B | 1.76G | **1.28G** | Config I | +1.9% | 96% | 6/6 | | Qwen3.5-27B | 26.6G | **19.1G** | Config I | +1.3% | 99% | 3/3 | | Qwen3.5-35B-A3B MoE | 34.4G | **21.6G** | Config I | +1.4% | 102% | — | | **Qwen2.5-72B** | **72.0G** | **45.8G** | **Config I** | **+3.9%** | **95%** | **3/3** | | Phi-4 14B | 14.5G | **9.3G** | Config I | +1.0% | **254%** | 3/3 | | Llama 3.1 70B | 69.8G | **49.8G** | Premium | +5.8% | fast | 3/3 | | Llama 3.1 70B | 69.8G | **40.2G** | Hybrid | +16% | 133% | 3/3 | ### Predicted Models — Based on Architecture Analysis Models marked with * need community validation. Predictions based on `convert_hf_to_gguf.py` analysis. | Model Family | Config | Expected PPL Delta | Size Reduction | Notes | |---|---|---|---|---| | Phi-3* | Config I | ~+1% | ~36% | Same family as tested Phi-4 | | Llama 2/3/3.2* | Hybrid/Premium | ~+6-20% | ~29-42% | Same family as tested Llama 3.1 | | CodeLlama* | Hybrid/Premium | ~+6-20% | ~29-42% | Extends LlamaModel | | Mistral* | Hybrid/Premium | ~+6-20% | ~29-42% | Extends LlamaModel, has permutation | | Mixtral* | Hybrid/Premium | ~+6-20% | ~29-42% | MoE, extends LlamaModel | | Granite (IBM)* | Hybrid/Premium | ~+6-20% | ~29-42% | Extends LlamaModel | | Llama 4* | Config I | ~+1-4% | ~30-38% | `undo_permute=False`, no permutation | | Gemma 2/3* | Config I | ~+1-4% | ~30-38% | No permutation | | Command-R/R+* | Config I | ~+1-4% | ~30-38% | No permutation | | DeepSeek V2/R1* | Config I | unknown | ~30% | MLA attention, less certain | | Falcon* | Config I | ~+1-4% | ~30-38% | No permutation | | InternLM* | Config I | ~+1-4% | ~30-38% | No permutation | | OLMo* | Config I | ~+1-4% | ~30-38% | No permutation | | Yi* | Config I | ~+1-4% | ~30-38% | No permutation | **Config I:** attn+gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2. For Qwen, Phi, and non-Llama models. **Hybrid:** attn=TQ4_1S, ALL FFN=Q4_K, boundary 2+2. For Llama-family. Max compression, +16% PPL. **Premium:** attn=TQ4_1S, ffn_gate/up=Q5_K, ffn_down=Q6_K, boundary 4+4. For Llama-family. Best quality, +5.8% PPL. The key discriminator is whether the model's GGUF conversion permutes Q/K weights for RoPE (`undo_permute` in `convert_hf_to_gguf.py`). Models without permutation tend to work well with Config I. However, permutation alone does not fully explain the sensitivity difference (see paper Section 5.7). Treat predictions as directional. **Community validation on untested models is very welcome** -- post results on [PR #45](https://github.com/TheTom/llama-cpp-turboquant/pull/45). ### Benchmark a Compressed Model ```bash # PPL baseline (Q8_0) ./build/bin/llama-perplexity -m model-Q8_0.gguf -ngl 99 -fa 1 \ -f wiki.test.raw -c 512 --chunks 20 # PPL compressed ./build/bin/llama-perplexity -m model-config-i.gguf -ngl 99 -fa 1 \ -f wiki.test.raw -c 512 --chunks 20 # Speed ./build/bin/llama-bench -m model-config-i.gguf -fa 1 # Combined with TurboQuant KV compression ./build/bin/llama-server -m model-config-i.gguf -ngl 99 -fa 1 \ -ctk q8_0 -ctv turbo3 ``` ## Apple Silicon: Large Models at Long Context If you're running 70B+ models at long context on Apple Silicon, macOS caps GPU memory at ~75% of RAM by default. This causes Metal to hang at ~49K context. Fix: ```bash # Recommended: 90% of physical RAM (safe for sustained inference) # 128GB Mac sudo sysctl iogpu.wired_limit_mb=117964 # 96GB Mac sudo sysctl iogpu.wired_limit_mb=88474 # 64GB Mac sudo sysctl iogpu.wired_limit_mb=58982 ``` No reboot required. Resets on reboot. ## Resources - [Configuration Recommendations](turboquant-recommendations.md) (full config guide with all tested models) - [M5 Max Stress Test](papers/m5-max-stress-test.md) (70B and 104B results) - [Sparse V Dequantization](papers/sparse-v-dequant.md) - [Boundary V](papers/layer-aware-v-compression.md) - [Block Size Optimization](papers/block-size-experiment.md) - [TurboQuant paper (Google Research)](https://arxiv.org/abs/2504.19874)